Standard feedforward neural networks process inputs independently. They have no memory of past inputs, which makes them unsuitable for sequential data like text or time series, where order matters.
1. Recurrent Neural Networks (RNNs)
RNNs are a special type of neural network designed for sequences. They have a "loop" in their architecture that allows information to persist. An RNN processes a sequence one element at a time. At each step, the output from the previous step is fed back as an input to the current step. This feedback loop creates an internal "memory" or hidden state that captures information about the preceding elements in the sequence.
You can think of the hidden state as the network's summary of what it has seen so far.
The Problem with Simple RNNs: In practice, simple RNNs struggle with long-range dependencies. Because the gradient signal has to flow back through the network at every time step (backpropagation through time), it can either diminish to zero (vanishing gradient) or grow uncontrollably (exploding gradient). This makes it very difficult for a simple RNN to learn connections between words that are far apart in a long sentence.
2. Long Short-Term Memory (LSTM)
LSTMs are a special kind of RNN, explicitly designed to solve the vanishing gradient problem and learn long-term dependencies. They do this by introducing a more complex cell structure with special mechanisms called gates.
The LSTM cell has a cell state (a sort of conveyor belt for information) and three gates that regulate the flow of information into and out of this state:
- Forget Gate: Decides what information from the previous cell state should be thrown away or forgotten.
- Input Gate: Decides what new information from the current input should be stored in the cell state.
- Output Gate: Decides what information from the cell state should be used to generate the output for the current time step.
These gates are essentially small neural networks that learn which information is important to keep or discard at each step, allowing LSTMs to maintain relevant context over very long sequences.
3. Gated Recurrent Unit (GRU)
The GRU is a newer, simplified variant of the LSTM. It combines the forget and input gates into a single update gate and merges the cell state and hidden state. This makes GRUs computationally more efficient than LSTMs and, on some smaller datasets, they can perform just as well. They have two gates:
- Reset Gate: Decides how much of the past information to forget.
- Update Gate: Decides how much of the past information to keep versus how much new information to add.
Which to use? There's no hard rule. LSTM is a solid default choice, but if you need a slightly faster model and are willing to experiment, a GRU is a great alternative.
Code Example: Building an LSTM Layer in Keras
Here’s how you would add an LSTM layer to a model in Keras for a task like text classification.
Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Vocabulary size and embedding dimension
vocab_size = 10000
embedding_dim = 16
max_length = 120
model = Sequential([
# The Embedding layer turns positive integers (indexes) into dense vectors of fixed size.
Embedding(vocab_size, embedding_dim, input_length=max_length),
# An LSTM layer with 32 units (neurons).
# It will process the sequence of embedding vectors.
LSTM(32),
# A standard Dense layer for classification
Dense(64, activation='relu'),
Dense(1, activation='sigmoid') # Output layer for binary classification
])
model.summary()