RNNs & LSTMs — Sequence Memory
Vanishing gradients, LSTM gates, the precursor to transformers
Prerequisites
- neural-networks
Read this sentence: "The cat sat on the ___." You instantly know the blank should be something like "mat" or "floor." Your brain remembers what came before. Regular neural networks can't do this — they see each input in isolation, with no memory of what came before. RNNs were built to fix that.
Think about how you read. You don't throw away everything you've read so far and start fresh at every word. You carry context forward. When you see the word "bank," whether it means a financial institution or a riverbank depends entirely on the words that came before it. Meaning lives in sequences, not in isolated tokens.
A standard feedforward neural network takes a fixed-size input and produces a fixed-size output. It has no notion of "before" or "after." You could shuffle the words in a sentence and the network wouldn't know the difference. Recurrent Neural Networks (RNNs) solve this by adding a loop: the output from one step becomes part of the input for the next step. They process sequences one element at a time, carrying a hidden state that acts as a running summary of everything they've seen so far.
Networks with memory
The key insight of an RNN is simple: order matters. The sentence "dog bites man" means something entirely different from "man bites dog." Same words, different sequence, different meaning. An RNN processes inputs sequentially and maintains a hidden state — a vector of numbers that gets updated at every time step. This hidden state is the network's memory.
At each step, the RNN takes two inputs: the current word (or token) and the hidden state from the previous step. It combines them, applies a transformation, and produces a new hidden state plus an output. The hidden state carries information from all previous words forward through the sequence, like a baton being passed in a relay race.
How an RNN processes a sequence
It's the same RNN cell at every step — it reuses the same weights. This weight sharing is what makes RNNs work regardless of sequence length. A network trained on 10-word sentences can process 100-word sentences because the same cell unrolls for as many steps as needed. The hidden state is the thread that stitches each step together into a coherent understanding of the whole sequence.
From simple memory to smart memory
The basic RNN idea is elegant, but it has a fatal flaw. Understanding that flaw — and the solutions designed to overcome it — is the story of how we got from simple recurrence all the way to the transformers that power modern AI.
Step 1: The Simple RNN
A simple RNN passes a hidden state from one time step to the next. At each step, the new hidden state is computed from the current input and the previous hidden state using a tanh activation function.
This works beautifully for short sequences. Predicting the next character in "clo_d" (cloud) is easy — the context is right there. But what about remembering something from 50 words ago? That's where things break down.
Watch sequence processing in action
Here's an interactive visualization of how RNNs and LSTMs process a sentence word by word. Watch the hidden state ball pass from word to word, carrying information forward. In Simple RNN mode, notice how the signal fades over long sequences. Switch to LSTM mode to see how the cell state highway and gates keep information alive. Try typing your own sentences to experiment with different lengths.
Sequence Processing Visualizer
Watch how an RNN processes a sentence word by word, carrying a hidden state forward. Toggle between Simple RNN (vanishing gradient) and LSTM (stable memory) to see the difference.
Click "Next Word" to start processing the sentence step by step, or "Auto Play" to watch it animate.
Try this: Process a long sentence in Simple RNN mode and watch the hidden state bar shrink. Then switch to LSTM and run the same sentence — notice how the strength stays high. That's the LSTM advantage.
Key Takeaways
- RNNs process sequences one step at a time, maintaining a hidden state that carries context forward. The same cell with the same weights is reused at every step — this weight sharing lets RNNs handle variable-length inputs.
- The vanishing gradient problem is the fundamental limitation of simple RNNs: gradients shrink exponentially during backpropagation through time, making it impossible for the network to learn dependencies between words that are far apart in the sequence.
- LSTMs solve the vanishing gradient problem with a cell state highway and three learned gates (forget, input, output) that control information flow. The cell state can carry information across many time steps without degradation.
- GRUs are a simplified alternative to LSTMs with only two gates (reset and update). They often perform comparably while being faster to train, making them a practical choice for many sequence tasks.
- RNNs and LSTMs were the dominant architecture for NLP from 2013-2017 and directly inspired the attention mechanism and transformer architecture that powers today's large language models.
Common Misconceptions
- "RNNs are obsolete." — While transformers dominate NLP, RNNs and LSTMs are still widely used for real-time streaming data, edge devices with limited memory, time-series forecasting, and any task where processing must happen one step at a time. They're also making a comeback in hybrid architectures like RWKV and Mamba.
- "LSTMs can remember everything forever." — LSTMs are much better than simple RNNs at long-range dependencies, but they still struggle with very long sequences (hundreds of tokens). The cell state does degrade over time, just much more slowly. This limitation is precisely what motivated the development of attention mechanisms.
- "More gates always means better performance." — GRUs with 2 gates often match LSTMs with 3 gates. The right architecture depends on the task, data size, and computational budget. Simpler models can outperform complex ones when data is limited.