The Memory Problem
In the last section we built Backpropagation Through Time and watched an RNN actually learn. It worked beautifully — on sentences of four words. But if we had pushed that same network on a sentence of fifty words and asked it to connect a subject to a verb far, far away, we would have watched it fail silently. No error. No crash. The weights just would not move in the right direction. The network would behave, stubbornly, as if the first half of the sequence did not exist.
This is the vanishing gradient problem, and it is not a bug. It is a mathematical consequence of how RNNs share weights across time. In this section we are going to see, with full rigor and with numbers you can click through line by line, exactly why the signal dies on its way back through time, and why this single difficulty held back sequence modeling for nearly a decade — until LSTM and GRU stepped in.
The one-sentence idea. A vanilla RNN multiplies its gradient by the same matrix at every step backward. A product of many matrices with spectral radius less than 1 collapses to zero exponentially; a product with spectral radius greater than 1 explodes to infinity. There is almost no middle ground.
The Chain Rule Unrolled Through Time
Recall the RNN update from Chapter 16 Section 2:
We want to train this with gradient descent, so we need . The loss depends on the final hidden state , and depends on , which depends on , all the way back to . Applying the chain rule through this dependency chain gives
That sum looks innocent, but every single term contains the factor — a Jacobian that spans many timesteps. If even one of those factors is tiny, the corresponding term contributes almost nothing to the update of . If every one of them is tiny for early , the network gets essentially no information about what happened at the start of the sequence.
The Jacobian Product: Where It All Goes Wrong
Expand using the chain rule again, one step at a time:
Each factor is the Jacobian of one RNN step. For the tanh recurrence above it equals
So the matrix that transports the gradient from time back to time is a product of T - t copies of . Taking norms and using submultiplicativity of matrix norms,
The diagonal factor is bounded by because . So in the best case the bound collapses to
Two regimes jump out immediately:
- If : the bound is with . The gradient shrinks exponentially in the sequence length. This is the vanishing regime.
- If : the matrix can grow exponentially. Unless tanh saturation happens to cancel it, gradients explode.
A Scalar Intuition: Why 0.5²⁰ Destroys Learning
All of the matrix mystery disappears once you look at a scalar version. Consider the one-dimensional recurrence with weight . In the linear regime the local Jacobian is just , and the full gradient over steps is
Plug in real numbers. For a sentence of 20 words ():
| Weight w | After 20 steps (w²⁰) | Interpretation |
|---|---|---|
| 0.5 | 9.5 × 10⁻⁷ | Gradient has vanished — one part in a million |
| 0.7 | 7.9 × 10⁻⁴ | Still 1000× too small to drive learning |
| 0.9 | 1.2 × 10⁻¹ | Usable, but already losing 90% of signal |
| 1.0 | 1.0 | Edge of stability — no decay, no growth |
| 1.1 | 6.7 | Mild explosion — numerics start to hurt |
| 1.5 | 3325 | Explodes — training will diverge |
| 2.0 | 1,048,576 | Complete numerical meltdown |
This is the entire story in a single table. The vanishing gradient problem is not subtle: it is simply exponential decay. If you want the earliest word of a sentence to influence the last word's loss, every multiplicative factor along the chain must be very close to 1 — and random initialization plus tanh saturation conspire to make that almost impossible.
0.5**20. Then type 0.9**20. The difference between usable and useless gradient is a factor of 100,000, produced by changing one number from 0.5 to 0.9.How Activation Functions Make It Worse
The bound assumes the activation derivative sits at its maximum value of 1. In practice it does not. Each activation introduces its own shrinking factor:
| Activation | Max derivative | Typical derivative (|z| ≈ 1) | Effect on gradient |
|---|---|---|---|
| Sigmoid σ(z) | 0.25 | ≈ 0.20 | Multiplies by ≤ 1/4 at EVERY step — guaranteed vanishing |
| Tanh(z) | 1.00 | ≈ 0.42 | Best of the classics, but saturates past |z| > 2 |
| ReLU | 1.00 | 1 or 0 | No decay on positives, but dies on negatives (exploding risk on positives) |
| Identity | 1.00 | 1.00 | Theoretical ideal — but removes the nonlinearity RNNs need |
Sigmoid is the most cruel: even if , the per-step gradient is bounded by . Over 20 steps that gives . That is why modern RNN implementations almost always use tanh (or, inside LSTMs, use sigmoid only as a gate, never on the gradient highway).
Vanishing vs. Exploding: Two Faces of the Same Problem
Vanishing and exploding gradients are the same phenomenon with different signs in the exponent. Both arise because the gradient is a product, not a sum. Small differences in each factor compound multiplicatively.
| Symptom | Vanishing | Exploding |
|---|---|---|
| Cause | ‖W_hh‖ · max|σ′| < 1 | ‖W_hh‖ · max|σ′| > 1 |
| Training behavior | Loss plateaus; long-range patterns never learned | Loss → NaN; weights blow up |
| Detectability | Silent — network seems to work, just never improves | Loud — you see the numbers overflow |
| Classical fix | Better architecture (LSTM/GRU), identity init, skip connections | Gradient clipping, smaller learning rate |
| Which is worse? | Harder to detect, so often much worse in practice | Annoying but easy to catch and clip |
Exploding gradients are usually fixed in 3 lines of code with torch.nn.utils.clip_grad_norm_ — the standard technique we already saw in Section 2. Vanishing gradients, however, cannot be clipped away. The signal is simply not there. To recover it, we need to change the architecture so that gradients can flow through a different, non-multiplicative path. That is exactly what LSTM and GRU accomplish.
Watch Gradients Flow Through Time
Before we run any code, let us see the phenomenon directly. Below is an interactive RNN where you can choose the activation, the weight scale (), and the sequence length. Watch how the earliest hidden state receives almost no signal at all once you push the sequence past eight or ten steps, and how sigmoid is far worse than tanh.
Python from Scratch: Watching Gradients Die
The surest way to believe the mathematics is to compute it. In this code block we trace the scalar gradient step by step for a one-dimensional RNN with weight . Every line has a click-to-reveal annotation showing the exact values flowing through the computer's memory: the hidden state, the local Jacobian at each step, and the cumulative product that IS the backpropagated gradient.
Read the right-hand column of the output: cumprod. At it is 0.4988. At it is 9.7 × 10⁻⁴. That is a thousand-fold shrinkage in ten steps, and every step is the same operation — multiplying by roughly 0.5. No bug, no pathology; just the chain rule doing exactly what it is supposed to do.
PyTorch: Measuring Vanishing Gradients
Now let us scale up to a real RNN and use PyTorch's autograd to measure at every timestep. The key trick is retain_grad(), which forces PyTorch to keep the gradient on intermediate tensors so we can inspect them. We initialize with spectral radius ≈ 0.5, pass 20 tiny inputs through the unrolled network, and let backward() do the rest.
The printed gradient norms tell the whole story: at the gradient norm is 2.00, at it is 2.35 × 10⁻⁴. The first hidden state receives a learning signal 10,000 times weaker than the last. Whatever information about the earliest input is encoded in , gradient descent can barely move its weights — the update step is buried in numerical noise.
0.5 in the W_hh initialization to 1.5. Now the spectral radius is above 1, and the gradient norms explode backward — you will see numbers like 10⁸ at early timesteps before tanh saturation catches up. Same code, same framework, opposite failure mode.Long-Term Dependencies in Practice
Why does this mathematical curiosity matter? Because language is full of long-distance dependencies. Consider:
“The cats, which we adopted from the shelter that opened last year, sit quietly.”
The verb must agree with cats (plural), but 13 tokens sit between them. For a vanilla RNN with vanishing gradients, by the time we are predicting “sit”, the signal from “cats” has been multiplied by 13 copies of . With a typical gradient decay factor of ~0.7 per step, that is — the subject is effectively invisible at the verb position.
Below you can select sentences of varying distance between subject and verb and watch the gradient signal decay as it flows back through the tokens. Short-distance agreement is easy; long-distance agreement is nearly hopeless.
A Historical Detour: Who Discovered This?
The vanishing gradient problem was not understood until almost a decade after backpropagation was popularized. Early RNN practitioners knew their models struggled on long sequences but could not explain why. The first rigorous analysis came from Sepp Hochreiter's 1991 diploma thesis at TU Munich — a largely overlooked document at the time. Yoshua Bengio, Patrice Simard, and Paolo Frasconi gave the problem broader visibility in 1994. Six years later, Hochreiter and Jürgen Schmidhuber would publish the paper that solved it: LSTM.
The Roadmap to LSTM and GRU
If the problem is that every backward step multiplies the gradient, the fix must change that. There are really only three options:
- Change initialization. Initialize as identity (IRNN), orthogonal, or unitary so its eigenvalues sit on the unit circle. This delays vanishing but does not eliminate it — once the weights start training, the spectral radius drifts.
- Change the activation path. Use ReLU or LeakyReLU so the derivative is 1 wherever the activation is positive. Helps, but leaves the matrix product problem untouched.
- Change the architecture. Introduce an additive path alongside the multiplicative one: the cell state in an LSTM, or the gated combination in a GRU. Gradients travel along the additive path almost unchanged, bypassing the vanishing problem entirely. This is what actually worked.
Chapter 17 will build LSTMs and GRUs from these principles, and you will see that the “complicated” gates are just the smallest machinery needed to give gradients a non-multiplicative highway through time. Once you understand vanishing gradients, LSTM stops looking mysterious — it looks inevitable.
The design principle. If you want a neural network to remember something across many steps, you must build a path along which the gradient is not continuously multiplied by a learned matrix. Additive skip connections — whether in LSTMs, GRUs, residual networks, or transformers — are the single most important structural idea in modern deep learning.
Test Your Understanding
Before moving on to LSTM, work through these questions to make sure the mathematics is solid. Each wrong answer comes with an explanation.
Key Takeaways
- The gradient is a product, not a sum. is a chain of Jacobians multiplied together.
- Vanilla RNNs share weights across time. Every factor in that product involves the same — so its spectral radius dominates.
- Spectral radius < 1 → vanishing; spectral radius > 1 → exploding. The edge of stability () is a knife-edge that training cannot maintain.
- Activation functions shrink gradients further. Sigmoid caps the per-step factor at 0.25; tanh saturates once activations leave .
- Exploding gradients are easy to detect and fix (clip them). Vanishing gradients are silent and devastating — the network simply never learns long-range patterns.
- The architectural fix is an additive gradient path. LSTM, GRU, residual nets, and transformer skip-connections all share this idea: provide a route through the computation where gradients are not multiplied by a learned matrix at every step.
- Hochreiter (1991) and Bengio et al. (1994) gave the problem its rigorous analysis. Hochreiter and Schmidhuber's 1997 LSTM paper gave the first clean solution.