Boo-AI — Master Artificial Intelligence by Building from Scratch

The Memory Problem

In the last section we built Backpropagation Through Time and watched an RNN actually learn. It worked beautifully — on sentences of four words. But if we had pushed that same network on a sentence of fifty words and asked it to connect a subject to a verb far, far away, we would have watched it fail silently. No error. No crash. The weights just would not move in the right direction. The network would behave, stubbornly, as if the first half of the sequence did not exist.

This is the vanishing gradient problem, and it is not a bug. It is a mathematical consequence of how RNNs share weights across time. In this section we are going to see, with full rigor and with numbers you can click through line by line, exactly why the signal dies on its way back through time, and why this single difficulty held back sequence modeling for nearly a decade — until LSTM and GRU stepped in.

The one-sentence idea. A vanilla RNN multiplies its gradient by the same matrix $W_{hh}$ at every step backward. A product of many matrices with spectral radius less than 1 collapses to zero exponentially; a product with spectral radius greater than 1 explodes to infinity. There is almost no middle ground.

The Chain Rule Unrolled Through Time

Recall the RNN update from Chapter 16 Section 2:

$h_t = \tanh(W_{xh}\, x_t + W_{hh}\, h_{t-1} + b_h)$

We want to train this with gradient descent, so we need $\partial L / \partial W_{hh}$ . The loss $L$ depends on the final hidden state $h_T$ , and $h_T$ depends on $h_{T-1}$ , which depends on $h_{T-2}$ , all the way back to $h_1$ . Applying the chain rule through this dependency chain gives

$\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}}.$

That sum looks innocent, but every single term contains the factor $\partial h_T / \partial h_t$ — a Jacobian that spans many timesteps. If even one of those factors is tiny, the corresponding term contributes almost nothing to the update of $W_{hh}$ . If every one of them is tiny for early $t$ , the network gets essentially no information about what happened at the start of the sequence.

Mental picture. Imagine a rope with 20 knots. You shake the last knot. The wave travels back along the rope. If the rope absorbs most of the energy at each knot, nothing reaches the first knot. Weight-tied RNNs are a rope where every knot has the same absorption coefficient — and that coefficient is usually less than 1.

The Jacobian Product: Where It All Goes Wrong

Expand $\partial h_T / \partial h_t$ using the chain rule again, one step at a time:

$\frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}.$

Each factor is the Jacobian of one RNN step. For the tanh recurrence above it equals

$\frac{\partial h_k}{\partial h_{k-1}} = \operatorname{diag}\!\bigl(1 - h_k^{\,2}\bigr)\; W_{hh}.$

So the matrix that transports the gradient from time $T$ back to time $t$ is a product of T - t copies of $\operatorname{diag}(1 - h_k^2)\, W_{hh}$ . Taking norms and using submultiplicativity of matrix norms,

$\left\| \frac{\partial h_T}{\partial h_t} \right\| \leq \prod_{k=t+1}^{T} \left\| \operatorname{diag}(1-h_k^2) \right\| \; \left\| W_{hh} \right\|.$

The diagonal factor is bounded by $1$ because $\tanh' \leq 1$ . So in the best case the bound collapses to

$\left\| \frac{\partial h_T}{\partial h_t} \right\| \leq \left\| W_{hh} \right\|^{\,T-t}.$

Two regimes jump out immediately:

If $\|W_{hh}\| < 1$ : the bound is $c^{T-t}$ with $c < 1$ . The gradient shrinks exponentially in the sequence length. This is the vanishing regime.
If $\|W_{hh}\| > 1$ : the matrix $W_{hh}^{T-t}$ can grow exponentially. Unless tanh saturation happens to cancel it, gradients explode.

The spectral radius

\rho(W_{hh})

(the largest absolute eigenvalue) is the quantity that really determines stability. Only when

\rho(W_{hh}) = 1

exactly — the edge of stability — can a vanilla RNN propagate gradients cleanly over many steps. Training never sits on that edge.

A Scalar Intuition: Why 0.5²⁰ Destroys Learning

All of the matrix mystery disappears once you look at a scalar version. Consider the one-dimensional recurrence $h_t = \tanh(w\, h_{t-1})$ with weight $w$ . In the linear regime the local Jacobian is just $w$ , and the full gradient over $T$ steps is

$\frac{\partial h_T}{\partial h_0} \approx w^{\,T}.$

Plug in real numbers. For a sentence of 20 words ( $T = 20$ ):

Weight w	After 20 steps (w²⁰)	Interpretation
0.5	9.5 × 10⁻⁷	Gradient has vanished — one part in a million
0.7	7.9 × 10⁻⁴	Still 1000× too small to drive learning
0.9	1.2 × 10⁻¹	Usable, but already losing 90% of signal
1.0	1.0	Edge of stability — no decay, no growth
1.1	6.7	Mild explosion — numerics start to hurt
1.5	3325	Explodes — training will diverge
2.0	1,048,576	Complete numerical meltdown

This is the entire story in a single table. The vanishing gradient problem is not subtle: it is simply exponential decay. If you want the earliest word of a sentence to influence the last word's loss, every multiplicative factor along the chain must be very close to 1 — and random initialization plus tanh saturation conspire to make that almost impossible.

Try this yourself: open a Python REPL and type 0.5**20. Then type 0.9**20. The difference between usable and useless gradient is a factor of 100,000, produced by changing one number from 0.5 to 0.9.

How Activation Functions Make It Worse

The bound $\|W_{hh}\|^{\,T-t}$ assumes the activation derivative sits at its maximum value of 1. In practice it does not. Each activation introduces its own shrinking factor:

Activation	Max derivative	Typical derivative (\|z\| ≈ 1)	Effect on gradient
Sigmoid σ(z)	0.25	≈ 0.20	Multiplies by ≤ 1/4 at EVERY step — guaranteed vanishing
Tanh(z)	1.00	≈ 0.42	Best of the classics, but saturates past \|z\| > 2
ReLU	1.00	1 or 0	No decay on positives, but dies on negatives (exploding risk on positives)
Identity	1.00	1.00	Theoretical ideal — but removes the nonlinearity RNNs need

Sigmoid is the most cruel: even if $\|W_{hh}\| = 1$ , the per-step gradient is bounded by $0.25$ . Over 20 steps that gives $0.25^{20} \approx 10^{-12}$ . That is why modern RNN implementations almost always use tanh (or, inside LSTMs, use sigmoid only as a gate, never on the gradient highway).

Why tanh and not ReLU in vanilla RNNs? ReLU has derivative 1 for positive inputs — great for gradient flow — but the recurrent update

W_{hh}\, h_{t-1}

is unbounded, so activations can blow up within the forward pass. Tanh caps

h_t

at ±1 and stabilizes the forward dynamics, at the cost of gradient saturation. LSTMs will later break this trade-off by decoupling the gradient path from the activation path.

Vanishing vs. Exploding: Two Faces of the Same Problem

Vanishing and exploding gradients are the same phenomenon with different signs in the exponent. Both arise because the gradient is a product, not a sum. Small differences in each factor compound multiplicatively.

Symptom	Vanishing	Exploding
Cause	‖W_hh‖ · max\|σ′\| < 1	‖W_hh‖ · max\|σ′\| > 1
Training behavior	Loss plateaus; long-range patterns never learned	Loss → NaN; weights blow up
Detectability	Silent — network seems to work, just never improves	Loud — you see the numbers overflow
Classical fix	Better architecture (LSTM/GRU), identity init, skip connections	Gradient clipping, smaller learning rate
Which is worse?	Harder to detect, so often much worse in practice	Annoying but easy to catch and clip

Exploding gradients are usually fixed in 3 lines of code with torch.nn.utils.clip_grad_norm_ — the standard technique we already saw in Section 2. Vanishing gradients, however, cannot be clipped away. The signal is simply not there. To recover it, we need to change the architecture so that gradients can flow through a different, non-multiplicative path. That is exactly what LSTM and GRU accomplish.

Watch Gradients Flow Through Time

Before we run any code, let us see the phenomenon directly. Below is an interactive RNN where you can choose the activation, the weight scale ( $\|W_{hh}\|$ ), and the sequence length. Watch how the earliest hidden state receives almost no signal at all once you push the sequence past eight or ten steps, and how sigmoid is far worse than tanh.

Loading gradient flow visualization...

Things to try. (1) Pick sigmoid with weight scale 1.0 and set sequence length to 20 — the network is blind beyond step 10. (2) Pick tanh with weight scale 1.2 — you will see mild explosion. (3) Pick tanh with weight scale 1.0 — the closest thing to a stable RNN, and even this is fragile.

Python from Scratch: Watching Gradients Die

The surest way to believe the mathematics is to compute it. In this code block we trace the scalar gradient $\partial h_t / \partial h_0$ step by step for a one-dimensional RNN with weight $w = 0.5$ . Every line has a click-to-reveal annotation showing the exact values flowing through the computer's memory: the hidden state, the local Jacobian at each step, and the cumulative product that IS the backpropagated gradient.

Scalar RNN — the gradient product line by line

🐍vanishing_scalar.py

Explanation(16)

Code(20)

1import numpy as np

NumPy gives us the fundamentals we need: `np.tanh` for the nonlinearity and basic arithmetic on floats. We do not use any matrix here — this is a one-dimensional toy so that the vanishing gradient formula is visible as plain arithmetic.

EXECUTION STATE

numpy = Numerical library. Here we only use np.tanh, but the same logic extends directly to matrices.

as np = Standard alias — lets us write np.tanh instead of numpy.tanh.

3Comment — what this function simulates

A 1-dimensional RNN: every hidden state is a single scalar h_t. The recurrence is h_t = tanh(w · h_{t-1}). No input is used — we isolate the recurrent gradient path so that nothing else can distort what we observe.

4Comment — target quantity

We want to see ∂h_t / ∂h_0 at every step t. This is the `signal` that flows from the future back to the beginning. If it vanishes, the RNN cannot learn any dependency that crosses t steps.

5def trace_gradient(w, T, h0=0.1) → list

Runs the RNN forward for T steps and, at each step, records the local Jacobian and the full multiplicative product so far.

EXECUTION STATE

⬇ input: w = The single recurrent weight. If |w| < 1 gradients shrink; if |w| > 1 they grow. We will call this with w=0.5.

⬇ input: T = Number of timesteps to simulate. We use T=10 — long enough to see the product collapse toward zero.

⬇ input: h0=0.1 = Initial hidden state. Small enough to keep tanh in its near-linear regime so the mathematics is crystal clear.

⬆ returns = A list of tuples (t, h_t, local Jacobian at t, cumulative product up to t) for t = 1..T.

6h = h0

Start with h = 0.1. Everything that happens next is deterministic because there is no input and no randomness in the recurrence.

EXECUTION STATE

h = 0.1 — the hidden state at t=0.

7cumulative = 1.0

The multiplicative identity. We will multiply this by the local Jacobian at every step to accumulate ∂h_t / ∂h_0.

EXECUTION STATE

cumulative = 1.0 — at t=0, ∂h_0/∂h_0 = 1 trivially.

→ why multiply? = By the chain rule, ∂h_t/∂h_0 = (∂h_t/∂h_{t-1}) · (∂h_{t-1}/∂h_{t-2}) · ... · (∂h_1/∂h_0). A product of scalars.

8log = []

An empty list we will append (t, h_t, local, cumulative) tuples into, so we can print a full trace at the end.

EXECUTION STATE

log = [] — will hold one entry per timestep.

9for t in range(1, T + 1):

Loop over timesteps 1, 2, ..., T. We use 1-indexed t because h_0 is the starting state — the first update produces h_1.

LOOP TRACE · 4 iterations

t=1

body runs with t=1 = produces h_1, local_1, cum_1

t=2

body runs with t=2 = produces h_2, local_2, cum_2

t=3

body runs with t=3 = ...

t=4..10

body runs for each t = gradient keeps shrinking

10h = np.tanh(w * h) — the forward step

Compute the next hidden state. For w=0.5 and small h, tanh is nearly linear: tanh(x) ≈ x. So each step roughly halves h: 0.1 → 0.05 → 0.025 → ... → 0.0001. The hidden state melts away.

EXECUTION STATE

📚 np.tanh(x) = Hyperbolic tangent: (e^x − e^{−x}) / (e^x + e^{−x}). Squashes any real number into (−1, 1). Nearly linear near 0; saturates at ±1 for |x| > 2.

⬇ arg: w * h = At t=1: 0.5 × 0.1 = 0.05. At t=2: 0.5 × 0.0500 = 0.025. At t=3: ≈ 0.0125.

⬆ result: h after step = t=1: 0.049958 ≈ 0.05 t=2: 0.024974 ≈ 0.025 t=3: 0.012486 t=4: 0.006243 t=5: 0.003122

12local = w * (1.0 - h ** 2) — the local Jacobian

Derivative of tanh(w·h_{t-1}) with respect to h_{t-1}: by the chain rule it is w · tanh'(w·h_{t-1}) = w · (1 − h_t²). This is the multiplicative factor that the gradient picks up at this step.

EXECUTION STATE

Formula: d/dh_{t-1} tanh(w·h_{t-1}) = = w · (1 − tanh²(w·h_{t-1})) = w · (1 − h_t²).

w = 0.5 — the recurrent weight.

(1 − h**2) = The derivative of tanh evaluated at the post-activation: always in (0, 1].

⬆ local Jacobian values = t=1: 0.5 × (1 − 0.05²) = 0.4988 t=2: 0.5 × (1 − 0.025²) = 0.4997 t=3..: ≈ 0.5 (h shrinks to ~0)

→ intuition = Each step multiplies the backward-flowing gradient by ≈ 0.5. Ten steps means (0.5)^10 ≈ 0.001. Twenty steps means (0.5)^20 ≈ 10⁻⁶.

14cumulative *= local — accumulate the chain rule

Multiply the running product by this step's local Jacobian. After the loop, cumulative = ∂h_T / ∂h_0 exactly (for this scalar recurrence).

EXECUTION STATE

⬆ cumulative after each step = t=1: 0.4988 t=2: 0.2492 t=3: 0.1246 t=4: 0.0623 t=5: 0.0311 t=6: 0.0156 t=7: 0.00779 t=8: 0.00389 t=9: 0.00195 t=10: 0.000973

→ what this means = After 10 steps the gradient signal is ~1000× weaker than it started. After 20 steps, ~10⁶× weaker. The gradient has effectively vanished.

15log.append((t, h, local, cumulative))

Record this step so we can print the full trace after the loop finishes.

EXECUTION STATE

log contents (first few) = (1, 0.0500, 0.4988, 4.99e-1) (2, 0.0250, 0.4997, 2.49e-1) (3, 0.0125, 0.4999, 1.25e-1)

16return log

Hand the recorded trace back to the caller so it can be inspected or printed.

EXECUTION STATE

⬆ return: log = List of 10 tuples — one per timestep.

18# Case 1: small weight -> vanishing gradient

Comment marking this as the vanishing case. w=0.5 gives spectral radius 0.5, far below 1.

19for t, h, local, cum in trace_gradient(w=0.5, T=10):

Unpack each tuple from the returned log. `t` is the timestep, `h` is the current hidden state, `local` is the local Jacobian at this step, and `cum` is the cumulative product (= ∂h_t/∂h_0).

EXECUTION STATE

trace_gradient(w=0.5, T=10) = Runs the simulation with recurrent weight 0.5 for 10 steps.

unpacking: t, h, local, cum = Python tuple-unpacking: each iteration binds these four names to the four fields of the tuple.

20print(f"t={t:2d} h={h:8.5f} local={local:.5f} cumprod={cum:.3e}")

Format every row of the trace. This is the concrete, numeric evidence that gradients vanish exponentially in this system.

EXECUTION STATE

📚 f-string format specs = `:2d` — integer in 2 columns. `:8.5f` — float, 8 columns wide, 5 decimals. `:.3e` — scientific notation with 3 decimals (for the tiny numbers).

⬆ full printed output = t= 1 h= 0.04996 local=0.49875 cumprod=4.988e-01 t= 2 h= 0.02497 local=0.49969 cumprod=2.492e-01 t= 3 h= 0.01249 local=0.49992 cumprod=1.246e-01 t= 4 h= 0.00624 local=0.49998 cumprod=6.229e-02 t= 5 h= 0.00312 local=0.50000 cumprod=3.115e-02 t= 6 h= 0.00156 local=0.50000 cumprod=1.557e-02 t= 7 h= 0.00078 local=0.50000 cumprod=7.787e-03 t= 8 h= 0.00039 local=0.50000 cumprod=3.893e-03 t= 9 h= 0.00020 local=0.50000 cumprod=1.947e-03 t=10 h= 0.00010 local=0.50000 cumprod=9.733e-04

→ the key observation = The cumprod column IS the gradient. It halves every step, like (0.5)^t. At t=10 the gradient is ≈ 1/1000 of its starting value. This is the vanishing gradient problem in its purest form.

4 lines without explanation

1import numpy as np
2
3# Scalar RNN: h_t = tanh(w * h_{t-1})
4# We track how the gradient dh_t/dh_0 shrinks at every step.
5def trace_gradient(w, T, h0=0.1):
6    h = h0
7    cumulative = 1.0
8    log = []
9    for t in range(1, T + 1):
10        h = np.tanh(w * h)
11        # Local Jacobian at step t: dh_t / dh_{t-1}
12        local = w * (1.0 - h ** 2)
13        # Full gradient dh_t / dh_0 is the product of local Jacobians
14        cumulative *= local
15        log.append((t, h, local, cumulative))
16    return log
17
18# Case 1: small weight -> vanishing gradient
19for t, h, local, cum in trace_gradient(w=0.5, T=10):
20    print(f"t={t:2d}  h={h:8.5f}  local={local:.5f}  cumprod={cum:.3e}")

Read the right-hand column of the output: cumprod. At $t=1$ it is 0.4988. At $t=10$ it is 9.7 × 10⁻⁴. That is a thousand-fold shrinkage in ten steps, and every step is the same operation — multiplying by roughly 0.5. No bug, no pathology; just the chain rule doing exactly what it is supposed to do.

The output matches the mathematics to six decimal places. There is no approximation. This is the vanishing gradient problem — not a stylized illustration of it.

PyTorch: Measuring Vanishing Gradients

Now let us scale up to a real RNN and use PyTorch's autograd to measure $\|\partial L / \partial h_t\|$ at every timestep. The key trick is retain_grad(), which forces PyTorch to keep the gradient on intermediate tensors so we can inspect them. We initialize $W_{hh}$ with spectral radius ≈ 0.5, pass 20 tiny inputs through the unrolled network, and let backward() do the rest.

PyTorch — measuring ||dL/dh_t|| at every step

🐍vanishing_pytorch.py

Explanation(28)

Code(36)

1import torch

PyTorch is our deep learning framework. Unlike the NumPy version above, PyTorch tracks every operation in a computation graph and automatically computes gradients via `backward()`. This is exactly the machinery that makes BPTT practical at scale.

EXECUTION STATE

torch = Core PyTorch — tensors, autograd, operations.

2import torch.nn as nn

`nn` contains the building blocks of neural networks: `nn.Linear`, `nn.LSTM`, activation modules, loss functions. We alias it `nn` by convention.

EXECUTION STATE

nn.Linear = Learnable weight matrix (optionally with bias). We will use it to implement W_xh and W_hh.

4torch.manual_seed(0)

Makes every random number reproducible. Without this, each run would produce different gradient norms and we could not compare them across runs.

EXECUTION STATE

📚 torch.manual_seed(s) = Seeds PyTorch's CPU RNG. Any call to torch.randn after this is deterministic.

⬇ arg: 0 = The seed. 0 is arbitrary — any fixed integer works.

6T = 20 — sequence length

We simulate a 20-step sequence. Long enough that the gradient magnitude difference between the first and last step is thousands of orders of magnitude.

EXECUTION STATE

T = 20 = 20 timesteps. Typical sentence length in NLP tasks.

7hidden_size = 4

The dimension of the hidden state vector h_t. Small enough that we can keep intuition but real enough that the Jacobian is a 4×4 matrix (not a scalar).

EXECUTION STATE

hidden_size = 4 = Each h_t is a 4-dimensional vector. W_hh is therefore 4×4.

8input_size = 2

Each input x_t is a 2-dimensional vector (e.g., a tiny word embedding).

EXECUTION STATE

input_size = 2 = W_xh maps 2-d input → 4-d hidden.

10# Recurrent weights chosen so spectral radius < 1 (vanishing regime)

Comment flagging that this is the `vanishing` configuration. Try changing the 0.5 below to 1.5 to see exploding gradients instead.

11W_xh = nn.Linear(input_size, hidden_size, bias=False)

Creates a learnable input-to-hidden projection. For every timestep we will pass x_t through this layer and add it into the recurrence.

EXECUTION STATE

📚 nn.Linear(in, out, bias) = Creates a module with weight W of shape (out, in). Forward: y = x @ W.T + b (or no b).

⬇ arg 1: input_size = 2 = Number of input features per timestep.

⬇ arg 2: hidden_size = 4 = Number of hidden units.

⬇ arg 3: bias = False = No bias — keeps the demonstration about the recurrent weight only, not about affine offsets.

⬆ result: W_xh.weight = Tensor of shape (4, 2), initialized by PyTorch's default Kaiming-uniform scheme.

12W_hh = nn.Linear(hidden_size, hidden_size, bias=False)

The recurrent weight matrix. This is the star of the chapter: W_hh is applied at EVERY timestep, so its spectral properties determine whether gradients vanish or explode.

EXECUTION STATE

⬇ arg 1: hidden_size = 4 = Input: 4-d hidden state.

⬇ arg 2: hidden_size = 4 = Output: 4-d hidden state.

⬆ result: W_hh.weight = Tensor of shape (4, 4). We will override its values below.

→ the whole problem in one sentence = Every step backward multiplies the gradient by W_hh^T · diag(tanh'(z_t)). Do that 20 times and one matrix decides everything.

13with torch.no_grad():

A context manager that disables autograd. We use it here because we are MANUALLY setting the weight value — we do not want this assignment to be recorded as a differentiable operation.

EXECUTION STATE

📚 torch.no_grad() = Context manager. Inside the block, tensor operations do not build a computation graph. Essential when mutating parameters or running pure inference.

14W_hh.weight.copy_(...)

In-place copy: replaces the values of `W_hh.weight` with the tensor we build on the right. The underscore in `copy_` marks it as an in-place operation.

EXECUTION STATE

📚 .copy_(src) = In-place tensor copy. Writes `src` into `self` element-by-element. Shapes must match.

→ why in-place? = `W_hh.weight` is a registered Parameter. We must not replace it with a new tensor (that would break the optimizer's reference). `copy_` updates the values without changing the object.

150.5 * torch.eye(hidden_size)

A 4×4 identity matrix, multiplied by 0.5. By itself this matrix would give h_t = 0.5 · h_{t-1} along every axis — the matrix analogue of w=0.5 from our scalar example.

EXECUTION STATE

📚 torch.eye(n) = Creates an n×n identity matrix (1s on the diagonal, 0s elsewhere).

⬆ 0.5 * eye(4) =

        c0    c1    c2    c3
r0   0.5   0.0   0.0   0.0
r1   0.0   0.5   0.0   0.0
r2   0.0   0.0   0.5   0.0
r3   0.0   0.0   0.0   0.5

16+ 0.05 * torch.randn(hidden_size, hidden_size)

Adds a small random perturbation so the matrix is not perfectly diagonal — it behaves more like a real learned RNN. The 0.05 multiplier keeps the spectral radius well below 1.

EXECUTION STATE

📚 torch.randn(m, n) = Samples an m×n tensor of standard-normal values (mean 0, std 1). With manual_seed(0) already set, this output is deterministic.

→ why small noise? = Identity + small noise ≈ identity in spectral properties. This keeps the spectral radius near 0.5 so we are firmly in the vanishing regime.

⬆ effective W_hh = Roughly 0.5·I plus ~0.05-magnitude off-diagonal noise. Spectral radius ≈ 0.5.

20# Tiny random inputs so tanh stays in its linear regime

Comment. Why tiny inputs? Because tanh(x) ≈ x only for small x. If inputs saturate tanh, the local Jacobian (1 − h²) also shrinks, amplifying the vanishing effect and muddying what we want to measure.

21x = 0.05 * torch.randn(T, input_size)

A (20, 2) tensor of small random inputs — one 2-d vector per timestep. Multiplying by 0.05 keeps every |x_t| well below 0.1.

EXECUTION STATE

⬇ arg: T = 20 — the first dimension size (one row per timestep).

⬇ arg: input_size = 2 — the second dimension size.

⬆ x shape = (20, 2) — x[t] is the 2-d input at step t.

23# Build the unrolled RNN and retain gradients at each hidden state

Comment. `retain_grad()` is the trick that lets us inspect intermediate gradients — PyTorch only keeps gradients for leaf tensors by default.

24h = torch.zeros(hidden_size, requires_grad=True)

Initial hidden state h_0 = [0, 0, 0, 0]. `requires_grad=True` tells autograd to track gradients with respect to this tensor, so it behaves as an input-like leaf we can differentiate against.

EXECUTION STATE

📚 torch.zeros(shape) = Allocates a tensor of the given shape filled with 0.0.

⬇ arg: hidden_size = 4 = Shape (4,) — a 4-dim vector.

⬇ arg: requires_grad=True = Ask autograd to compute ∂loss/∂h_0 during backward().

⬆ h (h_0) = tensor([0., 0., 0., 0.])

25hidden_states = [h]

A list that will accumulate every hidden state we produce, starting with h_0. After the loop it will contain h_0, h_1, ..., h_T — 21 tensors in total.

EXECUTION STATE

hidden_states[0] = h_0 = [0, 0, 0, 0].

26for t in range(T):

Loop over timesteps t = 0, 1, ..., 19. Each iteration produces h_{t+1} from the previous hidden state and the current input.

LOOP TRACE · 4 iterations

t=0

effect = uses x[0] and hidden_states[-1] = h_0 → appends h_1

t=1

effect = uses x[1] and h_1 → appends h_2

t=...

effect = builds the full unrolled graph of 20 steps

t=19

effect = uses x[19] and h_19 → appends h_20

27h_new = torch.tanh(W_xh(x[t]) + W_hh(hidden_states[-1]))

The canonical RNN update: h_{t+1} = tanh(W_xh · x_t + W_hh · h_t). This is the single line that, repeated 20 times, builds the long Jacobian chain we want to analyze.

EXECUTION STATE

W_xh(x[t]) = Projects 2-d input to 4-d. Value: (4,) tensor.

W_hh(hidden_states[-1]) = Applies the recurrent matrix to the previous hidden state. Value: (4,) tensor.

📚 torch.tanh(x) = Element-wise hyperbolic tangent. Squashes each element into (−1, 1).

⬆ h_new = A (4,) tensor — the next hidden state, attached to the autograd graph through W_xh, W_hh, and the previous h.

28h_new.retain_grad()

By default, PyTorch only saves gradients for leaf tensors (Parameters, user-created requires_grad tensors). Intermediate tensors like h_new are discarded after backward. `retain_grad()` tells autograd to KEEP h_new.grad after the backward pass — which is exactly what we need to plot ||dL/dh_t||.

EXECUTION STATE

📚 tensor.retain_grad() = Asks autograd to preserve the gradient of this non-leaf tensor. Must be called before `backward()`.

→ why bother? = Without this, h_new.grad would be None after backward. We would only see gradients for W_xh.weight, W_hh.weight, and h_0 — not for the per-step hidden states.

29hidden_states.append(h_new)

Append the newly computed hidden state. The next iteration will use `hidden_states[-1]` (this h_new) as its previous state, linking the whole sequence into one autograd graph.

EXECUTION STATE

hidden_states growth = After t=0: length 2 (h_0, h_1). After t=19: length 21 (h_0 .. h_20).

31# Loss = sum of final hidden state (so dL/dh_T = 1)

We want the cleanest possible measurement of ∂L/∂h_t. By choosing L = sum(h_T), the gradient at the final hidden state is simply a vector of all 1s — no mixing, no scale factor. Any decay we observe is entirely due to W_hh and tanh'.

32loss = hidden_states[-1].sum()

Sum the 4 components of h_20 into a single scalar — this is what we will call `.backward()` on.

EXECUTION STATE

📚 tensor.sum() = Reduces a tensor to a single scalar by adding all its entries.

⬆ loss = A scalar tensor. ∂loss/∂h_T = [1, 1, 1, 1] trivially.

33loss.backward()

Runs backpropagation through the full 20-step unrolled graph. After this call, every tensor in the graph that we marked with requires_grad=True or retain_grad() has a `.grad` attribute holding ∂loss/∂tensor.

EXECUTION STATE

📚 tensor.backward() = Starts reverse-mode autodiff from this scalar. Walks the graph backward, accumulating gradients into every leaf and retained node.

→ what happens internally = At each step PyTorch multiplies by W_hh^T · diag(1 − h²). 20 such multiplications = exactly the Jacobian product from the math.

35# Inspect how the gradient norm shrinks as we go back in time

Comment. What we are about to print is the headline evidence: ||dL/dh_t|| plotted against t.

36for t, h_t in enumerate(hidden_states[1:], 1):

Iterate over h_1, h_2, ..., h_20 and number them 1, 2, ..., 20. We skip h_0 because its grad is stored on the leaf tensor directly (not retained).

EXECUTION STATE

📚 enumerate(iter, start) = Yields (index, value) pairs, beginning the index at `start`. Here start=1 so t matches the physical timestep.

hidden_states[1:] = All hidden states except h_0. Length 20.

LOOP TRACE · 7 iterations

t=1

h_t.grad.norm() = 2.354e-04

t=2

h_t.grad.norm() = 3.731e-04

t=5

h_t.grad.norm() = 1.486e-03

t=10

h_t.grad.norm() = 1.484e-02

t=15

h_t.grad.norm() = 1.514e-01

t=19

h_t.grad.norm() = 1.118e+00

t=20

h_t.grad.norm() = 2.000e+00

37print(f"t={t:2d}: ||dL/dh_t|| = {h_t.grad.norm().item():.3e}")

For each hidden state, print the L2 norm of its gradient. `.norm()` computes the Euclidean norm; `.item()` pulls the scalar out of the 0-dim tensor so we can format it with f-string.

EXECUTION STATE

📚 tensor.norm() = Default: L2 / Frobenius norm — sqrt(sum of squares).

📚 tensor.item() = Converts a 0-dim tensor to a Python float. Required for f-string formatting.

⬆ full output = t= 1: ||dL/dh_t|| = 2.354e-04 t= 2: ||dL/dh_t|| = 3.731e-04 t= 3: ||dL/dh_t|| = 5.914e-04 t= 4: ||dL/dh_t|| = 9.372e-04 t= 5: ||dL/dh_t|| = 1.486e-03 t= 6: ||dL/dh_t|| = 2.354e-03 t= 7: ||dL/dh_t|| = 3.730e-03 t= 8: ||dL/dh_t|| = 5.910e-03 t= 9: ||dL/dh_t|| = 9.362e-03 t=10: ||dL/dh_t|| = 1.484e-02 t=11: ||dL/dh_t|| = 2.356e-02 t=12: ||dL/dh_t|| = 3.739e-02 t=13: ||dL/dh_t|| = 5.944e-02 t=14: ||dL/dh_t|| = 9.468e-02 t=15: ||dL/dh_t|| = 1.514e-01 t=16: ||dL/dh_t|| = 2.436e-01 t=17: ||dL/dh_t|| = 3.974e-01 t=18: ||dL/dh_t|| = 6.565e-01 t=19: ||dL/dh_t|| = 1.118e+00 t=20: ||dL/dh_t|| = 2.000e+00

→ headline result = The gradient at t=1 is roughly 10,000× smaller than the gradient at t=20. Each step backward in time shrinks ||dL/dh_t|| by a factor of about 0.6 (close to our spectral radius of 0.5 plus some slack from the noise). The earliest hidden state receives almost no learning signal — the network is effectively deaf to long-range information.

8 lines without explanation

1import torch
2import torch.nn as nn
3
4torch.manual_seed(0)
5
6T = 20                  # sequence length
7hidden_size = 4
8input_size = 2
9
10# Recurrent weights chosen so spectral radius < 1 (vanishing regime)
11W_xh = nn.Linear(input_size, hidden_size, bias=False)
12W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
13with torch.no_grad():
14    W_hh.weight.copy_(
15        0.5 * torch.eye(hidden_size)
16        + 0.05 * torch.randn(hidden_size, hidden_size)
17    )
18
19# Tiny random inputs so tanh stays in its linear regime
20x = 0.05 * torch.randn(T, input_size)
21
22# Build the unrolled RNN and retain gradients at each hidden state
23h = torch.zeros(hidden_size, requires_grad=True)
24hidden_states = [h]
25for t in range(T):
26    h_new = torch.tanh(W_xh(x[t]) + W_hh(hidden_states[-1]))
27    h_new.retain_grad()
28    hidden_states.append(h_new)
29
30# Loss = sum of final hidden state (so dL/dh_T = 1)
31loss = hidden_states[-1].sum()
32loss.backward()
33
34# Inspect how the gradient norm shrinks as we go back in time
35for t, h_t in enumerate(hidden_states[1:], 1):
36    print(f"t={t:2d}: ||dL/dh_t|| = {h_t.grad.norm().item():.3e}")

The printed gradient norms tell the whole story: at $t=20$ the gradient norm is 2.00, at $t=1$ it is 2.35 × 10⁻⁴. The first hidden state receives a learning signal 10,000 times weaker than the last. Whatever information about the earliest input is encoded in $h_1$ , gradient descent can barely move its weights — the update step is buried in numerical noise.

Experiment to run locally. Change the 0.5 in the W_hh initialization to 1.5. Now the spectral radius is above 1, and the gradient norms explode backward — you will see numbers like 10⁸ at early timesteps before tanh saturation catches up. Same code, same framework, opposite failure mode.

Long-Term Dependencies in Practice

Why does this mathematical curiosity matter? Because language is full of long-distance dependencies. Consider:

“The cats, which we adopted from the shelter that opened last year, sit quietly.”

The verb must agree with cats (plural), but 13 tokens sit between them. For a vanilla RNN with vanishing gradients, by the time we are predicting “sit”, the signal from “cats” has been multiplied by 13 copies of $\operatorname{diag}(1-h^2)\, W_{hh}$ . With a typical gradient decay factor of ~0.7 per step, that is $0.7^{13} \approx 0.01$ — the subject is effectively invisible at the verb position.

Below you can select sentences of varying distance between subject and verb and watch the gradient signal decay as it flows back through the tokens. Short-distance agreement is easy; long-distance agreement is nearly hopeless.

Loading long-term dependency demo...

This is not just a linguistic problem. The same issue shows up everywhere sequences are long: time series with seasonal patterns spanning many steps, music with motifs that recur bars apart, video sequences with cause-and-effect across seconds, protein sequences where distant amino acids fold against each other. Any place you need memory that reaches back, a vanilla RNN fails.

A Historical Detour: Who Discovered This?

The vanishing gradient problem was not understood until almost a decade after backpropagation was popularized. Early RNN practitioners knew their models struggled on long sequences but could not explain why. The first rigorous analysis came from Sepp Hochreiter's 1991 diploma thesis at TU Munich — a largely overlooked document at the time. Yoshua Bengio, Patrice Simard, and Paolo Frasconi gave the problem broader visibility in 1994. Six years later, Hochreiter and Jürgen Schmidhuber would publish the paper that solved it: LSTM.

Loading timeline...

The Roadmap to LSTM and GRU

If the problem is that every backward step multiplies the gradient, the fix must change that. There are really only three options:

Change initialization. Initialize $W_{hh}$ as identity (IRNN), orthogonal, or unitary so its eigenvalues sit on the unit circle. This delays vanishing but does not eliminate it — once the weights start training, the spectral radius drifts.
Change the activation path. Use ReLU or LeakyReLU so the derivative is 1 wherever the activation is positive. Helps, but leaves the matrix product problem untouched.
Change the architecture. Introduce an additive path alongside the multiplicative one: the cell state in an LSTM, or the gated combination in a GRU. Gradients travel along the additive path almost unchanged, bypassing the vanishing problem entirely. This is what actually worked.

Chapter 17 will build LSTMs and GRUs from these principles, and you will see that the “complicated” gates are just the smallest machinery needed to give gradients a non-multiplicative highway through time. Once you understand vanishing gradients, LSTM stops looking mysterious — it looks inevitable.

The design principle. If you want a neural network to remember something across many steps, you must build a path along which the gradient is not continuously multiplied by a learned matrix. Additive skip connections — whether in LSTMs, GRUs, residual networks, or transformers — are the single most important structural idea in modern deep learning.

Test Your Understanding

Before moving on to LSTM, work through these questions to make sure the mathematics is solid. Each wrong answer comes with an explanation.

Loading quiz...

Key Takeaways

The gradient is a product, not a sum. $\partial h_T/\partial h_t$ is a chain of $T-t$ Jacobians multiplied together.
Vanilla RNNs share weights across time. Every factor in that product involves the same $W_{hh}$ — so its spectral radius dominates.
Spectral radius < 1 → vanishing; spectral radius > 1 → exploding. The edge of stability ( $\rho = 1$ ) is a knife-edge that training cannot maintain.
Activation functions shrink gradients further. Sigmoid caps the per-step factor at 0.25; tanh saturates once activations leave $[-1, 1]$ .
Exploding gradients are easy to detect and fix (clip them). Vanishing gradients are silent and devastating — the network simply never learns long-range patterns.
The architectural fix is an additive gradient path. LSTM, GRU, residual nets, and transformer skip-connections all share this idea: provide a route through the computation where gradients are not multiplied by a learned matrix at every step.
Hochreiter (1991) and Bengio et al. (1994) gave the problem its rigorous analysis. Hochreiter and Schmidhuber's 1997 LSTM paper gave the first clean solution.