Chapter 16
25 min read
Section 51 of 65

Vanilla RNN Implementation

Recurrent Neural Networks

From Inference to Learning

In Section 1, we built an RNN that could process a sentence and produce a sentiment probability. But our weights were hand-picked — we chose WxhW_{xh}, WhhW_{hh}, and WhyW_{hy} ourselves, and the model barely classified “I love this movie” as positive (probability 0.558 — almost a coin flip).

The question is: how do we find weights that actually work? The answer is the same as for any neural network — gradient descent. But RNNs introduce a twist: the same weights are used at every timestep, so the gradient must flow backward through time. This algorithm is called Backpropagation Through Time (BPTT), and understanding it is essential to understanding why RNNs work — and why they sometimes fail.

What we are building: A complete RNN training pipeline that takes 4 labeled sentences and learns to classify their sentiment. We will implement BPTT from scratch in NumPy, then see how PyTorch automates the same process in 6 lines.

The Loss Function for Sequences

Before we can train, we need a way to measure how wrong the model is. For binary sentiment classification (positive vs. negative), we use Binary Cross-Entropy (BCE):

L=[tlog(p)  +  (1t)log(1p)]L = -\Big[\, t \cdot \log(p) \;+\; (1 - t) \cdot \log(1 - p) \,\Big]

where pp is the model's predicted probability (output of sigmoid) and tt is the target label (1 for positive, 0 for negative). This loss has two intuitive properties:

  • When t=1t = 1 (positive sentiment), the loss is log(p)-\log(p). If the model predicts p=0.99p = 0.99, loss is 0.01 (good). If p=0.01p = 0.01, loss is 4.6 (terrible).
  • When t=0t = 0 (negative sentiment), the loss is log(1p)-\log(1 - p). Now lower pp is better.

For our untrained model on “I love this movie” (target = 1, p=0.558p = 0.558):

L=log(0.558)=0.584L = -\log(0.558) = 0.584

For comparison, a confident correct prediction p=0.98p = 0.98 would give L=log(0.98)=0.020L = -\log(0.98) = 0.020 — 29 times lower. Gradient descent will adjust the weights to push the loss down.

A beautiful gradient simplification

When you combine BCE loss with sigmoid activation, the derivative of the loss with respect to the pre-sigmoid logit yy simplifies to:

Ly=pt\frac{\partial L}{\partial y} = p - t

Just the prediction minus the target. For “I love this movie”: 0.5581=0.4420.558 - 1 = -0.442. The negative sign means “increase yy to push pp toward 1.” This clean gradient is the starting point for BPTT.


Backpropagation Through Time (BPTT)

In a feedforward network, backpropagation flows through a fixed sequence of layers. In an RNN, the computation graph is different: the same weights are used at every timestep. To compute gradients, we conceptually unroll the RNN through time — treating each timestep as a separate “layer” that shares weights with all the others.

The unrolled graph for our 4-word sentence looks like this:

h0Whhh1Whhh2Whhh3Whhh4Whyy^Lh_0 \xrightarrow{W_{hh}} h_1 \xrightarrow{W_{hh}} h_2 \xrightarrow{W_{hh}} h_3 \xrightarrow{W_{hh}} h_4 \xrightarrow{W_{hy}} \hat{y} \rightarrow L

BPTT computes gradients by applying the chain rule backward through this unrolled graph — from the loss LL, through the output layer, through h4h_4, then h3h_3, h2h_2, h1h_1. At each step, two things happen:

  1. Local weight gradients are computed and accumulated: since WxhW_{xh} and WhhW_{hh} are shared, the total gradient is the sum of contributions from every timestep.
  2. The gradient is propagated backward to the previous timestep via htht1=diag(1ht2)Whh\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1 - h_t^2) \cdot W_{hh}. This multiplication happens at every step — and is the source of both vanishing and exploding gradients.
Loading BPTT visualization...
Key observation: In the visualization above, the gradient magnitude drops from 0.345 at t=3t{=}3 to 0.030 at t=0t{=}0 — a 12× reduction across just 4 timesteps. For a 100-word sentence, this exponential decay makes early words effectively invisible to the gradient. This is the vanishing gradient problem, which we will analyze fully in Section 3.

The Chain Rule Unrolled Through Time

Let's write down the exact mathematics of BPTT. The forward pass at each timestep is:

ht=tanh ⁣(Wxhxt+Whhht1+bh)h_t = \tanh\!\left(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h\right)

The output and loss come from the final hidden state:

y=WhyhT+by,p=σ(y),L=BCE(p,t)y = W_{hy} \cdot h_T + b_y, \quad p = \sigma(y), \quad L = \text{BCE}(p, t)

Step 1: Output layer gradients

Starting from the loss, the gradient w.r.t. the logit is L/y=pt\partial L / \partial y = p - t. The output weight and bias gradients are:

LWhy=LyhT,Lby=Ly\frac{\partial L}{\partial W_{hy}} = \frac{\partial L}{\partial y} \cdot h_T^\top, \qquad \frac{\partial L}{\partial b_y} = \frac{\partial L}{\partial y}

The gradient flowing into the final hidden state is:

LhT=WhyLy\frac{\partial L}{\partial h_T} = W_{hy}^\top \cdot \frac{\partial L}{\partial y}

For our example: Why(0.442)=[0.265,  0.177,  0.133]W_{hy}^\top \cdot (-0.442) = [-0.265,\; 0.177,\; -0.133].

Step 2: BPTT — backward through each timestep

For each timestep from t=T1t = T{-}1 down to t=0t = 0, we apply three operations:

1. Tanh derivative gate: Since ht=tanh(pret)h_t = \tanh(\text{pre}_t), the gradient through tanh is element-wise:

δt=Lht(1ht2)\delta_t = \frac{\partial L}{\partial h_t} \odot (1 - h_t^2)

2. Accumulate weight gradients: Because the weights are shared, we add (not replace) the contribution from each timestep:

LWxh+=δtxt,LWhh+=δtht1,Lbh+=δt\frac{\partial L}{\partial W_{xh}} \mathrel{+}= \delta_t \cdot x_t^\top, \qquad \frac{\partial L}{\partial W_{hh}} \mathrel{+}= \delta_t \cdot h_{t-1}^\top, \qquad \frac{\partial L}{\partial b_h} \mathrel{+}= \delta_t

3. Propagate gradient to previous timestep:

Lht1=Whhδt\frac{\partial L}{\partial h_{t-1}} = W_{hh}^\top \cdot \delta_t

This last equation is the heart of BPTT. At each step, the gradient is multiplied by WhhW_{hh}^\top and element-wise by (1ht2)(1 - h_t^2). If the spectral radius of WhhW_{hh} is less than 1, these repeated multiplications cause the gradient to decay exponentially.

Concrete values: BPTT for “I love this movie”

TimestepToken||∂L/∂h_t||Decay vs t=3Key Observation
t=3"movie"0.34541.00×Full gradient from output layer
t=2"this"0.14000.41×Lost 59% through W_hh and tanh'
t=1"love"0.05930.17×83% gone — 'love' barely updates W_xh
t=0"I"0.02980.09×91% gone — 'I' is almost invisible

The word “movie” (closest to the output) gets a gradient 11.6× stronger than “I” (furthest away). With only 4 timesteps, this is already significant. With 100 timesteps, early words would receive effectively zero gradient.


Implementing BPTT from Scratch

Now let's implement everything we just derived. The code below contains the complete training pipeline: forward pass with state storage, BPTT backward pass, gradient clipping, and SGD weight updates. Click any line to see the exact values flowing through memory.

Complete RNN Training with BPTT — NumPy
🐍rnn_training.py
1import numpy as np

NumPy provides the ndarray type and all the linear algebra operations we need: matrix multiplication (@), element-wise operations (tanh, exp), and array creation (zeros). Every gradient computation in BPTT is a matrix operation.

EXECUTION STATE
numpy = Numerical computing library — provides ndarray, @ operator for matmul, np.tanh, np.exp, np.zeros, np.sqrt, np.sum, np.log
as np = Standard alias — lets us write np.array() instead of numpy.array()
3# --- Training data ---

We define a tiny vocabulary of 8 words with hand-crafted 2D embeddings, and 4 training sentences labeled as positive (1) or negative (0). This is the same setup from Section 1, now extended with negative words ('hate', 'bad') so the model has something to learn.

4vocab = {...} — word embeddings lookup

A dictionary mapping each word to its 2D embedding vector. In a real system, these would be learned (Word2Vec, GloVe) or come from a pretrained model. Here we hand-craft them: positive words ('love', 'great') have high second components, negative words ('hate', 'bad') have negative second components.

EXECUTION STATE
vocab = 8 words, each a 2D vector. Key design: • Positive: love=[0.5, 0.8], great=[0.4, 0.9] — high dim 1 • Negative: hate=[0.7, -0.6], bad=[0.6, -0.5] — negative dim 1 • Neutral: I=[1,0], this=[0.3,0.7], movie=[0.9,0.1], is=[0.2,0.3]
→ Why 2D? = 2 dimensions lets us see every number in every matrix. Real embeddings are 100-768 dimensions, but the math is identical.
10sentences = [...] — labeled training data

Four training examples: two positive (target=1) and two negative (target=0). Each example is a tuple of (word_list, label). The model will learn to output probability near 1.0 for positive sentences and near 0.0 for negative ones.

EXECUTION STATE
sentences[0] = (['I', 'love', 'this', 'movie'], 1) — positive sentiment
sentences[1] = (['I', 'hate', 'this', 'movie'], 0) — negative sentiment
sentences[2] = (['this', 'movie', 'is', 'great'], 1) — positive sentiment
sentences[3] = (['this', 'movie', 'is', 'bad'], 0) — negative sentiment
→ Key challenge = Sentences 0 and 1 differ by ONE word ('love' vs 'hate'). The RNN must learn that this single word flips the entire sentiment — a test of whether BPTT can learn meaningful word-level features.
17# --- Model weights (same as Section 1) ---

We use the exact same weight matrices as Section 1. These are the initial (untrained) weights — BPTT will compute gradients to update them. After training, these weights will have changed to correctly classify all 4 sentences.

18input_size, hidden_size = 2, 3

Tuple unpacking: sets input_size=2 (each word is a 2D vector) and hidden_size=3 (the hidden state has 3 elements). These dimensions determine the shapes of all weight matrices.

EXECUTION STATE
input_size = 2 = Each word embedding has 2 features. Determines W_xh columns.
hidden_size = 3 = Hidden state has 3 elements. Determines W_xh rows, W_hh size, b_h size.
19W_xh — input-to-hidden weights (3×2)

Transforms each 2D input embedding into the 3D hidden space. Same initial values as Section 1. These will be updated by gradient descent during training.

EXECUTION STATE
W_xh shape: (3, 2) =
      x0    x1
h0 [ 0.5  -0.3]
h1 [ 0.2   0.7]
h2 [-0.1   0.4]
→ After training = W_xh will change to: [[0.081, 1.071], [0.336, 1.374], [-0.096, 2.125]] — notice dim 1 weights increase dramatically, because that dimension separates positive/negative words.
22W_hh — hidden-to-hidden weights (3×3)

The recurrence matrix — transforms the previous hidden state to influence the current one. This is the heart of the RNN's memory. BPTT computes gradients that flow backward through this matrix at every timestep.

EXECUTION STATE
W_hh shape: (3, 3) =
       h0    h1    h2
h0 [ 0.1   0.4  -0.2]
h1 [-0.3   0.2   0.5]
h2 [ 0.3  -0.1   0.1]
25b_h = np.zeros(hidden_size)

Hidden bias initialized to zeros. Will be learned during training.

EXECUTION STATE
b_h = [0.0, 0.0, 0.0] = One bias per hidden unit, added before tanh. Zero initialization is standard.
26W_hy — hidden-to-output weights (1×3)

Projects the final hidden state into a single sentiment score. Shape (1, 3) because we have 1 output and 3 hidden units.

EXECUTION STATE
W_hy = [[0.6, -0.4, 0.3]] = Output = 0.6×h[0] − 0.4×h[1] + 0.3×h[2]. This determines which hidden features matter for the prediction.
27b_y = np.array([0.1])

Output bias. Adds a small constant to the logit before sigmoid.

EXECUTION STATE
b_y = [0.1] = Slight positive bias — the model starts with a mild lean toward positive prediction.
29# --- Forward pass (stores states for BPTT) ---

This forward pass is almost identical to Section 1, but with one critical difference: it stores ALL intermediate hidden states in h_list. BPTT needs these stored states to compute gradients — you cannot compute ∂L/∂W_hh at timestep t without knowing h_{t-1} and h_t.

30def forward(X) → (h_list, y, p)

Runs the RNN forward pass through the entire sequence, storing every hidden state for later use in BPTT. Returns the list of hidden states, raw logit, and sigmoid probability.

EXECUTION STATE
⬇ input: X — shape (T, 2) =
The embedding matrix for one sentence. Each row is a word's 2D embedding.
Example for 'I love this movie':
[[1.0, 0.0],
 [0.5, 0.8],
 [0.3, 0.7],
 [0.9, 0.1]]
⬆ returns: h_list = List of T+1 hidden state vectors [h₀, h₁, ..., h_T]. h₀ is zeros, h_T is the final state. Needed for BPTT gradient computation.
⬆ returns: y = Raw logit from W_hy @ h_T + b_y. For 'I love this movie': y = [0.2318]
⬆ returns: p = Sigmoid probability. For 'I love this movie': p = 0.5577
31T = X.shape[0]

Gets the number of timesteps (words) in this sentence. X has shape (T, input_size), so X.shape[0] = T.

EXECUTION STATE
📚 .shape[0] = NumPy attribute: returns the size of the first dimension. For X of shape (4, 2): X.shape[0] = 4 (4 words).
T = 4 = This sentence has 4 words, so the RNN runs for 4 timesteps.
32h_list = [np.zeros(hidden_size)] — store all hidden states

Initialize the list with h₀ = [0, 0, 0]. Unlike Section 1 where we overwrote h at each step, here we APPEND each new hidden state. After the loop, h_list = [h₀, h₁, h₂, h₃, h₄] — all 5 states stored for BPTT.

EXECUTION STATE
h_list[0] = h₀ = [0.0, 0.0, 0.0] = Initial hidden state (blank slate). BPTT needs this to compute dW_hh at t=0.
→ Why store all states? = BPTT computes dW_hh at time t using h_{t-1}: dW_hh += delta_t × h_{t-1}ᵀ. Without storing h_{t-1}, we would need to recompute the entire forward pass at each backward step.
33for t in range(T): — forward loop

Process each word in sequence. At each step: (1) compute pre-activation from current input and previous hidden state, (2) apply tanh, (3) store the new hidden state. This is identical to Section 1 but now we keep every h_t.

LOOP TRACE · 4 iterations
t=0, word='I'
X[0] = [1.0, 0.0] = Embedding for 'I'
h_prev = [0.0, 0.0, 0.0] = No memory yet
→ h₁ = [0.4621, 0.1974, -0.0997] = After tanh
t=1, word='love'
X[1] = [0.5, 0.8] = Embedding for 'love'
h_prev = [0.4621, 0.1974, -0.0997] = Memory of 'I'
→ h₂ = [0.1539, 0.4707, 0.3618] = Captures 'I love'
t=2, word='this'
X[2] = [0.3, 0.7] = Embedding for 'this'
h_prev = [0.1539, 0.4707, 0.3618] = Memory of 'I love'
→ h₃ = [0.0712, 0.6521, 0.2778] = Captures 'I love this'
t=3, word='movie'
X[3] = [0.9, 0.1] = Embedding for 'movie'
h_prev = [0.0712, 0.6521, 0.2778] = Memory of 'I love this'
→ h₄ = [0.5597, 0.4605, -0.0660] = Final state: 'I love this movie'
34pre = W_xh @ X[t] + W_hh @ h_list[-1] + b_h

The core RNN equation (pre-activation). Combines current input X[t] and previous hidden state h_list[-1] through two separate learned transformations, then adds bias. This is computed BEFORE tanh.

EXECUTION STATE
📚 @ (matrix multiply) = Python's matmul operator. W_xh(3,2) @ X[t](2,) → (3,). W_hh(3,3) @ h(3,) → (3,).
📚 h_list[-1] = Python negative indexing: -1 means last element. After appending h₁, h_list[-1] = h₁. This is the previous hidden state.
W_xh @ X[t] = Transforms input into hidden space: 'what does this word contribute?'
W_hh @ h_list[-1] = Transforms previous memory: 'what should be remembered?'
── Example: t=1, word='love' ── =
W_xh @ [0.5, 0.8] = [0.5×0.5+(-0.3)×0.8, 0.2×0.5+0.7×0.8, (-0.1)×0.5+0.4×0.8] = [0.0100, 0.6600, 0.2700]
W_hh @ [0.4621, 0.1974, -0.0997] = [0.1451, -0.1490, 0.1089]
pre = sum + b_h = [0.0100+0.1451+0, 0.6600-0.1490+0, 0.2700+0.1089+0] = [0.1551, 0.5110, 0.3789]
35h_list.append(np.tanh(pre))

Apply tanh to squash the pre-activation into (-1, +1), then APPEND to h_list (don't overwrite). This is the critical difference from Section 1 — we preserve every hidden state for BPTT.

EXECUTION STATE
📚 np.tanh(x) = Element-wise hyperbolic tangent. Maps any real number to (-1, +1). tanh(0)=0, tanh(1)=0.762, tanh(-1)=-0.762. Derivative: 1 - tanh²(x) — needed for BPTT.
📚 list.append() = Adds element to end of list. h_list grows from length 1 to length T+1 = 5.
→ Why append, not overwrite? = BPTT needs h_list[t] (previous state) at every backward step t. If we overwrote h, we'd lose all intermediate states.
36y = W_hy @ h_list[-1] + b_y — output logit

Project the final hidden state into a single output value. h_list[-1] is h₄ (after processing all 4 words). W_hy(1,3) @ h₄(3,) gives a scalar logit.

EXECUTION STATE
h_list[-1] = h₄ = [0.5597, 0.4605, -0.0660] — final hidden state encoding 'I love this movie'
W_hy @ h₄ + b_y = 0.6×0.5597 + (-0.4)×0.4605 + 0.3×(-0.0660) + 0.1 = 0.2318
⬆ y = [0.2318] = Positive logit → lean toward positive sentiment (before sigmoid)
37p = 1.0 / (1.0 + np.exp(-y[0])) — sigmoid

Sigmoid converts the raw logit into a probability between 0 and 1. σ(0.2318) = 0.5577 — barely above 0.5, meaning the untrained model is nearly guessing.

EXECUTION STATE
📚 sigmoid formula = σ(x) = 1/(1 + e^(-x)). Maps R → (0, 1). σ(0)=0.5, σ(2)=0.88, σ(-2)=0.12
np.exp(-0.2318) = e^(-0.2318) = 0.7932
⬆ p = 0.5577 = Probability of positive sentiment. Only 55.8% confident — barely correct. After training, this becomes 0.98.
38return h_list, y, p

Returns all three values needed for BPTT: h_list for gradient computation, y for debugging, and p for loss calculation.

EXECUTION STATE
⬆ return: h_list = [h₀, h₁, h₂, h₃, h₄] — 5 vectors, each 3D. Needed by backward() to compute weight gradients at each timestep.
⬆ return: y = [0.2318] — raw logit
⬆ return: p = 0.5577 — sigmoid probability
40# --- Backward pass: BPTT ---

This is the core of training: Backpropagation Through Time. We compute gradients of the loss with respect to every weight by propagating the error signal backward through the unrolled computation graph — from the output, through h₄, h₃, h₂, h₁, accumulating weight gradients at each step.

41def backward(X, h_list, p, target) → gradients

Computes all weight gradients using BPTT. Takes the stored forward-pass states and propagates the loss gradient backward through every timestep. Returns gradients for all 5 parameter groups.

EXECUTION STATE
⬇ input: X (4×2) = The input embeddings for this sentence. Needed to compute dW_xh at each timestep: dW_xh += delta_t × x_tᵀ
⬇ input: h_list (5 vectors) = [h₀, h₁, h₂, h₃, h₄] — all hidden states from forward pass. Needed for: • tanh derivative: 1 - h_t² (requires h_t) • dW_hh: delta_t × h_{t-1}ᵀ (requires h_{t-1})
⬇ input: p = 0.5577 = Sigmoid probability from forward pass. Used to compute dL/dy = p - target.
⬇ input: target = 1 = Ground truth label (1 = positive). dL/dy = 0.5577 - 1 = -0.4423.
⬆ returns = Tuple of 5 gradient arrays: (dW_xh, dW_hh, db_h, dW_hy, db_y). Same shapes as the weights.
42T = X.shape[0]

Number of timesteps. T=4 for our 4-word sentences.

EXECUTION STATE
T = 4 = BPTT will iterate backward from t=3 to t=0 — 4 steps of gradient propagation.
43dW_xh = np.zeros_like(W_xh) — accumulator for input weight gradients

Initialize gradient accumulators to zero. At each backward timestep, we ADD the local gradient contribution. The final dW_xh is the SUM of contributions from all 4 timesteps — this is because W_xh is shared across time.

EXECUTION STATE
📚 np.zeros_like(arr) = Creates a zero array with the same shape and dtype as arr. zeros_like(W_xh) → shape (3, 2), all zeros.
dW_xh shape: (3, 2) = Same shape as W_xh. Will accumulate: dW_xh = Σ_t delta_t × x_tᵀ
→ Why accumulate? = Because W_xh is used at EVERY timestep, the total gradient is the sum of gradients from each use: ∂L/∂W_xh = ∂L/∂W_xh|_{t=0} + ∂L/∂W_xh|_{t=1} + ... + ∂L/∂W_xh|_{t=3}
44dW_hh = np.zeros_like(W_hh) — accumulator for recurrence weight gradients

Gradient accumulator for the recurrence matrix. Same logic: W_hh is shared across all timesteps, so the total gradient is the sum of per-timestep contributions.

EXECUTION STATE
dW_hh shape: (3, 3) = Same shape as W_hh. Will accumulate: dW_hh = Σ_t delta_t × h_{t-1}ᵀ
45db_h = np.zeros(hidden_size)

Gradient accumulator for the hidden bias. db_h = Σ_t delta_t.

EXECUTION STATE
db_h shape: (3,) = One gradient per hidden unit. Accumulated across all timesteps.
47# Output layer gradients

First, compute the gradient at the output layer. This is the starting point — the error signal that will be propagated backward through time. The output gradient comes from differentiating the binary cross-entropy loss combined with the sigmoid activation.

48dL_dy = np.array([p - target]) — gradient of loss w.r.t. logit

This is a beautiful simplification: when you combine binary cross-entropy loss L = -[t·log(p) + (1-t)·log(1-p)] with sigmoid p = σ(y), the derivative simplifies to just p - target. No need to differentiate sigmoid and loss separately.

EXECUTION STATE
📚 dL/dy derivation = L = -t·log(σ(y)) - (1-t)·log(1-σ(y)) dL/dy = σ(y) - t = p - target This elegant result is why BCE + sigmoid is so common in binary classification.
p - target = 0.5577 - 1 = -0.4423. Negative because p < target: the model predicted 0.558 but should have predicted 1.0. The negative sign tells gradient descent to INCREASE the logit.
⬆ dL_dy = [-0.4423] = This error signal is the seed of BPTT — it flows into h₄, then h₃, h₂, h₁, accumulating weight gradients at each step.
49dW_hy = dL_dy.reshape(-1,1) @ h_list[-1].reshape(1,-1)

Gradient of loss w.r.t. the output weight matrix. Since y = W_hy @ h_T + b_y, the gradient ∂L/∂W_hy = (∂L/∂y) × h_Tᵀ. This is an outer product: column vector × row vector → matrix.

EXECUTION STATE
📚 .reshape(-1, 1) = Reshape to column vector. [-0.4423] → [[-0.4423]] shape (1,1).
📚 .reshape(1, -1) = Reshape to row vector. [0.5597, 0.4605, -0.0660] → [[0.5597, 0.4605, -0.0660]] shape (1,3).
dW_hy = dL_dy × h₄ᵀ =
[[-0.4423]] @ [[0.5597, 0.4605, -0.0660]] = [[-0.2476, -0.2037, 0.0292]]
→ Interpretation = Negative gradient for h[0] and h[1] means: increase W_hy[0] and W_hy[1] to push the output higher (toward target=1). Positive for h[2] means: decrease W_hy[2].
50db_y = dL_dy.copy()

Gradient of loss w.r.t. output bias. Since y = W_hy @ h + b_y, the derivative ∂L/∂b_y = ∂L/∂y = dL_dy. We copy to avoid aliasing.

EXECUTION STATE
📚 .copy() = Creates an independent copy of the array. Without copy(), db_y would be a reference to dL_dy, and modifying one would modify both.
db_y = [-0.4423] = Same as dL_dy. Gradient says: increase b_y to push output toward target.
52# Initial gradient into final hidden state

Now we need to push the error signal from the output layer INTO the final hidden state h₄. This is the gateway from the output gradients into the BPTT loop.

53dL_dh = (W_hy.T @ dL_dy.reshape(-1,1)).flatten()

Compute ∂L/∂h₄ using the chain rule: since y = W_hy @ h₄ + b_y, the gradient flowing backward through this multiplication is W_hyᵀ × dL_dy. This gives us a 3D vector — the error signal for each hidden unit.

EXECUTION STATE
📚 .T (transpose) = W_hy has shape (1, 3). W_hy.T has shape (3, 1). Transposing reverses the multiplication direction for backpropagation.
W_hy.T @ dL_dy = [[0.6], [-0.4], [0.3]] × [[-0.4423]] = [[-0.2654], [0.1769], [-0.1327]]
📚 .flatten() = Removes all dimensions of size 1. Shape (3, 1) → (3,). Converts column vector to 1D array for element-wise operations.
⬆ dL_dh = dL/dh₄ = [-0.2654, 0.1769, -0.1327]. This is the error signal entering the BPTT loop. ||dL/dh₄|| = 0.3454.
→ Interpretation = h₄[0] should INCREASE (negative gradient), h₄[1] should DECREASE (positive gradient), h₄[2] should INCREASE. These signals propagate backward through time.
55# Propagate gradients backward through time

This is THE core loop of BPTT. We iterate backward from t=T-1 to t=0 (i.e., from 'movie' back to 'I'), at each step: (1) apply tanh derivative, (2) compute local weight gradients, (3) propagate gradient to the previous timestep via W_hh.

56for t in reversed(range(T)): — BPTT backward loop

Iterate backward through time: t = 3, 2, 1, 0. At each step, we use the gradient arriving at h_{t+1} (stored in dL_dh) to compute weight gradients and then propagate to h_t. The gradient weakens at each step because it passes through tanh derivatives and W_hh multiplication.

LOOP TRACE · 4 iterations
t=3, word='movie' (most recent → strongest gradient)
||dL/dh₄|| = 0.3454 = Full-strength error signal from output
1 - h₄² = [0.6867, 0.7879, 0.9956] = tanh derivative at h₄
delta = [-0.1822, 0.1394, -0.1321] = Gradient through tanh
→ dL/dh₃ = W_hhᵀ @ delta = [-0.0997, -0.0318, 0.0929] — ||∇|| = 0.1400
t=2, word='this' (gradient weakens)
||dL/dh₃|| = 0.1400 = 2.5× weaker than at t=3
1 - h₃² = [0.9949, 0.5748, 0.9228] = tanh derivative at h₃
delta = [-0.0992, -0.0183, 0.0858] = Gradient through tanh
→ dL/dh₂ = W_hhᵀ @ delta = [0.0213, -0.0519, 0.0193] — ||∇|| = 0.0593
t=1, word='love' (gradient fading)
||dL/dh₂|| = 0.0593 = 5.8× weaker than at t=3
1 - h₂² = [0.9763, 0.7784, 0.8691] = tanh derivative at h₂
delta = [0.0208, -0.0404, 0.0167] = Gradient through tanh
→ dL/dh₁ = W_hhᵀ @ delta = [0.0192, -0.0014, -0.0227] — ||∇|| = 0.0298
t=0, word='I' (gradient nearly vanished)
||dL/dh₁|| = 0.0298 = 11.6× weaker than at t=3 — the vanishing gradient!
1 - h₁² = [0.7864, 0.9610, 0.9901] = tanh derivative at h₁
delta = [0.0151, -0.0014, -0.0225] = Gradient through tanh — tiny adjustments
57dtanh = 1.0 - h_list[t + 1] ** 2 — tanh derivative

The derivative of tanh(z) is 1 - tanh²(z) = 1 - h_t². Since h_t = tanh(pre_t), we can compute the derivative directly from the stored hidden state — no need to store the pre-activation values.

EXECUTION STATE
📚 tanh derivative = d/dz tanh(z) = 1 - tanh²(z). This is always between 0 and 1: • If h_t ≈ 0: derivative ≈ 1 (full gradient flow) • If h_t ≈ ±1: derivative ≈ 0 (gradient vanishes — saturation)
h_list[t + 1] = The hidden state AT timestep t. Index is t+1 because h_list[0] = h₀ (initial), h_list[1] = h₁ (after first word), etc.
** 2 (element-wise square) = NumPy element-wise operation. [0.5597, 0.4605, -0.066]² = [0.3133, 0.2121, 0.0044]
── Example: t=3 ── =
h₄ = [0.5597, 0.4605, -0.0660] = Hidden state after processing 'movie'
1 - h₄² = [1-0.3133, 1-0.2121, 1-0.0044] = [0.6867, 0.7879, 0.9956]
→ Why this matters = h₄[0]=0.56 is moderately saturated → derivative 0.69 (30% gradient lost). h₄[2]=-0.07 is near zero → derivative 0.996 (nearly all gradient passes). Saturated units kill gradients.
58delta = dL_dh * dtanh — gradient through tanh gate

Element-wise multiply the incoming gradient by the tanh derivative. This is the chain rule at the tanh operation: ∂L/∂(pre_t) = ∂L/∂h_t × ∂h_t/∂(pre_t). The tanh derivative acts as a gate — it can shrink or pass the gradient for each hidden unit independently.

EXECUTION STATE
* (element-wise multiply) = NumPy broadcasts element-by-element: delta[i] = dL_dh[i] × dtanh[i]. Not a matrix multiply.
── Example: t=3 ── =
dL_dh = [-0.2654, 0.1769, -0.1327] — incoming gradient
dtanh = [0.6867, 0.7879, 0.9956] — tanh derivative gate
delta = dL_dh * dtanh = [-0.2654×0.6867, 0.1769×0.7879, -0.1327×0.9956] = [-0.1822, 0.1394, -0.1321]
→ Gradient shrinkage = Element 0 lost 31% (0.6867), element 1 lost 21% (0.7879), element 2 lost only 0.4% (0.9956). The tanh derivative is the FIRST source of gradient decay in BPTT.
60dW_xh += delta.reshape(-1,1) @ X[t].reshape(1,-1)

Accumulate the gradient for W_xh at this timestep. Since h_t = tanh(W_xh @ x_t + ...), the gradient ∂L/∂W_xh at time t is delta_t × x_tᵀ (outer product). We ADD to the accumulator because W_xh is shared across all timesteps.

EXECUTION STATE
+= (accumulate) = Add to the running total. W_xh is used at every timestep, so the total gradient is the SUM of per-timestep contributions.
delta.reshape(-1,1) @ X[t].reshape(1,-1) = Outer product: column(3,1) × row(1,2) = matrix(3,2). Each element dW_xh[i,j] = delta[i] × x[j]
── Example: t=3, x₃=[0.9, 0.1] ── =
delta = [-0.1822, 0.1394, -0.1321] = Error signal at this timestep
contribution =
         x0       x1
h0 [-0.1640  -0.0182]
h1 [ 0.1255   0.0139]
h2 [-0.1189  -0.0132]
61dW_hh += delta.reshape(-1,1) @ h_list[t].reshape(1,-1)

Accumulate the gradient for W_hh at this timestep. Since pre_t = ... + W_hh @ h_{t-1} + ..., the gradient ∂L/∂W_hh at time t is delta_t × h_{t-1}ᵀ. Note: h_list[t] is h_{t-1} because h_list[0]=h₀.

EXECUTION STATE
h_list[t] = h_{t-1} = Previous hidden state. h_list[3] = h₃ = [0.0712, 0.6521, 0.2778] (the state BEFORE processing 'movie').
── Example: t=3 ── =
delta × h₃ᵀ =
         h0       h1       h2
h0 [-0.0130  -0.1188  -0.0506]
h1 [ 0.0099   0.0909   0.0387]
h2 [-0.0094  -0.0861  -0.0367]
── Example: t=0 ── =
h_list[0] = h₀ = [0, 0, 0] = All zeros! So dW_hh contribution at t=0 is a zero matrix. The recurrence weight gets NO gradient from the first timestep because there was no previous hidden state.
62db_h += delta

Accumulate bias gradient. Since pre_t = ... + b_h, the gradient ∂L/∂b_h = delta_t (the bias derivative is always 1). Sum across timesteps.

EXECUTION STATE
db_h accumulation = db_h = delta₃ + delta₂ + delta₁ + delta₀ = [-0.2455, 0.0793, -0.0521]
64dL_dh = W_hh.T @ delta — propagate gradient to previous timestep

THIS IS THE KEY STEP OF BPTT. The gradient flows from h_t to h_{t-1} through the recurrence matrix W_hh. Since pre_t depends on h_{t-1} via W_hh @ h_{t-1}, the backward gradient is W_hhᵀ @ delta_t. This multiplication happens at EVERY backward step — it is the second source of gradient decay (or explosion).

EXECUTION STATE
📚 W_hh.T @ delta = Transpose of W_hh (3,3) multiplied by delta (3,). This reverses the forward multiplication W_hh @ h_{t-1}, sending the gradient backward through the recurrence.
── Example: t=3 → t=2 ── =
W_hh.T =
     h0     h1    h2
h0 [ 0.1  -0.3   0.3]
h1 [ 0.4   0.2  -0.1]
h2 [-0.2   0.5   0.1]
W_hh.T @ delta₃ = [-0.0997, -0.0318, 0.0929]
||dL/dh₃|| = 0.1400 = Down from ||dL/dh₄|| = 0.3454. The gradient lost 59% of its magnitude in ONE step through W_hh.
→ Why gradients vanish = Each backward step multiplies by W_hh.T and element-wise by (1-h²). If the spectral norm of W_hh < 1, gradients shrink exponentially: after T steps, ||∇|| ≈ ||∇₀|| × (spectral_norm)^T. For T=100: even 0.9^100 ≈ 2.7×10⁻⁵.
66return dW_xh, dW_hh, db_h, dW_hy, db_y

Return all 5 gradient tensors. These will be used by the optimizer (SGD) to update the weights. Each gradient has the same shape as its corresponding weight.

EXECUTION STATE
⬆ return: dW_xh (3×2) =
[[-0.1683, -0.0710],
 [ 0.0984, -0.0312],
 [-0.1073,  0.0602]]
⬆ return: dW_hh (3×3) =
[[-0.0186, -0.1614, -0.0886],
 [-0.0116,  0.0743,  0.0361],
 [ 0.0115, -0.0425, -0.0073]]
⬆ return: db_h (3,) = [-0.2455, 0.0793, -0.0521]
⬆ return: dW_hy (1×3) =
[[-0.2476, -0.2037, 0.0292]]
⬆ return: db_y (1,) = [-0.4423]
68# --- Training loop ---

Now we put forward and backward together in a training loop. For each epoch: (1) forward pass on all 4 sentences, (2) BPTT backward on each, (3) average gradients, (4) clip gradients for stability, (5) SGD weight update.

69lr = 0.5 — learning rate

The step size for gradient descent. Each weight update: w ← w − lr × gradient. With lr=0.5, each step changes weights by half the gradient magnitude. Too large → divergence, too small → slow training.

EXECUTION STATE
lr = 0.5 = Moderate learning rate. For our small model (25 parameters, 4 examples), this converges in ~6 epochs to 100% accuracy. Real models use lr=0.001-0.01 with Adam optimizer.
70for epoch in range(30): — training epochs

Train for 30 epochs. Each epoch processes all 4 sentences and updates weights once. By epoch 6, the model achieves 100% accuracy. The remaining epochs increase confidence (pushing predictions further from 0.5).

EXECUTION STATE
epoch = One full pass through all training data. Epoch 0: 25% accuracy. Epoch 6: 100% accuracy. Epoch 29: Loss = 0.13.
71total_loss, correct = 0.0, 0 — epoch accumulators

Track total loss and number of correct predictions for this epoch. Reset to zero at the start of each epoch.

EXECUTION STATE
total_loss = 0.0 = Sum of BCE losses across all 4 sentences. Divided by 4 to get average loss.
correct = 0 = Count of correct predictions. Accuracy = correct/4.
72grads = [np.zeros_like(w) for w in [...]] — gradient accumulators

Create a list of 5 zero arrays, one for each weight. We accumulate gradients from all 4 sentences, then average them before updating. This is batch gradient descent (as opposed to updating after each sentence).

EXECUTION STATE
📚 list comprehension = Creates [zeros(3,2), zeros(3,3), zeros(3), zeros(1,3), zeros(1)] — matching shapes of W_xh, W_hh, b_h, W_hy, b_y.
→ Why accumulate? = Averaging gradients over all examples reduces noise and gives more stable updates. This is equivalent to computing the gradient of the average loss.
74for words, target in sentences: — iterate over training examples

Process each sentence: forward pass, compute loss, backward pass (BPTT), accumulate gradients. The weight update happens AFTER all sentences are processed.

LOOP TRACE · 4 iterations
Example 1: 'I love this movie', target=1
p = 0.558 (epoch 0) = Barely positive — should be near 1.0
loss = 0.584 = High loss — prediction is far from target
Example 2: 'I hate this movie', target=0
p = 0.528 (epoch 0) = Slightly positive — should be near 0.0
loss = 0.741 = High loss — wrong side of boundary
Example 3: 'this movie is great', target=1
p = 0.485 (epoch 0) = Slightly negative — should be near 1.0
loss = 0.724 = Wrong prediction entirely
Example 4: 'this movie is bad', target=0
p = 0.597 (epoch 0) = Moderately positive — should be near 0.0
loss = 0.917 = Worst loss — most wrong prediction
75X = np.array([vocab[w] for w in words])

Convert word list to embedding matrix by looking up each word in the vocabulary dictionary.

EXECUTION STATE
📚 list comprehension + np.array = [vocab[w] for w in words] creates a list of 2D vectors, then np.array() stacks them into a (4, 2) matrix.
Example: words=['I','love','this','movie'] = X = [[1.0, 0.0], [0.5, 0.8], [0.3, 0.7], [0.9, 0.1]] shape (4, 2)
76h_list, y, p = forward(X)

Run the forward pass to get hidden states (for BPTT), raw logit, and probability. All intermediate states are stored in h_list.

EXECUTION STATE
h_list = 5 vectors: [h₀, h₁, h₂, h₃, h₄]. Needed for backward().
y = Raw logit. Example: [0.2318] for 'I love this movie'
p = Sigmoid probability. Example: 0.5577 for 'I love this movie'
78loss = -(target * np.log(p + 1e-7) + ...)

Binary cross-entropy loss: L = -[t·log(p) + (1-t)·log(1-p)]. Measures how far the prediction p is from the target. When target=1 and p=0.558: L = -log(0.558) = 0.584. When target=0 and p=0.528: L = -log(1-0.528) = 0.749.

EXECUTION STATE
📚 Binary Cross-Entropy = L = -[t·log(p) + (1-t)·log(1-p)] • Perfect prediction (p=1, t=1): L = -log(1) = 0 • Wrong prediction (p=0.5, t=1): L = -log(0.5) = 0.693 • Very wrong (p=0.01, t=1): L = -log(0.01) = 4.605
1e-7 (epsilon) = Small constant to prevent log(0) = -infinity. Added inside log() for numerical stability.
Epoch 0 losses = 'I love this movie': 0.584 'I hate this movie': 0.749 'this movie is great': 0.724 'this movie is bad': 0.917 Avg: 0.742
80total_loss += loss

Accumulate loss across all sentences for reporting.

81correct += int((p >= 0.5) == target)

Count correct predictions. A prediction is correct if p >= 0.5 matches target=1, or p < 0.5 matches target=0.

EXECUTION STATE
📚 int(bool) = Converts True→1, False→0. Allows counting with +=.
Epoch 0 predictions = 'I love this movie': p=0.558 ≥ 0.5 → pred=1, target=1 ✓ 'I hate this movie': p=0.528 ≥ 0.5 → pred=1, target=0 ✗ 'this movie is great': p=0.485 < 0.5 → pred=0, target=1 ✗ 'this movie is bad': p=0.597 ≥ 0.5 → pred=1, target=0 ✗ correct = 1/4 = 25%
83g = backward(X, h_list, p, target) — BPTT

Run BPTT to compute all weight gradients for this sentence. Returns a tuple of 5 gradient arrays.

EXECUTION STATE
g = Tuple: (dW_xh, dW_hh, db_h, dW_hy, db_y). Each has the same shape as its weight.
84for i in range(5): grads[i] += g[i]

Add this sentence's gradients to the accumulators. After processing all 4 sentences, grads[i] contains the SUM of gradients from all examples.

EXECUTION STATE
grads[0] += g[0] = Accumulate dW_xh from this sentence
grads[1] += g[1] = Accumulate dW_hh from this sentence
86for i in range(5): grads[i] /= len(sentences)

Average the accumulated gradients by dividing by the number of sentences (4). This gives us the gradient of the AVERAGE loss, which is less noisy than individual sentence gradients.

EXECUTION STATE
/= len(sentences) = Divide all gradient accumulators by 4. This is equivalent to computing ∇(L_avg) where L_avg = (L₁ + L₂ + L₃ + L₄) / 4.
88# Gradient clipping

Gradient clipping is a practical necessity for RNN training. Without it, occasional large gradients can cause weight updates that destroy learned features. We compute the global L2 norm of all gradients and scale them down if the norm exceeds a threshold.

89grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))

Compute the global L2 norm of ALL gradients concatenated into one vector. This single number tells us how 'large' the total gradient is — if it exceeds our threshold, we scale everything down proportionally.

EXECUTION STATE
📚 L2 norm = ||g|| = √(Σ g²). The Euclidean length of the gradient vector. For all parameters concatenated: √(||dW_xh||² + ||dW_hh||² + ||db_h||² + ||dW_hy||² + ||db_y||²)
📚 np.sum(g**2) = Sum of squared elements. For a (3,2) array: sums all 6 squared values into a scalar.
📚 sum(... for g in grads) = Python generator sum — adds up the squared norms from all 5 gradient arrays.
grad_norm at epoch 0 = 0.1964 — well below our clip threshold of 5.0, so no clipping needed yet.
90if grad_norm > 5.0: — clip check

Only clip if the gradient norm exceeds 5.0. For our small model, this rarely triggers, but for larger models with longer sequences, gradient explosion is common and clipping is essential.

EXECUTION STATE
threshold = 5.0 = Common values: 1.0 to 10.0. Lower → more aggressive clipping → slower but safer training. Higher → less clipping → faster but riskier.
→ When does clipping trigger? = With longer sequences (50+ words) or larger learning rates, gradients can explode to 100+ or even infinity. Clipping prevents catastrophic weight updates.
91for i in range(5): grads[i] *= 5.0 / grad_norm

Scale ALL gradients by the same factor so the total norm equals exactly 5.0. This preserves the direction of the gradient (which features to update more/less) while limiting the step size.

EXECUTION STATE
scaling factor = 5.0 / grad_norm. Example: if grad_norm=12.5, scale = 5.0/12.5 = 0.4. All gradients shrink by 60%.
→ Direction preserved = Multiplying all gradients by the same scalar doesn't change which weights get larger/smaller updates relative to each other — only the overall magnitude.
93W_xh -= lr * grads[0] — SGD weight update

Stochastic Gradient Descent: update each weight by subtracting the learning rate times its gradient. The minus sign is because we want to MINIMIZE the loss: move in the opposite direction of the gradient.

EXECUTION STATE
📚 SGD update rule = w ← w − lr × ∂L/∂w. If gradient is negative (loss decreases when w increases), we increase w. If positive, we decrease w.
W_xh update (epoch 0) = W_xh -= 0.5 × averaged_dW_xh. Example: W_xh[2,1] was 0.4, gradient was ~0.015, so W_xh[2,1] becomes 0.4 - 0.5×0.015 = 0.3925.
94W_hh -= lr * grads[1]

Update the recurrence matrix. This changes how the RNN transforms its memory from one timestep to the next.

95b_h -= lr * grads[2]

Update the hidden bias. Shifts the pre-activation values.

96W_hy -= lr * grads[3]

Update the output weight matrix. Changes how the final hidden state maps to the prediction.

97b_y -= lr * grads[4]

Update the output bias.

99if epoch % 5 == 0: print(...)

Print progress every 5 epochs. Shows average loss and accuracy improving over time.

EXECUTION STATE
📚 % (modulo) = epoch % 5 == 0 is True when epoch is 0, 5, 10, 15, 20, 25. Prints 6 progress reports.
Expected output = Epoch 0: Loss=0.7416, Acc=1/4 Epoch 5: Loss=0.6601, Acc=2/4 Epoch 10: Loss=0.5706, Acc=4/4 Epoch 15: Loss=0.4278, Acc=4/4 Epoch 20: Loss=0.2506, Acc=4/4 Epoch 25: Loss=0.1554, Acc=4/4
100print(f"Epoch {epoch}: Loss=..., Acc=...")

Formatted string output showing epoch number, average loss, and number of correct predictions.

EXECUTION STATE
total_loss/4 = Average loss over 4 sentences. Epoch 0: 0.742. Epoch 29: 0.129.
correct/4 = Accuracy. Goes from 1/4 (25%) to 4/4 (100%) by epoch 6.
38 lines without explanation
1import numpy as np
2
3# --- Training data ---
4vocab = {
5    "I": [1.0, 0.0], "love": [0.5, 0.8],
6    "hate": [0.7, -0.6], "this": [0.3, 0.7],
7    "movie": [0.9, 0.1], "is": [0.2, 0.3],
8    "great": [0.4, 0.9], "bad": [0.6, -0.5],
9}
10
11sentences = [
12    (["I", "love", "this", "movie"], 1),
13    (["I", "hate", "this", "movie"], 0),
14    (["this", "movie", "is", "great"], 1),
15    (["this", "movie", "is", "bad"], 0),
16]
17
18# --- Model weights (same as Section 1) ---
19input_size, hidden_size = 2, 3
20W_xh = np.array([[ 0.5, -0.3],
21                  [ 0.2,  0.7],
22                  [-0.1,  0.4]])
23W_hh = np.array([[ 0.1,  0.4, -0.2],
24                  [-0.3,  0.2,  0.5],
25                  [ 0.3, -0.1,  0.1]])
26b_h  = np.zeros(hidden_size)
27W_hy = np.array([[0.6, -0.4, 0.3]])
28b_y  = np.array([0.1])
29
30# --- Forward pass (stores states for BPTT) ---
31def forward(X):
32    T = X.shape[0]
33    h_list = [np.zeros(hidden_size)]
34    for t in range(T):
35        pre = W_xh @ X[t] + W_hh @ h_list[-1] + b_h
36        h_list.append(np.tanh(pre))
37    y = W_hy @ h_list[-1] + b_y
38    p = 1.0 / (1.0 + np.exp(-y[0]))
39    return h_list, y, p
40
41# --- Backward pass: BPTT ---
42def backward(X, h_list, p, target):
43    T = X.shape[0]
44    dW_xh = np.zeros_like(W_xh)
45    dW_hh = np.zeros_like(W_hh)
46    db_h  = np.zeros(hidden_size)
47
48    # Output layer gradients
49    dL_dy = np.array([p - target])
50    dW_hy = dL_dy.reshape(-1, 1) @ h_list[-1].reshape(1, -1)
51    db_y  = dL_dy.copy()
52
53    # Initial gradient into final hidden state
54    dL_dh = (W_hy.T @ dL_dy.reshape(-1, 1)).flatten()
55
56    # Propagate gradients backward through time
57    for t in reversed(range(T)):
58        dtanh = 1.0 - h_list[t + 1] ** 2
59        delta = dL_dh * dtanh
60
61        dW_xh += delta.reshape(-1, 1) @ X[t].reshape(1, -1)
62        dW_hh += delta.reshape(-1, 1) @ h_list[t].reshape(1, -1)
63        db_h  += delta
64
65        dL_dh = W_hh.T @ delta
66
67    return dW_xh, dW_hh, db_h, dW_hy, db_y
68
69# --- Training loop ---
70lr = 0.5
71for epoch in range(30):
72    total_loss, correct = 0.0, 0
73    grads = [np.zeros_like(w) for w in [W_xh, W_hh, b_h, W_hy, b_y]]
74
75    for words, target in sentences:
76        X = np.array([vocab[w] for w in words])
77        h_list, y, p = forward(X)
78
79        loss = -(target * np.log(p + 1e-7)
80               + (1 - target) * np.log(1 - p + 1e-7))
81        total_loss += loss
82        correct += int((p >= 0.5) == target)
83
84        g = backward(X, h_list, p, target)
85        for i in range(5):
86            grads[i] += g[i]
87
88    for i in range(5):
89        grads[i] /= len(sentences)
90
91    # Gradient clipping
92    grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
93    if grad_norm > 5.0:
94        for i in range(5):
95            grads[i] *= 5.0 / grad_norm
96
97    W_xh -= lr * grads[0]
98    W_hh -= lr * grads[1]
99    b_h  -= lr * grads[2]
100    W_hy -= lr * grads[3]
101    b_y  -= lr * grads[4]
102
103    if epoch % 5 == 0:
104        print(f"Epoch {epoch}: Loss={total_loss/4:.4f}, "
105              f"Acc={correct}/4")

Training results

EpochAvg LossAccuracyWhat Happened
00.74161/4 (25%)Random weights — almost guessing
50.66012/4 (50%)Starting to separate positive from negative
60.64454/4 (100%)All 4 sentences correctly classified!
100.57064/4 (100%)Confidence growing — predictions moving away from 0.5
200.25064/4 (100%)High confidence on all examples
290.12854/4 (100%)Near convergence — loss approaching zero

In just 6 epochs, our hand-coded BPTT trains the RNN to perfectly classify all 4 sentences. The remaining epochs increase confidence — pushing positive predictions toward 1.0 and negative ones toward 0.0.


Training with PyTorch

PyTorch's autograd engine performs BPTT automatically. When you call loss.backward(), it traverses the computation graph — including the unrolled RNN timesteps — and computes all gradients. Our entire backward() function is replaced by one line.

RNN Training — PyTorch (BPTT is automatic)
🐍rnn_train_pytorch.py
1import torch

PyTorch core library. Provides Tensor objects (like NumPy arrays but with automatic differentiation and GPU support). All neural network computations run on tensors.

EXECUTION STATE
torch = Core PyTorch — provides Tensor, autograd, device management. loss.backward() uses autograd to compute ALL gradients automatically — replacing our entire backward() function.
2import torch.nn as nn

Neural network module with building blocks: nn.RNN, nn.Linear, nn.Module, and loss functions like nn.BCEWithLogitsLoss.

EXECUTION STATE
torch.nn = Provides nn.Module (base class), nn.RNN (recurrent layer), nn.Linear (fully connected), nn.BCEWithLogitsLoss (loss function). Aliased as 'nn'.
4class SentimentRNN(nn.Module):

Same model architecture as Section 1. nn.Module base class provides automatic parameter tracking, gradient computation, and serialization.

EXECUTION STATE
SentimentRNN = Custom model: nn.RNN (recurrence) + nn.Linear (output projection). Identical to Section 1, but now we'll TRAIN it.
5def __init__(self, input_size, hidden_size, output_size):

Constructor defining the architecture. Called once to create the model.

EXECUTION STATE
⬇ input: input_size = 2 = Each word is a 2D embedding
⬇ input: hidden_size = 3 = Hidden state has 3 elements
⬇ input: output_size = 1 = Single output for binary classification
6super().__init__()

Initialize nn.Module internals. Mandatory before defining any layers.

7self.hidden_size = hidden_size

Store for creating h0 in forward(). Just a config value, not a learnable parameter.

EXECUTION STATE
self.hidden_size = 3 = Used in forward() to create torch.zeros(1, batch, 3).
8self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)

Creates the RNN layer. This one line replaces our W_xh, W_hh, b_h AND the entire recurrence loop. PyTorch handles weight initialization, the for loop, and — critically — the gradient computation (BPTT) automatically via autograd.

EXECUTION STATE
📚 nn.RNN(in, hid, batch_first=True) = Creates an RNN with weight_ih (3,2), weight_hh (3,3), bias_ih (3,), bias_hh (3,) = 21 parameters. batch_first=True: input shape is (batch, seq, features).
→ BPTT is automatic = When you call loss.backward(), PyTorch's autograd records every operation in the forward pass and automatically computes BPTT gradients. Our entire backward() function is replaced by one line.
9self.fc = nn.Linear(hidden_size, output_size)

Output layer: maps 3D hidden state to 1D logit. Replaces our W_hy and b_y.

EXECUTION STATE
📚 nn.Linear(3, 1) = Weight (1, 3) + bias (1,) = 4 parameters. Total model: 21 + 4 = 25 params.
11def forward(self, x):

Forward pass — identical to Section 1. PyTorch records all operations for automatic backpropagation.

EXECUTION STATE
⬇ input: x — shape (batch, seq_len, input_size) = For our data: (4, 4, 2) = 4 sentences, each 4 words, each 2D embedding.
12h0 = torch.zeros(1, x.size(0), self.hidden_size)

Initial hidden state of zeros. Shape: (num_layers=1, batch_size, hidden_size).

EXECUTION STATE
📚 x.size(0) = Batch size. For x of shape (4, 4, 2): x.size(0) = 4.
h0 shape: (1, 4, 3) = One layer, 4 sentences in batch, 3 hidden units — all zeros.
13output, h_n = self.rnn(x, h0)

Run the entire RNN forward pass. PyTorch internally loops through all timesteps, applying the recurrence at each step. Returns hidden states at all timesteps (output) and the final hidden state (h_n).

EXECUTION STATE
output shape: (4, 4, 3) = Hidden states at all 4 timesteps for all 4 sentences in the batch.
h_n shape: (1, 4, 3) = Final hidden state for each sentence. This is what we use for classification.
→ Autograd records this = Every matrix multiply and tanh inside self.rnn is recorded in the computation graph. loss.backward() will automatically unroll these for BPTT.
14logit = self.fc(h_n.squeeze(0))

Project final hidden states to sentiment logits. squeeze(0) removes the num_layers dimension.

EXECUTION STATE
📚 .squeeze(0) = h_n: (1, 4, 3) → (4, 3). Removes dim 0 (num_layers=1).
self.fc: (4, 3) → (4, 1) = Each sentence gets one logit value.
⬆ logit shape: (4, 1) = Raw sentiment scores for all 4 sentences.
15return logit

Return raw logits. Sigmoid is applied externally — BCEWithLogitsLoss combines sigmoid + loss for numerical stability.

EXECUTION STATE
⬆ return: logit (4, 1) = Raw sentiment scores. Positive → positive sentiment, negative → negative sentiment.
17# --- Training setup ---

Create model, loss function, and optimizer. These three objects are the standard PyTorch training setup.

18model = SentimentRNN(input_size=2, hidden_size=3, output_size=1)

Instantiate the model. Weights are randomly initialized by PyTorch (using Kaiming uniform for RNN and Linear layers). Total: 25 learnable parameters.

EXECUTION STATE
model = SentimentRNN with: • self.rnn: nn.RNN(2, 3) — 21 params • self.fc: nn.Linear(3, 1) — 4 params Total: 25 parameters to learn
19criterion = nn.BCEWithLogitsLoss()

Binary Cross-Entropy loss WITH sigmoid built in. This is numerically more stable than applying sigmoid separately and then computing BCE, because it uses the log-sum-exp trick internally.

EXECUTION STATE
📚 nn.BCEWithLogitsLoss() = Combines sigmoid + BCE in one step: L = -[t·log(σ(y)) + (1-t)·log(1-σ(y))] Numerically stable because it never computes σ(y) explicitly — avoids log(0).
→ vs nn.BCELoss() = BCELoss requires you to apply sigmoid first, which can produce 0 or 1 due to floating point limits, causing log(0) = -inf. BCEWithLogitsLoss avoids this.
20optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

Stochastic Gradient Descent optimizer. model.parameters() returns ALL learnable parameters (from nn.RNN and nn.Linear). The optimizer stores a reference to these parameters and updates them when optimizer.step() is called.

EXECUTION STATE
📚 torch.optim.SGD(params, lr) = Implements w ← w − lr × ∂L/∂w for each parameter. Same as our manual W_xh -= lr * grads[0], but handles all parameters automatically.
📚 model.parameters() = Iterator over all learnable tensors: rnn.weight_ih_l0, rnn.weight_hh_l0, rnn.bias_ih_l0, rnn.bias_hh_l0, fc.weight, fc.bias. PyTorch discovered these because they are nn.Module attributes.
lr = 0.5 = Same learning rate as our NumPy version for fair comparison.
22# Training data: (batch, seq_len, input_size)

Unlike NumPy where we processed one sentence at a time, PyTorch processes ALL 4 sentences in a single batch. This is efficient and leverages parallelism.

23X_train = torch.tensor([...]) — all 4 sentences as one batch

Creates a single tensor with all training data. Shape (4, 4, 2): 4 sentences × 4 words × 2 embedding dims. Each sentence's embeddings are in a separate 'row' of the batch dimension.

EXECUTION STATE
📚 torch.tensor(data) = Creates a tensor from nested lists. Infers float32 dtype from the float literals.
X_train shape: (4, 4, 2) = Batch=4 sentences, seq_len=4 words each, input_size=2 dims per word. All processed in parallel by nn.RNN.
X_train[0] = [[1.0,0.0],[0.5,0.8],[0.3,0.7],[0.9,0.1]] — 'I love this movie'
X_train[1] = [[1.0,0.0],[0.7,-0.6],[0.3,0.7],[0.9,0.1]] — 'I hate this movie'
X_train[2] = [[0.3,0.7],[0.9,0.1],[0.2,0.3],[0.4,0.9]] — 'this movie is great'
X_train[3] = [[0.3,0.7],[0.9,0.1],[0.2,0.3],[0.6,-0.5]] — 'this movie is bad'
29y_train = torch.tensor([[1.0],[0.0],[1.0],[0.0]])

Target labels. Shape (4, 1) to match the model output shape (4, 1).

EXECUTION STATE
y_train shape: (4, 1) = 4 targets, each a single float. 1.0 = positive, 0.0 = negative.
31# --- Training loop ---

The PyTorch training loop is remarkably concise. What took ~30 lines of NumPy (forward, backward, gradient accumulation, clipping, update) becomes 6 lines. The key magic: loss.backward() performs BPTT automatically.

32for epoch in range(30):

Train for 30 epochs, same as NumPy. Each epoch: zero gradients → forward → loss → backward (BPTT) → clip → update.

33optimizer.zero_grad() — reset gradients to zero

Clear all parameter gradients from the previous iteration. PyTorch ACCUMULATES gradients by default (for techniques like gradient accumulation), so we must explicitly zero them at the start of each step.

EXECUTION STATE
📚 optimizer.zero_grad() = Sets .grad = None (or zeros) for every parameter the optimizer manages. Without this, gradients from epoch 0 would add to epoch 1's gradients, giving wrong updates.
→ Why accumulate by default? = Useful for large batches that don't fit in GPU memory: process sub-batches, accumulate gradients, then step. For our case, we zero every time.
34logits = model(X_train) — forward pass on entire batch

Process ALL 4 sentences in parallel. model(X_train) calls forward(), which runs the RNN on the (4, 4, 2) input tensor. Returns 4 logits in one shot.

EXECUTION STATE
logits shape: (4, 1) = One sentiment score per sentence. All computed in parallel via batch processing.
→ vs NumPy = Our NumPy version processed one sentence at a time in a for loop. PyTorch's nn.RNN handles the entire batch in optimized C++/CUDA code.
35loss = criterion(logits, y_train) — compute batch loss

Compute the average BCE loss across all 4 sentences. BCEWithLogitsLoss applies sigmoid internally and averages across the batch by default.

EXECUTION STATE
📚 criterion(input, target) = BCEWithLogitsLoss: applies sigmoid to logits, computes BCE, averages over batch. Returns a single scalar tensor.
loss (scalar tensor) = Average loss across the batch. Tracks the computation graph for backpropagation.
36loss.backward() — BPTT happens HERE

THIS SINGLE LINE replaces our entire backward() function. PyTorch's autograd engine traverses the computation graph in reverse, computing gradients for EVERY parameter. For the RNN, this means unrolling through all 4 timesteps — full BPTT — automatically.

EXECUTION STATE
📚 loss.backward() = Autograd computes ∂loss/∂p for every parameter p. For the RNN, this includes: 1. ∂L/∂W_hy, ∂L/∂b_y (output layer) 2. ∂L/∂h₄ (into hidden state) 3. BPTT: ∂L/∂h₃ → ∂L/∂h₂ → ∂L/∂h₁ (through time) 4. ∂L/∂W_xh, ∂L/∂W_hh, ∂L/∂b_h (accumulated at each timestep)
→ After this call = Every parameter tensor has a .grad attribute with its gradient. E.g., model.rnn.weight_ih_l0.grad has shape (3, 2).
→ Efficiency = PyTorch computes all gradients in one reverse pass using the chain rule. The cost is roughly 2× the forward pass (need to store intermediate values, then traverse backward).
37torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)

Gradient clipping — same concept as our NumPy version. Compute the global L2 norm of all parameter gradients; if it exceeds 5.0, scale all gradients down proportionally.

EXECUTION STATE
📚 clip_grad_norm_(params, max_norm) = Computes total_norm = √(Σ ||p.grad||²) across all params. If total_norm > max_norm, multiplies every p.grad by max_norm/total_norm. The underscore _ means in-place modification.
⬇ arg 1: model.parameters() = All 6 parameter tensors (weight_ih, weight_hh, bias_ih, bias_hh, fc.weight, fc.bias).
⬇ arg 2: max_norm = 5.0 = Same threshold as our NumPy implementation.
→ Returns = The total gradient norm (before clipping). Useful for logging.
38optimizer.step() — update all weights

Apply the SGD update rule to ALL parameters: w ← w − lr × w.grad. This single line replaces our 5 manual update lines (W_xh -= ..., W_hh -= ..., etc.).

EXECUTION STATE
📚 optimizer.step() = For SGD: param.data -= lr × param.grad for each parameter. For Adam/AdaGrad, the update rule is more complex but the interface is identical.
→ After this call = All 25 parameters have been updated. The model is now slightly better at sentiment classification.
40with torch.no_grad():

Disable gradient tracking for the accuracy computation. We don't need to backpropagate through this — it's just for reporting.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd. Operations inside don't build a computation graph. Saves memory and computation for inference/evaluation.
41preds = (torch.sigmoid(logits) >= 0.5).float()

Convert logits to binary predictions: apply sigmoid, threshold at 0.5, convert boolean to float (1.0 or 0.0).

EXECUTION STATE
📚 torch.sigmoid(logits) = Element-wise sigmoid: σ(y) = 1/(1+e^(-y)). Shape (4, 1) → (4, 1).
📚 >= 0.5 = Element-wise comparison: returns BoolTensor. True where probability ≥ 0.5.
📚 .float() = Convert bool to float: True → 1.0, False → 0.0.
preds shape: (4, 1) = Binary predictions: [[1.0], [0.0], [1.0], [0.0]] (when correct).
42acc = (preds == y_train).float().mean()

Compute accuracy: compare predictions to targets, convert to float, take the mean.

EXECUTION STATE
📚 .mean() = Average of all elements. For [1,1,1,0]: mean = 0.75 (75% accuracy).
acc (scalar tensor) = Accuracy as a fraction. 1.0 = all correct, 0.25 = 1 of 4 correct.
43if epoch % 5 == 0: print(...)

Print progress every 5 epochs, same as NumPy version.

EXECUTION STATE
📚 loss.item() = Extracts a Python float from a 0-dim tensor. loss is a tensor; .item() gives 0.7416.
📚 acc.item() = Extracts accuracy as float. 0.25 → '25%' with :.0% format.
44print(f"Epoch {epoch}: Loss=..., Acc=...")

Formatted progress output showing training convergence.

EXECUTION STATE
Expected output (with different random init) = Epoch 0: Loss=~0.74, Acc=25% Epoch 5: Loss=~0.66, Acc=50% Epoch 10: Loss=~0.47, Acc=100% Epoch 20: Loss=~0.15, Acc=100% Epoch 25: Loss=~0.08, Acc=100%
→ Note on exact values = PyTorch uses random weight initialization (different from our NumPy weights), so exact loss/accuracy numbers will differ from our NumPy version. The learning PATTERN (loss decreasing, accuracy improving) is the same.
12 lines without explanation
1import torch
2import torch.nn as nn
3
4class SentimentRNN(nn.Module):
5    def __init__(self, input_size, hidden_size, output_size):
6        super().__init__()
7        self.hidden_size = hidden_size
8        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
9        self.fc  = nn.Linear(hidden_size, output_size)
10
11    def forward(self, x):
12        h0 = torch.zeros(1, x.size(0), self.hidden_size)
13        output, h_n = self.rnn(x, h0)
14        logit = self.fc(h_n.squeeze(0))
15        return logit
16
17# --- Training setup ---
18model = SentimentRNN(input_size=2, hidden_size=3, output_size=1)
19criterion = nn.BCEWithLogitsLoss()
20optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
21
22# Training data: (batch, seq_len, input_size)
23X_train = torch.tensor([
24    [[1.0,0.0],[0.5,0.8],[0.3,0.7],[0.9,0.1]],
25    [[1.0,0.0],[0.7,-0.6],[0.3,0.7],[0.9,0.1]],
26    [[0.3,0.7],[0.9,0.1],[0.2,0.3],[0.4,0.9]],
27    [[0.3,0.7],[0.9,0.1],[0.2,0.3],[0.6,-0.5]],
28])
29y_train = torch.tensor([[1.0],[0.0],[1.0],[0.0]])
30
31# --- Training loop ---
32for epoch in range(30):
33    optimizer.zero_grad()
34    logits = model(X_train)
35    loss = criterion(logits, y_train)
36    loss.backward()
37    torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
38    optimizer.step()
39
40    with torch.no_grad():
41        preds = (torch.sigmoid(logits) >= 0.5).float()
42        acc = (preds == y_train).float().mean()
43    if epoch % 5 == 0:
44        print(f"Epoch {epoch}: Loss={loss.item():.4f}, "
45              f"Acc={acc.item():.0%}")

NumPy vs. PyTorch: what changed?

AspectNumPy (from scratch)PyTorch
Forward passManual loop + state storagemodel(X_train) — one call
Backward pass40-line backward() functionloss.backward() — one line
Gradient clippingManual norm computationclip_grad_norm_() — one line
Weight update5 separate linesoptimizer.step() — one line
Batch processingLoop over sentencesAll sentences in one tensor
GPU supportNonemodel.to('cuda')
The math is identical. PyTorch's autograd computes exactly the same gradients as our hand-coded BPTT. The forward pass stores intermediate tensors on the computation graph, and backward() traverses it in reverse, applying the chain rule at each node. Understanding the NumPy version means you understand what PyTorch is doing under the hood.

Gradient Clipping: Taming Exploding Gradients

While vanishing gradients cause slow learning for early timesteps, exploding gradients cause catastrophic training failure. If the spectral radius of WhhW_{hh} is greater than 1, the gradient can grow exponentially at each backward step instead of shrinking.

The solution is gradient clipping: compute the global L2 norm of all gradients, and if it exceeds a threshold, scale all gradients down proportionally:

if >τ:τ\text{if } \|\nabla\| > \tau: \quad \nabla \leftarrow \frac{\tau}{\|\nabla\|} \cdot \nabla

This preserves the direction of the gradient (which weights should increase vs. decrease) while limiting the magnitude (how big the step is). Think of it as saying: “adjust the weights in the right direction, but take a smaller step if the gradient is too large.”

ScenarioGrad NormThresholdAction
Normal training0.1965.0No clipping — gradient is small
Moderate explosion12.55.0Scale by 5.0/12.5 = 0.4 — gradients shrink 60%
Severe explosion1000.05.0Scale by 5.0/1000 = 0.005 — gradients shrink 99.5%

In our tiny 4-word model, gradient clipping rarely activates (the norm stays below 5.0). But in production RNNs with hundreds of timesteps, clipping is essential — without it, a single exploding gradient can destroy all learned weights in one step.

Clipping fixes explosion but not vanishing. You can prevent gradients from becoming too large, but you cannot force vanishing gradients to become larger without changing the architecture. This asymmetry is why LSTMs and GRUs (Chapter 17) were invented — they introduce gates that allow gradients to flow undiminished through many timesteps.

Watching an RNN Learn

The interactive visualization below shows 30 epochs of training on our 4-sentence dataset. Watch the loss curve decrease and the prediction probabilities separate: positive sentences push toward 1.0, negative sentences push toward 0.0.

Loading training visualizer...

Several patterns are worth noting in the training dynamics:

  1. Accuracy precedes confidence. The model achieves 100% accuracy at epoch 6, but the loss is still 0.64. The remaining 24 epochs increase confidence without changing any predictions — the model pushes probabilities further from the 0.5 boundary.
  2. “This movie is great” separates first. This sentence has the most distinctive signal (both “great” and “movie” have unique embeddings), so the model learns it fastest. “I love/hate this movie” (which differ by only one word) takes longer.
  3. Loss is not monotonically decreasing. Around epoch 28-29, small fluctuations appear. This is normal — with a learning rate of 0.5 and batch gradient descent, overshooting can occur.

Key Takeaways

  1. BPTT is standard backpropagation on an unrolled graph. The RNN is conceptually “unfolded” through time, and the chain rule is applied backward through each timestep. Because weights are shared, gradients are accumulated (summed) across timesteps.
  2. Gradients decay exponentially through time. Each backward step multiplies by WhhW_{hh}^\top and (1ht2)(1 - h_t^2). In our 4-word example, the gradient decayed 12× from the last word to the first. For long sequences, early inputs become invisible.
  3. Gradient clipping prevents explosions. When gradients grow too large (spectral radius of Whh>1W_{hh} > 1), clipping scales them down while preserving direction. This is a band-aid — it prevents crashes but does not fix the vanishing gradient problem.
  4. PyTorch automates BPTT entirely. loss.backward() computes the same gradients as our hand-coded implementation, by recording the computation graph during the forward pass and traversing it in reverse.
Coming next: In Section 3, we will dissect the vanishing gradient problem mathematically — showing exactly why the spectral radius of WhhW_{hh} determines whether an RNN can learn long-range dependencies, and setting the stage for LSTM and GRU architectures in Chapter 17 that solve this problem through gating mechanisms.
Loading comments...