Apply the chain rule to compute gradients layer by layer
Compute every gradient by hand for our 2×2 diagonal flip network
Understand why ReLU kills some gradients and what "dead neurons" mean for learning
What we're about to do: Trace the gradient backward through every layer, every neuron, every weight—from the loss all the way back to the input. Every single number, computed by hand. This is the core of deep learning.
Recall our network from Chapter 7:
Variable
Value
Meaning
x
[1, 0, 1, 1]
Input (flattened 2×2 image)
z⁽¹⁾
[0.0, −0.1, 0.5]
Hidden pre-activation
h
[0.0, 0.0, 0.5]
Hidden post-ReLU
ŷ
[0.1, −0.1, −0.05, −0.25]
Network output
y
[1, 1, 0, 1]
Target
L (MSE)
0.896
Loss
The Chain Rule: Our Main Tool
Backpropagation is just the chain rule of calculus applied repeatedly. If L depends on y^ which depends on h which depends on W1, then:
∂W1∂L=∂y^∂L⋅∂h∂y^⋅∂W1∂h
We start at the loss and work backward, computing one link of the chain at a time. Each step reuses the results from the previous step—that's what makes it efficient.
Step 1: Gradient of the Loss
Our loss function is Mean Squared Error:
L=41i=0∑3(y^i−yi)2
The gradient of L with respect to each output y^i is:
∂y^i∂L=42(y^i−yi)=21(y^i−yi)
Let's compute each one:
Output
ŷᵢ
yᵢ
ŷᵢ − yᵢ
½(ŷᵢ − yᵢ)
∂L/∂ŷ₀
0.10
1
−0.90
−0.45
∂L/∂ŷ₁
−0.10
1
−1.10
−0.55
∂L/∂ŷ₂
−0.05
0
−0.05
−0.025
∂L/∂ŷ₃
−0.25
1
−1.25
−0.625
∂y^∂L=−0.45−0.55−0.025−0.625
What do these numbers mean? The gradient for y^3 is the most negative (-0.625), meaning this output is the furthest from its target. The gradient for y^2 is nearly zero (-0.025), meaning this output is already close to correct—which makes sense: y^2=−0.05 is close to the target of 0.
Step 2: Output Layer Weights (W2)
The output is y^j=∑ihi⋅W2[i][j]+b2[j]. By the chain rule:
∂W2[i][j]∂L=hi⋅∂y^j∂L
Since h=[0.0,0.0,0.5], the first two hidden neurons are dead (zero after ReLU). Their gradients are all zero:
Weight
hᵢ
× ∂L/∂ŷⱼ
= Gradient
W₂[0][0..3]
h₀ = 0.0
× anything
= 0 (all four)
W₂[1][0..3]
h₁ = 0.0
× anything
= 0 (all four)
W₂[2][0]
h₂ = 0.5
× (−0.45)
= −0.225
W₂[2][1]
h₂ = 0.5
× (−0.55)
= −0.275
W₂[2][2]
h₂ = 0.5
× (−0.025)
= −0.0125
W₂[2][3]
h₂ = 0.5
× (−0.625)
= −0.3125
∂W2∂L=00−0.22500−0.27500−0.012500−0.3125
8 out of 12 gradients are zero. Because hidden neurons 0 and 1 output zero (they were killed by ReLU), none of their outgoing weights get any gradient signal. Only hidden neuron 2 (which output 0.5) contributes to learning. This is a real phenomenon in neural networks—dead neurons don't learn.
Step 3: Output Layer Biases (b2)
The bias gradient is the simplest. Since y^j=…+b2[j], the derivative of y^j with respect to b2[j] is just 1:
∂b2[j]∂L=∂y^j∂L⋅1=∂y^j∂L
∂b2∂L=−0.45−0.55−0.025−0.625
Unlike weights, biases always get gradient signal—they don't depend on the hidden layer output. Even when hidden neurons are dead, the biases of the output layer still learn.
Step 4: Backpropagate to Hidden Layer
Now comes the key step. We need to know: how does the loss change when we change a hidden neuron's output? Each hidden neuron hi connects to all 4 output neurons, so we sum the effects:
In matrix form:∂h∂L=W2⋅∂y^∂L. The weight matrix W2 transposes the error from the output layer back to the hidden layer. This is why it's called backpropagation—the gradient flows backward through the same connections the data flowed forward through.
Step 5: Backpropagate Through ReLU
We have the gradient at h (after ReLU). But to get gradients for W1 and b1, we need the gradient at z(1) (before ReLU). ReLU's derivative is:
ReLU′(z)={10if z>0if z≤0
Apply element-wise:
Neuron
z⁽¹⁾
ReLU′(z)
∂L/∂h
∂L/∂z = ∂L/∂h × ReLU′(z)
0
0.0
0 (dead)
−0.0975
0.0
1
−0.1
0 (dead)
−0.3475
0.0
2
0.5
1 (alive)
0.44
0.44
∂z(1)∂L=000.44
ReLU is a gradient gate. Neurons 0 and 1 had z≤0, so ReLU multiplied their gradients by zero. Even though the loss "wants" these neurons to change (the gradient from the output layer was -0.0975 and -0.3475), the ReLU gate blocks the signal. This is the dead neuron problem—neurons that output zero can't learn from this training example.
Step 6: Hidden Layer Weights (W1)
Same pattern as Step 2, but now for the first layer. Since zj(1)=∑ixi⋅W1[i][j]+b1[j]:
∂W1[i][j]∂L=xi⋅∂zj(1)∂L
Since ∂z(1)∂L=[0,0,0.44], only column 2 (neuron 2) gets any gradient:
Weight
xᵢ
× ∂L/∂zⱼ
= Gradient
W₁[i][0] (all)
any
× 0
= 0 (dead neuron)
W₁[i][1] (all)
any
× 0
= 0 (dead neuron)
W₁[0][2]
x₀ = 1
× 0.44
= 0.44
W₁[1][2]
x₁ = 0
× 0.44
= 0 (input was 0)
W₁[2][2]
x₂ = 1
× 0.44
= 0.44
W₁[3][2]
x₃ = 1
× 0.44
= 0.44
∂W1∂L=000000000.4400.440.44
Only 3 out of 12 weights get gradient signal. Two reasons: (a) dead neurons 0 and 1 block gradients to their entire column, (b) input x1=0 zeroes out row 1. On other training examples where different neurons are alive and different inputs are non-zero, different weights will get gradients.
Step 7: Hidden Layer Biases (b1)
∂b1[j]∂L=∂zj(1)∂L
∂b1∂L=000.44
Again, only bias 2 gets a gradient. Biases 0 and 1 are blocked by the dead ReLU neurons.
All 31 Gradients at a Glance
Here's every gradient we just computed, organized by parameter:
Parameter
Gradient
Will it change?
W₁ column 0 (4 weights)
all 0
❌ Dead neuron 0
W₁ column 1 (4 weights)
all 0
❌ Dead neuron 1
W₁[0][2]
0.44
✅ Will decrease
W₁[1][2]
0
❌ Input was 0
W₁[2][2]
0.44
✅ Will decrease
W₁[3][2]
0.44
✅ Will decrease
b₁[0], b₁[1]
0, 0
❌ Dead neurons
b₁[2]
0.44
✅ Will decrease
W₂ rows 0,1 (8 weights)
all 0
❌ Dead hidden neurons
W₂[2][0]
−0.225
✅ Will increase
W₂[2][1]
−0.275
✅ Will increase
W₂[2][2]
−0.0125
✅ Tiny increase
W₂[2][3]
−0.3125
✅ Will increase
b₂[0]
−0.45
✅ Will increase
b₂[1]
−0.55
✅ Will increase
b₂[2]
−0.025
✅ Tiny increase
b₂[3]
−0.625
✅ Will increase
Out of 31 parameters, only 11 will actually change on this training step. The other 20 are blocked by dead neurons or zero inputs. On other training examples, different parameters will get gradient signal—that's why we train on many examples.
Backpropagation Visualized
Below is the same network from Chapter 7, but now showing gradients flowing backward. Step through each phase to see which paths carry signal and which are blocked by dead neurons.
Loading backpropagation visualization...
Python Implementation
Here's the complete backpropagation in NumPy — every gradient we computed by hand, in 10 lines of code. Click any line to see the exact values flowing through.
Backpropagation — NumPy Implementation
🐍backprop.py
Explanation(29)
Code(39)
1import numpy as np
NumPy provides fast N-dimensional arrays and matrix operations. The @ operator, np.outer(), np.maximum(), and np.round() are all used for backpropagation math.
EXECUTION STATE
numpy = Library for numerical computing — ndarray type, linear algebra, element-wise math
3# ── Network weights (same as Chapter 7) ──
We reuse the exact same weight matrices from the forward pass in Chapter 7 so every gradient we compute matches the hand calculations in this section.
W1 is a 4×3 matrix connecting 4 inputs to 3 hidden neurons. Row i holds all weights leaving input x[i]; column j holds all weights entering hidden neuron j.
EXECUTION STATE
📚 np.array(nested_lists) = Creates an ndarray from nested Python lists. Shape is inferred: 4 inner lists of length 3 → shape (4,3).
W2 is a 3×4 matrix connecting 3 hidden neurons to 4 outputs. Row i holds weights leaving hidden neuron h[i]; column j holds weights entering output y[j]. This same matrix is used BACKWARD in Step 4 to propagate gradients from outputs to hidden layer.
The 2×2 diagonal image [[1,0],[1,1]] flattened to a 1D vector. x[1]=0 means row 1 of W1 will get zero gradient (input was zero → no gradient flows through it).
We need all forward-pass intermediate values (z1, h, y_hat) to compute gradients. Backprop requires both the upstream gradient AND the local values from the forward pass.
Compute hidden layer pre-activation. np.round(..., 10) eliminates floating-point noise: z1[0] is mathematically exactly 0.0 (0.2 + 0 + 0.1 - 0.4 + 0.1 = 0.0) but raw computation gives ~2.8e-17.
EXECUTION STATE
📚 np.round(array, decimals) = Round each element to the given number of decimal places. np.round(2.78e-17, 10) = 0.0. Eliminates floating-point noise from exact cancellations.
ReLU clamps negative values to zero. Neurons 0 and 1 die (z≤0), only neuron 2 survives. This determines which neurons can learn during backprop.
EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of two arrays (or scalar and array). np.maximum(0, [-0.1]) = [0.0]. Different from np.max() which finds the single largest element.
⬇ z1 = [0.0, -0.1, 0.5]
z1[0] = 0.0 = max(0, 0.0) = 0.0 — dead (on the boundary)
∂L/∂W₂[i][j] = h[i] × ∂L/∂ŷ[j]. np.outer computes every product h[i]×dL_dy[j] at once. Since h[0]=h[1]=0 (dead neurons), rows 0 and 1 are entirely zero — dead neurons don’t learn.
EXECUTION STATE
📚 np.outer(a, b) = Outer product: creates a matrix where element [i][j] = a[i] × b[j]. For vectors of length m and n, produces an m×n matrix. Every pair of elements gets multiplied.
∂L/∂b₂[j] = ∂L/∂ŷ[j] × 1. Since ŷ = h@W2 + b2, the derivative of ŷ with respect to b2 is just 1. Bias gradients are always equal to the loss gradient — they don’t depend on hidden activations.
EXECUTION STATE
📚 .copy() = Creates an independent copy of the array. Without copy(), dL_db2 would be a reference to dL_dy, and modifying one would change the other.
⬆ result: dL_db2 = [-0.45, -0.55, -0.025, -0.625] — same as dL_dy (biases always get gradient)
Error flows BACKWARD through the same weight matrix W₂. Each hidden neuron’s gradient is the dot product of its outgoing weights with the output gradients: ∂L/∂h[i] = Σⱼ W₂[i][j] × ∂L/∂ŷ[j]
EXECUTION STATE
📚 W2 @ dL_dy = Matrix-vector multiply: W2(3,4) × dL_dy(4,) = result(3,). Each row of W2 dots with dL_dy to produce one hidden gradient.
ReLU’(z) = 1 if z > 0, else 0. This creates a binary mask: neurons with z>0 pass gradient through, neurons with z≤0 block it completely. This is the ‘gradient gate’ that causes the dead neuron problem.
Element-wise multiply: the ReLU mask zeros out gradients for dead neurons. Only neuron 2 (relu_grad=1) passes its gradient through; neurons 0 and 1 are permanently blocked.
∂L/∂W₁[i][j] = x[i] × ∂L/∂z[j]. Same outer product pattern as W₂. Since dL_dz1=[0,0,0.44], columns 0 and 1 are all zero (dead neurons). Since x[1]=0, row 1 is all zero (zero input).
10 lines of math. That's the entire backpropagation algorithm for our network. Every line corresponds to one of the 7 steps we computed by hand. The code mirrors the math line for line — no magic, no hidden complexity.
Summary
Start at the loss, compute ∂y^∂L
Output layer weights/biases: multiply by the hidden layer output h
Backprop to hidden layer: multiply by W2 (same weights, backward direction)
Through ReLU: zero out gradients for dead neurons (z≤0)
Hidden layer weights/biases: multiply by the input x
In the next section, we'll apply these gradients to update all 31 parameters and see the loss drop.