Chapter 8
30 min read
Section 25 of 65

Backpropagation Algorithm

Backpropagation

Learning Objectives

By the end of this section, you will be able to:

  1. Apply the chain rule to compute gradients layer by layer
  2. Compute every gradient by hand for our 2×2 diagonal flip network
  3. Understand why ReLU kills some gradients and what "dead neurons" mean for learning
What we're about to do: Trace the gradient backward through every layer, every neuron, every weight—from the loss all the way back to the input. Every single number, computed by hand. This is the core of deep learning.

Recall our network from Chapter 7:

VariableValueMeaning
x[1, 0, 1, 1]Input (flattened 2×2 image)
z⁽¹⁾[0.0, −0.1, 0.5]Hidden pre-activation
h[0.0, 0.0, 0.5]Hidden post-ReLU
ŷ[0.1, −0.1, −0.05, −0.25]Network output
y[1, 1, 0, 1]Target
L (MSE)0.896Loss

The Chain Rule: Our Main Tool

Backpropagation is just the chain rule of calculus applied repeatedly. If LL depends on y^\hat{y} which depends on hh which depends on W1W_1, then:

LW1=Ly^y^hhW1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial W_1}

We start at the loss and work backward, computing one link of the chain at a time. Each step reuses the results from the previous step—that's what makes it efficient.


Step 1: Gradient of the Loss

Our loss function is Mean Squared Error:

L=14i=03(y^iyi)2L = \frac{1}{4} \sum_{i=0}^{3} (\hat{y}_i - y_i)^2

The gradient of LL with respect to each output y^i\hat{y}_i is:

Ly^i=24(y^iyi)=12(y^iyi)\frac{\partial L}{\partial \hat{y}_i} = \frac{2}{4}(\hat{y}_i - y_i) = \frac{1}{2}(\hat{y}_i - y_i)

Let's compute each one:

Outputŷᵢyᵢŷᵢ − yᵢ½(ŷᵢ − yᵢ)
∂L/∂ŷ₀0.101−0.90−0.45
∂L/∂ŷ₁−0.101−1.10−0.55
∂L/∂ŷ₂−0.050−0.05−0.025
∂L/∂ŷ₃−0.251−1.25−0.625
Ly^=[0.450.550.0250.625]\frac{\partial L}{\partial \hat{\mathbf{y}}} = \begin{bmatrix} -0.45 \\ -0.55 \\ -0.025 \\ -0.625 \end{bmatrix}
What do these numbers mean? The gradient for y^3\hat{y}_3 is the most negative (-0.625), meaning this output is the furthest from its target. The gradient for y^2\hat{y}_2 is nearly zero (-0.025), meaning this output is already close to correct—which makes sense: y^2=0.05\hat{y}_2 = -0.05 is close to the target of 0.

Step 2: Output Layer Weights (W2W_2)

The output is y^j=ihiW2[i][j]+b2[j]\hat{y}_j = \sum_i h_i \cdot W_2[i][j] + b_2[j]. By the chain rule:

LW2[i][j]=hiLy^j\frac{\partial L}{\partial W_2[i][j]} = h_i \cdot \frac{\partial L}{\partial \hat{y}_j}

Since h=[0.0,0.0,0.5]\mathbf{h} = [0.0, 0.0, 0.5], the first two hidden neurons are dead (zero after ReLU). Their gradients are all zero:

Weighthᵢ× ∂L/∂ŷⱼ= Gradient
W₂[0][0..3]h₀ = 0.0× anything= 0 (all four)
W₂[1][0..3]h₁ = 0.0× anything= 0 (all four)
W₂[2][0]h₂ = 0.5× (−0.45)= −0.225
W₂[2][1]h₂ = 0.5× (−0.55)= −0.275
W₂[2][2]h₂ = 0.5× (−0.025)= −0.0125
W₂[2][3]h₂ = 0.5× (−0.625)= −0.3125
LW2=[000000000.2250.2750.01250.3125]\frac{\partial L}{\partial \mathbf{W}_2} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ -0.225 & -0.275 & -0.0125 & -0.3125 \end{bmatrix}
8 out of 12 gradients are zero. Because hidden neurons 0 and 1 output zero (they were killed by ReLU), none of their outgoing weights get any gradient signal. Only hidden neuron 2 (which output 0.5) contributes to learning. This is a real phenomenon in neural networks—dead neurons don't learn.

Step 3: Output Layer Biases (b2b_2)

The bias gradient is the simplest. Since y^j=+b2[j]\hat{y}_j = \ldots + b_2[j], the derivative of y^j\hat{y}_j with respect to b2[j]b_2[j] is just 1:

Lb2[j]=Ly^j1=Ly^j\frac{\partial L}{\partial b_2[j]} = \frac{\partial L}{\partial \hat{y}_j} \cdot 1 = \frac{\partial L}{\partial \hat{y}_j}
Lb2=[0.450.550.0250.625]\frac{\partial L}{\partial \mathbf{b}_2} = \begin{bmatrix} -0.45 \\ -0.55 \\ -0.025 \\ -0.625 \end{bmatrix}

Unlike weights, biases always get gradient signal—they don't depend on the hidden layer output. Even when hidden neurons are dead, the biases of the output layer still learn.


Step 4: Backpropagate to Hidden Layer

Now comes the key step. We need to know: how does the loss change when we change a hidden neuron's output? Each hidden neuron hih_i connects to all 4 output neurons, so we sum the effects:

Lhi=j=03W2[i][j]Ly^j\frac{\partial L}{\partial h_i} = \sum_{j=0}^{3} W_2[i][j] \cdot \frac{\partial L}{\partial \hat{y}_j}

Hidden neuron 0

Lh0=(0.3)(0.45)+(0.2)(0.55)+(0.4)(0.025)+(0.1)(0.625)\frac{\partial L}{\partial h_0} = (0.3)(-0.45) + (-0.2)(-0.55) + (0.4)(-0.025) + (0.1)(-0.625)
=0.135+0.110.010.0625=0.0975= -0.135 + 0.11 - 0.01 - 0.0625 = \mathbf{-0.0975}

Hidden neuron 1

Lh1=(0.1)(0.45)+(0.5)(0.55)+(0.3)(0.025)+(0.2)(0.625)\frac{\partial L}{\partial h_1} = (-0.1)(-0.45) + (0.5)(-0.55) + (-0.3)(-0.025) + (0.2)(-0.625)
=0.0450.275+0.00750.125=0.3475= 0.045 - 0.275 + 0.0075 - 0.125 = \mathbf{-0.3475}

Hidden neuron 2

Lh2=(0.2)(0.45)+(0.4)(0.55)+(0.1)(0.025)+(0.5)(0.625)\frac{\partial L}{\partial h_2} = (0.2)(-0.45) + (-0.4)(-0.55) + (0.1)(-0.025) + (-0.5)(-0.625)
=0.09+0.220.0025+0.3125=0.44= -0.09 + 0.22 - 0.0025 + 0.3125 = \mathbf{0.44}
Lh=[0.09750.34750.44]\frac{\partial L}{\partial \mathbf{h}} = \begin{bmatrix} -0.0975 \\ -0.3475 \\ 0.44 \end{bmatrix}
In matrix form: Lh=W2Ly^\frac{\partial L}{\partial \mathbf{h}} = \mathbf{W}_2 \cdot \frac{\partial L}{\partial \hat{\mathbf{y}}}. The weight matrix W2\mathbf{W}_2 transposes the error from the output layer back to the hidden layer. This is why it's called backpropagation—the gradient flows backward through the same connections the data flowed forward through.

Step 5: Backpropagate Through ReLU

We have the gradient at h\mathbf{h} (after ReLU). But to get gradients for W1\mathbf{W}_1 and b1\mathbf{b}_1, we need the gradient at z(1)\mathbf{z}^{(1)} (before ReLU). ReLU's derivative is:

ReLU(z)={1if z>00if z0\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

Apply element-wise:

Neuronz⁽¹⁾ReLU′(z)∂L/∂h∂L/∂z = ∂L/∂h × ReLU′(z)
00.00 (dead)−0.09750.0
1−0.10 (dead)−0.34750.0
20.51 (alive)0.440.44
Lz(1)=[000.44]\frac{\partial L}{\partial \mathbf{z}^{(1)}} = \begin{bmatrix} 0 \\ 0 \\ 0.44 \end{bmatrix}
ReLU is a gradient gate. Neurons 0 and 1 had z0z \leq 0, so ReLU multiplied their gradients by zero. Even though the loss "wants" these neurons to change (the gradient from the output layer was -0.0975 and -0.3475), the ReLU gate blocks the signal. This is the dead neuron problem—neurons that output zero can't learn from this training example.

Step 6: Hidden Layer Weights (W1W_1)

Same pattern as Step 2, but now for the first layer. Since zj(1)=ixiW1[i][j]+b1[j]z^{(1)}_j = \sum_i x_i \cdot W_1[i][j] + b_1[j]:

LW1[i][j]=xiLzj(1)\frac{\partial L}{\partial W_1[i][j]} = x_i \cdot \frac{\partial L}{\partial z^{(1)}_j}

Since Lz(1)=[0,0,0.44]\frac{\partial L}{\partial z^{(1)}} = [0, 0, 0.44], only column 2 (neuron 2) gets any gradient:

Weightxᵢ× ∂L/∂zⱼ= Gradient
W₁[i][0] (all)any× 0= 0 (dead neuron)
W₁[i][1] (all)any× 0= 0 (dead neuron)
W₁[0][2]x₀ = 1× 0.44= 0.44
W₁[1][2]x₁ = 0× 0.44= 0 (input was 0)
W₁[2][2]x₂ = 1× 0.44= 0.44
W₁[3][2]x₃ = 1× 0.44= 0.44
LW1=[000.44000000.44000.44]\frac{\partial L}{\partial \mathbf{W}_1} = \begin{bmatrix} 0 & 0 & 0.44 \\ 0 & 0 & 0 \\ 0 & 0 & 0.44 \\ 0 & 0 & 0.44 \end{bmatrix}
Only 3 out of 12 weights get gradient signal. Two reasons: (a) dead neurons 0 and 1 block gradients to their entire column, (b) input x1=0x_1 = 0 zeroes out row 1. On other training examples where different neurons are alive and different inputs are non-zero, different weights will get gradients.

Step 7: Hidden Layer Biases (b1b_1)

Lb1[j]=Lzj(1)\frac{\partial L}{\partial b_1[j]} = \frac{\partial L}{\partial z^{(1)}_j}
Lb1=[000.44]\frac{\partial L}{\partial \mathbf{b}_1} = \begin{bmatrix} 0 \\ 0 \\ 0.44 \end{bmatrix}

Again, only bias 2 gets a gradient. Biases 0 and 1 are blocked by the dead ReLU neurons.


All 31 Gradients at a Glance

Here's every gradient we just computed, organized by parameter:

ParameterGradientWill it change?
W₁ column 0 (4 weights)all 0❌ Dead neuron 0
W₁ column 1 (4 weights)all 0❌ Dead neuron 1
W₁[0][2]0.44✅ Will decrease
W₁[1][2]0❌ Input was 0
W₁[2][2]0.44✅ Will decrease
W₁[3][2]0.44✅ Will decrease
b₁[0], b₁[1]0, 0❌ Dead neurons
b₁[2]0.44✅ Will decrease
W₂ rows 0,1 (8 weights)all 0❌ Dead hidden neurons
W₂[2][0]−0.225✅ Will increase
W₂[2][1]−0.275✅ Will increase
W₂[2][2]−0.0125✅ Tiny increase
W₂[2][3]−0.3125✅ Will increase
b₂[0]−0.45✅ Will increase
b₂[1]−0.55✅ Will increase
b₂[2]−0.025✅ Tiny increase
b₂[3]−0.625✅ Will increase

Out of 31 parameters, only 11 will actually change on this training step. The other 20 are blocked by dead neurons or zero inputs. On other training examples, different parameters will get gradient signal—that's why we train on many examples.


Backpropagation Visualized

Below is the same network from Chapter 7, but now showing gradients flowing backward. Step through each phase to see which paths carry signal and which are blocked by dead neurons.

Loading backpropagation visualization...

Python Implementation

Here's the complete backpropagation in NumPy — every gradient we computed by hand, in 10 lines of code. Click any line to see the exact values flowing through.

Backpropagation — NumPy Implementation
🐍backprop.py
1import numpy as np

NumPy provides fast N-dimensional arrays and matrix operations. The @ operator, np.outer(), np.maximum(), and np.round() are all used for backpropagation math.

EXECUTION STATE
numpy = Library for numerical computing — ndarray type, linear algebra, element-wise math
3# ── Network weights (same as Chapter 7) ──

We reuse the exact same weight matrices from the forward pass in Chapter 7 so every gradient we compute matches the hand calculations in this section.

4W1 = np.array([[0.2, -0.5, 0.1], ...]) — Hidden layer weights

W1 is a 4×3 matrix connecting 4 inputs to 3 hidden neurons. Row i holds all weights leaving input x[i]; column j holds all weights entering hidden neuron j.

EXECUTION STATE
📚 np.array(nested_lists) = Creates an ndarray from nested Python lists. Shape is inferred: 4 inner lists of length 3 → shape (4,3).
⬇ shape = (4, 3) — 4 inputs × 3 hidden neurons
⬆ result: W1 =
       h0     h1     h2
x0   0.20  -0.50   0.10
x1  -0.30   0.40  -0.20
x2   0.10   0.30   0.50
x3  -0.40   0.20  -0.10
9b1 = np.array([0.1, -0.1, 0.0]) — Hidden biases

One bias per hidden neuron. Added to z1 after the matrix multiply. b1[2] = 0.0 means neuron 2 gets no bias shift.

EXECUTION STATE
⬆ result: b1 = [0.1, -0.1, 0.0] — one per hidden neuron
10W2 = np.array([[0.3, -0.2, 0.4, 0.1], ...]) — Output layer weights

W2 is a 3×4 matrix connecting 3 hidden neurons to 4 outputs. Row i holds weights leaving hidden neuron h[i]; column j holds weights entering output y[j]. This same matrix is used BACKWARD in Step 4 to propagate gradients from outputs to hidden layer.

EXECUTION STATE
⬇ shape = (3, 4) — 3 hidden neurons × 4 outputs
⬆ result: W2 =
       y0     y1     y2     y3
h0   0.30  -0.20   0.40   0.10
h1  -0.10   0.50  -0.30   0.20
h2   0.20  -0.40   0.10  -0.50
13b2 = np.array([0.0, 0.1, -0.1, 0.0]) — Output biases

One bias per output neuron. These always receive gradient (unlike weights, which depend on hidden activations).

EXECUTION STATE
⬆ result: b2 = [0.0, 0.1, -0.1, 0.0] — one per output neuron
15x = np.array([1.0, 0.0, 1.0, 1.0]) — Input image

The 2×2 diagonal image [[1,0],[1,1]] flattened to a 1D vector. x[1]=0 means row 1 of W1 will get zero gradient (input was zero → no gradient flows through it).

EXECUTION STATE
⬆ result: x = [1.0, 0.0, 1.0, 1.0] — flattened 2×2 image
x[1] = 0 = This pixel is off → W1[1][j] gets zero gradient for every j
16target = np.array([1.0, 1.0, 0.0, 1.0]) — Target output

The diagonal flip target [[1,1],[0,1]] flattened. The network should learn to produce these values from the input.

EXECUTION STATE
⬆ result: target = [1.0, 1.0, 0.0, 1.0] — flattened 2×2 flipped image
18# ── Forward pass (recompute to have values) ──

We need all forward-pass intermediate values (z1, h, y_hat) to compute gradients. Backprop requires both the upstream gradient AND the local values from the forward pass.

19z1 = np.round(x @ W1 + b1, 10) — Hidden pre-activation

Compute hidden layer pre-activation. np.round(..., 10) eliminates floating-point noise: z1[0] is mathematically exactly 0.0 (0.2 + 0 + 0.1 - 0.4 + 0.1 = 0.0) but raw computation gives ~2.8e-17.

EXECUTION STATE
📚 np.round(array, decimals) = Round each element to the given number of decimal places. np.round(2.78e-17, 10) = 0.0. Eliminates floating-point noise from exact cancellations.
x @ W1 + b1 = Matrix multiply x(4,) × W1(4,3) + b1(3,) = z1(3,)
⬆ result: z1 = [0.0, -0.1, 0.5]
20h = np.maximum(0, z1) — ReLU activation

ReLU clamps negative values to zero. Neurons 0 and 1 die (z≤0), only neuron 2 survives. This determines which neurons can learn during backprop.

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of two arrays (or scalar and array). np.maximum(0, [-0.1]) = [0.0]. Different from np.max() which finds the single largest element.
⬇ z1 = [0.0, -0.1, 0.5]
z1[0] = 0.0 = max(0, 0.0) = 0.0 — dead (on the boundary)
z1[1] = -0.1 = max(0, -0.1) = 0.0 — dead (negative)
z1[2] = 0.5 = max(0, 0.5) = 0.5 — alive!
⬆ result: h = [0.0, 0.0, 0.5] — only neuron 2 is alive
21y_hat = h @ W2 + b2 — Network output

Output layer: multiply hidden activations by output weights and add biases. Since h[0]=h[1]=0, only h[2]=0.5 contributes — effectively y_hat = 0.5 * W2[2] + b2.

EXECUTION STATE
h @ W2 = h(3,) × W2(3,4) = result(4,). Only row 2 of W2 matters since h[0]=h[1]=0.
0.5 * W2[2] + b2 = 0.5*[0.2,-0.4,0.1,-0.5] + [0,0.1,-0.1,0] = [0.1, -0.1, -0.05, -0.25]
⬆ result: y_hat = [0.1, -0.1, -0.05, -0.25]
22loss = np.mean((y_hat - target) ** 2) — MSE loss

Mean Squared Error: average the squared difference between predictions and targets. This is the single number we want to minimize.

EXECUTION STATE
📚 np.mean(array) = Compute the arithmetic mean of all elements. For a length-4 array, divides the sum by 4.
y_hat - target = [-0.9, -1.1, -0.05, -1.25]
(y_hat - target) ** 2 = [0.81, 1.21, 0.0025, 1.5625]
sum / 4 = (0.81 + 1.21 + 0.0025 + 1.5625) / 4 = 3.585 / 4
⬆ result: loss = 0.8963 — far from 0, network is mostly wrong
24# ── Backpropagation (7 steps) ──

Now we trace the gradient backward from the loss through every layer. Each line corresponds to one of the 7 steps we computed by hand above.

EXECUTION STATE
── Backpropagation begins ── =
25dL_dy = 0.5 * (y_hat - target) — Step 1: Loss gradient

MSE gradient: ∂L/∂ŷᵢ = (2/N)(ŷᵢ - yᵢ)/2 = (1/2)(ŷᵢ - yᵢ) since N=4. The factor 1/2 comes from the derivative of x² (gives 2x) divided by N=4.

EXECUTION STATE
y_hat - target = [0.1-1, -0.1-1, -0.05-0, -0.25-1] = [-0.9, -1.1, -0.05, -1.25]
× 0.5 = [-0.45, -0.55, -0.025, -0.625]
⬆ result: dL_dy = [-0.45, -0.55, -0.025, -0.625] — matches Section 2 Step 1!
Largest gradient = dL/dŷ₃ = -0.625 — output 3 is furthest from target (predicted -0.25, target 1)
26dL_dW2 = np.outer(h, dL_dy) — Step 2: Output weight gradients

∂L/∂W₂[i][j] = h[i] × ∂L/∂ŷ[j]. np.outer computes every product h[i]×dL_dy[j] at once. Since h[0]=h[1]=0 (dead neurons), rows 0 and 1 are entirely zero — dead neurons don’t learn.

EXECUTION STATE
📚 np.outer(a, b) = Outer product: creates a matrix where element [i][j] = a[i] × b[j]. For vectors of length m and n, produces an m×n matrix. Every pair of elements gets multiplied.
⬇ h = [0.0, 0.0, 0.5] — only neuron 2 is alive
⬇ dL_dy = [-0.45, -0.55, -0.025, -0.625]
── Row 0 (h₀=0.0, dead) ── =
h[0] × dL_dy = 0.0 × [-0.45, -0.55, -0.025, -0.625] = [0, 0, 0, 0]
── Row 1 (h₁=0.0, dead) ── =
h[1] × dL_dy = 0.0 × [-0.45, -0.55, -0.025, -0.625] = [0, 0, 0, 0]
── Row 2 (h₂=0.5, alive!) ── =
h[2] × dL_dy = 0.5 × [-0.45, -0.55, -0.025, -0.625] = [-0.225, -0.275, -0.0125, -0.3125]
⬆ result: dL_dW2 (3×4) =
       y0      y1      y2      y3
h0   0.000   0.000   0.000   0.000
h1   0.000   0.000   0.000   0.000
h2  -0.225  -0.275  -0.013  -0.313
8 of 12 gradients are zero = Dead neurons h0, h1 block ALL gradient to their outgoing weights
27dL_db2 = dL_dy.copy() — Step 3: Output bias gradients

∂L/∂b₂[j] = ∂L/∂ŷ[j] × 1. Since ŷ = h@W2 + b2, the derivative of ŷ with respect to b2 is just 1. Bias gradients are always equal to the loss gradient — they don’t depend on hidden activations.

EXECUTION STATE
📚 .copy() = Creates an independent copy of the array. Without copy(), dL_db2 would be a reference to dL_dy, and modifying one would change the other.
⬆ result: dL_db2 = [-0.45, -0.55, -0.025, -0.625] — same as dL_dy (biases always get gradient)
28dL_dh = W2 @ dL_dy — Step 4: Backprop to hidden layer

Error flows BACKWARD through the same weight matrix W₂. Each hidden neuron’s gradient is the dot product of its outgoing weights with the output gradients: ∂L/∂h[i] = Σⱼ W₂[i][j] × ∂L/∂ŷ[j]

EXECUTION STATE
📚 W2 @ dL_dy = Matrix-vector multiply: W2(3,4) × dL_dy(4,) = result(3,). Each row of W2 dots with dL_dy to produce one hidden gradient.
── Hidden neuron 0 ── =
W2[0] · dL_dy = (0.3)(-0.45) + (-0.2)(-0.55) + (0.4)(-0.025) + (0.1)(-0.625) = -0.0975
── Hidden neuron 1 ── =
W2[1] · dL_dy = (-0.1)(-0.45) + (0.5)(-0.55) + (-0.3)(-0.025) + (0.2)(-0.625) = -0.3475
── Hidden neuron 2 ── =
W2[2] · dL_dy = (0.2)(-0.45) + (-0.4)(-0.55) + (0.1)(-0.025) + (-0.5)(-0.625) = 0.44
⬆ result: dL_dh = [-0.0975, -0.3475, 0.44] — matches Section 2 Step 4!
29relu_grad = (z1 > 0).astype(float) — Step 5: ReLU derivative

ReLU’(z) = 1 if z > 0, else 0. This creates a binary mask: neurons with z>0 pass gradient through, neurons with z≤0 block it completely. This is the ‘gradient gate’ that causes the dead neuron problem.

EXECUTION STATE
📚 (z1 > 0) = Element-wise comparison. Returns boolean array: [False, False, True] for z1=[0.0, -0.1, 0.5]
📚 .astype(float) = Convert booleans to floats: False→0.0, True→1.0
⬇ z1 = [0.0, -0.1, 0.5]
z1[0] = 0.0 = 0.0 > 0 is False → 0.0 (dead — on the boundary)
z1[1] = -0.1 = -0.1 > 0 is False → 0.0 (dead — negative)
z1[2] = 0.5 = 0.5 > 0 is True → 1.0 (alive!)
⬆ result: relu_grad = [0.0, 0.0, 1.0] — only neuron 2 passes gradient
30dL_dz1 = dL_dh * relu_grad — Step 5b: Apply ReLU gate

Element-wise multiply: the ReLU mask zeros out gradients for dead neurons. Only neuron 2 (relu_grad=1) passes its gradient through; neurons 0 and 1 are permanently blocked.

EXECUTION STATE
⬇ dL_dh = [-0.0975, -0.3475, 0.44]
⬇ relu_grad = [0.0, 0.0, 1.0]
neuron 0 = -0.0975 × 0.0 = 0.0 (gradient blocked!)
neuron 1 = -0.3475 × 0.0 = 0.0 (gradient blocked!)
neuron 2 = 0.44 × 1.0 = 0.44 (gradient passes through!)
⬆ result: dL_dz1 = [0.0, 0.0, 0.44] — matches Section 2 Step 5!
31dL_dW1 = np.outer(x, dL_dz1) — Step 6: Hidden weight gradients

∂L/∂W₁[i][j] = x[i] × ∂L/∂z[j]. Same outer product pattern as W₂. Since dL_dz1=[0,0,0.44], columns 0 and 1 are all zero (dead neurons). Since x[1]=0, row 1 is all zero (zero input).

EXECUTION STATE
⬇ x = [1.0, 0.0, 1.0, 1.0]
⬇ dL_dz1 = [0.0, 0.0, 0.44]
⬆ result: dL_dW1 (4×3) =
       h0    h1    h2
x0   0.00  0.00  0.44
x1   0.00  0.00  0.00
x2   0.00  0.00  0.44
x3   0.00  0.00  0.44
Only 3 of 12 non-zero = Col 0,1: dead neurons. Row 1: x[1]=0. Only x[0,2,3]×dL_dz1[2] survive.
32dL_db1 = dL_dz1.copy() — Step 7: Hidden bias gradients

∂L/∂b₁[j] = ∂L/∂z⁽¹⁾[j]. Same as the output bias case: the derivative of z with respect to b is 1. Only bias 2 gets a non-zero gradient.

EXECUTION STATE
📚 .copy() = Independent copy so future modifications to dL_dz1 won’t affect dL_db1.
⬆ result: dL_db1 = [0.0, 0.0, 0.44] — only bias 2 learns
34print(f"Loss: {loss:.4f}")

Print the MSE loss, formatted to 4 decimal places.

EXECUTION STATE
Output = Loss: 0.8963
35print(f"dL/dy: {dL_dy}")

Print the loss gradient vector (Step 1 result).

EXECUTION STATE
Output = dL/dy: [-0.45 -0.55 -0.025 -0.625]
36print(f"dL/dW2: {dL_dW2[2]}")

Print row 2 of dL_dW2 (the only non-zero row). Rows 0 and 1 are all zeros from dead neurons.

EXECUTION STATE
Output = dL/dW2: [-0.225 -0.275 -0.0125 -0.3125]
37print(f"dL/dh: {dL_dh}")

Print the hidden-layer gradient (Step 4 result).

EXECUTION STATE
Output = dL/dh: [-0.0975 -0.3475 0.44 ]
38print(f"dL/dz1: {dL_dz1}")

Print the post-ReLU gradient (Step 5 result). Two zeros from dead neurons.

EXECUTION STATE
Output = dL/dz1: [0. 0. 0.44]
39print(f"dL/dW1:\n{dL_dW1}")

Print the full 4×3 weight gradient matrix (Step 6 result). Only 3 non-zero entries.

EXECUTION STATE
Output = dL/dW1: [[0. 0. 0.44] [0. 0. 0. ] [0. 0. 0.44] [0. 0. 0.44]]
40print(f"dL/db1: {dL_db1}")

Print the hidden bias gradient (Step 7 result). Only bias 2 gets a signal.

EXECUTION STATE
Output = dL/db1: [0. 0. 0.44]
10 lines without explanation
1import numpy as np
2
3# ── Network weights (same as Chapter 7) ──
4W1 = np.array([[ 0.2, -0.5,  0.1],
5               [-0.3,  0.4, -0.2],
6               [ 0.1,  0.3,  0.5],
7               [-0.4,  0.2, -0.1]])
8b1 = np.array([0.1, -0.1, 0.0])
9W2 = np.array([[ 0.3, -0.2,  0.4,  0.1],
10               [-0.1,  0.5, -0.3,  0.2],
11               [ 0.2, -0.4,  0.1, -0.5]])
12b2 = np.array([0.0, 0.1, -0.1, 0.0])
13
14x = np.array([1.0, 0.0, 1.0, 1.0])
15target = np.array([1.0, 1.0, 0.0, 1.0])
16
17# ── Forward pass (recompute to have values) ──
18z1 = np.round(x @ W1 + b1, 10)
19h = np.maximum(0, z1)
20y_hat = h @ W2 + b2
21loss = np.mean((y_hat - target) ** 2)
22
23# ── Backpropagation (7 steps) ──
24dL_dy = 0.5 * (y_hat - target)
25dL_dW2 = np.outer(h, dL_dy)
26dL_db2 = dL_dy.copy()
27dL_dh = W2 @ dL_dy
28relu_grad = (z1 > 0).astype(float)
29dL_dz1 = dL_dh * relu_grad
30dL_dW1 = np.outer(x, dL_dz1)
31dL_db1 = dL_dz1.copy()
32
33print(f"Loss:     {loss:.4f}")
34print(f"dL/dy:    {dL_dy}")
35print(f"dL/dW2:   {dL_dW2[2]}")
36print(f"dL/dh:    {dL_dh}")
37print(f"dL/dz1:   {dL_dz1}")
38print(f"dL/dW1:\n{dL_dW1}")
39print(f"dL/db1:   {dL_db1}")
10 lines of math. That's the entire backpropagation algorithm for our network. Every line corresponds to one of the 7 steps we computed by hand. The code mirrors the math line for line — no magic, no hidden complexity.

Summary

  1. Start at the loss, compute Ly^\frac{\partial L}{\partial \hat{\mathbf{y}}}
  2. Output layer weights/biases: multiply by the hidden layer output h\mathbf{h}
  3. Backprop to hidden layer: multiply by W2\mathbf{W}_2 (same weights, backward direction)
  4. Through ReLU: zero out gradients for dead neurons (z0z \leq 0)
  5. Hidden layer weights/biases: multiply by the input x\mathbf{x}

In the next section, we'll apply these gradients to update all 31 parameters and see the loss drop.

Loading comments...