Chapter 8
20 min read
Section 27 of 65

Backpropagation in PyTorch

Backpropagation

Learning Objectives

By the end of this section, you will be able to:

  1. Write a complete training step in NumPy — forward pass, backpropagation, weight update, and verification
  2. Translate the same step to PyTorch and verify autograd produces identical gradients
  3. Build a training loop in both NumPy and PyTorch, watching the loss converge to zero
  4. Train the network on all 16 possible 2×2 images and reach 100% accuracy
  5. Inspect what the network learned about the diagonal flip transformation
The pattern for this section: For each concept, we first build it from scratch in Python/NumPy — so you see every multiplication. Then we show the same thing in PyTorch — so you see how frameworks automate the tedious parts while doing the exact same math underneath.

Python: One Complete Training Step

In Sections 2 and 3, we computed gradients by hand and updated weights one equation at a time. Now let's put it all together — forward pass, backward pass, weight update, and verification — in a single Python script. Click any line to see the exact values flowing through.

Complete Training Step — NumPy
🐍training_step_numpy.py
1import numpy as np

NumPy provides fast N-dimensional arrays and matrix operations. We use @ for matrix multiply, np.outer() for outer product, np.maximum() for ReLU, and np.mean() for MSE loss. All math runs as optimized C code, not slow Python loops.

EXECUTION STATE
📚 numpy = Numerical computing library — ndarray type, linear algebra, element-wise math. Aliased as 'np' by universal Python convention.
3# ── Network weights (same as Chapter 7) ──

These are the exact same weights we used in Chapter 7 (forward pass) and Section 2 (backprop). Using identical weights lets us verify that every computed gradient matches our hand calculations.

4W1 = np.array([...]) — Hidden layer weights (4×3)

W1 connects 4 inputs to 3 hidden neurons. Row i holds all weights leaving input x[i]; column j holds all weights entering hidden neuron j. Shape (4, 3) — the NumPy convention where x @ W1 works directly.

EXECUTION STATE
⬇ shape = (4, 3) — 4 inputs × 3 hidden neurons = 12 weights
⬆ result: W1 =
       h0     h1     h2
x0   0.20  -0.50   0.10
x1  -0.30   0.40  -0.20
x2   0.10   0.30   0.50
x3  -0.40   0.20  -0.10
8b1 = np.array([0.1, -0.1, 0.0]) — Hidden biases

One bias per hidden neuron, added after the matrix multiply. b1[2] = 0.0 means neuron 2 gets no bias shift.

EXECUTION STATE
⬆ result: b1 = [0.1, -0.1, 0.0] — one per hidden neuron
9W2 = np.array([...]) — Output layer weights (3×4)

W2 connects 3 hidden neurons to 4 outputs. Row i holds weights from hidden neuron h[i]; column j holds weights entering output y[j]. This same matrix is used BACKWARD to propagate gradients.

EXECUTION STATE
⬇ shape = (3, 4) — 3 hidden neurons × 4 outputs = 12 weights
⬆ result: W2 =
       y0     y1     y2     y3
h0   0.30  -0.20   0.40   0.10
h1  -0.10   0.50  -0.30   0.20
h2   0.20  -0.40   0.10  -0.50
12b2 = np.array([0.0, 0.1, -0.1, 0.0]) — Output biases

One bias per output neuron. Unlike weights, biases always receive gradient during backprop because they don’t depend on hidden activations.

EXECUTION STATE
⬆ result: b2 = [0.0, 0.1, -0.1, 0.0]
14x = np.array([1.0, 0.0, 1.0, 1.0]) — Input image

The 2×2 image [[1,0],[1,1]] flattened. x[1] = 0 means input pixel 1 is off — any weight connected to x[1] will get zero gradient (no signal flowed through it).

EXECUTION STATE
⬆ x = [1.0, 0.0, 1.0, 1.0] — flattened 2×2 binary image
15target = np.array([1.0, 1.0, 0.0, 1.0]) — Diagonal flip target

The diagonal flip of the input: [[1,1],[0,1]] flattened. Positions 1 and 2 swap (0→1, 1→0), while positions 0 and 3 stay fixed.

EXECUTION STATE
⬆ target = [1.0, 1.0, 0.0, 1.0] — what the network should output
17# ── Step 1: Forward pass ──

Run the input through the network to get a prediction and compute the loss. We need all intermediate values (z1, h) because backprop multiplies upstream gradients by local values from the forward pass.

18z1 = np.round(x @ W1 + b1, 10) — Pre-activation

Matrix multiply input (4,) by W1 (4,3), add bias (3,). np.round(..., 10) eliminates floating-point noise: z1[0] is mathematically 0.0 (0.2+0+0.1−0.4+0.1 = 0.0) but raw computation gives ~2.8e-17.

EXECUTION STATE
📚 @ operator = Python matrix multiplication. x(4,) @ W1(4,3) = z1_raw(3,)
📚 np.round(array, 10) = Rounds to 10 decimal places. Eliminates floating-point noise (2.8e-17 → 0.0) while keeping all meaningful precision.
── Per-neuron calculation ── =
z1[0] = 1×0.2 + 0×(-0.3) + 1×0.1 + 1×(-0.4) + 0.1 = 0.0
z1[1] = 1×(-0.5) + 0×0.4 + 1×0.3 + 1×0.2 + (-0.1) = -0.1
z1[2] = 1×0.1 + 0×(-0.2) + 1×0.5 + 1×(-0.1) + 0.0 = 0.5
⬆ z1 = [0.0, -0.1, 0.5]
19h = np.maximum(0, z1) — ReLU activation

ReLU clamps negatives to zero: max(0, value). Neurons 0 and 1 die (z≤0), only neuron 2 survives. Dead neurons cannot learn — their gradients will be zero in the backward pass.

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum. np.maximum(0, [-0.1, 0.5]) = [0.0, 0.5]. Different from np.max() which finds the single largest value.
h[0] = max(0, 0.0) = 0.0 = Dead (boundary) — gradient blocked
h[1] = max(0, -0.1) = 0.0 = Dead (negative) — gradient blocked
h[2] = max(0, 0.5) = 0.5 = Alive! — gradient passes through
⬆ h = [0.0, 0.0, 0.5]
20y_hat = h @ W2 + b2 — Network prediction

Output layer: h(3,) @ W2(3,4) + b2(4,) = y_hat(4,). Only h[2]=0.5 is non-zero, so each output = 0.5 × W2[2][j] + b2[j].

EXECUTION STATE
ŷ₀ = 0.5×0.2 + 0.0 = 0.10 = Target 1.0 — too low
ŷ₁ = 0.5×(-0.4) + 0.1 = -0.10 = Target 1.0 — way too low
ŷ₂ = 0.5×0.1 + (-0.1) = -0.05 = Target 0.0 — close!
ŷ₃ = 0.5×(-0.5) + 0.0 = -0.25 = Target 1.0 — worst prediction
⬆ y_hat = [0.10, -0.10, -0.05, -0.25]
21loss = np.mean((y_hat - target) ** 2) — MSE loss

Mean Squared Error: average of squared differences. This single number measures how wrong the network is — gradient descent will try to make it smaller.

EXECUTION STATE
📚 np.mean(array) = Sum of elements divided by count. For 4 elements: sum / 4.
errors² = (0.1-1)² + (-0.1-1)² + (-0.05-0)² + (-0.25-1)² = 0.81 + 1.21 + 0.0025 + 1.5625 = 3.585
⬆ loss = 3.585 / 4 = 0.8963
23# ── Step 2: Backward pass (7 gradient steps) ──

Backpropagation: compute the gradient of the loss with respect to every weight, starting from the output and working backward. Each line corresponds to one of the 7 steps from Section 2.

24dL_dy = 0.5 * (y_hat - target) — Loss gradient

The derivative of MSE loss with respect to each output. The 0.5 comes from the chain rule: d/dŷ[(1/4)∑(ŷ-y)²] = (1/4)×2×(ŷ-y) = 0.5×(ŷ-y). Negative values mean the output should INCREASE.

EXECUTION STATE
── Per-output gradient ── =
dL/dŷ₀ = 0.5 × (0.10 - 1.0) = -0.450 ← output should increase
dL/dŷ₁ = 0.5 × (-0.10 - 1.0) = -0.550 ← output should increase
dL/dŷ₂ = 0.5 × (-0.05 - 0.0) = -0.025 ← small, already close
dL/dŷ₃ = 0.5 × (-0.25 - 1.0) = -0.625 ← largest error, biggest gradient
⬆ dL_dy = [-0.450, -0.550, -0.025, -0.625]
25dL_dW2 = np.outer(h, dL_dy) — Output weight gradients

The gradient for each weight W2[i][j] = h[i] × dL/dŷ[j]. Since h[0]=0 and h[1]=0 (dead neurons), rows 0 and 1 are entirely zero. Only row 2 has non-zero gradients.

EXECUTION STATE
📚 np.outer(a, b) = Outer product: creates a matrix where result[i][j] = a[i] × b[j]. h(3,) ⊗ dL_dy(4,) = (3,4) matrix.
⬆ dL_dW2 (3×4) =
       y0       y1       y2       y3
h0  0.000    0.000    0.000    0.000
h1  0.000    0.000    0.000    0.000
h2 -0.225   -0.275   -0.013   -0.313
Why rows 0,1 are zero = h[0]=0 and h[1]=0 (dead neurons) → 0 × anything = 0. Dead neurons block ALL gradient to their output weights.
26dL_db2 = dL_dy.copy() — Output bias gradients

Bias gradients equal the loss gradient directly: ∂L/∂b[j] = dL/dŷ[j] × 1. Biases always get gradient, even when neurons are dead.

EXECUTION STATE
📚 .copy() = Creates an independent copy so modifying dL_db2 later won’t change dL_dy.
⬆ dL_db2 = [-0.450, -0.550, -0.025, -0.625]
27dL_dh = W2 @ dL_dy — Backprop to hidden layer

Error flows BACKWARD through W2. Each hidden neuron’s gradient is the weighted sum of all output gradients: dL/dh[i] = ∑ⱼ W2[i][j] × dL/dŷ[j]. Same weights, backward direction.

EXECUTION STATE
W2 @ dL_dy = W2(3,4) @ dL_dy(4,) = dL_dh(3,). Matrix-vector multiply.
dL_dh[0] = 0.3×(-0.45) + (-0.2)×(-0.55) + 0.4×(-0.025) + 0.1×(-0.625) = -0.0975
dL_dh[1] = (-0.1)×(-0.45) + 0.5×(-0.55) + (-0.3)×(-0.025) + 0.2×(-0.625) = -0.3475
dL_dh[2] = 0.2×(-0.45) + (-0.4)×(-0.55) + 0.1×(-0.025) + (-0.5)×(-0.625) = 0.4400
⬆ dL_dh = [-0.0975, -0.3475, 0.4400]
28relu_grad = (z1 > 0).astype(float) — ReLU gradient gate

ReLU’s derivative is a binary mask: 1 where z1 > 0 (gradient passes), 0 where z1 ≤ 0 (gradient blocked). This is the gate that kills learning in dead neurons.

EXECUTION STATE
📚 (z1 > 0) = Element-wise comparison: returns boolean array [False, False, True]
📚 .astype(float) = Convert booleans to floats: False→0.0, True→1.0
z1[0] = 0.0 > 0? = False → 0.0 — gradient blocked (boundary: exactly zero is not positive)
z1[1] = -0.1 > 0? = False → 0.0 — gradient blocked
z1[2] = 0.5 > 0? = True → 1.0 — gradient passes through
⬆ relu_grad = [0.0, 0.0, 1.0]
29dL_dz1 = dL_dh * relu_grad — Apply the gate

Element-wise multiply: the gradient only passes where ReLU was active. Neurons 0 and 1 lose their gradient entirely. Only neuron 2 carries signal backward.

EXECUTION STATE
dL_dz1[0] = (-0.0975) × 0.0 = 0.0 — killed by ReLU gate
dL_dz1[1] = (-0.3475) × 0.0 = 0.0 — killed by ReLU gate
dL_dz1[2] = 0.4400 × 1.0 = 0.44 — survives!
⬆ dL_dz1 = [0.0, 0.0, 0.44]
30dL_dW1 = np.outer(x, dL_dz1) — Hidden weight gradients

Each gradient dL/dW1[i][j] = x[i] × dL/dz1[j]. Only column 2 is non-zero (neurons 0,1 are dead). Row 1 is all-zero because x[1] = 0.

EXECUTION STATE
📚 np.outer(x, dL_dz1) = x(4,) ⊗ dL_dz1(3,) = (4,3) matrix. Only 3 of 12 entries are non-zero.
⬆ dL_dW1 (4×3) =
      h0    h1    h2
x0  0.00  0.00  0.44
x1  0.00  0.00  0.00
x2  0.00  0.00  0.44
x3  0.00  0.00  0.44
Why so sparse? = Cols 0,1 = 0 (dead neurons). Row 1 = 0 (input x[1]=0). Only 3 of 12 weights will change.
31dL_db1 = dL_dz1.copy() — Hidden bias gradients

Bias gradients equal the pre-activation gradients. Only b1[2] has non-zero gradient — the other two neurons are dead.

EXECUTION STATE
⬆ dL_db1 = [0.0, 0.0, 0.44]
33# ── Step 3: Update weights ──

Apply the gradient descent update rule: w_new = w_old - η × gradient. Negative gradients mean the weight increases (moves toward lower loss). Positive gradients mean the weight decreases.

34eta = 0.1 — Learning rate

Controls step size. Too large (e.g., 1.0) and the network overshoots, bouncing around. Too small (e.g., 0.001) and training crawls. 0.1 is a common starting point for small networks.

EXECUTION STATE
η (eta) = 0.1 — each weight moves by at most 10% of its gradient
35W1_new = W1 - eta * dL_dW1 — Update hidden weights

Only column 2 changes (neurons 0,1 are dead). Positive gradient (0.44) means the weight DECREASES — we move opposite to the gradient to reduce loss.

EXECUTION STATE
── Only column 2 changes ── =
W1[0][2] = 0.100 - 0.1×(0.44) = 0.056
W1[1][2] = -0.200 - 0.1×(0.00) = -0.200 (unchanged, x[1]=0)
W1[2][2] = 0.500 - 0.1×(0.44) = 0.456
W1[3][2] = -0.100 - 0.1×(0.44) = -0.144
36b1_new = b1 - eta * dL_db1 — Update hidden biases

Only b1[2] changes (neurons 0,1 have zero gradient). The positive gradient (0.44) pushes b1[2] downward.

EXECUTION STATE
b1[0] = 0.100 - 0.1×(0.0) = 0.100 (dead neuron)
b1[1] = -0.100 - 0.1×(0.0) = -0.100 (dead neuron)
b1[2] = 0.000 - 0.1×(0.44) = -0.044
37W2_new = W2 - eta * dL_dW2 — Update output weights

Only row 2 changes (rows 0,1 have zero gradient from dead neurons). All gradients are negative, so all weights INCREASE — pushing outputs upward toward targets.

EXECUTION STATE
── Only row 2 changes ── =
W2[2][0] = 0.200 - 0.1×(-0.225) = 0.2225
W2[2][1] = -0.400 - 0.1×(-0.275) = -0.3725
W2[2][2] = 0.100 - 0.1×(-0.013) = 0.1013
W2[2][3] = -0.500 - 0.1×(-0.313) = -0.4688
38b2_new = b2 - eta * dL_db2 — Update output biases

All 4 biases get updated (biases don’t depend on dead neurons). All gradients are negative, so all biases increase.

EXECUTION STATE
b2[0] = 0.000 - 0.1×(-0.450) = 0.0450
b2[1] = 0.100 - 0.1×(-0.550) = 0.1550
b2[2] = -0.100 - 0.1×(-0.025) = -0.0975
b2[3] = 0.000 - 0.1×(-0.625) = 0.0625
40# ── Step 4: Verify improvement ──

Run a new forward pass with the updated weights to confirm the loss actually dropped. This is the payoff — one step of gradient descent made the network measurably better.

41z1_v = np.round(x @ W1_new + b1_new, 10) — New pre-activation

Forward pass with updated weights. Neuron 2 now produces a smaller pre-activation because W1 column 2 decreased.

EXECUTION STATE
Neuron 2 calculation = 1×(0.056) + 0×(-0.2) + 1×(0.456) + 1×(-0.144) + (-0.044) = 0.324
⬆ z1_v = [0.0, -0.1, 0.324] — neuron 2 lower than before (was 0.5)
42h_v = np.maximum(0, z1_v) — New hidden activation

ReLU: neurons 0,1 still dead, neuron 2 alive but lower (0.5 → 0.324).

EXECUTION STATE
⬆ h_v = [0.0, 0.0, 0.324]
43y_hat_new = h_v @ W2_new + b2_new — New prediction

Both the hidden activation (0.324 vs 0.5) and the weights/biases changed. Three of four outputs moved closer to their targets.

EXECUTION STATE
ŷ₀ = (0.324)(0.2225) + 0.045 = 0.117 (target 1.0, was 0.10)
ŷ₁ = (0.324)(-0.3725) + 0.155 = 0.034 (target 1.0, was -0.10)
ŷ₂ = (0.324)(0.1013) + (-0.0975) = -0.065 (target 0.0, was -0.05)
ŷ₃ = (0.324)(-0.4688) + 0.0625 = -0.089 (target 1.0, was -0.25)
⬆ y_hat_new = [0.117, 0.034, -0.065, -0.089]
44loss_new = np.mean((y_hat_new - target) ** 2) — New loss

Recompute MSE. Each squared error: (0.117-1)²=0.779, (0.034-1)²=0.934, (-0.065-0)²=0.004, (-0.089-1)²=1.186. Mean = 0.726.

EXECUTION STATE
⬆ loss_new = 0.726 — down from 0.896 (−19.0%)
46print(f"Old loss: {loss:.4f}")

Display original loss for comparison.

EXECUTION STATE
⬆ output = Old loss: 0.8963
47print(f"New loss: {loss_new:.4f}")

Loss after one gradient descent step.

EXECUTION STATE
⬆ output = New loss: 0.7258
48print(f"Improved: {(1 - loss_new/loss)*100:.1f}%")

One step cut the loss by 19%. Not perfect, but measurably better.

EXECUTION STATE
⬆ output = Improved: 19.0%
12 lines without explanation
1import numpy as np
2
3# ── Network weights (same as Chapter 7) ──
4W1 = np.array([[ 0.2, -0.5,  0.1],
5               [-0.3,  0.4, -0.2],
6               [ 0.1,  0.3,  0.5],
7               [-0.4,  0.2, -0.1]])
8b1 = np.array([0.1, -0.1, 0.0])
9W2 = np.array([[ 0.3, -0.2,  0.4,  0.1],
10               [-0.1,  0.5, -0.3,  0.2],
11               [ 0.2, -0.4,  0.1, -0.5]])
12b2 = np.array([0.0, 0.1, -0.1, 0.0])
13
14x = np.array([1.0, 0.0, 1.0, 1.0])
15target = np.array([1.0, 1.0, 0.0, 1.0])
16
17# ── Step 1: Forward pass ──
18z1 = np.round(x @ W1 + b1, 10)
19h = np.maximum(0, z1)
20y_hat = h @ W2 + b2
21loss = np.mean((y_hat - target) ** 2)
22
23# ── Step 2: Backward pass (7 gradient steps) ──
24dL_dy = 0.5 * (y_hat - target)
25dL_dW2 = np.outer(h, dL_dy)
26dL_db2 = dL_dy.copy()
27dL_dh = W2 @ dL_dy
28relu_grad = (z1 > 0).astype(float)
29dL_dz1 = dL_dh * relu_grad
30dL_dW1 = np.outer(x, dL_dz1)
31dL_db1 = dL_dz1.copy()
32
33# ── Step 3: Update weights ──
34eta = 0.1
35W1_new = W1 - eta * dL_dW1
36b1_new = b1 - eta * dL_db1
37W2_new = W2 - eta * dL_dW2
38b2_new = b2 - eta * dL_db2
39
40# ── Step 4: Verify improvement ──
41z1_v = np.round(x @ W1_new + b1_new, 10)
42h_v = np.maximum(0, z1_v)
43y_hat_new = h_v @ W2_new + b2_new
44loss_new = np.mean((y_hat_new - target) ** 2)
45
46print(f"Old loss:  {loss:.4f}")
47print(f"New loss:  {loss_new:.4f}")
48print(f"Improved:  {(1 - loss_new/loss)*100:.1f}%")

Result: One step of gradient descent reduced the loss from 0.896 to 0.726 — a 19% improvement. Three of four outputs moved closer to their targets.

OutputBeforeAfterTargetDirection
ŷ₀0.100.121.0✅ Toward target
ŷ₁−0.100.031.0✅ Toward target
ŷ₂−0.05−0.060.0➖ Tiny overshoot
ŷ₃−0.25−0.091.0✅ Toward target
Count the lines: Forward pass = 4 lines, backward pass = 7 lines, update = 4 lines. That's 15 lines of math for one training step. Next, we'll see PyTorch do the same thing in 3.

PyTorch: Autograd Does It in Three Lines

Now the same computation in PyTorch. The key insight: loss.backward()\texttt{loss.backward()} replaces 7 lines of backprop math, and optimizer.step()\texttt{optimizer.step()} replaces 4 lines of weight updates. The gradients are identical to our hand calculations.

Complete Training Step — PyTorch
🐍training_step_pytorch.py
1import torch

PyTorch’s core library. Provides tensors (like NumPy arrays but with automatic gradient computation) and GPU acceleration. Every operation on a tensor is recorded in a computation graph for backpropagation.

EXECUTION STATE
📚 torch = Core tensor library — like NumPy but with autograd and GPU support
2import torch.nn as nn

Neural network building blocks: layers (Linear, Conv2d), the Module base class, and loss functions. nn.Linear stores its own weight matrix and bias vector.

EXECUTION STATE
📚 torch.nn = Neural network module — nn.Module, nn.Linear, nn.ReLU, loss functions
3import torch.nn.functional as F

Stateless versions of activations and operations. F.relu(x) applies ReLU without creating a persistent module. Used in forward() methods.

EXECUTION STATE
📚 torch.nn.functional = Stateless functions: relu, softmax, cross_entropy. Same math as nn.ReLU() but without storing state.
6class DiagonalFlipNet(nn.Module) — Our network

Same architecture as the NumPy version: 4 inputs → 3 hidden (ReLU) → 4 outputs. nn.Module is the base class for all PyTorch models — it tracks parameters, enables saving/loading, and defines the forward() contract.

EXECUTION STATE
📚 nn.Module = Base class for all PyTorch models. Provides parameter tracking (.parameters()), device management (.to()), and serialization (.state_dict()).
7def __init__(self)

Constructor: define the layers (weight matrices). super().__init__() initializes the nn.Module machinery that tracks parameters.

8super().__init__() — Initialize nn.Module

Calls the parent class constructor. Required for nn.Module to work correctly — sets up parameter registration, hooks, and the computation graph.

9self.layer1 = nn.Linear(4, 3) — Hidden layer

Creates a fully-connected layer: 4 inputs → 3 outputs. Stores a (3,4) weight matrix and (3,) bias. Note: PyTorch uses shape (out_features, in_features) — the transpose of our NumPy convention.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Forward: y = x @ W.T + b. Weight shape: (out, in) = (3, 4). Bias shape: (3,).
⬇ in_features = 4 = Our input has 4 elements (flattened 2×2 image)
⬇ out_features = 3 = We want 3 hidden neurons
10self.layer2 = nn.Linear(3, 4) — Output layer

3 hidden neurons → 4 outputs. Weight shape: (4, 3). Bias shape: (4,).

EXECUTION STATE
⬇ in_features = 3 = Matches layer1’s output size
⬇ out_features = 4 = Our target has 4 elements
12def forward(self, x) — Define computation

PyTorch calls this when you do model(x). It defines the computation graph: linear → ReLU → linear. Every operation is recorded for backpropagation.

EXECUTION STATE
⬇ x = Input tensor — shape (4,) for a single example
13h = F.relu(self.layer1(x)) — Hidden layer + ReLU

self.layer1(x) computes x @ W1.T + b1 = z1. F.relu() applies max(0, z1). Both operations are recorded in the computation graph for loss.backward() later.

EXECUTION STATE
📚 F.relu(input) = Element-wise max(0, x). Same as np.maximum(0, x) but with gradient tracking.
⬆ h = [0.0, 0.0, 0.5] — neurons 0,1 dead; neuron 2 alive
14return self.layer2(h) — Output (no activation)

Passes hidden activations through output layer: h @ W2.T + b2. No activation on the output — this is a regression network (predicting continuous values).

EXECUTION STATE
⬆ return = [0.1, -0.1, -0.05, -0.25] — same as our NumPy result
16model = DiagonalFlipNet() — Create instance

Creates the network with random weights. We’ll overwrite them with our hand-picked values next.

EXECUTION STATE
model = DiagonalFlipNet with 31 parameters: W1(3×4)=12 + b1(3) + W2(4×3)=12 + b2(4) = 31
19with torch.no_grad(): — Disable gradient tracking

When loading weights manually, we don’t want PyTorch to record these operations in the computation graph. torch.no_grad() is a context manager that temporarily disables gradient computation.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd. Operations inside won’t be tracked for backpropagation. Used for inference and manual weight loading.
20model.layer1.weight.copy_(...) — Load W1

Copy our hand-picked weights into the model. Note the shape (3,4) — PyTorch stores weights as (out_features, in_features), which is the TRANSPOSE of our NumPy W1 (4,3). The values are the same, just organized differently.

EXECUTION STATE
📚 .copy_(tensor) = In-place copy: overwrites the parameter data. The underscore _ in PyTorch means 'in-place operation'.
PyTorch W1 shape = (3, 4) = (out, in). Row 0 = neuron 0’s weights = [0.2, -0.3, 0.1, -0.4]
NumPy W1 shape = (4, 3) = (in, out). Column 0 = neuron 0’s weights = [0.2, -0.3, 0.1, -0.4]
24model.layer1.bias.copy_(...) — Load b1

Bias shape is the same in both conventions: one value per neuron.

EXECUTION STATE
b1 = [0.1, -0.1, 0.0]
25model.layer2.weight.copy_(...) — Load W2

Output weights: (4, 3) in PyTorch — the transpose of our NumPy W2 (3, 4).

EXECUTION STATE
PyTorch W2 shape = (4, 3) = (out, in). Each row is one output neuron’s weights.
30model.layer2.bias.copy_(...) — Load b2

Output biases: [0.0, 0.1, -0.1, 0.0].

EXECUTION STATE
b2 = [0.0, 0.1, -0.1, 0.0]
33x = torch.tensor([1.0, 0.0, 1.0, 1.0]) — Same input

Identical to our NumPy input. torch.tensor() creates a PyTorch tensor from a Python list. Tensors track operations for backpropagation.

EXECUTION STATE
📚 torch.tensor(data) = Creates a tensor from a Python list. Unlike np.array, tensors can track gradients.
34target = torch.tensor([1.0, 1.0, 0.0, 1.0]) — Same target

The diagonal flip target: [[1,1],[0,1]] flattened.

35y_hat = model(x) — Forward pass

Calling model(x) runs forward(x) under the hood. PyTorch builds a computation graph as it goes, recording every operation for backprop.

EXECUTION STATE
⬆ y_hat = [0.1, -0.1, -0.05, -0.25] — identical to NumPy
36loss = torch.mean((y_hat - target) ** 2) — MSE loss

Same MSE computation as NumPy, but now the computation graph connects loss → y_hat → layer2 → relu → layer1 → x. This graph is what makes .backward() possible.

EXECUTION STATE
⬆ loss = 0.8963 — identical to NumPy
38# ── Backprop: ONE LINE replaces 7 steps ──

This is the key moment. Seven lines of careful NumPy math — computing dL_dy, dL_dW2, dL_db2, dL_dh, relu_grad, dL_dz1, dL_dW1, dL_db1 — all computed automatically in one call.

39loss.backward() — THE magic line

Walks the computation graph backward from loss to every parameter, computing ∂L/∂w for all 31 weights and biases. Uses the chain rule automatically. After this call, every parameter’s .grad attribute holds its gradient.

EXECUTION STATE
📚 .backward() = Computes gradients of the loss with respect to every tensor that has requires_grad=True (all nn.Module parameters do by default).
What it computes = All 31 gradients: dL/dW1 (12), dL/db1 (3), dL/dW2 (12), dL/db2 (4) — identical to our 7 NumPy lines
Where gradients are stored = model.layer1.weight.grad, model.layer1.bias.grad, model.layer2.weight.grad, model.layer2.bias.grad
42print("dL/db2:", model.layer2.bias.grad) — Verify

Access the computed gradient for the output bias. Every parameter has a .grad attribute after .backward() is called.

EXECUTION STATE
.grad = Attribute set by .backward(). Contains ∂L/∂(this parameter). Initially None before backward is called.
⬆ output = dL/db2: tensor([-0.4500, -0.5500, -0.0250, -0.6250])
NumPy result = [-0.450, -0.550, -0.025, -0.625] — identical ✓
43print("dL/db1:", model.layer1.bias.grad) — Verify

Hidden bias gradients. Only neuron 2 has non-zero gradient (the other two are dead).

EXECUTION STATE
⬆ output = dL/db1: tensor([0.0000, 0.0000, 0.4400])
NumPy result = [0.0, 0.0, 0.44] — identical ✓
46optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Creates a Stochastic Gradient Descent optimizer that manages all 31 parameters. lr=0.1 is our learning rate η. The optimizer applies the update rule: w = w - η × grad.

EXECUTION STATE
📚 torch.optim.SGD(params, lr) = Stochastic Gradient Descent optimizer. Applies w = w - lr × w.grad for every parameter.
⬇ model.parameters() = Iterator over all 31 learnable parameters (W1, b1, W2, b2). The optimizer will update all of them.
⬇ lr = 0.1 = Learning rate. Same η = 0.1 as our NumPy code.
48optimizer.step() — Apply weight updates

Applies w = w - lr × grad for all 31 parameters in one call. This is equivalent to our 4 lines of NumPy updates (W1_new, b1_new, W2_new, b2_new).

EXECUTION STATE
📚 optimizer.step() = For each parameter p: p.data = p.data - lr × p.grad. One call updates all 31 weights.
Equivalent NumPy = W1 -= 0.1 * dL_dW1; b1 -= 0.1 * dL_db1; W2 -= 0.1 * dL_dW2; b2 -= 0.1 * dL_db2
49optimizer.zero_grad() — Clear old gradients

PyTorch ACCUMULATES gradients by default (adds to .grad instead of replacing). You must zero them before the next backward pass, or gradients from different iterations pile up. This is the #1 PyTorch beginner bug.

EXECUTION STATE
📚 optimizer.zero_grad() = Sets .grad = None (or zeros) for every parameter. MUST be called before each new loss.backward().
Why accumulate? = Useful for gradient accumulation across mini-batches. But for standard training, always zero first.
51y_hat_new = model(x) — New prediction

Forward pass with updated weights.

EXECUTION STATE
⬆ y_hat_new = [0.117, 0.034, -0.065, -0.089] — closer to target [1, 1, 0, 1]
52loss_new = torch.mean(...) — New loss

Recompute MSE with updated weights.

EXECUTION STATE
⬆ loss_new = 0.7258 — down from 0.8963
53print(f"Loss: {loss.item():.4f} -> {loss_new.item():.4f}")

Confirm: one training step reduced the loss by 19%, matching our NumPy result exactly.

EXECUTION STATE
📚 .item() = Extracts a Python number from a single-element tensor. loss is a tensor, loss.item() is a float.
⬆ output = Loss: 0.8963 -> 0.7258 — 19.0% decrease ✓
24 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# ── Same network, same weights ──
6class DiagonalFlipNet(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layer1 = nn.Linear(4, 3)
10        self.layer2 = nn.Linear(3, 4)
11
12    def forward(self, x):
13        h = F.relu(self.layer1(x))
14        return self.layer2(h)
15
16model = DiagonalFlipNet()
17
18# Load our exact weights
19with torch.no_grad():
20    model.layer1.weight.copy_(torch.tensor([
21        [ 0.2, -0.3,  0.1, -0.4],
22        [-0.5,  0.4,  0.3,  0.2],
23        [ 0.1, -0.2,  0.5, -0.1]]))
24    model.layer1.bias.copy_(
25        torch.tensor([0.1, -0.1, 0.0]))
26    model.layer2.weight.copy_(torch.tensor([
27        [ 0.3, -0.1,  0.2],
28        [-0.2,  0.5, -0.4],
29        [ 0.4, -0.3,  0.1],
30        [ 0.1,  0.2, -0.5]]))
31    model.layer2.bias.copy_(
32        torch.tensor([0.0, 0.1, -0.1, 0.0]))
33
34# ── Forward pass + loss ──
35x = torch.tensor([1.0, 0.0, 1.0, 1.0])
36target = torch.tensor([1.0, 1.0, 0.0, 1.0])
37y_hat = model(x)
38loss = torch.mean((y_hat - target) ** 2)
39
40# ── Backprop: ONE LINE replaces 7 steps ──
41loss.backward()
42
43# ── Verify: gradients match hand calculations ──
44print("dL/db2:", model.layer2.bias.grad)
45print("dL/db1:", model.layer1.bias.grad)
46
47# ── One training step ──
48optimizer = torch.optim.SGD(
49    model.parameters(), lr=0.1)
50optimizer.step()
51optimizer.zero_grad()
52
53y_hat_new = model(x)
54loss_new = torch.mean((y_hat_new - target) ** 2)
55print(f"Loss: {loss.item():.4f} -> {loss_new.item():.4f}")
Gradient verification: PyTorch's autograd computes the exact same gradients as our 7-line NumPy implementation. Every value matches:
GradientNumPy (hand-coded)PyTorch (autograd)
∂L/∂b₂[-0.45, -0.55, -0.025, -0.625][-0.45, -0.55, -0.025, -0.625] ✓
∂L/∂W₂ row 2[-0.225, -0.275, -0.013, -0.313][-0.225, -0.275, -0.013, -0.313] ✓
∂L/∂b₁[0.0, 0.0, 0.44][0.0, 0.0, 0.44] ✓
∂L/∂W₁ col 2[0.44, 0.0, 0.44, 0.44][0.44, 0.0, 0.44, 0.44] ✓

The trade-off is clear: NumPy gives you full visibility into every gradient computation. PyTorch gives you the same result with far less code. Now that you've verified they match, you can trust PyTorch's autograd and focus on the bigger picture.


Python: The Training Loop

One step gave us 19% improvement. What happens when we repeat 200 times? The training loop is the heart of machine learning: forward → loss → backward → update → repeat.

Training Loop (200 Steps) — NumPy
🐍training_loop_numpy.py
1import numpy as np

NumPy for all array math operations.

3# ── Same network setup ──

Identical weights as before. We start from the original random initialization so we can watch the full learning trajectory from scratch.

4W1, b1, W2, b2 — Network weights

Same 31 parameters from Chapter 7. The network starts with a loss of 0.8963 and predictions [0.1, -0.1, -0.05, -0.25] — far from the target [1, 1, 0, 1].

EXECUTION STATE
Total parameters = W1(4×3)=12 + b1(3) + W2(3×4)=12 + b2(4) = 31
14x, target — Training data

Input [1,0,1,1] and its diagonal flip target [1,1,0,1]. We train on this single example for 200 steps.

EXECUTION STATE
x = [1.0, 0.0, 1.0, 1.0]
target = [1.0, 1.0, 0.0, 1.0]
16eta = 0.1 — Learning rate

Same learning rate as our single-step experiment.

EXECUTION STATE
η = 0.1
18for step in range(201): — Training loop

Repeat forward → backward → update 201 times (steps 0 through 200). Each iteration nudges all 31 weights a tiny bit toward lower loss. After enough iterations, the network converges to near-zero loss.

EXECUTION STATE
📚 range(201) = Generates integers 0, 1, 2, ..., 200. Total: 201 iterations.
LOOP TRACE · 8 iterations
Step 0
loss = 0.896250
pred = [0.1, -0.1, -0.05, -0.25]
Step 1
loss = 0.725753 (−19.0%)
pred = [0.117, 0.034, -0.065, -0.089]
Step 5
loss = 0.411635 (−54.1%)
pred = [0.216, 0.321, -0.082, 0.249]
Step 10
loss = 0.246461 (−72.5%)
pred = [0.393, 0.475, -0.064, 0.419]
Step 25
loss = 0.052900 (−94.1%)
pred = [0.719, 0.757, -0.029, 0.731]
Step 50
loss = 0.004070 (−99.5%)
pred = [0.922, 0.933, -0.008, 0.925]
Step 100
loss = 0.000024 (≈ 0)
pred = [0.994, 0.995, -0.001, 0.994]
Step 200
loss = 0.000000 (≈ 0)
pred = [1.0, 1.0, -0.0, 1.0] ← essentially perfect
20z1 = np.round(x @ W1 + b1, 10) — Forward: pre-activation

Compute hidden layer pre-activation with CURRENT weights (updated each iteration). The values change every step as weights are adjusted.

EXECUTION STATE
Step 0 = z1 = [0.0, -0.1, 0.5]
Step 200 = z1 evolves as weights change — dead neurons may come alive in later steps
21h = np.maximum(0, z1) — Forward: ReLU

Apply ReLU. As weights evolve, neurons that were dead at step 0 may become alive in later steps.

22y_hat = h @ W2 + b2 — Forward: prediction

Compute the network’s current prediction. This changes every step as weights are updated.

23loss = np.mean((y_hat - target) ** 2) — MSE loss

How wrong the network is right now. This number should decrease every step if training is working.

25if step in [0, 1, 5, 10, 25, 50, 100, 200]:

Print progress at selected checkpoints. The spacing is logarithmic because most improvement happens early (fast drops), then fine-tuning later (slow improvements).

26pred = np.round(y_hat, 3)

Round predictions to 3 decimal places for clean display.

27print(f"Step {step:<4} loss={loss:.6f} ...")

Display step number, current loss, and prediction.

EXECUTION STATE
⬆ Sample output (Step 0) = Step 0 loss=0.896250 pred=[ 0.1 -0.1 -0.05 -0.25]
30dL_dy = 0.5 * (y_hat - target) — Backward: loss gradient

Start of backpropagation. Same formula as before, but with the current (evolving) y_hat.

31dL_dW2 = np.outer(h, dL_dy) — Output weight gradients

Outer product of hidden activation and loss gradient.

32dL_db2 = dL_dy.copy() — Output bias gradients

Bias gradient equals loss gradient directly.

33dL_dh = W2 @ dL_dy — Backprop to hidden layer

Error flows backward through W2 (same matrix, backward direction).

34dL_dz1 = dL_dh * (z1 > 0).astype(float) — ReLU gate

Apply ReLU gradient mask. This line combines steps 5 and 5b from Section 2 into one line.

EXECUTION STATE
Combined operation = relu_grad = (z1 > 0).astype(float); dL_dz1 = dL_dh * relu_grad — two steps in one
35dL_dW1 = np.outer(x, dL_dz1) — Hidden weight gradients

Outer product of input and hidden gradient.

36dL_db1 = dL_dz1.copy() — Hidden bias gradients

Bias gradient equals the pre-activation gradient.

39W1 -= eta * dL_dW1 — Update hidden weights

In-place update: W1 = W1 - 0.1 × gradient. Using -= modifies the array directly instead of creating a new one. More memory-efficient.

EXECUTION STATE
📚 -= (in-place subtraction) = a -= b is equivalent to a = a - b, but modifies a directly without creating a new array.
40b1 -= eta * dL_db1 — Update hidden biases

In-place bias update.

41W2 -= eta * dL_dW2 — Update output weights

In-place output weight update.

42b2 -= eta * dL_db2 — Update output biases

After this line, all 31 parameters have been updated. The loop returns to the top and does another forward pass with the improved weights.

EXECUTION STATE
One iteration = = Forward (4 lines) + Backward (7 lines) + Update (4 lines) = 15 lines of math per step
18 lines without explanation
1import numpy as np
2
3# ── Same network setup ──
4W1 = np.array([[ 0.2, -0.5,  0.1],
5               [-0.3,  0.4, -0.2],
6               [ 0.1,  0.3,  0.5],
7               [-0.4,  0.2, -0.1]])
8b1 = np.array([0.1, -0.1, 0.0])
9W2 = np.array([[ 0.3, -0.2,  0.4,  0.1],
10               [-0.1,  0.5, -0.3,  0.2],
11               [ 0.2, -0.4,  0.1, -0.5]])
12b2 = np.array([0.0, 0.1, -0.1, 0.0])
13
14x = np.array([1.0, 0.0, 1.0, 1.0])
15target = np.array([1.0, 1.0, 0.0, 1.0])
16eta = 0.1
17
18for step in range(201):
19    # Forward
20    z1 = np.round(x @ W1 + b1, 10)
21    h = np.maximum(0, z1)
22    y_hat = h @ W2 + b2
23    loss = np.mean((y_hat - target) ** 2)
24
25    if step in [0, 1, 5, 10, 25, 50, 100, 200]:
26        pred = np.round(y_hat, 3)
27        print(f"Step {step:<4} loss={loss:.6f}  pred={pred}")
28
29    # Backward
30    dL_dy = 0.5 * (y_hat - target)
31    dL_dW2 = np.outer(h, dL_dy)
32    dL_db2 = dL_dy.copy()
33    dL_dh = W2 @ dL_dy
34    dL_dz1 = dL_dh * (z1 > 0).astype(float)
35    dL_dW1 = np.outer(x, dL_dz1)
36    dL_db1 = dL_dz1.copy()
37
38    # Update
39    W1 -= eta * dL_dW1
40    b1 -= eta * dL_db1
41    W2 -= eta * dL_dW2
42    b2 -= eta * dL_db2

The output tells the full story of learning:

StepLossPredictionPattern
00.8963[0.1, -0.1, -0.05, -0.25]Random garbage
10.7258[0.12, 0.03, -0.07, -0.09]First improvement
100.2465[0.39, 0.48, -0.06, 0.42]Getting the shape right
500.0041[0.92, 0.93, -0.01, 0.93]Nearly there
200≈0.000[1.0, 1.0, -0.0, 1.0]Essentially perfect
Watch the learning curve. The loss drops fast early (0.896 → 0.246 in 10 steps), then slows down (0.246 → 0.004 over the next 40). This is typical — the easy adjustments happen first, then fine-tuning takes longer.

PyTorch: The Same Loop, Simplified

The same 200-step training loop in PyTorch. The loop body shrinks from 15 lines (forward + backward + update) to just 5: model(x)\texttt{model(x)}, loss\texttt{loss}, zero_grad()\texttt{zero\_grad()}, backward()\texttt{backward()}, step()\texttt{step()}.

Training Loop (200 Steps) — PyTorch
🐍training_loop_pytorch.py
1import torch

PyTorch core: tensors, autograd, computation graphs.

2import torch.nn as nn

Neural network layers and base class.

3import torch.nn.functional as F

Stateless activation functions like F.relu().

5class DiagonalFlipNet(nn.Module) — Same architecture

Identical model definition. In practice, you define this once and reuse it. We repeat it here for clarity.

11def forward(self, x):

The forward pass in one line: linear → ReLU → linear. Note how compact this is compared to the 4 NumPy lines (z1, h, y_hat, loss).

12return self.layer2(F.relu(self.layer1(x)))

Chains layer1 → relu → layer2 in one expression. PyTorch builds the computation graph automatically as each operation executes.

EXECUTION STATE
Computation chain = self.layer1(x) = z1 → F.relu(z1) = h → self.layer2(h) = y_hat
15model = DiagonalFlipNet()

Create the model.

16with torch.no_grad(): — Load weights

Load our hand-picked weights (identical to NumPy version). Lines 17-29 are the same weight loading as before.

31x = torch.tensor([1.0, 0.0, 1.0, 1.0]) — Input

Same input image as NumPy.

32target = torch.tensor([1.0, 1.0, 0.0, 1.0]) — Target

Same diagonal flip target.

33optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Create the optimizer ONCE before the loop. It holds references to all parameters and applies updates when .step() is called.

EXECUTION STATE
📚 torch.optim.SGD = Stochastic Gradient Descent. update rule: w = w - lr × grad.
36for step in range(201): — Training loop

Same 201 iterations. But the loop body is just 5 lines instead of 15 — PyTorch handles forward graph construction and backpropagation automatically.

LOOP TRACE · 5 iterations
Step 0
loss = 0.896250
pred = [0.1, -0.1, -0.05, -0.25]
Step 1
loss = 0.725753
pred = [0.117, 0.034, -0.065, -0.089]
Step 10
loss = 0.246461
pred = [0.393, 0.475, -0.064, 0.419]
Step 50
loss = 0.004070
pred = [0.922, 0.933, -0.008, 0.925]
Step 200
loss = ≈ 0.000000
pred = [1.0, 1.0, -0.0, 1.0] ← perfect
37y_hat = model(x) — Forward pass

One call does the entire forward pass: layer1 → ReLU → layer2. Compare to NumPy: z1 = ..., h = ..., y_hat = ... (3 separate lines).

38loss = torch.mean((y_hat - target) ** 2) — MSE loss

Compute loss. PyTorch records this in the computation graph.

40if step in [0, 1, 5, 10, 25, 50, 100, 200]:

Print at logarithmically-spaced checkpoints.

41pred = [round(v, 3) for v in y_hat.tolist()]

Convert tensor to Python list and round for display.

EXECUTION STATE
📚 .tolist() = Converts a PyTorch tensor to a plain Python list. Required for round() which doesn’t work on tensors.
42print(f"Step {step:<4} loss={loss.item():.6f} ...")

Display progress. .item() extracts a Python float from a 0-dim tensor.

45optimizer.zero_grad() — Clear old gradients

MUST come before .backward(). Without this, gradients from the previous step would accumulate (add up) with the current step’s gradients.

EXECUTION STATE
Order matters = zero_grad() → backward() → step(). This is the PyTorch training loop mantra.
46loss.backward() — Compute all gradients

Replaces 7 lines of NumPy backprop math. Walks the computation graph backward and fills every parameter’s .grad attribute.

EXECUTION STATE
NumPy equivalent = dL_dy, dL_dW2, dL_db2, dL_dh, dL_dz1, dL_dW1, dL_db1 — 7 lines compressed to 1
47optimizer.step() — Update all weights

Replaces 4 lines of NumPy updates. Applies w = w - 0.1 × grad for all 31 parameters.

EXECUTION STATE
NumPy equivalent = W1 -= ...; b1 -= ...; W2 -= ...; b2 -= ... — 4 lines compressed to 1
Total: PyTorch loop body = 5 lines (model, loss, zero_grad, backward, step) vs 15 lines in NumPy
27 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DiagonalFlipNet(nn.Module):
6    def __init__(self):
7        super().__init__()
8        self.layer1 = nn.Linear(4, 3)
9        self.layer2 = nn.Linear(3, 4)
10
11    def forward(self, x):
12        return self.layer2(F.relu(self.layer1(x)))
13
14# ── Load same weights ──
15model = DiagonalFlipNet()
16with torch.no_grad():
17    model.layer1.weight.copy_(torch.tensor([
18        [ 0.2, -0.3,  0.1, -0.4],
19        [-0.5,  0.4,  0.3,  0.2],
20        [ 0.1, -0.2,  0.5, -0.1]]))
21    model.layer1.bias.copy_(
22        torch.tensor([0.1, -0.1, 0.0]))
23    model.layer2.weight.copy_(torch.tensor([
24        [ 0.3, -0.1,  0.2],
25        [-0.2,  0.5, -0.4],
26        [ 0.4, -0.3,  0.1],
27        [ 0.1,  0.2, -0.5]]))
28    model.layer2.bias.copy_(
29        torch.tensor([0.0, 0.1, -0.1, 0.0]))
30
31x = torch.tensor([1.0, 0.0, 1.0, 1.0])
32target = torch.tensor([1.0, 1.0, 0.0, 1.0])
33optimizer = torch.optim.SGD(
34    model.parameters(), lr=0.1)
35
36for step in range(201):
37    y_hat = model(x)
38    loss = torch.mean((y_hat - target) ** 2)
39
40    if step in [0, 1, 5, 10, 25, 50, 100, 200]:
41        pred = [round(v, 3) for v in y_hat.tolist()]
42        print(f"Step {step:<4} loss={loss.item():.6f}  "
43              f"pred={pred}")
44
45    optimizer.zero_grad()
46    loss.backward()
47    optimizer.step()

Same convergence, same result, one-third the code. The PyTorch loop body follows a universal pattern that works for any neural network:

y_hat = model(x)    loss    zero_grad()    backward()    step()\boxed{\texttt{y\_hat = model(x)} \;\rightarrow\; \texttt{loss} \;\rightarrow\; \texttt{zero\_grad()} \;\rightarrow\; \texttt{backward()} \;\rightarrow\; \texttt{step()}}

This five-step pattern is the same whether you're training a 31-parameter toy network or a 175-billion-parameter GPT model. The architecture changes, the loss function changes, the data changes — but the training loop is always these five lines.


PyTorch: Training on All 16 Images

Training on a single image is a good demo, but not realistic. A real network must generalize: learn the rule, not memorize one example. Let's train on all 16 possible 2×2 binary images and their diagonal flips:

Full Dataset Training (500 Epochs) — PyTorch
🐍full_training.py
1import torch

PyTorch core library.

2import torch.nn as nn

Neural network building blocks.

3import torch.nn.functional as F

Stateless functions (F.relu).

5class DiagonalFlipNet(nn.Module)

Same architecture: 4 → 3 (ReLU) → 4. But this time with RANDOM weights (not our hand-picked ones).

11def forward(self, x):

When x is a batch (16×4), PyTorch processes all 16 images at once via matrix multiplication. No Python loop needed.

EXECUTION STATE
Batch input = x shape (16, 4) — all 16 images at once. layer1 output: (16, 3). layer2 output: (16, 4).
12return self.layer2(F.relu(self.layer1(x)))

Forward pass. With batch input (16,4), produces batch output (16,4) — all 16 predictions at once.

15def make_dataset(): — All 16 2×2 images

Generates every possible 2×2 binary image and its diagonal flip. There are 2⁴ = 16 possible images (each pixel is 0 or 1).

EXECUTION STATE
Why 16? = 2×2 image with binary pixels: 2⁴ = 16 combinations. From [0,0,0,0] to [1,1,1,1].
17for i in range(16): — Generate each image

Loop through integers 0-15. Each integer’s binary representation IS a 2×2 image.

LOOP TRACE · 4 iterations
i=0 (0000)
bits = [0, 0, 0, 0]
flipped = [0, 0, 0, 0] (diagonal flip = same)
i=6 (0110)
bits = [0, 1, 1, 0]
flipped = [0, 1, 1, 0] (anti-diagonal = same)
i=11 (1011) — our example
bits = [1, 0, 1, 1]
flipped = [1, 1, 0, 1] (our familiar target)
i=15 (1111)
bits = [1, 1, 1, 1]
flipped = [1, 1, 1, 1] (all ones = same)
18bits = [(i >> 3) & 1, (i >> 2) & 1, ...]

Extract individual bits from integer i. For i=11 (binary 1011): bit 3 = 1, bit 2 = 0, bit 1 = 1, bit 0 = 1.

EXECUTION STATE
📚 >> (right shift) = Shifts bits right. 11 >> 2 = 2 (binary 10). Then & 1 extracts the last bit.
📚 & 1 (bitwise AND) = Masks to just the lowest bit. 2 & 1 = 0, 3 & 1 = 1.
20img = torch.tensor(bits, dtype=torch.float32)

Convert bit list to a float tensor for computation.

21flipped = img.reshape(2, 2).T.reshape(-1)

The diagonal flip operation: reshape to 2×2 matrix, transpose (.T swaps rows and columns), flatten back to 1D.

EXECUTION STATE
📚 .reshape(2, 2) = [1,0,1,1] → [[1,0],[1,1]] — 2×2 image
📚 .T = Transpose: [[1,0],[1,1]] → [[1,1],[0,1]] — diagonal flip!
📚 .reshape(-1) = Flatten: [[1,1],[0,1]] → [1,1,0,1] — -1 means 'infer size'
24return torch.stack(inputs), torch.stack(targets)

Combine 16 individual tensors into batch tensors. torch.stack creates a new dimension: 16 tensors of shape (4,) become one tensor of shape (16, 4).

EXECUTION STATE
📚 torch.stack(list_of_tensors) = Stacks tensors along a new dimension 0. [t1(4,), t2(4,), ...] → (16, 4).
⬆ X shape = (16, 4) — all 16 inputs
⬆ Y shape = (16, 4) — all 16 targets
26X, Y = make_dataset() — 16 image-target pairs

Now we have the complete training set: 16 inputs and their diagonal-flipped targets.

29torch.manual_seed(42) — Reproducible random init

Fix the random number generator so the initial weights are the same every time we run. Seed 42 is a convention (Hitchhiker’s Guide reference).

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets the RNG state. All subsequent random operations (including nn.Linear weight init) produce the same sequence.
30model = DiagonalFlipNet() — Fresh random weights

Brand new model with random initialization. NOT our hand-picked weights — this time the network must discover good weights entirely through training.

31optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

Lower learning rate (0.05 vs 0.1) because we’re training on 16 examples. Larger datasets need smaller steps to avoid overshooting.

EXECUTION STATE
lr = 0.05 = Half of our single-example rate. With 16 examples, each gradient is an average — smaller steps work better.
35for epoch in range(501): — Train 500 epochs

One epoch = one pass through all 16 training examples. 500 epochs means the network sees each image 500 times.

EXECUTION STATE
epoch vs step = In the single-image loop, step = epoch. With 16 images, 1 epoch = 16 gradient contributions averaged together.
36predictions = model(X) — Forward pass (all 16 at once)

Batch forward pass: X(16,4) produces predictions(16,4). PyTorch handles the batch dimension automatically through matrix multiplication.

EXECUTION STATE
⬆ predictions shape = (16, 4) — one 4-element prediction per image
37loss = torch.mean((predictions - Y) ** 2) — Average MSE

MSE across ALL 16 images: average of 16×4 = 64 squared errors. The gradient points in the direction that improves all 16 predictions simultaneously.

39if epoch in [0, 10, 50, 100, 200, 500]:

Print at selected epochs to show convergence.

40pred = [...predictions[11]...] — Show sample

Show prediction for input index 11 = [1,0,1,1] (our familiar example). Target is [1,1,0,1].

EXECUTION STATE
predictions[11] = The network’s prediction for input [1,0,1,1]. Should converge to [1,1,0,1].
45optimizer.zero_grad() / loss.backward() / optimizer.step()

The three-line training mantra. zero_grad → backward → step. Same as before, but now the gradients are averaged across all 16 images.

50with torch.no_grad(): — Final evaluation

Disable gradient tracking for evaluation. We’re just testing — no need to build a computation graph.

51preds = model(X) — All predictions

Run all 16 images through the trained model.

52rounded = torch.round(preds).clamp(0, 1)

Round predictions to nearest integer (0 or 1) and clamp to valid range. This converts continuous outputs to binary predictions for accuracy check.

EXECUTION STATE
📚 torch.round(tensor) = Round each element to nearest integer. 0.97 → 1.0, 0.03 → 0.0.
📚 .clamp(min, max) = Clip values to [min, max] range. Prevents rounding to -1 or 2.
53correct = (rounded == Y).all(dim=1).sum()

Count how many images are predicted perfectly. .all(dim=1) checks if ALL 4 outputs match for each image. .sum() counts the True values.

EXECUTION STATE
📚 .all(dim=1) = Per-row all: True only if all 4 elements match. Returns (16,) boolean tensor.
📚 .sum() = Count True values: True=1, False=0. Result: number of perfectly-predicted images.
⬆ correct = 16 — all 16 images predicted correctly!
54print(f"Accuracy: {correct}/16")

Display final accuracy.

EXECUTION STATE
⬆ output = Accuracy: 16/16 — 100% ✓
27 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DiagonalFlipNet(nn.Module):
6    def __init__(self):
7        super().__init__()
8        self.layer1 = nn.Linear(4, 3)
9        self.layer2 = nn.Linear(3, 4)
10
11    def forward(self, x):
12        return self.layer2(F.relu(self.layer1(x)))
13
14# ── Create ALL 16 training examples ──
15def make_dataset():
16    inputs, targets = [], []
17    for i in range(16):
18        bits = [(i >> 3) & 1, (i >> 2) & 1,
19                (i >> 1) & 1, i & 1]
20        img = torch.tensor(bits, dtype=torch.float32)
21        flipped = img.reshape(2, 2).T.reshape(-1)
22        inputs.append(img)
23        targets.append(flipped)
24    return torch.stack(inputs), torch.stack(targets)
25
26X, Y = make_dataset()
27
28# ── Fresh model, random weights ──
29torch.manual_seed(42)
30model = DiagonalFlipNet()
31optimizer = torch.optim.SGD(
32    model.parameters(), lr=0.05)
33
34# ── Train 500 epochs on all 16 examples ──
35for epoch in range(501):
36    predictions = model(X)
37    loss = torch.mean((predictions - Y) ** 2)
38
39    if epoch in [0, 10, 50, 100, 200, 500]:
40        pred = [round(v, 3)
41                for v in predictions[11].tolist()]
42        print(f"Epoch {epoch:<4} loss={loss.item():.6f}"
43              f"  sample={pred}")
44
45    optimizer.zero_grad()
46    loss.backward()
47    optimizer.step()
48
49# ── Final accuracy check ──
50with torch.no_grad():
51    preds = model(X)
52    rounded = torch.round(preds).clamp(0, 1)
53    correct = (rounded == Y).all(dim=1).sum()
54    print(f"\nAccuracy: {correct}/16")

100% accuracy. The network learned to diagonally flip every possible 2×2 binary image. From random weights producing garbage to a perfect transformation — all through the same loop: forward → loss → backward → update.


What the Network Learned

The diagonal flip is a permutation: it swaps positions 1 and 2 while keeping positions 0 and 3 fixed. Mathematically, it's multiplication by a permutation matrix:

P=[1000001001000001]P[p00p01p10p11]=[p00p10p01p11]\mathbf{P} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \quad \Rightarrow \quad \mathbf{P} \begin{bmatrix} p_{00} \\ p_{01} \\ p_{10} \\ p_{11} \end{bmatrix} = \begin{bmatrix} p_{00} \\ p_{10} \\ p_{01} \\ p_{11} \end{bmatrix}

The network approximates this permutation using its two layers and ReLU activation. It didn't learn the matrix directly — it learned a decomposition through a 3-neuron hidden layer that achieves the same result.

InputNetwork OutputTargetMatch?
[0, 0, 0, 0][0.00, 0.00, 0.00, 0.00][0, 0, 0, 0]
[0, 1, 0, 0][0.00, 0.00, 1.00, 0.00][0, 0, 1, 0]
[1, 0, 1, 1][1.00, 1.00, 0.00, 1.00][1, 1, 0, 1]
[1, 1, 1, 1][1.00, 1.00, 1.00, 1.00][1, 1, 1, 1]
Key takeaway: We never told the network about permutation matrices, linear algebra, or geometric transformations. We just showed it 16 input-output pairs and said "minimize the error." The network discovered the pattern on its own through gradient descent. That's the power of neural networks.

Exercises

  1. Different learning rates. Try η=0.01\eta = 0.01, η=0.5\eta = 0.5, and η=2.0\eta = 2.0 in the 200-step NumPy loop. How does convergence speed change? Does η=2.0\eta = 2.0 diverge?
  2. Fewer hidden neurons. Change the DiagonalFlipNet to use 2 hidden neurons instead of 3. Can the network still learn the diagonal flip? Why or why not? (Hint: think about how many independent operations the flip requires.)
  3. Different transformation. Instead of diagonal flip (transpose), try horizontal flip: [p00,p01,p10,p11][p01,p00,p11,p10][p_{00}, p_{01}, p_{10}, p_{11}] \to [p_{01}, p_{00}, p_{11}, p_{10}]. Modify make_dataset()\texttt{make\_dataset()} and retrain. How fast does the network learn this?
  4. Numerical gradient check. For one weight (e.g., W1[0][2]), compute the gradient numerically: run two forward passes with w±0.0001w \pm 0.0001 and compare L(w+ε)L(wε)2ε\frac{L(w+\varepsilon) - L(w-\varepsilon)}{2\varepsilon} with the backprop gradient. They should match to 4+ decimal places.

Chapter Summary

In this chapter, you traced the complete learning process of a neural network:

  1. Gradient descent adjusts weights in the direction that reduces the loss
  2. Backpropagation uses the chain rule to compute all gradients in one backward pass
  3. Dead neurons (zero after ReLU) block gradient flow — they can't learn from that example
  4. Weight updates are small steps: w=wηgradw = w - \eta \cdot \text{grad}
  5. After one step, loss dropped 19%. After 200 steps, it reached essentially zero.
  6. PyTorch autograd computes identical gradients to hand calculations — loss.backward()\texttt{loss.backward()} replaces 7 lines of math
  7. Training on all 16 images for 500 epochs achieved 100% accuracy
You now understand the complete deep learning loop at the level of individual multiplications. Everything else — CNNs, RNNs, Transformers, GPT — is built on this exact foundation. The architectures change, the loss functions change, the data changes. But forward pass → loss → backward pass → update is always the core.
Loading comments...