Learning Objectives
By the end of this section, you will be able to:
- Write a complete training step in NumPy — forward pass, backpropagation, weight update, and verification
- Translate the same step to PyTorch and verify autograd produces identical gradients
- Build a training loop in both NumPy and PyTorch, watching the loss converge to zero
- Train the network on all 16 possible 2×2 images and reach 100% accuracy
- Inspect what the network learned about the diagonal flip transformation
The pattern for this section: For each concept, we first build it from scratch in Python/NumPy — so you see every multiplication. Then we show the same thing in PyTorch — so you see how frameworks automate the tedious parts while doing the exact same math underneath.
Python: One Complete Training Step
In Sections 2 and 3, we computed gradients by hand and updated weights one equation at a time. Now let's put it all together — forward pass, backward pass, weight update, and verification — in a single Python script. Click any line to see the exact values flowing through.
NumPy provides fast N-dimensional arrays and matrix operations. We use @ for matrix multiply, np.outer() for outer product, np.maximum() for ReLU, and np.mean() for MSE loss. All math runs as optimized C code, not slow Python loops.
These are the exact same weights we used in Chapter 7 (forward pass) and Section 2 (backprop). Using identical weights lets us verify that every computed gradient matches our hand calculations.
W1 connects 4 inputs to 3 hidden neurons. Row i holds all weights leaving input x[i]; column j holds all weights entering hidden neuron j. Shape (4, 3) — the NumPy convention where x @ W1 works directly.
h0 h1 h2 x0 0.20 -0.50 0.10 x1 -0.30 0.40 -0.20 x2 0.10 0.30 0.50 x3 -0.40 0.20 -0.10
One bias per hidden neuron, added after the matrix multiply. b1[2] = 0.0 means neuron 2 gets no bias shift.
W2 connects 3 hidden neurons to 4 outputs. Row i holds weights from hidden neuron h[i]; column j holds weights entering output y[j]. This same matrix is used BACKWARD to propagate gradients.
y0 y1 y2 y3 h0 0.30 -0.20 0.40 0.10 h1 -0.10 0.50 -0.30 0.20 h2 0.20 -0.40 0.10 -0.50
One bias per output neuron. Unlike weights, biases always receive gradient during backprop because they don’t depend on hidden activations.
The 2×2 image [[1,0],[1,1]] flattened. x[1] = 0 means input pixel 1 is off — any weight connected to x[1] will get zero gradient (no signal flowed through it).
The diagonal flip of the input: [[1,1],[0,1]] flattened. Positions 1 and 2 swap (0→1, 1→0), while positions 0 and 3 stay fixed.
Run the input through the network to get a prediction and compute the loss. We need all intermediate values (z1, h) because backprop multiplies upstream gradients by local values from the forward pass.
Matrix multiply input (4,) by W1 (4,3), add bias (3,). np.round(..., 10) eliminates floating-point noise: z1[0] is mathematically 0.0 (0.2+0+0.1−0.4+0.1 = 0.0) but raw computation gives ~2.8e-17.
ReLU clamps negatives to zero: max(0, value). Neurons 0 and 1 die (z≤0), only neuron 2 survives. Dead neurons cannot learn — their gradients will be zero in the backward pass.
Output layer: h(3,) @ W2(3,4) + b2(4,) = y_hat(4,). Only h[2]=0.5 is non-zero, so each output = 0.5 × W2[2][j] + b2[j].
Mean Squared Error: average of squared differences. This single number measures how wrong the network is — gradient descent will try to make it smaller.
Backpropagation: compute the gradient of the loss with respect to every weight, starting from the output and working backward. Each line corresponds to one of the 7 steps from Section 2.
The derivative of MSE loss with respect to each output. The 0.5 comes from the chain rule: d/dŷ[(1/4)∑(ŷ-y)²] = (1/4)×2×(ŷ-y) = 0.5×(ŷ-y). Negative values mean the output should INCREASE.
The gradient for each weight W2[i][j] = h[i] × dL/dŷ[j]. Since h[0]=0 and h[1]=0 (dead neurons), rows 0 and 1 are entirely zero. Only row 2 has non-zero gradients.
y0 y1 y2 y3 h0 0.000 0.000 0.000 0.000 h1 0.000 0.000 0.000 0.000 h2 -0.225 -0.275 -0.013 -0.313
Bias gradients equal the loss gradient directly: ∂L/∂b[j] = dL/dŷ[j] × 1. Biases always get gradient, even when neurons are dead.
Error flows BACKWARD through W2. Each hidden neuron’s gradient is the weighted sum of all output gradients: dL/dh[i] = ∑ⱼ W2[i][j] × dL/dŷ[j]. Same weights, backward direction.
ReLU’s derivative is a binary mask: 1 where z1 > 0 (gradient passes), 0 where z1 ≤ 0 (gradient blocked). This is the gate that kills learning in dead neurons.
Element-wise multiply: the gradient only passes where ReLU was active. Neurons 0 and 1 lose their gradient entirely. Only neuron 2 carries signal backward.
Each gradient dL/dW1[i][j] = x[i] × dL/dz1[j]. Only column 2 is non-zero (neurons 0,1 are dead). Row 1 is all-zero because x[1] = 0.
h0 h1 h2 x0 0.00 0.00 0.44 x1 0.00 0.00 0.00 x2 0.00 0.00 0.44 x3 0.00 0.00 0.44
Bias gradients equal the pre-activation gradients. Only b1[2] has non-zero gradient — the other two neurons are dead.
Apply the gradient descent update rule: w_new = w_old - η × gradient. Negative gradients mean the weight increases (moves toward lower loss). Positive gradients mean the weight decreases.
Controls step size. Too large (e.g., 1.0) and the network overshoots, bouncing around. Too small (e.g., 0.001) and training crawls. 0.1 is a common starting point for small networks.
Only column 2 changes (neurons 0,1 are dead). Positive gradient (0.44) means the weight DECREASES — we move opposite to the gradient to reduce loss.
Only b1[2] changes (neurons 0,1 have zero gradient). The positive gradient (0.44) pushes b1[2] downward.
Only row 2 changes (rows 0,1 have zero gradient from dead neurons). All gradients are negative, so all weights INCREASE — pushing outputs upward toward targets.
All 4 biases get updated (biases don’t depend on dead neurons). All gradients are negative, so all biases increase.
Run a new forward pass with the updated weights to confirm the loss actually dropped. This is the payoff — one step of gradient descent made the network measurably better.
Forward pass with updated weights. Neuron 2 now produces a smaller pre-activation because W1 column 2 decreased.
ReLU: neurons 0,1 still dead, neuron 2 alive but lower (0.5 → 0.324).
Both the hidden activation (0.324 vs 0.5) and the weights/biases changed. Three of four outputs moved closer to their targets.
Recompute MSE. Each squared error: (0.117-1)²=0.779, (0.034-1)²=0.934, (-0.065-0)²=0.004, (-0.089-1)²=1.186. Mean = 0.726.
Display original loss for comparison.
Loss after one gradient descent step.
One step cut the loss by 19%. Not perfect, but measurably better.
1import numpy as np
2
3# ── Network weights (same as Chapter 7) ──
4W1 = np.array([[ 0.2, -0.5, 0.1],
5 [-0.3, 0.4, -0.2],
6 [ 0.1, 0.3, 0.5],
7 [-0.4, 0.2, -0.1]])
8b1 = np.array([0.1, -0.1, 0.0])
9W2 = np.array([[ 0.3, -0.2, 0.4, 0.1],
10 [-0.1, 0.5, -0.3, 0.2],
11 [ 0.2, -0.4, 0.1, -0.5]])
12b2 = np.array([0.0, 0.1, -0.1, 0.0])
13
14x = np.array([1.0, 0.0, 1.0, 1.0])
15target = np.array([1.0, 1.0, 0.0, 1.0])
16
17# ── Step 1: Forward pass ──
18z1 = np.round(x @ W1 + b1, 10)
19h = np.maximum(0, z1)
20y_hat = h @ W2 + b2
21loss = np.mean((y_hat - target) ** 2)
22
23# ── Step 2: Backward pass (7 gradient steps) ──
24dL_dy = 0.5 * (y_hat - target)
25dL_dW2 = np.outer(h, dL_dy)
26dL_db2 = dL_dy.copy()
27dL_dh = W2 @ dL_dy
28relu_grad = (z1 > 0).astype(float)
29dL_dz1 = dL_dh * relu_grad
30dL_dW1 = np.outer(x, dL_dz1)
31dL_db1 = dL_dz1.copy()
32
33# ── Step 3: Update weights ──
34eta = 0.1
35W1_new = W1 - eta * dL_dW1
36b1_new = b1 - eta * dL_db1
37W2_new = W2 - eta * dL_dW2
38b2_new = b2 - eta * dL_db2
39
40# ── Step 4: Verify improvement ──
41z1_v = np.round(x @ W1_new + b1_new, 10)
42h_v = np.maximum(0, z1_v)
43y_hat_new = h_v @ W2_new + b2_new
44loss_new = np.mean((y_hat_new - target) ** 2)
45
46print(f"Old loss: {loss:.4f}")
47print(f"New loss: {loss_new:.4f}")
48print(f"Improved: {(1 - loss_new/loss)*100:.1f}%")Result: One step of gradient descent reduced the loss from 0.896 to 0.726 — a 19% improvement. Three of four outputs moved closer to their targets.
| Output | Before | After | Target | Direction |
|---|---|---|---|---|
| ŷ₀ | 0.10 | 0.12 | 1.0 | ✅ Toward target |
| ŷ₁ | −0.10 | 0.03 | 1.0 | ✅ Toward target |
| ŷ₂ | −0.05 | −0.06 | 0.0 | ➖ Tiny overshoot |
| ŷ₃ | −0.25 | −0.09 | 1.0 | ✅ Toward target |
PyTorch: Autograd Does It in Three Lines
Now the same computation in PyTorch. The key insight: loss.backward() replaces 7 lines of backprop math, and optimizer.step() replaces 4 lines of weight updates. The gradients are identical to our hand calculations.
PyTorch’s core library. Provides tensors (like NumPy arrays but with automatic gradient computation) and GPU acceleration. Every operation on a tensor is recorded in a computation graph for backpropagation.
Neural network building blocks: layers (Linear, Conv2d), the Module base class, and loss functions. nn.Linear stores its own weight matrix and bias vector.
Stateless versions of activations and operations. F.relu(x) applies ReLU without creating a persistent module. Used in forward() methods.
Same architecture as the NumPy version: 4 inputs → 3 hidden (ReLU) → 4 outputs. nn.Module is the base class for all PyTorch models — it tracks parameters, enables saving/loading, and defines the forward() contract.
Constructor: define the layers (weight matrices). super().__init__() initializes the nn.Module machinery that tracks parameters.
Calls the parent class constructor. Required for nn.Module to work correctly — sets up parameter registration, hooks, and the computation graph.
Creates a fully-connected layer: 4 inputs → 3 outputs. Stores a (3,4) weight matrix and (3,) bias. Note: PyTorch uses shape (out_features, in_features) — the transpose of our NumPy convention.
3 hidden neurons → 4 outputs. Weight shape: (4, 3). Bias shape: (4,).
PyTorch calls this when you do model(x). It defines the computation graph: linear → ReLU → linear. Every operation is recorded for backpropagation.
self.layer1(x) computes x @ W1.T + b1 = z1. F.relu() applies max(0, z1). Both operations are recorded in the computation graph for loss.backward() later.
Passes hidden activations through output layer: h @ W2.T + b2. No activation on the output — this is a regression network (predicting continuous values).
Creates the network with random weights. We’ll overwrite them with our hand-picked values next.
When loading weights manually, we don’t want PyTorch to record these operations in the computation graph. torch.no_grad() is a context manager that temporarily disables gradient computation.
Copy our hand-picked weights into the model. Note the shape (3,4) — PyTorch stores weights as (out_features, in_features), which is the TRANSPOSE of our NumPy W1 (4,3). The values are the same, just organized differently.
Bias shape is the same in both conventions: one value per neuron.
Output weights: (4, 3) in PyTorch — the transpose of our NumPy W2 (3, 4).
Output biases: [0.0, 0.1, -0.1, 0.0].
Identical to our NumPy input. torch.tensor() creates a PyTorch tensor from a Python list. Tensors track operations for backpropagation.
The diagonal flip target: [[1,1],[0,1]] flattened.
Calling model(x) runs forward(x) under the hood. PyTorch builds a computation graph as it goes, recording every operation for backprop.
Same MSE computation as NumPy, but now the computation graph connects loss → y_hat → layer2 → relu → layer1 → x. This graph is what makes .backward() possible.
This is the key moment. Seven lines of careful NumPy math — computing dL_dy, dL_dW2, dL_db2, dL_dh, relu_grad, dL_dz1, dL_dW1, dL_db1 — all computed automatically in one call.
Walks the computation graph backward from loss to every parameter, computing ∂L/∂w for all 31 weights and biases. Uses the chain rule automatically. After this call, every parameter’s .grad attribute holds its gradient.
Access the computed gradient for the output bias. Every parameter has a .grad attribute after .backward() is called.
Hidden bias gradients. Only neuron 2 has non-zero gradient (the other two are dead).
Creates a Stochastic Gradient Descent optimizer that manages all 31 parameters. lr=0.1 is our learning rate η. The optimizer applies the update rule: w = w - η × grad.
Applies w = w - lr × grad for all 31 parameters in one call. This is equivalent to our 4 lines of NumPy updates (W1_new, b1_new, W2_new, b2_new).
PyTorch ACCUMULATES gradients by default (adds to .grad instead of replacing). You must zero them before the next backward pass, or gradients from different iterations pile up. This is the #1 PyTorch beginner bug.
Forward pass with updated weights.
Recompute MSE with updated weights.
Confirm: one training step reduced the loss by 19%, matching our NumPy result exactly.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# ── Same network, same weights ──
6class DiagonalFlipNet(nn.Module):
7 def __init__(self):
8 super().__init__()
9 self.layer1 = nn.Linear(4, 3)
10 self.layer2 = nn.Linear(3, 4)
11
12 def forward(self, x):
13 h = F.relu(self.layer1(x))
14 return self.layer2(h)
15
16model = DiagonalFlipNet()
17
18# Load our exact weights
19with torch.no_grad():
20 model.layer1.weight.copy_(torch.tensor([
21 [ 0.2, -0.3, 0.1, -0.4],
22 [-0.5, 0.4, 0.3, 0.2],
23 [ 0.1, -0.2, 0.5, -0.1]]))
24 model.layer1.bias.copy_(
25 torch.tensor([0.1, -0.1, 0.0]))
26 model.layer2.weight.copy_(torch.tensor([
27 [ 0.3, -0.1, 0.2],
28 [-0.2, 0.5, -0.4],
29 [ 0.4, -0.3, 0.1],
30 [ 0.1, 0.2, -0.5]]))
31 model.layer2.bias.copy_(
32 torch.tensor([0.0, 0.1, -0.1, 0.0]))
33
34# ── Forward pass + loss ──
35x = torch.tensor([1.0, 0.0, 1.0, 1.0])
36target = torch.tensor([1.0, 1.0, 0.0, 1.0])
37y_hat = model(x)
38loss = torch.mean((y_hat - target) ** 2)
39
40# ── Backprop: ONE LINE replaces 7 steps ──
41loss.backward()
42
43# ── Verify: gradients match hand calculations ──
44print("dL/db2:", model.layer2.bias.grad)
45print("dL/db1:", model.layer1.bias.grad)
46
47# ── One training step ──
48optimizer = torch.optim.SGD(
49 model.parameters(), lr=0.1)
50optimizer.step()
51optimizer.zero_grad()
52
53y_hat_new = model(x)
54loss_new = torch.mean((y_hat_new - target) ** 2)
55print(f"Loss: {loss.item():.4f} -> {loss_new.item():.4f}")| Gradient | NumPy (hand-coded) | PyTorch (autograd) |
|---|---|---|
| ∂L/∂b₂ | [-0.45, -0.55, -0.025, -0.625] | [-0.45, -0.55, -0.025, -0.625] ✓ |
| ∂L/∂W₂ row 2 | [-0.225, -0.275, -0.013, -0.313] | [-0.225, -0.275, -0.013, -0.313] ✓ |
| ∂L/∂b₁ | [0.0, 0.0, 0.44] | [0.0, 0.0, 0.44] ✓ |
| ∂L/∂W₁ col 2 | [0.44, 0.0, 0.44, 0.44] | [0.44, 0.0, 0.44, 0.44] ✓ |
The trade-off is clear: NumPy gives you full visibility into every gradient computation. PyTorch gives you the same result with far less code. Now that you've verified they match, you can trust PyTorch's autograd and focus on the bigger picture.
Python: The Training Loop
One step gave us 19% improvement. What happens when we repeat 200 times? The training loop is the heart of machine learning: forward → loss → backward → update → repeat.
NumPy for all array math operations.
Identical weights as before. We start from the original random initialization so we can watch the full learning trajectory from scratch.
Same 31 parameters from Chapter 7. The network starts with a loss of 0.8963 and predictions [0.1, -0.1, -0.05, -0.25] — far from the target [1, 1, 0, 1].
Input [1,0,1,1] and its diagonal flip target [1,1,0,1]. We train on this single example for 200 steps.
Same learning rate as our single-step experiment.
Repeat forward → backward → update 201 times (steps 0 through 200). Each iteration nudges all 31 weights a tiny bit toward lower loss. After enough iterations, the network converges to near-zero loss.
Compute hidden layer pre-activation with CURRENT weights (updated each iteration). The values change every step as weights are adjusted.
Apply ReLU. As weights evolve, neurons that were dead at step 0 may become alive in later steps.
Compute the network’s current prediction. This changes every step as weights are updated.
How wrong the network is right now. This number should decrease every step if training is working.
Print progress at selected checkpoints. The spacing is logarithmic because most improvement happens early (fast drops), then fine-tuning later (slow improvements).
Round predictions to 3 decimal places for clean display.
Display step number, current loss, and prediction.
Start of backpropagation. Same formula as before, but with the current (evolving) y_hat.
Outer product of hidden activation and loss gradient.
Bias gradient equals loss gradient directly.
Error flows backward through W2 (same matrix, backward direction).
Apply ReLU gradient mask. This line combines steps 5 and 5b from Section 2 into one line.
Outer product of input and hidden gradient.
Bias gradient equals the pre-activation gradient.
In-place update: W1 = W1 - 0.1 × gradient. Using -= modifies the array directly instead of creating a new one. More memory-efficient.
In-place bias update.
In-place output weight update.
After this line, all 31 parameters have been updated. The loop returns to the top and does another forward pass with the improved weights.
1import numpy as np
2
3# ── Same network setup ──
4W1 = np.array([[ 0.2, -0.5, 0.1],
5 [-0.3, 0.4, -0.2],
6 [ 0.1, 0.3, 0.5],
7 [-0.4, 0.2, -0.1]])
8b1 = np.array([0.1, -0.1, 0.0])
9W2 = np.array([[ 0.3, -0.2, 0.4, 0.1],
10 [-0.1, 0.5, -0.3, 0.2],
11 [ 0.2, -0.4, 0.1, -0.5]])
12b2 = np.array([0.0, 0.1, -0.1, 0.0])
13
14x = np.array([1.0, 0.0, 1.0, 1.0])
15target = np.array([1.0, 1.0, 0.0, 1.0])
16eta = 0.1
17
18for step in range(201):
19 # Forward
20 z1 = np.round(x @ W1 + b1, 10)
21 h = np.maximum(0, z1)
22 y_hat = h @ W2 + b2
23 loss = np.mean((y_hat - target) ** 2)
24
25 if step in [0, 1, 5, 10, 25, 50, 100, 200]:
26 pred = np.round(y_hat, 3)
27 print(f"Step {step:<4} loss={loss:.6f} pred={pred}")
28
29 # Backward
30 dL_dy = 0.5 * (y_hat - target)
31 dL_dW2 = np.outer(h, dL_dy)
32 dL_db2 = dL_dy.copy()
33 dL_dh = W2 @ dL_dy
34 dL_dz1 = dL_dh * (z1 > 0).astype(float)
35 dL_dW1 = np.outer(x, dL_dz1)
36 dL_db1 = dL_dz1.copy()
37
38 # Update
39 W1 -= eta * dL_dW1
40 b1 -= eta * dL_db1
41 W2 -= eta * dL_dW2
42 b2 -= eta * dL_db2The output tells the full story of learning:
| Step | Loss | Prediction | Pattern |
|---|---|---|---|
| 0 | 0.8963 | [0.1, -0.1, -0.05, -0.25] | Random garbage |
| 1 | 0.7258 | [0.12, 0.03, -0.07, -0.09] | First improvement |
| 10 | 0.2465 | [0.39, 0.48, -0.06, 0.42] | Getting the shape right |
| 50 | 0.0041 | [0.92, 0.93, -0.01, 0.93] | Nearly there |
| 200 | ≈0.000 | [1.0, 1.0, -0.0, 1.0] | Essentially perfect |
PyTorch: The Same Loop, Simplified
The same 200-step training loop in PyTorch. The loop body shrinks from 15 lines (forward + backward + update) to just 5: model(x), loss, zero_grad(), backward(), step().
PyTorch core: tensors, autograd, computation graphs.
Neural network layers and base class.
Stateless activation functions like F.relu().
Identical model definition. In practice, you define this once and reuse it. We repeat it here for clarity.
The forward pass in one line: linear → ReLU → linear. Note how compact this is compared to the 4 NumPy lines (z1, h, y_hat, loss).
Chains layer1 → relu → layer2 in one expression. PyTorch builds the computation graph automatically as each operation executes.
Create the model.
Load our hand-picked weights (identical to NumPy version). Lines 17-29 are the same weight loading as before.
Same input image as NumPy.
Same diagonal flip target.
Create the optimizer ONCE before the loop. It holds references to all parameters and applies updates when .step() is called.
Same 201 iterations. But the loop body is just 5 lines instead of 15 — PyTorch handles forward graph construction and backpropagation automatically.
One call does the entire forward pass: layer1 → ReLU → layer2. Compare to NumPy: z1 = ..., h = ..., y_hat = ... (3 separate lines).
Compute loss. PyTorch records this in the computation graph.
Print at logarithmically-spaced checkpoints.
Convert tensor to Python list and round for display.
Display progress. .item() extracts a Python float from a 0-dim tensor.
MUST come before .backward(). Without this, gradients from the previous step would accumulate (add up) with the current step’s gradients.
Replaces 7 lines of NumPy backprop math. Walks the computation graph backward and fills every parameter’s .grad attribute.
Replaces 4 lines of NumPy updates. Applies w = w - 0.1 × grad for all 31 parameters.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DiagonalFlipNet(nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.layer1 = nn.Linear(4, 3)
9 self.layer2 = nn.Linear(3, 4)
10
11 def forward(self, x):
12 return self.layer2(F.relu(self.layer1(x)))
13
14# ── Load same weights ──
15model = DiagonalFlipNet()
16with torch.no_grad():
17 model.layer1.weight.copy_(torch.tensor([
18 [ 0.2, -0.3, 0.1, -0.4],
19 [-0.5, 0.4, 0.3, 0.2],
20 [ 0.1, -0.2, 0.5, -0.1]]))
21 model.layer1.bias.copy_(
22 torch.tensor([0.1, -0.1, 0.0]))
23 model.layer2.weight.copy_(torch.tensor([
24 [ 0.3, -0.1, 0.2],
25 [-0.2, 0.5, -0.4],
26 [ 0.4, -0.3, 0.1],
27 [ 0.1, 0.2, -0.5]]))
28 model.layer2.bias.copy_(
29 torch.tensor([0.0, 0.1, -0.1, 0.0]))
30
31x = torch.tensor([1.0, 0.0, 1.0, 1.0])
32target = torch.tensor([1.0, 1.0, 0.0, 1.0])
33optimizer = torch.optim.SGD(
34 model.parameters(), lr=0.1)
35
36for step in range(201):
37 y_hat = model(x)
38 loss = torch.mean((y_hat - target) ** 2)
39
40 if step in [0, 1, 5, 10, 25, 50, 100, 200]:
41 pred = [round(v, 3) for v in y_hat.tolist()]
42 print(f"Step {step:<4} loss={loss.item():.6f} "
43 f"pred={pred}")
44
45 optimizer.zero_grad()
46 loss.backward()
47 optimizer.step()Same convergence, same result, one-third the code. The PyTorch loop body follows a universal pattern that works for any neural network:
This five-step pattern is the same whether you're training a 31-parameter toy network or a 175-billion-parameter GPT model. The architecture changes, the loss function changes, the data changes — but the training loop is always these five lines.
PyTorch: Training on All 16 Images
Training on a single image is a good demo, but not realistic. A real network must generalize: learn the rule, not memorize one example. Let's train on all 16 possible 2×2 binary images and their diagonal flips:
PyTorch core library.
Neural network building blocks.
Stateless functions (F.relu).
Same architecture: 4 → 3 (ReLU) → 4. But this time with RANDOM weights (not our hand-picked ones).
When x is a batch (16×4), PyTorch processes all 16 images at once via matrix multiplication. No Python loop needed.
Forward pass. With batch input (16,4), produces batch output (16,4) — all 16 predictions at once.
Generates every possible 2×2 binary image and its diagonal flip. There are 2⁴ = 16 possible images (each pixel is 0 or 1).
Loop through integers 0-15. Each integer’s binary representation IS a 2×2 image.
Extract individual bits from integer i. For i=11 (binary 1011): bit 3 = 1, bit 2 = 0, bit 1 = 1, bit 0 = 1.
Convert bit list to a float tensor for computation.
The diagonal flip operation: reshape to 2×2 matrix, transpose (.T swaps rows and columns), flatten back to 1D.
Combine 16 individual tensors into batch tensors. torch.stack creates a new dimension: 16 tensors of shape (4,) become one tensor of shape (16, 4).
Now we have the complete training set: 16 inputs and their diagonal-flipped targets.
Fix the random number generator so the initial weights are the same every time we run. Seed 42 is a convention (Hitchhiker’s Guide reference).
Brand new model with random initialization. NOT our hand-picked weights — this time the network must discover good weights entirely through training.
Lower learning rate (0.05 vs 0.1) because we’re training on 16 examples. Larger datasets need smaller steps to avoid overshooting.
One epoch = one pass through all 16 training examples. 500 epochs means the network sees each image 500 times.
Batch forward pass: X(16,4) produces predictions(16,4). PyTorch handles the batch dimension automatically through matrix multiplication.
MSE across ALL 16 images: average of 16×4 = 64 squared errors. The gradient points in the direction that improves all 16 predictions simultaneously.
Print at selected epochs to show convergence.
Show prediction for input index 11 = [1,0,1,1] (our familiar example). Target is [1,1,0,1].
The three-line training mantra. zero_grad → backward → step. Same as before, but now the gradients are averaged across all 16 images.
Disable gradient tracking for evaluation. We’re just testing — no need to build a computation graph.
Run all 16 images through the trained model.
Round predictions to nearest integer (0 or 1) and clamp to valid range. This converts continuous outputs to binary predictions for accuracy check.
Count how many images are predicted perfectly. .all(dim=1) checks if ALL 4 outputs match for each image. .sum() counts the True values.
Display final accuracy.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DiagonalFlipNet(nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.layer1 = nn.Linear(4, 3)
9 self.layer2 = nn.Linear(3, 4)
10
11 def forward(self, x):
12 return self.layer2(F.relu(self.layer1(x)))
13
14# ── Create ALL 16 training examples ──
15def make_dataset():
16 inputs, targets = [], []
17 for i in range(16):
18 bits = [(i >> 3) & 1, (i >> 2) & 1,
19 (i >> 1) & 1, i & 1]
20 img = torch.tensor(bits, dtype=torch.float32)
21 flipped = img.reshape(2, 2).T.reshape(-1)
22 inputs.append(img)
23 targets.append(flipped)
24 return torch.stack(inputs), torch.stack(targets)
25
26X, Y = make_dataset()
27
28# ── Fresh model, random weights ──
29torch.manual_seed(42)
30model = DiagonalFlipNet()
31optimizer = torch.optim.SGD(
32 model.parameters(), lr=0.05)
33
34# ── Train 500 epochs on all 16 examples ──
35for epoch in range(501):
36 predictions = model(X)
37 loss = torch.mean((predictions - Y) ** 2)
38
39 if epoch in [0, 10, 50, 100, 200, 500]:
40 pred = [round(v, 3)
41 for v in predictions[11].tolist()]
42 print(f"Epoch {epoch:<4} loss={loss.item():.6f}"
43 f" sample={pred}")
44
45 optimizer.zero_grad()
46 loss.backward()
47 optimizer.step()
48
49# ── Final accuracy check ──
50with torch.no_grad():
51 preds = model(X)
52 rounded = torch.round(preds).clamp(0, 1)
53 correct = (rounded == Y).all(dim=1).sum()
54 print(f"\nAccuracy: {correct}/16")100% accuracy. The network learned to diagonally flip every possible 2×2 binary image. From random weights producing garbage to a perfect transformation — all through the same loop: forward → loss → backward → update.
What the Network Learned
The diagonal flip is a permutation: it swaps positions 1 and 2 while keeping positions 0 and 3 fixed. Mathematically, it's multiplication by a permutation matrix:
The network approximates this permutation using its two layers and ReLU activation. It didn't learn the matrix directly — it learned a decomposition through a 3-neuron hidden layer that achieves the same result.
| Input | Network Output | Target | Match? |
|---|---|---|---|
| [0, 0, 0, 0] | [0.00, 0.00, 0.00, 0.00] | [0, 0, 0, 0] | ✓ |
| [0, 1, 0, 0] | [0.00, 0.00, 1.00, 0.00] | [0, 0, 1, 0] | ✓ |
| [1, 0, 1, 1] | [1.00, 1.00, 0.00, 1.00] | [1, 1, 0, 1] | ✓ |
| [1, 1, 1, 1] | [1.00, 1.00, 1.00, 1.00] | [1, 1, 1, 1] | ✓ |
Exercises
- Different learning rates. Try η=0.01, η=0.5, and η=2.0 in the 200-step NumPy loop. How does convergence speed change? Does η=2.0 diverge?
- Fewer hidden neurons. Change the DiagonalFlipNet to use 2 hidden neurons instead of 3. Can the network still learn the diagonal flip? Why or why not? (Hint: think about how many independent operations the flip requires.)
- Different transformation. Instead of diagonal flip (transpose), try horizontal flip: [p00,p01,p10,p11]→[p01,p00,p11,p10]. Modify make_dataset() and retrain. How fast does the network learn this?
- Numerical gradient check. For one weight (e.g., W1[0][2]), compute the gradient numerically: run two forward passes with w±0.0001 and compare 2εL(w+ε)−L(w−ε) with the backprop gradient. They should match to 4+ decimal places.
Chapter Summary
In this chapter, you traced the complete learning process of a neural network:
- Gradient descent adjusts weights in the direction that reduces the loss
- Backpropagation uses the chain rule to compute all gradients in one backward pass
- Dead neurons (zero after ReLU) block gradient flow — they can't learn from that example
- Weight updates are small steps: w=w−η⋅grad
- After one step, loss dropped 19%. After 200 steps, it reached essentially zero.
- PyTorch autograd computes identical gradients to hand calculations — loss.backward() replaces 7 lines of math
- Training on all 16 images for 500 epochs achieved 100% accuracy
You now understand the complete deep learning loop at the level of individual multiplications. Everything else — CNNs, RNNs, Transformers, GPT — is built on this exact foundation. The architectures change, the loss functions change, the data changes. But forward pass → loss → backward pass → update is always the core.