Chapter 8
20 min read
Section 26 of 65

Computing Gradients Step by Step

Backpropagation

Learning Objectives

By the end of this section, you will be able to:

  1. Apply the weight update rule to every parameter using the gradients from Section 2
  2. See every parameter's old and new value side by side
  3. Run a forward pass with the updated weights and verify the loss decreased

The Update Rule

For every parameter ww, we apply:

wnew=woldηLww_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}

With learning rate η=0.1\eta = 0.1. Let's update every parameter that has a non-zero gradient.


Updating W2W_2 (Output Weights)

Only the third row of W2W_2 has non-zero gradients (from hidden neuron 2, the only alive neuron):

WeightOld valueGradientη × gradNew value
W₂[2][0]0.2000−0.2250−0.02250.2225
W₂[2][1]−0.4000−0.2750−0.0275−0.3725
W₂[2][2]0.1000−0.0125−0.001250.10125
W₂[2][3]−0.5000−0.3125−0.03125−0.46875

Notice the direction: all four gradients are negative, so all four weights increase. The loss is telling us: "the output values are too low—increase the weights that feed into them from the alive hidden neuron."

Largest change: W2[2][3]W_2[2][3] changed the most because y^3\hat{y}_3 had the largest error (predicted -0.25, target was 1).

Updating b2b_2 (Output Biases)

BiasOld valueGradientη × gradNew value
b₂[0]0.0000−0.4500−0.04500.0450
b₂[1]0.1000−0.5500−0.05500.1550
b₂[2]−0.1000−0.0250−0.0025−0.0975
b₂[3]0.0000−0.6250−0.06250.0625

All biases increase. The network is learning to shift all outputs upward, since most predictions were too low.


Updating W1W_1 (Hidden Weights)

Only column 2 (hidden neuron 2) has non-zero gradients, and row 1 is zero because x1=0x_1 = 0:

WeightOld valueGradientη × gradNew value
W₁[0][2]0.10000.44000.04400.0560
W₁[1][2]−0.20000.00000.0000−0.2000
W₁[2][2]0.50000.44000.04400.4560
W₁[3][2]−0.10000.44000.0440−0.1440

The positive gradient (0.44) means increasing these weights would increase the loss. So we decrease them. The network is learning to reduce the signal flowing through hidden neuron 2—because the current signal produces too-negative outputs.


Updating b1b_1 (Hidden Biases)

BiasOld valueGradientη × gradNew value
b₁[0]0.10000.00000.00000.1000
b₁[1]−0.10000.00000.0000−0.1000
b₁[2]0.00000.44000.0440−0.0440

Complete Before → After Table

Here's every parameter that changed, all in one place:

ParameterBeforeAfterChange
W₂[2][0]0.20000.2225+0.0225
W₂[2][1]−0.4000−0.3725+0.0275
W₂[2][2]0.10000.1013+0.0013
W₂[2][3]−0.5000−0.4688+0.0313
b₂[0]0.00000.0450+0.0450
b₂[1]0.10000.1550+0.0550
b₂[2]−0.1000−0.0975+0.0025
b₂[3]0.00000.0625+0.0625
W₁[0][2]0.10000.0560−0.0440
W₁[2][2]0.50000.4560−0.0440
W₁[3][2]−0.1000−0.1440−0.0440
b₁[2]0.0000−0.0440−0.0440

20 parameters unchanged (gradient was zero). 12 parameters updated. The changes are small—that's the learning rate doing its job. Each step makes a tiny adjustment.


Python: Applying Updates

Let's implement the weight update rule in Python and verify the loss drops. This code picks up where Section 2's backprop code left off — all gradient variables (dL_dW2\texttt{dL\_dW2}, dL_db1\texttt{dL\_db1}, etc.) are already computed.

Weight Updates — NumPy Implementation
🐍weight_update.py
1import numpy as np

NumPy provides fast array math. We use @ for matrix multiply, np.round() for clean output, np.maximum() for ReLU, and np.mean() for MSE loss.

EXECUTION STATE
numpy = Numerical computing library — ndarray, linear algebra, element-wise ops
3# ── Gradients computed in Section 2 ──

All variables from Section 2’s backprop code are still in scope: the weight matrices (W1, W2), biases (b1, b2), input (x), target, forward pass values (y_hat, loss), and all four gradient arrays (dL_dW1, dL_db1, dL_dW2, dL_db2). We pick up right where that code left off.

EXECUTION STATE
dL_dW2 =
      y0       y1       y2       y3
h0  0.000    0.000    0.000    0.000
h1  0.000    0.000    0.000    0.000
h2 -0.225   -0.275   -0.0125  -0.3125
dL_db2 = [-0.45, -0.55, -0.025, -0.625]
dL_dW1 =
      h0    h1    h2
x0  0.00  0.00  0.44
x1  0.00  0.00  0.00
x2  0.00  0.00  0.44
x3  0.00  0.00  0.44
dL_db1 = [0.0, 0.0, 0.44]
8eta = 0.1 — Learning rate

The learning rate controls step size. Too large (e.g. 1.0) and the network overshoots, bouncing around the minimum. Too small (e.g. 0.0001) and training takes forever. 0.1 is a common starting point for small networks.

EXECUTION STATE
η (eta) = 0.1 — each weight changes by at most 10% of its gradient. A conservative step.
9W2_new = W2 - eta * dL_dW2 — Update output weights

The gradient descent update rule: w_new = w_old - η × gradient. Negative gradient means the weight should INCREASE (move in the direction that reduces loss). Only row 2 changes — rows 0,1 have zero gradients (dead neurons).

EXECUTION STATE
── Only row 2 changes ── =
W2[2][0] = 0.2000 - 0.1×(-0.225) = 0.2000 + 0.0225 = 0.2225
W2[2][1] = -0.4000 - 0.1×(-0.275) = -0.4000 + 0.0275 = -0.3725
W2[2][2] = 0.1000 - 0.1×(-0.0125) = 0.1000 + 0.00125 = 0.10125
W2[2][3] = -0.5000 - 0.1×(-0.3125) = -0.5000 + 0.03125 = -0.46875
Direction = All gradients are negative → all weights increase → outputs will be larger (closer to targets)
10b2_new = b2 - eta * dL_db2 — Update output biases

Biases always get updated (they don’t depend on dead neurons). All four gradients are negative, so all four biases increase — the network is learning to shift outputs upward.

EXECUTION STATE
b2[0] = 0.0000 - 0.1×(-0.45) = 0.0000 + 0.045 = 0.0450
b2[1] = 0.1000 - 0.1×(-0.55) = 0.1000 + 0.055 = 0.1550
b2[2] = -0.1000 - 0.1×(-0.025) = -0.1000 + 0.0025 = -0.0975
b2[3] = 0.0000 - 0.1×(-0.625) = 0.0000 + 0.0625 = 0.0625
11W1_new = W1 - eta * dL_dW1 — Update hidden weights

Only column 2 (hidden neuron 2) has non-zero gradients. Row 1 stays zero because x[1] = 0. Positive gradient (0.44) means increasing these weights increases the loss, so we DECREASE them.

EXECUTION STATE
── Only column 2 changes ── =
W1[0][2] = 0.1000 - 0.1×(0.44) = 0.1000 - 0.044 = 0.0560
W1[1][2] = -0.2000 - 0.1×(0.00) = -0.2000 (unchanged, x[1]=0)
W1[2][2] = 0.5000 - 0.1×(0.44) = 0.5000 - 0.044 = 0.4560
W1[3][2] = -0.1000 - 0.1×(0.44) = -0.1000 - 0.044 = -0.1440
Direction = Positive gradient → weights decrease → hidden neuron 2 produces a smaller activation
12b1_new = b1 - eta * dL_db1 — Update hidden biases

Only b1[2] has a non-zero gradient (0.44). Neurons 0 and 1 are dead, so their biases get zero gradient and don’t change.

EXECUTION STATE
b1[0] = 0.1000 - 0.1×(0.0) = 0.1000 (dead neuron)
b1[1] = -0.1000 - 0.1×(0.0) = -0.1000 (dead neuron)
b1[2] = 0.0000 - 0.1×(0.44) = 0.0000 - 0.044 = -0.0440
15z1_new = np.round(x @ W1_new + b1_new, 10) — New pre-activations

Forward pass with updated weights. Only hidden neuron 2 matters (neurons 0,1 are still dead). np.round(..., 10) avoids floating-point noise in the display.

EXECUTION STATE
📚 np.round(array, decimals) = Rounds each element to the given number of decimal places. 10 decimals keeps full precision while eliminating 1e-16 noise.
⬇ x @ W1_new + b1_new = neuron 2: (1)(0.056) + (0)(-0.2) + (1)(0.456) + (1)(-0.144) + (-0.044) = 0.324
⬆ z1_new = [-0.2, -0.5, 0.324] — neurons 0,1 still negative (dead)
16h_new = np.maximum(0, z1_new) — ReLU on new pre-activations

Apply ReLU: kill negative values, keep positives. Neuron 2 is still alive at 0.324 — lower than before (was 0.5). The gradient nudged it down because the too-large hidden activation was feeding too-negative outputs.

EXECUTION STATE
📚 np.maximum(0, array) = Element-wise max with 0 — the ReLU activation function
⬆ h_new = [0.0, 0.0, 0.324] — still alive, but lower (0.5 → 0.324)
17y_hat_new = h_new @ W2_new + b2_new — New predictions

Only h_new[2] = 0.324 is non-zero, so each output = (0.324)(W2_new[2][j]) + b2_new[j]. Both the weight and bias changed, pushing outputs closer to targets.

EXECUTION STATE
ŷ₀ = (0.324)(0.2225) + 0.045 = 0.072 + 0.045 = 0.117
ŷ₁ = (0.324)(-0.3725) + 0.155 = -0.121 + 0.155 = 0.034
ŷ₂ = (0.324)(0.10125) + (-0.0975) = 0.033 - 0.0975 = -0.065
ŷ₃ = (0.324)(-0.46875) + 0.0625 = -0.152 + 0.0625 = -0.089
⬆ y_hat_new = [0.117, 0.034, -0.065, -0.089]
18loss_new = np.mean((y_hat_new - target) ** 2) — New MSE loss

Recompute MSE with the new predictions. Each squared error: (0.117-1)²=0.779, (0.034-1)²=0.934, (-0.065-0)²=0.004, (-0.089-1)²=1.186. Mean = 0.7268.

EXECUTION STATE
📚 np.mean(array) = Arithmetic mean of all elements — sum / count
⬆ loss_new = 0.7268 — down from 0.8963 (−18.9%)
20print(f"Old loss: {loss:.4f}")

Display the original loss for comparison.

EXECUTION STATE
⬆ output = Old loss: 0.8963
21print(f"New loss: {loss_new:.4f}")

Display the loss after one gradient descent step.

EXECUTION STATE
⬆ output = New loss: 0.7268
22print(f"Improvement: {(1 - loss_new/loss)*100:.1f}%")

Percentage reduction: (1 - 0.7268/0.8963) × 100 = 18.9%. One step cut the loss by nearly a fifth.

EXECUTION STATE
⬆ output = Improvement: 18.9%
23print(f"Old prediction: {np.round(y_hat, 4)}")

The predictions before the weight update.

EXECUTION STATE
⬆ output = Old prediction: [ 0.1 -0.1 -0.05 -0.25]
24print(f"New prediction: {np.round(y_hat_new, 4)}")

The predictions after the update. Three out of four moved closer to their targets.

EXECUTION STATE
⬆ output = New prediction: [ 0.117 0.034 -0.065 -0.089]
25print(f"Target: {target}")

The goal: [1, 1, 0, 1]. We’re still far away, but one step moved us measurably closer.

EXECUTION STATE
⬆ output = Target: [1. 1. 0. 1.]
8 lines without explanation
1import numpy as np
2
3# ── Gradients computed in Section 2 ──
4# (W1, b1, W2, b2, x, target, loss, dL_dW1, dL_db1,
5#  dL_dW2, dL_db2 are all already computed)
6
7# ── Gradient descent update ──
8eta = 0.1
9W2_new = W2 - eta * dL_dW2
10b2_new = b2 - eta * dL_db2
11W1_new = W1 - eta * dL_dW1
12b1_new = b1 - eta * dL_db1
13
14# ── Verify: forward pass with new weights ──
15z1_new = np.round(x @ W1_new + b1_new, 10)
16h_new = np.maximum(0, z1_new)
17y_hat_new = h_new @ W2_new + b2_new
18loss_new = np.mean((y_hat_new - target) ** 2)
19
20print(f"Old loss:       {loss:.4f}")
21print(f"New loss:       {loss_new:.4f}")
22print(f"Improvement:    {(1 - loss_new/loss)*100:.1f}%")
23print(f"Old prediction: {np.round(y_hat, 4)}")
24print(f"New prediction: {np.round(y_hat_new, 4)}")
25print(f"Target:         {target}")

Verify: Forward Pass with New Weights

Let's run the same input through the network with the updated weights and see if things improved.

Layer 1 with new weights

Hidden neuron 2 (the only one that changed):

z2new=(1)(0.056)+(0)(-0.2)+(1)(0.456)+(1)(-0.144)+(-0.044)z_2^{\text{new}} = (1)(0.056) + (0)(\text{-}0.2) + (1)(0.456) + (1)(\text{-}0.144) + (\text{-}0.044)
=0.056+0+0.4560.1440.044=0.324= 0.056 + 0 + 0.456 - 0.144 - 0.044 = 0.324

After ReLU: h2new=max(0,0.324)=0.324h_2^{\text{new}} = \max(0, 0.324) = 0.324 (still alive, but lower than before: 0.5 → 0.324)

Layer 2 with new weights

y^0new=(0.324)(0.2225)+0.045=0.072+0.045=0.117\hat{y}_0^{\text{new}} = (0.324)(0.2225) + 0.045 = 0.072 + 0.045 = 0.117
y^1new=(0.324)(0.3725)+0.155=0.121+0.155=0.034\hat{y}_1^{\text{new}} = (0.324)(-0.3725) + 0.155 = -0.121 + 0.155 = 0.034
y^2new=(0.324)(0.1013)+(0.0975)=0.0330.0975=0.065\hat{y}_2^{\text{new}} = (0.324)(0.1013) + (-0.0975) = 0.033 - 0.0975 = -0.065
y^3new=(0.324)(0.4688)+0.0625=0.152+0.0625=0.089\hat{y}_3^{\text{new}} = (0.324)(-0.4688) + 0.0625 = -0.152 + 0.0625 = -0.089

The Improvement

OutputBeforeAfterTargetBetter?
ŷ₀0.100.121✅ Moved toward 1
ŷ₁−0.100.031✅ Moved toward 1
ŷ₂−0.05−0.060➖ Tiny worse
ŷ₃−0.25−0.091✅ Moved toward 1

The new MSE loss:

Lnew=14[(0.121)2+(0.031)2+(0.060)2+(0.091)2]L^{\text{new}} = \frac{1}{4}\left[(0.12-1)^2 + (0.03-1)^2 + (-0.06-0)^2 + (-0.09-1)^2\right]
=14(0.774+0.941+0.004+1.188)=2.9074=0.727= \frac{1}{4}(0.774 + 0.941 + 0.004 + 1.188) = \frac{2.907}{4} = 0.727
MetricBeforeAfterChange
MSE Loss0.8960.726−19.0%
One step of gradient descent reduced the loss by 19%. The network is still far from perfect (loss 0.726 vs. target of 0.0), but it's measurably better after a single weight update. After hundreds of steps on all 16 training examples, the loss will approach zero.
This is the full training loop: Forward pass → compute loss → backpropagate gradients → update weights → repeat. Every framework (PyTorch, TensorFlow, JAX) does exactly this. The only difference is they do it automatically, on GPUs, with millions of parameters instead of 31.

PyTorch: One Line Does It All

Everything we did by hand — forward pass, backpropagation, weight updates — PyTorch does in three lines: loss.backward()\texttt{loss.backward()} computes all gradients, optimizer.zero_grad()\texttt{optimizer.zero\_grad()} clears old gradients, and optimizer.step()\texttt{optimizer.step()} applies the updates. The numbers match our hand calculations exactly.

Backpropagation + Updates — PyTorch
🐍backprop_pytorch.py
1import torch

PyTorch’s core library. Provides tensors (GPU-accelerated arrays), automatic differentiation, and the computation graph that makes .backward() possible.

EXECUTION STATE
📚 torch = Core tensor library — like NumPy but with autograd (automatic gradient computation) and GPU support
2import torch.nn as nn

Neural network building blocks: layers (Linear, Conv2d), loss functions (MSELoss, CrossEntropyLoss), and the Module base class that all models inherit from.

EXECUTION STATE
📚 torch.nn = Neural network module — contains nn.Module, nn.Linear, nn.ReLU, loss functions, etc.
3import torch.optim as optim

Optimization algorithms that update weights using gradients. SGD, Adam, RMSprop, and many others — all share the same .zero_grad() / .step() interface.

EXECUTION STATE
📚 torch.optim = Optimizer library — SGD, Adam, RMSprop, Adagrad, etc. All implement the same API.
4import torch.nn.functional as F

Functional versions of layers and activations. F.relu(x) applies ReLU without creating a persistent layer object — useful in forward() methods.

EXECUTION STATE
📚 torch.nn.functional = Stateless functions: relu, softmax, cross_entropy, etc. Same math as nn.ReLU() but without storing state.
6class DiagonalFlipNet(nn.Module) — Same model from Chapter 7 Section 3

This is identical to the class we defined in Chapter 7 Section 3. nn.Module is the base class for all PyTorch models. It tracks parameters, enables .to(device) for GPU, and defines the forward() interface.

EXECUTION STATE
📚 nn.Module = Base class for all neural network modules. Provides parameter tracking, serialization, and the forward() contract.
9self.layer1 = nn.Linear(4, 3) — Hidden layer

Creates a fully-connected layer: 4 inputs → 3 outputs. Internally stores a (3×4) weight matrix and a (3,) bias vector — note the transposed convention (out_features, in_features).

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = y = x @ W.T + b. Weight shape: (out, in) = (3, 4). Bias shape: (3,).
⬇ in_features = 4 — our input has 4 elements
⬇ out_features = 3 — we want 3 hidden neurons
10self.layer2 = nn.Linear(3, 4) — Output layer

Output layer: 3 hidden neurons → 4 outputs. Weight shape: (4, 3). Bias shape: (4,).

EXECUTION STATE
⬇ in_features = 3 — matches layer1’s output size
⬇ out_features = 4 — our target has 4 elements
12def forward(self, x) — Define the computation

PyTorch calls this when you do model(x). It defines the computation graph: multiply by W1, add b1, apply ReLU, multiply by W2, add b2. Every operation is recorded for backpropagation.

EXECUTION STATE
⬇ x = Input tensor — shape (4,) for a single example
13h = F.relu(self.layer1(x)) — Hidden layer + ReLU

self.layer1(x) computes x @ W1.T + b1 (the pre-activation z1). F.relu() applies max(0, z) element-wise. Both operations are recorded in the computation graph.

EXECUTION STATE
📚 F.relu(input) = Element-wise max(0, x). Same as nn.ReLU() but as a function, not a module.
⬆ h = [0.0, 0.0, 0.5] — neurons 0,1 dead; neuron 2 alive
14return self.layer2(h) — Output layer (no activation)

Passes the hidden activations through the output layer: h @ W2.T + b2. No activation function on the output — this is a regression network.

EXECUTION STATE
⬆ return value = [0.1, -0.1, -0.05, -0.25] — same as our hand calculation
17model = DiagonalFlipNet() — Instantiate the model

Creates the network with randomly initialized weights. We’ll immediately overwrite them with our specific values so the results match the hand calculations.

EXECUTION STATE
⬆ model = DiagonalFlipNet with 31 parameters: layer1.weight(3×4) + layer1.bias(3) + layer2.weight(4×3) + layer2.bias(4)
18with torch.no_grad(): — Disable gradient tracking

We’re manually setting weights, not training. torch.no_grad() tells PyTorch not to record these operations in the computation graph — saves memory and avoids corrupting future gradients.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd. Operations inside won’t be tracked for backpropagation.
19model.layer1.weight.copy_(...) — Load W1 (transposed!)

PyTorch stores weights as (out_features, in_features) = (3, 4), which is the TRANSPOSE of our W1 (4×3). So row 0 here is hidden neuron 0’s incoming weights — the same as COLUMN 0 of our W1.

EXECUTION STATE
📚 .copy_(tensor) = In-place copy. The underscore suffix is PyTorch convention for in-place operations.
⬆ layer1.weight =
       x0     x1     x2     x3
h0   0.20  -0.30   0.10  -0.40
h1  -0.50   0.40   0.30   0.20
h2   0.10  -0.20   0.50  -0.10
23model.layer1.bias.copy_(...) — Load b1

Set the hidden layer biases to our hand-calculated values.

EXECUTION STATE
⬆ layer1.bias = [0.1, -0.1, 0.0] — one per hidden neuron
25model.layer2.weight.copy_(...) — Load W2 (transposed!)

Again transposed: (4, 3) in PyTorch vs. our (3, 4). Row 0 here is output neuron 0’s incoming weights = column 0 of our W2.

EXECUTION STATE
⬆ layer2.weight =
       h0     h1     h2
y0   0.30  -0.10   0.20
y1  -0.20   0.50  -0.40
y2   0.40  -0.30   0.10
y3   0.10   0.20  -0.50
31model.layer2.bias.copy_(...) — Load b2

Set the output layer biases.

EXECUTION STATE
⬆ layer2.bias = [0.0, 0.1, -0.1, 0.0] — one per output
35x = torch.tensor([1.0, 0.0, 1.0, 1.0]) — Input

Same input as our hand calculations: x = [1, 0, 1, 1]. The .0 suffix makes these float32 (required for gradient computation).

EXECUTION STATE
⬆ x = tensor([1., 0., 1., 1.]) — diagonal flip target
36target = torch.tensor([1.0, 1.0, 0.0, 1.0]) — Target

The desired output. MSE loss will measure how far the model’s prediction is from this.

EXECUTION STATE
⬆ target = tensor([1., 1., 0., 1.])
37y_hat = model(x) — Forward pass

Calls model.forward(x) and records every operation in the computation graph. The graph connects x → W1 multiply → bias add → ReLU → W2 multiply → bias add → y_hat.

EXECUTION STATE
📚 model(x) = Calls __call__ which invokes forward(x) plus hooks. Returns a tensor with grad_fn attached.
⬆ y_hat = tensor([ 0.1000, -0.1000, -0.0500, -0.2500]) — matches hand calculation exactly
38loss = torch.mean((y_hat - target) ** 2) — MSE loss

Mean Squared Error: average of (prediction - target)² across all 4 outputs. This is the scalar we’ll differentiate with respect to every parameter.

EXECUTION STATE
⬆ loss = tensor(0.8963) — same as our hand calculation
39print(f"Loss before: {loss.item():.4f}")

.item() extracts a Python float from a 0-dimensional tensor.

EXECUTION STATE
📚 .item() = Converts a single-element tensor to a Python number. Required for printing/logging.
⬆ output = Loss before: 0.8963
42optimizer = optim.SGD(model.parameters(), lr=0.1)

Creates a Stochastic Gradient Descent optimizer. It will manage weight updates for ALL model parameters using the learning rate we specify.

EXECUTION STATE
📚 optim.SGD(params, lr) = SGD optimizer. Given gradients, updates each parameter: p_new = p - lr × p.grad. ‘Stochastic’ because in practice we use random mini-batches, not the full dataset.
⬇ model.parameters() = Generator yielding all 31 learnable parameters: layer1.weight(3×4), layer1.bias(3), layer2.weight(4×3), layer2.bias(4)
⬇ lr=0.1 = Learning rate — same η=0.1 we used in the hand calculation
43optimizer.zero_grad() — Clear old gradients

Reset all parameter gradients to zero. PyTorch ACCUMULATES gradients by default (each .backward() ADDS to existing .grad). If you skip this, gradients from the previous step would contaminate the current step.

EXECUTION STATE
📚 .zero_grad() = Sets .grad = 0 for every parameter. Must be called before each new .backward(). Forgetting this is a common bug that causes gradients to grow without bound.
44loss.backward() — ALL gradients in one call

THIS SINGLE LINE computes ALL 31 gradients we spent Section 2 calculating by hand. PyTorch recorded every operation during the forward pass (building a ‘computation graph’), then walks it backward applying the chain rule automatically. Every multiplication, every ReLU gate, every bias addition — all differentiated in reverse order.

EXECUTION STATE
📚 .backward() = Backpropagation: traverse the computation graph from loss to inputs, computing ∂loss/∂param for every learnable parameter. Results stored in param.grad for each parameter.
What it computes = layer1.weight.grad → dL/dW1 (same 4×3 matrix from Step 6) layer1.bias.grad → dL/db1 (same [0, 0, 0.44] from Step 7) layer2.weight.grad → dL/dW2 (same 3×4 matrix from Step 2) layer2.bias.grad → dL/db2 (same [-0.45, -0.55, -0.025, -0.625] from Step 3)
⬆ result = All .grad attributes populated — 31 gradient values, identical to our hand calculations
45optimizer.step() — Apply all weight updates

Applies w_new = w - lr × w.grad to EVERY parameter in one call. This single line replaces ALL of Section 3’s manual weight update tables.

EXECUTION STATE
📚 .step() = For SGD: param.data -= lr × param.grad for each parameter. Other optimizers (Adam, RMSprop) use more complex rules but the interface is the same.
⬆ result = All 31 parameters updated. 12 actually change (non-zero gradient), 19 stay the same (zero gradient).
48y_hat_new = model(x) — Forward pass with updated weights

Run the same input through the now-updated network. The weights have changed, so the predictions should be closer to the targets.

EXECUTION STATE
⬆ y_hat_new = tensor([ 0.117, 0.034, -0.065, -0.089]) — closer to [1, 1, 0, 1]
49loss_new = torch.mean((y_hat_new - target) ** 2) — New loss

Recompute MSE with the updated predictions.

EXECUTION STATE
⬆ loss_new = tensor(0.7268) — down from 0.8963
50print(f"Loss after: {loss_new.item():.4f}")

Display the post-update loss.

EXECUTION STATE
⬆ output = Loss after: 0.7268
51print(f"Improvement: {(1-loss_new/loss)*100:.1f}%")

Same 18.9% improvement as our NumPy code — confirming that PyTorch’s autograd produces identical results to hand calculation.

EXECUTION STATE
⬆ output = Improvement: 18.9%
21 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4import torch.nn.functional as F
5
6class DiagonalFlipNet(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layer1 = nn.Linear(4, 3)
10        self.layer2 = nn.Linear(3, 4)
11
12    def forward(self, x):
13        h = F.relu(self.layer1(x))
14        return self.layer2(h)
15
16# ── Create model and load our weights ──
17model = DiagonalFlipNet()
18with torch.no_grad():
19    model.layer1.weight.copy_(torch.tensor([
20        [ 0.2, -0.3,  0.1, -0.4],
21        [-0.5,  0.4,  0.3,  0.2],
22        [ 0.1, -0.2,  0.5, -0.1]]))
23    model.layer1.bias.copy_(
24        torch.tensor([0.1, -0.1, 0.0]))
25    model.layer2.weight.copy_(torch.tensor([
26        [ 0.3, -0.1,  0.2],
27        [-0.2,  0.5, -0.4],
28        [ 0.4, -0.3,  0.1],
29        [ 0.1,  0.2, -0.5]]))
30    model.layer2.bias.copy_(
31        torch.tensor([0.0, 0.1, -0.1, 0.0]))
32
33# ── Forward pass + loss ──
34x = torch.tensor([1.0, 0.0, 1.0, 1.0])
35target = torch.tensor([1.0, 1.0, 0.0, 1.0])
36y_hat = model(x)
37loss = torch.mean((y_hat - target) ** 2)
38print(f"Loss before: {loss.item():.4f}")
39
40# ── Backprop + update (three lines!) ──
41optimizer = optim.SGD(model.parameters(), lr=0.1)
42optimizer.zero_grad()
43loss.backward()
44optimizer.step()
45
46# ── Verify ──
47y_hat_new = model(x)
48loss_new = torch.mean((y_hat_new - target) ** 2)
49print(f"Loss after:  {loss_new.item():.4f}")
50print(f"Improvement: {(1-loss_new.item()/loss.item())*100:.1f}%")
Three lines replaced two sections of math. loss.backward()\texttt{loss.backward()} = all of Section 2. optimizer.step()\texttt{optimizer.step()} = all of Section 3. But now you know exactly what those lines do under the hood — every multiplication, every gradient gate, every weight nudge.

Summary

  1. Update rule: wnew=wold0.1×gradientw_{\text{new}} = w_{\text{old}} - 0.1 \times \text{gradient}
  2. 12 of 31 parameters changed. Dead neurons and zero inputs blocked the rest.
  3. Output layer biases got the biggest push—they always learn because they don't depend on dead neurons.
  4. Loss dropped 19% after one update: 0.896 → 0.726
  5. Predictions moved toward targets for 3 out of 4 outputs.

In the next section, we'll implement all of this in PyTorch, verify our hand calculations match, and train the network to convergence.

Loading comments...