Learning Objectives
By the end of this section, you will be able to:
- Apply the weight update rule to every parameter using the gradients from Section 2
- See every parameter's old and new value side by side
- Run a forward pass with the updated weights and verify the loss decreased
The Update Rule
For every parameter , we apply:
With learning rate . Let's update every parameter that has a non-zero gradient.
Updating (Output Weights)
Only the third row of has non-zero gradients (from hidden neuron 2, the only alive neuron):
| Weight | Old value | Gradient | η × grad | New value |
|---|---|---|---|---|
| W₂[2][0] | 0.2000 | −0.2250 | −0.0225 | 0.2225 |
| W₂[2][1] | −0.4000 | −0.2750 | −0.0275 | −0.3725 |
| W₂[2][2] | 0.1000 | −0.0125 | −0.00125 | 0.10125 |
| W₂[2][3] | −0.5000 | −0.3125 | −0.03125 | −0.46875 |
Notice the direction: all four gradients are negative, so all four weights increase. The loss is telling us: "the output values are too low—increase the weights that feed into them from the alive hidden neuron."
Updating (Output Biases)
| Bias | Old value | Gradient | η × grad | New value |
|---|---|---|---|---|
| b₂[0] | 0.0000 | −0.4500 | −0.0450 | 0.0450 |
| b₂[1] | 0.1000 | −0.5500 | −0.0550 | 0.1550 |
| b₂[2] | −0.1000 | −0.0250 | −0.0025 | −0.0975 |
| b₂[3] | 0.0000 | −0.6250 | −0.0625 | 0.0625 |
All biases increase. The network is learning to shift all outputs upward, since most predictions were too low.
Updating (Hidden Weights)
Only column 2 (hidden neuron 2) has non-zero gradients, and row 1 is zero because :
| Weight | Old value | Gradient | η × grad | New value |
|---|---|---|---|---|
| W₁[0][2] | 0.1000 | 0.4400 | 0.0440 | 0.0560 |
| W₁[1][2] | −0.2000 | 0.0000 | 0.0000 | −0.2000 |
| W₁[2][2] | 0.5000 | 0.4400 | 0.0440 | 0.4560 |
| W₁[3][2] | −0.1000 | 0.4400 | 0.0440 | −0.1440 |
The positive gradient (0.44) means increasing these weights would increase the loss. So we decrease them. The network is learning to reduce the signal flowing through hidden neuron 2—because the current signal produces too-negative outputs.
Updating (Hidden Biases)
| Bias | Old value | Gradient | η × grad | New value |
|---|---|---|---|---|
| b₁[0] | 0.1000 | 0.0000 | 0.0000 | 0.1000 |
| b₁[1] | −0.1000 | 0.0000 | 0.0000 | −0.1000 |
| b₁[2] | 0.0000 | 0.4400 | 0.0440 | −0.0440 |
Complete Before → After Table
Here's every parameter that changed, all in one place:
| Parameter | Before | After | Change |
|---|---|---|---|
| W₂[2][0] | 0.2000 | 0.2225 | +0.0225 |
| W₂[2][1] | −0.4000 | −0.3725 | +0.0275 |
| W₂[2][2] | 0.1000 | 0.1013 | +0.0013 |
| W₂[2][3] | −0.5000 | −0.4688 | +0.0313 |
| b₂[0] | 0.0000 | 0.0450 | +0.0450 |
| b₂[1] | 0.1000 | 0.1550 | +0.0550 |
| b₂[2] | −0.1000 | −0.0975 | +0.0025 |
| b₂[3] | 0.0000 | 0.0625 | +0.0625 |
| W₁[0][2] | 0.1000 | 0.0560 | −0.0440 |
| W₁[2][2] | 0.5000 | 0.4560 | −0.0440 |
| W₁[3][2] | −0.1000 | −0.1440 | −0.0440 |
| b₁[2] | 0.0000 | −0.0440 | −0.0440 |
20 parameters unchanged (gradient was zero). 12 parameters updated. The changes are small—that's the learning rate doing its job. Each step makes a tiny adjustment.
Python: Applying Updates
Let's implement the weight update rule in Python and verify the loss drops. This code picks up where Section 2's backprop code left off — all gradient variables (, , etc.) are already computed.
Verify: Forward Pass with New Weights
Let's run the same input through the network with the updated weights and see if things improved.
Layer 1 with new weights
Hidden neuron 2 (the only one that changed):
After ReLU: (still alive, but lower than before: 0.5 → 0.324)
Layer 2 with new weights
The Improvement
| Output | Before | After | Target | Better? |
|---|---|---|---|---|
| ŷ₀ | 0.10 | 0.12 | 1 | ✅ Moved toward 1 |
| ŷ₁ | −0.10 | 0.03 | 1 | ✅ Moved toward 1 |
| ŷ₂ | −0.05 | −0.06 | 0 | ➖ Tiny worse |
| ŷ₃ | −0.25 | −0.09 | 1 | ✅ Moved toward 1 |
The new MSE loss:
| Metric | Before | After | Change |
|---|---|---|---|
| MSE Loss | 0.896 | 0.726 | −19.0% |
One step of gradient descent reduced the loss by 19%. The network is still far from perfect (loss 0.726 vs. target of 0.0), but it's measurably better after a single weight update. After hundreds of steps on all 16 training examples, the loss will approach zero.
PyTorch: One Line Does It All
Everything we did by hand — forward pass, backpropagation, weight updates — PyTorch does in three lines: computes all gradients, clears old gradients, and applies the updates. The numbers match our hand calculations exactly.
Summary
- Update rule:
- 12 of 31 parameters changed. Dead neurons and zero inputs blocked the rest.
- Output layer biases got the biggest push—they always learn because they don't depend on dead neurons.
- Loss dropped 19% after one update: 0.896 → 0.726
- Predictions moved toward targets for 3 out of 4 outputs.
In the next section, we'll implement all of this in PyTorch, verify our hand calculations match, and train the network to convergence.