Learning Objectives
By the end of this section, you will be able to:
- Explain why we need gradients to improve a neural network
- Understand gradient descent as repeatedly nudging weights downhill
- Describe the role of the learning rate and what happens if it's too large or too small
- Articulate the challenge that backpropagation solves
Where We Left Off
In Chapter 7, we built a neural network for our 2×2 diagonal flip problem and traced the forward pass by hand. Here's where we ended up:
| What | Value |
|---|---|
| Input x | [1, 0, 1, 1] |
| Target y (diagonal flip) | [1, 1, 0, 1] |
| Prediction ŷ | [0.1, −0.1, −0.05, −0.25] |
| Loss (MSE) | 0.896 |
The prediction is terrible. Output pixel 1 should be 1 but the network says -0.1. Output pixel 3 should be 1 but the network says -0.25. The MSE loss of 0.896 confirms: the network is mostly wrong.
The situation: We have 31 random numbers (weights and biases) that produce garbage. We need to change them so the loss goes down. But which ones should increase? Which should decrease? And by how much?
The Key Question
Imagine you're standing on a hilly landscape in thick fog. You can't see where the valley is, but you can feel the slope under your feet. What do you do? You walk downhill. One step at a time, always moving in the direction where the ground slopes down most steeply.
Neural network training works the same way:
- The landscape is the loss function—its shape depends on the 31 weight values
- Your position is the current set of weights
- The altitude is the loss value (0.896 right now)
- The slope is the gradient—it tells you which direction is downhill
What Is a Gradient?
The gradient of the loss with respect to a weight tells you: if I increase this weight by a tiny amount, how much does the loss change?
Three cases:
| Gradient value | Meaning | Action |
|---|---|---|
| Positive (+0.44) | Increasing this weight increases the loss | Decrease the weight |
| Negative (−0.55) | Increasing this weight decreases the loss | Increase the weight |
| Zero (0.0) | Changing this weight has no effect on loss | Leave it alone |
Gradient Descent: The Algorithm
The complete algorithm is embarrassingly simple:
- Run the forward pass to compute the prediction and loss
- Compute the gradient of the loss with respect to every weight
- Update each weight: move it a small step opposite to the gradient
- Repeat from step 1
The update rule for a single weight:
The minus sign is crucial—we go opposite to the gradient. If the gradient is positive (increasing the weight would increase the loss), we decrease it. If the gradient is negative (increasing the weight would decrease the loss), we increase it.
The Learning Rate
The symbol (eta) is the learning rate—it controls how big each step is. It's the single most important number you choose when training a network.
| Learning rate | Behavior | Problem |
|---|---|---|
| η = 0.001 | Very small steps, very slow progress | Takes forever to converge |
| η = 0.1 | Moderate steps, steady progress | Usually a good starting point |
| η = 1.0 | Large steps, fast but unstable | May overshoot and diverge |
| η = 10.0 | Huge steps | Loss explodes, network blows up |
For our example, we'll use . This means each weight will change by 10% of its gradient each step.
The Challenge: Computing Gradients
The update rule is simple. The hard part is step 2: computing the gradient for all 31 parameters. We need:
We could compute each gradient numerically by wiggling each weight and seeing how the loss changes. But that requires 2 forward passes per weight—62 forward passes for 31 weights. For a real network with millions of parameters, this is impossibly slow.
Backpropagation solves this. It computes all gradients in a single backward pass—roughly the same cost as one forward pass. That's why it's the most important algorithm in deep learning.
The key insight of backpropagation: By applying the chain rule of calculus layer by layer from output back to input, we can compute how every single weight in the network affects the loss—all at once. In the next section, we'll trace every gradient by hand through our 2×2 image network.
Summary
- Gradients point uphill. We go in the opposite direction to reduce the loss.
- Update rule:
- Learning rate controls step size. Too big = unstable. Too small = slow.
- The challenge is efficiently computing gradients for all parameters. Backpropagation does this in one backward pass.