Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why we need gradients to improve a neural network
Understand gradient descent as repeatedly nudging weights downhill
Describe the role of the learning rate and what happens if it's too large or too small
Articulate the challenge that backpropagation solves

Where We Left Off

In Chapter 7, we built a neural network for our 2×2 diagonal flip problem and traced the forward pass by hand. Here's where we ended up:

What	Value
Input x	[1, 0, 1, 1]
Target y (diagonal flip)	[1, 1, 0, 1]
Prediction ŷ	[0.1, −0.1, −0.05, −0.25]
Loss (MSE)	0.896

The prediction is terrible. Output pixel 1 should be 1 but the network says -0.1. Output pixel 3 should be 1 but the network says -0.25. The MSE loss of 0.896 confirms: the network is mostly wrong.

The situation: We have 31 random numbers (weights and biases) that produce garbage. We need to change them so the loss goes down. But which ones should increase? Which should decrease? And by how much?

The Key Question

Imagine you're standing on a hilly landscape in thick fog. You can't see where the valley is, but you can feel the slope under your feet. What do you do? You walk downhill. One step at a time, always moving in the direction where the ground slopes down most steeply.

Neural network training works the same way:

The landscape is the loss function—its shape depends on the 31 weight values
Your position is the current set of weights
The altitude is the loss value (0.896 right now)
The slope is the gradient—it tells you which direction is downhill

What Is a Gradient?

The gradient of the loss with respect to a weight tells you: if I increase this weight by a tiny amount, how much does the loss change?

\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w)}{\epsilon}

Three cases:

Gradient value	Meaning	Action
Positive (+0.44)	Increasing this weight increases the loss	Decrease the weight
Negative (−0.55)	Increasing this weight decreases the loss	Increase the weight
Zero (0.0)	Changing this weight has no effect on loss	Leave it alone

The gradient is a direction, not a destination. It tells you which way is downhill from where you are right now. After you take a step, the gradient changes—because you're at a new position on the landscape. That's why training takes many steps.

Gradient Descent: The Algorithm

The complete algorithm is embarrassingly simple:

Run the forward pass to compute the prediction and loss
Compute the gradient of the loss with respect to every weight
Update each weight: move it a small step opposite to the gradient
Repeat from step 1

The update rule for a single weight:

w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}

The minus sign is crucial—we go opposite to the gradient. If the gradient is positive (increasing the weight would increase the loss), we decrease it. If the gradient is negative (increasing the weight would decrease the loss), we increase it.

The Learning Rate

The symbol $\eta$ (eta) is the learning rate—it controls how big each step is. It's the single most important number you choose when training a network.

Learning rate	Behavior	Problem
η = 0.001	Very small steps, very slow progress	Takes forever to converge
η = 0.1	Moderate steps, steady progress	Usually a good starting point
η = 1.0	Large steps, fast but unstable	May overshoot and diverge
η = 10.0	Huge steps	Loss explodes, network blows up

For our example, we'll use $\eta = 0.1$ . This means each weight will change by 10% of its gradient each step.

Practical rule: If the loss goes down each step, your learning rate is probably fine. If the loss jumps around wildly or goes up, decrease it. If training is painfully slow, try increasing it.

The Challenge: Computing Gradients

The update rule is simple. The hard part is step 2: computing the gradient for all 31 parameters. We need:

\frac{\partial L}{\partial W_1[i][j]}, \quad \frac{\partial L}{\partial b_1[j]}, \quad \frac{\partial L}{\partial W_2[i][j]}, \quad \frac{\partial L}{\partial b_2[j]}

We could compute each gradient numerically by wiggling each weight and seeing how the loss changes. But that requires 2 forward passes per weight—62 forward passes for 31 weights. For a real network with millions of parameters, this is impossibly slow.

Backpropagation solves this. It computes all gradients in a single backward pass—roughly the same cost as one forward pass. That's why it's the most important algorithm in deep learning.

The key insight of backpropagation: By applying the chain rule of calculus layer by layer from output back to input, we can compute how every single weight in the network affects the loss—all at once. In the next section, we'll trace every gradient by hand through our 2×2 image network.

Summary

Gradients point uphill. We go in the opposite direction to reduce the loss.
Update rule: $w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}$
Learning rate $\eta$ controls step size. Too big = unstable. Too small = slow.
The challenge is efficiently computing gradients for all parameters. Backpropagation does this in one backward pass.