Chapter 8
18 min read
Section 48 of 178

The Learning Problem

Backpropagation from Scratch

Learning Objectives

By the end of this section, you will be able to:

  1. Understand what "learning" means in the context of neural networks—it's optimization over a loss landscape
  2. Describe the learning loop that drives all neural network training: forward pass → loss → backward pass → update
  3. Visualize loss surfaces and understand why finding the minimum is challenging in high dimensions
  4. Explain gradient descent as the fundamental algorithm for navigating loss landscapes
  5. Connect loss functions to learning objectives and understand why different tasks need different loss functions
  6. Recognize why backpropagation is essential—computing gradients efficiently is the key to training deep networks
Why This Matters: Before diving into the mechanics of backpropagation, you need to understand what we're trying to achieve. Learning is optimization. The loss function defines success. Gradients show us the way. This section builds the conceptual foundation for everything that follows.

What Does It Mean to Learn?

When we say a neural network "learns," we mean something very specific and mathematical. Unlike biological learning, which remains mysterious, machine learning has a precise definition:

Learning is the process of adjusting a model's parameters to minimize a measure of error on training data.

Let's unpack this definition. A neural network is a parameterized function f(x;θ)f(\mathbf{x}; \boldsymbol{\theta}) that maps inputs x\mathbf{x} to outputs. The parameters θ\boldsymbol{\theta} (weights and biases) determine the function's behavior. Learning means finding the parameter values that make the function produce the outputs we want.

The Key Insight

The breakthrough that enables machine learning is treating learning as an optimization problem:

θ=argminθL(θ)\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})

Where:

  • θ\boldsymbol{\theta}^* is the optimal parameter vector we seek
  • L(θ)\mathcal{L}(\boldsymbol{\theta}) is the loss function—a scalar measure of how wrong our predictions are
  • argmin\arg\min means "find the argument that minimizes"

This framing is powerful because optimization is a well-studied field. We can apply centuries of mathematical insights to make machines learn.

A Concrete Example

Consider training a network to predict house prices from features like square footage and location. Given training data {(x(i),y(i))}i=1m\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m, we want parameters that make predictions y^(i)=f(x(i);θ)\hat{y}^{(i)} = f(\mathbf{x}^{(i)}; \boldsymbol{\theta}) close to actual prices y(i)y^{(i)}.

Using mean squared error as our loss function:

L(θ)=1mi=1m(f(x(i);θ)y(i))2\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \left( f(\mathbf{x}^{(i)}; \boldsymbol{\theta}) - y^{(i)} \right)^2

Learning means finding θ\boldsymbol{\theta} that makes this loss as small as possible.


The Learning Loop

All neural network training follows the same iterative pattern. Understanding this loop is crucial—it's the heartbeat of deep learning.

The Learning LoopIteration 1

1
📊

Input Data

Sample a batch of training examples (x, y)

2
➡️

Forward Pass

Compute predictions: ŷ = f(x; θ)

3
📏

Compute Loss

Measure error: L(ŷ, y)

4
⬅️

Backward Pass

Compute gradients: ∇θL

5
🔄

Update Weights

Apply gradient descent: θ ← θ - η∇θL

6

Iterate

Repeat until convergence

📊

Input Data

What happens: We sample a mini-batch of training examples from our dataset. Each example consists of an input x and a target output y.

Mini-batch size is typically 32, 64, or 128 examples. Smaller batches add noise but enable more frequent updates.

The Four Stages

1. Forward Pass

We feed input data through the network, layer by layer, computing predictions. Each layer applies its transformation:

h(l)=σ(l)(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = \sigma^{(l)}\left( \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} \right)

The output of the final layer is our prediction y^\hat{\mathbf{y}}.

2. Loss Computation

We compare predictions to ground truth using a loss function. This produces a single scalar number that summarizes how wrong we are:

L=L(y^,y)L = \mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})

3. Backward Pass (Backpropagation)

Using the chain rule of calculus, we compute how each parameter contributed to the loss. We calculate gradients θL\nabla_{\boldsymbol{\theta}} \mathcal{L}—the direction of steepest increase in loss.

4. Parameter Update

We adjust parameters in the opposite direction of the gradient, reducing the loss:

θθηθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}

Where η\eta is the learning rate—a crucial hyperparameter controlling step size.

Why This Works

This loop is a feedback system. Each iteration provides information about errors, which we use to correct the model. Over thousands of iterations, the model converges to a state where it makes good predictions—or at least, as good as its architecture allows.

The Optimization Perspective

To truly understand learning, we need to think about it geometrically. The loss function L(θ)\mathcal{L}(\boldsymbol{\theta}) defines a loss surface over parameter space.

The Loss Surface

Imagine the parameters as coordinates in a high-dimensional space. For a network with 1 million parameters, this is a 1,000,000-dimensional space! At each point (parameter configuration), the loss function assigns a height (the loss value).

For visualization, consider just two parameters: the loss surface is then a 3D landscape. Hills represent high loss (bad predictions). Valleys represent low loss (good predictions). Training is like rolling a ball downhill to find the lowest point.

Interactive Loss Surface Explorer

Adjust the weight and bias sliders to see how the loss changes. The yellow arrow shows the direction of steepest descent.

Weight (w): 0.00
Bias (b): 0.00
MSE Loss: 6.1718Optimal: w ≈ 2.0, b ≈ 0.5

Quick Check

In the loss surface visualization above, what do the colors represent?

What Makes Optimization Hard?

In two dimensions, finding the minimum seems easy—just roll downhill! But real loss surfaces are much more complex:

ChallengeDescriptionImpact on Training
High DimensionalityMillions of parameters = millions of dimensionsVisualization impossible, intuition breaks down
Local MinimaValleys that aren't the deepestMay get stuck at suboptimal solutions
Saddle PointsFlat in some directions, curved in othersGradients near zero but not at minimum
PlateausFlat regions with small gradientsTraining stalls, appears to stop learning
Sharp ValleysNarrow, steep-sided minimaGeneralization issues, sensitivity to perturbations
Ill-ConditioningVery different curvature in different directionsSlow convergence, oscillations

The Surprising Truth

Despite these challenges, deep learning works remarkably well in practice. Research suggests that in high dimensions, most critical points are saddle points rather than local minima, and gradient-based methods can escape them. The loss landscapes of overparameterized networks tend to be more benign than we might expect.

Visualizing the Loss Landscape

The interactive visualization below shows a 3D loss surface. Watch as gradient descent navigates this landscape, following the negative gradient at each step. Try different learning rates to see their effects.

3D Loss Landscape

Learning Rate: 0.10

Drag to rotate the loss surface. Watch gradient descent find the minimum.

What You're Seeing

  • The surface: A loss function with two parameters, showing a global minimum and some local structure
  • The red ball: Current parameter position during optimization
  • The yellow path: Trajectory taken by gradient descent
  • The green marker: Location of the global minimum

Experiment Ideas

  • Try a very high learning rate (0.4+). What happens?
  • Try a very low learning rate (0.02). How does convergence speed change?
  • Drag to rotate and see the surface from different angles. Notice the "valley" structure.

Gradient Descent: Following the Slope

Gradient descent is the algorithm that enables neural network training. The idea is beautifully simple: the gradient tells us the direction of steepest ascent, so we walk in the opposite direction.

The Gradient

The gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is a vector of partial derivatives:

θL=(Lθ1Lθ2Lθn)\nabla_{\boldsymbol{\theta}} \mathcal{L} = \begin{pmatrix} \frac{\partial \mathcal{L}}{\partial \theta_1} \\ \frac{\partial \mathcal{L}}{\partial \theta_2} \\ \vdots \\ \frac{\partial \mathcal{L}}{\partial \theta_n} \end{pmatrix}

Each component tells us how much the loss changes when we slightly perturb that parameter. The gradient points in the direction where loss increases fastest.

The Update Rule

Gradient descent iteratively updates parameters:

θt+1=θtηθL(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

Where:

  • θt\boldsymbol{\theta}_t = parameters at iteration tt
  • η\eta = learning rate (step size)
  • θL\nabla_{\boldsymbol{\theta}} \mathcal{L} = gradient at current position

Gradient Descent in 2D

Loss Function:
Learning Rate: 0.10
Step: 0

The Learning Rate Dilemma

The learning rate η\eta is critical:

Learning RateBehaviorRisk
Too largeBig steps, fast initial progressOvershooting, oscillations, divergence
Too smallCareful steps, stable convergenceExtremely slow, may get stuck
Just rightBalanced progressStill depends on landscape topology

Quick Check

If gradient descent starts oscillating without converging, what should you try first?


Loss Functions: Measuring Error

The loss function is the objective we're optimizing. It quantifies how wrong our predictions are. Different tasks require different loss functions.

Mean Squared Error (MSE)

For regression (predicting continuous values):

LMSE=1mi=1m(y^(i)y(i))2\mathcal{L}_{\text{MSE}} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Intuition: Penalizes predictions proportionally to the square of their error. Large errors are penalized much more than small ones.

Cross-Entropy Loss

For classification (predicting categories):

LCE=1mi=1mk=1Kyk(i)log(p^k(i))\mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)})

Intuition: Measures the difference between predicted probability distribution and true distribution. Heavily penalizes confident wrong predictions.

Binary Cross-Entropy

For binary classification:

LBCE=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))]\mathcal{L}_{\text{BCE}} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{p}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{p}^{(i)}) \right]

Why Different Loss Functions?

TaskLoss FunctionReason
RegressionMSE, MAE, HuberPredictions are continuous values
Binary ClassificationBinary Cross-EntropyOutput is probability of class 1
Multi-class ClassificationCross-Entropy + SoftmaxOutput is probability distribution over classes
Object DetectionCombination lossesMultiple objectives: location + classification

Loss Functions as Assumptions

Your choice of loss function encodes assumptions about the problem. MSE assumes Gaussian noise in targets. Cross-entropy assumes a probabilistic interpretation of outputs. These assumptions affect what the model learns to optimize.

The Statistical Learning Perspective

So far we've discussed fitting training data. But the real goal is generalization—performing well on new, unseen data.

The True Objective

We actually want to minimize the expected risk over the data distribution:

R(θ)=E(x,y)Pdata[L(f(x;θ),y)]\mathcal{R}(\boldsymbol{\theta}) = \mathbb{E}_{(\mathbf{x}, y) \sim P_{\text{data}}} \left[ \mathcal{L}(f(\mathbf{x}; \boldsymbol{\theta}), y) \right]

But we don't know the true data distribution PdataP_{\text{data}}! We only have samples.

Empirical Risk Minimization

Instead, we minimize the empirical risk—the average loss over our training set:

R^(θ)=1mi=1mL(f(x(i);θ),y(i))\hat{\mathcal{R}}(\boldsymbol{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(f(\mathbf{x}^{(i)}; \boldsymbol{\theta}), y^{(i)})

This works when training data is representative of the true distribution. The gap between training loss and test loss tells us how well we've generalized.

The Bias-Variance Tradeoff

Two sources of error in learning:

  • Bias: Error from the model being too simple (underfitting). A linear model can't learn XOR, no matter how much data you have.
  • Variance: Error from the model being too sensitive to training data (overfitting). A very complex model might memorize noise.

Deep learning mitigates this through:

  1. High capacity models (low bias)
  2. Regularization techniques (controlled variance)
  3. Massive datasets (reduces overfitting)
  4. Careful architecture design (inductive biases)

The Challenge: Computing Gradients

We've established that gradient descent requires gradients. But how do we compute θL\nabla_{\boldsymbol{\theta}} \mathcal{L} for a neural network with millions of parameters?

Three Approaches

1. Numerical Differentiation

Approximate derivatives using finite differences:

LθiL(θ+ϵei)L(θϵei)2ϵ\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\boldsymbol{\theta} + \epsilon \mathbf{e}_i) - \mathcal{L}(\boldsymbol{\theta} - \epsilon \mathbf{e}_i)}{2\epsilon}

Problem: Requires 2 forward passes per parameter. For 1 million parameters, that's 2 million forward passes per gradient! Completely impractical.

2. Symbolic Differentiation

Apply calculus rules to derive gradient expressions analytically.

Problem: Expression size grows exponentially with network depth. Memory-intensive and computationally explosive.

3. Automatic Differentiation (Backpropagation)

Compute gradients through reverse-mode automatic differentiation—the famous backpropagation algorithm.

Advantage: Computes all gradients in essentially the same time as one forward pass. This is the breakthrough that makes deep learning possible!

The Key Insight

Backpropagation is just the chain rule, applied systematically. By storing intermediate values during the forward pass and reusing them during the backward pass, we compute gradients efficiently. This will be the focus of subsequent sections.

Implementation Preview

Here's a preview of how these concepts translate to PyTorch code. We'll build this understanding fully in subsequent sections.

The Complete Learning Loop in PyTorch
🐍learning_loop.py
6The Model

A neural network is a parameterized function. nn.Sequential stacks layers. Each layer has weights and biases that we'll learn.

EXAMPLE
This network: 10 inputs → 64 → 32 → 1 output
14Loss Function

The loss function defines our objective. MSELoss computes mean squared error: L = (1/n)Σ(ŷ - y)². PyTorch provides many built-in loss functions.

17Optimizer

The optimizer implements the update rule. SGD (Stochastic Gradient Descent) applies: θ ← θ - η∇L. We specify learning rate and which parameters to update.

EXAMPLE
lr=0.01 means steps of size 0.01 × gradient
22Forward Pass

Pass input through the network to compute predictions. PyTorch builds a computational graph as it goes, recording operations for backpropagation.

25Loss Computation

Compare predictions to ground truth. The loss is a single scalar. This becomes the 'output' of our computational graph.

28Zero Gradients

PyTorch accumulates gradients by default (useful for some cases). We clear them before each backward pass to avoid mixing gradients from different batches.

29Backpropagation

The magic line! .backward() computes ∂L/∂θ for every parameter θ using the chain rule. After this, each parameter's .grad attribute contains its gradient.

32Parameter Update

.step() applies the optimizer's update rule. For SGD: θ ← θ - lr × θ.grad. More advanced optimizers use momentum, adaptive learning rates, etc.

28 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# 1. Define a simple model (parameterized function)
6model = nn.Sequential(
7    nn.Linear(10, 64),
8    nn.ReLU(),
9    nn.Linear(64, 32),
10    nn.ReLU(),
11    nn.Linear(32, 1)
12)
13
14# 2. Define the loss function (what we're optimizing)
15loss_fn = nn.MSELoss()
16
17# 3. Define the optimizer (how we update parameters)
18optimizer = optim.SGD(model.parameters(), lr=0.01)
19
20# The Learning Loop
21for epoch in range(num_epochs):
22    for batch_x, batch_y in dataloader:
23        # Forward pass: compute predictions
24        predictions = model(batch_x)
25
26        # Compute loss
27        loss = loss_fn(predictions, batch_y)
28
29        # Backward pass: compute gradients
30        optimizer.zero_grad()  # Clear previous gradients
31        loss.backward()        # Backpropagation!
32
33        # Update parameters
34        optimizer.step()       # θ ← θ - η∇L
35
36    print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

Summary

This section established the conceptual foundation for understanding backpropagation and neural network training.

Key Concepts

ConceptDefinitionKey Equation
LearningAdjusting parameters to minimize lossθ* = argmin L(θ)
Loss FunctionScalar measure of prediction errorL(ŷ, y)
GradientDirection of steepest increase∇L = [∂L/∂θ₁, ∂L/∂θ₂, ...]
Gradient DescentUpdate rule to minimize lossθ ← θ - η∇L
Learning RateStep size for updatesη (hyperparameter)
Forward PassCompute predictions from inputŷ = f(x; θ)
Backward PassCompute gradients via chain ruleBackpropagation

The Big Picture

Learning is optimization. We have a parameterized model, a loss function that measures error, and an update rule (gradient descent) that adjusts parameters to reduce error. The key challenge is computing gradients efficiently—this is what backpropagation solves.

Coming Up: In the next sections, we'll dive deep into gradient descent variants, computational graphs, and the backpropagation algorithm itself. You'll implement backprop from scratch and understand exactly how modern frameworks like PyTorch compute gradients automatically.

Exercises

Conceptual Questions

  1. Why is the loss function always a scalar? What would happen if it were a vector?
  2. Explain why gradient descent moves in the negative gradient direction. What would happen if we moved in the positive gradient direction?
  3. In the loss surface visualization, the path taken by gradient descent isn't a straight line to the minimum. Why does it curve?
  4. What's the difference between batch gradient descent (using all training data), stochastic gradient descent (using one example), and mini-batch gradient descent (using a subset)?

Mathematical Exercises

  1. For the loss function L(w,b)=(wx+by)2\mathcal{L}(w, b) = (wx + b - y)^2 with a single data point (x,y)=(2,5)(x, y) = (2, 5), compute the gradient [L/w,L/b][\partial \mathcal{L}/\partial w, \partial \mathcal{L}/\partial b] at (w,b)=(1,1)(w, b) = (1, 1).
  2. Show that for MSE loss, the gradient with respect to a prediction y^\hat{y} is 2m(y^y)\frac{2}{m}(\hat{y} - y).
  3. Prove that gradient descent converges to the minimum for a convex quadratic function f(x)=ax2f(x) = ax^2 with a>0a > 0, given appropriate learning rate.

Coding Exercises

🐍python
1# Exercise: Implement gradient descent for linear regression from scratch
2# Given: Data points, initial weights, learning rate, number of iterations
3# Goal: Find optimal weights that minimize MSE
4
5import numpy as np
6
7# Generate data
8np.random.seed(42)
9X = np.random.randn(100, 1)
10y = 2 * X + 0.5 + 0.1 * np.random.randn(100, 1)
11
12# TODO: Implement gradient descent
13def gradient_descent(X, y, lr=0.1, n_iters=100):
14    w = np.zeros((1, 1))
15    b = 0.0
16    m = len(X)
17    losses = []
18
19    for i in range(n_iters):
20        # Forward pass: compute predictions
21        y_pred = # TODO
22
23        # Compute loss
24        loss = # TODO
25        losses.append(loss)
26
27        # Compute gradients
28        dw = # TODO
29        db = # TODO
30
31        # Update parameters
32        w = # TODO
33        b = # TODO
34
35    return w, b, losses
36
37# Test your implementation
38w_final, b_final, losses = gradient_descent(X, y)
39print(f"Learned: w = {w_final[0,0]:.3f}, b = {b_final:.3f}")
40print(f"True:    w = 2.0, b = 0.5")

Expected Results

After implementing correctly, you should find w2.0w \approx 2.0 and b0.5b \approx 0.5. Plot the loss over iterations to visualize convergence.

You now understand the fundamental learning problem that backpropagation solves. In the next section, we'll explore gradient descent variants—the practical algorithms that make training neural networks efficient.