Learning Objectives
By the end of this section, you will be able to:
- Understand what "learning" means in the context of neural networks—it's optimization over a loss landscape
- Describe the learning loop that drives all neural network training: forward pass → loss → backward pass → update
- Visualize loss surfaces and understand why finding the minimum is challenging in high dimensions
- Explain gradient descent as the fundamental algorithm for navigating loss landscapes
- Connect loss functions to learning objectives and understand why different tasks need different loss functions
- Recognize why backpropagation is essential—computing gradients efficiently is the key to training deep networks
Why This Matters: Before diving into the mechanics of backpropagation, you need to understand what we're trying to achieve. Learning is optimization. The loss function defines success. Gradients show us the way. This section builds the conceptual foundation for everything that follows.
What Does It Mean to Learn?
When we say a neural network "learns," we mean something very specific and mathematical. Unlike biological learning, which remains mysterious, machine learning has a precise definition:
Learning is the process of adjusting a model's parameters to minimize a measure of error on training data.
Let's unpack this definition. A neural network is a parameterized function that maps inputs to outputs. The parameters (weights and biases) determine the function's behavior. Learning means finding the parameter values that make the function produce the outputs we want.
The Key Insight
The breakthrough that enables machine learning is treating learning as an optimization problem:
Where:
- is the optimal parameter vector we seek
- is the loss function—a scalar measure of how wrong our predictions are
- means "find the argument that minimizes"
This framing is powerful because optimization is a well-studied field. We can apply centuries of mathematical insights to make machines learn.
A Concrete Example
Consider training a network to predict house prices from features like square footage and location. Given training data , we want parameters that make predictions close to actual prices .
Using mean squared error as our loss function:
Learning means finding that makes this loss as small as possible.
The Learning Loop
All neural network training follows the same iterative pattern. Understanding this loop is crucial—it's the heartbeat of deep learning.
The Learning LoopIteration 1
Input Data
Sample a batch of training examples (x, y)
Forward Pass
Compute predictions: ŷ = f(x; θ)
Compute Loss
Measure error: L(ŷ, y)
Backward Pass
Compute gradients: ∇θL
Update Weights
Apply gradient descent: θ ← θ - η∇θL
Iterate
Repeat until convergence
Input Data
What happens: We sample a mini-batch of training examples from our dataset. Each example consists of an input x and a target output y.
Mini-batch size is typically 32, 64, or 128 examples. Smaller batches add noise but enable more frequent updates.
The Four Stages
1. Forward Pass
We feed input data through the network, layer by layer, computing predictions. Each layer applies its transformation:
The output of the final layer is our prediction .
2. Loss Computation
We compare predictions to ground truth using a loss function. This produces a single scalar number that summarizes how wrong we are:
3. Backward Pass (Backpropagation)
Using the chain rule of calculus, we compute how each parameter contributed to the loss. We calculate gradients —the direction of steepest increase in loss.
4. Parameter Update
We adjust parameters in the opposite direction of the gradient, reducing the loss:
Where is the learning rate—a crucial hyperparameter controlling step size.
Why This Works
The Optimization Perspective
To truly understand learning, we need to think about it geometrically. The loss function defines a loss surface over parameter space.
The Loss Surface
Imagine the parameters as coordinates in a high-dimensional space. For a network with 1 million parameters, this is a 1,000,000-dimensional space! At each point (parameter configuration), the loss function assigns a height (the loss value).
For visualization, consider just two parameters: the loss surface is then a 3D landscape. Hills represent high loss (bad predictions). Valleys represent low loss (good predictions). Training is like rolling a ball downhill to find the lowest point.
Interactive Loss Surface Explorer
Adjust the weight and bias sliders to see how the loss changes. The yellow arrow shows the direction of steepest descent.
Quick Check
In the loss surface visualization above, what do the colors represent?
What Makes Optimization Hard?
In two dimensions, finding the minimum seems easy—just roll downhill! But real loss surfaces are much more complex:
| Challenge | Description | Impact on Training |
|---|---|---|
| High Dimensionality | Millions of parameters = millions of dimensions | Visualization impossible, intuition breaks down |
| Local Minima | Valleys that aren't the deepest | May get stuck at suboptimal solutions |
| Saddle Points | Flat in some directions, curved in others | Gradients near zero but not at minimum |
| Plateaus | Flat regions with small gradients | Training stalls, appears to stop learning |
| Sharp Valleys | Narrow, steep-sided minima | Generalization issues, sensitivity to perturbations |
| Ill-Conditioning | Very different curvature in different directions | Slow convergence, oscillations |
The Surprising Truth
Visualizing the Loss Landscape
The interactive visualization below shows a 3D loss surface. Watch as gradient descent navigates this landscape, following the negative gradient at each step. Try different learning rates to see their effects.
3D Loss Landscape
Drag to rotate the loss surface. Watch gradient descent find the minimum.
What You're Seeing
- The surface: A loss function with two parameters, showing a global minimum and some local structure
- The red ball: Current parameter position during optimization
- The yellow path: Trajectory taken by gradient descent
- The green marker: Location of the global minimum
Experiment Ideas
- Try a very high learning rate (0.4+). What happens?
- Try a very low learning rate (0.02). How does convergence speed change?
- Drag to rotate and see the surface from different angles. Notice the "valley" structure.
Gradient Descent: Following the Slope
Gradient descent is the algorithm that enables neural network training. The idea is beautifully simple: the gradient tells us the direction of steepest ascent, so we walk in the opposite direction.
The Gradient
The gradient is a vector of partial derivatives:
Each component tells us how much the loss changes when we slightly perturb that parameter. The gradient points in the direction where loss increases fastest.
The Update Rule
Gradient descent iteratively updates parameters:
Where:
- = parameters at iteration
- = learning rate (step size)
- = gradient at current position
Gradient Descent in 2D
The Learning Rate Dilemma
The learning rate is critical:
| Learning Rate | Behavior | Risk |
|---|---|---|
| Too large | Big steps, fast initial progress | Overshooting, oscillations, divergence |
| Too small | Careful steps, stable convergence | Extremely slow, may get stuck |
| Just right | Balanced progress | Still depends on landscape topology |
Quick Check
If gradient descent starts oscillating without converging, what should you try first?
Loss Functions: Measuring Error
The loss function is the objective we're optimizing. It quantifies how wrong our predictions are. Different tasks require different loss functions.
Mean Squared Error (MSE)
For regression (predicting continuous values):
Intuition: Penalizes predictions proportionally to the square of their error. Large errors are penalized much more than small ones.
Cross-Entropy Loss
For classification (predicting categories):
Intuition: Measures the difference between predicted probability distribution and true distribution. Heavily penalizes confident wrong predictions.
Binary Cross-Entropy
For binary classification:
Why Different Loss Functions?
| Task | Loss Function | Reason |
|---|---|---|
| Regression | MSE, MAE, Huber | Predictions are continuous values |
| Binary Classification | Binary Cross-Entropy | Output is probability of class 1 |
| Multi-class Classification | Cross-Entropy + Softmax | Output is probability distribution over classes |
| Object Detection | Combination losses | Multiple objectives: location + classification |
Loss Functions as Assumptions
The Statistical Learning Perspective
So far we've discussed fitting training data. But the real goal is generalization—performing well on new, unseen data.
The True Objective
We actually want to minimize the expected risk over the data distribution:
But we don't know the true data distribution ! We only have samples.
Empirical Risk Minimization
Instead, we minimize the empirical risk—the average loss over our training set:
This works when training data is representative of the true distribution. The gap between training loss and test loss tells us how well we've generalized.
The Bias-Variance Tradeoff
Two sources of error in learning:
- Bias: Error from the model being too simple (underfitting). A linear model can't learn XOR, no matter how much data you have.
- Variance: Error from the model being too sensitive to training data (overfitting). A very complex model might memorize noise.
Deep learning mitigates this through:
- High capacity models (low bias)
- Regularization techniques (controlled variance)
- Massive datasets (reduces overfitting)
- Careful architecture design (inductive biases)
The Challenge: Computing Gradients
We've established that gradient descent requires gradients. But how do we compute for a neural network with millions of parameters?
Three Approaches
1. Numerical Differentiation
Approximate derivatives using finite differences:
Problem: Requires 2 forward passes per parameter. For 1 million parameters, that's 2 million forward passes per gradient! Completely impractical.
2. Symbolic Differentiation
Apply calculus rules to derive gradient expressions analytically.
Problem: Expression size grows exponentially with network depth. Memory-intensive and computationally explosive.
3. Automatic Differentiation (Backpropagation)
Compute gradients through reverse-mode automatic differentiation—the famous backpropagation algorithm.
Advantage: Computes all gradients in essentially the same time as one forward pass. This is the breakthrough that makes deep learning possible!
The Key Insight
Implementation Preview
Here's a preview of how these concepts translate to PyTorch code. We'll build this understanding fully in subsequent sections.
Summary
This section established the conceptual foundation for understanding backpropagation and neural network training.
Key Concepts
| Concept | Definition | Key Equation |
|---|---|---|
| Learning | Adjusting parameters to minimize loss | θ* = argmin L(θ) |
| Loss Function | Scalar measure of prediction error | L(ŷ, y) |
| Gradient | Direction of steepest increase | ∇L = [∂L/∂θ₁, ∂L/∂θ₂, ...] |
| Gradient Descent | Update rule to minimize loss | θ ← θ - η∇L |
| Learning Rate | Step size for updates | η (hyperparameter) |
| Forward Pass | Compute predictions from input | ŷ = f(x; θ) |
| Backward Pass | Compute gradients via chain rule | Backpropagation |
The Big Picture
Learning is optimization. We have a parameterized model, a loss function that measures error, and an update rule (gradient descent) that adjusts parameters to reduce error. The key challenge is computing gradients efficiently—this is what backpropagation solves.
Coming Up: In the next sections, we'll dive deep into gradient descent variants, computational graphs, and the backpropagation algorithm itself. You'll implement backprop from scratch and understand exactly how modern frameworks like PyTorch compute gradients automatically.
Exercises
Conceptual Questions
- Why is the loss function always a scalar? What would happen if it were a vector?
- Explain why gradient descent moves in the negative gradient direction. What would happen if we moved in the positive gradient direction?
- In the loss surface visualization, the path taken by gradient descent isn't a straight line to the minimum. Why does it curve?
- What's the difference between batch gradient descent (using all training data), stochastic gradient descent (using one example), and mini-batch gradient descent (using a subset)?
Mathematical Exercises
- For the loss function with a single data point , compute the gradient at .
- Show that for MSE loss, the gradient with respect to a prediction is .
- Prove that gradient descent converges to the minimum for a convex quadratic function with , given appropriate learning rate.
Coding Exercises
1# Exercise: Implement gradient descent for linear regression from scratch
2# Given: Data points, initial weights, learning rate, number of iterations
3# Goal: Find optimal weights that minimize MSE
4
5import numpy as np
6
7# Generate data
8np.random.seed(42)
9X = np.random.randn(100, 1)
10y = 2 * X + 0.5 + 0.1 * np.random.randn(100, 1)
11
12# TODO: Implement gradient descent
13def gradient_descent(X, y, lr=0.1, n_iters=100):
14 w = np.zeros((1, 1))
15 b = 0.0
16 m = len(X)
17 losses = []
18
19 for i in range(n_iters):
20 # Forward pass: compute predictions
21 y_pred = # TODO
22
23 # Compute loss
24 loss = # TODO
25 losses.append(loss)
26
27 # Compute gradients
28 dw = # TODO
29 db = # TODO
30
31 # Update parameters
32 w = # TODO
33 b = # TODO
34
35 return w, b, losses
36
37# Test your implementation
38w_final, b_final, losses = gradient_descent(X, y)
39print(f"Learned: w = {w_final[0,0]:.3f}, b = {b_final:.3f}")
40print(f"True: w = 2.0, b = 0.5")Expected Results
You now understand the fundamental learning problem that backpropagation solves. In the next section, we'll explore gradient descent variants—the practical algorithms that make training neural networks efficient.