Learning Objectives
By the end of this section, you will be able to:
- Understand the optimizer landscape: Know the key optimizers (SGD, Momentum, RMSprop, Adam) and the problems each one solves
- Derive optimizer update rules: Understand the mathematical formulation behind each optimizer and what each term contributes
- Apply physical intuition: Connect momentum and adaptive learning rates to physics concepts like velocity, friction, and acceleration
- Implement optimizers in PyTorch: Use PyTorch's optimizer API and understand when to choose each optimizer
- Diagnose optimization problems: Recognize when training issues stem from optimizer choice and how to fix them
Why This Matters: The optimizer determines how your network learns. A well-chosen optimizer can mean the difference between a model that converges in hours versus days—or one that never converges at all. Understanding optimizers is essential for training deep networks effectively.
The Big Picture
The Problem We're Solving
In the previous section, we established the training loop: forward pass, compute loss, backward pass, update weights. The optimizer is the algorithm that performs that final step—deciding exactly how to adjust each weight based on its gradient.
The simplest approach is vanilla gradient descent: move each weight in the direction opposite to its gradient, scaled by a learning rate. But this simple approach has serious problems:
- Oscillation: In ravines (long, narrow valleys in the loss landscape), gradients point steeply across the valley but only gently along it. SGD oscillates back and forth across the ravine instead of moving efficiently toward the minimum.
- Uniform learning rates: All parameters get the same learning rate, but some features are sparse (appear rarely) while others are dense. Sparse features need larger updates when they appear.
- Local minima and saddle points: With no momentum, the optimizer can get stuck in suboptimal regions.
A Brief History
The evolution of optimizers is a story of progressively solving these problems:
| Year | Optimizer | Key Innovation | Introduced By |
|---|---|---|---|
| 1847 | Gradient Descent | Follow the negative gradient | Augustin-Louis Cauchy |
| 1964 | Momentum | Accumulate velocity to smooth updates | Boris Polyak |
| 2011 | AdaGrad | Adapt learning rate per-parameter | John Duchi et al. |
| 2012 | RMSprop | Use moving average instead of sum | Geoffrey Hinton (unpublished) |
| 2014 | Adam | Combine momentum + adaptive LR | Diederik Kingma & Jimmy Ba |
| 2017 | AdamW | Fix weight decay in Adam | Ilya Loshchilov & Frank Hutter |
Today, Adam is the default choice for most practitioners, though SGD with momentum remains competitive for some tasks (especially vision models). Let's understand why by building up from the basics.
Vanilla Stochastic Gradient Descent
The simplest optimizer updates each parameter by stepping in the direction opposite to the gradient:
Where:
- is the parameter value at step
- (eta) is the learning rate
- is the gradient of the loss
Why "Stochastic"?
In true gradient descent, we would compute the gradient over the entire dataset. Stochastic gradient descent (SGD) instead computes gradients on small batches, introducing noise into the gradient estimate. This noise:
- Makes training faster (we update after each batch, not each epoch)
- Acts as regularization (noise helps escape local minima)
- Reduces memory requirements (we only need one batch in memory)
The Problem: Ravines
Consider a loss function shaped like a long, narrow valley. The gradient perpendicular to the valley (across it) is large, while the gradient along the valley (toward the minimum) is small.
What happens with vanilla SGD? The optimizer takes large steps across the valley and small steps along it. Result: it oscillates back and forth across the valley, making slow progress toward the minimum.
The Hessian Perspective
Quick Check
What causes oscillation in vanilla SGD?
Momentum: Learning from Physics
Momentum solves the oscillation problem by borrowing an idea from physics: a ball rolling down a hill accumulates velocity. If the hill curves back and forth, the ball's inertia smooths out the oscillations.
The Physical Analogy
Imagine a ball rolling on the loss surface. Without friction, the ball would accelerate indefinitely. With friction, it reaches a terminal velocity where the gravitational force (gradient) balances the friction.
Momentum introduces a velocity term that accumulates across updates:
Where:
- is the velocity at step
- (mu) is the momentum coefficient, typically 0.9
- is the current gradient
Why Momentum Works
In a ravine:
- Across the valley: Gradients alternate direction (left, right, left, right). The momentum term averages these out, canceling the oscillation.
- Along the valley: Gradients consistently point toward the minimum. Momentum accumulates, accelerating progress.
The velocity is an exponentially weighted moving average of past gradients. Recent gradients have more influence (weight ) than older ones (weight as ).
Nesterov Accelerated Gradient (NAG)
A clever variant of momentum is Nesterov momentum, which "looks ahead" before computing the gradient:
Instead of computing the gradient at the current position, NAG computes it at the position we would reach if we followed momentum alone. This provides a correction: if momentum would overshoot, we notice sooner and slow down.
Typical Values
Quick Check
In momentum SGD with μ = 0.9, how much weight does a gradient from 10 steps ago have compared to the current gradient?
Adaptive Learning Rates
Momentum solves oscillation, but what about the uniform learning rate problem? Not all parameters should be updated at the same rate:
- Sparse features: Some features (like rare words in NLP) appear infrequently. When they do appear, we want to learn from them quickly.
- Different scales: Weights in different layers may have gradients of different magnitudes. A learning rate that works for one layer may be too large or too small for another.
AdaGrad: Accumulate Squared Gradients
AdaGrad (Adaptive Gradient) adapts the learning rate for each parameter based on how much that parameter has been updated:
Where:
- is the accumulated squared gradient (one value per parameter)
- denotes element-wise multiplication
- is a small constant (e.g., ) to prevent division by zero
Intuition: Parameters with large accumulated gradients get smaller learning rates; parameters with small accumulated gradients get larger learning rates. This naturally handles sparse features.
AdaGrad's Problem
RMSprop: Exponential Moving Average
RMSprop (Root Mean Square Propagation) fixes AdaGrad's problem by using an exponentially weighted moving average instead of a sum:
Where is typically 0.9 or 0.99.
Key difference: Unlike AdaGrad, RMSprop "forgets" old gradients. If a parameter stops receiving updates, its accumulated gradient decays, and its learning rate can recover.
Origin of RMSprop
Quick Check
What problem does RMSprop solve that AdaGrad has?
Adam: The Best of Both Worlds
Adam (Adaptive Moment Estimation) combines the best ideas from momentum and RMSprop: it maintains both a first moment (momentum-like) and a second moment (RMSprop-like) of the gradients.
The Algorithm
Default hyperparameters (from the original paper):
- (momentum decay)
- (RMSprop decay)
Understanding Each Component
First Moment (m): Momentum
The first moment is an exponentially weighted average of past gradients—exactly like momentum. It smooths out oscillations and accumulates velocity in consistent directions.
Second Moment (v): Adaptive Learning Rate
The second moment is an exponentially weighted average of squared gradients—exactly like RMSprop. It provides per-parameter adaptive learning rates.
Bias Correction
Why the correction terms and ? Both and are initialized to zero. Early in training, they're biased toward zero.
Consider after one step:
This is much smaller than ! The bias correction divides by , giving . As , , and the correction vanishes.
Adam is Robust
Interactive: Optimizer Comparison
Watch different optimizers navigate the same loss landscape. The Rosenbrock function creates a curved valley—a challenging test for optimizers. Observe how:
- SGD oscillates dramatically, struggling in the curved valley
- Momentum smooths the path but may overshoot
- RMSprop adapts to the landscape's shape
- Adam combines smooth movement with adaptation
The yellow dot marks the global minimum at (1, 1). Watch how different optimizers navigate the curved valley of the Rosenbrock function. SGD oscillates, while momentum and Adam find smoother paths.
Experiment
Quick Check
Based on the visualization, which optimizer typically reaches the minimum first on the Rosenbrock function?
Other Notable Optimizers
AdamW: Adam with Decoupled Weight Decay
The original Adam paper applied L2 regularization by adding the weight penalty to the loss, which means it gets scaled by the adaptive learning rate. AdamW decouples weight decay, applying it directly to the weights:
This seemingly small change improves generalization significantly. AdamW is now the default for training transformers and many other architectures.
NAdam: Nesterov + Adam
NAdam incorporates Nesterov momentum into Adam, computing the gradient at a "look-ahead" position. This can provide faster convergence in some cases.
Lion: Evolved Sign Momentum
Lion (2023) was discovered via automated search over optimizer algorithms. It uses only the sign of the momentum, not the magnitude:
Lion uses less memory than Adam (no second moment) and has shown strong performance on large language models.
Summary Table
| Optimizer | Memory | Momentum | Adaptive LR | Best For |
|---|---|---|---|---|
| SGD | O(n) | No | No | Baseline, simple problems |
| SGD + Momentum | O(2n) | Yes | No | Vision, when tuned well |
| RMSprop | O(2n) | No | Yes | RNNs, quick prototyping |
| Adam | O(3n) | Yes | Yes | General default |
| AdamW | O(3n) | Yes | Yes | Transformers, with regularization |
| Lion | O(2n) | Yes (sign) | No | Large models, memory-constrained |
PyTorch Implementation
PyTorch provides all major optimizers in torch.optim. Here's how to use them:
Basic Usage
Per-Parameter Learning Rates
Sometimes you want different learning rates for different parts of the model. PyTorch supports this with parameter groups:
Implementing Adam from Scratch
Understanding optimizers deeply means being able to implement them yourself. Here's Adam from scratch:
Choosing the Right Optimizer
With so many optimizers available, how do you choose? Here are practical guidelines:
Decision Tree
- Starting a new project? Use Adam with default hyperparameters. It's robust and works well across domains.
- Training a transformer or LLM? Use AdamW. The decoupled weight decay is important for these architectures.
- Training a CNN for image classification? Consider SGD with momentum and a learning rate schedule. It often gives better final performance than Adam, though it requires more tuning.
- Memory constrained? Consider Lion or SGD with momentum. They use less memory than Adam.
- Training an RNN? RMSprop or Adam work well. Gradient clipping is often needed too.
Common Mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Using Adam's default LR for SGD | Training doesn't progress | SGD needs higher LR (0.01-0.1) |
| Forgetting weight decay | Model overfits | Add weight_decay (0.01 for AdamW) |
| Not using bias correction | Slow start to training | PyTorch optimizers handle this automatically |
| Epsilon too large | Updates too small | Use ε = 1e-8 (the default) |
| Same LR for pretrained and new layers | Pretrained features get destroyed | Use smaller LR for pretrained layers |
When in Doubt, Try Both
Related Topics
- Chapter 8 Section 2: Gradient Descent Variants - Mathematical foundations of batch, stochastic, and mini-batch gradient descent
- Section 3: Learning Rate Strategies - Learning rate schedules, warmup, and finding optimal learning rates
Summary
Optimizers determine how neural networks learn from gradients. Let's review the key concepts:
| Optimizer | Key Idea | Update Rule Essence |
|---|---|---|
| SGD | Follow negative gradient | θ ← θ - η·g |
| Momentum | Accumulate velocity | v ← μv + g; θ ← θ - η·v |
| RMSprop | Adapt LR per-parameter | θ ← θ - η·g / √(v + ε) |
| Adam | Momentum + adaptive LR | θ ← θ - η·m̂ / √(v̂ + ε) |
| AdamW | Adam with fixed weight decay | Adam + λθ decay applied directly |
Key Takeaways
- Momentum smooths oscillations by accumulating velocity across updates
- Adaptive learning rates handle parameters of different scales and sparsities
- Adam combines both ideas and works well out-of-the-box
- AdamW is preferred for transformers due to proper weight decay handling
- SGD + momentum can achieve better final performance with careful tuning
1# Quick reference: PyTorch optimizers
2import torch.optim as optim
3
4# For quick prototyping (most projects)
5opt = optim.Adam(model.parameters(), lr=1e-3)
6
7# For transformers and LLMs
8opt = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
9
10# For CNNs when tuning for best performance
11opt = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
12
13# For RNNs
14opt = optim.RMSprop(model.parameters(), lr=1e-3)Exercises
Conceptual Questions
- Explain why momentum helps with ravines in the loss landscape. What would happen if you used momentum coefficient instead of ?
- Why does Adam need bias correction but vanilla momentum doesn't? What would happen to Adam's first steps without bias correction?
- You're training a model and notice that loss oscillates wildly. Which optimizer hyperparameters would you adjust, and how?
- Explain why AdamW is preferred over Adam for transformers. What goes wrong with standard Adam + weight decay?
Solution Hints
- Q1: Higher momentum means more smoothing but slower response to gradient changes. μ=0.99 would be very sluggish and might overshoot more.
- Q2: Momentum starts from zero but quickly gets close to the gradient average. Adam's moments are exponentially weighted averages initialized to zero, which biases them toward zero early on.
- Q3: Reduce learning rate first. If using SGD, add or increase momentum. Consider switching to Adam if not already using it.
- Q4: In Adam, weight decay gets scaled by the adaptive learning rate, so parameters with small gradients (often important features) get less regularization. AdamW applies weight decay uniformly.
Coding Exercises
- Implement SGD with momentum from scratch: Create a custom optimizer class that implements SGD with momentum (with optional Nesterov). Verify it produces the same results as PyTorch's
optim.SGD. - Optimizer comparison experiment: Train a small CNN on MNIST with SGD, SGD+momentum, RMSprop, and Adam. Plot training and validation loss curves. Which converges fastest? Which achieves the best final accuracy?
- Learning rate sensitivity analysis: Train the same model with Adam using learning rates [0.0001, 0.001, 0.01, 0.1]. Repeat with SGD. Compare how sensitive each optimizer is to learning rate choice.
- Visualize optimizer trajectories: Create a 2D loss surface (like the Rosenbrock function) and animate the trajectories of different optimizers. Highlight where each optimizer struggles or excels.
Coding Exercise Hints
- Exercise 1: Store velocity as a list of tensors, one per parameter. For Nesterov, evaluate gradient at θ - η·μ·v.
- Exercise 2: Use a simple CNN (2-3 conv layers + FC). Train for 10 epochs, plot losses with matplotlib.
- Exercise 3: Create a loop over learning rates and store results in a dictionary. Use early stopping if loss explodes.
- Exercise 4: Use matplotlib.animation. Store optimizer state history and plot paths on a contour plot.
In the next section, we'll explore learning rate strategies—techniques for changing the learning rate during training. You'll learn about warmup, cosine annealing, and other schedules that can significantly improve training.