Chapter 9
25 min read
Section 55 of 178

Optimizers

Training Neural Networks

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the optimizer landscape: Know the key optimizers (SGD, Momentum, RMSprop, Adam) and the problems each one solves
  2. Derive optimizer update rules: Understand the mathematical formulation behind each optimizer and what each term contributes
  3. Apply physical intuition: Connect momentum and adaptive learning rates to physics concepts like velocity, friction, and acceleration
  4. Implement optimizers in PyTorch: Use PyTorch's optimizer API and understand when to choose each optimizer
  5. Diagnose optimization problems: Recognize when training issues stem from optimizer choice and how to fix them
Why This Matters: The optimizer determines how your network learns. A well-chosen optimizer can mean the difference between a model that converges in hours versus days—or one that never converges at all. Understanding optimizers is essential for training deep networks effectively.

The Big Picture

The Problem We're Solving

In the previous section, we established the training loop: forward pass, compute loss, backward pass, update weights. The optimizer is the algorithm that performs that final step—deciding exactly how to adjust each weight based on its gradient.

The simplest approach is vanilla gradient descent: move each weight in the direction opposite to its gradient, scaled by a learning rate. But this simple approach has serious problems:

  • Oscillation: In ravines (long, narrow valleys in the loss landscape), gradients point steeply across the valley but only gently along it. SGD oscillates back and forth across the ravine instead of moving efficiently toward the minimum.
  • Uniform learning rates: All parameters get the same learning rate, but some features are sparse (appear rarely) while others are dense. Sparse features need larger updates when they appear.
  • Local minima and saddle points: With no momentum, the optimizer can get stuck in suboptimal regions.

A Brief History

The evolution of optimizers is a story of progressively solving these problems:

YearOptimizerKey InnovationIntroduced By
1847Gradient DescentFollow the negative gradientAugustin-Louis Cauchy
1964MomentumAccumulate velocity to smooth updatesBoris Polyak
2011AdaGradAdapt learning rate per-parameterJohn Duchi et al.
2012RMSpropUse moving average instead of sumGeoffrey Hinton (unpublished)
2014AdamCombine momentum + adaptive LRDiederik Kingma & Jimmy Ba
2017AdamWFix weight decay in AdamIlya Loshchilov & Frank Hutter

Today, Adam is the default choice for most practitioners, though SGD with momentum remains competitive for some tasks (especially vision models). Let's understand why by building up from the basics.


Vanilla Stochastic Gradient Descent

The simplest optimizer updates each parameter by stepping in the direction opposite to the gradient:

θt+1=θtηgt\theta_{t+1} = \theta_t - \eta \cdot g_t

Where:

  • θt\theta_t is the parameter value at step tt
  • η\eta (eta) is the learning rate
  • gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t) is the gradient of the loss

Why "Stochastic"?

In true gradient descent, we would compute the gradient over the entire dataset. Stochastic gradient descent (SGD) instead computes gradients on small batches, introducing noise into the gradient estimate. This noise:

  • Makes training faster (we update after each batch, not each epoch)
  • Acts as regularization (noise helps escape local minima)
  • Reduces memory requirements (we only need one batch in memory)

The Problem: Ravines

Consider a loss function shaped like a long, narrow valley. The gradient perpendicular to the valley (across it) is large, while the gradient along the valley (toward the minimum) is small.

What happens with vanilla SGD? The optimizer takes large steps across the valley and small steps along it. Result: it oscillates back and forth across the valley, making slow progress toward the minimum.

Progress along valleyOscillation across valley\text{Progress along valley} \ll \text{Oscillation across valley}

The Hessian Perspective

Mathematically, ravines occur when the Hessian (matrix of second derivatives) has eigenvalues of very different magnitudes. The condition number κ=λmax/λmin\kappa = \lambda_{\max} / \lambda_{\min} measures this. Higher condition numbers mean more severe ravines and slower convergence.

Quick Check

What causes oscillation in vanilla SGD?


Momentum: Learning from Physics

Momentum solves the oscillation problem by borrowing an idea from physics: a ball rolling down a hill accumulates velocity. If the hill curves back and forth, the ball's inertia smooths out the oscillations.

The Physical Analogy

Imagine a ball rolling on the loss surface. Without friction, the ball would accelerate indefinitely. With friction, it reaches a terminal velocity where the gravitational force (gradient) balances the friction.

Momentum introduces a velocity term that accumulates across updates:

vt=μvt1+gtθt+1=θtηvt\begin{aligned} v_t &= \mu \cdot v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta \cdot v_t \end{aligned}

Where:

  • vtv_t is the velocity at step tt
  • μ\mu (mu) is the momentum coefficient, typically 0.9
  • gtg_t is the current gradient

Why Momentum Works

In a ravine:

  • Across the valley: Gradients alternate direction (left, right, left, right). The momentum term averages these out, canceling the oscillation.
  • Along the valley: Gradients consistently point toward the minimum. Momentum accumulates, accelerating progress.
vt=gt+μgt1+μ2gt2+=i=0tμtigiv_t = g_t + \mu g_{t-1} + \mu^2 g_{t-2} + \cdots = \sum_{i=0}^{t} \mu^{t-i} g_i

The velocity is an exponentially weighted moving average of past gradients. Recent gradients have more influence (weight μ0=1\mu^0 = 1) than older ones (weight μt0\mu^t \to 0 as tt \to \infty).

Nesterov Accelerated Gradient (NAG)

A clever variant of momentum is Nesterov momentum, which "looks ahead" before computing the gradient:

vt=μvt1+θL(θtημvt1)θt+1=θtηvt\begin{aligned} v_t &= \mu \cdot v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t - \eta \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \eta \cdot v_t \end{aligned}

Instead of computing the gradient at the current position, NAG computes it at the position we would reach if we followed momentum alone. This provides a correction: if momentum would overshoot, we notice sooner and slow down.

Typical Values

Momentum coefficient: 0.9 is the standard choice. Higher values (0.99) give more smoothing but slower response to changes. Lower values (0.8) respond faster but smooth less.

Quick Check

In momentum SGD with μ = 0.9, how much weight does a gradient from 10 steps ago have compared to the current gradient?


Adaptive Learning Rates

Momentum solves oscillation, but what about the uniform learning rate problem? Not all parameters should be updated at the same rate:

  • Sparse features: Some features (like rare words in NLP) appear infrequently. When they do appear, we want to learn from them quickly.
  • Different scales: Weights in different layers may have gradients of different magnitudes. A learning rate that works for one layer may be too large or too small for another.

AdaGrad: Accumulate Squared Gradients

AdaGrad (Adaptive Gradient) adapts the learning rate for each parameter based on how much that parameter has been updated:

rt=rt1+gtgtθt+1=θtηrt+ϵgt\begin{aligned} r_t &= r_{t-1} + g_t \odot g_t \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{r_t} + \epsilon} \odot g_t \end{aligned}

Where:

  • rtr_t is the accumulated squared gradient (one value per parameter)
  • \odot denotes element-wise multiplication
  • ϵ\epsilon is a small constant (e.g., 10810^{-8}) to prevent division by zero

Intuition: Parameters with large accumulated gradients get smaller learning rates; parameters with small accumulated gradients get larger learning rates. This naturally handles sparse features.

AdaGrad's Problem

The accumulated squared gradient rtr_t only grows. Over time, the effective learning rate η/rt\eta / \sqrt{r_t} shrinks to zero, and learning stops completely. This makes AdaGrad unsuitable for long training runs.

RMSprop: Exponential Moving Average

RMSprop (Root Mean Square Propagation) fixes AdaGrad's problem by using an exponentially weighted moving average instead of a sum:

vt=βvt1+(1β)gtgtθt+1=θtηvt+ϵgt\begin{aligned} v_t &= \beta \cdot v_{t-1} + (1 - \beta) \cdot g_t \odot g_t \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t \end{aligned}

Where β\beta is typically 0.9 or 0.99.

Key difference: Unlike AdaGrad, RMSprop "forgets" old gradients. If a parameter stops receiving updates, its accumulated gradient decays, and its learning rate can recover.

Origin of RMSprop

Geoffrey Hinton introduced RMSprop in Lecture 6e of his Coursera course "Neural Networks for Machine Learning" (2012). It was never formally published in a paper, yet it became one of the most widely used optimizers!

Quick Check

What problem does RMSprop solve that AdaGrad has?


Adam: The Best of Both Worlds

Adam (Adaptive Moment Estimation) combines the best ideas from momentum and RMSprop: it maintains both a first moment (momentum-like) and a second moment (RMSprop-like) of the gradients.

The Algorithm

mt=β1mt1+(1β1)gt(first moment estimate)vt=β2vt1+(1β2)gtgt(second moment estimate)m^t=mt1β1t(bias correction)v^t=vt1β2t(bias correction)θt+1=θtηv^t+ϵm^t\begin{aligned} m_t &= \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t && \text{(first moment estimate)} \\ v_t &= \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t \odot g_t && \text{(second moment estimate)} \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} && \text{(bias correction)} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} && \text{(bias correction)} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t \end{aligned}

Default hyperparameters (from the original paper):

  • β1=0.9\beta_1 = 0.9 (momentum decay)
  • β2=0.999\beta_2 = 0.999 (RMSprop decay)
  • ϵ=108\epsilon = 10^{-8}
  • η=0.001\eta = 0.001

Understanding Each Component

First Moment (m): Momentum

The first moment mtm_t is an exponentially weighted average of past gradients—exactly like momentum. It smooths out oscillations and accumulates velocity in consistent directions.

Second Moment (v): Adaptive Learning Rate

The second moment vtv_t is an exponentially weighted average of squared gradients—exactly like RMSprop. It provides per-parameter adaptive learning rates.

Bias Correction

Why the correction terms m^t\hat{m}_t and v^t\hat{v}_t? Both mtm_t and vtv_t are initialized to zero. Early in training, they're biased toward zero.

Consider mtm_t after one step:

m1=(1β1)g1=0.1g1m_1 = (1 - \beta_1) \cdot g_1 = 0.1 \cdot g_1

This is much smaller than g1g_1! The bias correction divides by 1β11=0.11 - \beta_1^1 = 0.1, giving m^1=g1\hat{m}_1 = g_1. As tt \to \infty, β1t0\beta_1^t \to 0, and the correction vanishes.

Adam is Robust

Adam's great advantage is that it works well "out of the box" for a wide range of problems. The default hyperparameters (β1=0.9,β2=0.999,η=0.001\beta_1=0.9, \beta_2=0.999, \eta=0.001) rarely need tuning, making Adam the go-to choice for rapid prototyping.

Interactive: Optimizer Comparison

Watch different optimizers navigate the same loss landscape. The Rosenbrock function creates a curved valley—a challenging test for optimizers. Observe how:

  • SGD oscillates dramatically, struggling in the curved valley
  • Momentum smooths the path but may overshoot
  • RMSprop adapts to the landscape's shape
  • Adam combines smooth movement with adaptation
Optimizer Comparison on Loss Landscape
Learning Rate: 0.0010Step: 0
SGD
Loss: 12.5000 | Pos: (-1.500, 2.000)
Momentum
Loss: 12.5000 | Pos: (-1.500, 2.000)
Adam
Loss: 12.5000 | Pos: (-1.500, 2.000)

The yellow dot marks the global minimum at (1, 1). Watch how different optimizers navigate the curved valley of the Rosenbrock function. SGD oscillates, while momentum and Adam find smoother paths.

Experiment

Try different learning rates! Notice how SGD is sensitive to the learning rate choice, while Adam is more robust. Also try selecting different combinations of optimizers to compare them directly.

Quick Check

Based on the visualization, which optimizer typically reaches the minimum first on the Rosenbrock function?


Other Notable Optimizers

AdamW: Adam with Decoupled Weight Decay

The original Adam paper applied L2 regularization by adding the weight penalty to the loss, which means it gets scaled by the adaptive learning rate. AdamW decouples weight decay, applying it directly to the weights:

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

This seemingly small change improves generalization significantly. AdamW is now the default for training transformers and many other architectures.

NAdam: Nesterov + Adam

NAdam incorporates Nesterov momentum into Adam, computing the gradient at a "look-ahead" position. This can provide faster convergence in some cases.

Lion: Evolved Sign Momentum

Lion (2023) was discovered via automated search over optimizer algorithms. It uses only the sign of the momentum, not the magnitude:

θt+1=θtηsign(mt)\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(m_t)

Lion uses less memory than Adam (no second moment) and has shown strong performance on large language models.

Summary Table

OptimizerMemoryMomentumAdaptive LRBest For
SGDO(n)NoNoBaseline, simple problems
SGD + MomentumO(2n)YesNoVision, when tuned well
RMSpropO(2n)NoYesRNNs, quick prototyping
AdamO(3n)YesYesGeneral default
AdamWO(3n)YesYesTransformers, with regularization
LionO(2n)Yes (sign)NoLarge models, memory-constrained

PyTorch Implementation

PyTorch provides all major optimizers in torch.optim. Here's how to use them:

Basic Usage

Creating Optimizers in PyTorch
🐍optimizers.py
12SGD with Momentum

Classic SGD with momentum. Add nesterov=True for Nesterov accelerated gradient, which often converges faster.

EXAMPLE
Good for computer vision tasks when carefully tuned
14Learning Rate for SGD

SGD typically needs a higher learning rate (0.01-0.1) compared to Adam (0.001). SGD is more sensitive to this choice.

20Adam Optimizer

Adam is the default choice for most deep learning tasks. It combines momentum and adaptive learning rates.

23Beta Parameters

β₁ controls momentum (first moment decay), β₂ controls adaptive learning rate (second moment decay). Defaults work well in most cases.

28AdamW

AdamW fixes weight decay in Adam. Use weight_decay to control regularization strength. Preferred for transformers.

31Weight Decay

Typical values: 0.01-0.1 for transformers, 0.0001-0.001 for other models. Higher values = stronger regularization.

27 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Create a simple model
6model = nn.Sequential(
7    nn.Linear(784, 256),
8    nn.ReLU(),
9    nn.Linear(256, 10)
10)
11
12# SGD with momentum
13optimizer_sgd = optim.SGD(
14    model.parameters(),
15    lr=0.01,
16    momentum=0.9,
17    nesterov=True  # Use Nesterov momentum
18)
19
20# Adam (the default choice)
21optimizer_adam = optim.Adam(
22    model.parameters(),
23    lr=0.001,
24    betas=(0.9, 0.999),  # (β₁, β₂)
25    eps=1e-8
26)
27
28# AdamW (Adam with decoupled weight decay)
29optimizer_adamw = optim.AdamW(
30    model.parameters(),
31    lr=0.001,
32    weight_decay=0.01  # L2 regularization strength
33)

Per-Parameter Learning Rates

Sometimes you want different learning rates for different parts of the model. PyTorch supports this with parameter groups:

Per-Parameter Learning Rates
🐍per_param_lr.py
3Lower LR for Pretrained Layers

Pretrained layers already have good weights. Use a small learning rate to make fine adjustments without destroying learned features.

4Higher LR for New Layers

Newly initialized layers (like a classification head) need larger updates to learn from scratch.

12No Weight Decay for Classifier

Sometimes you want different regularization for different layers. The classifier may not need weight decay if it's small.

9 lines without explanation
1# Different learning rates for different layers
2optimizer = optim.Adam([
3    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Pretrained: small LR
4    {'params': model.head.parameters(), 'lr': 1e-3}      # New layers: larger LR
5], lr=1e-4)  # Default LR for any params not in groups
6
7# Fine-tuning a pretrained model
8optimizer = optim.AdamW([
9    {'params': model.encoder.parameters(), 'lr': 1e-5, 'weight_decay': 0.01},
10    {'params': model.decoder.parameters(), 'lr': 1e-4, 'weight_decay': 0.01},
11    {'params': model.classifier.parameters(), 'lr': 1e-3, 'weight_decay': 0.0}
12])

Implementing Adam from Scratch

Understanding optimizers deeply means being able to implement them yourself. Here's Adam from scratch:

Adam Optimizer from Scratch
🐍adam_scratch.py
10First Moment (m)

Stores the exponentially weighted average of gradients, like momentum. Initialized to zeros.

11Second Moment (v)

Stores the exponentially weighted average of squared gradients, like RMSprop. Used for adaptive learning rates.

27Update First Moment

Blend previous momentum with current gradient. β₁=0.9 means 90% old momentum, 10% new gradient.

30Update Second Moment

Blend previous squared gradient estimate with current squared gradient. β₂=0.999 means very smooth adaptation.

33Bias Correction

Divide by (1 - β^t) to correct for initialization at zero. Essential in early steps, vanishes over time.

37Parameter Update

Combine everything: step in direction of corrected momentum, scaled by inverse of corrected RMS gradient.

32 lines without explanation
1class AdamFromScratch:
2    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
3        self.params = list(params)
4        self.lr = lr
5        self.beta1, self.beta2 = betas
6        self.eps = eps
7        self.t = 0
8
9        # Initialize moment estimates
10        self.m = [torch.zeros_like(p) for p in self.params]  # First moment
11        self.v = [torch.zeros_like(p) for p in self.params]  # Second moment
12
13    def zero_grad(self):
14        for p in self.params:
15            if p.grad is not None:
16                p.grad.zero_()
17
18    def step(self):
19        self.t += 1
20
21        for i, p in enumerate(self.params):
22            if p.grad is None:
23                continue
24
25            g = p.grad.data
26
27            # Update biased first moment estimate
28            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
29
30            # Update biased second moment estimate
31            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g * g
32
33            # Compute bias-corrected estimates
34            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
35            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
36
37            # Update parameters
38            p.data -= self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)

Choosing the Right Optimizer

With so many optimizers available, how do you choose? Here are practical guidelines:

Decision Tree

  1. Starting a new project? Use Adam with default hyperparameters. It's robust and works well across domains.
  2. Training a transformer or LLM? Use AdamW. The decoupled weight decay is important for these architectures.
  3. Training a CNN for image classification? Consider SGD with momentum and a learning rate schedule. It often gives better final performance than Adam, though it requires more tuning.
  4. Memory constrained? Consider Lion or SGD with momentum. They use less memory than Adam.
  5. Training an RNN? RMSprop or Adam work well. Gradient clipping is often needed too.

Common Mistakes

MistakeSymptomFix
Using Adam's default LR for SGDTraining doesn't progressSGD needs higher LR (0.01-0.1)
Forgetting weight decayModel overfitsAdd weight_decay (0.01 for AdamW)
Not using bias correctionSlow start to trainingPyTorch optimizers handle this automatically
Epsilon too largeUpdates too smallUse ε = 1e-8 (the default)
Same LR for pretrained and new layersPretrained features get destroyedUse smaller LR for pretrained layers

When in Doubt, Try Both

For important projects, it's worth comparing Adam and SGD+momentum. SGD often achieves slightly better final performance if you have time to tune the learning rate schedule, but Adam is faster to get working.

Related Topics

  • Chapter 8 Section 2: Gradient Descent Variants - Mathematical foundations of batch, stochastic, and mini-batch gradient descent
  • Section 3: Learning Rate Strategies - Learning rate schedules, warmup, and finding optimal learning rates

Summary

Optimizers determine how neural networks learn from gradients. Let's review the key concepts:

OptimizerKey IdeaUpdate Rule Essence
SGDFollow negative gradientθ ← θ - η·g
MomentumAccumulate velocityv ← μv + g; θ ← θ - η·v
RMSpropAdapt LR per-parameterθ ← θ - η·g / √(v + ε)
AdamMomentum + adaptive LRθ ← θ - η·m̂ / √(v̂ + ε)
AdamWAdam with fixed weight decayAdam + λθ decay applied directly

Key Takeaways

  • Momentum smooths oscillations by accumulating velocity across updates
  • Adaptive learning rates handle parameters of different scales and sparsities
  • Adam combines both ideas and works well out-of-the-box
  • AdamW is preferred for transformers due to proper weight decay handling
  • SGD + momentum can achieve better final performance with careful tuning
🐍quick_reference.py
1# Quick reference: PyTorch optimizers
2import torch.optim as optim
3
4# For quick prototyping (most projects)
5opt = optim.Adam(model.parameters(), lr=1e-3)
6
7# For transformers and LLMs
8opt = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
9
10# For CNNs when tuning for best performance
11opt = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
12
13# For RNNs
14opt = optim.RMSprop(model.parameters(), lr=1e-3)

Exercises

Conceptual Questions

  1. Explain why momentum helps with ravines in the loss landscape. What would happen if you used momentum coefficient μ=0.99\mu = 0.99 instead of 0.90.9?
  2. Why does Adam need bias correction but vanilla momentum doesn't? What would happen to Adam's first steps without bias correction?
  3. You're training a model and notice that loss oscillates wildly. Which optimizer hyperparameters would you adjust, and how?
  4. Explain why AdamW is preferred over Adam for transformers. What goes wrong with standard Adam + weight decay?

Solution Hints

  1. Q1: Higher momentum means more smoothing but slower response to gradient changes. μ=0.99 would be very sluggish and might overshoot more.
  2. Q2: Momentum starts from zero but quickly gets close to the gradient average. Adam's moments are exponentially weighted averages initialized to zero, which biases them toward zero early on.
  3. Q3: Reduce learning rate first. If using SGD, add or increase momentum. Consider switching to Adam if not already using it.
  4. Q4: In Adam, weight decay gets scaled by the adaptive learning rate, so parameters with small gradients (often important features) get less regularization. AdamW applies weight decay uniformly.

Coding Exercises

  1. Implement SGD with momentum from scratch: Create a custom optimizer class that implements SGD with momentum (with optional Nesterov). Verify it produces the same results as PyTorch's optim.SGD.
  2. Optimizer comparison experiment: Train a small CNN on MNIST with SGD, SGD+momentum, RMSprop, and Adam. Plot training and validation loss curves. Which converges fastest? Which achieves the best final accuracy?
  3. Learning rate sensitivity analysis: Train the same model with Adam using learning rates [0.0001, 0.001, 0.01, 0.1]. Repeat with SGD. Compare how sensitive each optimizer is to learning rate choice.
  4. Visualize optimizer trajectories: Create a 2D loss surface (like the Rosenbrock function) and animate the trajectories of different optimizers. Highlight where each optimizer struggles or excels.

Coding Exercise Hints

  • Exercise 1: Store velocity as a list of tensors, one per parameter. For Nesterov, evaluate gradient at θ - η·μ·v.
  • Exercise 2: Use a simple CNN (2-3 conv layers + FC). Train for 10 epochs, plot losses with matplotlib.
  • Exercise 3: Create a loop over learning rates and store results in a dictionary. Use early stopping if loss explodes.
  • Exercise 4: Use matplotlib.animation. Store optimizer state history and plot paths on a contour plot.

In the next section, we'll explore learning rate strategies—techniques for changing the learning rate during training. You'll learn about warmup, cosine annealing, and other schedules that can significantly improve training.