Learning Objectives
By the end of this section, you will be able to:
- Explain why momentum alone is insufficient when different parameters need different learning rates
- Derive the RMSprop update and explain how dividing by the running RMS gradient creates per-parameter adaptive learning rates
- Derive the full Adam algorithm as the combination of momentum (first moment) and RMSprop (second moment) with bias correction
- Implement Adam from scratch in NumPy and trace the m, v, , buffers step by step
- Use PyTorch's Adam optimizer to train a real network and compare its convergence against SGD with momentum
- Explain AdamW and why decoupled weight decay is the standard in modern transformer training
Where Momentum Falls Short
In Section 1, we saw that momentum gives every parameter the same treatment: accumulate past gradients with , then step with learning rate . This works brilliantly when the loss surface is smooth and all directions have similar curvature. But real neural networks have a much harder problem.
The Per-Parameter Problem
Consider our diagonal flip network with 31 parameters. Some of these are weights in (layer 1), some are biases in , and others are in and . During training:
- Layer 1 weights see gradients that are attenuated by the ReLU and layer 2 weights (chain rule). Typical gradient magnitude: ~0.1
- Layer 2 biases see gradients directly from the loss. Typical gradient magnitude: ~1.0
- Difference: 10× between the smallest and largest gradients among the 31 parameters
With momentum, the learning rate applies to all 31 parameters equally. If we choose to work well for the large-gradient parameters (layer 2 biases), it is too small for the small-gradient parameters (layer 1 weights). If we increase , the large-gradient parameters overshoot.
The fundamental limitation of momentum: it smooths the gradient direction but cannot adapt the step SIZE per parameter. All parameters share one learning rate. In a network where gradient magnitudes span 10× or 100× across parameters, one learning rate cannot serve them all optimally.
What we need is an optimizer that automatically gives larger steps to parameters with small gradients and smaller steps to parameters with large gradients. This is the idea behind adaptive learning rate methods.
Adaptive Learning Rates: The Key Insight
The core idea is simple: divide each parameter's update by a measure of how large its gradients typically are. If a parameter consistently has large gradients, divide by a large number (take smaller steps). If it has small gradients, divide by a small number (take larger steps).
AdaGrad: The First Attempt (2011)
AdaGrad (Adaptive Gradient) introduced this idea. For each parameter , it accumulates the sum of all past squared gradients:
Then the update divides by the square root of this sum:
For a parameter with consistently large gradients (say ), grows quickly and becomes large, shrinking the effective step. For a parameter with small gradients (), stays small and the effective step stays large. Per-parameter adaptation, automatically.
But AdaGrad has a fatal flaw: only grows, never shrinks. After thousands of steps, becomes so large that the effective learning rate approaches zero. The optimizer effectively stops learning.
| Step | v (accumulated) | √v | Effective lr (η/√v) |
|---|---|---|---|
| 100 | 10,000 | 100 | 0.01 × η |
| 1,000 | 100,000 | 316 | 0.003 × η |
| 10,000 | 1,000,000 | 1,000 | 0.001 × η |
| 100,000 | 10,000,000 | 3,162 | 0.0003 × η ← nearly dead |
RMSprop: Scaling by Gradient History
RMSprop (Root Mean Square Propagation), proposed by Geoffrey Hinton in 2012, fixes AdaGrad's decay problem with a simple change: replace the running sum of squared gradients with an exponential moving average:
Here (typically 0.9 or 0.999) is the decay rate. Unlike AdaGrad's sum, now tracks the recent gradient variance. Old gradient information fades away exponentially, so stays bounded and the learning rate never dies.
What the Denominator Really Does
The quantity is an estimate of the root mean square (RMS) of recent gradients. Dividing by it normalizes each parameter's step by its gradient's typical magnitude:
| Parameter | Typical |grad| | √v (RMS) | Effective step |
|---|---|---|---|
| layer2.bias[0] | ~1.0 | ~1.0 | η × 1.0/1.0 = η |
| layer1.weight[0][0] | ~0.1 | ~0.1 | η × 0.1/0.1 = η |
| layer1.weight[2][3] | ~0.01 | ~0.01 | η × 0.01/0.01 = η |
All three parameters end up with an effective step of approximately , regardless of their gradient magnitude. RMSprop equalizes the step size across parameters automatically.
RMSprop's Limitation
RMSprop has no momentum. It adapts the magnitude of each step but doesn't smooth the direction. On a noisy loss surface, RMSprop still zig-zags — it just zig-zags with adaptive step sizes. What if we could combine the directional smoothing of momentum with the magnitude adaptation of RMSprop?
Adam: The Best of Both Worlds
Adam (Adaptive Moment Estimation), published by Kingma & Ba in 2015, combines momentum and RMSprop into a single optimizer. It maintains two exponential moving averages:
- First moment — the mean of gradients (like momentum). Smooths direction.
- Second moment — the mean of squared gradients (like RMSprop). Scales magnitude.
The Full Algorithm
Given gradient at step :
Step 1: Update the first moment (momentum)
Step 2: Update the second moment (RMSprop)
Step 3: Bias correction
Step 4: Update the weights
Default hyperparameters: , , , .
Reading the Update Rule
The update can be read as:
- says: "the average gradient direction is this way"
- says: "the typical gradient magnitude is this big"
- Their ratio is approximately when the gradient is consistent, giving a step of approximately
- When gradients are noisy (alternating sign), is small but is still large, so the step shrinks — automatic noise dampening
| Component | Source | What it provides |
|---|---|---|
| m₁ (first moment) | Momentum | Smooths gradient direction, dampens noise |
| v₁ (second moment) | RMSprop | Per-parameter adaptive step size |
| Bias correction | Adam (new) | Corrects zero-initialization bias |
| Combined update | m̂/√v̂ | Steps of ≈η in all dimensions |
Understanding Bias Correction
Both and are initialized to zero. This creates a problem: in the first few steps, they are biased toward zero. Bias correction fixes this.
Why Zero Initialization Causes Bias
At step 1 with :
The expected value of is , not . It underestimates the true gradient mean by a factor of 10. In general, after steps:
Dividing by gives — an unbiased estimate.
The Second Moment Bias Is Much Worse
For with , the bias at step 1 is:
That is 1000× too small. Without correction, would be times the true RMS, making the step ~30× too large. The bias correction divides by , multiplying by 1000 to recover the correct scale.
| Step t | 1 − β₁ᵗ (β₁=0.9) | m correction | 1 − β₂ᵗ (β₂=0.999) | v correction |
|---|---|---|---|---|
| 1 | 0.100 | 10.0× | 0.001 | 1000× |
| 5 | 0.410 | 2.4× | 0.005 | 200× |
| 10 | 0.651 | 1.5× | 0.010 | 100× |
| 50 | 0.995 | 1.005× | 0.049 | 20.4× |
| 100 | ≈1.0 | none | 0.095 | 10.5× |
| 1000 | ≈1.0 | none | 0.632 | 1.58× |
Notice: the first moment correction vanishes by step 50, but the second moment correction is still significant at step 1000. This is because is so close to 1 that it takes ~1000 steps for to reach steady state.
Interactive: Adam Step by Step
The visualization below lets you step through Adam on starting at . Watch how the raw moments (m, v) slowly ramp up while the bias-corrected moments (, ) stay stable from step 1. The purple (Adam) dot moves in constant steps of ~lr, while the green (Momentum) dot accelerates.
Try these experiments:
- Press play at default settings (lr=0.01, β&sub1;=0.9, β&sub2;=0.999). Watch momentum race ahead while Adam takes tiny constant steps. On this 1D problem, momentum wins.
- Increase lr to 0.05: Adam takes bigger steps but still constant-sized. Momentum still wins because velocity builds up.
- Set β&sub1;=0.0: disables the momentum component. Adam becomes pure RMSprop-with-bias-correction. The steps are still approximately lr because the second moment normalizes the gradient.
- Set β&sub2;=0.9 (instead of 0.999): the second moment tracks recent gradients more aggressively. Watch how the raw v ramps up faster, while stays approximately the same due to bias correction.
Why momentum wins here: on a 1D problem, there is only one parameter. Adam's per-parameter adaptation has nothing to adapt between. The ratio always, so the step is just lr. Momentum accelerates because velocity compounds in a consistent direction. Adam's advantage appears when parameters have different gradient scales — which requires at least two dimensions.
Interactive: Optimizer Comparison
The visualization below shows all four optimizers — SGD (red), Momentum (green), RMSprop (blue), and Adam (purple) — racing toward the minimum of the Rosenbrock function: . This function has a famous curved valley (banana shape) where the minimum sits at . It is a classic test because the curvature differs enormously along the valley floor versus across the walls.
What to look for:
- SGD (red) barely moves — the learning rate must be tiny to avoid divergence in the steep direction, so progress in the gentle direction is glacial
- Momentum (green) follows the valley but oscillates across the walls, especially at higher learning rates
- RMSprop (blue) makes equal-sized steps in both directions but has no momentum to smooth the path
- Adam (purple) combines the best of both — it follows the valley floor smoothly with adaptive steps in both directions
NumPy: Adam from Scratch
Let's implement Adam from scratch and trace every internal variable. We use the same problem from Section 1 so we can directly compare Adam against vanilla SGD (w=4.25 after 8 steps) and momentum (w=2.44 after 8 steps).
The results reveal Adam's behavior clearly:
| Optimizer | w after 8 steps | Loss | Reduction |
|---|---|---|---|
| SGD + Momentum | 2.4421 | 5.96 | 76% |
| Vanilla SGD | 4.2539 | 18.10 | 28% |
| Adam | 4.9200 | 24.21 | 3% |
Adam is the slowest on this 1D problem. This is not a bug — it is an important property to understand.
Why Adam Is Slow Here (And Why It Doesn't Matter)
On a 1D problem with a consistent gradient, Adam's ratio is always approximately 1. So the step is just — no acceleration, no adaptation. Momentum, by contrast, builds up velocity: after 8 steps, its effective step is ~4.5× the raw gradient step.
But on a real neural network with hundreds or thousands of parameters at different scales, Adam's per-parameter adaptation is transformative. A parameter with large gradients () gets an effective lr of . A parameter with tiny gradients () gets an effective lr of . Each parameter gets exactly the learning rate it needs.
PyTorch: Training with Adam
Now let's see Adam on a real network. We train the same diagonal flip network from Section 1, comparing SGD+momentum against Adam. Both models start with identical weights. The only difference is the optimizer.
On the real 31-parameter network, Adam converges significantly faster. The per-parameter adaptation gives each of the 31 parameters an individually tuned learning rate. Layer 1 weights (which receive attenuated gradients through the chain rule) automatically get a larger effective learning rate, while layer 2 biases (which receive direct gradients) get a smaller one.
Why Adam Wins on Real Networks
| Factor | Momentum | Adam |
|---|---|---|
| Learning rate | 0.05 (hand-tuned) | 0.001 (default) |
| Per-parameter adaptation | No — same lr for all 31 params | Yes — each param gets its own effective lr |
| Gradient scale handling | Must tune lr to balance layers | Automatic — √v normalizes each param |
| Default robustness | Sensitive to lr choice | Works well with lr=0.001 out of the box |
| Memory (31 params) | 62 floats (w + v) | 93 floats (w + m + v) |
AdamW: Weight Decay Done Right
In 2019, Loshchilov & Hutter published a seemingly minor paper that turned out to be extremely important. They showed that the standard way of adding L2 regularization to Adam is fundamentally broken, and proposed AdamW as the fix.
The Problem with L2 Regularization in Adam
L2 regularization adds a penalty to the loss. The gradient becomes . In vanilla SGD, this works correctly — the weight decay term shrinks each weight proportionally.
But in Adam, the gradient goes through the normalization. The weight decay term gets scaled differently for each parameter. A parameter with large gets less weight decay than intended; a parameter with small gets more. The adaptive scaling, which is perfect for the gradient, distorts the regularization.
The AdamW Fix
AdamW decouples weight decay from the gradient update. Instead of adding to the gradient, AdamW applies it directly to the weights after the Adam step:
The gradient update () goes through Adam's normalization. The weight decay () is applied directly, without any scaling. Every parameter gets the same proportional decay, regardless of its gradient history.
| Method | Weight decay mechanism | Consistent across parameters? |
|---|---|---|
| Adam + L2 | λw goes through m/√v normalization | No — each param gets different effective decay |
| AdamW | λw applied directly to weights | Yes — all params decay at rate λ uniformly |
In PyTorch, switching from Adam to AdamW is a one-line change:
Connection to Modern Training
Adam and AdamW are the workhorses of modern deep learning. Here is how they appear in practice.
Standard Hyperparameters for Transformers
| Hyperparameter | Typical value | Why |
|---|---|---|
| Optimizer | AdamW | Decoupled weight decay is critical for regularization |
| lr | 1e-4 to 3e-4 | Lower than the default 1e-3 because transformers are sensitive |
| β₁ | 0.9 | Standard momentum. Some use 0.9 → 0.95 for very large batches |
| β₂ | 0.95 or 0.999 | LLaMA uses 0.95; GPT-3 uses 0.999. Lower β₂ tracks gradient variance faster |
| ε | 1e-8 | Rarely changed. Some use 1e-6 for mixed-precision stability |
| weight_decay | 0.01 or 0.1 | Applied only to weight matrices, NOT biases or LayerNorm params |
| Gradient clipping | max_norm=1.0 | Prevents gradient explosions from corrupting Adam’s buffers |
Learning Rate Warmup
Transformers use a learning rate warmup for the first 1,000–4,000 steps. The learning rate starts at 0 and linearly increases to the target value. This interacts with Adam in two ways:
- Adam's bias correction already warms up the effective step. At step 1, is corrected by 1000×, preventing oversized steps. Warmup adds additional safety on top of bias correction.
- The second moment needs time to converge. During warmup, builds up a reasonable estimate of the gradient variance. By the time lr reaches its target, is well-calibrated and the per-parameter scaling is reliable.
Cosine Learning Rate Decay
After warmup, most transformer training schedules follow a cosine decay:
This smoothly decreases the learning rate from to over steps. The combination of AdamW + warmup + cosine decay is the standard training recipe for GPT, LLaMA, and most modern language models.
Memory Cost at Scale
Adam's two buffers per parameter become significant at scale. For a 7B parameter model (like LLaMA-7B):
| Component | Size (fp32) | Size (fp16/bf16) |
|---|---|---|
| Model weights | 28 GB | 14 GB |
| Adam m buffer | 28 GB | 28 GB (kept in fp32) |
| Adam v buffer | 28 GB | 28 GB (kept in fp32) |
| Gradients | 28 GB | 14 GB |
| Total | 112 GB | 84 GB |
The optimizer state (m + v) alone is 56 GB for a 7B model — twice the size of the model itself. This is why optimizer state sharding (like ZeRO from DeepSpeed and FSDP from PyTorch) is essential for training large models. These techniques distribute the m and v buffers across multiple GPUs.
Adam Variants in Production
- AdaFactor: reduces memory by storing only row and column statistics instead of the full m and v matrices. Used in T5 and some Google models.
- LAMB (Layer-wise Adaptive Moments for Batch training): scales Adam's update by the ratio of parameter norm to update norm. Enables extremely large batch sizes (up to 64K). Used in BERT pretraining.
- 8-bit Adam: quantizes m and v buffers to 8-bit integers (via the bitsandbytes library), reducing optimizer memory by 4× with minimal quality loss. Increasingly popular for fine-tuning large models on consumer GPUs.
- Lion (EvoLved Sign Momentum): discovered by Google through evolutionary search. Uses only the sign of the momentum update (no second moment), reducing memory to 1 buffer per parameter. Competitive with Adam on many tasks.
Summary
- Momentum alone cannot adapt per-parameter. All parameters share one learning rate, which fails when gradient magnitudes span orders of magnitude across parameters.
- RMSprop divides by the RMS gradient to normalize step sizes across parameters. But it has no momentum (no directional smoothing) and no bias correction (misbehaves early in training).
- Adam combines momentum and RMSprop with bias correction. First moment smooths direction; second moment adapts magnitude. The effective step is approximately for every parameter.
- Bias correction is essential because both and start at zero. Without it, the second moment is underestimated by up to 1000× in the first step, causing oversized updates.
- Adam is not always fastest — on simple 1D problems, momentum wins because it builds velocity. Adam's advantage is on multi-dimensional problems where parameters need different step sizes.
- AdamW decouples weight decay from the gradient normalization. This ensures consistent regularization across all parameters and is the standard for modern transformer training.
- The standard transformer recipe: AdamW + linear warmup + cosine decay, with , –, weight_decay = 0.01–0.1, gradient clipping at max_norm = 1.0.
Looking ahead: Adam and AdamW give us an excellent optimizer, but they still require choosing a learning rate and a training duration. Learning rate scheduling — how to adjust during training — can further improve convergence. We explore this in Section 3.