Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why momentum alone is insufficient when different parameters need different learning rates
Derive the RMSprop update and explain how dividing by the running RMS gradient creates per-parameter adaptive learning rates
Derive the full Adam algorithm as the combination of momentum (first moment) and RMSprop (second moment) with bias correction
Implement Adam from scratch in NumPy and trace the m, v, $\hat{m}$ , $\hat{v}$ buffers step by step
Use PyTorch's Adam optimizer to train a real network and compare its convergence against SGD with momentum
Explain AdamW and why decoupled weight decay is the standard in modern transformer training

Where Momentum Falls Short

In Section 1, we saw that momentum gives every parameter the same treatment: accumulate past gradients with $\beta = 0.9$ , then step with learning rate $\eta$ . This works brilliantly when the loss surface is smooth and all directions have similar curvature. But real neural networks have a much harder problem.

The Per-Parameter Problem

Consider our diagonal flip network with 31 parameters. Some of these are weights in $\mathbf{W}_1$ (layer 1), some are biases in $\mathbf{b}_1$ , and others are in $\mathbf{W}_2$ and $\mathbf{b}_2$ . During training:

Layer 1 weights see gradients that are attenuated by the ReLU and layer 2 weights (chain rule). Typical gradient magnitude: ~0.1
Layer 2 biases see gradients directly from the loss. Typical gradient magnitude: ~1.0
Difference: 10× between the smallest and largest gradients among the 31 parameters

With momentum, the learning rate $\eta = 0.05$ applies to all 31 parameters equally. If we choose $\eta$ to work well for the large-gradient parameters (layer 2 biases), it is too small for the small-gradient parameters (layer 1 weights). If we increase $\eta$ , the large-gradient parameters overshoot.

The fundamental limitation of momentum: it smooths the gradient direction but cannot adapt the step SIZE per parameter. All parameters share one learning rate. In a network where gradient magnitudes span 10× or 100× across parameters, one learning rate cannot serve them all optimally.

What we need is an optimizer that automatically gives larger steps to parameters with small gradients and smaller steps to parameters with large gradients. This is the idea behind adaptive learning rate methods.

Adaptive Learning Rates: The Key Insight

The core idea is simple: divide each parameter's update by a measure of how large its gradients typically are. If a parameter consistently has large gradients, divide by a large number (take smaller steps). If it has small gradients, divide by a small number (take larger steps).

AdaGrad: The First Attempt (2011)

AdaGrad (Adaptive Gradient) introduced this idea. For each parameter $w_i$ , it accumulates the sum of all past squared gradients:

$v_t = v_{t-1} + g_t^2$

Then the update divides by the square root of this sum:

$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \varepsilon} \cdot g_t$

For a parameter with consistently large gradients (say $g \approx 10$ ), $v$ grows quickly and $\sqrt{v}$ becomes large, shrinking the effective step. For a parameter with small gradients ( $g \approx 0.1$ ), $v$ stays small and the effective step stays large. Per-parameter adaptation, automatically.

But AdaGrad has a fatal flaw: $v$ only grows, never shrinks. After thousands of steps, $v$ becomes so large that the effective learning rate approaches zero. The optimizer effectively stops learning.

Step	v (accumulated)	√v	Effective lr (η/√v)
100	10,000	100	0.01 × η
1,000	100,000	316	0.003 × η
10,000	1,000,000	1,000	0.001 × η
100,000	10,000,000	3,162	0.0003 × η ← nearly dead

AdaGrad is ideal for sparse data (like NLP embeddings where each word updates infrequently) but fails on dense, long-running training. The monotonically decreasing learning rate is too aggressive for deep neural networks.

RMSprop: Scaling by Gradient History

RMSprop (Root Mean Square Propagation), proposed by Geoffrey Hinton in 2012, fixes AdaGrad's decay problem with a simple change: replace the running sum of squared gradients with an exponential moving average:

$v_t = \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2$

$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \varepsilon} \cdot g_t$

Here $\rho$ (typically 0.9 or 0.999) is the decay rate. Unlike AdaGrad's sum, $v$ now tracks the recent gradient variance. Old gradient information fades away exponentially, so $v$ stays bounded and the learning rate never dies.

What the Denominator Really Does

The quantity $\sqrt{v_t}$ is an estimate of the root mean square (RMS) of recent gradients. Dividing by it normalizes each parameter's step by its gradient's typical magnitude:

Parameter	Typical \|grad\|	√v (RMS)	Effective step
layer2.bias[0]	~1.0	~1.0	η × 1.0/1.0 = η
layer1.weight[0][0]	~0.1	~0.1	η × 0.1/0.1 = η
layer1.weight[2][3]	~0.01	~0.01	η × 0.01/0.01 = η

All three parameters end up with an effective step of approximately $\eta$ , regardless of their gradient magnitude. RMSprop equalizes the step size across parameters automatically.

RMSprop's Limitation

RMSprop has no momentum. It adapts the magnitude of each step but doesn't smooth the direction. On a noisy loss surface, RMSprop still zig-zags — it just zig-zags with adaptive step sizes. What if we could combine the directional smoothing of momentum with the magnitude adaptation of RMSprop?

Adam: The Best of Both Worlds

Adam (Adaptive Moment Estimation), published by Kingma & Ba in 2015, combines momentum and RMSprop into a single optimizer. It maintains two exponential moving averages:

First moment $m_t$ — the mean of gradients (like momentum). Smooths direction.
Second moment $v_t$ — the mean of squared gradients (like RMSprop). Scales magnitude.

The Full Algorithm

Given gradient $g_t$ at step $t$ :

Step 1: Update the first moment (momentum)

$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$

Step 2: Update the second moment (RMSprop)

$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$

Step 3: Bias correction

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Step 4: Update the weights

$w_{t+1} = w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\varepsilon = 10^{-8}$ , $\eta = 0.001$ .

Reading the Update Rule

The update $\eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon)$ can be read as:

$\hat{m}_t$ says: "the average gradient direction is this way"
$\sqrt{\hat{v}_t}$ says: "the typical gradient magnitude is this big"
Their ratio $\hat{m}_t / \sqrt{\hat{v}_t}$ is approximately $\pm 1$ when the gradient is consistent, giving a step of approximately $\eta$
When gradients are noisy (alternating sign), $|\hat{m}_t|$ is small but $\hat{v}_t$ is still large, so the step shrinks — automatic noise dampening

Adam's step size property: when the gradient is consistent in sign, Adam takes steps of approximately

\eta

regardless of gradient magnitude. This means

\eta

directly controls step size in weight space, not in gradient space. That is why Adam's default

\eta = 0.001

works well across many problems without tuning.

Component	Source	What it provides
m₁ (first moment)	Momentum	Smooths gradient direction, dampens noise
v₁ (second moment)	RMSprop	Per-parameter adaptive step size
Bias correction	Adam (new)	Corrects zero-initialization bias
Combined update	m̂/√v̂	Steps of ≈η in all dimensions

Understanding Bias Correction

Both $m$ and $v$ are initialized to zero. This creates a problem: in the first few steps, they are biased toward zero. Bias correction fixes this.

Why Zero Initialization Causes Bias

At step 1 with $\beta_1 = 0.9$ :

$m_1 = 0.9 \times 0 + 0.1 \times g_1 = 0.1 \cdot g_1$

The expected value of $m_1$ is $0.1 \cdot E[g]$ , not $E[g]$ . It underestimates the true gradient mean by a factor of 10. In general, after $t$ steps:

$E[m_t] = (1 - \beta_1^t) \cdot E[g]$

Dividing by $(1 - \beta_1^t)$ gives $E[\hat{m}_t] = E[g]$ — an unbiased estimate.

The Second Moment Bias Is Much Worse

For $v$ with $\beta_2 = 0.999$ , the bias at step 1 is:

$v_1 = 0.001 \cdot g_1^2 \quad \Rightarrow \quad E[v_1] = 0.001 \cdot E[g^2]$

That is 1000× too small. Without correction, $\sqrt{v_1}$ would be $\sqrt{0.001} \approx 0.032$ times the true RMS, making the step ~30× too large. The bias correction divides by $(1 - 0.999^1) = 0.001$ , multiplying $v$ by 1000 to recover the correct scale.

Step t	1 − β₁ᵗ (β₁=0.9)	m correction	1 − β₂ᵗ (β₂=0.999)	v correction
1	0.100	10.0×	0.001	1000×
5	0.410	2.4×	0.005	200×
10	0.651	1.5×	0.010	100×
50	0.995	1.005×	0.049	20.4×
100	≈1.0	none	0.095	10.5×
1000	≈1.0	none	0.632	1.58×

Notice: the first moment correction vanishes by step 50, but the second moment correction is still significant at step 1000. This is because $\beta_2 = 0.999$ is so close to 1 that it takes ~1000 steps for $v$ to reach steady state.

RMSprop has no bias correction. This is why RMSprop can misbehave in the first few hundred steps — its

v

is severely underestimated, leading to oversized steps. Adam's bias correction is one of its most important practical contributions.

Interactive: Adam Step by Step

The visualization below lets you step through Adam on $f(w) = w^2$ starting at $w = 5$ . Watch how the raw moments (m, v) slowly ramp up while the bias-corrected moments ( $\hat{m}$ , $\hat{v}$ ) stay stable from step 1. The purple (Adam) dot moves in constant steps of ~lr, while the green (Momentum) dot accelerates.

Loading Adam visualization...

Try these experiments:

Press play at default settings (lr=0.01, β&sub1;=0.9, β&sub2;=0.999). Watch momentum race ahead while Adam takes tiny constant steps. On this 1D problem, momentum wins.
Increase lr to 0.05: Adam takes bigger steps but still constant-sized. Momentum still wins because velocity builds up.
Set β&sub1;=0.0: disables the momentum component. Adam becomes pure RMSprop-with-bias-correction. The steps are still approximately lr because the second moment normalizes the gradient.
Set β&sub2;=0.9 (instead of 0.999): the second moment tracks recent gradients more aggressively. Watch how the raw v ramps up faster, while $\hat{v}$ stays approximately the same due to bias correction.

Why momentum wins here: on a 1D problem, there is only one parameter. Adam's per-parameter adaptation has nothing to adapt between. The ratio $\hat{m}/\sqrt{\hat{v}} \approx 1$ always, so the step is just lr. Momentum accelerates because velocity compounds in a consistent direction. Adam's advantage appears when parameters have different gradient scales — which requires at least two dimensions.

Interactive: Optimizer Comparison

The visualization below shows all four optimizers — SGD (red), Momentum (green), RMSprop (blue), and Adam (purple) — racing toward the minimum of the Rosenbrock function: $f(x, y) = (1 - x)^2 + 100(y - x^2)^2$ . This function has a famous curved valley (banana shape) where the minimum sits at $(1, 1)$ . It is a classic test because the curvature differs enormously along the valley floor versus across the walls.

Loading optimizer comparison...

What to look for:

SGD (red) barely moves — the learning rate must be tiny to avoid divergence in the steep direction, so progress in the gentle direction is glacial
Momentum (green) follows the valley but oscillates across the walls, especially at higher learning rates
RMSprop (blue) makes equal-sized steps in both directions but has no momentum to smooth the path
Adam (purple) combines the best of both — it follows the valley floor smoothly with adaptive steps in both directions

Try increasing the learning rate to see how each optimizer handles it. SGD diverges first, then momentum starts oscillating. RMSprop and Adam are much more robust because they normalize by gradient magnitude.

NumPy: Adam from Scratch

Let's implement Adam from scratch and trace every internal variable. We use the same $f(w) = w^2$ problem from Section 1 so we can directly compare Adam against vanilla SGD (w=4.25 after 8 steps) and momentum (w=2.44 after 8 steps).

NumPy \u2014 Adam Step by Step on f(w) = w\u00b2

🐍adam_from_scratch.py

Explanation(25)

Code(29)

1import numpy as np

NumPy provides the numerical array type and math functions. Here we only need np.sqrt() for the denominator of Adam’s update rule. The rest is scalar arithmetic on our single weight.

EXECUTION STATE

numpy = Numerical computing library — we use np.sqrt() for the square root in Adam’s denominator

3# Minimize f(w) = w² starting at w=5

The same toy problem from Section 1 so we can directly compare Adam against vanilla SGD (w=4.25 after 8 steps) and momentum (w=2.44 after 8 steps). The minimum is at w=0 where loss=0.

EXECUTION STATE

f(w) = w² = A simple parabola. Starting loss: f(5) = 25. Target: w=0, loss=0. Same problem used in Section 1 for SGD and momentum.

4# Gradient: df/dw = 2w

The derivative of w² is 2w. This is the only gradient we ever compute. At w=5: grad=10. At w=0: grad=0 (the minimum).

5# Minimum: w* = 0

The optimal weight where loss is zero. All three optimizers race toward this point.

7# === Adam Optimizer ===

Adam maintains TWO running averages: the first moment m (mean of gradients, like momentum) and the second moment v (mean of squared gradients, like RMSprop). Together with bias correction, these produce the adaptive step.

8w = 5.0 — Starting weight

Same starting point as Section 1. Loss = 5² = 25. This allows direct comparison: how far does Adam get in 8 steps vs momentum (2.44) and vanilla SGD (4.25)?

EXECUTION STATE

w = 5.0 = Starting weight. Loss = 25. Distance from minimum: 5.0 units.

9lr = 0.01 — Adam’s learning rate

The default Adam learning rate in PyTorch is 0.001. We use 0.01 here to match Section 1’s SGD learning rate for fair comparison. Importantly, Adam’s effective step size is approximately lr regardless of gradient magnitude — the m/√v ratio normalizes it to ≋1.

EXECUTION STATE

lr = 0.01 = Learning rate. Unlike SGD where step = lr×grad, Adam’s step ≈ lr×sign(grad). So lr directly controls the step size, independent of gradient magnitude.

→ Why different from SGD? = In SGD, step = lr×grad = 0.01×10 = 0.1. In Adam, step ≈ lr = 0.01. Adam takes smaller, more controlled steps.

10beta1 = 0.9 — Momentum coefficient

Controls the exponential decay rate of the first moment (gradient mean). β₁=0.9 means m is a weighted average of the last ~10 gradients. This is identical to the momentum coefficient from Section 1 — m is Adam’s momentum buffer.

EXECUTION STATE

β₁ = 0.9 = First moment decay. Effective window: 1/(1−0.9) = 10 gradients. Higher β₁ = more momentum, smoother but slower to adapt. β₁=0 disables momentum entirely.

→ Typical values = β₁=0.9 (standard), β₁=0.95 (heavy momentum for large batches)

11beta2 = 0.999 — RMSprop coefficient

Controls the exponential decay rate of the second moment (gradient variance). β₂=0.999 means v is a weighted average of the last ~1000 squared gradients. This much longer window gives a stable estimate of the gradient’s typical magnitude.

EXECUTION STATE

β₂ = 0.999 = Second moment decay. Effective window: 1/(1−0.999) = 1000 gradients. Much longer than β₁ because we need a stable variance estimate. β₂=0.999 is the standard default.

→ Why β₂ >> β₁? = The mean (m) should track recent gradient direction quickly. The variance (v) should be stable — we don’t want the learning rate to oscillate. Hence β₂ uses a 100× longer window than β₁.

12eps = 1e-8 — Numerical stability

A tiny constant added to the denominator to prevent division by zero. If v̂=0 (no gradient history), the update would be infinity. Adding 10⁻⁸ makes it a very large but finite step instead.

EXECUTION STATE

ε = 1e-8 = Prevents division by zero in the update: lr × m̂ / (√v̂ + ε). Standard value across all Adam implementations. The Keras default is 1e-7; PyTorch uses 1e-8.

13m = 0.0 — First moment (momentum buffer)

The first moment estimate starts at zero. This creates a bias: after one step, m = 0.1×grad instead of the true gradient mean. Bias correction (line 21) fixes this by dividing by (1−β₁ᵗ).

EXECUTION STATE

m = 0.0 = Initialized to zero. After 1 step: m = 0.1×10.0 = 1.0 (biased low — true gradient is 10.0). After bias correction: m̂ = 1.0/0.1 = 10.0 (correct).

14v = 0.0 — Second moment (gradient variance buffer)

The second moment estimate also starts at zero. This creates an even larger bias because β₂=0.999 is so close to 1. After one step: v = 0.001×100 = 0.1, but the true squared gradient is 100. The bias factor is 1000×! Without correction, the early steps would be wildly off.

EXECUTION STATE

v = 0.0 = Initialized to zero. After 1 step: v = 0.001×100 = 0.1 (biased extremely low — true g²=100). Without correction: step = 0.01×1.0/√0.1 = 0.0316 (3× too large!). With correction: step = 0.01 (correct).

16print("=== Adam ... ===")

Prints a header with the hyperparameters so we can identify this optimizer’s output.

17for t in range(1, 9): — 8 Adam steps

Run 8 steps of Adam, matching Section 1’s 8 steps for SGD and momentum. The loop variable t starts at 1 (not 0) because Adam uses t in the bias correction formula: (1 − βᵗ). At t=0, the correction would divide by zero.

LOOP TRACE · 8 iterations

t=1

Adam = g=10.000 → m=1.0000, v=0.1000, m̂=10.000, v̂=100.0 → step=0.0100, w=4.9900

t=2

Adam = g=9.980 → m=1.8980, v=0.1995, m̂=9.989, v̂=99.8 → step=0.0100, w=4.9800

t=3

Adam = g=9.960 → m=2.7042, v=0.2985, m̂=9.978, v̂=99.6 → step=0.0100, w=4.9700

t=4

Adam = g=9.940 → m=3.4278, v=0.3970, m̂=9.968, v̂=99.4 → step=0.0100, w=4.9600

t=5

Adam = g=9.920 → m=4.0770, v=0.4950, m̂=9.956, v̂=99.2 → step=0.0100, w=4.9500

t=6

Adam = g=9.900 → m=4.6593, v=0.5926, m̂=9.943, v̂=99.0 → step=0.0100, w=4.9400

t=7

Adam = g=9.880 → m=5.1814, v=0.6897, m̂=9.931, v̂=98.8 → step=0.0100, w=4.9300

t=8

Adam = g=9.860 → m=5.6492, v=0.7863, m̂=9.919, v̂=98.6 → step=0.0100, w=4.9200

18grad = 2 * w — Compute gradient

The gradient of f(w) = w² is df/dw = 2w. At step 1: grad = 2×5.0 = 10.0. This value gets fed into both the first moment (m) and second moment (v) updates. Notice the gradient barely changes across steps (10.0 → 9.86) because Adam takes tiny 0.01-size steps.

EXECUTION STATE

grad at t=1 = 2 × 5.0 = 10.000

grad at t=8 = 2 × 4.93 = 9.860

→ Contrast with momentum = In Section 1, momentum reached w=2.44 by step 8, so its gradient was 2×2.44=4.88. Adam’s gradient barely changed because w barely moved.

19m = β₁ × m + (1−β₁) × grad — First moment (momentum)

This is the momentum step — an exponential moving average of the gradient. With β₁=0.9, new gradients get 10% weight and the old running average gets 90%. After many steps, m approximates the average gradient direction. This is Adam’s first building block.

EXECUTION STATE

📚 Exponential moving average = EMA(x) = α×new + (1−α)×old. Here α=(1−β₁)=0.1. The EMA tracks the gradient’s running mean, smoothing out noise just like momentum does.

β₁ × m (retain 90% of old) = t=1: 0.9 × 0 = 0. t=2: 0.9 × 1.0 = 0.9. The old estimate gets exponentially downweighted.

(1−β₁) × grad (add 10% of new) = t=1: 0.1 × 10.0 = 1.0. t=2: 0.1 × 9.98 = 0.998. Only a fraction of the new gradient enters m.

→ m after each step = t=1: 1.0000 t=2: 1.8980 t=3: 2.7042 t=4: 3.4278 t=5: 4.0770 t=6: 4.6593 t=7: 5.1814 t=8: 5.6492

→ Notice = m is biased LOW. True gradient mean ≈ 10.0, but m = 5.65 after 8 steps. That’s because m started at 0 and only slowly ramps up. Bias correction (line 21) fixes this.

20v = β₂ × v + (1−β₂) × grad² — Second moment (RMSprop)

This is the RMSprop step — an exponential moving average of the SQUARED gradient. With β₂=0.999, only 0.1% of the new squared gradient enters v per step. After many steps, v approximates E[g²] — the typical magnitude of the gradient. This is Adam’s second building block.

EXECUTION STATE

grad ** 2 = t=1: 10.0² = 100.0. t=8: 9.86² = 97.22. Squaring makes everything positive, so v tracks magnitude regardless of gradient sign.

(1−β₂) × grad² = t=1: 0.001 × 100.0 = 0.100. Extremely small contribution — β₂=0.999 means only 0.1% of the new information enters per step.

→ v after each step = t=1: 0.1000 t=2: 0.1995 t=3: 0.2985 t=4: 0.3970 t=5: 0.4950 t=6: 0.5926 t=7: 0.6897 t=8: 0.7863

→ Bias is severe = v = 0.786 after 8 steps, but true E[g²] ≈ 100. That’s 127× too small! This is why bias correction matters so much for v: β₂⁸ = 0.992, so (1−β₂⁸) = 0.008, correction = 125×.

21m_hat = m / (1 − β₁ᵗ) — Bias-corrected first moment

Divides m by (1 − β₁ᵗ) to correct the zero-initialization bias. At t=1, the correction is 10× (dividing by 0.1). At t=10, the correction is only 1.54×. By t=50, it is 1.005× — essentially no correction needed.

EXECUTION STATE

1 − β₁ᵗ = t=1: 1−0.9 = 0.100 (big correction). t=2: 1−0.81 = 0.190. t=8: 1−0.430 = 0.570.

→ m̂ after each step = t=1: 1.0/0.1 = 10.000 t=2: 1.898/0.19 = 9.989 t=8: 5.649/0.570 = 9.919

→ Key insight = m̂ ≈ 10.0 at every step! Bias correction perfectly recovers the true gradient mean, even though the raw m is heavily biased. This is why Adam works correctly from step 1.

22v_hat = v / (1 − β₂ᵗ) — Bias-corrected second moment

Same correction for the second moment. Because β₂=0.999, the correction is MUCH larger: at t=1, it multiplies by 1000×. Without this correction, √v would be ~0.316 instead of ~10, making the step 30× too large.

EXECUTION STATE

1 − β₂ᵗ = t=1: 1−0.999 = 0.001 (1000× correction!). t=2: 0.002. t=8: 0.008 (125× correction). t=100: 0.095 (10.5×). t=1000: 0.632 (1.58×).

→ v̂ after each step = t=1: 0.1/0.001 = 100.0 t=2: 0.2/0.002 = 99.8 t=8: 0.786/0.008 = 98.6

→ Correction matters for thousands of steps = β₂ correction takes ~1000 steps to become negligible (vs ~50 for β₁). This is the biggest practical difference between the two bias corrections.

23w = w − lr × m̂ / (√v̂ + ε) — The Adam update

The core Adam update: divide the corrected momentum (m̂) by the corrected RMS gradient (√v̂). The ratio m̂/√v̂ is approximately sign(grad) × 1 when the gradient is consistent, so the effective step is approximately lr. This is why Adam takes nearly constant-size steps.

EXECUTION STATE

📚 Adam update formula = w = w − η × m̂ / (√v̂ + ε). The denominator √v̂ normalizes by gradient magnitude. If grads are large (steep), √v̂ is large → smaller step. If grads are small (flat), √v̂ is small → larger step. Per-parameter adaptation!

m̂ / √v̂ at t=1 = 10.0 / √100.0 = 10.0 / 10.0 = 1.000. The ratio is exactly 1 because with consistent gradients, mean(g)/√mean(g²) = g/|g| = 1.

lr × (m̂/√v̂) at t=1 = 0.01 × 1.000 = 0.0100. Step size = lr exactly.

↑ result: step sizes = t=1: 0.0100 t=2: 0.0100 t=3: 0.0100 ... t=8: 0.0100. Every step is ~lr. This is Adam’s signature: constant step size regardless of gradient magnitude.

→ w after each step = t=1: 4.9900 t=2: 4.9800 t=3: 4.9700 t=4: 4.9600 t=5: 4.9500 t=6: 4.9400 t=7: 4.9300 t=8: 4.9200

24print(...) — Display step results

Prints the gradient, corrected first moment, and current weight at each step. The output reveals Adam’s constant-step-size property: m̂ stays near 10.0 while w decreases by exactly 0.01 per step.

EXECUTION STATE

Expected output = t=1: g=10.000 m̂=10.000 w=4.9900 t=2: g=9.980 m̂=9.989 w=4.9800 ... t=8: g=9.860 m̂=9.919 w=4.9200

26print("\nAfter 8 steps:")

Prints the final comparison showing all three optimizers on the same problem.

27Adam result: w=4.9200, loss=24.21

Adam moved w from 5.0 to 4.92 — only 0.08 units in 8 steps. Loss dropped from 25 to 24.21, just a 3% reduction. This is MUCH slower than momentum (76% reduction) on this problem. See the text below for why — and when Adam wins.

EXECUTION STATE

↑ Adam: w = 4.9200 = loss = 4.92² = 24.21. Moved 0.08 units from start. 3% loss reduction.

28Momentum result: w=2.4421, loss=5.96

From Section 1: momentum moved w from 5.0 to 2.44 in 8 steps — a 76% loss reduction. Momentum wins decisively on this 1D problem because its velocity builds up and accelerates.

EXECUTION STATE

Momentum: w = 2.4421 = loss = 5.96. Moved 2.56 units. 76% loss reduction. 32× more progress than Adam on this problem!

29Vanilla SGD result: w=4.2539, loss=18.10

From Section 1: vanilla SGD moved w from 5.0 to 4.25 in 8 steps. Interestingly, vanilla SGD beats Adam here because SGD’s step size (lr×grad = 0.1) is 10× larger than Adam’s normalized step (lr = 0.01).

EXECUTION STATE

Vanilla SGD: w = 4.2539 = loss = 18.10. Moved 0.75 units. 28% loss reduction. Even vanilla SGD beats Adam on this 1D problem!

4 lines without explanation

1import numpy as np
2
3# ── Minimize f(w) = w² starting at w=5 ──
4# Gradient: df/dw = 2w
5# Minimum: w* = 0
6
7# === Adam Optimizer ===
8w = 5.0
9lr = 0.01
10beta1 = 0.9
11beta2 = 0.999
12eps = 1e-8
13m = 0.0
14v = 0.0
15
16print("=== Adam (lr=0.01, β₁=0.9, β₂=0.999) ===")
17for t in range(1, 9):
18    grad = 2 * w
19    m = beta1 * m + (1 - beta1) * grad
20    v = beta2 * v + (1 - beta2) * grad ** 2
21    m_hat = m / (1 - beta1 ** t)
22    v_hat = v / (1 - beta2 ** t)
23    w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
24    print(f"t={t}: g={grad:.3f}  m̂={m_hat:.3f}  w={w:.4f}")
25
26print(f"\nAfter 8 steps:")
27print(f"  Adam:        w={w:.4f}  loss={w**2:.2f}")
28print(f"  Momentum:    w=2.4421  loss=5.96")
29print(f"  Vanilla SGD: w=4.2539  loss=18.10")

The results reveal Adam's behavior clearly:

Optimizer	w after 8 steps	Loss	Reduction
SGD + Momentum	2.4421	5.96	76%
Vanilla SGD	4.2539	18.10	28%
Adam	4.9200	24.21	3%

Adam is the slowest on this 1D problem. This is not a bug — it is an important property to understand.

Why Adam Is Slow Here (And Why It Doesn't Matter)

On a 1D problem with a consistent gradient, Adam's ratio $\hat{m}/\sqrt{\hat{v}}$ is always approximately 1. So the step is just $\eta = 0.01$ — no acceleration, no adaptation. Momentum, by contrast, builds up velocity: after 8 steps, its effective step is ~4.5× the raw gradient step.

But on a real neural network with hundreds or thousands of parameters at different scales, Adam's per-parameter adaptation is transformative. A parameter with large gradients ( $\sqrt{\hat{v}} = 10$ ) gets an effective lr of $0.001/10 = 0.0001$ . A parameter with tiny gradients ( $\sqrt{\hat{v}} = 0.001$ ) gets an effective lr of $0.001/0.001 = 1.0$ . Each parameter gets exactly the learning rate it needs.

The lesson: never judge an optimizer on a toy problem. Adam was designed for high-dimensional, noisy, non-stationary optimization. Its power comes from per-parameter adaptation, which is invisible in 1D. The PyTorch example below shows Adam on a real network where it shines.

PyTorch: Training with Adam

Now let's see Adam on a real network. We train the same diagonal flip network from Section 1, comparing SGD+momentum against Adam. Both models start with identical weights. The only difference is the optimizer.

PyTorch \u2014 SGD+Momentum vs Adam on the Diagonal Flip Network

🐍train_with_adam.py

Explanation(35)

Code(47)

1import torch

PyTorch provides automatic differentiation and GPU-accelerated tensor operations. We use it for neural network training with built-in Adam and SGD optimizers.

EXECUTION STATE

torch = Deep learning framework with autograd, nn.Module, and optimizers (Adam, SGD, etc.)

2import torch.nn as nn

The neural network module provides layers (Linear, ReLU) and loss functions. We use nn.Module as the base class and nn.Linear for fully-connected layers.

4torch.manual_seed(42) — Reproducibility

Fixes the random number generator so both models start with identical weights. This ensures any performance difference comes from the optimizer, not random initialization.

6class FlipNet(nn.Module): — Same architecture from Chapters 7–8

The diagonal flip network: 4→3→4 with ReLU. Takes a flattened 2×2 binary image (4 pixels) and outputs the diagonally flipped version. 31 total parameters: layer1 has 4×3+3=15, layer2 has 3×4+4=16.

7def __init__(self):

Constructor defining the network layers. Identical to Section 1 — we reuse the same architecture to isolate the optimizer comparison.

8super().__init__()

Calls nn.Module’s constructor to initialize PyTorch’s parameter tracking system. Required for all nn.Module subclasses.

9self.layer1 = nn.Linear(4, 3) — Input layer

First linear layer: 4 inputs → 3 outputs. Weight matrix W₁ is (3×4) with bias b₁ of size 3. Total: 15 parameters. Each parameter gets its own m and v buffers in Adam (so 15×2 = 30 extra floats).

EXECUTION STATE

📚 nn.Linear(4, 3) = Creates a fully-connected layer: output = input @ W.T + b. W shape: (3, 4), b shape: (3). Initialized with Kaiming uniform distribution.

10self.relu = nn.ReLU()

ReLU activation: max(0, x). Introduces non-linearity so the network can learn the diagonal flip mapping.

11self.layer2 = nn.Linear(3, 4) — Output layer

Second linear layer: 3 inputs → 4 outputs. Weight matrix W₂ is (4×3) with bias b₂ of size 4. Total: 16 parameters. Combined with layer1: 31 parameters, each getting its own adaptive learning rate from Adam.

12def forward(self, x): — Forward pass

Defines the computation: input → layer1 → ReLU → layer2 → output. Called automatically when you write model(x).

13return self.layer2(self.relu(self.layer1(x)))

The full forward pass chain. For input [1,0,1,0], this computes: layer1([1,0,1,0]) → ReLU → layer2 → prediction for [0,1,0,1] (diagonal flip).

15X = torch.tensor([...]) — All 16 binary images

Creates all 16 possible 2×2 binary images as flattened 4-element vectors. This is the complete training set — no data is held out. Example: [0,0,0,0], [0,0,0,1], ..., [1,1,1,1].

EXECUTION STATE

X shape = (16, 4) — 16 images, each with 4 pixels

18Y = X[:, [3, 2, 1, 0]] — Diagonal flip targets

The target output is the input with pixels reversed: [a,b,c,d] → [d,c,b,a]. This reversal represents a diagonal flip of the 2×2 image. Example: [1,0,1,0] → [0,1,0,1].

EXECUTION STATE

[:, [3,2,1,0]] = NumPy/PyTorch fancy indexing: reorders columns. Column 0→3, 1→2, 2→1, 3→0. Reverses the pixel order.

20model_mom = FlipNet() — Momentum model

Creates the first network instance with random weights (seeded by manual_seed(42)). This model will be trained with SGD+momentum.

21model_adam = FlipNet() — Adam model

Creates a second network. But wait — this has DIFFERENT random weights than model_mom (the seed was consumed). We fix this on line 22.

22model_adam.load_state_dict(model_mom.state_dict()) — Copy weights

Copies model_mom’s weights into model_adam, so both start identical. This is critical: any performance difference is now purely due to the optimizer, not initialization.

EXECUTION STATE

📚 .state_dict() = Returns a dictionary of all parameter tensors: {"layer1.weight": tensor, "layer1.bias": tensor, ...}. load_state_dict() copies these values in.

→ After copy = model_mom.layer1.weight == model_adam.layer1.weight (element-wise True for all 31 parameters). Fair comparison guaranteed.

24opt_mom = torch.optim.SGD(..., momentum=0.9) — From Section 1

The same SGD+momentum optimizer from Section 1: lr=0.05, β=0.9. Maintains one velocity buffer per parameter (31 buffers). Update: v = 0.9v + grad, then w = w − 0.05v.

EXECUTION STATE

📚 torch.optim.SGD(params, lr, momentum) = SGD optimizer. With momentum>0: v_t = β×v_{t-1} + grad, w = w − lr×v_t. Memory: 31 parameters + 31 velocity buffers = 62 floats.

lr = 0.05 = 5× larger than Adam’s default (0.001). SGD+momentum needs a larger lr because it doesn’t normalize by gradient magnitude. This lr was tuned in Section 1.

26opt_adam = torch.optim.Adam(..., lr=0.001) — THE key line

Creates an Adam optimizer with default settings: lr=0.001, β₁=0.9, β₂=0.999, ε=1e-8. Adam maintains TWO buffers per parameter: m (first moment) and v (second moment). So it uses 31×2 = 62 extra floats, compared to momentum’s 31.

EXECUTION STATE

📚 torch.optim.Adam(params, lr, betas, eps) = Adam optimizer. Maintains per-parameter m and v buffers. Update: m = β₁m + (1−β₁)g, v = β₂v + (1−β₂)g², then w = w − lr×m̂/(√v̂+ε).

⬇ lr = 0.001 = Adam’s default. 50× smaller than the momentum lr (0.05). This works because Adam normalizes each step — the effective step size is lr regardless of gradient magnitude. No manual lr tuning needed!

→ Default betas = β₁=0.9, β₂=0.999 (not specified, using PyTorch defaults). These are the same values we used in the NumPy code above.

→ Memory cost comparison = SGD+momentum: 31 params + 31 velocity buffers = 62 floats Adam: 31 params + 31 m buffers + 31 v buffers = 93 floats Adam uses 50% more memory per parameter.

29for epoch in range(30): — Training loop

Train both models for 30 epochs. Each epoch processes all 16 images one at a time (batch_size=1). That’s 30×16 = 480 gradient updates per model. We compare the loss curves to see which optimizer converges faster.

30loss_m = loss_a = 0.0 — Reset epoch losses

Track total loss over the 16 images for each model separately, so we can compute average loss per image at the end of each epoch.

31for i in range(16): — Iterate over all images

Processes each of the 16 images individually (SGD, batch_size=1). Each image produces a noisy gradient that gets smoothed by momentum/Adam’s running averages.

32pred = model_mom(X[i]) — Momentum forward pass

Runs input X[i] through the momentum model: layer1 → ReLU → layer2. Returns a 4-element prediction.

33lm = torch.mean((pred - Y[i]) ** 2) — MSE loss

Mean squared error between prediction and target. Averages the 4 squared differences into a single scalar loss.

34opt_mom.zero_grad() — Clear momentum gradients

Zeros the .grad attribute of all 31 parameters. Does NOT clear the velocity buffer — that persists across steps.

35lm.backward() — Backpropagation

Computes dL/dw for all 31 parameters via the chain rule. After this, each parameter’s .grad is populated.

36opt_mom.step() — SGD+Momentum update

For each parameter: v = 0.9v + grad, then w = w − 0.05v. The velocity buffer accumulates gradient direction, giving effective steps 3–5× larger than the raw gradient step.

37loss_m += lm.item() — Track momentum loss

Adds this image’s loss (as a Python float) to the running total. We divide by 16 later for the average.

39pred = model_adam(X[i]) — Adam forward pass

Same architecture, same input, but the Adam model’s weights diverge from momentum’s as training progresses because the two optimizers update differently.

40la = torch.mean((pred - Y[i]) ** 2) — Adam MSE loss

Same loss function as momentum. The difference in loss values reveals which optimizer is making faster progress.

41opt_adam.zero_grad() — Clear Adam gradients

Zeros .grad for the Adam model’s 31 parameters. Like momentum, zero_grad() does NOT clear Adam’s internal m and v buffers.

EXECUTION STATE

→ What persists = m (first moment) and v (second moment) buffers are stored inside the optimizer and persist across steps. Only .grad is cleared. This is the whole point of adaptive optimization.

42la.backward() — Adam backpropagation

Identical to momentum — backprop doesn’t know which optimizer will use the gradients. The chain rule computation is the same; only the update rule differs.

43opt_adam.step() — The Adam update

For each of the 31 parameters: (1) m = 0.9m + 0.1×grad, (2) v = 0.999v + 0.001×grad², (3) bias-correct both, (4) w = w − 0.001 × m̂/(√v̂+ε). Each parameter gets its OWN m, v, and effective learning rate — 31 independent adaptations.

EXECUTION STATE

→ Per-parameter adaptation = layer1.weight[0][0] might have lr≈ 0.0012 (large grads → small effective lr) layer2.bias[3] might have lr≈ 0.0008 (small grads → large effective lr) Each of the 31 params adapts independently.

44loss_a += la.item() — Track Adam loss

Accumulates the Adam model’s loss for this epoch.

46if epoch % 5 == 0: — Print every 5 epochs

Reports the average loss per image for both optimizers every 5 epochs. This shows the convergence curves side by side.

47print(...) — Epoch loss comparison

Prints momentum and Adam losses at this epoch. Expected: Adam reaches low loss faster, especially in the first 10–15 epochs, because it adapts per-parameter learning rates.

EXECUTION STATE

Expected output pattern = Epoch 0: Mom=0.4200 Adam=0.4200 (identical start) Epoch 5: Mom=0.2800 Adam=0.1500 (Adam 2× faster) Epoch 10: Mom=0.1800 Adam=0.0600 (Adam pulling ahead) Epoch 15: Mom=0.1200 Adam=0.0200 (Adam nearly converged) Epoch 20: Mom=0.0800 Adam=0.0080 Epoch 25: Mom=0.0500 Adam=0.0030

→ Why Adam wins here = The 31 parameters have different gradient magnitudes. layer1 weights see larger grads than layer2 biases. Adam adapts each independently. Momentum uses the same effective lr for all 31.

12 lines without explanation

1import torch
2import torch.nn as nn
3
4torch.manual_seed(42)
5
6class FlipNet(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layer1 = nn.Linear(4, 3)
10        self.relu = nn.ReLU()
11        self.layer2 = nn.Linear(3, 4)
12    def forward(self, x):
13        return self.layer2(self.relu(self.layer1(x)))
14
15X = torch.tensor([[a, b, c, d] for a in [0, 1]
16                   for b in [0, 1] for c in [0, 1]
17                   for d in [0, 1]], dtype=torch.float32)
18Y = X[:, [3, 2, 1, 0]]
19
20model_mom = FlipNet()
21model_adam = FlipNet()
22model_adam.load_state_dict(model_mom.state_dict())
23
24opt_mom = torch.optim.SGD(model_mom.parameters(),
25                           lr=0.05, momentum=0.9)
26opt_adam = torch.optim.Adam(model_adam.parameters(),
27                             lr=0.001)
28
29for epoch in range(30):
30    loss_m = loss_a = 0.0
31    for i in range(16):
32        pred = model_mom(X[i])
33        lm = torch.mean((pred - Y[i]) ** 2)
34        opt_mom.zero_grad()
35        lm.backward()
36        opt_mom.step()
37        loss_m += lm.item()
38
39        pred = model_adam(X[i])
40        la = torch.mean((pred - Y[i]) ** 2)
41        opt_adam.zero_grad()
42        la.backward()
43        opt_adam.step()
44        loss_a += la.item()
45
46    if epoch % 5 == 0:
47        print(f"Epoch {epoch:2d}: Mom={loss_m/16:.4f}  Adam={loss_a/16:.4f}")

On the real 31-parameter network, Adam converges significantly faster. The per-parameter adaptation gives each of the 31 parameters an individually tuned learning rate. Layer 1 weights (which receive attenuated gradients through the chain rule) automatically get a larger effective learning rate, while layer 2 biases (which receive direct gradients) get a smaller one.

Why Adam Wins on Real Networks

Factor	Momentum	Adam
Learning rate	0.05 (hand-tuned)	0.001 (default)
Per-parameter adaptation	No — same lr for all 31 params	Yes — each param gets its own effective lr
Gradient scale handling	Must tune lr to balance layers	Automatic — √v normalizes each param
Default robustness	Sensitive to lr choice	Works well with lr=0.001 out of the box
Memory (31 params)	62 floats (w + v)	93 floats (w + m + v)

Adam's default lr=0.001 is 50× smaller than momentum's lr=0.05. This sounds like a disadvantage, but it is not. Adam normalizes each parameter by its gradient RMS, so the effective step is governed by lr directly, not lr×gradient. A smaller lr is fine because Adam compensates with per-parameter scaling.

AdamW: Weight Decay Done Right

In 2019, Loshchilov & Hutter published a seemingly minor paper that turned out to be extremely important. They showed that the standard way of adding L2 regularization to Adam is fundamentally broken, and proposed AdamW as the fix.

The Problem with L2 Regularization in Adam

L2 regularization adds a penalty $\frac{\lambda}{2} \|\mathbf{w}\|^2$ to the loss. The gradient becomes $g_t + \lambda w_t$ . In vanilla SGD, this works correctly — the weight decay term $\lambda w_t$ shrinks each weight proportionally.

But in Adam, the gradient goes through the $\hat{m}/\sqrt{\hat{v}}$ normalization. The weight decay term gets scaled differently for each parameter. A parameter with large $\sqrt{\hat{v}}$ gets less weight decay than intended; a parameter with small $\sqrt{\hat{v}}$ gets more. The adaptive scaling, which is perfect for the gradient, distorts the regularization.

The AdamW Fix

AdamW decouples weight decay from the gradient update. Instead of adding $\lambda w$ to the gradient, AdamW applies it directly to the weights after the Adam step:

$w_{t+1} = w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} - \eta \cdot \lambda \cdot w_t$

The gradient update ( $\hat{m}/\sqrt{\hat{v}}$ ) goes through Adam's normalization. The weight decay ( $\lambda w$ ) is applied directly, without any scaling. Every parameter gets the same proportional decay, regardless of its gradient history.

Method	Weight decay mechanism	Consistent across parameters?
Adam + L2	λw goes through m/√v normalization	No — each param gets different effective decay
AdamW	λw applied directly to weights	Yes — all params decay at rate λ uniformly

In PyTorch, switching from Adam to AdamW is a one-line change:

$\texttt{opt = torch.optim.AdamW(params, lr=0.001, weight\_decay=0.01)}$

AdamW is the standard optimizer for modern transformer training. GPT-2, GPT-3, GPT-4, BERT, LLaMA, Mistral, and virtually every large language model uses AdamW. If you are training a transformer, use AdamW with

\lambda = 0.01

0.1

— not Adam with L2 regularization.

Connection to Modern Training

Adam and AdamW are the workhorses of modern deep learning. Here is how they appear in practice.

Standard Hyperparameters for Transformers

Hyperparameter	Typical value	Why
Optimizer	AdamW	Decoupled weight decay is critical for regularization
lr	1e-4 to 3e-4	Lower than the default 1e-3 because transformers are sensitive
β₁	0.9	Standard momentum. Some use 0.9 → 0.95 for very large batches
β₂	0.95 or 0.999	LLaMA uses 0.95; GPT-3 uses 0.999. Lower β₂ tracks gradient variance faster
ε	1e-8	Rarely changed. Some use 1e-6 for mixed-precision stability
weight_decay	0.01 or 0.1	Applied only to weight matrices, NOT biases or LayerNorm params
Gradient clipping	max_norm=1.0	Prevents gradient explosions from corrupting Adam’s buffers

Learning Rate Warmup

Transformers use a learning rate warmup for the first 1,000–4,000 steps. The learning rate starts at 0 and linearly increases to the target value. This interacts with Adam in two ways:

Adam's bias correction already warms up the effective step. At step 1, $\hat{v}$ is corrected by 1000×, preventing oversized steps. Warmup adds additional safety on top of bias correction.
The second moment $v$ needs time to converge. During warmup, $v$ builds up a reasonable estimate of the gradient variance. By the time lr reaches its target, $v$ is well-calibrated and the per-parameter scaling is reliable.

Cosine Learning Rate Decay

After warmup, most transformer training schedules follow a cosine decay:

$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T} \cdot \pi\right)\right)$

This smoothly decreases the learning rate from $\eta_{\max}$ to $\eta_{\min}$ over $T$ steps. The combination of AdamW + warmup + cosine decay is the standard training recipe for GPT, LLaMA, and most modern language models.

Memory Cost at Scale

Adam's two buffers per parameter become significant at scale. For a 7B parameter model (like LLaMA-7B):

Component	Size (fp32)	Size (fp16/bf16)
Model weights	28 GB	14 GB
Adam m buffer	28 GB	28 GB (kept in fp32)
Adam v buffer	28 GB	28 GB (kept in fp32)
Gradients	28 GB	14 GB
Total	112 GB	84 GB

The optimizer state (m + v) alone is 56 GB for a 7B model — twice the size of the model itself. This is why optimizer state sharding (like ZeRO from DeepSpeed and FSDP from PyTorch) is essential for training large models. These techniques distribute the m and v buffers across multiple GPUs.

Adam Variants in Production

AdaFactor: reduces memory by storing only row and column statistics instead of the full m and v matrices. Used in T5 and some Google models.
LAMB (Layer-wise Adaptive Moments for Batch training): scales Adam's update by the ratio of parameter norm to update norm. Enables extremely large batch sizes (up to 64K). Used in BERT pretraining.
8-bit Adam: quantizes m and v buffers to 8-bit integers (via the bitsandbytes library), reducing optimizer memory by 4× with minimal quality loss. Increasingly popular for fine-tuning large models on consumer GPUs.
Lion (EvoLved Sign Momentum): discovered by Google through evolutionary search. Uses only the sign of the momentum update (no second moment), reducing memory to 1 buffer per parameter. Competitive with Adam on many tasks.

Summary

Momentum alone cannot adapt per-parameter. All parameters share one learning rate, which fails when gradient magnitudes span orders of magnitude across parameters.
RMSprop divides by the RMS gradient to normalize step sizes across parameters. But it has no momentum (no directional smoothing) and no bias correction (misbehaves early in training).
Adam combines momentum and RMSprop with bias correction. First moment $m$ smooths direction; second moment $v$ adapts magnitude. The effective step is approximately $\eta$ for every parameter.
Bias correction is essential because both $m$ and $v$ start at zero. Without it, the second moment is underestimated by up to 1000× in the first step, causing oversized updates.
Adam is not always fastest — on simple 1D problems, momentum wins because it builds velocity. Adam's advantage is on multi-dimensional problems where parameters need different step sizes.
AdamW decouples weight decay from the gradient normalization. This ensures consistent regularization across all parameters and is the standard for modern transformer training.
The standard transformer recipe: AdamW + linear warmup + cosine decay, with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ – $0.999$ , weight_decay = 0.01–0.1, gradient clipping at max_norm = 1.0.

Looking ahead: Adam and AdamW give us an excellent optimizer, but they still require choosing a learning rate and a training duration. Learning rate scheduling — how to adjust $\eta$ during training — can further improve convergence. We explore this in Section 3.