Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why a constant learning rate fails in noisy optimization and why decaying the LR over time enables tight convergence
Derive the cosine annealing formula and explain how a half-cosine curve maps training steps to learning rates
Explain the warmup problem and why starting with a small LR prevents early training instability in transformers
Implement step decay, cosine annealing, and warmup+cosine schedules from scratch in NumPy
Use PyTorch's lr_scheduler classes (StepLR, CosineAnnealingLR) to train a network with scheduled learning rates
Describe the modern transformer training recipe: warmup + cosine decay with AdamW, and why each piece is necessary

Where We Left Off

In Section 1, we learned that SGD with momentum accelerates convergence by accumulating past gradients into a velocity buffer. The velocity $\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla L$ smooths out noise and amplifies consistent directions, giving roughly a 3–5× speedup over vanilla SGD.

But we used the same learning rate $\eta = 0.05$ for all 30 training epochs. That raises a question: should the learning rate stay constant throughout training?

The answer is almost always no. A constant learning rate forces a painful tradeoff. Set it high enough to make fast progress early, and the optimizer oscillates wildly near the minimum. Set it low enough to converge precisely, and early training crawls. Learning rate scheduling resolves this tension by changing the learning rate during training — large steps in the beginning, small steps at the end.

Why Schedules Matter

Consider what happens when you train with SGD on noisy mini-batch gradients. At any step, the observed gradient is:

$\hat{g}_t = \nabla L(\mathbf{w}_t) + \varepsilon_t$ where $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$

The weight update is $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \hat{g}_t$ . The noise in the gradient introduces a random walk component with variance proportional to $\eta^2 \sigma^2$ . This means:

With a large $\eta$ : the random walk amplitude is large. The optimizer bounces around the minimum with big jumps. The loss oscillates even after many steps.
With a small $\eta$ : the random walk amplitude is small, so convergence is tight. But the signal component $\eta \nabla L$ is also tiny, so early progress is glacial.

The resolution is straightforward: start with a large $\eta$ for fast exploration, then decrease it for precise convergence. This is learning rate scheduling.

Phase	Learning Rate	Behavior
Early training	Large (e.g. 0.05–0.1)	Rapid loss decrease, rough convergence, exploring the loss landscape
Mid training	Medium (e.g. 0.01–0.03)	Refining the solution, approaching the minimum region
Late training	Small (e.g. 0.001)	Fine-tuning, settling into the minimum, noise dampened

The oscillation bound: for SGD with constant learning rate $\eta$ , the expected distance from the minimum is bounded by $\mathcal{O}(\eta \sigma^2)$ . To halve the oscillation, you must halve the learning rate — but halving the LR also halves the convergence speed. Scheduling lets you escape this tradeoff by changing $\eta$ over time.

The Big Picture: Exploration vs Exploitation

There is a deeper reason why scheduling works so well. Think of training as having two phases:

Exploration (high LR): the loss landscape has many regions — some with good minima, some with bad ones. A high learning rate lets the optimizer jump over small bad minima and explore broadly. The noise from mini-batches is actually helpful here: it prevents the optimizer from committing too early to a suboptimal minimum.

Exploitation (low LR): once the optimizer has found a promising region, a low learning rate lets it settle into the best point within that region. The noise is now harmful — it prevents precise convergence. A small LR suppresses the noise and allows fine-tuning.

Every learning rate schedule is a particular strategy for transitioning from exploration to exploitation. The question is how to make that transition.

Step Decay

The simplest schedule: multiply the learning rate by a factor $\gamma < 1$ every $N$ epochs.

$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / N \rfloor}$

For example, with $\eta_0 = 0.05$ , $\gamma = 0.5$ , and $N = 10$ :

Epochs	Learning Rate	Computation
0–9	0.05000	η₀ × 0.5⁰ = 0.05
10–19	0.02500	η₀ × 0.5¹ = 0.025
20–29	0.01250	η₀ × 0.5² = 0.0125
30–39	0.00625	η₀ × 0.5³ = 0.00625

Step decay was the dominant schedule in early deep learning. The original ResNet paper (He et al., 2015) trained for 90 epochs, dividing the LR by 10 at epochs 30 and 60. It works well but requires choosing two hyperparameters ( $\gamma$ and $N$ ) and introduces abrupt transitions that can destabilize training.

The Abrupt Transition Problem

When the LR drops suddenly, the optimizer's momentum buffer is still calibrated for the old LR. The velocity built up over 10 epochs at LR=0.05 is suddenly too large for LR=0.025. This causes a brief spike in loss right after each step. The model recovers quickly, but the transition is wasteful.

When to use step decay: it remains popular in computer vision (ImageNet training). If you are training a ResNet or similar CNN for classification, step decay with

\gamma = 0.1

at epochs [30, 60, 90] in a 100-epoch training run is a well-tested recipe.

Cosine Annealing

Cosine annealing (Loshchilov & Hutter, 2016) replaces the staircase with a smooth curve. The learning rate follows a half-period of a cosine function:

$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$

Let's unpack this formula piece by piece:

$\cos(\pi t / T)$ traces a half-cosine from $+1$ (at $t=0$ ) to $-1$ (at $t=T$ )
Adding 1 shifts the range to $[0, 2]$
Multiplying by $\frac{1}{2}(\eta_0 - \eta_{\min})$ scales to $[0, \eta_0 - \eta_{\min}]$
Adding $\eta_{\min}$ shifts to $[\eta_{\min}, \eta_0]$

The result is a smooth S-shaped decay: slow at the start, fast in the middle, slow again near the end.

Why Cosine Beats Step Decay

No abrupt transitions: the LR changes continuously, so the momentum buffer stays calibrated. No loss spikes.
Fewer hyperparameters: just $\eta_0$ , $\eta_{\min}$ , and $T$ (total steps). No need to choose decay milestones.
Slow decay near start and end: the cosine curve spends more time near $\eta_0$ (exploring) and near $\eta_{\min}$ (fine-tuning). The fast decay in the middle wastes the least time in the intermediate region.

Cosine annealing is the default schedule for most modern training. When you see a training recipe in a paper and it doesn't specify the schedule, it is probably cosine annealing. The formula is simple, the implementation is one line, and it consistently outperforms step decay across tasks.

Warmup: Stabilizing Early Training

All the schedules above assume that training starts at the peak LR. For simple models and SGD, this works fine. But for transformers and other large models with adaptive optimizers like Adam, starting at the full LR is dangerous.

Why Large Models Need Warmup

Adam (and similar adaptive optimizers) maintains a running estimate of each gradient's variance. In the first few steps, these estimates are based on very few samples and can be wildly inaccurate. If the learning rate is also high during these early steps, the combination of bad variance estimates + high LR produces huge, unstable weight updates that can derail training before it starts.

The solution: start with a tiny learning rate and linearly ramp up to the target LR over the first $W$ steps:

$\eta_t = \eta_0 \cdot \frac{t + 1}{W} \quad \text{for } t < W$

During warmup, the optimizer builds up accurate gradient statistics at a safe pace. By the time the LR reaches its peak, the variance estimates are reliable and training proceeds normally.

Step	Warmup LR (W=5, η₀=0.08)	Status
0	0.016	Cautious — variance estimates still cold
1	0.032	Warming up — estimates improving
2	0.048	Estimates more reliable
3	0.064	Almost at full speed
4	0.080	Full LR — estimates now accurate
5+	Decay phase begins	Switch to cosine/step decay

How many warmup steps? Typical values are 1–10% of total training steps. The original transformer paper ("Attention Is All You Need") used 4,000 warmup steps out of ~100,000 total. LLaMA uses 2,000 warmup steps. For small experiments, even 5–10 steps of warmup helps.

The Modern Standard: Warmup + Cosine Decay

The standard training recipe for transformers and most modern deep learning combines warmup with cosine decay:

$\eta_t = \begin{cases} \eta_0 \cdot \dfrac{t + 1}{W} & \text{if } t < W \\[8pt] \eta_{\min} + \dfrac{1}{2}(\eta_0 - \eta_{\min})\!\left(1 + \cos\!\left(\dfrac{\pi(t - W)}{T - W}\right)\right) & \text{if } t \geq W \end{cases}$

The two phases:

Phase 1 (Warmup): linear ramp from $\eta_0 / W$ to $\eta_0$ over steps $0$ to $W-1$
Phase 2 (Cosine Decay): half-cosine from $\eta_0$ to $\eta_{\min}$ over steps $W$ to $T-1$

This gives the optimizer the best of everything: safe startup (warmup prevents early instability), fast progress (high LR after warmup), and precise convergence (cosine decay settles into the minimum).

Model	Peak LR	Warmup Steps	Total Steps	Optimizer
BERT (Base)	1e-4	10,000	1,000,000	AdamW
GPT-2	2.5e-4	2,000	~250,000	AdamW
GPT-3 (175B)	6e-5	375	~300,000	AdamW
LLaMA-2 (7B)	3e-4	2,000	~1,000,000	AdamW
Our toy FlipNet	0.05	5	30×16 = 480	SGD+momentum

The recipe in one line: AdamW with warmup + cosine decay is to modern deep learning what SGD with momentum was to pre-2017 training. If you are starting a new project and need a default optimizer setup, this is it.

Interactive: LR Schedule Explorer

The visualization below lets you explore different learning rate schedules. Switch between LR Curves (the schedule shapes) and Training Loss (how they affect optimization of $f(w) = w^2$ with noisy gradients). Use the sliders to adjust parameters and see how each schedule responds.

Loading schedule visualization...

Try these experiments:

LR Curves, default settings: press Play. Watch how cosine (blue) curves smoothly while step decay (purple) drops abruptly. Warmup+cosine (green) ramps up before decaying.
Increase warmup steps to 15: the green curve ramps for longer before peaking. With many warmup steps, the available decay time shrinks.
Switch to Training Loss mode: watch the constant LR (orange) oscillate wildly near the end while cosine and warmup+cosine settle down.
Set noise (σ) to 0: without noise, all schedules converge smoothly. Scheduling matters most when gradients are noisy (which they always are in practice).
Set noise (σ) to 5: extreme noise. The constant LR bounces violently. Scheduled LRs still converge because their tiny late-stage LR dampens the noise.

NumPy: Schedules from Scratch

Let's implement cosine annealing and warmup+cosine from scratch. We minimize $f(w) = w^2$ with noisy gradients ( $\sigma = 2.0$ ) over 50 steps, comparing constant LR, cosine annealing, and warmup+cosine. The noise simulates mini-batch SGD: each gradient is the true gradient plus Gaussian noise.

NumPy \u2014 Learning Rate Schedules from Scratch

🐍lr_schedules.py

Explanation(30)

Code(40)

1import numpy as np

NumPy provides fast numerical arrays and mathematical functions. We use np.cos for the cosine schedule, np.pi for π, and np.random.normal for generating gradient noise. All these are vectorized C operations under the hood.

EXECUTION STATE

numpy = Library for numerical computing — ndarray, trigonometric functions (np.cos, np.pi), random number generation (np.random.normal)

3# Minimize f(w) = w² with noisy gradients

Same toy function from Section 1. The minimum is at w=0. But now we add random noise to the gradient, simulating the stochastic nature of mini-batch SGD. This noise is what makes scheduling necessary — constant LR can’t converge tightly when gradients are noisy.

EXECUTION STATE

f(w) = w² = Parabola with minimum at w=0. True gradient: df/dw = 2w. Adding noise: grad = 2w + N(0, σ²). The noise prevents clean convergence with a fixed learning rate.

4# True gradient: df/dw = 2w

Without noise, the gradient is exactly 2w. At w=5 the gradient is 10 (steep). Near w=0 the gradient is small (gentle). Noise adds a random perturbation: some steps the gradient is too large, others too small.

5# mini-batch SGD where each batch gives a noisy estimate

In real training, each mini-batch produces a noisy estimate of the true gradient because it only sees a subset of the data. The noise standard deviation σ controls how noisy the gradients are. Higher σ = smaller batches = noisier gradients.

7np.random.seed(42) — Reproducibility

Seeds the random number generator so the noise sequence is identical across all three schedules. This makes the comparison fair: the only variable is the learning rate schedule, not the noise realization.

EXECUTION STATE

📚 np.random.seed() = NumPy function: seeds the Mersenne Twister RNG. All subsequent calls to np.random.normal() will produce the same sequence. We re-seed before each schedule run so they all see identical noise.

⬇ arg: 42 = Seed value. Any integer works. 42 is convention.

8eta_0 = 0.08 — Peak learning rate

The maximum learning rate. For constant LR, every step uses 0.08. For cosine annealing, the LR starts at 0.08 and decays. For warmup+cosine, the LR ramps UP to 0.08 during warmup, then decays. This is large enough to make fast progress early but will cause oscillation if held constant in noisy conditions.

EXECUTION STATE

η₀ = 0.08 = Peak learning rate. Update size per step: lr × gradient. At w=5 with no noise: step = 0.08 × 10 = 0.8 (moves w from 5.0 to 4.2 in one step). Aggressive but effective early on.

9eta_min = 0.001 — Minimum learning rate

The learning rate floor. The cosine schedule decays from η₀ = 0.08 down to η_min = 0.001. At the end of training, step size = 0.001 × grad, which is 80× smaller than the initial step. This tiny LR lets the optimizer settle precisely near the minimum without bouncing around.

EXECUTION STATE

η_min = 0.001 = Learning rate floor. At step 49: update = 0.001 × grad ≈ 0.001 × 0.7 = 0.0007. Tiny steps → stable convergence. The noise still exists but its effect is multiplied by this small LR, so w barely moves.

10T = 50 — Total training steps

We run 50 gradient descent steps. The cosine schedule completes one half-period of cosine over these 50 steps: cos(0)=1 at step 0, cos(π)=−1 at step 50. The LR transitions smoothly from η₀ to η_min.

EXECUTION STATE

T = 50 = Total steps. In real training, T = total_epochs × steps_per_epoch. For GPT-3: T ≈ 300,000 steps. The cosine curve stretches to fill whatever T you choose.

11sigma = 2.0 — Gradient noise

Each gradient has noise ~ N(0, 4.0) added. At w=0.5, the true gradient is 1.0, but the observed gradient might be anything from −5 to +7. This mimics the variance of mini-batch gradients in real training. With constant LR, this noise causes the weight to bounce around the minimum.

EXECUTION STATE

σ = 2.0 = Standard deviation of gradient noise. noise ~ N(0, σ²) = N(0, 4.0). The signal-to-noise ratio at w=0.5 is: |2×0.5| / 2.0 = 0.5 — the noise is TWICE the true gradient! Only scheduling can handle this.

13# === Schedule 1: Cosine Annealing ===

Cosine annealing follows a half-cosine curve from η₀ to η_min. The decay is gentle at first (near the peak of the cosine), accelerates in the middle, and gentles again as it approaches η_min. This S-shaped decay is more effective than linear decay.

14def cosine_lr(step) — Cosine annealing schedule

Computes the learning rate at a given step using the cosine annealing formula from Loshchilov & Hutter (2016). The formula: η(t) = η_min + ½(η₀ − η_min)(1 + cos(πt/T)). At step 0: cos(0)=1, so LR = η₀. At step T: cos(π)=−1, so LR = η_min.

EXECUTION STATE

⬇ input: step = Current training step (0 to T−1). The schedule maps this integer to a learning rate value.

⬆ returns = float — the learning rate for this step. Range: [η_min, η₀] = [0.001, 0.08]

→ Example values = step=0: 0.080000 | step=10: 0.072456 | step=25: 0.040500 | step=40: 0.008544 | step=49: 0.001078

15return eta_min + 0.5 * (eta_0 - eta_min) * (1 + cos(π*step/T))

The cosine annealing formula. Let’s unpack: (1) cos(π·step/T) traces a half cosine from +1 to −1 as step goes 0→T. (2) Adding 1 shifts the range to [0, 2]. (3) Multiplying by 0.5 gives [0, 1]. (4) Multiplying by (η₀−η_min) scales to [0, 0.079]. (5) Adding η_min shifts to [0.001, 0.08].

EXECUTION STATE

📚 np.cos(x) = NumPy cosine function. Returns cos(x) where x is in radians. cos(0) = 1.0, cos(π/2) = 0.0, cos(π) = −1.0. Smooth and differentiable everywhere.

📚 np.pi = NumPy constant: π ≈ 3.14159. Used here so cos(π × step / T) completes exactly half a cycle over T steps.

→ step=0 = cos(π×0/50) = cos(0) = 1.0 → 0.001 + 0.5×0.079×(1+1) = 0.001 + 0.079 = 0.080

→ step=25 = cos(π×25/50) = cos(π/2) = 0.0 → 0.001 + 0.5×0.079×(1+0) = 0.001 + 0.0395 = 0.0405

→ step=49 = cos(π×49/50) = cos(0.98π) ≈ −0.998 → 0.001 + 0.5×0.079×0.002 ≈ 0.00108

18# === Schedule 2: Linear Warmup + Cosine Decay ===

The modern standard for transformer training. Instead of starting at the full LR (which can destabilize early training), we ramp up linearly from a small value to η₀ over the first few steps. Then cosine decay takes over. This is the schedule used for GPT, BERT, LLaMA, and most large language models.

19def warmup_cosine_lr(step, warmup=5) — Warmup + Cosine schedule

Two-phase schedule: (1) Linear warmup from ~0 to η₀ over the first 5 steps, (2) Cosine decay from η₀ to η_min over the remaining 45 steps. The warmup phase prevents large, unstable updates when the optimizer’s momentum/variance estimates are still cold.

EXECUTION STATE

⬇ input: step = Current training step (0 to T−1 = 49)

⬇ input: warmup = 5 = Number of warmup steps. During steps 0–4, LR ramps linearly. At step 5, LR reaches η₀ = 0.08 and cosine decay begins. In practice: 1–10% of total steps.

⬆ returns = float — learning rate. Warmup phase: [0.016, 0.032, 0.048, 0.064, 0.080]. Decay phase: 0.080 → 0.001

20if step < warmup: — Warmup phase

During the first 5 steps (step = 0, 1, 2, 3, 4), we’re in the warmup phase. The LR increases linearly from η₀/5 = 0.016 to η₀ = 0.080. This gives the optimizer time to build up accurate gradient statistics before taking large steps.

EXECUTION STATE

warmup = 5 = Steps 0–4 are warmup. step < 5 is True for these steps.

→ Why warmup? = Adam and other adaptive optimizers estimate gradient variance from a running average. In the first few steps, these estimates are inaccurate. Large LR + bad estimates = exploding updates. Warmup avoids this.

21return eta_0 * (step + 1) / warmup — Linear ramp

Linear warmup formula. We use (step+1) so step=0 gives LR = η₀×1/5 = 0.016 (not zero). At step=4: LR = η₀×5/5 = 0.08 (full LR). The ramp is: 0.016 → 0.032 → 0.048 → 0.064 → 0.080.

EXECUTION STATE

(step + 1) / warmup = Linear fraction from 1/5 to 5/5. The +1 ensures step=0 gets a nonzero LR. Without +1, step=0 would have lr=0 (no learning at all).

→ step=0 = 0.08 × 1/5 = 0.016

→ step=1 = 0.08 × 2/5 = 0.032

→ step=2 = 0.08 × 3/5 = 0.048

→ step=3 = 0.08 × 4/5 = 0.064

→ step=4 = 0.08 × 5/5 = 0.080 — full learning rate reached!

22progress = (step - warmup) / (T - warmup) — Decay fraction

After warmup, we compute how far through the decay phase we are. progress goes from 0.0 (start of decay, step=5) to 1.0 (end, step=49). This is fed into the cosine function. The denominator is T−warmup = 50−5 = 45 decay steps.

EXECUTION STATE

T - warmup = 45 = Number of steps in the decay phase. 50 total minus 5 warmup = 45 cosine decay steps.

→ step=5 = (5−5)/45 = 0.000 — start of decay

→ step=27 = (27−5)/45 = 0.489 — halfway through decay

→ step=49 = (49−5)/45 = 0.978 — near end of decay

23return ... cosine decay after warmup

Same cosine formula as cosine_lr, but using ‘progress’ (0→1) instead of ‘step/T’. This ensures the cosine decay spans only the post-warmup steps. The LR at step=5 is η₀ (matching the end of warmup) and decays to η_min by step=49.

EXECUTION STATE

→ step=5 = progress=0.0 → cos(0)=1 → lr = 0.001 + 0.0395×2 = 0.080

→ step=27 = progress=0.489 → cos(0.489π)≈0.035 → lr ≈ 0.042

→ step=49 = progress=0.978 → cos(0.978π)≈−0.998 → lr ≈ 0.0011

26# === Compare three schedules ===

We train f(w)=w² three times with identical noise but different LR schedules. The constant schedule uses η₀ = 0.08 at every step. The cosine schedule decays from 0.08 to 0.001. The warmup+cosine first ramps up to 0.08, then decays. Watch the final oscillation: constant bounces, schedules settle.

27for name, get_lr in [...] — Three schedule functions

Iterates over three schedules. Each is a function that takes a step number and returns the LR. The constant schedule is an inline lambda: lambda s: eta_0 (always returns 0.08). The other two are the functions we defined above.

EXECUTION STATE

📚 lambda s: eta_0 = Python anonymous function. Takes step s and returns eta_0=0.08 regardless of the step. This is the constant LR baseline.

Schedule 1: Constant = get_lr(any_step) = 0.08 always. No adaptation. Fast early progress but oscillates near minimum.

Schedule 2: Cosine = get_lr(0) = 0.080, get_lr(25) = 0.041, get_lr(49) = 0.001. Smooth decay.

Schedule 3: Warmup+Cosine = get_lr(0) = 0.016, get_lr(4) = 0.080, get_lr(49) = 0.001. Ramp then decay.

30np.random.seed(42) — Reset noise for fair comparison

Critical: we re-seed the RNG before each schedule run. This means all three schedules encounter the exact same noise sequence at each step. The only difference in the results is due to the learning rate schedule, not random chance.

EXECUTION STATE

→ Why re-seed? = Without re-seeding, each schedule would see different random noise, making comparison meaningless. Re-seeding ensures the ONLY variable is the LR schedule.

31w = 5.0 — Same starting weight

Start at w=5.0 for each schedule. Initial loss = 25.0. All three begin from the same point and try to reach w=0 (loss=0) through the noisy gradient landscape.

EXECUTION STATE

w = 5.0 = Starting weight. Loss = 5² = 25.0. Same starting point for all three schedules.

32for step in range(T): — 50 training steps

Main training loop. Each iteration: (1) sample noise, (2) compute noisy gradient, (3) get LR from schedule, (4) update weight. After 50 steps, compare the final w and oscillation pattern.

LOOP TRACE · 6 iterations

Constant: step=0

w: 5.000 → 4.121 = lr=0.080, noise=+0.99, grad=10.99 — fast initial descent

Constant: step=10

w: 0.521 → 0.512 = lr=0.080, noise=−0.93, grad=0.12 — near minimum but LR still large

Constant: step=40

w: 0.467 → 0.274 = lr=0.080, noise=+1.48, grad=2.41 — bouncing! noise dominates

Constant: step=49

w: 0.127 → 0.389 = lr=0.080, noise=−3.53, grad=−3.27 — jumped 0.26 on noise alone!

Cosine: step=49

w: 0.362 → 0.365 = lr=0.001, noise=−3.53, grad=−2.80 — moved only 0.003. Stable!

Warmup+Cos: step=49

w: 0.358 → 0.361 = lr=0.001, noise=−3.53, grad=−2.81 — moved only 0.003. Stable!

33noise = np.random.normal(0, sigma) — Gradient noise

Samples one noise value from a Gaussian distribution with mean 0 and std dev σ=2.0. This simulates the stochasticity of mini-batch gradients. Each step gets a different noise value but the sequence is deterministic (seeded).

EXECUTION STATE

📚 np.random.normal(loc, scale) = NumPy function: draws one sample from a normal distribution. loc=0 (mean), scale=sigma=2.0 (std dev). Returns a float, e.g., +0.99, −0.28, +1.30, etc.

⬇ arg 1: loc = 0 = Mean of the distribution. Noise is centered at zero — on average the gradient is correct, but any single step might be off.

⬇ arg 2: scale = sigma = 2.0 = Standard deviation. 68% of noise values fall within [−2, +2]. 95% fall within [−4, +4]. Occasional large values like +3.0 or −3.5 cause big jumps.

→ step 0 noise = +0.993 (gradient pushed slightly higher than the true value)

→ step 49 noise = −3.526 (large negative noise — flips the gradient sign at small w!)

34grad = 2 * w + noise — Noisy gradient

The observed gradient is the true gradient (2w) plus noise. Near the minimum (w≈0), the true gradient is small but noise is still σ=2.0 — the noise dominates! This is why constant LR oscillates: the optimizer keeps taking big steps based on noise, not signal.

EXECUTION STATE

→ step=0 (all schedules) = grad = 2×5.0 + 0.993 = 10.993 (noise is 10% of signal — minor)

→ step=49 (Constant) = grad = 2×0.127 + (−3.526) = −3.271 (noise FLIPS the sign! True grad is +0.25 but observed is −3.3)

→ The problem = Near minimum, signal (2w ≈ 0.3) is dwarfed by noise (σ=2.0). With lr=0.08: step = 0.08×3.3 = 0.26. That’s a HUGE jump when w is only 0.13!

35lr = get_lr(step) — Get scheduled learning rate

Calls the schedule function for the current step. For constant: always 0.08. For cosine: smoothly decays from 0.08 to 0.001. For warmup+cosine: ramps up in steps 0–4, then decays. This is where the magic happens — at step 49, cosine uses lr=0.001 while constant uses lr=0.08.

EXECUTION STATE

Constant at step 49 = get_lr(49) = 0.08 (unchanged — still taking big steps!)

Cosine at step 49 = get_lr(49) = 0.00108 (80× smaller than constant)

Warmup+Cos at step 49 = get_lr(49) = 0.00110 (also tiny — barely moves)

36w = w - lr * grad — Weight update

Standard gradient descent update. The step size = lr × grad. With constant LR near the minimum: step = 0.08 × (−3.27) = +0.262 (jumps from 0.127 to 0.389!). With cosine LR: step = 0.001 × (−2.80) = +0.003 (barely moves from 0.362 to 0.365). Scheduling turns wild bouncing into gentle settling.

EXECUTION STATE

→ Constant, step 49 = w = 0.127 − 0.08×(−3.271) = 0.127 + 0.262 = 0.389 (jumped 0.26 due to noise!)

→ Cosine, step 49 = w = 0.362 − 0.001×(−2.802) = 0.362 + 0.003 = 0.365 (barely moved — stable!)

→ Warmup+Cos, step 49 = w = 0.358 − 0.001×(−2.810) = 0.358 + 0.003 = 0.361 (also stable!)

→ Key insight = Same noise, same gradient, different step sizes. Constant LR: noise causes ±0.26 jumps. Cosine: noise causes ±0.003 wiggles. That’s the power of scheduling.

37if step % 10 == 0 or step == T - 1: — Print at intervals

Prints every 10 steps plus the final step. This shows the trajectory: fast descent in early steps, then either oscillation (constant) or settling (cosine/warmup).

38print(...) — Output formatted results

Prints schedule name, step number, current LR, weight, and loss. The output reveals: Constant LR oscillates (loss bounces: 0.26, 0.18, 0.18, 0.07, 0.15). Cosine settles (loss: 0.31, 0.28, 0.22, 0.14, 0.13). Warmup+Cosine also settles (loss: 0.76, 0.33, 0.24, 0.13, 0.13).

EXECUTION STATE

→ Constant final 10 steps = w oscillates: std=0.118, range=[0.13, 0.46]. Loss bounces between 0.02 and 0.21. Unstable!

→ Cosine final 10 steps = w stable: std=0.004, range=[0.358, 0.369]. Loss steady at ~0.13. Converged!

→ Warmup+Cos final 10 steps = w stable: std=0.005, range=[0.352, 0.365]. Loss steady at ~0.13. Converged!

40print() — Blank line between schedules

Separates the output of each schedule with a blank line for readability.

10 lines without explanation

1import numpy as np
2
3# ── Minimize f(w) = w² with noisy gradients ──
4# True gradient: df/dw = 2w.  We add noise to simulate
5# mini-batch SGD where each batch gives a noisy estimate.
6
7np.random.seed(42)
8eta_0 = 0.08       # peak learning rate
9eta_min = 0.001    # minimum learning rate
10T = 50             # total training steps
11sigma = 2.0        # gradient noise std dev
12
13# === Schedule 1: Cosine Annealing ===
14def cosine_lr(step):
15    return eta_min + 0.5 * (eta_0 - eta_min) * (
16        1 + np.cos(np.pi * step / T))
17
18# === Schedule 2: Linear Warmup + Cosine Decay ===
19def warmup_cosine_lr(step, warmup=5):
20    if step < warmup:
21        return eta_0 * (step + 1) / warmup
22    progress = (step - warmup) / (T - warmup)
23    return eta_min + 0.5 * (eta_0 - eta_min) * (
24        1 + np.cos(np.pi * progress))
25
26# === Compare three schedules ===
27for name, get_lr in [("Constant",      lambda s: eta_0),
28                     ("Cosine",         cosine_lr),
29                     ("Warmup+Cosine",  warmup_cosine_lr)]:
30    np.random.seed(42)
31    w = 5.0
32    for step in range(T):
33        noise = np.random.normal(0, sigma)
34        grad = 2 * w + noise
35        lr = get_lr(step)
36        w = w - lr * grad
37        if step % 10 == 0 or step == T - 1:
38            print(f"[{name:14s}] Step {step:2d}: "
39                  f"lr={lr:.5f}  w={w:7.4f}  loss={w**2:.4f}")
40    print()

The results reveal the key insight:

Schedule	Final w	Final Loss	Last 10 Steps Std
Constant (lr=0.08)	0.389	0.151	0.118 (bouncing!)
Cosine Annealing	0.365	0.133	0.004 (stable)
Warmup + Cosine	0.361	0.130	0.005 (stable)

All three reach similar neighborhoods, but constant LR oscillates with a standard deviation of 0.118 while the scheduled approaches have std < 0.005 — a 25× reduction in oscillation. The schedules "park" the optimizer at a good point and stop bouncing.

Why are the losses similar? With only 50 steps on a simple function, all schedules make good progress. The difference is in stability, not final average loss. In real training with millions of steps, this stability translates to better generalization: the model converges to a sharper, more specific minimum rather than bouncing between multiple nearby minima.

PyTorch: Training with Schedulers

PyTorch provides built-in schedulers in $\texttt{torch.optim.lr\_scheduler}$ . Let's train our FlipNet from Chapters 7–8 with three approaches: constant LR, StepLR, and CosineAnnealingLR. All three models start with identical weights. The only difference is how the learning rate evolves.

PyTorch \u2014 Training with LR Schedulers

🐍train_with_scheduling.py

Explanation(44)

Code(66)

1import torch

PyTorch — the deep learning framework. Provides tensors, autograd, and the full training ecosystem including optimizers and learning rate schedulers (torch.optim.lr_scheduler).

EXECUTION STATE

torch = Core PyTorch library — tensors, autograd, nn, optimizers, schedulers

2import torch.nn as nn

Neural network module containing layers (nn.Linear), activations (nn.ReLU), and the base Module class. Same import as Chapters 7–9.

EXECUTION STATE

torch.nn = Neural network building blocks — nn.Linear, nn.ReLU, nn.Module

4torch.manual_seed(42) — Global seed

Sets the global random seed. But we also re-seed before creating each model so all three start with identical weights, making the LR schedule the only difference.

EXECUTION STATE

📚 torch.manual_seed() = Seeds PyTorch’s RNG for weight initialization, dropout, etc. We call this again before each FlipNet() below.

7class FlipNet(nn.Module): — Our test network

The same 4→3→4 network from Chapters 7–8 and Section 1 of this chapter. 31 learnable parameters. Input: flattened 2×2 binary image. Output: the diagonally flipped image. A simple but real network for testing optimizer behavior.

EXECUTION STATE

Architecture = Input(4) → Linear(4,3) → ReLU → Linear(3,4) → Output(4). 15 + 16 = 31 parameters total.

8def __init__(self):

Constructor creates two linear layers and a ReLU activation. Same architecture as all previous chapters.

9super().__init__()

Initializes nn.Module internals: parameter tracking, hooks, training mode.

10self.layer1 = nn.Linear(4, 3)

Hidden layer: 4 inputs → 3 outputs. Weight matrix (3×4) + bias (3) = 15 parameters.

EXECUTION STATE

📚 nn.Linear(4, 3) = Fully-connected layer. y = x @ W.T + b. W shape: (3,4), b shape: (3).

11self.layer2 = nn.Linear(3, 4)

Output layer: 3 hidden → 4 outputs. Weight (4×3) + bias (4) = 16 parameters.

EXECUTION STATE

📚 nn.Linear(3, 4) = Output layer. y = x @ W.T + b. W shape: (4,3), b shape: (4).

12self.relu = nn.ReLU()

ReLU activation: f(x) = max(0, x). Applied between layers to add nonlinearity.

14def forward(self, x):

Defines the forward pass: input → layer1 → relu → layer2 → output. Called automatically by model(x).

EXECUTION STATE

⬇ input: x = 4-element tensor: one flattened 2×2 image, e.g., [1,0,1,1]

⬆ returns = 4-element tensor: predicted flipped image

15return self.layer2(self.relu(self.layer1(x)))

Chain: linear → relu → linear. PyTorch records all ops for autograd.

18X = torch.tensor([...]) — All 16 binary 2×2 images

Creates the dataset: all 16 possible 4-bit binary strings as float tensors. X[0] = [0,0,0,0], X[1] = [0,0,0,1], ..., X[15] = [1,1,1,1]. Each represents a flattened 2×2 image.

EXECUTION STATE

X shape = (16, 4) — 16 images, 4 pixels each

f"{i:04b}" = Python f-string: formats integer i as 4-digit binary. i=5 → "0101" → [0,1,0,1]

20Y = X[:, [0, 2, 1, 3]] — Diagonal flip targets

The target output is the diagonal flip of each image. Reordering columns [0,2,1,3] swaps elements 1 and 2 (the off-diagonal pixels in a 2×2 matrix). This is the operation the network must learn.

EXECUTION STATE

[:, [0, 2, 1, 3]] = Fancy indexing: keep columns 0 and 3, swap columns 1 and 2. Example: [1,0,1,1] → [1,1,0,1]

22def train_epoch(model, optimizer) — One training epoch

Trains on all 16 images sequentially (stochastic gradient descent, batch size=1). Returns the average loss. This function is called with different optimizers to compare schedulers.

EXECUTION STATE

⬇ input: model = A FlipNet instance with its own weights

⬇ input: optimizer = SGD optimizer bound to model’s parameters. May have a scheduler attached externally.

⬆ returns = float — average MSE loss over 16 images this epoch

23total_loss = 0.0

Accumulator for summing losses across all 16 images.

24for i in range(16):

Loop over all 16 training images. Each iteration: forward pass, compute loss, backprop, update weights.

25pred = model(X[i])

Forward pass: runs image X[i] through the network to get a 4-element prediction.

26loss = torch.mean((pred - Y[i]) ** 2)

MSE loss: average squared difference between prediction and target. Differentiable — PyTorch can compute gradients through this.

EXECUTION STATE

📚 torch.mean() = Computes the mean of all elements. For a 4-element vector: sum/4.

27optimizer.zero_grad()

Clears accumulated gradients from the previous step. Without this, gradients would accumulate across images.

28loss.backward()

Backpropagation: computes ∂loss/∂w for all 31 parameters. Stores gradients in each parameter’s .grad attribute.

29optimizer.step()

Applies the SGD+momentum update: v = βv + grad; w = w - lr·v. The lr used here is whatever the scheduler has set in opt.param_groups[0]["lr"]. This is how scheduling works: the scheduler modifies this lr between epochs.

EXECUTION STATE

→ How schedulers connect = Scheduler.step() modifies optimizer.param_groups[0]["lr"]. When optimizer.step() runs, it reads the current lr from param_groups. No code change needed inside the training loop.

30total_loss += loss.item()

Adds this image’s loss to the accumulator. .item() extracts the Python float from a 0-dim tensor.

31return total_loss / 16

Returns average loss per image this epoch. Dividing by 16 normalizes for the dataset size.

33# Model 1: Constant LR (no scheduler)

The baseline: SGD with momentum and no learning rate decay. The LR stays at 0.05 for all 30 epochs. This is the approach from Section 1.

34torch.manual_seed(42); m_const = FlipNet()

Re-seed and create the first model. All three models start with identical weights thanks to the same seed.

EXECUTION STATE

→ Why re-seed? = nn.Linear uses torch.randn() internally for weight init. Same seed → same initial weights → fair comparison.

35opt_const = SGD(params, lr=0.05, momentum=0.9)

Creates an SGD optimizer with momentum 0.9 and constant LR 0.05. No scheduler will modify this LR — it stays 0.05 forever.

EXECUTION STATE

⬇ lr = 0.05 = Learning rate. Stays constant for all 30 epochs. The update: v = 0.9v + grad; w = w − 0.05·v

⬇ momentum = 0.9 = Momentum coefficient from Section 1. Velocity buffer averages ~10 recent gradients.

38# Model 2: StepLR (halve every 10 epochs)

Step decay: multiply the LR by γ=0.5 every 10 epochs. LR goes: 0.05 → 0.025 → 0.0125 → 0.00625. Simple and widely used in computer vision (ResNet papers).

39torch.manual_seed(42); m_step = FlipNet()

Second model with identical initial weights.

40opt_step = SGD(params, lr=0.05, momentum=0.9)

Same optimizer settings as model 1. The scheduler will modify this optimizer’s LR.

42sched_step = StepLR(opt_step, step_size=10, gamma=0.5)

Creates a step decay scheduler. Every 10 epochs, multiplies the optimizer’s LR by 0.5. The scheduler reads and writes opt_step.param_groups[0]["lr"]. Calling sched_step.step() after each epoch applies the decay.

EXECUTION STATE

📚 torch.optim.lr_scheduler.StepLR = PyTorch scheduler: lr = initial_lr × gamma^(epoch // step_size). Simple multiplicative decay at fixed intervals.

⬇ arg: opt_step = The optimizer whose LR to modify. The scheduler stores a reference to the optimizer.

⬇ arg: step_size = 10 = Decay every 10 epochs. Epochs 0–9: lr=0.05. Epochs 10–19: lr=0.025. Epochs 20–29: lr=0.0125.

⬇ arg: gamma = 0.5 = Multiplication factor at each decay. 0.5 = halve the LR. After 3 decays: 0.05 × 0.5³ = 0.00625.

45# Model 3: CosineAnnealingLR

Cosine annealing: smooth decay following a half-cosine curve. The most popular schedule for modern training — no step_size or gamma to tune, just T_max and eta_min.

46torch.manual_seed(42); m_cos = FlipNet()

Third model with identical initial weights.

47opt_cos = SGD(params, lr=0.05, momentum=0.9)

Same optimizer settings. The CosineAnnealingLR scheduler will modify this LR smoothly.

49sched_cos = CosineAnnealingLR(opt_cos, T_max=30, eta_min=0.001)

Creates a cosine annealing scheduler. The LR follows: η(t) = η_min + ½(η₀ − η_min)(1 + cos(πt/T_max)). Over 30 epochs, smoothly decays from 0.05 to 0.001. No abrupt drops like StepLR.

EXECUTION STATE

📚 CosineAnnealingLR = PyTorch scheduler: follows a half-cosine curve from initial_lr to eta_min over T_max epochs. Proposed by Loshchilov & Hutter (SGDR, 2016).

⬇ arg: opt_cos = The optimizer to schedule. Reads initial LR from opt.defaults["lr"] = 0.05.

⬇ arg: T_max = 30 = Total epochs for one cosine half-cycle. LR at epoch 0: 0.05 (cos(0)=1). LR at epoch 15: 0.0255 (cos(π/2)=0, midpoint). LR at epoch 30: 0.001 (cos(π)=−1).

⬇ arg: eta_min = 0.001 = Minimum LR. The cosine curve approaches this value at epoch T_max. Without this, the LR would reach exactly 0.

51for epoch in range(30): — Train all three models

Main training loop: 30 epochs. Each epoch trains all three models on all 16 images. After each epoch, the schedulers step to update the LR for the next epoch.

LOOP TRACE · 7 iterations

Epoch 0

All identical = Constant=0.3928 Step(lr=0.050)=0.3928 Cosine(lr=0.050)=0.3928

Epoch 5

Diverging slightly = Constant=0.2174 Step(lr=0.050)=0.2174 Cosine(lr=0.045)=0.2071

Epoch 10

StepLR drops to 0.025 = Constant=0.1390 Step(lr=0.025)=0.1341 Cosine(lr=0.035)=0.1361

Epoch 15

Scheduled models pull ahead = Constant=0.1381 Step(lr=0.025)=0.1295 Cosine(lr=0.023)=0.1296

Epoch 20

Constant starts oscillating = Constant=0.1348 Step(lr=0.013)=0.1270 Cosine(lr=0.011)=0.1272

Epoch 25

Constant loss RISES = Constant=0.1503 Step(lr=0.013)=0.1270 Cosine(lr=0.003)=0.1257

Epoch 29

Final: Cosine wins = Constant=0.1516 Step(lr=0.006)=0.1269 Cosine(lr=0.001)=0.1252

52l1 = train_epoch(m_const, opt_const)

Train model 1 (constant LR) for one epoch. Returns average loss over 16 images.

53l2 = train_epoch(m_step, opt_step)

Train model 2 (step decay). The optimizer uses whatever LR the scheduler set before this epoch.

54sched_step.step() — Update StepLR

Advances the StepLR scheduler by one epoch. If epoch+1 is a multiple of step_size (10), the optimizer’s LR is multiplied by gamma (0.5). Otherwise, LR stays the same. IMPORTANT: call scheduler.step() AFTER optimizer.step(), once per epoch.

EXECUTION STATE

📚 scheduler.step() = PyTorch scheduler method: updates the LR for the next epoch. Reads the current epoch count internally. Must be called after optimizer.step().

→ After epoch 9 = sched_step.step() checks: epoch 10 % 10 == 0? Yes → lr = 0.05 × 0.5 = 0.025

→ After epoch 10 = sched_step.step() checks: epoch 11 % 10 == 0? No → lr stays 0.025

55l3 = train_epoch(m_cos, opt_cos)

Train model 3 (cosine annealing). The LR changes smoothly every epoch.

56sched_cos.step() — Update CosineAnnealingLR

Advances the cosine scheduler. Unlike StepLR, the LR changes EVERY epoch (not just at milestones). The change is computed from the cosine formula.

EXECUTION STATE

→ After epoch 0 = New lr = 0.001 + 0.5×0.049×(1+cos(π×1/30)) = 0.04987

→ After epoch 14 = New lr = 0.001 + 0.5×0.049×(1+cos(π×15/30)) = 0.02550 (midpoint)

→ After epoch 28 = New lr = 0.001 + 0.5×0.049×(1+cos(π×29/30)) = 0.00113

57if epoch % 5 == 0 or epoch == 29: — Print every 5 epochs

Prints at epochs 0, 5, 10, 15, 20, 25, and 29 (the final epoch) to show the training progression.

58lr_s = opt_step.param_groups[0]['lr'] — Read current StepLR

Reads the current learning rate from the optimizer’s parameter groups. This is the value that the scheduler modifies. It’s the ground truth of what LR was actually used.

EXECUTION STATE

📚 param_groups[0]['lr'] = PyTorch stores optimizer parameters in groups. Most common: one group with all parameters. The 'lr' key holds the current learning rate. Schedulers modify this dict directly.

59lr_c = opt_cos.param_groups[0]['lr'] — Read current cosine LR

Same pattern: reads the cosine scheduler’s current LR for display.

60print(...) — Formatted epoch results

Prints all three losses with their current LRs. Key observation at epoch 29: Constant=0.1516, Step=0.1269, Cosine=0.1252. The constant LR model has HIGHER loss than it did at epoch 20 (0.1348) — it’s oscillating! Both scheduled models converged to lower loss.

EXECUTION STATE

→ Epoch 0 (all same) = All three models are identical at this point — same weights, same LR. Loss = 0.3928.

→ Epoch 10 (divergence begins) = StepLR just dropped to 0.025. All models are similar (~0.13–0.14) but constant is slightly higher.

→ Epoch 29 (the reveal) = Constant: 0.1516 (HIGHER than epoch 20!). Step: 0.1269. Cosine: 0.1252 (best). Scheduling wins.

22 lines without explanation

1import torch
2import torch.nn as nn
3
4torch.manual_seed(42)
5
6# ── Same FlipNet from Chapters 7-8 ──
7class FlipNet(nn.Module):
8    def __init__(self):
9        super().__init__()
10        self.layer1 = nn.Linear(4, 3)
11        self.layer2 = nn.Linear(3, 4)
12        self.relu = nn.ReLU()
13
14    def forward(self, x):
15        return self.layer2(self.relu(self.layer1(x)))
16
17# ── Dataset: all 16 binary 2×2 images ──
18X = torch.tensor([[int(b) for b in f"{i:04b}"]
19                   for i in range(16)], dtype=torch.float32)
20Y = X[:, [0, 2, 1, 3]]  # diagonal flip
21
22def train_epoch(model, optimizer):
23    total_loss = 0.0
24    for i in range(16):
25        pred = model(X[i])
26        loss = torch.mean((pred - Y[i]) ** 2)
27        optimizer.zero_grad()
28        loss.backward()
29        optimizer.step()
30        total_loss += loss.item()
31    return total_loss / 16
32
33# ── Model 1: Constant LR (no scheduler) ──
34torch.manual_seed(42)
35m_const = FlipNet()
36opt_const = torch.optim.SGD(m_const.parameters(),
37                            lr=0.05, momentum=0.9)
38
39# ── Model 2: StepLR (halve every 10 epochs) ──
40torch.manual_seed(42)
41m_step = FlipNet()
42opt_step = torch.optim.SGD(m_step.parameters(),
43                           lr=0.05, momentum=0.9)
44sched_step = torch.optim.lr_scheduler.StepLR(
45    opt_step, step_size=10, gamma=0.5)
46
47# ── Model 3: CosineAnnealingLR ──
48torch.manual_seed(42)
49m_cos = FlipNet()
50opt_cos = torch.optim.SGD(m_cos.parameters(),
51                          lr=0.05, momentum=0.9)
52sched_cos = torch.optim.lr_scheduler.CosineAnnealingLR(
53    opt_cos, T_max=30, eta_min=0.001)
54
55for epoch in range(30):
56    l1 = train_epoch(m_const, opt_const)
57    l2 = train_epoch(m_step, opt_step)
58    sched_step.step()
59    l3 = train_epoch(m_cos, opt_cos)
60    sched_cos.step()
61    if epoch % 5 == 0 or epoch == 29:
62        lr_s = opt_step.param_groups[0]['lr']
63        lr_c = opt_cos.param_groups[0]['lr']
64        print(f"Epoch {epoch:2d}:  Constant={l1:.4f}  "
65              f"Step(lr={lr_s:.4f})={l2:.4f}  "
66              f"Cosine(lr={lr_c:.5f})={l3:.4f}")

The final epoch losses tell the story:

Method	Loss at Epoch 0	Loss at Epoch 29	Final LR
Constant LR	0.3928	0.1516	0.050 (unchanged)
StepLR (γ=0.5, N=10)	0.3928	0.1269	0.006
CosineAnnealingLR	0.3928	0.1252	0.001

Notice two things. First, the constant LR model has higher loss at epoch 29 (0.152) than at epoch 20 (0.135) — it is oscillating around the minimum. Second, cosine annealing achieves the lowest final loss (0.125) with the smoothest convergence trajectory.

The scheduler pattern in PyTorch: create the optimizer, create the scheduler wrapping the optimizer, and call

\texttt{scheduler.step()}

once per epoch (after

\texttt{optimizer.step()}

). The scheduler directly modifies the optimizer's

\texttt{param\_groups[0][\"lr\"]}

. No changes to your training loop are needed.

Other PyTorch schedulers:

\texttt{ExponentialLR}

(multiply by

\gamma

each epoch),

\texttt{ReduceLROnPlateau}

(reduce when validation loss stops improving),

\texttt{OneCycleLR}

(cosine with warmup in one call), and

\texttt{CosineAnnealingWarmRestarts}

(periodic cosine with warm restarts). For most cases,

\texttt{CosineAnnealingLR}

with a separate warmup phase is sufficient.

Connection to Modern Training

Learning rate scheduling is not just a nice-to-have — it is a critical component of every modern training pipeline. Here is how it connects to the systems that power today's AI.

The Transformer Training Recipe

Nearly every large language model follows the same recipe:

Optimizer: AdamW (Adam with decoupled weight decay)
Schedule: Linear warmup + cosine decay to $\eta_{\min} = 0.1 \cdot \eta_0$
Warmup: 0.1–2% of total training steps
Weight decay: 0.1 (regularization, independent of LR)
Gradient clipping: max norm = 1.0 (prevents exploding gradients)

This recipe was established by BERT (2018), refined by GPT-2 (2019), and has been remarkably stable since. LLaMA-2 (2023) and Mistral (2023) use essentially the same setup.

Why Warmup Is Essential for Transformers

Transformers have a specific instability that makes warmup non-optional. The attention mechanism computes $\text{softmax}(QK^T / \sqrt{d_k})$ . In the first few steps, the $Q$ and $K$ matrices are randomly initialized, so the attention weights are nearly uniform (each token attends equally to all others). The gradients from these uniform attention patterns are large and noisy.

If the learning rate is also large at this point, the weight updates are so big that the attention weights can become sharply peaked (one token dominates) before the model has learned which tokens should actually attend to which. This creates a self-reinforcing cycle where bad attention patterns produce bad gradients that reinforce the bad patterns. Warmup breaks this cycle by keeping updates small until the attention weights have time to organize.

Connection to KV-Cache and Inference

The learning rate schedule affects not just training but also the quality of the trained model. Models trained with proper scheduling (warmup + cosine decay) produce more stable attention patterns, which has two important implications for inference:

KV-Cache efficiency: models with well-formed attention patterns have more predictable memory access patterns during autoregressive generation, making KV-cache utilization more efficient
Quantization robustness: models that converged smoothly (via scheduling) have weight distributions with fewer outliers, making them more amenable to post-training quantization (INT8, INT4) without quality loss

Flash Attention and Training Speed

Flash Attention (Dao et al., 2022) accelerates the attention computation, reducing the wall-clock time per training step. This means more steps per hour, which makes the total number of training steps $T$ larger for a given time budget. A larger $T$ means the cosine decay stretches over more steps — slower, more gradual decay — which generally improves final model quality. In this sense, Flash Attention indirectly improves model quality by enabling longer, more gradual schedules.

Scaling Laws and Schedule Design

The Chinchilla scaling laws (Hoffmann et al., 2022) established that the optimal balance of model size and training data depends on the total compute budget. This directly affects scheduling: if you know you will train for $T$ steps, you set $T_{\max} = T$ in the cosine schedule. Training for more steps with a proportionally slower decay consistently improves performance — one reason why organizations invest in longer training runs rather than just larger models.

Concept	How LR Scheduling Relates
Flash Attention	Faster steps → more total steps T → slower cosine decay → better convergence
Multi-head Attention	Each head has its own Q/K/V weights. Warmup prevents early attention collapse.
Positional Encodings	Stable training via scheduling helps the model learn position-dependent patterns reliably
KV-Cache	Better-trained models (via scheduling) have more cache-friendly attention patterns
Transformer Scaling	More compute → longer T → more gradual decay. Schedule is recomputed for each training run.

Summary

Constant learning rates force a tradeoff: high LR for fast progress causes oscillation near the minimum; low LR for precise convergence makes early training slow. Scheduling resolves this by decaying the LR over time.
Step decay ( $\eta_t = \eta_0 \gamma^{\lfloor t/N \rfloor}$ ) multiplies the LR by $\gamma$ at fixed intervals. Simple but has abrupt transitions.
Cosine annealing ( $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$ ) provides smooth, continuous decay. The modern default.
Warmup ramps the LR linearly from a small value to $\eta_0$ over the first $W$ steps. Essential for transformers and adaptive optimizers where variance estimates are unreliable early on.
Warmup + cosine decay is the standard recipe for modern training: safe startup, fast exploration at peak LR, smooth convergence at the end.
In PyTorch: use $\texttt{torch.optim.lr\_scheduler.CosineAnnealingLR}$ for cosine decay and $\texttt{StepLR}$ for step decay. Call $\texttt{scheduler.step()}$ once per epoch after $\texttt{optimizer.step()}$ .

The complete optimizer toolbox: SGD with momentum (Section 1) gives every parameter the same smoothed learning rate. Adam (Section 2) adds per-parameter adaptive rates. Learning rate scheduling (this section) controls the global trajectory of the LR over training. Together, they form the complete optimization system used in every modern neural network: AdamW + warmup + cosine decay.