Explain why a constant learning rate fails in noisy optimization and why decaying the LR over time enables tight convergence
Derive the cosine annealing formula and explain how a half-cosine curve maps training steps to learning rates
Explain the warmup problem and why starting with a small LR prevents early training instability in transformers
Implement step decay, cosine annealing, and warmup+cosine schedules from scratch in NumPy
Use PyTorch's lr_scheduler classes (StepLR, CosineAnnealingLR) to train a network with scheduled learning rates
Describe the modern transformer training recipe: warmup + cosine decay with AdamW, and why each piece is necessary
Where We Left Off
In Section 1, we learned that SGD with momentum accelerates convergence by accumulating past gradients into a velocity buffer. The velocity vt=βvt−1+∇L smooths out noise and amplifies consistent directions, giving roughly a 3–5× speedup over vanilla SGD.
But we used the same learning rate η=0.05 for all 30 training epochs. That raises a question: should the learning rate stay constant throughout training?
The answer is almost always no. A constant learning rate forces a painful tradeoff. Set it high enough to make fast progress early, and the optimizer oscillates wildly near the minimum. Set it low enough to converge precisely, and early training crawls. Learning rate scheduling resolves this tension by changing the learning rate during training — large steps in the beginning, small steps at the end.
Why Schedules Matter
Consider what happens when you train with SGD on noisy mini-batch gradients. At any step, the observed gradient is:
g^t=∇L(wt)+εt where εt∼N(0,σ2)
The weight update is wt+1=wt−ηg^t. The noise in the gradient introduces a random walk component with variance proportional to η2σ2. This means:
With a large η: the random walk amplitude is large. The optimizer bounces around the minimum with big jumps. The loss oscillates even after many steps.
With a small η: the random walk amplitude is small, so convergence is tight. But the signal component η∇L is also tiny, so early progress is glacial.
The resolution is straightforward: start with a large η for fast exploration, then decrease it for precise convergence. This is learning rate scheduling.
Phase
Learning Rate
Behavior
Early training
Large (e.g. 0.05–0.1)
Rapid loss decrease, rough convergence, exploring the loss landscape
Mid training
Medium (e.g. 0.01–0.03)
Refining the solution, approaching the minimum region
Late training
Small (e.g. 0.001)
Fine-tuning, settling into the minimum, noise dampened
The oscillation bound: for SGD with constant learning rate η, the expected distance from the minimum is bounded by O(ησ2). To halve the oscillation, you must halve the learning rate — but halving the LR also halves the convergence speed. Scheduling lets you escape this tradeoff by changing η over time.
The Big Picture: Exploration vs Exploitation
There is a deeper reason why scheduling works so well. Think of training as having two phases:
Exploration (high LR): the loss landscape has many regions — some with good minima, some with bad ones. A high learning rate lets the optimizer jump over small bad minima and explore broadly. The noise from mini-batches is actually helpful here: it prevents the optimizer from committing too early to a suboptimal minimum.
Exploitation (low LR): once the optimizer has found a promising region, a low learning rate lets it settle into the best point within that region. The noise is now harmful — it prevents precise convergence. A small LR suppresses the noise and allows fine-tuning.
Every learning rate schedule is a particular strategy for transitioning from exploration to exploitation. The question is how to make that transition.
Step Decay
The simplest schedule: multiply the learning rate by a factor γ<1 every N epochs.
ηt=η0⋅γ⌊t/N⌋
For example, with η0=0.05, γ=0.5, and N=10:
Epochs
Learning Rate
Computation
0–9
0.05000
η₀ × 0.5⁰ = 0.05
10–19
0.02500
η₀ × 0.5¹ = 0.025
20–29
0.01250
η₀ × 0.5² = 0.0125
30–39
0.00625
η₀ × 0.5³ = 0.00625
Step decay was the dominant schedule in early deep learning. The original ResNet paper (He et al., 2015) trained for 90 epochs, dividing the LR by 10 at epochs 30 and 60. It works well but requires choosing two hyperparameters (γ and N) and introduces abrupt transitions that can destabilize training.
The Abrupt Transition Problem
When the LR drops suddenly, the optimizer's momentum buffer is still calibrated for the old LR. The velocity built up over 10 epochs at LR=0.05 is suddenly too large for LR=0.025. This causes a brief spike in loss right after each step. The model recovers quickly, but the transition is wasteful.
When to use step decay: it remains popular in computer vision (ImageNet training). If you are training a ResNet or similar CNN for classification, step decay with γ=0.1 at epochs [30, 60, 90] in a 100-epoch training run is a well-tested recipe.
Cosine Annealing
Cosine annealing (Loshchilov & Hutter, 2016) replaces the staircase with a smooth curve. The learning rate follows a half-period of a cosine function:
ηt=ηmin+21(η0−ηmin)(1+cos(Tπt))
Let's unpack this formula piece by piece:
cos(πt/T) traces a half-cosine from +1 (at t=0) to −1 (at t=T)
Adding 1 shifts the range to [0,2]
Multiplying by 21(η0−ηmin) scales to [0,η0−ηmin]
Adding ηmin shifts to [ηmin,η0]
The result is a smooth S-shaped decay: slow at the start, fast in the middle, slow again near the end.
Why Cosine Beats Step Decay
No abrupt transitions: the LR changes continuously, so the momentum buffer stays calibrated. No loss spikes.
Fewer hyperparameters: just η0, ηmin, and T (total steps). No need to choose decay milestones.
Slow decay near start and end: the cosine curve spends more time near η0 (exploring) and near ηmin (fine-tuning). The fast decay in the middle wastes the least time in the intermediate region.
Cosine annealing is the default schedule for most modern training. When you see a training recipe in a paper and it doesn't specify the schedule, it is probably cosine annealing. The formula is simple, the implementation is one line, and it consistently outperforms step decay across tasks.
Warmup: Stabilizing Early Training
All the schedules above assume that training starts at the peak LR. For simple models and SGD, this works fine. But for transformers and other large models with adaptive optimizers like Adam, starting at the full LR is dangerous.
Why Large Models Need Warmup
Adam (and similar adaptive optimizers) maintains a running estimate of each gradient's variance. In the first few steps, these estimates are based on very few samples and can be wildly inaccurate. If the learning rate is also high during these early steps, the combination of bad variance estimates + high LR produces huge, unstable weight updates that can derail training before it starts.
The solution: start with a tiny learning rate and linearly ramp up to the target LR over the first W steps:
ηt=η0⋅Wt+1for t<W
During warmup, the optimizer builds up accurate gradient statistics at a safe pace. By the time the LR reaches its peak, the variance estimates are reliable and training proceeds normally.
Step
Warmup LR (W=5, η₀=0.08)
Status
0
0.016
Cautious — variance estimates still cold
1
0.032
Warming up — estimates improving
2
0.048
Estimates more reliable
3
0.064
Almost at full speed
4
0.080
Full LR — estimates now accurate
5+
Decay phase begins
Switch to cosine/step decay
How many warmup steps? Typical values are 1–10% of total training steps. The original transformer paper ("Attention Is All You Need") used 4,000 warmup steps out of ~100,000 total. LLaMA uses 2,000 warmup steps. For small experiments, even 5–10 steps of warmup helps.
The Modern Standard: Warmup + Cosine Decay
The standard training recipe for transformers and most modern deep learning combines warmup with cosine decay:
Phase 1 (Warmup): linear ramp from η0/W to η0 over steps 0 to W−1
Phase 2 (Cosine Decay): half-cosine from η0 to ηmin over steps W to T−1
This gives the optimizer the best of everything: safe startup (warmup prevents early instability), fast progress (high LR after warmup), and precise convergence (cosine decay settles into the minimum).
Model
Peak LR
Warmup Steps
Total Steps
Optimizer
BERT (Base)
1e-4
10,000
1,000,000
AdamW
GPT-2
2.5e-4
2,000
~250,000
AdamW
GPT-3 (175B)
6e-5
375
~300,000
AdamW
LLaMA-2 (7B)
3e-4
2,000
~1,000,000
AdamW
Our toy FlipNet
0.05
5
30×16 = 480
SGD+momentum
The recipe in one line: AdamW with warmup + cosine decay is to modern deep learning what SGD with momentum was to pre-2017 training. If you are starting a new project and need a default optimizer setup, this is it.
Interactive: LR Schedule Explorer
The visualization below lets you explore different learning rate schedules. Switch between LR Curves (the schedule shapes) and Training Loss (how they affect optimization of f(w)=w2 with noisy gradients). Use the sliders to adjust parameters and see how each schedule responds.
Loading schedule visualization...
Try these experiments:
LR Curves, default settings: press Play. Watch how cosine (blue) curves smoothly while step decay (purple) drops abruptly. Warmup+cosine (green) ramps up before decaying.
Increase warmup steps to 15: the green curve ramps for longer before peaking. With many warmup steps, the available decay time shrinks.
Switch to Training Loss mode: watch the constant LR (orange) oscillate wildly near the end while cosine and warmup+cosine settle down.
Set noise (σ) to 0: without noise, all schedules converge smoothly. Scheduling matters most when gradients are noisy (which they always are in practice).
Set noise (σ) to 5: extreme noise. The constant LR bounces violently. Scheduled LRs still converge because their tiny late-stage LR dampens the noise.
NumPy: Schedules from Scratch
Let's implement cosine annealing and warmup+cosine from scratch. We minimize f(w)=w2 with noisy gradients (σ=2.0) over 50 steps, comparing constant LR, cosine annealing, and warmup+cosine. The noise simulates mini-batch SGD: each gradient is the true gradient plus Gaussian noise.
NumPy \u2014 Learning Rate Schedules from Scratch
🐍lr_schedules.py
Explanation(30)
Code(40)
1import numpy as np
NumPy provides fast numerical arrays and mathematical functions. We use np.cos for the cosine schedule, np.pi for π, and np.random.normal for generating gradient noise. All these are vectorized C operations under the hood.
EXECUTION STATE
numpy = Library for numerical computing — ndarray, trigonometric functions (np.cos, np.pi), random number generation (np.random.normal)
3# Minimize f(w) = w² with noisy gradients
Same toy function from Section 1. The minimum is at w=0. But now we add random noise to the gradient, simulating the stochastic nature of mini-batch SGD. This noise is what makes scheduling necessary — constant LR can’t converge tightly when gradients are noisy.
EXECUTION STATE
f(w) = w² = Parabola with minimum at w=0. True gradient: df/dw = 2w. Adding noise: grad = 2w + N(0, σ²). The noise prevents clean convergence with a fixed learning rate.
4# True gradient: df/dw = 2w
Without noise, the gradient is exactly 2w. At w=5 the gradient is 10 (steep). Near w=0 the gradient is small (gentle). Noise adds a random perturbation: some steps the gradient is too large, others too small.
5# mini-batch SGD where each batch gives a noisy estimate
In real training, each mini-batch produces a noisy estimate of the true gradient because it only sees a subset of the data. The noise standard deviation σ controls how noisy the gradients are. Higher σ = smaller batches = noisier gradients.
7np.random.seed(42) — Reproducibility
Seeds the random number generator so the noise sequence is identical across all three schedules. This makes the comparison fair: the only variable is the learning rate schedule, not the noise realization.
EXECUTION STATE
📚 np.random.seed() = NumPy function: seeds the Mersenne Twister RNG. All subsequent calls to np.random.normal() will produce the same sequence. We re-seed before each schedule run so they all see identical noise.
⬇ arg: 42 = Seed value. Any integer works. 42 is convention.
8eta_0 = 0.08 — Peak learning rate
The maximum learning rate. For constant LR, every step uses 0.08. For cosine annealing, the LR starts at 0.08 and decays. For warmup+cosine, the LR ramps UP to 0.08 during warmup, then decays. This is large enough to make fast progress early but will cause oscillation if held constant in noisy conditions.
EXECUTION STATE
η₀ = 0.08 = Peak learning rate. Update size per step: lr × gradient. At w=5 with no noise: step = 0.08 × 10 = 0.8 (moves w from 5.0 to 4.2 in one step). Aggressive but effective early on.
9eta_min = 0.001 — Minimum learning rate
The learning rate floor. The cosine schedule decays from η₀ = 0.08 down to η_min = 0.001. At the end of training, step size = 0.001 × grad, which is 80× smaller than the initial step. This tiny LR lets the optimizer settle precisely near the minimum without bouncing around.
EXECUTION STATE
η_min = 0.001 = Learning rate floor. At step 49: update = 0.001 × grad ≈ 0.001 × 0.7 = 0.0007. Tiny steps → stable convergence. The noise still exists but its effect is multiplied by this small LR, so w barely moves.
10T = 50 — Total training steps
We run 50 gradient descent steps. The cosine schedule completes one half-period of cosine over these 50 steps: cos(0)=1 at step 0, cos(π)=−1 at step 50. The LR transitions smoothly from η₀ to η_min.
EXECUTION STATE
T = 50 = Total steps. In real training, T = total_epochs × steps_per_epoch. For GPT-3: T ≈ 300,000 steps. The cosine curve stretches to fill whatever T you choose.
11sigma = 2.0 — Gradient noise
Each gradient has noise ~ N(0, 4.0) added. At w=0.5, the true gradient is 1.0, but the observed gradient might be anything from −5 to +7. This mimics the variance of mini-batch gradients in real training. With constant LR, this noise causes the weight to bounce around the minimum.
EXECUTION STATE
σ = 2.0 = Standard deviation of gradient noise. noise ~ N(0, σ²) = N(0, 4.0). The signal-to-noise ratio at w=0.5 is: |2×0.5| / 2.0 = 0.5 — the noise is TWICE the true gradient! Only scheduling can handle this.
13# === Schedule 1: Cosine Annealing ===
Cosine annealing follows a half-cosine curve from η₀ to η_min. The decay is gentle at first (near the peak of the cosine), accelerates in the middle, and gentles again as it approaches η_min. This S-shaped decay is more effective than linear decay.
14def cosine_lr(step) — Cosine annealing schedule
Computes the learning rate at a given step using the cosine annealing formula from Loshchilov & Hutter (2016). The formula: η(t) = η_min + ½(η₀ − η_min)(1 + cos(πt/T)). At step 0: cos(0)=1, so LR = η₀. At step T: cos(π)=−1, so LR = η_min.
EXECUTION STATE
⬇ input: step = Current training step (0 to T−1). The schedule maps this integer to a learning rate value.
⬆ returns = float — the learning rate for this step. Range: [η_min, η₀] = [0.001, 0.08]
The cosine annealing formula. Let’s unpack: (1) cos(π·step/T) traces a half cosine from +1 to −1 as step goes 0→T. (2) Adding 1 shifts the range to [0, 2]. (3) Multiplying by 0.5 gives [0, 1]. (4) Multiplying by (η₀−η_min) scales to [0, 0.079]. (5) Adding η_min shifts to [0.001, 0.08].
EXECUTION STATE
📚 np.cos(x) = NumPy cosine function. Returns cos(x) where x is in radians. cos(0) = 1.0, cos(π/2) = 0.0, cos(π) = −1.0. Smooth and differentiable everywhere.
📚 np.pi = NumPy constant: π ≈ 3.14159. Used here so cos(π × step / T) completes exactly half a cycle over T steps.
18# === Schedule 2: Linear Warmup + Cosine Decay ===
The modern standard for transformer training. Instead of starting at the full LR (which can destabilize early training), we ramp up linearly from a small value to η₀ over the first few steps. Then cosine decay takes over. This is the schedule used for GPT, BERT, LLaMA, and most large language models.
Two-phase schedule: (1) Linear warmup from ~0 to η₀ over the first 5 steps, (2) Cosine decay from η₀ to η_min over the remaining 45 steps. The warmup phase prevents large, unstable updates when the optimizer’s momentum/variance estimates are still cold.
EXECUTION STATE
⬇ input: step = Current training step (0 to T−1 = 49)
⬇ input: warmup = 5 = Number of warmup steps. During steps 0–4, LR ramps linearly. At step 5, LR reaches η₀ = 0.08 and cosine decay begins. In practice: 1–10% of total steps.
During the first 5 steps (step = 0, 1, 2, 3, 4), we’re in the warmup phase. The LR increases linearly from η₀/5 = 0.016 to η₀ = 0.080. This gives the optimizer time to build up accurate gradient statistics before taking large steps.
EXECUTION STATE
warmup = 5 = Steps 0–4 are warmup. step < 5 is True for these steps.
→ Why warmup? = Adam and other adaptive optimizers estimate gradient variance from a running average. In the first few steps, these estimates are inaccurate. Large LR + bad estimates = exploding updates. Warmup avoids this.
Linear warmup formula. We use (step+1) so step=0 gives LR = η₀×1/5 = 0.016 (not zero). At step=4: LR = η₀×5/5 = 0.08 (full LR). The ramp is: 0.016 → 0.032 → 0.048 → 0.064 → 0.080.
EXECUTION STATE
(step + 1) / warmup = Linear fraction from 1/5 to 5/5. The +1 ensures step=0 gets a nonzero LR. Without +1, step=0 would have lr=0 (no learning at all).
After warmup, we compute how far through the decay phase we are. progress goes from 0.0 (start of decay, step=5) to 1.0 (end, step=49). This is fed into the cosine function. The denominator is T−warmup = 50−5 = 45 decay steps.
EXECUTION STATE
T - warmup = 45 = Number of steps in the decay phase. 50 total minus 5 warmup = 45 cosine decay steps.
→ step=5 = (5−5)/45 = 0.000 — start of decay
→ step=27 = (27−5)/45 = 0.489 — halfway through decay
→ step=49 = (49−5)/45 = 0.978 — near end of decay
23return ... cosine decay after warmup
Same cosine formula as cosine_lr, but using ‘progress’ (0→1) instead of ‘step/T’. This ensures the cosine decay spans only the post-warmup steps. The LR at step=5 is η₀ (matching the end of warmup) and decays to η_min by step=49.
→ step=27 = progress=0.489 → cos(0.489π)≈0.035 → lr ≈ 0.042
→ step=49 = progress=0.978 → cos(0.978π)≈−0.998 → lr ≈ 0.0011
26# === Compare three schedules ===
We train f(w)=w² three times with identical noise but different LR schedules. The constant schedule uses η₀ = 0.08 at every step. The cosine schedule decays from 0.08 to 0.001. The warmup+cosine first ramps up to 0.08, then decays. Watch the final oscillation: constant bounces, schedules settle.
27for name, get_lr in [...] — Three schedule functions
Iterates over three schedules. Each is a function that takes a step number and returns the LR. The constant schedule is an inline lambda: lambda s: eta_0 (always returns 0.08). The other two are the functions we defined above.
EXECUTION STATE
📚 lambda s: eta_0 = Python anonymous function. Takes step s and returns eta_0=0.08 regardless of the step. This is the constant LR baseline.
Schedule 1: Constant = get_lr(any_step) = 0.08 always. No adaptation. Fast early progress but oscillates near minimum.
30np.random.seed(42) — Reset noise for fair comparison
Critical: we re-seed the RNG before each schedule run. This means all three schedules encounter the exact same noise sequence at each step. The only difference in the results is due to the learning rate schedule, not random chance.
EXECUTION STATE
→ Why re-seed? = Without re-seeding, each schedule would see different random noise, making comparison meaningless. Re-seeding ensures the ONLY variable is the LR schedule.
31w = 5.0 — Same starting weight
Start at w=5.0 for each schedule. Initial loss = 25.0. All three begin from the same point and try to reach w=0 (loss=0) through the noisy gradient landscape.
EXECUTION STATE
w = 5.0 = Starting weight. Loss = 5² = 25.0. Same starting point for all three schedules.
32for step in range(T): — 50 training steps
Main training loop. Each iteration: (1) sample noise, (2) compute noisy gradient, (3) get LR from schedule, (4) update weight. After 50 steps, compare the final w and oscillation pattern.
Samples one noise value from a Gaussian distribution with mean 0 and std dev σ=2.0. This simulates the stochasticity of mini-batch gradients. Each step gets a different noise value but the sequence is deterministic (seeded).
EXECUTION STATE
📚 np.random.normal(loc, scale) = NumPy function: draws one sample from a normal distribution. loc=0 (mean), scale=sigma=2.0 (std dev). Returns a float, e.g., +0.99, −0.28, +1.30, etc.
⬇ arg 1: loc = 0 = Mean of the distribution. Noise is centered at zero — on average the gradient is correct, but any single step might be off.
⬇ arg 2: scale = sigma = 2.0 = Standard deviation. 68% of noise values fall within [−2, +2]. 95% fall within [−4, +4]. Occasional large values like +3.0 or −3.5 cause big jumps.
→ step 0 noise = +0.993 (gradient pushed slightly higher than the true value)
→ step 49 noise = −3.526 (large negative noise — flips the gradient sign at small w!)
34grad = 2 * w + noise — Noisy gradient
The observed gradient is the true gradient (2w) plus noise. Near the minimum (w≈0), the true gradient is small but noise is still σ=2.0 — the noise dominates! This is why constant LR oscillates: the optimizer keeps taking big steps based on noise, not signal.
EXECUTION STATE
→ step=0 (all schedules) = grad = 2×5.0 + 0.993 = 10.993 (noise is 10% of signal — minor)
→ step=49 (Constant) = grad = 2×0.127 + (−3.526) = −3.271 (noise FLIPS the sign! True grad is +0.25 but observed is −3.3)
→ The problem = Near minimum, signal (2w ≈ 0.3) is dwarfed by noise (σ=2.0). With lr=0.08: step = 0.08×3.3 = 0.26. That’s a HUGE jump when w is only 0.13!
35lr = get_lr(step) — Get scheduled learning rate
Calls the schedule function for the current step. For constant: always 0.08. For cosine: smoothly decays from 0.08 to 0.001. For warmup+cosine: ramps up in steps 0–4, then decays. This is where the magic happens — at step 49, cosine uses lr=0.001 while constant uses lr=0.08.
EXECUTION STATE
Constant at step 49 = get_lr(49) = 0.08 (unchanged — still taking big steps!)
Cosine at step 49 = get_lr(49) = 0.00108 (80× smaller than constant)
Standard gradient descent update. The step size = lr × grad. With constant LR near the minimum: step = 0.08 × (−3.27) = +0.262 (jumps from 0.127 to 0.389!). With cosine LR: step = 0.001 × (−2.80) = +0.003 (barely moves from 0.362 to 0.365). Scheduling turns wild bouncing into gentle settling.
EXECUTION STATE
→ Constant, step 49 = w = 0.127 − 0.08×(−3.271) = 0.127 + 0.262 = 0.389 (jumped 0.26 due to noise!)
→ Cosine, step 49 = w = 0.362 − 0.001×(−2.802) = 0.362 + 0.003 = 0.365 (barely moved — stable!)
→ Key insight = Same noise, same gradient, different step sizes. Constant LR: noise causes ±0.26 jumps. Cosine: noise causes ±0.003 wiggles. That’s the power of scheduling.
37if step % 10 == 0 or step == T - 1: — Print at intervals
Prints every 10 steps plus the final step. This shows the trajectory: fast descent in early steps, then either oscillation (constant) or settling (cosine/warmup).
38print(...) — Output formatted results
Prints schedule name, step number, current LR, weight, and loss. The output reveals: Constant LR oscillates (loss bounces: 0.26, 0.18, 0.18, 0.07, 0.15). Cosine settles (loss: 0.31, 0.28, 0.22, 0.14, 0.13). Warmup+Cosine also settles (loss: 0.76, 0.33, 0.24, 0.13, 0.13).
EXECUTION STATE
→ Constant final 10 steps = w oscillates: std=0.118, range=[0.13, 0.46]. Loss bounces between 0.02 and 0.21. Unstable!
→ Cosine final 10 steps = w stable: std=0.004, range=[0.358, 0.369]. Loss steady at ~0.13. Converged!
→ Warmup+Cos final 10 steps = w stable: std=0.005, range=[0.352, 0.365]. Loss steady at ~0.13. Converged!
40print() — Blank line between schedules
Separates the output of each schedule with a blank line for readability.
10 lines without explanation
1import numpy as np
23# ── Minimize f(w) = w² with noisy gradients ──4# True gradient: df/dw = 2w. We add noise to simulate5# mini-batch SGD where each batch gives a noisy estimate.67np.random.seed(42)8eta_0 =0.08# peak learning rate9eta_min =0.001# minimum learning rate10T =50# total training steps11sigma =2.0# gradient noise std dev1213# === Schedule 1: Cosine Annealing ===14defcosine_lr(step):15return eta_min +0.5*(eta_0 - eta_min)*(161+ np.cos(np.pi * step / T))1718# === Schedule 2: Linear Warmup + Cosine Decay ===19defwarmup_cosine_lr(step, warmup=5):20if step < warmup:21return eta_0 *(step +1)/ warmup
22 progress =(step - warmup)/(T - warmup)23return eta_min +0.5*(eta_0 - eta_min)*(241+ np.cos(np.pi * progress))2526# === Compare three schedules ===27for name, get_lr in[("Constant",lambda s: eta_0),28("Cosine", cosine_lr),29("Warmup+Cosine", warmup_cosine_lr)]:30 np.random.seed(42)31 w =5.032for step inrange(T):33 noise = np.random.normal(0, sigma)34 grad =2* w + noise
35 lr = get_lr(step)36 w = w - lr * grad
37if step %10==0or step == T -1:38print(f"[{name:14s}] Step {step:2d}: "39f"lr={lr:.5f} w={w:7.4f} loss={w**2:.4f}")40print()
The results reveal the key insight:
Schedule
Final w
Final Loss
Last 10 Steps Std
Constant (lr=0.08)
0.389
0.151
0.118 (bouncing!)
Cosine Annealing
0.365
0.133
0.004 (stable)
Warmup + Cosine
0.361
0.130
0.005 (stable)
All three reach similar neighborhoods, but constant LR oscillates with a standard deviation of 0.118 while the scheduled approaches have std < 0.005 — a 25× reduction in oscillation. The schedules "park" the optimizer at a good point and stop bouncing.
Why are the losses similar? With only 50 steps on a simple function, all schedules make good progress. The difference is in stability, not final average loss. In real training with millions of steps, this stability translates to better generalization: the model converges to a sharper, more specific minimum rather than bouncing between multiple nearby minima.
PyTorch: Training with Schedulers
PyTorch provides built-in schedulers in torch.optim.lr_scheduler. Let's train our FlipNet from Chapters 7–8 with three approaches: constant LR, StepLR, and CosineAnnealingLR. All three models start with identical weights. The only difference is how the learning rate evolves.
PyTorch \u2014 Training with LR Schedulers
🐍train_with_scheduling.py
Explanation(44)
Code(66)
1import torch
PyTorch — the deep learning framework. Provides tensors, autograd, and the full training ecosystem including optimizers and learning rate schedulers (torch.optim.lr_scheduler).
Neural network module containing layers (nn.Linear), activations (nn.ReLU), and the base Module class. Same import as Chapters 7–9.
EXECUTION STATE
torch.nn = Neural network building blocks — nn.Linear, nn.ReLU, nn.Module
4torch.manual_seed(42) — Global seed
Sets the global random seed. But we also re-seed before creating each model so all three start with identical weights, making the LR schedule the only difference.
EXECUTION STATE
📚 torch.manual_seed() = Seeds PyTorch’s RNG for weight initialization, dropout, etc. We call this again before each FlipNet() below.
7class FlipNet(nn.Module): — Our test network
The same 4→3→4 network from Chapters 7–8 and Section 1 of this chapter. 31 learnable parameters. Input: flattened 2×2 binary image. Output: the diagonally flipped image. A simple but real network for testing optimizer behavior.
Chain: linear → relu → linear. PyTorch records all ops for autograd.
18X = torch.tensor([...]) — All 16 binary 2×2 images
Creates the dataset: all 16 possible 4-bit binary strings as float tensors. X[0] = [0,0,0,0], X[1] = [0,0,0,1], ..., X[15] = [1,1,1,1]. Each represents a flattened 2×2 image.
EXECUTION STATE
X shape = (16, 4) — 16 images, 4 pixels each
f"{i:04b}" = Python f-string: formats integer i as 4-digit binary. i=5 → "0101" → [0,1,0,1]
20Y = X[:, [0, 2, 1, 3]] — Diagonal flip targets
The target output is the diagonal flip of each image. Reordering columns [0,2,1,3] swaps elements 1 and 2 (the off-diagonal pixels in a 2×2 matrix). This is the operation the network must learn.
22def train_epoch(model, optimizer) — One training epoch
Trains on all 16 images sequentially (stochastic gradient descent, batch size=1). Returns the average loss. This function is called with different optimizers to compare schedulers.
EXECUTION STATE
⬇ input: model = A FlipNet instance with its own weights
⬇ input: optimizer = SGD optimizer bound to model’s parameters. May have a scheduler attached externally.
⬆ returns = float — average MSE loss over 16 images this epoch
23total_loss = 0.0
Accumulator for summing losses across all 16 images.
24for i in range(16):
Loop over all 16 training images. Each iteration: forward pass, compute loss, backprop, update weights.
25pred = model(X[i])
Forward pass: runs image X[i] through the network to get a 4-element prediction.
26loss = torch.mean((pred - Y[i]) ** 2)
MSE loss: average squared difference between prediction and target. Differentiable — PyTorch can compute gradients through this.
EXECUTION STATE
📚 torch.mean() = Computes the mean of all elements. For a 4-element vector: sum/4.
27optimizer.zero_grad()
Clears accumulated gradients from the previous step. Without this, gradients would accumulate across images.
28loss.backward()
Backpropagation: computes ∂loss/∂w for all 31 parameters. Stores gradients in each parameter’s .grad attribute.
29optimizer.step()
Applies the SGD+momentum update: v = βv + grad; w = w - lr·v. The lr used here is whatever the scheduler has set in opt.param_groups[0]["lr"]. This is how scheduling works: the scheduler modifies this lr between epochs.
EXECUTION STATE
→ How schedulers connect = Scheduler.step() modifies optimizer.param_groups[0]["lr"]. When optimizer.step() runs, it reads the current lr from param_groups. No code change needed inside the training loop.
30total_loss += loss.item()
Adds this image’s loss to the accumulator. .item() extracts the Python float from a 0-dim tensor.
31return total_loss / 16
Returns average loss per image this epoch. Dividing by 16 normalizes for the dataset size.
33# Model 1: Constant LR (no scheduler)
The baseline: SGD with momentum and no learning rate decay. The LR stays at 0.05 for all 30 epochs. This is the approach from Section 1.
34torch.manual_seed(42); m_const = FlipNet()
Re-seed and create the first model. All three models start with identical weights thanks to the same seed.
EXECUTION STATE
→ Why re-seed? = nn.Linear uses torch.randn() internally for weight init. Same seed → same initial weights → fair comparison.
35opt_const = SGD(params, lr=0.05, momentum=0.9)
Creates an SGD optimizer with momentum 0.9 and constant LR 0.05. No scheduler will modify this LR — it stays 0.05 forever.
EXECUTION STATE
⬇ lr = 0.05 = Learning rate. Stays constant for all 30 epochs. The update: v = 0.9v + grad; w = w − 0.05·v
Step decay: multiply the LR by γ=0.5 every 10 epochs. LR goes: 0.05 → 0.025 → 0.0125 → 0.00625. Simple and widely used in computer vision (ResNet papers).
39torch.manual_seed(42); m_step = FlipNet()
Second model with identical initial weights.
40opt_step = SGD(params, lr=0.05, momentum=0.9)
Same optimizer settings as model 1. The scheduler will modify this optimizer’s LR.
Creates a step decay scheduler. Every 10 epochs, multiplies the optimizer’s LR by 0.5. The scheduler reads and writes opt_step.param_groups[0]["lr"]. Calling sched_step.step() after each epoch applies the decay.
EXECUTION STATE
📚 torch.optim.lr_scheduler.StepLR = PyTorch scheduler: lr = initial_lr × gamma^(epoch // step_size). Simple multiplicative decay at fixed intervals.
⬇ arg: opt_step = The optimizer whose LR to modify. The scheduler stores a reference to the optimizer.
⬇ arg: gamma = 0.5 = Multiplication factor at each decay. 0.5 = halve the LR. After 3 decays: 0.05 × 0.5³ = 0.00625.
45# Model 3: CosineAnnealingLR
Cosine annealing: smooth decay following a half-cosine curve. The most popular schedule for modern training — no step_size or gamma to tune, just T_max and eta_min.
46torch.manual_seed(42); m_cos = FlipNet()
Third model with identical initial weights.
47opt_cos = SGD(params, lr=0.05, momentum=0.9)
Same optimizer settings. The CosineAnnealingLR scheduler will modify this LR smoothly.
Creates a cosine annealing scheduler. The LR follows: η(t) = η_min + ½(η₀ − η_min)(1 + cos(πt/T_max)). Over 30 epochs, smoothly decays from 0.05 to 0.001. No abrupt drops like StepLR.
EXECUTION STATE
📚 CosineAnnealingLR = PyTorch scheduler: follows a half-cosine curve from initial_lr to eta_min over T_max epochs. Proposed by Loshchilov & Hutter (SGDR, 2016).
⬇ arg: opt_cos = The optimizer to schedule. Reads initial LR from opt.defaults["lr"] = 0.05.
⬇ arg: T_max = 30 = Total epochs for one cosine half-cycle. LR at epoch 0: 0.05 (cos(0)=1). LR at epoch 15: 0.0255 (cos(π/2)=0, midpoint). LR at epoch 30: 0.001 (cos(π)=−1).
⬇ arg: eta_min = 0.001 = Minimum LR. The cosine curve approaches this value at epoch T_max. Without this, the LR would reach exactly 0.
51for epoch in range(30): — Train all three models
Main training loop: 30 epochs. Each epoch trains all three models on all 16 images. After each epoch, the schedulers step to update the LR for the next epoch.
LOOP TRACE · 7 iterations
Epoch 0
All identical = Constant=0.3928 Step(lr=0.050)=0.3928 Cosine(lr=0.050)=0.3928
Train model 1 (constant LR) for one epoch. Returns average loss over 16 images.
53l2 = train_epoch(m_step, opt_step)
Train model 2 (step decay). The optimizer uses whatever LR the scheduler set before this epoch.
54sched_step.step() — Update StepLR
Advances the StepLR scheduler by one epoch. If epoch+1 is a multiple of step_size (10), the optimizer’s LR is multiplied by gamma (0.5). Otherwise, LR stays the same. IMPORTANT: call scheduler.step() AFTER optimizer.step(), once per epoch.
EXECUTION STATE
📚 scheduler.step() = PyTorch scheduler method: updates the LR for the next epoch. Reads the current epoch count internally. Must be called after optimizer.step().
→ After epoch 10 = sched_step.step() checks: epoch 11 % 10 == 0? No → lr stays 0.025
55l3 = train_epoch(m_cos, opt_cos)
Train model 3 (cosine annealing). The LR changes smoothly every epoch.
56sched_cos.step() — Update CosineAnnealingLR
Advances the cosine scheduler. Unlike StepLR, the LR changes EVERY epoch (not just at milestones). The change is computed from the cosine formula.
EXECUTION STATE
→ After epoch 0 = New lr = 0.001 + 0.5×0.049×(1+cos(π×1/30)) = 0.04987
→ After epoch 14 = New lr = 0.001 + 0.5×0.049×(1+cos(π×15/30)) = 0.02550 (midpoint)
→ After epoch 28 = New lr = 0.001 + 0.5×0.049×(1+cos(π×29/30)) = 0.00113
57if epoch % 5 == 0 or epoch == 29: — Print every 5 epochs
Prints at epochs 0, 5, 10, 15, 20, 25, and 29 (the final epoch) to show the training progression.
58lr_s = opt_step.param_groups[0]['lr'] — Read current StepLR
Reads the current learning rate from the optimizer’s parameter groups. This is the value that the scheduler modifies. It’s the ground truth of what LR was actually used.
EXECUTION STATE
📚 param_groups[0]['lr'] = PyTorch stores optimizer parameters in groups. Most common: one group with all parameters. The 'lr' key holds the current learning rate. Schedulers modify this dict directly.
59lr_c = opt_cos.param_groups[0]['lr'] — Read current cosine LR
Same pattern: reads the cosine scheduler’s current LR for display.
60print(...) — Formatted epoch results
Prints all three losses with their current LRs. Key observation at epoch 29: Constant=0.1516, Step=0.1269, Cosine=0.1252. The constant LR model has HIGHER loss than it did at epoch 20 (0.1348) — it’s oscillating! Both scheduled models converged to lower loss.
EXECUTION STATE
→ Epoch 0 (all same) = All three models are identical at this point — same weights, same LR. Loss = 0.3928.
→ Epoch 10 (divergence begins) = StepLR just dropped to 0.025. All models are similar (~0.13–0.14) but constant is slightly higher.
→ Epoch 29 (the reveal) = Constant: 0.1516 (HIGHER than epoch 20!). Step: 0.1269. Cosine: 0.1252 (best). Scheduling wins.
22 lines without explanation
1import torch
2import torch.nn as nn
34torch.manual_seed(42)56# ── Same FlipNet from Chapters 7-8 ──7classFlipNet(nn.Module):8def__init__(self):9super().__init__()10 self.layer1 = nn.Linear(4,3)11 self.layer2 = nn.Linear(3,4)12 self.relu = nn.ReLU()1314defforward(self, x):15return self.layer2(self.relu(self.layer1(x)))1617# ── Dataset: all 16 binary 2×2 images ──18X = torch.tensor([[int(b)for b inf"{i:04b}"]19for i inrange(16)], dtype=torch.float32)20Y = X[:,[0,2,1,3]]# diagonal flip2122deftrain_epoch(model, optimizer):23 total_loss =0.024for i inrange(16):25 pred = model(X[i])26 loss = torch.mean((pred - Y[i])**2)27 optimizer.zero_grad()28 loss.backward()29 optimizer.step()30 total_loss += loss.item()31return total_loss /163233# ── Model 1: Constant LR (no scheduler) ──34torch.manual_seed(42)35m_const = FlipNet()36opt_const = torch.optim.SGD(m_const.parameters(),37 lr=0.05, momentum=0.9)3839# ── Model 2: StepLR (halve every 10 epochs) ──40torch.manual_seed(42)41m_step = FlipNet()42opt_step = torch.optim.SGD(m_step.parameters(),43 lr=0.05, momentum=0.9)44sched_step = torch.optim.lr_scheduler.StepLR(45 opt_step, step_size=10, gamma=0.5)4647# ── Model 3: CosineAnnealingLR ──48torch.manual_seed(42)49m_cos = FlipNet()50opt_cos = torch.optim.SGD(m_cos.parameters(),51 lr=0.05, momentum=0.9)52sched_cos = torch.optim.lr_scheduler.CosineAnnealingLR(53 opt_cos, T_max=30, eta_min=0.001)5455for epoch inrange(30):56 l1 = train_epoch(m_const, opt_const)57 l2 = train_epoch(m_step, opt_step)58 sched_step.step()59 l3 = train_epoch(m_cos, opt_cos)60 sched_cos.step()61if epoch %5==0or epoch ==29:62 lr_s = opt_step.param_groups[0]['lr']63 lr_c = opt_cos.param_groups[0]['lr']64print(f"Epoch {epoch:2d}: Constant={l1:.4f} "65f"Step(lr={lr_s:.4f})={l2:.4f} "66f"Cosine(lr={lr_c:.5f})={l3:.4f}")
The final epoch losses tell the story:
Method
Loss at Epoch 0
Loss at Epoch 29
Final LR
Constant LR
0.3928
0.1516
0.050 (unchanged)
StepLR (γ=0.5, N=10)
0.3928
0.1269
0.006
CosineAnnealingLR
0.3928
0.1252
0.001
Notice two things. First, the constant LR model has higher loss at epoch 29 (0.152) than at epoch 20 (0.135) — it is oscillating around the minimum. Second, cosine annealing achieves the lowest final loss (0.125) with the smoothest convergence trajectory.
The scheduler pattern in PyTorch: create the optimizer, create the scheduler wrapping the optimizer, and call scheduler.step() once per epoch (after optimizer.step()). The scheduler directly modifies the optimizer's param_groups[0][l¨r]¨. No changes to your training loop are needed.
Other PyTorch schedulers:ExponentialLR (multiply by γ each epoch), ReduceLROnPlateau (reduce when validation loss stops improving), OneCycleLR (cosine with warmup in one call), and CosineAnnealingWarmRestarts (periodic cosine with warm restarts). For most cases, CosineAnnealingLR with a separate warmup phase is sufficient.
Connection to Modern Training
Learning rate scheduling is not just a nice-to-have — it is a critical component of every modern training pipeline. Here is how it connects to the systems that power today's AI.
The Transformer Training Recipe
Nearly every large language model follows the same recipe:
Optimizer: AdamW (Adam with decoupled weight decay)
Schedule: Linear warmup + cosine decay to ηmin=0.1⋅η0
Warmup: 0.1–2% of total training steps
Weight decay: 0.1 (regularization, independent of LR)
Gradient clipping: max norm = 1.0 (prevents exploding gradients)
This recipe was established by BERT (2018), refined by GPT-2 (2019), and has been remarkably stable since. LLaMA-2 (2023) and Mistral (2023) use essentially the same setup.
Why Warmup Is Essential for Transformers
Transformers have a specific instability that makes warmup non-optional. The attention mechanism computes softmax(QKT/dk). In the first few steps, the Q and K matrices are randomly initialized, so the attention weights are nearly uniform (each token attends equally to all others). The gradients from these uniform attention patterns are large and noisy.
If the learning rate is also large at this point, the weight updates are so big that the attention weights can become sharply peaked (one token dominates) before the model has learned which tokens should actually attend to which. This creates a self-reinforcing cycle where bad attention patterns produce bad gradients that reinforce the bad patterns. Warmup breaks this cycle by keeping updates small until the attention weights have time to organize.
Connection to KV-Cache and Inference
The learning rate schedule affects not just training but also the quality of the trained model. Models trained with proper scheduling (warmup + cosine decay) produce more stable attention patterns, which has two important implications for inference:
KV-Cache efficiency: models with well-formed attention patterns have more predictable memory access patterns during autoregressive generation, making KV-cache utilization more efficient
Quantization robustness: models that converged smoothly (via scheduling) have weight distributions with fewer outliers, making them more amenable to post-training quantization (INT8, INT4) without quality loss
Flash Attention and Training Speed
Flash Attention (Dao et al., 2022) accelerates the attention computation, reducing the wall-clock time per training step. This means more steps per hour, which makes the total number of training steps T larger for a given time budget. A larger T means the cosine decay stretches over more steps — slower, more gradual decay — which generally improves final model quality. In this sense, Flash Attention indirectly improves model quality by enabling longer, more gradual schedules.
Scaling Laws and Schedule Design
The Chinchilla scaling laws (Hoffmann et al., 2022) established that the optimal balance of model size and training data depends on the total compute budget. This directly affects scheduling: if you know you will train for T steps, you set Tmax=T in the cosine schedule. Training for more steps with a proportionally slower decay consistently improves performance — one reason why organizations invest in longer training runs rather than just larger models.
Concept
How LR Scheduling Relates
Flash Attention
Faster steps → more total steps T → slower cosine decay → better convergence
Multi-head Attention
Each head has its own Q/K/V weights. Warmup prevents early attention collapse.
Positional Encodings
Stable training via scheduling helps the model learn position-dependent patterns reliably
KV-Cache
Better-trained models (via scheduling) have more cache-friendly attention patterns
Transformer Scaling
More compute → longer T → more gradual decay. Schedule is recomputed for each training run.
Summary
Constant learning rates force a tradeoff: high LR for fast progress causes oscillation near the minimum; low LR for precise convergence makes early training slow. Scheduling resolves this by decaying the LR over time.
Step decay (ηt=η0γ⌊t/N⌋) multiplies the LR by γ at fixed intervals. Simple but has abrupt transitions.
Cosine annealing (ηt=ηmin+21(η0−ηmin)(1+cos(πt/T))) provides smooth, continuous decay. The modern default.
Warmup ramps the LR linearly from a small value to η0 over the first W steps. Essential for transformers and adaptive optimizers where variance estimates are unreliable early on.
Warmup + cosine decay is the standard recipe for modern training: safe startup, fast exploration at peak LR, smooth convergence at the end.
In PyTorch: use torch.optim.lr_scheduler.CosineAnnealingLR for cosine decay and StepLR for step decay. Call scheduler.step() once per epoch after optimizer.step().
The complete optimizer toolbox: SGD with momentum (Section 1) gives every parameter the same smoothed learning rate. Adam (Section 2) adds per-parameter adaptive rates. Learning rate scheduling (this section) controls the global trajectory of the LR over training. Together, they form the complete optimization system used in every modern neural network: AdamW + warmup + cosine decay.