Learning Objectives
By the end of this section, you will be able to:
- Understand the critical role of learning rate: See how the learning rate controls the speed and stability of training, and why getting it right is often the most important hyperparameter decision
- Find optimal learning rates: Use the learning rate range test to systematically find the best starting learning rate for your model
- Implement learning rate schedules: Apply step decay, exponential decay, cosine annealing, and other scheduling strategies in PyTorch
- Apply warmup strategies: Understand why warmup improves training stability and implement it effectively
- Use cyclical learning rates: Leverage oscillating learning rates to escape local minima and achieve better generalization
Why This Matters: The learning rate is often called the most important hyperparameter in deep learning. A well-chosen learning rate schedule can be the difference between a model that converges in hours versus days, or between 90% and 95% accuracy. Mastering learning rate strategies is essential for training models efficiently.
Why Learning Rate Matters
The learning rate controls the step size in gradient descent:
This single number has a profound impact on training dynamics:
| Learning Rate | Effect | Symptoms |
|---|---|---|
| Too high | Overshoots optima, oscillates wildly | Loss spikes, NaN values, training fails |
| Too low | Crawls toward optimum | Days/weeks to converge, stuck in local minima |
| Just right | Converges efficiently | Smooth loss curve, reaches good minimum |
The Goldilocks Problem
Finding the right learning rate is a balancing act:
- Too high: The optimizer takes huge steps, overshooting valleys and bouncing off loss landscape walls. The loss oscillates or explodes.
- Too low: The optimizer makes tiny, timid steps. Training takes forever and may get stuck in poor local minima because it lacks the momentum to escape.
- Just right: The optimizer moves confidently toward good solutions, fast enough to be practical but controlled enough to converge.
The Changing Landscape
Here's the key insight: the optimal learning rate changes during training. Early in training, when the model is far from any optimum, you can take larger steps without overshooting. Later, when approaching a minimum, you need smaller, more precise steps to settle into the optimum without bouncing out.
This motivates learning rate schedules—systematic ways to adjust the learning rate as training progresses.
Finding the Optimal Learning Rate
Before choosing a schedule, you need to find a good starting learning rate. The learning rate range test (Smith, 2017) is a systematic approach:
The Algorithm
- Start very small: Begin with a tiny learning rate (e.g., )
- Increase exponentially: After each batch, multiply the learning rate by a constant factor
- Track the loss: Record the loss at each step
- Stop when loss explodes: End the test when loss becomes much larger than the minimum observed
- Analyze the curve: Plot loss vs. learning rate on a log scale
Interpreting the Results
The loss-vs-LR curve typically shows three regions:
- Flat region (low LR): Learning rate too small to make progress
- Decreasing region: The "sweet spot" where training works well
- Increasing region: Learning rate too high, training destabilizes
Where to Set the Learning Rate
Interactive: LR Range Test
Run a simulated learning rate range test below. Watch how loss changes as the learning rate increases exponentially, and observe the suggested optimal learning rate:
Learning Rate Range Test
Find the optimal learning rate for your model
Quick Check
In a learning rate range test, where should you set your initial learning rate?
Learning Rate Schedules
A learning rate schedule systematically adjusts during training. Here are the most common approaches:
Step Decay
Reduce learning rate by a factor every epochs:
Example: Start at 0.1, reduce by 10× every 30 epochs: 0.1 → 0.01 → 0.001
Exponential Decay
Decay the learning rate by a constant factor each epoch:
This creates a smooth, continuous decrease rather than abrupt drops.
Cosine Annealing
Smoothly decrease learning rate following a cosine curve from to :
The cosine shape provides a gentle decrease that slows near the end, allowing fine-grained convergence.
Linear Decay
Simply interpolate linearly from start to end:
| Schedule | Pros | Cons | Best For |
|---|---|---|---|
| Step Decay | Simple, interpretable, proven | Abrupt changes, requires tuning N | Classic baselines, ResNets |
| Exponential | Smooth decay | Hard to control final LR | NLP, older architectures |
| Cosine Annealing | Smooth, effective, principled | More complex formula | Modern vision models, transformers |
| Linear | Simple, predictable | May be too aggressive early | Fine-tuning, short training runs |
Interactive: Schedule Explorer
Compare different learning rate schedules side by side. Toggle schedules on/off and adjust parameters to see how they affect the learning rate over time:
Learning Rate Schedule Explorer
Compare different scheduling strategies
Reduce LR by factor gamma every N epochs
Smooth cosine decay to minimum LR
Quick Check
Which learning rate schedule is most commonly used in modern transformer training?
Warmup Strategies
Warmup is the practice of starting with a very low learning rate and gradually increasing it to the target value over the first few epochs or steps.
Why Warmup Helps
At the start of training:
- Model weights are randomly initialized—they're far from any reasonable solution
- Gradients can be large and noisy—initial estimates are unreliable
- High learning rates cause wild updates that may push weights into bad regions
Warmup allows the model to "find its footing" before taking larger optimization steps. This is especially important for:
- Large batch training: Bigger batches enable higher learning rates, but starting high causes instability
- Transformers: Self-attention can produce large gradients early on
- Adam optimizer: The adaptive moment estimates need time to stabilize
Common Warmup Strategies
Linear Warmup
Increase learning rate linearly from 0 (or a small value) to target over steps:
Exponential Warmup
Increase more aggressively at first, then slow down as approaching target:
How Long Should Warmup Be?
Interactive: Warmup Demo
See how warmup stabilizes training compared to jumping straight to a high learning rate. Watch the loss curves and observe how the warmup phase prevents early instability:
Learning Rate Warmup
See why gradual warmup improves training stability
Learning Rate Schedule
Training Loss
Why warmup helps: At initialization, network weights are random and gradients can be large and noisy. Starting with a high learning rate can cause the optimizer to take huge steps, leading to unstable loss spikes. Warmup allows the model to find a reasonable region of parameter space before aggressive optimization begins.
Quick Check
What is the main purpose of learning rate warmup?
Cyclical Learning Rates
Cyclical Learning Rates (CLR) (Smith, 2017) take a different approach: instead of monotonically decreasing the learning rate, they oscillate it between minimum and maximum bounds.
The Key Insight
Traditional schedules assume we want to converge to a single minimum. But the loss landscape has many local minima, and some are better than others. By periodically increasing the learning rate, we can:
- Escape poor local minima: High learning rates let us jump out of shallow local minima
- Explore the loss landscape: Oscillations help us find flatter, more generalizable minima
- Regularize training: The noise from varying LR acts as implicit regularization
Common Cyclical Policies
Triangular
Linear ramp up and down between and :
where is the step size (half a cycle).
Triangular2
Like triangular, but the amplitude (range between min and max) halves after each cycle. This combines exploration early on with convergence later.
One-Cycle Policy
The one-cycle policy (Smith, 2018) is particularly effective:
- Warmup (first ~30%): Increase LR from low to maximum
- Annealing (remaining ~70%): Decrease from maximum to very low (often 1/1000th of max)
This single, large cycle provides both exploration (high LR phase) and fine-grained convergence (annealing phase). It often achieves better results with fewer epochs than traditional schedules.
Super-Convergence
Interactive: Cyclical LR
Experiment with different cyclical learning rate policies. Watch how the oscillating learning rate affects training dynamics:
Cyclical Learning Rates
Oscillating LR can help escape local minima
Learning Rate Schedule
Training Loss
Triangular
Linear ramp up and down between min and max LR
Quick Check
What is the main advantage of cyclical learning rates over monotonic decay?
PyTorch Implementation
PyTorch provides built-in schedulers in torch.optim.lr_scheduler. Here's how to use the most common ones:
Step Decay
Cosine Annealing
Cosine with Warm Restarts
Linear Warmup + Cosine Decay (Transformers Style)
1from torch.optim.lr_scheduler import LambdaLR
2import math
3
4def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
5 """Creates a schedule with linear warmup then cosine decay."""
6
7 def lr_lambda(current_step):
8 if current_step < num_warmup_steps:
9 # Linear warmup
10 return float(current_step) / float(max(1, num_warmup_steps))
11 # Cosine decay
12 progress = float(current_step - num_warmup_steps) / float(
13 max(1, num_training_steps - num_warmup_steps)
14 )
15 return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
16
17 return LambdaLR(optimizer, lr_lambda)
18
19# Usage
20optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
21scheduler = get_cosine_schedule_with_warmup(
22 optimizer,
23 num_warmup_steps=1000,
24 num_training_steps=10000
25)
26
27for step, batch in enumerate(train_loader):
28 loss = train_step(model, batch, optimizer)
29 scheduler.step() # Call after each step, not each epoch!One-Cycle Policy
Epoch vs. Step Scheduling
- StepLR, CosineAnnealingLR: Call
scheduler.step()after each epoch - OneCycleLR, LambdaLR for warmup: Call
scheduler.step()after each batch/step
Practical Guidelines
Choosing a Schedule
| Scenario | Recommended Schedule |
|---|---|
| Training from scratch (CNNs) | Step decay or cosine annealing |
| Training transformers | Linear warmup + cosine decay |
| Limited training budget | One-cycle policy |
| Transfer learning / fine-tuning | Linear warmup + constant or linear decay |
| Hyperparameter search | Constant (to isolate other effects) |
Key Hyperparameters
- Initial learning rate: Use the LR range test. Common starting points: 0.1 for SGD, 0.001 for Adam on CNNs, 5e-5 for transformer fine-tuning.
- Warmup steps: 5-10% of total steps is a good default. More warmup for larger batches or more unstable models.
- Decay factor (gamma): For step decay, 0.1 (10× reduction) is common. For exponential, 0.95-0.99 per epoch.
- Final learning rate: Typically 1/100 to 1/1000 of the initial rate. Too high may prevent fine convergence.
Common Mistakes
- Calling scheduler at wrong frequency: Some are per-epoch, others per-step
- Forgetting warmup for transformers: Large models often require warmup for stability
- Using the same schedule for fine-tuning: Fine-tuning typically needs lower LR and less aggressive schedules
- Not adapting to batch size: Larger batches can support higher learning rates
Linear Scaling Rule
Summary
Learning rate management is crucial for efficient and effective neural network training. Here's what we covered:
| Concept | Key Point |
|---|---|
| Learning Rate | Controls optimization step size; too high causes instability, too low wastes time |
| LR Range Test | Systematically find optimal starting LR by increasing exponentially and watching loss |
| Step Decay | Reduce LR by factor γ every N epochs; simple and proven |
| Cosine Annealing | Smooth decrease following cosine curve; modern default for many tasks |
| Warmup | Start low and increase gradually; essential for large batches and transformers |
| Cyclical LR | Oscillate between min and max; helps escape local minima |
| One-Cycle | Single cycle of warmup + annealing; can achieve super-convergence |
PyTorch Scheduler Quick Reference
1# Step decay
2scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
3
4# Cosine annealing
5scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
6
7# One-cycle (step after each batch!)
8scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=num_batches)
9
10# Custom schedule with warmup
11scheduler = LambdaLR(optimizer, lr_lambda=custom_schedule_function)Exercises
Conceptual Questions
- Explain why the optimal learning rate changes during training. What happens if you use the same (high) learning rate throughout?
- Why is warmup particularly important for training transformers? What would happen if you started with a high learning rate immediately?
- In the one-cycle policy, why does the warmup phase act as regularization? How does this enable "super-convergence"?
- Compare step decay and cosine annealing. In what situations might you prefer one over the other?
Solution Hints
- Q1: Early training: far from optimum, can take larger steps. Later: near optimum, need small precise steps. Constant high LR causes overshooting/oscillation near the minimum.
- Q2: Transformers have self-attention which can produce large gradients early on. High initial LR causes exploding updates. Warmup lets the adaptive optimizer (Adam) accumulate stable statistics.
- Q3: High LR during warmup prevents settling into sharp minima (which overfit). The optimizer can only converge to minima it doesn't "bounce out of," favoring flatter, more generalizable solutions.
- Q4: Step decay is simpler and has fewer hyperparameters. Cosine provides smoother transitions that may work better for modern architectures. Step decay works well when you know specific epochs where performance plateaus.
Coding Exercises
- Implement an LR finder: Write a function that performs the learning rate range test on a model, returning the loss vs. LR data. Plot the results and identify the optimal LR.
- Compare schedules: Train the same model with step decay, cosine annealing, and one-cycle. Plot the learning rate and loss curves for each. Which converges fastest?
- Custom warmup scheduler: Implement a scheduler that does linear warmup for 10% of steps, then applies your choice of decay. Use
LambdaLRwith a custom function. - Batch size scaling: Train a model with batch size 32 and LR 0.01. Then try batch size 128 with scaled LR 0.04. Compare training curves. Does the linear scaling rule hold?
Coding Exercise Hints
- Exercise 1: Use an exponential increase:
lr = initial_lr * (final_lr/initial_lr)**(step/num_steps). Use a small model for speed. - Exercise 2: Log LR at each step using
scheduler.get_last_lr(). Run for the same number of epochs with each schedule. - Exercise 3: Define
lr_lambda(step)that returns a multiplier. Ifstep < warmup_steps, returnstep/warmup_steps. - Exercise 4: The rule is approximate. You may need to tune warmup duration differently for larger batches.
In the next section, we'll explore regularization techniques—methods like dropout, weight decay, and data augmentation that help neural networks generalize better to unseen data.