Chapter 9
20 min read
Section 56 of 178

Learning Rate Strategies

Training Neural Networks

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the critical role of learning rate: See how the learning rate controls the speed and stability of training, and why getting it right is often the most important hyperparameter decision
  2. Find optimal learning rates: Use the learning rate range test to systematically find the best starting learning rate for your model
  3. Implement learning rate schedules: Apply step decay, exponential decay, cosine annealing, and other scheduling strategies in PyTorch
  4. Apply warmup strategies: Understand why warmup improves training stability and implement it effectively
  5. Use cyclical learning rates: Leverage oscillating learning rates to escape local minima and achieve better generalization
Why This Matters: The learning rate is often called the most important hyperparameter in deep learning. A well-chosen learning rate schedule can be the difference between a model that converges in hours versus days, or between 90% and 95% accuracy. Mastering learning rate strategies is essential for training models efficiently.

Why Learning Rate Matters

The learning rate η\eta controls the step size in gradient descent:

θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}

This single number has a profound impact on training dynamics:

Learning RateEffectSymptoms
Too highOvershoots optima, oscillates wildlyLoss spikes, NaN values, training fails
Too lowCrawls toward optimumDays/weeks to converge, stuck in local minima
Just rightConverges efficientlySmooth loss curve, reaches good minimum

The Goldilocks Problem

Finding the right learning rate is a balancing act:

  • Too high: The optimizer takes huge steps, overshooting valleys and bouncing off loss landscape walls. The loss oscillates or explodes.
  • Too low: The optimizer makes tiny, timid steps. Training takes forever and may get stuck in poor local minima because it lacks the momentum to escape.
  • Just right: The optimizer moves confidently toward good solutions, fast enough to be practical but controlled enough to converge.

The Changing Landscape

Here's the key insight: the optimal learning rate changes during training. Early in training, when the model is far from any optimum, you can take larger steps without overshooting. Later, when approaching a minimum, you need smaller, more precise steps to settle into the optimum without bouncing out.

This motivates learning rate schedules—systematic ways to adjust the learning rate as training progresses.


Finding the Optimal Learning Rate

Before choosing a schedule, you need to find a good starting learning rate. The learning rate range test (Smith, 2017) is a systematic approach:

The Algorithm

  1. Start very small: Begin with a tiny learning rate (e.g., 10710^{-7})
  2. Increase exponentially: After each batch, multiply the learning rate by a constant factor
  3. Track the loss: Record the loss at each step
  4. Stop when loss explodes: End the test when loss becomes much larger than the minimum observed
  5. Analyze the curve: Plot loss vs. learning rate on a log scale

Interpreting the Results

The loss-vs-LR curve typically shows three regions:

  • Flat region (low LR): Learning rate too small to make progress
  • Decreasing region: The "sweet spot" where training works well
  • Increasing region: Learning rate too high, training destabilizes

Where to Set the Learning Rate

A common heuristic is to choose the learning rate where loss is decreasing fastest—typically about 10× lower than the rate where loss starts to increase. This gives you headroom before instability.

Interactive: LR Range Test

Run a simulated learning rate range test below. Watch how loss changes as the learning rate increases exponentially, and observe the suggested optimal learning rate:

Learning Rate Range Test

Find the optimal learning rate for your model

0/100
Steps
1.00e-7
Current LR
0.000
Loss
Smoothed Loss
Raw Loss

Quick Check

In a learning rate range test, where should you set your initial learning rate?


Learning Rate Schedules

A learning rate schedule systematically adjusts η\eta during training. Here are the most common approaches:

Step Decay

Reduce learning rate by a factor γ\gamma every NN epochs:

ηt=η0γt/N\eta_t = \eta_0 \cdot \gamma^{\lfloor t / N \rfloor}

Example: Start at 0.1, reduce by 10× every 30 epochs: 0.1 → 0.01 → 0.001

Exponential Decay

Decay the learning rate by a constant factor each epoch:

ηt=η0γt\eta_t = \eta_0 \cdot \gamma^t

This creates a smooth, continuous decrease rather than abrupt drops.

Cosine Annealing

Smoothly decrease learning rate following a cosine curve from η0\eta_0 to ηmin\eta_{min}:

ηt=ηmin+12(η0ηmin)(1+cos(tπT))\eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})\left(1 + \cos\left(\frac{t \cdot \pi}{T}\right)\right)

The cosine shape provides a gentle decrease that slows near the end, allowing fine-grained convergence.

Linear Decay

Simply interpolate linearly from start to end:

ηt=η0(η0ηmin)tT\eta_t = \eta_0 - (\eta_0 - \eta_{min}) \cdot \frac{t}{T}
ScheduleProsConsBest For
Step DecaySimple, interpretable, provenAbrupt changes, requires tuning NClassic baselines, ResNets
ExponentialSmooth decayHard to control final LRNLP, older architectures
Cosine AnnealingSmooth, effective, principledMore complex formulaModern vision models, transformers
LinearSimple, predictableMay be too aggressive earlyFine-tuning, short training runs

Interactive: Schedule Explorer

Compare different learning rate schedules side by side. Toggle schedules on/off and adjust parameters to see how they affect the learning rate over time:

Learning Rate Schedule Explorer

Compare different scheduling strategies

Step Decay

Reduce LR by factor gamma every N epochs

Cosine Annealing

Smooth cosine decay to minimum LR

Quick Check

Which learning rate schedule is most commonly used in modern transformer training?


Warmup Strategies

Warmup is the practice of starting with a very low learning rate and gradually increasing it to the target value over the first few epochs or steps.

Why Warmup Helps

At the start of training:

  • Model weights are randomly initialized—they're far from any reasonable solution
  • Gradients can be large and noisy—initial estimates are unreliable
  • High learning rates cause wild updates that may push weights into bad regions

Warmup allows the model to "find its footing" before taking larger optimization steps. This is especially important for:

  • Large batch training: Bigger batches enable higher learning rates, but starting high causes instability
  • Transformers: Self-attention can produce large gradients early on
  • Adam optimizer: The adaptive moment estimates need time to stabilize

Common Warmup Strategies

Linear Warmup

Increase learning rate linearly from 0 (or a small value) to target over WW steps:

ηt=ηtargettW,t<W\eta_t = \eta_{target} \cdot \frac{t}{W}, \quad t < W

Exponential Warmup

Increase more aggressively at first, then slow down as approaching target:

ηt=ηtarget(1eαt)\eta_t = \eta_{target} \cdot (1 - e^{-\alpha t})

How Long Should Warmup Be?

A common rule of thumb is to use warmup for 5-10% of total training steps. For example, if training for 10,000 steps, use 500-1,000 warmup steps. Some practitioners use 1-2 epochs for image classification.

Interactive: Warmup Demo

See how warmup stabilizes training compared to jumping straight to a high learning rate. Watch the loss curves and observe how the warmup phase prevents early instability:

Learning Rate Warmup

See why gradual warmup improves training stability

0/100
Step
Warmup
Phase
0.0000
Learning Rate
2.490
Loss

Learning Rate Schedule

Training Loss

Linear Warmup
No Warmup (baseline)

Why warmup helps: At initialization, network weights are random and gradients can be large and noisy. Starting with a high learning rate can cause the optimizer to take huge steps, leading to unstable loss spikes. Warmup allows the model to find a reasonable region of parameter space before aggressive optimization begins.

Quick Check

What is the main purpose of learning rate warmup?


Cyclical Learning Rates

Cyclical Learning Rates (CLR) (Smith, 2017) take a different approach: instead of monotonically decreasing the learning rate, they oscillate it between minimum and maximum bounds.

The Key Insight

Traditional schedules assume we want to converge to a single minimum. But the loss landscape has many local minima, and some are better than others. By periodically increasing the learning rate, we can:

  • Escape poor local minima: High learning rates let us jump out of shallow local minima
  • Explore the loss landscape: Oscillations help us find flatter, more generalizable minima
  • Regularize training: The noise from varying LR acts as implicit regularization

Common Cyclical Policies

Triangular

Linear ramp up and down between ηmin\eta_{min} and ηmax\eta_{max}:

ηt=ηmin+(ηmaxηmin)1tSmod21\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot |1 - |\frac{t}{S} \mod 2 - 1||

where SS is the step size (half a cycle).

Triangular2

Like triangular, but the amplitude (range between min and max) halves after each cycle. This combines exploration early on with convergence later.

One-Cycle Policy

The one-cycle policy (Smith, 2018) is particularly effective:

  1. Warmup (first ~30%): Increase LR from low to maximum
  2. Annealing (remaining ~70%): Decrease from maximum to very low (often 1/1000th of max)

This single, large cycle provides both exploration (high LR phase) and fine-grained convergence (annealing phase). It often achieves better results with fewer epochs than traditional schedules.

Super-Convergence

The one-cycle policy can enable super-convergence—training to the same accuracy in 10× fewer epochs by using learning rates that would normally be considered too high. This happens because the high LR phase acts as strong regularization.

Interactive: Cyclical LR

Experiment with different cyclical learning rate policies. Watch how the oscillating learning rate affects training dynamics:

Cyclical Learning Rates

Oscillating LR can help escape local minima

0/100
Step
1
Cycle
0.0010
Learning Rate
2.500
Loss

Learning Rate Schedule

Training Loss

Triangular

Linear ramp up and down between min and max LR

Quick Check

What is the main advantage of cyclical learning rates over monotonic decay?


PyTorch Implementation

PyTorch provides built-in schedulers in torch.optim.lr_scheduler. Here's how to use the most common ones:

Step Decay

Step Decay Scheduler
🐍step_lr.py
4StepLR Parameters

step_size=30 means reduce LR every 30 epochs. gamma=0.1 means multiply LR by 0.1 (divide by 10) at each step.

EXAMPLE
LR: 0.1 → 0.01 → 0.001
8Call After Each Epoch

scheduler.step() must be called after each epoch (or each batch for some schedulers). It updates the learning rate according to the schedule.

7 lines without explanation
1from torch.optim.lr_scheduler import StepLR
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step()  # Reduce LR every 30 epochs
9    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")

Cosine Annealing

Cosine Annealing Scheduler
🐍cosine_lr.py
4CosineAnnealingLR Parameters

T_max is the total number of epochs. eta_min is the minimum learning rate at the end of the schedule. LR follows a cosine curve from initial LR to eta_min.

7 lines without explanation
1from torch.optim.lr_scheduler import CosineAnnealingLR
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step()

Cosine with Warm Restarts

Cosine Annealing with Warm Restarts
🐍cosine_restarts.py
4Restart Parameters

T_0=10 means the first cycle is 10 epochs. T_mult=2 means each subsequent cycle is 2× longer than the previous (10, 20, 40 epochs).

7 lines without explanation
1from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step(epoch)  # Pass epoch number for restarts

Linear Warmup + Cosine Decay (Transformers Style)

🐍warmup_cosine.py
1from torch.optim.lr_scheduler import LambdaLR
2import math
3
4def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
5    """Creates a schedule with linear warmup then cosine decay."""
6
7    def lr_lambda(current_step):
8        if current_step < num_warmup_steps:
9            # Linear warmup
10            return float(current_step) / float(max(1, num_warmup_steps))
11        # Cosine decay
12        progress = float(current_step - num_warmup_steps) / float(
13            max(1, num_training_steps - num_warmup_steps)
14        )
15        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
16
17    return LambdaLR(optimizer, lr_lambda)
18
19# Usage
20optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
21scheduler = get_cosine_schedule_with_warmup(
22    optimizer,
23    num_warmup_steps=1000,
24    num_training_steps=10000
25)
26
27for step, batch in enumerate(train_loader):
28    loss = train_step(model, batch, optimizer)
29    scheduler.step()  # Call after each step, not each epoch!

One-Cycle Policy

One-Cycle Learning Rate
🐍one_cycle.py
7steps_per_epoch Required

OneCycleLR needs to know total training steps upfront. Provide len(train_loader) so it can calculate the schedule correctly.

8Warmup Percentage

pct_start=0.3 means 30% of training is warmup (increasing to max_lr), 70% is annealing (decreasing to ~0).

14Step After Every Batch

Unlike epoch-based schedulers, OneCycleLR should be stepped after every batch, not every epoch.

13 lines without explanation
1from torch.optim.lr_scheduler import OneCycleLR
2
3optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
4scheduler = OneCycleLR(
5    optimizer,
6    max_lr=0.1,
7    epochs=100,
8    steps_per_epoch=len(train_loader),
9    pct_start=0.3,  # Warmup for 30% of training
10    anneal_strategy='cos'
11)
12
13for epoch in range(100):
14    for batch in train_loader:
15        loss = train_step(model, batch, optimizer)
16        scheduler.step()  # Call after EVERY batch

Epoch vs. Step Scheduling

Different schedulers expect different calling patterns:
  • StepLR, CosineAnnealingLR: Call scheduler.step() after each epoch
  • OneCycleLR, LambdaLR for warmup: Call scheduler.step() after each batch/step
Check the documentation to ensure you're calling at the right frequency!

Practical Guidelines

Choosing a Schedule

ScenarioRecommended Schedule
Training from scratch (CNNs)Step decay or cosine annealing
Training transformersLinear warmup + cosine decay
Limited training budgetOne-cycle policy
Transfer learning / fine-tuningLinear warmup + constant or linear decay
Hyperparameter searchConstant (to isolate other effects)

Key Hyperparameters

  1. Initial learning rate: Use the LR range test. Common starting points: 0.1 for SGD, 0.001 for Adam on CNNs, 5e-5 for transformer fine-tuning.
  2. Warmup steps: 5-10% of total steps is a good default. More warmup for larger batches or more unstable models.
  3. Decay factor (gamma): For step decay, 0.1 (10× reduction) is common. For exponential, 0.95-0.99 per epoch.
  4. Final learning rate: Typically 1/100 to 1/1000 of the initial rate. Too high may prevent fine convergence.

Common Mistakes

  • Calling scheduler at wrong frequency: Some are per-epoch, others per-step
  • Forgetting warmup for transformers: Large models often require warmup for stability
  • Using the same schedule for fine-tuning: Fine-tuning typically needs lower LR and less aggressive schedules
  • Not adapting to batch size: Larger batches can support higher learning rates

Linear Scaling Rule

When increasing batch size by factor kk, you can often scale the learning rate by the same factor. For example, if batch size 32 works with LR 0.1, batch size 128 (4× larger) may work with LR 0.4. Always use warmup when applying this rule.

Summary

Learning rate management is crucial for efficient and effective neural network training. Here's what we covered:

ConceptKey Point
Learning RateControls optimization step size; too high causes instability, too low wastes time
LR Range TestSystematically find optimal starting LR by increasing exponentially and watching loss
Step DecayReduce LR by factor γ every N epochs; simple and proven
Cosine AnnealingSmooth decrease following cosine curve; modern default for many tasks
WarmupStart low and increase gradually; essential for large batches and transformers
Cyclical LROscillate between min and max; helps escape local minima
One-CycleSingle cycle of warmup + annealing; can achieve super-convergence

PyTorch Scheduler Quick Reference

🐍quick_reference.py
1# Step decay
2scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
3
4# Cosine annealing
5scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
6
7# One-cycle (step after each batch!)
8scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=num_batches)
9
10# Custom schedule with warmup
11scheduler = LambdaLR(optimizer, lr_lambda=custom_schedule_function)

Exercises

Conceptual Questions

  1. Explain why the optimal learning rate changes during training. What happens if you use the same (high) learning rate throughout?
  2. Why is warmup particularly important for training transformers? What would happen if you started with a high learning rate immediately?
  3. In the one-cycle policy, why does the warmup phase act as regularization? How does this enable "super-convergence"?
  4. Compare step decay and cosine annealing. In what situations might you prefer one over the other?

Solution Hints

  1. Q1: Early training: far from optimum, can take larger steps. Later: near optimum, need small precise steps. Constant high LR causes overshooting/oscillation near the minimum.
  2. Q2: Transformers have self-attention which can produce large gradients early on. High initial LR causes exploding updates. Warmup lets the adaptive optimizer (Adam) accumulate stable statistics.
  3. Q3: High LR during warmup prevents settling into sharp minima (which overfit). The optimizer can only converge to minima it doesn't "bounce out of," favoring flatter, more generalizable solutions.
  4. Q4: Step decay is simpler and has fewer hyperparameters. Cosine provides smoother transitions that may work better for modern architectures. Step decay works well when you know specific epochs where performance plateaus.

Coding Exercises

  1. Implement an LR finder: Write a function that performs the learning rate range test on a model, returning the loss vs. LR data. Plot the results and identify the optimal LR.
  2. Compare schedules: Train the same model with step decay, cosine annealing, and one-cycle. Plot the learning rate and loss curves for each. Which converges fastest?
  3. Custom warmup scheduler: Implement a scheduler that does linear warmup for 10% of steps, then applies your choice of decay. Use LambdaLR with a custom function.
  4. Batch size scaling: Train a model with batch size 32 and LR 0.01. Then try batch size 128 with scaled LR 0.04. Compare training curves. Does the linear scaling rule hold?

Coding Exercise Hints

  • Exercise 1: Use an exponential increase: lr = initial_lr * (final_lr/initial_lr)**(step/num_steps). Use a small model for speed.
  • Exercise 2: Log LR at each step using scheduler.get_last_lr(). Run for the same number of epochs with each schedule.
  • Exercise 3: Define lr_lambda(step) that returns a multiplier. If step < warmup_steps, return step/warmup_steps.
  • Exercise 4: The rule is approximate. You may need to tune warmup duration differently for larger batches.

In the next section, we'll explore regularization techniques—methods like dropout, weight decay, and data augmentation that help neural networks generalize better to unseen data.