Learning Objectives
By the end of this section, you will:
- Understand why warmup prevents early training failure
- Compare linear and exponential warmup strategies
- Choose appropriate warmup duration for your model
- Implement warmup schedulers in PyTorch
- Diagnose warmup-related training issues
Why This Matters: Large learning rates at the start of training can cause gradient explosion, NaN losses, or convergence to poor local minima. Warmup starts with a tiny learning rate and gradually increases it, giving the optimizer time to estimate gradient statistics before taking large steps.
Why Warmup is Necessary
Several factors make early training particularly fragile.
The Cold Start Problem
At initialization, Adam's moment estimates are unreliable:
- First moment (m): Initialized to 0, takes ~10 steps to stabilize
- Second moment (v): Initialized to 0, takes ~100 steps to stabilize
- Bias correction: Helps but doesn't fully compensate
Large learning rates with inaccurate moment estimates cause erratic updates.
Gradient Magnitude at Initialization
When Warmup is Most Important
| Scenario | Warmup Need | Reason |
|---|---|---|
| Large batch size | High | Gradient variance lower, can overshoot |
| Deep networks | High | Gradient flow unstable early |
| Transformers/Attention | Critical | Attention weights highly sensitive |
| Small batch size | Moderate | High variance provides implicit regularization |
| Transfer learning | Low | Weights already reasonable |
AMNL Uses Warmup
Our CNN-BiLSTM-Attention model includes attention layers, making warmup important. We use 5-10 epochs of warmup for stable training.
Warmup Strategies
There are several ways to increase the learning rate during warmup.
Linear Warmup
The most common approach—linearly increase from 0 to target:
Where:
- : Learning rate at step t
- : Target learning rate (e.g., 1e-3)
- : Warmup duration (steps or epochs)
1Linear Warmup Schedule:
2
3LR
4 η ─┤ ────────────
5 │ ╱
6 │ ╱
7 │ ╱
8 │ ╱
9 │ ╱
10 │ ╱
11 │ ╱
12 │ ╱
13 0 ─┼──╱───────────────────────────────
14 └──┬────────┬────────────────────
15 0 T_warmup Epochs
16
17Linear: η(t) = η_target × t / T_warmupExponential Warmup
Start very small and grow exponentially:
Exponential warmup spends more time at low learning rates, which can be beneficial for very sensitive models.
Gradual Warmup (Recommended)
Start from a small non-zero value:
With , this provides a gentler start than pure linear warmup.
Comparison
| Strategy | Formula | Best For |
|---|---|---|
| Linear | η × t / T | Most cases (recommended) |
| Exponential | η × (η₀/η)^(1-t/T) | Very sensitive models |
| Gradual | η₀ + (η - η₀) × t / T | Balanced approach |
| Constant then switch | η₀ then η | Simple, less smooth |
Warmup Duration Selection
How long should warmup last?
Rules of Thumb
| Training Length | Warmup Duration | Warmup % |
|---|---|---|
| 100 epochs | 5-10 epochs | 5-10% |
| 50 epochs | 3-5 epochs | 6-10% |
| 200 epochs | 10-20 epochs | 5-10% |
Batch-Based vs. Epoch-Based
Warmup can be specified in steps (batches) or epochs:
| Method | Advantage | Disadvantage |
|---|---|---|
| Steps | Consistent across batch sizes | Must recalculate for different datasets |
| Epochs | Intuitive, dataset-independent | Varies with batch size |
Epoch-Based for Simplicity
We recommend epoch-based warmup for its simplicity. 5-10% of total epochs is a good starting point. For 100 epochs, use 5-10 epochs of warmup.
AMNL Warmup Configuration
| Parameter | Value |
|---|---|
| Total epochs | 100 |
| Warmup epochs | 5 |
| Warmup percentage | 5% |
| Warmup strategy | Linear |
| Start LR | 1e-7 (effectively 0) |
| Target LR | 1e-3 |
Implementation
Our research implementation uses a simple but effective warmup approach that integrates cleanly with the training loop.
AMNL Research Implementation
Integration with Training Loop
Warmup + ReduceLROnPlateau
The key insight is that ReduceLROnPlateau should only start monitoring after warmup completes. During warmup, learning rate changes are expected and should not trigger plateau detection.
Combined Schedule Visualization
1# Visualize the combined warmup + plateau schedule
2import matplotlib.pyplot as plt
3
4epochs = range(100)
5lrs = []
6base_lr = 1e-3
7warmup_epochs = 10
8
9# Simulate warmup phase
10for e in epochs:
11 if e < warmup_epochs:
12 factor = 0.1 + 0.9 * (e / warmup_epochs)
13 lrs.append(base_lr * factor)
14 else:
15 # After warmup, assume some plateau reductions
16 if e < 50:
17 lrs.append(base_lr)
18 elif e < 70:
19 lrs.append(base_lr * 0.5) # First reduction
20 else:
21 lrs.append(base_lr * 0.25) # Second reduction
22
23plt.figure(figsize=(10, 5))
24plt.plot(epochs, lrs, 'b-', linewidth=2)
25plt.axvline(x=10, color='red', linestyle='--', alpha=0.5, label='Warmup ends')
26plt.xlabel('Epoch')
27plt.ylabel('Learning Rate')
28plt.title('Warmup + ReduceLROnPlateau Schedule')
29plt.legend()
30plt.grid(True, alpha=0.3)
31plt.show()Summary
In this section, we covered learning rate warmup:
- Why needed: Adam's moments need time to stabilize
- Linear warmup: (recommended)
- Duration: 5-10% of total epochs
- Combined scheduler: Warmup + cosine decay
- Step-level updates: Call scheduler.step() after each batch
| Parameter | Value |
|---|---|
| Warmup strategy | Linear |
| Warmup epochs | 5 (of 100) |
| Start LR | ~0 (1e-7) |
| Target LR | 1e-3 |
| Post-warmup | Cosine decay to 1e-6 |
Looking Ahead: After warmup, we need a strategy for the main training phase. The next section covers cosine annealing with warm restarts—a schedule that can escape local minima through periodic learning rate increases.
With warmup understood, we explore cosine annealing schedules.