Learning Objectives
By the end of this section, you will:
- Understand cosine annealing and its smooth decay properties
- Explain warm restarts for escaping local minima
- Configure cycle length and multipliers
- Compare with other schedules (step, exponential)
- Implement SGDR in PyTorch
Why This Matters: Cosine annealing (Loshchilov & Hutter, 2017) provides smooth learning rate decay that often outperforms step-based schedules. Warm restarts can help the optimizer escape local minima by periodically increasing the learning rate, potentially finding better solutions.
Cosine Annealing
Cosine annealing smoothly decreases the learning rate following a cosine curve.
Mathematical Formulation
Where:
- : Learning rate at step/epoch t
- : Maximum (initial) learning rate
- : Minimum (final) learning rate
- : Total number of steps/epochs
Properties of Cosine Decay
Cosine vs. Step Decay
1Learning Rate Schedule Comparison:
2
3LR
4 η_max ─┤●─────╮ ●────╮
5 │ ╲ ├── Step decay
6 │ ╲ ●────┘
7 │ ╲
8 │ ╲
9 │ ╲ ╱── Cosine decay
10 │ ╲___╱
11 │
12 η_min ─┤ ●───────────────
13 └───┬───┬───┬───┬───┬───┬───┬───┬
14 0 T/4 T/2 3T/4 T
15
16Cosine: Smooth, gradual decay
17Step: Sharp drops at fixed intervals| Aspect | Cosine Decay | Step Decay |
|---|---|---|
| Smoothness | Continuous | Discontinuous |
| Hyperparameters | 1 (T) | 2+ (steps, factor) |
| Early training | Slow decay | Constant until step |
| Late training | Slow convergence | Sharp drops |
| Typical performance | Often better | Baseline |
Warm Restarts (SGDR)
SGDR (Stochastic Gradient Descent with Warm Restarts) periodically resets the learning rate.
The Restart Concept
Instead of decaying to minimum once, SGDR uses multiple cycles:
Where:
- : Steps since last restart
- : Length of current cycle
Cycle Length Multiplier
Cycles can grow longer over time:
Where:
- : Initial cycle length
- : Cycle length multiplier (e.g., 2)
- : Cycle index (0, 1, 2, ...)
Warm Restart Visualization
1SGDR with T_0=10, T_mult=2:
2
3LR
4 η_max ─┤●╮ ●──╮ ●────────╮
5 │ ╲ │ ╲ │ ╲
6 │ ╲ │ ╲ │ ╲
7 │ ╲ │ ╲ │ ╲
8 │ ╲│ ╲ │ ╲
9 │ ● ╲ │ ╲
10 │ ╲ │ ╲
11 η_min ─┤ ●─┘ ●──
12 └──┬───────┬───────────┬─────────────
13 10 30 70 Epochs
14
15Cycle 1: T₀ = 10 epochs
16Cycle 2: T₁ = 10 × 2 = 20 epochs
17Cycle 3: T₂ = 10 × 4 = 40 epochsWhy Warm Restarts Help
Schedule Selection
Which schedule should you use for RUL prediction?
Schedule Comparison
| Schedule | FD002 RMSE | Stability | Complexity |
|---|---|---|---|
| Constant | 16.2 | Low | Simple |
| Step decay | 14.8 | Medium | Low |
| Exponential decay | 14.5 | High | Low |
| Cosine annealing | 14.1 | High | Low |
| SGDR (warm restarts) | 13.9 | Medium | Medium |
| Warmup + Cosine | 13.9 | High | Medium |
Recommended Configuration
For AMNL, we recommend warmup + cosine annealingwithout restarts:
- Warmup: 5 epochs linear warmup
- Decay: Cosine decay from η_max to η_min
- No restarts: Simpler, more stable for our use case
When to Use Warm Restarts
Warm restarts are most beneficial when: (1) the loss landscape has many local minima, (2) you have computational budget for longer training, (3) you want to generate multiple model snapshots. For standard RUL prediction, simple cosine annealing usually suffices.
Implementation
PyTorch implementations of cosine annealing schedules.
Simple Cosine Annealing
1from torch.optim.lr_scheduler import CosineAnnealingLR
2
3# Simple cosine annealing
4scheduler = CosineAnnealingLR(
5 optimizer,
6 T_max=100, # Total epochs
7 eta_min=1e-6 # Minimum learning rate
8)
9
10# Training loop
11for epoch in range(100):
12 train_epoch(model, dataloader, optimizer)
13 scheduler.step() # Update after each epoch
14 print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.2e}")Cosine Annealing with Warm Restarts
1from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
2
3# SGDR scheduler
4scheduler = CosineAnnealingWarmRestarts(
5 optimizer,
6 T_0=10, # Initial cycle length (epochs)
7 T_mult=2, # Cycle length multiplier
8 eta_min=1e-6 # Minimum learning rate
9)
10
11# Training loop - step after each batch for smooth restarts
12for epoch in range(100):
13 for batch_idx, batch in enumerate(dataloader):
14 optimizer.zero_grad()
15 loss = compute_loss(model, batch)
16 loss.backward()
17 optimizer.step()
18
19 # Step with fractional epoch for smooth restarts
20 scheduler.step(epoch + batch_idx / len(dataloader))Custom Warmup + Cosine Schedule
1class WarmupCosineAnnealingLR:
2 """
3 Custom scheduler combining linear warmup with cosine annealing.
4
5 Args:
6 optimizer: PyTorch optimizer
7 warmup_epochs: Number of warmup epochs
8 total_epochs: Total training epochs
9 eta_max: Maximum learning rate
10 eta_min: Minimum learning rate
11 """
12
13 def __init__(
14 self,
15 optimizer: torch.optim.Optimizer,
16 warmup_epochs: int,
17 total_epochs: int,
18 eta_max: float = 1e-3,
19 eta_min: float = 1e-6
20 ):
21 self.optimizer = optimizer
22 self.warmup_epochs = warmup_epochs
23 self.total_epochs = total_epochs
24 self.eta_max = eta_max
25 self.eta_min = eta_min
26 self.current_epoch = 0
27
28 def step(self):
29 """Update learning rate based on current epoch."""
30 self.current_epoch += 1
31
32 if self.current_epoch <= self.warmup_epochs:
33 # Linear warmup
34 lr = self.eta_max * (self.current_epoch / self.warmup_epochs)
35 else:
36 # Cosine annealing
37 progress = (self.current_epoch - self.warmup_epochs) / (
38 self.total_epochs - self.warmup_epochs
39 )
40 lr = self.eta_min + 0.5 * (self.eta_max - self.eta_min) * (
41 1 + math.cos(math.pi * progress)
42 )
43
44 for param_group in self.optimizer.param_groups:
45 param_group['lr'] = lr
46
47 def get_lr(self) -> float:
48 return self.optimizer.param_groups[0]['lr']
49
50
51# Usage
52scheduler = WarmupCosineAnnealingLR(
53 optimizer=optimizer,
54 warmup_epochs=5,
55 total_epochs=100,
56 eta_max=1e-3,
57 eta_min=1e-6
58)Summary
In this section, we covered cosine annealing schedules:
- Cosine annealing: Smooth decay following cosine curve
- Warm restarts: Periodic LR increases to escape local minima
- Cycle multiplier: Growing cycle lengths for deeper exploration
- Recommendation: Warmup + cosine annealing (no restarts)
- PyTorch: CosineAnnealingLR and CosineAnnealingWarmRestarts
| Parameter | Value |
|---|---|
| Schedule | Warmup + Cosine |
| Warmup epochs | 5 |
| Total epochs | 100 |
| η_max | 1e-3 |
| η_min | 1e-6 |
| Restarts | No |
Looking Ahead: We have configured learning rate schedules. The next section explores adaptive weight decay—techniques for adjusting regularization strength during training.
With learning rate scheduling covered, we examine weight decay strategies.