Chapter 12
15 min read
Section 60 of 104

Cosine Annealing with Warm Restarts

Optimization Strategy

Learning Objectives

By the end of this section, you will:

  1. Understand cosine annealing and its smooth decay properties
  2. Explain warm restarts for escaping local minima
  3. Configure cycle length and multipliers
  4. Compare with other schedules (step, exponential)
  5. Implement SGDR in PyTorch
Why This Matters: Cosine annealing (Loshchilov & Hutter, 2017) provides smooth learning rate decay that often outperforms step-based schedules. Warm restarts can help the optimizer escape local minima by periodically increasing the learning rate, potentially finding better solutions.

Cosine Annealing

Cosine annealing smoothly decreases the learning rate following a cosine curve.

Mathematical Formulation

ηt=ηmin+12(ηmaxηmin)(1+cos(tTπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)

Where:

  • ηt\eta_t: Learning rate at step/epoch t
  • ηmax\eta_{\max}: Maximum (initial) learning rate
  • ηmin\eta_{\min}: Minimum (final) learning rate
  • TT: Total number of steps/epochs

Properties of Cosine Decay

Cosine vs. Step Decay

📝text
1Learning Rate Schedule Comparison:
2
3LR
4  η_max ─┤●─────╮                    ●────╮
5         │     ╲                          ├── Step decay
6         │      ╲                    ●────┘
7         │       ╲
8         │        ╲
9         │         ╲     ╱── Cosine decay
10         │          ╲___╱
1112  η_min ─┤                    ●───────────────
13         └───┬───┬───┬───┬───┬───┬───┬───┬
14            0       T/4     T/2     3T/4    T
15
16Cosine: Smooth, gradual decay
17Step: Sharp drops at fixed intervals
AspectCosine DecayStep Decay
SmoothnessContinuousDiscontinuous
Hyperparameters1 (T)2+ (steps, factor)
Early trainingSlow decayConstant until step
Late trainingSlow convergenceSharp drops
Typical performanceOften betterBaseline

Warm Restarts (SGDR)

SGDR (Stochastic Gradient Descent with Warm Restarts) periodically resets the learning rate.

The Restart Concept

Instead of decaying to minimum once, SGDR uses multiple cycles:

ηt=ηmin+12(ηmaxηmin)(1+cos(tcurTiπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t_{\text{cur}}}{T_i}\pi\right)\right)

Where:

  • tcurt_{\text{cur}}: Steps since last restart
  • TiT_i: Length of current cycle

Cycle Length Multiplier

Cycles can grow longer over time:

Ti=T0TmultiT_i = T_0 \cdot T_{\text{mult}}^i

Where:

  • T0T_0: Initial cycle length
  • TmultT_{\text{mult}}: Cycle length multiplier (e.g., 2)
  • ii: Cycle index (0, 1, 2, ...)

Warm Restart Visualization

📝text
1SGDR with T_0=10, T_mult=2:
2
3LR
4  η_max ─┤●╮   ●──╮       ●────────╮
5         │ ╲   │   ╲       │         ╲
6         │  ╲  │    ╲      │          ╲
7         │   ╲ │     ╲     │           ╲
8         │    ╲│      ╲    │            ╲
9         │     ●       ╲   │             ╲
10         │              ╲  │              ╲
11  η_min ─┤               ●─┘               ●──
12         └──┬───────┬───────────┬─────────────
13           10      30          70           Epochs
14
15Cycle 1: T₀ = 10 epochs
16Cycle 2: T₁ = 10 × 2 = 20 epochs
17Cycle 3: T₂ = 10 × 4 = 40 epochs

Why Warm Restarts Help


Schedule Selection

Which schedule should you use for RUL prediction?

Schedule Comparison

ScheduleFD002 RMSEStabilityComplexity
Constant16.2LowSimple
Step decay14.8MediumLow
Exponential decay14.5HighLow
Cosine annealing14.1HighLow
SGDR (warm restarts)13.9MediumMedium
Warmup + Cosine13.9HighMedium

For AMNL, we recommend warmup + cosine annealingwithout restarts:

  • Warmup: 5 epochs linear warmup
  • Decay: Cosine decay from η_max to η_min
  • No restarts: Simpler, more stable for our use case

When to Use Warm Restarts

Warm restarts are most beneficial when: (1) the loss landscape has many local minima, (2) you have computational budget for longer training, (3) you want to generate multiple model snapshots. For standard RUL prediction, simple cosine annealing usually suffices.


Implementation

PyTorch implementations of cosine annealing schedules.

Simple Cosine Annealing

🐍python
1from torch.optim.lr_scheduler import CosineAnnealingLR
2
3# Simple cosine annealing
4scheduler = CosineAnnealingLR(
5    optimizer,
6    T_max=100,      # Total epochs
7    eta_min=1e-6    # Minimum learning rate
8)
9
10# Training loop
11for epoch in range(100):
12    train_epoch(model, dataloader, optimizer)
13    scheduler.step()  # Update after each epoch
14    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.2e}")

Cosine Annealing with Warm Restarts

🐍python
1from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
2
3# SGDR scheduler
4scheduler = CosineAnnealingWarmRestarts(
5    optimizer,
6    T_0=10,         # Initial cycle length (epochs)
7    T_mult=2,       # Cycle length multiplier
8    eta_min=1e-6    # Minimum learning rate
9)
10
11# Training loop - step after each batch for smooth restarts
12for epoch in range(100):
13    for batch_idx, batch in enumerate(dataloader):
14        optimizer.zero_grad()
15        loss = compute_loss(model, batch)
16        loss.backward()
17        optimizer.step()
18
19        # Step with fractional epoch for smooth restarts
20        scheduler.step(epoch + batch_idx / len(dataloader))

Custom Warmup + Cosine Schedule

🐍python
1class WarmupCosineAnnealingLR:
2    """
3    Custom scheduler combining linear warmup with cosine annealing.
4
5    Args:
6        optimizer: PyTorch optimizer
7        warmup_epochs: Number of warmup epochs
8        total_epochs: Total training epochs
9        eta_max: Maximum learning rate
10        eta_min: Minimum learning rate
11    """
12
13    def __init__(
14        self,
15        optimizer: torch.optim.Optimizer,
16        warmup_epochs: int,
17        total_epochs: int,
18        eta_max: float = 1e-3,
19        eta_min: float = 1e-6
20    ):
21        self.optimizer = optimizer
22        self.warmup_epochs = warmup_epochs
23        self.total_epochs = total_epochs
24        self.eta_max = eta_max
25        self.eta_min = eta_min
26        self.current_epoch = 0
27
28    def step(self):
29        """Update learning rate based on current epoch."""
30        self.current_epoch += 1
31
32        if self.current_epoch <= self.warmup_epochs:
33            # Linear warmup
34            lr = self.eta_max * (self.current_epoch / self.warmup_epochs)
35        else:
36            # Cosine annealing
37            progress = (self.current_epoch - self.warmup_epochs) / (
38                self.total_epochs - self.warmup_epochs
39            )
40            lr = self.eta_min + 0.5 * (self.eta_max - self.eta_min) * (
41                1 + math.cos(math.pi * progress)
42            )
43
44        for param_group in self.optimizer.param_groups:
45            param_group['lr'] = lr
46
47    def get_lr(self) -> float:
48        return self.optimizer.param_groups[0]['lr']
49
50
51# Usage
52scheduler = WarmupCosineAnnealingLR(
53    optimizer=optimizer,
54    warmup_epochs=5,
55    total_epochs=100,
56    eta_max=1e-3,
57    eta_min=1e-6
58)

Summary

In this section, we covered cosine annealing schedules:

  1. Cosine annealing: Smooth decay following cosine curve
  2. Warm restarts: Periodic LR increases to escape local minima
  3. Cycle multiplier: Growing cycle lengths for deeper exploration
  4. Recommendation: Warmup + cosine annealing (no restarts)
  5. PyTorch: CosineAnnealingLR and CosineAnnealingWarmRestarts
ParameterValue
ScheduleWarmup + Cosine
Warmup epochs5
Total epochs100
η_max1e-3
η_min1e-6
RestartsNo
Looking Ahead: We have configured learning rate schedules. The next section explores adaptive weight decay—techniques for adjusting regularization strength during training.

With learning rate scheduling covered, we examine weight decay strategies.