AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand cosine annealing and its smooth decay properties
Explain warm restarts for escaping local minima
Configure cycle length and multipliers
Compare with other schedules (step, exponential)
Implement SGDR in PyTorch

Why This Matters: Cosine annealing (Loshchilov & Hutter, 2017) provides smooth learning rate decay that often outperforms step-based schedules. Warm restarts can help the optimizer escape local minima by periodically increasing the learning rate, potentially finding better solutions.

Cosine Annealing

Cosine annealing smoothly decreases the learning rate following a cosine curve.

Mathematical Formulation

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)

Where:

$\eta_t$ : Learning rate at step/epoch t
$\eta_{\max}$ : Maximum (initial) learning rate
$\eta_{\min}$ : Minimum (final) learning rate
$T$ : Total number of steps/epochs

Properties of Cosine Decay

Cosine vs. Step Decay

📝text

1Learning Rate Schedule Comparison:
2
3LR
4  η_max ─┤●─────╮                    ●────╮
5         │     ╲                          ├── Step decay
6         │      ╲                    ●────┘
7         │       ╲
8         │        ╲
9         │         ╲     ╱── Cosine decay
10         │          ╲___╱
11         │
12  η_min ─┤                    ●───────────────
13         └───┬───┬───┬───┬───┬───┬───┬───┬
14            0       T/4     T/2     3T/4    T
15
16Cosine: Smooth, gradual decay
17Step: Sharp drops at fixed intervals

Aspect	Cosine Decay	Step Decay
Smoothness	Continuous	Discontinuous
Hyperparameters	1 (T)	2+ (steps, factor)
Early training	Slow decay	Constant until step
Late training	Slow convergence	Sharp drops
Typical performance	Often better	Baseline

Warm Restarts (SGDR)

SGDR (Stochastic Gradient Descent with Warm Restarts) periodically resets the learning rate.

The Restart Concept

Instead of decaying to minimum once, SGDR uses multiple cycles:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t_{\text{cur}}}{T_i}\pi\right)\right)

Where:

$t_{\text{cur}}$ : Steps since last restart
$T_i$ : Length of current cycle

Cycle Length Multiplier

Cycles can grow longer over time:

T_i = T_0 \cdot T_{\text{mult}}^i

Where:

$T_0$ : Initial cycle length
$T_{\text{mult}}$ : Cycle length multiplier (e.g., 2)
$i$ : Cycle index (0, 1, 2, ...)

Warm Restart Visualization

📝text

1SGDR with T_0=10, T_mult=2:
2
3LR
4  η_max ─┤●╮   ●──╮       ●────────╮
5         │ ╲   │   ╲       │         ╲
6         │  ╲  │    ╲      │          ╲
7         │   ╲ │     ╲     │           ╲
8         │    ╲│      ╲    │            ╲
9         │     ●       ╲   │             ╲
10         │              ╲  │              ╲
11  η_min ─┤               ●─┘               ●──
12         └──┬───────┬───────────┬─────────────
13           10      30          70           Epochs
14
15Cycle 1: T₀ = 10 epochs
16Cycle 2: T₁ = 10 × 2 = 20 epochs
17Cycle 3: T₂ = 10 × 4 = 40 epochs

Why Warm Restarts Help

Schedule Selection

Which schedule should you use for RUL prediction?

Schedule Comparison

Schedule	FD002 RMSE	Stability	Complexity
Constant	16.2	Low	Simple
Step decay	14.8	Medium	Low
Exponential decay	14.5	High	Low
Cosine annealing	14.1	High	Low
SGDR (warm restarts)	13.9	Medium	Medium
Warmup + Cosine	13.9	High	Medium

Recommended Configuration

For AMNL, we recommend warmup + cosine annealingwithout restarts:

Warmup: 5 epochs linear warmup
Decay: Cosine decay from η_max to η_min
No restarts: Simpler, more stable for our use case

When to Use Warm Restarts

Warm restarts are most beneficial when: (1) the loss landscape has many local minima, (2) you have computational budget for longer training, (3) you want to generate multiple model snapshots. For standard RUL prediction, simple cosine annealing usually suffices.

Implementation

PyTorch implementations of cosine annealing schedules.

Simple Cosine Annealing

🐍python

1from torch.optim.lr_scheduler import CosineAnnealingLR
2
3# Simple cosine annealing
4scheduler = CosineAnnealingLR(
5    optimizer,
6    T_max=100,      # Total epochs
7    eta_min=1e-6    # Minimum learning rate
8)
9
10# Training loop
11for epoch in range(100):
12    train_epoch(model, dataloader, optimizer)
13    scheduler.step()  # Update after each epoch
14    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.2e}")

Cosine Annealing with Warm Restarts

🐍python

1from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
2
3# SGDR scheduler
4scheduler = CosineAnnealingWarmRestarts(
5    optimizer,
6    T_0=10,         # Initial cycle length (epochs)
7    T_mult=2,       # Cycle length multiplier
8    eta_min=1e-6    # Minimum learning rate
9)
10
11# Training loop - step after each batch for smooth restarts
12for epoch in range(100):
13    for batch_idx, batch in enumerate(dataloader):
14        optimizer.zero_grad()
15        loss = compute_loss(model, batch)
16        loss.backward()
17        optimizer.step()
18
19        # Step with fractional epoch for smooth restarts
20        scheduler.step(epoch + batch_idx / len(dataloader))

Custom Warmup + Cosine Schedule

🐍python

1class WarmupCosineAnnealingLR:
2    """
3    Custom scheduler combining linear warmup with cosine annealing.
4
5    Args:
6        optimizer: PyTorch optimizer
7        warmup_epochs: Number of warmup epochs
8        total_epochs: Total training epochs
9        eta_max: Maximum learning rate
10        eta_min: Minimum learning rate
11    """
12
13    def __init__(
14        self,
15        optimizer: torch.optim.Optimizer,
16        warmup_epochs: int,
17        total_epochs: int,
18        eta_max: float = 1e-3,
19        eta_min: float = 1e-6
20    ):
21        self.optimizer = optimizer
22        self.warmup_epochs = warmup_epochs
23        self.total_epochs = total_epochs
24        self.eta_max = eta_max
25        self.eta_min = eta_min
26        self.current_epoch = 0
27
28    def step(self):
29        """Update learning rate based on current epoch."""
30        self.current_epoch += 1
31
32        if self.current_epoch <= self.warmup_epochs:
33            # Linear warmup
34            lr = self.eta_max * (self.current_epoch / self.warmup_epochs)
35        else:
36            # Cosine annealing
37            progress = (self.current_epoch - self.warmup_epochs) / (
38                self.total_epochs - self.warmup_epochs
39            )
40            lr = self.eta_min + 0.5 * (self.eta_max - self.eta_min) * (
41                1 + math.cos(math.pi * progress)
42            )
43
44        for param_group in self.optimizer.param_groups:
45            param_group['lr'] = lr
46
47    def get_lr(self) -> float:
48        return self.optimizer.param_groups[0]['lr']
49
50
51# Usage
52scheduler = WarmupCosineAnnealingLR(
53    optimizer=optimizer,
54    warmup_epochs=5,
55    total_epochs=100,
56    eta_max=1e-3,
57    eta_min=1e-6
58)

Summary

In this section, we covered cosine annealing schedules:

Cosine annealing: Smooth decay following cosine curve
Warm restarts: Periodic LR increases to escape local minima
Cycle multiplier: Growing cycle lengths for deeper exploration
Recommendation: Warmup + cosine annealing (no restarts)
PyTorch: CosineAnnealingLR and CosineAnnealingWarmRestarts

Parameter	Value
Schedule	Warmup + Cosine
Warmup epochs	5
Total epochs	100
η_max	1e-3
η_min	1e-6
Restarts	No

Looking Ahead: We have configured learning rate schedules. The next section explores adaptive weight decay—techniques for adjusting regularization strength during training.

With learning rate scheduling covered, we examine weight decay strategies.