Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the critical role of learning rate: See how the learning rate controls the speed and stability of training, and why getting it right is often the most important hyperparameter decision
Find optimal learning rates: Use the learning rate range test to systematically find the best starting learning rate for your model
Implement learning rate schedules: Apply step decay, exponential decay, cosine annealing, and other scheduling strategies in PyTorch
Apply warmup strategies: Understand why warmup improves training stability and implement it effectively
Use cyclical learning rates: Leverage oscillating learning rates to escape local minima and achieve better generalization

Why This Matters: The learning rate is often called the most important hyperparameter in deep learning. A well-chosen learning rate schedule can be the difference between a model that converges in hours versus days, or between 90% and 95% accuracy. Mastering learning rate strategies is essential for training models efficiently.

Why Learning Rate Matters

The learning rate $\eta$ controls the step size in gradient descent:

\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}

This single number has a profound impact on training dynamics:

Learning Rate	Effect	Symptoms
Too high	Overshoots optima, oscillates wildly	Loss spikes, NaN values, training fails
Too low	Crawls toward optimum	Days/weeks to converge, stuck in local minima
Just right	Converges efficiently	Smooth loss curve, reaches good minimum

The Goldilocks Problem

Finding the right learning rate is a balancing act:

Too high: The optimizer takes huge steps, overshooting valleys and bouncing off loss landscape walls. The loss oscillates or explodes.
Too low: The optimizer makes tiny, timid steps. Training takes forever and may get stuck in poor local minima because it lacks the momentum to escape.
Just right: The optimizer moves confidently toward good solutions, fast enough to be practical but controlled enough to converge.

The Changing Landscape

Here's the key insight: the optimal learning rate changes during training. Early in training, when the model is far from any optimum, you can take larger steps without overshooting. Later, when approaching a minimum, you need smaller, more precise steps to settle into the optimum without bouncing out.

This motivates learning rate schedules—systematic ways to adjust the learning rate as training progresses.

Finding the Optimal Learning Rate

Before choosing a schedule, you need to find a good starting learning rate. The learning rate range test (Smith, 2017) is a systematic approach:

The Algorithm

Start very small: Begin with a tiny learning rate (e.g., $10^{-7}$ )
Increase exponentially: After each batch, multiply the learning rate by a constant factor
Track the loss: Record the loss at each step
Stop when loss explodes: End the test when loss becomes much larger than the minimum observed
Analyze the curve: Plot loss vs. learning rate on a log scale

Interpreting the Results

The loss-vs-LR curve typically shows three regions:

Flat region (low LR): Learning rate too small to make progress
Decreasing region: The "sweet spot" where training works well
Increasing region: Learning rate too high, training destabilizes

Where to Set the Learning Rate

A common heuristic is to choose the learning rate where loss is decreasing fastest—typically about 10× lower than the rate where loss starts to increase. This gives you headroom before instability.

Interactive: LR Range Test

Run a simulated learning rate range test below. Watch how loss changes as the learning rate increases exponentially, and observe the suggested optimal learning rate:

Learning Rate Range Test

Find the optimal learning rate for your model

Min LR: 1.0e-7

Max LR: 10.0

0/100

Steps

1.00e-7

Current LR

0.000

Loss

Smoothed Loss

Raw Loss

Quick Check

In a learning rate range test, where should you set your initial learning rate?

Learning Rate Schedules

A learning rate schedule systematically adjusts $\eta$ during training. Here are the most common approaches:

Step Decay

Reduce learning rate by a factor $\gamma$ every $N$ epochs:

\eta_t = \eta_0 \cdot \gamma^{\lfloor t / N \rfloor}

Example: Start at 0.1, reduce by 10× every 30 epochs: 0.1 → 0.01 → 0.001

Exponential Decay

Decay the learning rate by a constant factor each epoch:

\eta_t = \eta_0 \cdot \gamma^t

This creates a smooth, continuous decrease rather than abrupt drops.

Cosine Annealing

Smoothly decrease learning rate following a cosine curve from $\eta_0$ to $\eta_{min}$ :

\eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})\left(1 + \cos\left(\frac{t \cdot \pi}{T}\right)\right)

The cosine shape provides a gentle decrease that slows near the end, allowing fine-grained convergence.

Linear Decay

Simply interpolate linearly from start to end:

\eta_t = \eta_0 - (\eta_0 - \eta_{min}) \cdot \frac{t}{T}

Schedule	Pros	Cons	Best For
Step Decay	Simple, interpretable, proven	Abrupt changes, requires tuning N	Classic baselines, ResNets
Exponential	Smooth decay	Hard to control final LR	NLP, older architectures
Cosine Annealing	Smooth, effective, principled	More complex formula	Modern vision models, transformers
Linear	Simple, predictable	May be too aggressive early	Fine-tuning, short training runs

Interactive: Schedule Explorer

Compare different learning rate schedules side by side. Toggle schedules on/off and adjust parameters to see how they affect the learning rate over time:

Learning Rate Schedule Explorer

Compare different scheduling strategies

Step Decay

Reduce LR by factor gamma every N epochs

Cosine Annealing

Smooth cosine decay to minimum LR

Quick Check

Which learning rate schedule is most commonly used in modern transformer training?

Warmup Strategies

Warmup is the practice of starting with a very low learning rate and gradually increasing it to the target value over the first few epochs or steps.

Why Warmup Helps

At the start of training:

Model weights are randomly initialized—they're far from any reasonable solution
Gradients can be large and noisy—initial estimates are unreliable
High learning rates cause wild updates that may push weights into bad regions

Warmup allows the model to "find its footing" before taking larger optimization steps. This is especially important for:

Large batch training: Bigger batches enable higher learning rates, but starting high causes instability
Transformers: Self-attention can produce large gradients early on
Adam optimizer: The adaptive moment estimates need time to stabilize

Common Warmup Strategies

Linear Warmup

Increase learning rate linearly from 0 (or a small value) to target over $W$ steps:

\eta_t = \eta_{target} \cdot \frac{t}{W}, \quad t < W

Exponential Warmup

Increase more aggressively at first, then slow down as approaching target:

\eta_t = \eta_{target} \cdot (1 - e^{-\alpha t})

How Long Should Warmup Be?

A common rule of thumb is to use warmup for 5-10% of total training steps. For example, if training for 10,000 steps, use 500-1,000 warmup steps. Some practitioners use 1-2 epochs for image classification.

Interactive: Warmup Demo

See how warmup stabilizes training compared to jumping straight to a high learning rate. Watch the loss curves and observe how the warmup phase prevents early instability:

Learning Rate Warmup

See why gradual warmup improves training stability

Target LR: 0.100

Warmup Steps: 20

0/100

Step

Warmup

Phase

0.0000

Learning Rate

2.480

Loss

Learning Rate Schedule

Training Loss

Linear Warmup

No Warmup (baseline)

Why warmup helps: At initialization, network weights are random and gradients can be large and noisy. Starting with a high learning rate can cause the optimizer to take huge steps, leading to unstable loss spikes. Warmup allows the model to find a reasonable region of parameter space before aggressive optimization begins.

Quick Check

What is the main purpose of learning rate warmup?

Cyclical Learning Rates

Cyclical Learning Rates (CLR) (Smith, 2017) take a different approach: instead of monotonically decreasing the learning rate, they oscillate it between minimum and maximum bounds.

The Key Insight

Traditional schedules assume we want to converge to a single minimum. But the loss landscape has many local minima, and some are better than others. By periodically increasing the learning rate, we can:

Escape poor local minima: High learning rates let us jump out of shallow local minima
Explore the loss landscape: Oscillations help us find flatter, more generalizable minima
Regularize training: The noise from varying LR acts as implicit regularization

Common Cyclical Policies

Triangular

Linear ramp up and down between $\eta_{min}$ and $\eta_{max}$ :

\eta_t = \eta_{min} + (\eta_{max} - \eta_{min}) \cdot |1 - |\frac{t}{S} \mod 2 - 1||

where $S$ is the step size (half a cycle).

Triangular2

Like triangular, but the amplitude (range between min and max) halves after each cycle. This combines exploration early on with convergence later.

One-Cycle Policy

The one-cycle policy (Smith, 2018) is particularly effective:

Warmup (first ~30%): Increase LR from low to maximum
Annealing (remaining ~70%): Decrease from maximum to very low (often 1/1000th of max)

This single, large cycle provides both exploration (high LR phase) and fine-grained convergence (annealing phase). It often achieves better results with fewer epochs than traditional schedules.

Super-Convergence

The one-cycle policy can enable super-convergence—training to the same accuracy in 10× fewer epochs by using learning rates that would normally be considered too high. This happens because the high LR phase acts as strong regularization.

Interactive: Cyclical LR

Experiment with different cyclical learning rate policies. Watch how the oscillating learning rate affects training dynamics:

Cyclical Learning Rates

Oscillating LR can help escape local minima

Min LR: 0.0010

Max LR: 0.100

Step Size: 20

0/100

Step

Cycle

0.0010

Learning Rate

2.500

Loss

Learning Rate Schedule

Training Loss

Triangular

Linear ramp up and down between min and max LR

Quick Check

What is the main advantage of cyclical learning rates over monotonic decay?

PyTorch Implementation

PyTorch provides built-in schedulers in torch.optim.lr_scheduler. Here's how to use the most common ones:

Step Decay

Step Decay Scheduler

🐍step_lr.py

Explanation(2)

Code(9)

4StepLR Parameters

step_size=30 means reduce LR every 30 epochs. gamma=0.1 means multiply LR by 0.1 (divide by 10) at each step.

EXAMPLE

LR: 0.1 → 0.01 → 0.001

8Call After Each Epoch

scheduler.step() must be called after each epoch (or each batch for some schedulers). It updates the learning rate according to the schedule.

7 lines without explanation

1from torch.optim.lr_scheduler import StepLR
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step()  # Reduce LR every 30 epochs
9    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")

Cosine Annealing

Cosine Annealing Scheduler

🐍cosine_lr.py

Explanation(1)

Code(8)

4CosineAnnealingLR Parameters

T_max is the total number of epochs. eta_min is the minimum learning rate at the end of the schedule. LR follows a cosine curve from initial LR to eta_min.

7 lines without explanation

1from torch.optim.lr_scheduler import CosineAnnealingLR
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step()

Cosine with Warm Restarts

Cosine Annealing with Warm Restarts

🐍cosine_restarts.py

Explanation(1)

Code(8)

4Restart Parameters

T_0=10 means the first cycle is 10 epochs. T_mult=2 means each subsequent cycle is 2× longer than the previous (10, 20, 40 epochs).

7 lines without explanation

1from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
2
3optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
4scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
5
6for epoch in range(100):
7    train_one_epoch(model, train_loader, optimizer)
8    scheduler.step(epoch)  # Pass epoch number for restarts

Linear Warmup + Cosine Decay (Transformers Style)

🐍warmup_cosine.py

1from torch.optim.lr_scheduler import LambdaLR
2import math
3
4def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
5    """Creates a schedule with linear warmup then cosine decay."""
6
7    def lr_lambda(current_step):
8        if current_step < num_warmup_steps:
9            # Linear warmup
10            return float(current_step) / float(max(1, num_warmup_steps))
11        # Cosine decay
12        progress = float(current_step - num_warmup_steps) / float(
13            max(1, num_training_steps - num_warmup_steps)
14        )
15        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
16
17    return LambdaLR(optimizer, lr_lambda)
18
19# Usage
20optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
21scheduler = get_cosine_schedule_with_warmup(
22    optimizer,
23    num_warmup_steps=1000,
24    num_training_steps=10000
25)
26
27for step, batch in enumerate(train_loader):
28    loss = train_step(model, batch, optimizer)
29    scheduler.step()  # Call after each step, not each epoch!

One-Cycle Policy

One-Cycle Learning Rate

🐍one_cycle.py

Explanation(3)

Code(16)

7steps_per_epoch Required

OneCycleLR needs to know total training steps upfront. Provide len(train_loader) so it can calculate the schedule correctly.

8Warmup Percentage

pct_start=0.3 means 30% of training is warmup (increasing to max_lr), 70% is annealing (decreasing to ~0).

14Step After Every Batch

Unlike epoch-based schedulers, OneCycleLR should be stepped after every batch, not every epoch.

13 lines without explanation

1from torch.optim.lr_scheduler import OneCycleLR
2
3optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
4scheduler = OneCycleLR(
5    optimizer,
6    max_lr=0.1,
7    epochs=100,
8    steps_per_epoch=len(train_loader),
9    pct_start=0.3,  # Warmup for 30% of training
10    anneal_strategy='cos'
11)
12
13for epoch in range(100):
14    for batch in train_loader:
15        loss = train_step(model, batch, optimizer)
16        scheduler.step()  # Call after EVERY batch

Epoch vs. Step Scheduling

Different schedulers expect different calling patterns:

StepLR, CosineAnnealingLR: Call scheduler.step() after each epoch
OneCycleLR, LambdaLR for warmup: Call scheduler.step() after each batch/step

Check the documentation to ensure you're calling at the right frequency!

Practical Guidelines

Choosing a Schedule

Scenario	Recommended Schedule
Training from scratch (CNNs)	Step decay or cosine annealing
Training transformers	Linear warmup + cosine decay
Limited training budget	One-cycle policy
Transfer learning / fine-tuning	Linear warmup + constant or linear decay
Hyperparameter search	Constant (to isolate other effects)

Key Hyperparameters

Initial learning rate: Use the LR range test. Common starting points: 0.1 for SGD, 0.001 for Adam on CNNs, 5e-5 for transformer fine-tuning.
Warmup steps: 5-10% of total steps is a good default. More warmup for larger batches or more unstable models.
Decay factor (gamma): For step decay, 0.1 (10× reduction) is common. For exponential, 0.95-0.99 per epoch.
Final learning rate: Typically 1/100 to 1/1000 of the initial rate. Too high may prevent fine convergence.

Common Mistakes

Calling scheduler at wrong frequency: Some are per-epoch, others per-step
Forgetting warmup for transformers: Large models often require warmup for stability
Using the same schedule for fine-tuning: Fine-tuning typically needs lower LR and less aggressive schedules
Not adapting to batch size: Larger batches can support higher learning rates

Linear Scaling Rule

When increasing batch size by factor

k

, you can often scale the learning rate by the same factor. For example, if batch size 32 works with LR 0.1, batch size 128 (4× larger) may work with LR 0.4. Always use warmup when applying this rule.

Summary

Learning rate management is crucial for efficient and effective neural network training. Here's what we covered:

Concept	Key Point
Learning Rate	Controls optimization step size; too high causes instability, too low wastes time
LR Range Test	Systematically find optimal starting LR by increasing exponentially and watching loss
Step Decay	Reduce LR by factor γ every N epochs; simple and proven
Cosine Annealing	Smooth decrease following cosine curve; modern default for many tasks
Warmup	Start low and increase gradually; essential for large batches and transformers
Cyclical LR	Oscillate between min and max; helps escape local minima
One-Cycle	Single cycle of warmup + annealing; can achieve super-convergence

PyTorch Scheduler Quick Reference

🐍quick_reference.py

1# Step decay
2scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
3
4# Cosine annealing
5scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
6
7# One-cycle (step after each batch!)
8scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=num_batches)
9
10# Custom schedule with warmup
11scheduler = LambdaLR(optimizer, lr_lambda=custom_schedule_function)

Exercises

Conceptual Questions

Explain why the optimal learning rate changes during training. What happens if you use the same (high) learning rate throughout?
Why is warmup particularly important for training transformers? What would happen if you started with a high learning rate immediately?
In the one-cycle policy, why does the warmup phase act as regularization? How does this enable "super-convergence"?
Compare step decay and cosine annealing. In what situations might you prefer one over the other?

Solution Hints

Q1: Early training: far from optimum, can take larger steps. Later: near optimum, need small precise steps. Constant high LR causes overshooting/oscillation near the minimum.
Q2: Transformers have self-attention which can produce large gradients early on. High initial LR causes exploding updates. Warmup lets the adaptive optimizer (Adam) accumulate stable statistics.
Q3: High LR during warmup prevents settling into sharp minima (which overfit). The optimizer can only converge to minima it doesn't "bounce out of," favoring flatter, more generalizable solutions.
Q4: Step decay is simpler and has fewer hyperparameters. Cosine provides smoother transitions that may work better for modern architectures. Step decay works well when you know specific epochs where performance plateaus.

Coding Exercises

Implement an LR finder: Write a function that performs the learning rate range test on a model, returning the loss vs. LR data. Plot the results and identify the optimal LR.
Compare schedules: Train the same model with step decay, cosine annealing, and one-cycle. Plot the learning rate and loss curves for each. Which converges fastest?
Custom warmup scheduler: Implement a scheduler that does linear warmup for 10% of steps, then applies your choice of decay. Use LambdaLR with a custom function.
Batch size scaling: Train a model with batch size 32 and LR 0.01. Then try batch size 128 with scaled LR 0.04. Compare training curves. Does the linear scaling rule hold?

Coding Exercise Hints

Exercise 1: Use an exponential increase: lr = initial_lr * (final_lr/initial_lr)**(step/num_steps). Use a small model for speed.
Exercise 2: Log LR at each step using scheduler.get_last_lr(). Run for the same number of epochs with each schedule.
Exercise 3: Define lr_lambda(step) that returns a multiplier. If step < warmup_steps, return step/warmup_steps.
Exercise 4: The rule is approximate. You may need to tune warmup duration differently for larger batches.

In the next section, we'll explore regularization techniques—methods like dropout, weight decay, and data augmentation that help neural networks generalize better to unseen data.