Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Detect overfitting: Use training and validation curves to diagnose overfitting and underfitting in your models
Tune regularization hyperparameters: Find optimal values for weight decay, dropout rate, and early stopping patience
Combine techniques effectively: Know which regularization methods work well together and which conflict
Debug regularization problems: Diagnose under-regularization, over-regularization, and technique-specific issues

Prerequisite: Chapter 5

This section assumes you understand what regularization techniques are. See Chapter 5, Section 5 (Normalization Layers) and Section 6 (Dropout and Regularization) for comprehensive coverage of how dropout, weight decay, batch normalization, and other techniques work. Here, we focus on practical training strategies.

The Big Picture

You've learned what regularization techniques do (Chapter 5). Now the question is: how do you actually use them in practice?

Regularization is not just a matter of adding dropout or weight decay. It's about:

Detection: Recognizing when your model is overfitting (or underfitting)
Selection: Choosing which techniques to apply for your architecture
Tuning: Finding the right hyperparameter values
Combination: Making multiple techniques work together
Debugging: Fixing problems when regularization isn't working

Situation	Signs	Action
Under-regularization	Large train/val gap, val loss increasing	Add or strengthen regularization
Over-regularization	Both losses high, model underfits	Reduce regularization strength
Good balance	Small gap, both losses low and stable	Continue training or stop

Detecting Overfitting

Explore how model complexity affects fitting. Increase the polynomial degree to see how the model can fit training points perfectly while losing the ability to generalize:

Overfitting Demonstration

Loading visualization...

Quick Check

In the demo above, what happens to the test error when you increase model complexity from 3 to 15?

Recognizing the Patterns

The primary tool for detecting overfitting is comparing training and validation loss curves over time:

Healthy Training

Both losses decrease together
The gap between them remains small and stable
Both eventually plateau at similar values

Overfitting

Training loss continues to decrease
Validation loss decreases initially, then starts increasing
The gap between them grows larger over time

Underfitting

Both losses plateau at high values
Neither loss shows significant improvement
The model lacks capacity to learn the pattern

The Silent Killer

Overfitting is insidious because training metrics look great! Always monitor validation loss, not just training loss. If you only look at training loss, you might think your model is improving when it's actually getting worse at the task you care about.

The Bias-Variance Tradeoff

The bias-variance tradeoff is fundamental to understanding why regularization works. Explore it interactively:

The Bias-Variance Tradeoff in Machine Learning

Model Complexity5.0

Simple (e.g., linear)Complex (e.g., deep NN)

Bias²

0.203

Variance

0.288

Irreducible

0.200

Total Error

0.691

Optimal Complexity

The sweet spot! Model is complex enough to capture the pattern but not so complex that it fits noise. This minimizes total prediction error. In practice, found via cross-validation.

Expected Prediction Error = Bias² + Variance + Irreducible Noise

The key insight: optimal generalization occurs at a model complexity where the sum of bias and variance is minimized—not where either one is minimized individually.

See Chapter 5 for Theory

For the mathematical formulation of the bias-variance decomposition, L1/L2 regularization theory, and dropout mechanics, see Chapter 5, Section 6. Here we focus on practical tuning strategies.

Regularization Strategy

You know what regularization techniques do from Chapter 5. Now let's focus on the practical question: how do you tune them?

Tuning Weight Decay

Weight decay (L2 regularization) is almost always beneficial. The question is finding the right strength:

Weight Decay (L2 Regularization) Effect

Data Loss Optimum (θ*)

Regularized Optimum (θ̃*)

Regularization Strength (λ)0.50

No regularization (0)Strong (2)

Show Data Loss Contours

Show Regularization Contours

Optimum Comparison

Unregularized θ*

(2.00, 1.50)

‖θ*‖ = 2.50

Regularized θ̃*

(1.00, 1.20)

‖θ̃*‖ = 1.56

Weight Shrinkage

37.5%

Weights shrunk toward origin by 37.5%

Total Loss Function

L = L_data + λ·‖θ‖²

L2 regularization adds a penalty proportional to squared weight magnitude

Understanding Weight Decay

Blue ellipses show contours of constant data loss (where data loss is the same). The blue dot marks where data loss is minimized. Orange circles show contours of constant regularization penalty (centered at origin). The green dot shows where the total loss (data + regularization) is minimized. Notice how increasing λ pulls the optimum closer to the origin, shrinking the weight magnitudes. This prevents weights from growing too large, which helps prevent overfitting.

Weight Decay Value	When to Use	Signs It's Wrong
1e-5 to 1e-4	Small models, little overfitting risk	Large train/val gap persists
1e-4 to 1e-3	Most cases, good starting point	Check both gaps and absolute performance
1e-3 to 1e-2	Large models, severe overfitting	Training loss stays high (underfitting)
1e-1+	Rarely used	Almost always too strong

AdamW vs Adam with weight_decay

Always use AdamW for weight decay with Adam-family optimizers. The standard Adam(weight_decay=...) doesn't decouple weight decay from adaptive learning rates, leading to suboptimal regularization.

Tuning Dropout

Dropout rate depends on layer type and model capacity:

Layer Type	Typical Rate	Reasoning
Input layer	0.1 - 0.2	Don't drop too much input information
Hidden (FC) layers	0.3 - 0.5	Standard range, start at 0.5
Conv layers (Dropout2d)	0.1 - 0.3	Spatial dropout, lower rates work
Before output	0 (none)	Final layer needs all features
Transformers	0.1	Lower rates common with LayerNorm

Dropout + BatchNorm Interaction

Using dropout directly before or after BatchNorm can cause issues. BatchNorm expects consistent statistics, but dropout changes the activation distribution. If using both, place BatchNorm first, then dropout: Linear → BatchNorm → ReLU → Dropout.

Interactive: Dropout Visualization

See how dropout randomly disables neurons during training. Toggle between training and inference modes to see the difference:

Dropout Visualization

Training Mode

Input Layer

Hidden Layer

Output Layer

Dropped

Mode

Dropout Rate (p)0.50

No dropout (0)Heavy (0.9)

Statistics

Dropped Neurons4 / 10

Active Connections24 / 50

Keep Probability0.50

How Dropout Works

During Training: Each hidden neuron is randomly "dropped" (set to zero) with probability p = 0.50. This means each forward pass uses a different random subset of the network.

This prevents neurons from co-adapting too much to each other. Each neuron must learn useful features independently, without relying on specific other neurons being present.

Quick Check

Why do we need to scale neuron outputs during dropout?

Early Stopping

Early stopping is a simple but effective regularization strategy: stop training before the model overfits.

The Algorithm

Monitor validation loss after each epoch
Track the best validation loss seen so far and save that model checkpoint
If validation loss doesn't improve for patience consecutive epochs, stop training
Restore the model weights from the best checkpoint

Why It Works

Early stopping is a form of implicit regularization. As training progresses:

Early epochs: Model learns general patterns (bias decreases, variance stays low)
Middle epochs: Model reaches optimal generalization point
Late epochs: Model starts memorizing training data (bias stays low, variance increases)

By stopping at the optimal point, we prevent the variance from increasing while maintaining the learned general patterns.

Early Stopping as Regularization

Early stopping limits the "effective" number of parameters by limiting how far weights can travel from their initialization. This is mathematically equivalent to a form of L2 regularization with a time-varying strength.

Interactive: Early Stopping

Watch a training run and see how early stopping detects the optimal point to stop training:

Early Stopping Demonstration

Training Loss

Validation Loss

Best Epoch

Early Stop

Playback

Epoch0

Patience10 epochs

Current Status

Train Loss

2.5728

Val Loss

2.6188

Best Val Loss (Epoch 0)

2.6188

Epochs Without Improvement

How Early Stopping Works

Early stopping monitors the validation loss during training. If the validation loss doesn't improve for 10 consecutive epochs (the "patience"), training stops and we restore the model weights from the best epoch. This prevents the model from continuing to memorize training data after it has stopped learning generalizable patterns.

Quick Check

What happens if you set patience too low?

Early Stopping in Practice

Early stopping prevents overfitting by stopping training when validation loss stops improving. The key decisions are:

Parameter	Typical Values	Trade-off
Patience	5-20 epochs	Low: may stop prematurely. High: may overfit before stopping
Min delta	0 to 1e-4	Ignore tiny improvements that may be noise
Monitor metric	val_loss (usually)	val_accuracy for classification if loss is noisy

Normalization in Training

For detailed coverage of BatchNorm, LayerNorm, and other normalization techniques—including when to use each—see Chapter 5, Section 5. Key training considerations:

Always call model.train() / model.eval(): BatchNorm behaves differently in each mode
CNNs: Use BatchNorm2d (benefits from batch statistics)
Transformers/RNNs: Use LayerNorm (batch-size independent)
Small batches: Prefer LayerNorm or GroupNorm over BatchNorm

Combining Regularization Techniques

Regularization techniques can be combined, but some combinations work better than others:

Complementary Combinations

Combination	Effect	Recommended
Weight Decay + Early Stopping	Continuous shrinkage + stopping at optimal point	Almost always
Weight Decay + Data Augmentation	Model regularization + data regularization	Yes, especially for images
Dropout + Weight Decay	Different mechanisms, complementary	Yes, may reduce both strengths slightly
BatchNorm + Weight Decay	Normalization + shrinkage	Yes, but use lower weight decay
Dropout + BatchNorm	Can conflict; use carefully	BatchNorm first, then Dropout

Conflicting Combinations

Watch for Conflicts

Heavy Dropout + Heavy Weight Decay: Can cause severe underfitting. Use moderate amounts of each.
Dropout immediately after BatchNorm: Disrupts batch statistics. Place BatchNorm before activation, dropout after.
Multiple strong regularizers: Start with one technique, add others gradually.

Recommended Baseline Setup

🐍regularization_baseline.py

1# Solid baseline for most tasks
2optimizer = optim.AdamW(
3    model.parameters(),
4    lr=1e-3,
5    weight_decay=1e-2  # Start here, tune if needed
6)
7
8# Architecture includes appropriate normalization
9# (BatchNorm for CNNs, LayerNorm for Transformers)
10
11# Moderate dropout in FC layers
12nn.Dropout(0.3)  # Not too aggressive
13
14# Early stopping as safety net
15early_stopping = EarlyStopping(patience=10)
16
17# Data augmentation for image tasks
18transforms.Compose([
19    transforms.RandomHorizontalFlip(),
20    transforms.RandomCrop(32, padding=4),
21    transforms.ToTensor(),
22])

Debugging Regularization Issues

When regularization isn't working as expected, use this diagnostic guide:

Problem: Still Overfitting

Symptoms: Large train/val gap, val loss increasing while train loss decreases.

Increase weight decay (try 10x higher)
Add or increase dropout rate
Add more data augmentation
Reduce model capacity (fewer layers/neurons)
Lower early stopping patience

Problem: Underfitting

Symptoms: Both train and val loss high, neither improving.

Reduce weight decay (try 10x lower)
Reduce or remove dropout
Increase model capacity
Train longer (increase patience)
Check if learning rate is too low

Problem: Training Unstable

Symptoms: Loss spikes, NaN values, erratic behavior.

Check BatchNorm: are you calling model.train()/model.eval()?
Reduce learning rate
Check for gradient explosion (add gradient clipping)
Ensure batch size is large enough for BatchNorm (at least 16-32)

Systematic Tuning

Change one hyperparameter at a time. If you change weight decay and dropout simultaneously, you won't know which change helped (or hurt). Start with defaults, observe behavior, then adjust methodically.

PyTorch Implementation

L2 Regularization (Weight Decay)

L2 Regularization in PyTorch

🐍weight_decay.py

Explanation(3)

Code(43)

14Weight Decay Parameter

weight_decay adds L2 regularization to all parameters. PyTorch applies the penalty during the optimizer step, not by modifying the loss.

20AdamW Optimizer

AdamW implements 'decoupled weight decay' which works correctly with adaptive learning rates. Use this instead of Adam with weight_decay for better results.

33Manual L2 Penalty

Manually computing the L2 penalty gives you more control. You can exclude certain parameters (like biases) or use different strengths for different layers.

40 lines without explanation

1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Method 1: Using weight_decay in optimizer
6model = nn.Sequential(
7    nn.Linear(784, 256),
8    nn.ReLU(),
9    nn.Linear(256, 10)
10)
11
12# Add L2 regularization via weight_decay
13optimizer = optim.SGD(
14    model.parameters(),
15    lr=0.01,
16    weight_decay=1e-4  # L2 regularization strength
17)
18
19# Method 2: AdamW for proper weight decay with Adam
20optimizer = optim.AdamW(
21    model.parameters(),
22    lr=0.001,
23    weight_decay=0.01  # Decoupled weight decay
24)
25
26# Method 3: Manual L2 regularization
27def train_with_manual_l2(model, data_loader, loss_fn, optimizer, l2_lambda):
28    model.train()
29    for inputs, targets in data_loader:
30        optimizer.zero_grad()
31        outputs = model(inputs)
32
33        # Compute data loss
34        data_loss = loss_fn(outputs, targets)
35
36        # Compute L2 penalty manually
37        l2_penalty = sum(p.pow(2).sum() for p in model.parameters())
38
39        # Total loss
40        total_loss = data_loss + l2_lambda * l2_penalty
41
42        total_loss.backward()
43        optimizer.step()

Dropout

Dropout Implementation

🐍dropout.py

Explanation(5)

Code(44)

9Dropout Placement

Place dropout after activation functions. This is standard practice. Some research suggests dropout before activation may work better in certain cases, but after is the convention.

15No Dropout Before Output

Generally don't use dropout immediately before the output layer. The output needs stable, full-capacity predictions.

22Training Mode

model.train() enables dropout (and batch norm training mode). Always call this before training!

26Evaluation Mode

model.eval() disables dropout (all neurons active) and switches batch norm to use running statistics. Always call before evaluation!

35Spatial Dropout

Dropout2d drops entire feature maps (channels) rather than individual elements. This preserves spatial structure and is better for convolutional layers.

39 lines without explanation

1import torch.nn as nn
2
3class MLPWithDropout(nn.Module):
4    def __init__(self, input_size, hidden_size, num_classes, dropout_rate=0.5):
5        super().__init__()
6        self.model = nn.Sequential(
7            nn.Linear(input_size, hidden_size),
8            nn.ReLU(),
9            nn.Dropout(dropout_rate),  # Dropout after activation
10
11            nn.Linear(hidden_size, hidden_size),
12            nn.ReLU(),
13            nn.Dropout(dropout_rate),
14
15            nn.Linear(hidden_size, num_classes)  # No dropout before output
16        )
17
18    def forward(self, x):
19        return self.model(x)
20
21# Training: dropout is active
22model = MLPWithDropout(784, 256, 10, dropout_rate=0.5)
23model.train()  # Enables dropout
24output = model(input_tensor)  # Some neurons are dropped
25
26# Inference: dropout is disabled
27model.eval()  # Disables dropout
28output = model(input_tensor)  # All neurons active
29
30# For CNNs: use Dropout2d (drops entire feature maps)
31class CNNWithDropout(nn.Module):
32    def __init__(self):
33        super().__init__()
34        self.features = nn.Sequential(
35            nn.Conv2d(3, 64, 3, padding=1),
36            nn.ReLU(),
37            nn.Dropout2d(0.2),  # Spatial dropout for conv layers
38            nn.MaxPool2d(2),
39
40            nn.Conv2d(64, 128, 3, padding=1),
41            nn.ReLU(),
42            nn.Dropout2d(0.2),
43            nn.MaxPool2d(2),
44        )

Early Stopping

Early Stopping Implementation

🐍early_stopping.py

Explanation(4)

Code(42)

5Patience Parameter

Number of epochs to wait for improvement before stopping. Typical values are 5-20 depending on how noisy your validation loss is.

6Minimum Delta

Only consider it an improvement if loss decreases by at least this amount. Helps ignore tiny fluctuations. Set to 0 for any improvement to count.

17Save Best Weights

We save a copy of the model weights whenever we find a new best validation loss. This lets us restore the best model at the end.

24Restore Best Model

When stopping, we restore the weights from the best epoch, not the final epoch. This ensures we use the model with lowest validation loss.

38 lines without explanation

1class EarlyStopping:
2    """Early stopping to prevent overfitting."""
3
4    def __init__(self, patience=10, min_delta=0.0, restore_best=True):
5        self.patience = patience
6        self.min_delta = min_delta
7        self.restore_best = restore_best
8        self.best_loss = float('inf')
9        self.best_weights = None
10        self.epochs_without_improvement = 0
11
12    def __call__(self, val_loss, model):
13        """Returns True if training should stop."""
14        if val_loss < self.best_loss - self.min_delta:
15            # Improvement found
16            self.best_loss = val_loss
17            self.best_weights = model.state_dict().copy()
18            self.epochs_without_improvement = 0
19            return False
20        else:
21            # No improvement
22            self.epochs_without_improvement += 1
23            if self.epochs_without_improvement >= self.patience:
24                if self.restore_best and self.best_weights is not None:
25                    model.load_state_dict(self.best_weights)
26                return True  # Stop training
27            return False
28
29# Usage in training loop
30early_stopping = EarlyStopping(patience=10)
31
32for epoch in range(max_epochs):
33    train_loss = train_one_epoch(model, train_loader, optimizer)
34    val_loss = validate(model, val_loader)
35
36    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
37
38    if early_stopping(val_loss, model):
39        print(f"Early stopping triggered at epoch {epoch}")
40        break
41
42# Model now has best weights restored

Summary

This section covered practical regularization strategies for training neural networks. Key takeaways:

Topic	Key Practical Insight
Detecting Overfitting	Monitor train vs val loss gap; val increasing = overfitting
Weight Decay Tuning	Start at 1e-3 to 1e-2 with AdamW; adjust based on gap
Dropout Tuning	0.3-0.5 for FC layers, 0.1-0.2 for conv; none before output
Early Stopping	patience=10-15; saves best weights automatically
Combining Techniques	Weight decay + early stopping always; add dropout gradually
Debugging	Change one thing at a time; check train() vs eval() mode

Remember: Theory is in Chapter 5

For the mathematical foundations of regularization techniques (L1/L2 penalties, dropout derivation, normalization algorithms), see Chapter 5, Sections 5 and 6. This section focused on when and how to apply these techniques during training.

Quiz

Test your understanding of overfitting and regularization:

Overfitting & Regularization Quiz

Question 1/8

What does it mean when training loss is very low but validation loss is much higher?

Exercises

Diagnostic Scenarios

Scenario: Your model has 99% training accuracy but only 60% validation accuracy. The validation loss started increasing after epoch 5. What's happening, and what steps would you take (in order of priority)?
Scenario: Both your training and validation loss are stuck at high values (neither improving after 50 epochs). What might be wrong, and how would you diagnose it?
Scenario: You added dropout(0.5) to all layers, and now both train and validation accuracy dropped significantly. What went wrong?
Scenario: Your model works perfectly in training but gives random predictions during inference. What's the most likely cause?

Solution Hints

Q1: Severe overfitting. Priority: (1) Early stopping at epoch 5, (2) Add data augmentation, (3) Increase weight decay, (4) Add dropout, (5) Reduce model capacity.
Q2: Underfitting. Check: (1) Learning rate too low? (2) Too much regularization? (3) Model too small? (4) Data issues?
Q3: Over-regularization. Reduce dropout rate (0.2-0.3) or remove from some layers. Also check if dropout is on output layer (it shouldn't be).
Q4: Forgot to call model.eval() before inference. BatchNorm and Dropout behave differently in train vs eval mode.

Hands-On Experiments

Weight decay sweep: Train a CNN on CIFAR-10 with weight_decay values (0, 1e-5, 1e-4, 1e-3, 1e-2). Plot train/val curves for each. Identify which value gives the best validation accuracy.
Regularization combination: Compare three setups: (a) weight decay only, (b) dropout only, (c) both. Measure final validation accuracy and the train/val gap for each.
Early stopping patience: Train with patience values (3, 5, 10, 20, 50). Record at which epoch training stopped and the final validation loss. Find the sweet spot.
train() vs eval() experiment: Train a model with BatchNorm. Run inference both with and without calling model.eval(). Compare the predictions and accuracy.

Experiment Tips

Exercise 1: Use fixed random seed for fair comparison. Use TensorBoard to visualize all curves together.
Exercise 2: Keep total regularization budget similar (if comparing dropout 0.5 vs weight_decay 1e-2, neither should be overwhelming).
Exercise 3: Track not just final loss but also which epoch was "best" for each patience value.
Exercise 4: Test on a batch of 100+ samples to get statistically meaningful accuracy difference.

In the next section, we'll explore hyperparameter tuning—systematic approaches to finding optimal values for learning rate, regularization strength, and other hyperparameters.