Chapter 9
20 min read
Section 58 of 178

Overfitting and Regularization

Training Neural Networks

Learning Objectives

By the end of this section, you will be able to:

  1. Detect overfitting: Use training and validation curves to diagnose overfitting and underfitting in your models
  2. Tune regularization hyperparameters: Find optimal values for weight decay, dropout rate, and early stopping patience
  3. Combine techniques effectively: Know which regularization methods work well together and which conflict
  4. Debug regularization problems: Diagnose under-regularization, over-regularization, and technique-specific issues

Prerequisite: Chapter 5

This section assumes you understand what regularization techniques are. See Chapter 5, Section 5 (Normalization Layers) and Section 6 (Dropout and Regularization) for comprehensive coverage of how dropout, weight decay, batch normalization, and other techniques work. Here, we focus on practical training strategies.

The Big Picture

You've learned what regularization techniques do (Chapter 5). Now the question is: how do you actually use them in practice?

Regularization is not just a matter of adding dropout or weight decay. It's about:

  1. Detection: Recognizing when your model is overfitting (or underfitting)
  2. Selection: Choosing which techniques to apply for your architecture
  3. Tuning: Finding the right hyperparameter values
  4. Combination: Making multiple techniques work together
  5. Debugging: Fixing problems when regularization isn't working
SituationSignsAction
Under-regularizationLarge train/val gap, val loss increasingAdd or strengthen regularization
Over-regularizationBoth losses high, model underfitsReduce regularization strength
Good balanceSmall gap, both losses low and stableContinue training or stop

Detecting Overfitting

Explore how model complexity affects fitting. Increase the polynomial degree to see how the model can fit training points perfectly while losing the ability to generalize:

Overfitting Demonstration

Loading visualization...

Quick Check

In the demo above, what happens to the test error when you increase model complexity from 3 to 15?

Recognizing the Patterns

The primary tool for detecting overfitting is comparing training and validation loss curves over time:

Healthy Training

  • Both losses decrease together
  • The gap between them remains small and stable
  • Both eventually plateau at similar values

Overfitting

  • Training loss continues to decrease
  • Validation loss decreases initially, then starts increasing
  • The gap between them grows larger over time

Underfitting

  • Both losses plateau at high values
  • Neither loss shows significant improvement
  • The model lacks capacity to learn the pattern

The Silent Killer

Overfitting is insidious because training metrics look great! Always monitor validation loss, not just training loss. If you only look at training loss, you might think your model is improving when it's actually getting worse at the task you care about.

The Bias-Variance Tradeoff

The bias-variance tradeoff is fundamental to understanding why regularization works. Explore it interactively:

The Bias-Variance Tradeoff in Machine Learning

0246810Model ComplexityErrorUnderfittingSweet SpotOverfittingBias²VarianceTotal ErrorIrreducible
Model Complexity5.0
Simple (e.g., linear)Complex (e.g., deep NN)
Bias²
0.203
Variance
0.288
Irreducible
0.200
Total Error
0.691
Optimal Complexity

The sweet spot! Model is complex enough to capture the pattern but not so complex that it fits noise. This minimizes total prediction error. In practice, found via cross-validation.

Expected Prediction Error = Bias² + Variance + Irreducible Noise

The key insight: optimal generalization occurs at a model complexity where the sum of bias and variance is minimized—not where either one is minimized individually.

See Chapter 5 for Theory

For the mathematical formulation of the bias-variance decomposition, L1/L2 regularization theory, and dropout mechanics, see Chapter 5, Section 6. Here we focus on practical tuning strategies.

Regularization Strategy

You know what regularization techniques do from Chapter 5. Now let's focus on the practical question: how do you tune them?

Tuning Weight Decay

Weight decay (L2 regularization) is almost always beneficial. The question is finding the right strength:

Weight Decay (L2 Regularization) Effect

θ*θ̃*w₁w₂
Data Loss Optimum (θ*)
Regularized Optimum (θ̃*)
Regularization Strength (λ)0.50
No regularization (0)Strong (2)
Show Data Loss Contours
Show Regularization Contours
Optimum Comparison
Unregularized θ*
(2.00, 1.50)
‖θ*‖ = 2.50
Regularized θ̃*
(1.00, 1.20)
‖θ̃*‖ = 1.56
Weight Shrinkage
37.5%

Weights shrunk toward origin by 37.5%

Total Loss Function
L = Ldata + λ·‖θ‖²

L2 regularization adds a penalty proportional to squared weight magnitude

Understanding Weight Decay

Blue ellipses show contours of constant data loss (where data loss is the same). The blue dot marks where data loss is minimized. Orange circles show contours of constant regularization penalty (centered at origin). The green dot shows where the total loss (data + regularization) is minimized. Notice how increasing λ pulls the optimum closer to the origin, shrinking the weight magnitudes. This prevents weights from growing too large, which helps prevent overfitting.

Weight Decay ValueWhen to UseSigns It's Wrong
1e-5 to 1e-4Small models, little overfitting riskLarge train/val gap persists
1e-4 to 1e-3Most cases, good starting pointCheck both gaps and absolute performance
1e-3 to 1e-2Large models, severe overfittingTraining loss stays high (underfitting)
1e-1+Rarely usedAlmost always too strong

AdamW vs Adam with weight_decay

Always use AdamW for weight decay with Adam-family optimizers. The standard Adam(weight_decay=...) doesn't decouple weight decay from adaptive learning rates, leading to suboptimal regularization.

Tuning Dropout

Dropout rate depends on layer type and model capacity:

Layer TypeTypical RateReasoning
Input layer0.1 - 0.2Don't drop too much input information
Hidden (FC) layers0.3 - 0.5Standard range, start at 0.5
Conv layers (Dropout2d)0.1 - 0.3Spatial dropout, lower rates work
Before output0 (none)Final layer needs all features
Transformers0.1Lower rates common with LayerNorm

Dropout + BatchNorm Interaction

Using dropout directly before or after BatchNorm can cause issues. BatchNorm expects consistent statistics, but dropout changes the activation distribution. If using both, place BatchNorm first, then dropout: Linear → BatchNorm → ReLU → Dropout.

Interactive: Dropout Visualization

See how dropout randomly disables neurons during training. Toggle between training and inference modes to see the difference:

Dropout Visualization

Training Mode
InputHidden 1Hidden 2Output
Input Layer
Hidden Layer
Output Layer
Dropped
Mode
Dropout Rate (p)0.50
No dropout (0)Heavy (0.9)
Statistics
Dropped Neurons4 / 10
Active Connections24 / 50
Keep Probability0.50

How Dropout Works

During Training: Each hidden neuron is randomly "dropped" (set to zero) with probability p = 0.50. This means each forward pass uses a different random subset of the network.

This prevents neurons from co-adapting too much to each other. Each neuron must learn useful features independently, without relying on specific other neurons being present.

Quick Check

Why do we need to scale neuron outputs during dropout?


Early Stopping

Early stopping is a simple but effective regularization strategy: stop training before the model overfits.

The Algorithm

  1. Monitor validation loss after each epoch
  2. Track the best validation loss seen so far and save that model checkpoint
  3. If validation loss doesn't improve for patience consecutive epochs, stop training
  4. Restore the model weights from the best checkpoint

Why It Works

Early stopping is a form of implicit regularization. As training progresses:

  1. Early epochs: Model learns general patterns (bias decreases, variance stays low)
  2. Middle epochs: Model reaches optimal generalization point
  3. Late epochs: Model starts memorizing training data (bias stays low, variance increases)

By stopping at the optimal point, we prevent the variance from increasing while maintaining the learned general patterns.

Early Stopping as Regularization

Early stopping limits the "effective" number of parameters by limiting how far weights can travel from their initialization. This is mathematically equivalent to a form of L2 regularization with a time-varying strength.

Interactive: Early Stopping

Watch a training run and see how early stopping detects the optimal point to stop training:

Early Stopping Demonstration

0.01.02.03.00255075100EpochLoss
Training Loss
Validation Loss
Best Epoch
Early Stop
Playback
Epoch0
Patience10 epochs
Current Status
Train Loss
2.5728
Val Loss
2.6188
Best Val Loss (Epoch 0)
2.6188
Epochs Without Improvement
0

How Early Stopping Works

Early stopping monitors the validation loss during training. If the validation loss doesn't improve for 10 consecutive epochs (the "patience"), training stops and we restore the model weights from the best epoch. This prevents the model from continuing to memorize training data after it has stopped learning generalizable patterns.

Quick Check

What happens if you set patience too low?


Early Stopping in Practice

Early stopping prevents overfitting by stopping training when validation loss stops improving. The key decisions are:

ParameterTypical ValuesTrade-off
Patience5-20 epochsLow: may stop prematurely. High: may overfit before stopping
Min delta0 to 1e-4Ignore tiny improvements that may be noise
Monitor metricval_loss (usually)val_accuracy for classification if loss is noisy

Normalization in Training

For detailed coverage of BatchNorm, LayerNorm, and other normalization techniques—including when to use each—see Chapter 5, Section 5. Key training considerations:
  • Always call model.train() / model.eval(): BatchNorm behaves differently in each mode
  • CNNs: Use BatchNorm2d (benefits from batch statistics)
  • Transformers/RNNs: Use LayerNorm (batch-size independent)
  • Small batches: Prefer LayerNorm or GroupNorm over BatchNorm

Combining Regularization Techniques

Regularization techniques can be combined, but some combinations work better than others:

Complementary Combinations

CombinationEffectRecommended
Weight Decay + Early StoppingContinuous shrinkage + stopping at optimal pointAlmost always
Weight Decay + Data AugmentationModel regularization + data regularizationYes, especially for images
Dropout + Weight DecayDifferent mechanisms, complementaryYes, may reduce both strengths slightly
BatchNorm + Weight DecayNormalization + shrinkageYes, but use lower weight decay
Dropout + BatchNormCan conflict; use carefullyBatchNorm first, then Dropout

Conflicting Combinations

Watch for Conflicts

  • Heavy Dropout + Heavy Weight Decay: Can cause severe underfitting. Use moderate amounts of each.
  • Dropout immediately after BatchNorm: Disrupts batch statistics. Place BatchNorm before activation, dropout after.
  • Multiple strong regularizers: Start with one technique, add others gradually.
🐍regularization_baseline.py
1# Solid baseline for most tasks
2optimizer = optim.AdamW(
3    model.parameters(),
4    lr=1e-3,
5    weight_decay=1e-2  # Start here, tune if needed
6)
7
8# Architecture includes appropriate normalization
9# (BatchNorm for CNNs, LayerNorm for Transformers)
10
11# Moderate dropout in FC layers
12nn.Dropout(0.3)  # Not too aggressive
13
14# Early stopping as safety net
15early_stopping = EarlyStopping(patience=10)
16
17# Data augmentation for image tasks
18transforms.Compose([
19    transforms.RandomHorizontalFlip(),
20    transforms.RandomCrop(32, padding=4),
21    transforms.ToTensor(),
22])

Debugging Regularization Issues

When regularization isn't working as expected, use this diagnostic guide:

Problem: Still Overfitting

Symptoms: Large train/val gap, val loss increasing while train loss decreases.

  1. Increase weight decay (try 10x higher)
  2. Add or increase dropout rate
  3. Add more data augmentation
  4. Reduce model capacity (fewer layers/neurons)
  5. Lower early stopping patience

Problem: Underfitting

Symptoms: Both train and val loss high, neither improving.

  1. Reduce weight decay (try 10x lower)
  2. Reduce or remove dropout
  3. Increase model capacity
  4. Train longer (increase patience)
  5. Check if learning rate is too low

Problem: Training Unstable

Symptoms: Loss spikes, NaN values, erratic behavior.

  1. Check BatchNorm: are you calling model.train()/model.eval()?
  2. Reduce learning rate
  3. Check for gradient explosion (add gradient clipping)
  4. Ensure batch size is large enough for BatchNorm (at least 16-32)

Systematic Tuning

Change one hyperparameter at a time. If you change weight decay and dropout simultaneously, you won't know which change helped (or hurt). Start with defaults, observe behavior, then adjust methodically.

PyTorch Implementation

L2 Regularization (Weight Decay)

L2 Regularization in PyTorch
🐍weight_decay.py
14Weight Decay Parameter

weight_decay adds L2 regularization to all parameters. PyTorch applies the penalty during the optimizer step, not by modifying the loss.

20AdamW Optimizer

AdamW implements 'decoupled weight decay' which works correctly with adaptive learning rates. Use this instead of Adam with weight_decay for better results.

33Manual L2 Penalty

Manually computing the L2 penalty gives you more control. You can exclude certain parameters (like biases) or use different strengths for different layers.

40 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Method 1: Using weight_decay in optimizer
6model = nn.Sequential(
7    nn.Linear(784, 256),
8    nn.ReLU(),
9    nn.Linear(256, 10)
10)
11
12# Add L2 regularization via weight_decay
13optimizer = optim.SGD(
14    model.parameters(),
15    lr=0.01,
16    weight_decay=1e-4  # L2 regularization strength
17)
18
19# Method 2: AdamW for proper weight decay with Adam
20optimizer = optim.AdamW(
21    model.parameters(),
22    lr=0.001,
23    weight_decay=0.01  # Decoupled weight decay
24)
25
26# Method 3: Manual L2 regularization
27def train_with_manual_l2(model, data_loader, loss_fn, optimizer, l2_lambda):
28    model.train()
29    for inputs, targets in data_loader:
30        optimizer.zero_grad()
31        outputs = model(inputs)
32
33        # Compute data loss
34        data_loss = loss_fn(outputs, targets)
35
36        # Compute L2 penalty manually
37        l2_penalty = sum(p.pow(2).sum() for p in model.parameters())
38
39        # Total loss
40        total_loss = data_loss + l2_lambda * l2_penalty
41
42        total_loss.backward()
43        optimizer.step()

Dropout

Dropout Implementation
🐍dropout.py
9Dropout Placement

Place dropout after activation functions. This is standard practice. Some research suggests dropout before activation may work better in certain cases, but after is the convention.

15No Dropout Before Output

Generally don't use dropout immediately before the output layer. The output needs stable, full-capacity predictions.

22Training Mode

model.train() enables dropout (and batch norm training mode). Always call this before training!

26Evaluation Mode

model.eval() disables dropout (all neurons active) and switches batch norm to use running statistics. Always call before evaluation!

35Spatial Dropout

Dropout2d drops entire feature maps (channels) rather than individual elements. This preserves spatial structure and is better for convolutional layers.

39 lines without explanation
1import torch.nn as nn
2
3class MLPWithDropout(nn.Module):
4    def __init__(self, input_size, hidden_size, num_classes, dropout_rate=0.5):
5        super().__init__()
6        self.model = nn.Sequential(
7            nn.Linear(input_size, hidden_size),
8            nn.ReLU(),
9            nn.Dropout(dropout_rate),  # Dropout after activation
10
11            nn.Linear(hidden_size, hidden_size),
12            nn.ReLU(),
13            nn.Dropout(dropout_rate),
14
15            nn.Linear(hidden_size, num_classes)  # No dropout before output
16        )
17
18    def forward(self, x):
19        return self.model(x)
20
21# Training: dropout is active
22model = MLPWithDropout(784, 256, 10, dropout_rate=0.5)
23model.train()  # Enables dropout
24output = model(input_tensor)  # Some neurons are dropped
25
26# Inference: dropout is disabled
27model.eval()  # Disables dropout
28output = model(input_tensor)  # All neurons active
29
30# For CNNs: use Dropout2d (drops entire feature maps)
31class CNNWithDropout(nn.Module):
32    def __init__(self):
33        super().__init__()
34        self.features = nn.Sequential(
35            nn.Conv2d(3, 64, 3, padding=1),
36            nn.ReLU(),
37            nn.Dropout2d(0.2),  # Spatial dropout for conv layers
38            nn.MaxPool2d(2),
39
40            nn.Conv2d(64, 128, 3, padding=1),
41            nn.ReLU(),
42            nn.Dropout2d(0.2),
43            nn.MaxPool2d(2),
44        )

Early Stopping

Early Stopping Implementation
🐍early_stopping.py
5Patience Parameter

Number of epochs to wait for improvement before stopping. Typical values are 5-20 depending on how noisy your validation loss is.

6Minimum Delta

Only consider it an improvement if loss decreases by at least this amount. Helps ignore tiny fluctuations. Set to 0 for any improvement to count.

17Save Best Weights

We save a copy of the model weights whenever we find a new best validation loss. This lets us restore the best model at the end.

24Restore Best Model

When stopping, we restore the weights from the best epoch, not the final epoch. This ensures we use the model with lowest validation loss.

38 lines without explanation
1class EarlyStopping:
2    """Early stopping to prevent overfitting."""
3
4    def __init__(self, patience=10, min_delta=0.0, restore_best=True):
5        self.patience = patience
6        self.min_delta = min_delta
7        self.restore_best = restore_best
8        self.best_loss = float('inf')
9        self.best_weights = None
10        self.epochs_without_improvement = 0
11
12    def __call__(self, val_loss, model):
13        """Returns True if training should stop."""
14        if val_loss < self.best_loss - self.min_delta:
15            # Improvement found
16            self.best_loss = val_loss
17            self.best_weights = model.state_dict().copy()
18            self.epochs_without_improvement = 0
19            return False
20        else:
21            # No improvement
22            self.epochs_without_improvement += 1
23            if self.epochs_without_improvement >= self.patience:
24                if self.restore_best and self.best_weights is not None:
25                    model.load_state_dict(self.best_weights)
26                return True  # Stop training
27            return False
28
29# Usage in training loop
30early_stopping = EarlyStopping(patience=10)
31
32for epoch in range(max_epochs):
33    train_loss = train_one_epoch(model, train_loader, optimizer)
34    val_loss = validate(model, val_loader)
35
36    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
37
38    if early_stopping(val_loss, model):
39        print(f"Early stopping triggered at epoch {epoch}")
40        break
41
42# Model now has best weights restored

Summary

This section covered practical regularization strategies for training neural networks. Key takeaways:

TopicKey Practical Insight
Detecting OverfittingMonitor train vs val loss gap; val increasing = overfitting
Weight Decay TuningStart at 1e-3 to 1e-2 with AdamW; adjust based on gap
Dropout Tuning0.3-0.5 for FC layers, 0.1-0.2 for conv; none before output
Early Stoppingpatience=10-15; saves best weights automatically
Combining TechniquesWeight decay + early stopping always; add dropout gradually
DebuggingChange one thing at a time; check train() vs eval() mode

Remember: Theory is in Chapter 5

For the mathematical foundations of regularization techniques (L1/L2 penalties, dropout derivation, normalization algorithms), see Chapter 5, Sections 5 and 6. This section focused on when and how to apply these techniques during training.

Quiz

Test your understanding of overfitting and regularization:

Overfitting & Regularization Quiz

Question 1/8

What does it mean when training loss is very low but validation loss is much higher?


Exercises

Diagnostic Scenarios

  1. Scenario: Your model has 99% training accuracy but only 60% validation accuracy. The validation loss started increasing after epoch 5. What's happening, and what steps would you take (in order of priority)?
  2. Scenario: Both your training and validation loss are stuck at high values (neither improving after 50 epochs). What might be wrong, and how would you diagnose it?
  3. Scenario: You added dropout(0.5) to all layers, and now both train and validation accuracy dropped significantly. What went wrong?
  4. Scenario: Your model works perfectly in training but gives random predictions during inference. What's the most likely cause?

Solution Hints

  1. Q1: Severe overfitting. Priority: (1) Early stopping at epoch 5, (2) Add data augmentation, (3) Increase weight decay, (4) Add dropout, (5) Reduce model capacity.
  2. Q2: Underfitting. Check: (1) Learning rate too low? (2) Too much regularization? (3) Model too small? (4) Data issues?
  3. Q3: Over-regularization. Reduce dropout rate (0.2-0.3) or remove from some layers. Also check if dropout is on output layer (it shouldn't be).
  4. Q4: Forgot to call model.eval() before inference. BatchNorm and Dropout behave differently in train vs eval mode.

Hands-On Experiments

  1. Weight decay sweep: Train a CNN on CIFAR-10 with weight_decay values (0, 1e-5, 1e-4, 1e-3, 1e-2). Plot train/val curves for each. Identify which value gives the best validation accuracy.
  2. Regularization combination: Compare three setups: (a) weight decay only, (b) dropout only, (c) both. Measure final validation accuracy and the train/val gap for each.
  3. Early stopping patience: Train with patience values (3, 5, 10, 20, 50). Record at which epoch training stopped and the final validation loss. Find the sweet spot.
  4. train() vs eval() experiment: Train a model with BatchNorm. Run inference both with and without calling model.eval(). Compare the predictions and accuracy.

Experiment Tips

  • Exercise 1: Use fixed random seed for fair comparison. Use TensorBoard to visualize all curves together.
  • Exercise 2: Keep total regularization budget similar (if comparing dropout 0.5 vs weight_decay 1e-2, neither should be overwhelming).
  • Exercise 3: Track not just final loss but also which epoch was "best" for each patience value.
  • Exercise 4: Test on a batch of 100+ samples to get statistically meaningful accuracy difference.

In the next section, we'll explore hyperparameter tuning—systematic approaches to finding optimal values for learning rate, regularization strength, and other hyperparameters.