Learning Objectives
By the end of this section, you will be able to:
- Detect overfitting: Use training and validation curves to diagnose overfitting and underfitting in your models
- Tune regularization hyperparameters: Find optimal values for weight decay, dropout rate, and early stopping patience
- Combine techniques effectively: Know which regularization methods work well together and which conflict
- Debug regularization problems: Diagnose under-regularization, over-regularization, and technique-specific issues
Prerequisite: Chapter 5
The Big Picture
You've learned what regularization techniques do (Chapter 5). Now the question is: how do you actually use them in practice?
Regularization is not just a matter of adding dropout or weight decay. It's about:
- Detection: Recognizing when your model is overfitting (or underfitting)
- Selection: Choosing which techniques to apply for your architecture
- Tuning: Finding the right hyperparameter values
- Combination: Making multiple techniques work together
- Debugging: Fixing problems when regularization isn't working
| Situation | Signs | Action |
|---|---|---|
| Under-regularization | Large train/val gap, val loss increasing | Add or strengthen regularization |
| Over-regularization | Both losses high, model underfits | Reduce regularization strength |
| Good balance | Small gap, both losses low and stable | Continue training or stop |
Detecting Overfitting
Explore how model complexity affects fitting. Increase the polynomial degree to see how the model can fit training points perfectly while losing the ability to generalize:
Overfitting Demonstration
Quick Check
In the demo above, what happens to the test error when you increase model complexity from 3 to 15?
Recognizing the Patterns
The primary tool for detecting overfitting is comparing training and validation loss curves over time:
Healthy Training
- Both losses decrease together
- The gap between them remains small and stable
- Both eventually plateau at similar values
Overfitting
- Training loss continues to decrease
- Validation loss decreases initially, then starts increasing
- The gap between them grows larger over time
Underfitting
- Both losses plateau at high values
- Neither loss shows significant improvement
- The model lacks capacity to learn the pattern
The Silent Killer
The Bias-Variance Tradeoff
The bias-variance tradeoff is fundamental to understanding why regularization works. Explore it interactively:
The Bias-Variance Tradeoff in Machine Learning
Optimal Complexity
The sweet spot! Model is complex enough to capture the pattern but not so complex that it fits noise. This minimizes total prediction error. In practice, found via cross-validation.
Expected Prediction Error = Bias² + Variance + Irreducible Noise
The key insight: optimal generalization occurs at a model complexity where the sum of bias and variance is minimized—not where either one is minimized individually.
See Chapter 5 for Theory
Regularization Strategy
You know what regularization techniques do from Chapter 5. Now let's focus on the practical question: how do you tune them?
Tuning Weight Decay
Weight decay (L2 regularization) is almost always beneficial. The question is finding the right strength:
Weight Decay (L2 Regularization) Effect
Weights shrunk toward origin by 37.5%
L2 regularization adds a penalty proportional to squared weight magnitude
Understanding Weight Decay
Blue ellipses show contours of constant data loss (where data loss is the same). The blue dot marks where data loss is minimized. Orange circles show contours of constant regularization penalty (centered at origin). The green dot shows where the total loss (data + regularization) is minimized. Notice how increasing λ pulls the optimum closer to the origin, shrinking the weight magnitudes. This prevents weights from growing too large, which helps prevent overfitting.
| Weight Decay Value | When to Use | Signs It's Wrong |
|---|---|---|
| 1e-5 to 1e-4 | Small models, little overfitting risk | Large train/val gap persists |
| 1e-4 to 1e-3 | Most cases, good starting point | Check both gaps and absolute performance |
| 1e-3 to 1e-2 | Large models, severe overfitting | Training loss stays high (underfitting) |
| 1e-1+ | Rarely used | Almost always too strong |
AdamW vs Adam with weight_decay
AdamW for weight decay with Adam-family optimizers. The standard Adam(weight_decay=...) doesn't decouple weight decay from adaptive learning rates, leading to suboptimal regularization.Tuning Dropout
Dropout rate depends on layer type and model capacity:
| Layer Type | Typical Rate | Reasoning |
|---|---|---|
| Input layer | 0.1 - 0.2 | Don't drop too much input information |
| Hidden (FC) layers | 0.3 - 0.5 | Standard range, start at 0.5 |
| Conv layers (Dropout2d) | 0.1 - 0.3 | Spatial dropout, lower rates work |
| Before output | 0 (none) | Final layer needs all features |
| Transformers | 0.1 | Lower rates common with LayerNorm |
Dropout + BatchNorm Interaction
Linear → BatchNorm → ReLU → Dropout.Interactive: Dropout Visualization
See how dropout randomly disables neurons during training. Toggle between training and inference modes to see the difference:
Dropout Visualization
How Dropout Works
During Training: Each hidden neuron is randomly "dropped" (set to zero) with probability p = 0.50. This means each forward pass uses a different random subset of the network.
This prevents neurons from co-adapting too much to each other. Each neuron must learn useful features independently, without relying on specific other neurons being present.
Quick Check
Why do we need to scale neuron outputs during dropout?
Early Stopping
Early stopping is a simple but effective regularization strategy: stop training before the model overfits.
The Algorithm
- Monitor validation loss after each epoch
- Track the best validation loss seen so far and save that model checkpoint
- If validation loss doesn't improve for
patienceconsecutive epochs, stop training - Restore the model weights from the best checkpoint
Why It Works
Early stopping is a form of implicit regularization. As training progresses:
- Early epochs: Model learns general patterns (bias decreases, variance stays low)
- Middle epochs: Model reaches optimal generalization point
- Late epochs: Model starts memorizing training data (bias stays low, variance increases)
By stopping at the optimal point, we prevent the variance from increasing while maintaining the learned general patterns.
Early Stopping as Regularization
Interactive: Early Stopping
Watch a training run and see how early stopping detects the optimal point to stop training:
Early Stopping Demonstration
How Early Stopping Works
Early stopping monitors the validation loss during training. If the validation loss doesn't improve for 10 consecutive epochs (the "patience"), training stops and we restore the model weights from the best epoch. This prevents the model from continuing to memorize training data after it has stopped learning generalizable patterns.
Quick Check
What happens if you set patience too low?
Early Stopping in Practice
Early stopping prevents overfitting by stopping training when validation loss stops improving. The key decisions are:
| Parameter | Typical Values | Trade-off |
|---|---|---|
| Patience | 5-20 epochs | Low: may stop prematurely. High: may overfit before stopping |
| Min delta | 0 to 1e-4 | Ignore tiny improvements that may be noise |
| Monitor metric | val_loss (usually) | val_accuracy for classification if loss is noisy |
Normalization in Training
- Always call model.train() / model.eval(): BatchNorm behaves differently in each mode
- CNNs: Use BatchNorm2d (benefits from batch statistics)
- Transformers/RNNs: Use LayerNorm (batch-size independent)
- Small batches: Prefer LayerNorm or GroupNorm over BatchNorm
Combining Regularization Techniques
Regularization techniques can be combined, but some combinations work better than others:
Complementary Combinations
| Combination | Effect | Recommended |
|---|---|---|
| Weight Decay + Early Stopping | Continuous shrinkage + stopping at optimal point | Almost always |
| Weight Decay + Data Augmentation | Model regularization + data regularization | Yes, especially for images |
| Dropout + Weight Decay | Different mechanisms, complementary | Yes, may reduce both strengths slightly |
| BatchNorm + Weight Decay | Normalization + shrinkage | Yes, but use lower weight decay |
| Dropout + BatchNorm | Can conflict; use carefully | BatchNorm first, then Dropout |
Conflicting Combinations
Watch for Conflicts
- Heavy Dropout + Heavy Weight Decay: Can cause severe underfitting. Use moderate amounts of each.
- Dropout immediately after BatchNorm: Disrupts batch statistics. Place BatchNorm before activation, dropout after.
- Multiple strong regularizers: Start with one technique, add others gradually.
Recommended Baseline Setup
1# Solid baseline for most tasks
2optimizer = optim.AdamW(
3 model.parameters(),
4 lr=1e-3,
5 weight_decay=1e-2 # Start here, tune if needed
6)
7
8# Architecture includes appropriate normalization
9# (BatchNorm for CNNs, LayerNorm for Transformers)
10
11# Moderate dropout in FC layers
12nn.Dropout(0.3) # Not too aggressive
13
14# Early stopping as safety net
15early_stopping = EarlyStopping(patience=10)
16
17# Data augmentation for image tasks
18transforms.Compose([
19 transforms.RandomHorizontalFlip(),
20 transforms.RandomCrop(32, padding=4),
21 transforms.ToTensor(),
22])Debugging Regularization Issues
When regularization isn't working as expected, use this diagnostic guide:
Problem: Still Overfitting
Symptoms: Large train/val gap, val loss increasing while train loss decreases.
- Increase weight decay (try 10x higher)
- Add or increase dropout rate
- Add more data augmentation
- Reduce model capacity (fewer layers/neurons)
- Lower early stopping patience
Problem: Underfitting
Symptoms: Both train and val loss high, neither improving.
- Reduce weight decay (try 10x lower)
- Reduce or remove dropout
- Increase model capacity
- Train longer (increase patience)
- Check if learning rate is too low
Problem: Training Unstable
Symptoms: Loss spikes, NaN values, erratic behavior.
- Check BatchNorm: are you calling model.train()/model.eval()?
- Reduce learning rate
- Check for gradient explosion (add gradient clipping)
- Ensure batch size is large enough for BatchNorm (at least 16-32)
Systematic Tuning
PyTorch Implementation
L2 Regularization (Weight Decay)
Dropout
Early Stopping
Summary
This section covered practical regularization strategies for training neural networks. Key takeaways:
| Topic | Key Practical Insight |
|---|---|
| Detecting Overfitting | Monitor train vs val loss gap; val increasing = overfitting |
| Weight Decay Tuning | Start at 1e-3 to 1e-2 with AdamW; adjust based on gap |
| Dropout Tuning | 0.3-0.5 for FC layers, 0.1-0.2 for conv; none before output |
| Early Stopping | patience=10-15; saves best weights automatically |
| Combining Techniques | Weight decay + early stopping always; add dropout gradually |
| Debugging | Change one thing at a time; check train() vs eval() mode |
Remember: Theory is in Chapter 5
Quiz
Test your understanding of overfitting and regularization:
Overfitting & Regularization Quiz
What does it mean when training loss is very low but validation loss is much higher?
Exercises
Diagnostic Scenarios
- Scenario: Your model has 99% training accuracy but only 60% validation accuracy. The validation loss started increasing after epoch 5. What's happening, and what steps would you take (in order of priority)?
- Scenario: Both your training and validation loss are stuck at high values (neither improving after 50 epochs). What might be wrong, and how would you diagnose it?
- Scenario: You added dropout(0.5) to all layers, and now both train and validation accuracy dropped significantly. What went wrong?
- Scenario: Your model works perfectly in training but gives random predictions during inference. What's the most likely cause?
Solution Hints
- Q1: Severe overfitting. Priority: (1) Early stopping at epoch 5, (2) Add data augmentation, (3) Increase weight decay, (4) Add dropout, (5) Reduce model capacity.
- Q2: Underfitting. Check: (1) Learning rate too low? (2) Too much regularization? (3) Model too small? (4) Data issues?
- Q3: Over-regularization. Reduce dropout rate (0.2-0.3) or remove from some layers. Also check if dropout is on output layer (it shouldn't be).
- Q4: Forgot to call
model.eval()before inference. BatchNorm and Dropout behave differently in train vs eval mode.
Hands-On Experiments
- Weight decay sweep: Train a CNN on CIFAR-10 with weight_decay values (0, 1e-5, 1e-4, 1e-3, 1e-2). Plot train/val curves for each. Identify which value gives the best validation accuracy.
- Regularization combination: Compare three setups: (a) weight decay only, (b) dropout only, (c) both. Measure final validation accuracy and the train/val gap for each.
- Early stopping patience: Train with patience values (3, 5, 10, 20, 50). Record at which epoch training stopped and the final validation loss. Find the sweet spot.
- train() vs eval() experiment: Train a model with BatchNorm. Run inference both with and without calling
model.eval(). Compare the predictions and accuracy.
Experiment Tips
- Exercise 1: Use fixed random seed for fair comparison. Use TensorBoard to visualize all curves together.
- Exercise 2: Keep total regularization budget similar (if comparing dropout 0.5 vs weight_decay 1e-2, neither should be overwhelming).
- Exercise 3: Track not just final loss but also which epoch was "best" for each patience value.
- Exercise 4: Test on a batch of 100+ samples to get statistically meaningful accuracy difference.
In the next section, we'll explore hyperparameter tuning—systematic approaches to finding optimal values for learning rate, regularization strength, and other hyperparameters.