Learning Objectives
By the end of this section, you will:
- Understand the overfitting problem and why neural networks are prone to memorizing training data
- Master dropout—the ingenious technique of randomly silencing neurons to prevent co-adaptation
- Know the mathematics behind dropout, including inverted dropout and the ensemble interpretation
- Understand weight decay (L2 regularization) and its geometric interpretation as shrinking weights toward zero
- Implement early stopping to halt training at the optimal generalization point
- Recognize when and how to combine multiple regularization techniques
- Implement all regularization methods in PyTorch with production-ready code
The Big Picture
In the previous sections, we learned how to build powerful neural networks with linear layers, activation functions, and loss functions. But there's a fundamental tension in machine learning: a model that perfectly fits the training data often fails on new, unseen data. This is the overfitting problem—perhaps the most important challenge in deep learning.
The Core Tension: We want our models to learn meaningful patterns, not memorize specific examples. A neural network with millions of parameters can easily memorize the training set, achieving zero training error while failing completely on test data. Regularization techniques constrain the model to learn simpler, more generalizable patterns.
Regularization is the umbrella term for techniques that prevent overfitting. In this section, we'll focus on the most important ones:
| Technique | Key Idea | When to Use |
|---|---|---|
| Dropout | Randomly silence neurons during training | Almost always for deep networks |
| Weight Decay (L2) | Penalize large weights | Default in most optimizers |
| Early Stopping | Stop before overfitting occurs | When validation loss plateaus |
| Data Augmentation | Create synthetic training examples | Limited data scenarios |
| Batch Normalization | Normalize layer inputs | Stabilizes training, mild regularization |
Historical Context
Regularization has been studied since the early days of machine learning. L2 regularization (ridge regression) dates back to 1970. But dropout, introduced by Hinton et al. in 2012, was revolutionary. The idea seemed counterintuitive: randomly disabling neurons during training should hurt performance, right? Instead, dropout became one of the most effective regularization techniques, enabling the training of much deeper networks and winning competitions like ImageNet.
The Overfitting Problem
To understand regularization, we first need to deeply understand the problem it solves. Overfitting occurs when a model learns the training data too well, capturing noise and peculiarities specific to that data rather than the underlying pattern.
The Bias-Variance Tradeoff
The prediction error of any model can be decomposed into three parts:
- Bias: Error from oversimplified assumptions. High bias models underfit—they can't capture the true pattern.
- Variance: Error from sensitivity to training data fluctuations. High variance models overfit—they change dramatically with different training sets.
- Irreducible Noise: Inherent randomness in the data. No model can predict this.
Neural networks are high-capacity models—they have low bias but can have very high variance. Regularization reduces variance at the cost of slightly increased bias, achieving better overall performance.
Signs of Overfitting
| Indicator | Underfitting | Good Fit | Overfitting |
|---|---|---|---|
| Training loss | High | Low | Very low (near zero) |
| Validation loss | High | Low | High (and increasing) |
| Train-val gap | Small | Small | Large (and growing) |
| Model complexity | Too simple | Appropriate | Too complex |
The Fundamental Problem
Interactive: Overfitting in Action
Watch overfitting happen in real-time. Train a neural network on a small dataset and observe how training loss decreases while validation loss eventually increases. The gap between them is the hallmark of overfitting.
Loading interactive demo...
Quick Check
You're training a neural network and observe: training accuracy = 99.8%, validation accuracy = 72%. What's happening?
Regularization: The Core Intuition
At its heart, regularization is about constraining the model to prefer simpler solutions. The intuition comes from Occam's Razor: among models that fit the data equally well, the simpler one is more likely to generalize.
Mathematical Framework
In the standard formulation, we minimize:
Where:
- is the standard loss on training data (e.g., cross-entropy)
- is the regularization term that penalizes complexity
- is the regularization strength (hyperparameter)
Different choices of give different regularization methods:
| Regularization | Formula | Effect on Weights |
|---|---|---|
| L2 (Weight Decay) | ∑ wᵢ² | Small but non-zero |
| L1 (Lasso) | ∑ |wᵢ| | Many exactly zero (sparse) |
| Elastic Net | α∑|wᵢ| + (1-α)∑wᵢ² | Combination of both |
Implicit Regularization
Dropout: Random Neuron Silencing
Dropout is elegantly simple: during training, randomly set a fraction of neuron activations to zero. At each forward pass, a different random subset of neurons is "dropped out."
The scaling is crucial—this is called inverted dropout. It ensures the expected value of outputs remains the same during training and inference.
Why Does Dropout Work?
There are several complementary explanations:
- Prevents Co-adaptation: Without dropout, neurons can develop complex co-dependencies where specific combinations produce the output. Dropout forces each neuron to be useful on its own.
- Implicit Ensemble: Each forward pass uses a different "thinned" network. Training with dropout is like training an exponential number of networks (2n for n neurons) and averaging their predictions.
- Noise Injection: Dropout adds stochasticity, making the model robust to small perturbations and preventing it from fitting noise.
- Adaptive Regularization: Neurons that are more useful get updated more often. Less useful neurons are "tested" less, providing implicit feature selection.
Hinton's Analogy: "If you want to develop a robust economy, don't allow companies to become too big to fail. Dropout does the same for neurons—no neuron can become so important that the network fails without it."
Interactive: Dropout Visualization
Explore dropout interactively. Toggle between training and inference modes to see how neurons are randomly dropped. Adjust the dropout rate and watch how the network changes. Click "Animate Dropout" to see different dropout masks in action.
Loading interactive demo...
Typical Dropout Rates
Quick Check
If you apply dropout with p=0.3 during training and don't scale the outputs, what happens during inference?
The Mathematics of Dropout
Inverted Dropout
In the original dropout paper, outputs were scaled by at inference. Modern implementations use inverted dropout, which scales by during training instead. This has a key advantage: no modification needed at inference time.
Where is the binary mask with and denotes element-wise multiplication.
Expected Value Analysis
Let's verify that inverted dropout preserves expected values. For a single activation :
The expected output during training equals the actual output during inference. This means subsequent layers see consistent input magnitudes regardless of mode.
Ensemble Interpretation
Training with dropout can be viewed as training an exponentially large ensemble of networks. For a network with droppable units, there are possible "thinned" networks.
At inference, using all neurons with scaled weights approximates the geometric mean of predictions from all these networks—a form of model averaging that's typically very effective for reducing variance.
Dropout and Bayesian Inference
Weight Decay (L2 Regularization)
Weight decay (also called L2 regularization) adds a penalty term proportional to the sum of squared weights:
This modifies the gradient update rule:
The term shrinks weights toward zero at each step, hence the name "weight decay."
Why Does Weight Decay Help?
- Prevents Large Weights: Large weights indicate the model is very sensitive to specific features. Penalizing them encourages smoother, more robust functions.
- Reduces Effective Capacity: Small weights mean the network behaves more linearly, reducing its ability to fit complex noise patterns.
- Improves Conditioning: Weight decay acts as a prior toward simpler models, improving the optimization landscape and making training more stable.
L1 vs L2 Regularization
| Property | L1 (Lasso) | L2 (Ridge/Weight Decay) |
|---|---|---|
| Penalty | ∑|wᵢ| | ∑wᵢ² |
| Effect on weights | Many exactly zero | Small but non-zero |
| Feature selection | Automatic (sparse) | No (all features used) |
| Computational cost | Slightly higher | Negligible |
| Gradient at zero | Undefined (subgradient) | Zero |
Weight Decay vs Adam
torch.optim.AdamW over torch.optim.Adam with weight_decay.Interactive: Weight Decay Effects
Visualize how weight decay affects the learned decision boundary. Without regularization, the model fits a complex boundary that perfectly separates training points but may not generalize. With weight decay, the boundary becomes smoother.
Loading interactive demo...
Quick Check
If you increase the weight decay parameter λ too much, what happens?
Early Stopping
Early stopping is perhaps the simplest regularization technique: just stop training before the model overfits. We monitor validation loss and halt when it stops improving.
The Overfitting Trajectory
During training, we typically observe three phases:
- Underfitting phase: Both training and validation loss decrease. The model is learning useful patterns.
- Sweet spot: Validation loss reaches its minimum. This is where we should stop.
- Overfitting phase: Training loss continues decreasing, but validation loss increases. The model is memorizing training data.
Implementation Details
A robust early stopping implementation needs:
- Patience: Number of epochs to wait for improvement before stopping
- Min delta: Minimum change to qualify as improvement
- Best model checkpointing: Save the model when validation loss improves
- Restore best: After stopping, restore the best model (not the last one)
Effective Model Complexity
Interactive: Early Stopping
Watch early stopping in action. The visualization shows training and validation loss curves. See how the gap between them grows during overfitting, and how early stopping finds the optimal stopping point.
Loading interactive demo...
Quick Check
Why should we restore the best model after early stopping, rather than using the model at the stopping epoch?
Stochastic Depth
Stochastic Depth (also called DropPath) is a regularization technique that randomly drops entire layers (or residual blocks) during training. Unlike dropout which drops individual neurons, stochastic depth drops entire computational paths.
Where is a residual block, and is the drop probability. When a layer is dropped, the input simply passes through unchanged (identity).
Why Does It Work?
- Implicit Ensemble: Like dropout, training samples different sub-networks with varying depths, creating an implicit ensemble
- Shorter Effective Paths: Reduces the effective depth during training, making gradients flow more easily to early layers
- Regularization: Forces the network to not rely on any single layer being present
- Faster Training: Dropped layers don't need forward/backward computation
1import torch
2import torch.nn as nn
3
4class StochasticDepth(nn.Module):
5 """Drop entire residual blocks during training (DropPath)."""
6
7 def __init__(self, drop_prob: float = 0.0):
8 super().__init__()
9 self.drop_prob = drop_prob
10
11 def forward(self, x, residual):
12 if not self.training or self.drop_prob == 0.0:
13 return x + residual
14
15 # Survival probability
16 keep_prob = 1 - self.drop_prob
17
18 # Random tensor with shape (batch_size, 1, 1, ...) for broadcasting
19 shape = (x.shape[0],) + (1,) * (x.ndim - 1)
20 random_tensor = torch.rand(shape, device=x.device) < keep_prob
21
22 # Scale to maintain expected value
23 output = x + residual * random_tensor / keep_prob
24 return output
25
26# Usage in ResNet block
27class ResBlock(nn.Module):
28 def __init__(self, channels, drop_prob=0.2):
29 super().__init__()
30 self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
31 self.bn1 = nn.BatchNorm2d(channels)
32 self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
33 self.bn2 = nn.BatchNorm2d(channels)
34 self.stochastic_depth = StochasticDepth(drop_prob)
35
36 def forward(self, x):
37 residual = F.relu(self.bn1(self.conv1(x)))
38 residual = self.bn2(self.conv2(residual))
39 return F.relu(self.stochastic_depth(x, residual))
40
41# Linear decay schedule (deeper layers have higher drop probability)
42def get_drop_prob(layer_idx, total_layers, max_drop_prob=0.5):
43 return (layer_idx / total_layers) * max_drop_probLinear Decay Schedule
Mixup and CutMix
Mixup and CutMix are data augmentation techniques that create new training examples by combining existing ones. They act as powerful regularizers by forcing the model to learn smoother decision boundaries.
Mixup
Mixup creates new training examples by taking convex combinations of pairs of examples:
Where with being typical.
1import torch
2import numpy as np
3
4def mixup_data(x, y, alpha=0.2):
5 """Apply mixup augmentation to a batch."""
6 if alpha > 0:
7 lam = np.random.beta(alpha, alpha)
8 else:
9 lam = 1.0
10
11 batch_size = x.size(0)
12 # Random permutation for pairing
13 index = torch.randperm(batch_size, device=x.device)
14
15 # Mix inputs and targets
16 mixed_x = lam * x + (1 - lam) * x[index, :]
17 y_a, y_b = y, y[index]
18
19 return mixed_x, y_a, y_b, lam
20
21def mixup_criterion(criterion, pred, y_a, y_b, lam):
22 """Mixup loss: weighted combination of losses."""
23 return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
24
25# Training loop with mixup
26for x, y in dataloader:
27 x, y_a, y_b, lam = mixup_data(x, y, alpha=0.2)
28 output = model(x)
29 loss = mixup_criterion(criterion, output, y_a, y_b, lam)
30 loss.backward()
31 optimizer.step()CutMix
CutMix cuts and pastes rectangular regions between training images:
Where is a binary mask indicating the cut region, and is the proportion of the first image that remains.
1import torch
2import numpy as np
3
4def rand_bbox(size, lam):
5 """Generate random bounding box for CutMix."""
6 W, H = size[2], size[3]
7 cut_rat = np.sqrt(1. - lam)
8 cut_w = int(W * cut_rat)
9 cut_h = int(H * cut_rat)
10
11 # Random center
12 cx = np.random.randint(W)
13 cy = np.random.randint(H)
14
15 # Bounding box coordinates (clipped)
16 bbx1 = np.clip(cx - cut_w // 2, 0, W)
17 bby1 = np.clip(cy - cut_h // 2, 0, H)
18 bbx2 = np.clip(cx + cut_w // 2, 0, W)
19 bby2 = np.clip(cy + cut_h // 2, 0, H)
20
21 return bbx1, bby1, bbx2, bby2
22
23def cutmix_data(x, y, alpha=1.0):
24 """Apply CutMix augmentation to a batch."""
25 lam = np.random.beta(alpha, alpha)
26
27 batch_size = x.size(0)
28 index = torch.randperm(batch_size, device=x.device)
29
30 # Generate cut region
31 bbx1, bby1, bbx2, bby2 = rand_bbox(x.size(), lam)
32
33 # Cut and paste
34 x_cutmix = x.clone()
35 x_cutmix[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
36
37 # Adjust lambda based on actual cut area
38 lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (x.size(-1) * x.size(-2)))
39
40 return x_cutmix, y, y[index], lam| Technique | How It Works | Best For |
|---|---|---|
| Mixup | Blends entire images with linear interpolation | General image classification |
| CutMix | Cuts and pastes rectangular regions | Object localization tasks |
| CutMix | Better spatial regularization | When objects span small regions |
| Both | Soft labels improve calibration | Uncertainty estimation |
Expected Improvements
Gradient Clipping
Gradient clipping limits the magnitude of gradients during training to prevent the exploding gradient problem. This is essential for training RNNs and transformers, and useful as a safety measure in any deep network.
Two Types of Gradient Clipping
There are two main approaches:
- Clip by Value: Clip each gradient element to
- Clip by Norm: Scale the entire gradient vector if its norm exceeds a threshold
1import torch
2import torch.nn as nn
3
4model = YourModel()
5optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
6
7# Training loop with gradient clipping
8for x, y in dataloader:
9 optimizer.zero_grad()
10 loss = criterion(model(x), y)
11 loss.backward()
12
13 # Method 1: Clip by norm (RECOMMENDED)
14 # Scales gradients if total norm exceeds max_norm
15 torch.nn.utils.clip_grad_norm_(
16 model.parameters(),
17 max_norm=1.0 # Maximum gradient norm
18 )
19
20 # Method 2: Clip by value (less common)
21 # Clips each gradient element independently
22 torch.nn.utils.clip_grad_value_(
23 model.parameters(),
24 clip_value=0.5 # Max absolute value
25 )
26
27 optimizer.step()
28
29# Monitoring gradient norms (useful for debugging)
30def get_grad_norm(model):
31 total_norm = 0
32 for p in model.parameters():
33 if p.grad is not None:
34 param_norm = p.grad.data.norm(2)
35 total_norm += param_norm.item() ** 2
36 return total_norm ** 0.5
37
38# In training loop
39grad_norm = get_grad_norm(model)
40if grad_norm > 100:
41 print(f"Warning: Large gradient norm {grad_norm:.2f}")When to Use Gradient Clipping
| Scenario | Recommended max_norm | Notes |
|---|---|---|
| Transformers | 1.0 | Standard for attention models |
| RNNs/LSTMs | 5.0 | Longer sequences need more tolerance |
| Deep CNNs | Usually not needed | BatchNorm helps stabilize gradients |
| GANs | 1.0 - 10.0 | Helps stabilize adversarial training |
| Debugging | Monitor first | Find typical norm, then clip 2-5x that |
Clipping Too Aggressively
Knowledge Distillation
Knowledge distillation transfers knowledge from a large, powerful "teacher" model to a smaller, efficient "student" model. The student learns to mimic the teacher's outputs, often achieving better performance than if trained directly on the labels.
How It Works
Instead of training on hard labels , the student learns from the teacher's soft predictions . These soft targets contain more information—they tell the student which classes are similar.
Where is the temperature (higher = softer probabilities), are student/teacher logits, and balances distillation vs ground truth loss.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DistillationLoss(nn.Module):
6 """Knowledge distillation loss."""
7
8 def __init__(self, temperature=4.0, alpha=0.7):
9 super().__init__()
10 self.temperature = temperature
11 self.alpha = alpha
12 self.ce_loss = nn.CrossEntropyLoss()
13
14 def forward(self, student_logits, teacher_logits, labels):
15 # Hard label loss (standard cross-entropy)
16 hard_loss = self.ce_loss(student_logits, labels)
17
18 # Soft label loss (KL divergence with temperature)
19 soft_student = F.log_softmax(student_logits / self.temperature, dim=1)
20 soft_teacher = F.softmax(teacher_logits / self.temperature, dim=1)
21 soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
22
23 # Scale by T^2 as per Hinton et al.
24 soft_loss = soft_loss * (self.temperature ** 2)
25
26 # Combined loss
27 return self.alpha * soft_loss + (1 - self.alpha) * hard_loss
28
29# Training with distillation
30teacher_model = load_pretrained_teacher()
31teacher_model.eval() # Teacher in eval mode, frozen
32
33student_model = SmallModel()
34criterion = DistillationLoss(temperature=4.0, alpha=0.7)
35optimizer = torch.optim.Adam(student_model.parameters())
36
37for x, y in dataloader:
38 # Get teacher predictions (no gradients)
39 with torch.no_grad():
40 teacher_logits = teacher_model(x)
41
42 # Train student
43 student_logits = student_model(x)
44 loss = criterion(student_logits, teacher_logits, y)
45
46 optimizer.zero_grad()
47 loss.backward()
48 optimizer.step()Why Distillation Works as Regularization
- Soft labels are smoother: They prevent the student from becoming overconfident, similar to label smoothing
- Dark knowledge: Teacher's mistakes reveal class relationships (e.g., confusing "cat" with "tiger" indicates they're similar)
- Implicit data augmentation: The teacher's ensemble of features effectively augments the training signal
| Use Case | Teacher | Student | Temperature |
|---|---|---|---|
| Model compression | Large pretrained model | Small deployment model | 4-20 |
| Self-distillation | Same model (trained longer) | Same model (fresh) | 1-4 |
| Ensemble distillation | Ensemble of models | Single model | 2-10 |
| Label smoothing alternative | Soft labels | Any model | 1-2 |
Self-Distillation
Quick Check
Why does knowledge distillation use a temperature parameter T > 1 for the softmax?
Other Regularization Techniques
Data Augmentation
Artificially expand the training set by applying transformations that preserve labels: rotations, flips, crops, color jitter, etc. This is one of the most effective regularization techniques when data is limited.
- Images: Random crops, flips, rotations, color augmentation, cutout, mixup
- Text: Synonym replacement, back-translation, random insertion/deletion
- Audio: Time stretching, pitch shifting, adding noise
Batch Normalization
While primarily designed for training stability, batch normalization has a regularization effect due to the noise introduced by mini-batch statistics. Each sample's normalization depends on other samples in the batch, adding stochasticity similar to dropout.
Label Smoothing
Instead of hard targets , use soft targets like . This prevents the model from becoming overconfident and improves generalization, especially in classification tasks.
Noise Injection
- Input noise: Add Gaussian noise to inputs (denoising autoencoders)
- Weight noise: Add noise to weights during training
- Gradient noise: Add noise to gradients (can help escape local minima)
Max-Norm Constraint
Clip weight vectors to have maximum norm:
This bounds the capacity of individual neurons and is often used with dropout.
PyTorch Implementation
Let's implement all the regularization techniques we've discussed using PyTorch.
Dropout Basics
Dropout Variants
Weight Decay
Early Stopping
Combining Regularization Techniques
In practice, we often combine multiple regularization techniques. They address different aspects of overfitting and can work synergistically.
Production-Ready Network
Common Combinations
| Architecture | Typical Regularization Stack |
|---|---|
| MLPs | Dropout (0.5) + Weight Decay + Early Stopping |
| CNNs | Dropout2D (0.2-0.3) + BatchNorm + Data Augmentation + Weight Decay |
| Transformers | Dropout (0.1) + Layer Norm + Weight Decay + Label Smoothing |
| RNNs/LSTMs | Recurrent Dropout + Weight Decay + Gradient Clipping |
| Small Datasets | Heavy augmentation + Dropout (0.5) + Weight Decay + Early Stopping |
Start Simple, Add If Needed
- Weight decay (0.01) in AdamW—almost always beneficial
- Early stopping with patience=10—prevents wasted computation
- Dropout (0.5) if you see overfitting after the above
- Data augmentation if you have limited data
Test Your Understanding
Test your knowledge of dropout and regularization with this comprehensive quiz. Each question has a detailed explanation.
Loading interactive demo...
Summary
Key Takeaways
- Overfitting is the central challenge—neural networks can memorize training data instead of learning generalizable patterns
- Dropout randomly silences neurons (p=0.5 typical), preventing co-adaptation and acting like an ensemble of networks
- Inverted dropout scales outputs during training so no change is needed at inference
- Weight decay (L2 regularization) penalizes large weights, preferring simpler functions; use AdamW, not Adam with weight_decay
- Early stopping halts training when validation loss stops improving; always restore the best model
- Combine techniques judiciously—start with weight decay and early stopping, add dropout and augmentation if needed
- model.train() vs model.eval() is critical—dropout (and batch norm) behave differently in each mode
| Technique | PyTorch API | Key Hyperparameter |
|---|---|---|
| Dropout | nn.Dropout(p=0.5) | p: drop probability |
| Dropout2D | nn.Dropout2d(p=0.3) | p: drop probability |
| Weight Decay | optim.AdamW(..., weight_decay=1e-2) | weight_decay: λ |
| Early Stopping | Custom callback | patience, min_delta |
| BatchNorm | nn.BatchNorm1d/2d(num_features) | momentum, eps |
| Stochastic Depth | timm.DropPath or custom | drop_prob: 0.0-0.5 |
| Gradient Clipping | torch.nn.utils.clip_grad_norm_ | max_norm: 1.0 |
| Mixup/CutMix | Custom or torchvision.transforms.v2 | alpha: 0.2-1.0 |
Exercises
Conceptual Questions
- Explain why dropout can be interpreted as training an ensemble of neural networks. How many networks are in this implicit ensemble for a network with 100 droppable neurons?
- Why is inverted dropout preferred over standard dropout in modern implementations? What practical advantage does it provide?
- The bias-variance tradeoff suggests there's a sweet spot between underfitting and overfitting. How does each regularization technique (dropout, weight decay, early stopping) affect this tradeoff?
Coding Exercises
- Implement Dropout from Scratch: Write a dropout function that takes a tensor and dropout probability as input. Implement both standard and inverted dropout. Verify against
nn.Dropout. - Dropout Ablation Study: Train the same architecture on MNIST with dropout rates [0, 0.25, 0.5, 0.75]. Plot training/validation accuracy curves for each. Which performs best? Is there a point where dropout hurts?
- Weight Decay Exploration: Train a model with weight_decay values [0, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]. Plot final validation accuracy vs weight_decay. What's the optimal value for your dataset?
- Monte Carlo Dropout: Implement Monte Carlo Dropout for uncertainty estimation. Run the model N times with dropout enabled at inference, and compute the mean prediction and variance. Use this to identify uncertain predictions.
Challenge: Regularization Strategy
You're training a model with the following characteristics:
- Training accuracy: 98%, Validation accuracy: 71%
- Dataset: 5,000 images, 50 classes
- Model: ResNet-50 (25 million parameters)
- Currently using: No regularization
Design a regularization strategy to address the severe overfitting. Consider:
- Which techniques would you apply and in what order?
- What hyperparameter values would you start with?
- How would you validate that each technique is helping?
- Would you consider using a smaller model?
Solution Hint
Coming Up in Chapter 9: This section covered what regularization techniques are and how they work. Chapter 9 (Training Neural Networks) provides practical guidance on when and how much: tuning regularization hyperparameters, combining multiple techniques effectively, and debugging under-regularization vs over-regularization.
Congratulations! You've completed Chapter 5: Neural Network Building Blocks. You now understand the fundamental components that make up modern neural networks—from the nn.Module class to layers, activations, losses, normalization, and regularization. In the next chapter, we'll put these pieces together to understand how information flows from perceptrons to deep networks, exploring why depth matters and building your first complete neural network from scratch.