Learning Objectives
By the end of this section, you will:
- Understand internal covariate shift and why it hinders training
- Master the batch normalization formula including learnable parameters
- Distinguish training from inference behavior with running statistics
- Apply BatchNorm1d correctly to time series convolutional layers
- Position batch normalization optimally within CNN blocks
Why This Matters: Batch normalization is one of the most important innovations in deep learning. It stabilizes training, allows higher learning rates, and provides mild regularization. Understanding its mechanics is essential for training deep networks effectively.
Internal Covariate Shift
Internal covariate shift refers to the change in the distribution of layer inputs during training, as the parameters of preceding layers change.
The Problem
Consider a deep network where each layer transforms its input. As training progresses:
- Layer 1 parameters update → its output distribution shifts
- Layer 2 must adapt to this new input distribution
- This adaptation is undone when Layer 1 updates again
- The cycle continues, slowing convergence
1Training dynamics without normalization:
2
3Epoch 1: Layer 1 output ~ N(μ₁, σ₁) → Layer 2 adapts
4Epoch 2: Layer 1 output ~ N(μ₂, σ₂) → Layer 2 re-adapts
5Epoch 3: Layer 1 output ~ N(μ₃, σ₃) → Layer 2 re-adapts again
6
7Each layer is "chasing" a moving target!Consequences
| Effect | Impact on Training |
|---|---|
| Slower convergence | More epochs needed to reach optimum |
| Lower learning rates required | Large steps cause instability |
| Gradient vanishing/exploding | Activations drift to saturation regions |
| Sensitive initialization | Poor initialization causes divergence |
The Solution: Normalize Inputs
Batch normalization addresses this by normalizing layer inputs to have zero mean and unit variance, making each layer's job easier.
Batch Normalization Formulation
Batch normalization normalizes activations across the batch dimension, then applies a learnable affine transformation.
Step 1: Compute Batch Statistics
For a mini-batch :
Step 2: Normalize
Where (typically 10⁻⁵) is a small constant for numerical stability.
Step 3: Scale and Shift
Where:
- : Learnable scale parameter (initialized to 1)
- : Learnable shift parameter (initialized to 0)
Why Scale and Shift?
Pure normalization to N(0, 1) would limit the network's representational power. The learnable and allow the network to recover any mean and variance if that's optimal—including undoing the normalization entirely.
Training vs Inference
Batch normalization behaves differently during training and inference.
During Training
- Use batch statistics (μ_B, σ_B²) for normalization
- Update running estimates with exponential moving average (EMA)
Where is the momentum (typically 0.1 in PyTorch).
During Inference
- Use running statistics (population estimates)
- No batch statistics computed—inference is deterministic
- Single sample inference is well-defined
PyTorch Behavior
1# Training mode: uses batch statistics, updates running stats
2model.train()
3output = model(batch) # Uses μ_B, σ_B²
4
5# Evaluation mode: uses running statistics, frozen
6model.eval()
7output = model(single_sample) # Uses μ_running, σ²_runningAlways Set Correct Mode
Forgetting to call model.eval() before inference is a common bug. With batch size 1, batch statistics become meaningless (variance is 0), causing incorrect predictions.
Training vs Inference Comparison
| Aspect | Training | Inference |
|---|---|---|
| Statistics used | Batch (μ_B, σ_B²) | Running (μ_run, σ²_run) |
| Running stats | Updated via EMA | Frozen |
| Batch dependency | Yes (needs mini-batch) | No (single sample OK) |
| Deterministic | No (varies with batch) | Yes |
| PyTorch mode | model.train() | model.eval() |
BatchNorm1d for Time Series
For 1D convolutional layers processing time series, we use BatchNorm1d which normalizes across the batch and time dimensions for each channel.
Input Shape Convention
PyTorch's BatchNorm1d expects input shape :
- : Batch size
- : Number of channels (features)
- : Sequence length (time)
Normalization Dimension
For each channel , statistics are computed over all batch samples and all time positions:
Parameter Count
For channels:
| Parameter | Count | Purpose |
|---|---|---|
| γ (gamma) | C | Learnable scale |
| β (beta) | C | Learnable shift |
| running_mean | C | Population mean estimate |
| running_var | C | Population variance estimate |
Learnable parameters: . For our layer with 64 channels: 128 learnable parameters.
Placement in CNN Blocks
The placement of batch normalization within the block affects training dynamics.
Standard Placement: After Convolution, Before Activation
1Recommended order (what we use):
2 Conv1D → BatchNorm1d → ReLU → Dropout
3
4The convolution output is normalized before the non-linearity.Alternative: After Activation
1Alternative order:
2 Conv1D → ReLU → BatchNorm1d → Dropout
3
4Normalizes after non-linearity. Less common.Comparison
| Placement | Pros | Cons |
|---|---|---|
| Before ReLU | Controls pre-activation scale, standard practice | Normalized values may be clipped by ReLU |
| After ReLU | Normalizes actual activations | ReLU zeros may skew statistics |
Our choice: Before ReLU. This is the original formulation and works well in practice.
Complete Block Order
1Input
2 ↓
3Conv1D(in_channels, out_channels, kernel_size=3, padding=1)
4 ↓
5BatchNorm1d(out_channels) ← Normalize here
6 ↓
7ReLU()
8 ↓
9Dropout(p=0.2)
10 ↓
11OutputBias in Convolution
When BatchNorm follows Conv1D, the convolution bias is redundant—the batch norm's β parameter serves the same purpose. Many implementations use bias=False in the convolution to save parameters.
Summary
In this section, we explained batch normalization for CNN training:
- Internal covariate shift: Changing input distributions slow training
- Batch normalization: Normalizes, then applies learnable γ and β
- Training vs inference: Batch stats vs running stats
- BatchNorm1d: Normalizes over batch and time for each channel
- Placement: After convolution, before activation
| Property | Value |
|---|---|
| Input shape | (batch, channels, time) |
| Statistics computed over | Batch and time dimensions |
| Learnable parameters | 2 × channels (γ and β) |
| Running stats | Mean and variance per channel |
| Momentum (PyTorch) | 0.1 (for EMA) |
| Epsilon | 1e-5 (numerical stability) |
Looking Ahead: Batch normalization helps training stability but doesn't prevent overfitting. For that, we need regularization. The next section introduces dropout—randomly zeroing activations during training to prevent co-adaptation and improve generalization.
With batch normalization understood, we now examine dropout strategies for regularization.