AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand internal covariate shift and why it hinders training
Master the batch normalization formula including learnable parameters
Distinguish training from inference behavior with running statistics
Apply BatchNorm1d correctly to time series convolutional layers
Position batch normalization optimally within CNN blocks

Why This Matters: Batch normalization is one of the most important innovations in deep learning. It stabilizes training, allows higher learning rates, and provides mild regularization. Understanding its mechanics is essential for training deep networks effectively.

Internal Covariate Shift

Internal covariate shift refers to the change in the distribution of layer inputs during training, as the parameters of preceding layers change.

The Problem

Consider a deep network where each layer transforms its input. As training progresses:

Layer 1 parameters update → its output distribution shifts
Layer 2 must adapt to this new input distribution
This adaptation is undone when Layer 1 updates again
The cycle continues, slowing convergence

📝text

1Training dynamics without normalization:
2
3Epoch 1:  Layer 1 output ~ N(μ₁, σ₁)  →  Layer 2 adapts
4Epoch 2:  Layer 1 output ~ N(μ₂, σ₂)  →  Layer 2 re-adapts
5Epoch 3:  Layer 1 output ~ N(μ₃, σ₃)  →  Layer 2 re-adapts again
6
7Each layer is "chasing" a moving target!

Consequences

Effect	Impact on Training
Slower convergence	More epochs needed to reach optimum
Lower learning rates required	Large steps cause instability
Gradient vanishing/exploding	Activations drift to saturation regions
Sensitive initialization	Poor initialization causes divergence

The Solution: Normalize Inputs

Batch normalization addresses this by normalizing layer inputs to have zero mean and unit variance, making each layer's job easier.

\hat{x} = \frac{x - \mu}{\sigma} \implies \hat{x} \sim N(0, 1)

Batch Normalization Formulation

Batch normalization normalizes activations across the batch dimension, then applies a learnable affine transformation.

Step 1: Compute Batch Statistics

For a mini-batch $\mathcal{B} = \{x_1, ..., x_m\}$ :

\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i

\sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2

Step 2: Normalize

\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}

Where $\epsilon$ (typically 10⁻⁵) is a small constant for numerical stability.

Step 3: Scale and Shift

y_i = \gamma \hat{x}_i + \beta

Where:

$\gamma$ : Learnable scale parameter (initialized to 1)
$\beta$ : Learnable shift parameter (initialized to 0)

Why Scale and Shift?

Pure normalization to N(0, 1) would limit the network's representational power. The learnable $\gamma$ and $\beta$ allow the network to recover any mean and variance if that's optimal—including undoing the normalization entirely.

Training vs Inference

Batch normalization behaves differently during training and inference.

During Training

Use batch statistics (μ_B, σ_B²) for normalization
Update running estimates with exponential moving average (EMA)

\mu_{\text{running}} \leftarrow (1 - m) \cdot \mu_{\text{running}} + m \cdot \mu_{\mathcal{B}}

\sigma^2_{\text{running}} \leftarrow (1 - m) \cdot \sigma^2_{\text{running}} + m \cdot \sigma^2_{\mathcal{B}}

Where $m$ is the momentum (typically 0.1 in PyTorch).

During Inference

Use running statistics (population estimates)
No batch statistics computed—inference is deterministic
Single sample inference is well-defined

\hat{x} = \frac{x - \mu_{\text{running}}}{\sqrt{\sigma^2_{\text{running}} + \epsilon}}

PyTorch Behavior

🐍python

1# Training mode: uses batch statistics, updates running stats
2model.train()
3output = model(batch)  # Uses μ_B, σ_B²
4
5# Evaluation mode: uses running statistics, frozen
6model.eval()
7output = model(single_sample)  # Uses μ_running, σ²_running

Always Set Correct Mode

Forgetting to call model.eval() before inference is a common bug. With batch size 1, batch statistics become meaningless (variance is 0), causing incorrect predictions.

Training vs Inference Comparison

Aspect	Training	Inference
Statistics used	Batch (μ_B, σ_B²)	Running (μ_run, σ²_run)
Running stats	Updated via EMA	Frozen
Batch dependency	Yes (needs mini-batch)	No (single sample OK)
Deterministic	No (varies with batch)	Yes
PyTorch mode	model.train()	model.eval()

BatchNorm1d for Time Series

For 1D convolutional layers processing time series, we use BatchNorm1d which normalizes across the batch and time dimensions for each channel.

Input Shape Convention

PyTorch's BatchNorm1d expects input shape $(N, C, L)$ :

$N$ : Batch size
$C$ : Number of channels (features)
$L$ : Sequence length (time)

Normalization Dimension

For each channel $c$ , statistics are computed over all batch samples and all time positions:

\mu_c = \frac{1}{N \cdot L} \sum_{n=1}^{N} \sum_{t=1}^{L} x_{n,c,t}

\sigma_c^2 = \frac{1}{N \cdot L} \sum_{n=1}^{N} \sum_{t=1}^{L} (x_{n,c,t} - \mu_c)^2

Parameter Count

For $C$ channels:

Parameter	Count	Purpose
γ (gamma)	C	Learnable scale
β (beta)	C	Learnable shift
running_mean	C	Population mean estimate
running_var	C	Population variance estimate

Learnable parameters: $2C$ . For our layer with 64 channels: 128 learnable parameters.

Placement in CNN Blocks

The placement of batch normalization within the block affects training dynamics.

Standard Placement: After Convolution, Before Activation

📝text

1Recommended order (what we use):
2  Conv1D → BatchNorm1d → ReLU → Dropout
3
4The convolution output is normalized before the non-linearity.

Alternative: After Activation

📝text

1Alternative order:
2  Conv1D → ReLU → BatchNorm1d → Dropout
3
4Normalizes after non-linearity. Less common.

Comparison

Placement	Pros	Cons
Before ReLU	Controls pre-activation scale, standard practice	Normalized values may be clipped by ReLU
After ReLU	Normalizes actual activations	ReLU zeros may skew statistics

Our choice: Before ReLU. This is the original formulation and works well in practice.

Complete Block Order

📝text

1Input
2  ↓
3Conv1D(in_channels, out_channels, kernel_size=3, padding=1)
4  ↓
5BatchNorm1d(out_channels)  ← Normalize here
6  ↓
7ReLU()
8  ↓
9Dropout(p=0.2)
10  ↓
11Output

Bias in Convolution

When BatchNorm follows Conv1D, the convolution bias is redundant—the batch norm's β parameter serves the same purpose. Many implementations use bias=False in the convolution to save parameters.

Summary

In this section, we explained batch normalization for CNN training:

Internal covariate shift: Changing input distributions slow training
Batch normalization: Normalizes, then applies learnable γ and β
Training vs inference: Batch stats vs running stats
BatchNorm1d: Normalizes over batch and time for each channel
Placement: After convolution, before activation

Property	Value
Input shape	(batch, channels, time)
Statistics computed over	Batch and time dimensions
Learnable parameters	2 × channels (γ and β)
Running stats	Mean and variance per channel
Momentum (PyTorch)	0.1 (for EMA)
Epsilon	1e-5 (numerical stability)

Looking Ahead: Batch normalization helps training stability but doesn't prevent overfitting. For that, we need regularization. The next section introduces dropout—randomly zeroing activations during training to prevent co-adaptation and improve generalization.

With batch normalization understood, we now examine dropout strategies for regularization.