Boo-AI — Master Artificial Intelligence by Building from Scratch

The Evolution of CNNs

In Section 1, we built a CNN from scratch using the same principles that Yann LeCun pioneered in 1998. But the field did not stop there. Between 1998 and 2015, a series of architectural innovations transformed CNNs from a niche technique for digit recognition into the dominant approach for virtually all computer vision tasks.

Each breakthrough solved a specific problem — and each one builds on the same foundation you already understand. The innovations are not mystical. They are engineering solutions to concrete problems: “How do we go deeper without vanishing gradients?” “How do we capture features at multiple scales?” “How do we reduce parameters without losing accuracy?”

Explore the evolution interactively — click each architecture to see its key innovation:

CNN Architecture Evolution (1998–2015)

ResNet2015

He, Zhang, Ren, Sun (Microsoft)

152

layers

60M

params

3.57%

top-5 error

Key Innovation

Skip connections (residual learning)

The most important architecture innovation since CNNs themselves. Skip connections let gradients flow directly through the network, solving the vanishing gradient problem at extreme depth. Instead of learning H(x), each block learns the residual F(x) = H(x) - x, which is easier to optimize. ResNet-152 (152 layers!) achieved superhuman performance on ImageNet. Skip connections are now used in virtually every modern architecture (transformers, U-Net, DenseNet).

Architecture

Conv7 → Pool → ResBlock×3 → ResBlock×4 → ResBlock×6 → ResBlock×3 → AvgPool → 1000

Depth Comparison

19L

22L

152L

LeNet-5: Where It All Began (1998)

LeNet-5, published by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner in 1998, was the first CNN to be successfully deployed at scale. AT&T used it to read handwritten digits on millions of bank checks and zip codes on postal mail.

The architecture is remarkably simple by today's standards: two convolutional layers with 5×5 kernels, two average pooling layers, and three fully-connected layers. It used tanh activations instead of ReLU and average pooling instead of max pooling — choices that were standard at the time.

LeNet-5 in PyTorch

🐍lenet5.py

Explanation(19)

Code(34)

1import torch

PyTorch core library for tensor operations and automatic differentiation.

2import torch.nn as nn

Neural network building blocks: layers, loss functions, and the Module base class.

4class LeNet5(nn.Module):

LeNet-5 was designed by Yann LeCun in 1998 for reading handwritten digits on bank checks and postal mail. It was the first CNN to be successfully deployed in a commercial product (reading zip codes). The architecture uses 5×5 kernels, average pooling, and tanh activations — all choices that were standard before the modern era of ReLU and max pooling.

EXECUTION STATE

LeNet5 = The original CNN architecture. 5 learnable layers (2 conv + 3 FC), ~44K parameters. Input: 28×28 grayscale. Output: 10 digit scores.

9self.conv1 = nn.Conv2d(1, 6, kernel_size=5)

First convolutional layer: 6 filters of size 5×5. No padding, so the spatial size shrinks: 28 - 5 + 1 = 24. Only 6 filters (compare to our 16 in Section 1) — compute was precious in 1998.

EXECUTION STATE

⬇ in_channels = 1 = Grayscale input (single channel)

⬇ out_channels = 6 = 6 feature maps. LeCun chose 6 because it was enough to capture the basic edge orientations in digit images.

⬇ kernel_size = 5 = 5×5 kernels — larger than modern 3×3. Captures more context per operation but uses more parameters per filter: 5×5×1 = 25 vs 3×3×1 = 9.

→ params = 6 × (5×5×1 + 1) = 156 parameters

11self.conv2 = nn.Conv2d(6, 16, kernel_size=5)

Second convolutional layer: 16 filters spanning all 6 input channels. After pooling, input is 12×12, so output will be 12 - 5 + 1 = 8.

EXECUTION STATE

→ params = 16 × (5×5×6 + 1) = 2,416 parameters

13self.pool = nn.AvgPool2d(kernel_size=2, stride=2)

Average pooling — the original LeNet used this instead of max pooling. It computes the mean of each 2×2 window. Modern CNNs prefer max pooling because it preserves stronger activations, but average pooling works fine for simple tasks like MNIST.

EXECUTION STATE

📚 nn.AvgPool2d = Computes the mean (not max) of each pooling window. More ‘democratic’ — all values contribute equally. Used in LeNet (1998), later replaced by MaxPool in AlexNet (2012).

15self.fc1 = nn.Linear(16 * 4 * 4, 120)

First fully-connected layer. After two conv+pool blocks, the feature maps are 16×4×4 = 256 values. This projects them into 120 dimensions.

EXECUTION STATE

16 * 4 * 4 = 256 = 16 channels × 4 height × 4 width. The spatial size went: 28 → 24 (conv1) → 12 (pool) → 8 (conv2) → 4 (pool).

→ params = 256 × 120 + 120 = 30,840 — most parameters in the network

16self.fc2 = nn.Linear(120, 84)

Second FC layer. The number 84 was chosen by LeCun to match the 7×12 bitmap representation of ASCII characters — a nod to the character recognition task.

EXECUTION STATE

84 = 7 × 12 = 84. LeCun designed a 7×12 pixel grid for each character class, and this layer maps to that representation space.

17self.fc3 = nn.Linear(84, 10)

Output layer: 10 classes for digits 0–9.

20def forward(self, x): # x: [batch, 1, 28, 28]

Forward pass: data flows through conv1 → tanh → pool → conv2 → tanh → pool → flatten → fc1 → tanh → fc2 → tanh → fc3. Notice tanh everywhere — ReLU had not been popularized yet.

EXECUTION STATE

⬇ input: x = [batch, 1, 28, 28] — MNIST images

21x = torch.tanh(self.conv1(x))

Conv1 + tanh activation. The original LeNet used a custom ‘squashing function’ similar to tanh. tanh outputs in [-1, 1], unlike ReLU’s [0, ∞). This limits gradient flow in deep networks but works fine for 5-layer LeNet.

EXECUTION STATE

📚 torch.tanh(x) = tanh(x) = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ). Output range: [-1, 1]. Zero-centered (unlike sigmoid) but saturates at extremes, causing vanishing gradients.

→ shape = [batch, 6, 24, 24] — 6 feature maps, size reduced from 28 to 24 (no padding)

22x = self.pool(x)

Average pooling halves spatial dimensions: 24 → 12.

EXECUTION STATE

→ shape = [batch, 6, 12, 12]

23x = torch.tanh(self.conv2(x))

Second convolution: 12 → 8 (no padding), then tanh.

EXECUTION STATE

→ shape = [batch, 16, 8, 8]

24x = self.pool(x)

Second pooling: 8 → 4.

EXECUTION STATE

→ shape = [batch, 16, 4, 4] — 16 feature maps of 4×4 each

25x = x.view(x.size(0), -1)

Flatten: 16 × 4 × 4 = 256 values per image.

EXECUTION STATE

→ shape = [batch, 256]

26x = torch.tanh(self.fc1(x))

Dense projection: 256 → 120, then tanh.

EXECUTION STATE

→ shape = [batch, 120]

27x = torch.tanh(self.fc2(x))

Dense projection: 120 → 84, then tanh.

EXECUTION STATE

→ shape = [batch, 84]

28x = self.fc3(x)

Output: 84 → 10 logits. No activation — raw scores for each digit class.

EXECUTION STATE

⬆ return: x = [batch, 10] — one score per digit

31total = sum(p.numel() for p in model.parameters())

Count all parameters. LeNet-5 has only ~44K — tiny by modern standards. AlexNet (2012) would have 61M, 1400× more.

EXECUTION STATE

total = 44,426 parameters — conv1: 156, conv2: 2,416, fc1: 30,840, fc2: 10,164, fc3: 850

15 lines without explanation

1import torch
2import torch.nn as nn
3
4class LeNet5(nn.Module):
5    """The original CNN (LeCun, 1998) — adapted for MNIST."""
6    def __init__(self):
7        super().__init__()
8        # Layer 1: 1 -> 6 channels, 5x5 kernels
9        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
10        # Layer 2: 6 -> 16 channels, 5x5 kernels
11        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
12        # Average pooling (original used subsampling)
13        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
14        # Fully connected layers
15        self.fc1 = nn.Linear(16 * 4 * 4, 120)
16        self.fc2 = nn.Linear(120, 84)
17        self.fc3 = nn.Linear(84, 10)
18
19    def forward(self, x):
20        # x: [batch, 1, 28, 28]
21        x = torch.tanh(self.conv1(x))  # -> [batch, 6, 24, 24]
22        x = self.pool(x)                # -> [batch, 6, 12, 12]
23        x = torch.tanh(self.conv2(x))  # -> [batch, 16, 8, 8]
24        x = self.pool(x)                # -> [batch, 16, 4, 4]
25        x = x.view(x.size(0), -1)      # -> [batch, 256]
26        x = torch.tanh(self.fc1(x))    # -> [batch, 120]
27        x = torch.tanh(self.fc2(x))    # -> [batch, 84]
28        x = self.fc3(x)                 # -> [batch, 10]
29        return x
30
31model = LeNet5()
32total = sum(p.numel() for p in model.parameters())
33print(f"LeNet-5 parameters: {total:,}")
34# LeNet-5 parameters: 44,426

LeNet-5 Dimension Flow

Layer	Output Shape	Key Difference from Our CNN
Input	1 × 28 × 28	Same input
Conv1 (5×5, no pad)	6 × 24 × 24	Larger kernels, fewer filters, no padding → size shrinks
AvgPool (2×2)	6 × 12 × 12	Average pooling instead of max pooling
Conv2 (5×5, no pad)	16 × 8 × 8	5×5 kernels again
AvgPool (2×2)	16 × 4 × 4	Spatial size: 4 (vs our 7)
Flatten	256	256 (vs our 1,568)
FC1 → FC2 → FC3	120 → 84 → 10	Three FC layers (vs our two)

Historical Impact: LeNet-5 proved that learned features outperform hand-engineered ones. But in the 2000s, CNNs fell out of favor as SVMs and other methods seemed to work just as well on small datasets. It took 14 years and the ImageNet dataset for CNNs to return — spectacularly.

AlexNet: The Deep Learning Big Bang (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a CNN called AlexNet into the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It achieved a top-5 error rate of 15.3% — while the runner-up (a non-deep-learning method) had 26.2%. That 10-point gap was like a thunderbolt. Overnight, the entire computer vision community pivoted to deep learning.

What Made AlexNet Special

AlexNet was not a single breakthrough — it was a collection of engineering insights that made deep CNNs trainable:

ReLU activation instead of tanh/sigmoid. ReLU does not saturate for positive inputs, so gradients flow freely through deeper layers. Training was 6× faster than with tanh.
GPU training across two NVIDIA GTX 580 GPUs (3GB VRAM each). This was the first major deep learning result trained on GPUs. The model was split across two GPUs with cross-GPU communication at specific layers.
Dropout (p=0.5) in the fully-connected layers to prevent overfitting on the 1.2M training images.
Data augmentation: random crops, horizontal flips, and color jittering to artificially expand the training set.
Local Response Normalization (LRN): a form of lateral inhibition across channels. Later abandoned in favor of batch normalization.

Property	LeNet-5 (1998)	AlexNet (2012)
Depth	5 layers	8 layers
Parameters	60K	61M (1000× more)
Activation	tanh	ReLU
Pooling	Average	Max
Training hardware	CPU	2× GPU
Training data	60K images (MNIST)	1.2M images (ImageNet)
Input size	28 × 28 grayscale	224 × 224 RGB
Classes	10 digits	1,000 categories

The scale insight: AlexNet proved that the combination of more data + more compute + deeper networks dramatically improves performance. This insight \u2014 that scale matters \u2014 has driven the field ever since, from VGGNet to GPT-4.

VGGNet: The Power of Depth (2014)

Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what happens if we just make the network deeper, using only 3×3 kernels?

The answer was VGGNet (VGG-16 and VGG-19) — the first architecture to demonstrate that depth is more important than kernel size. Instead of using 5×5 or 7×7 kernels like AlexNet, VGG uses only 3×3 convolutions stacked deep.

The Key Insight: Two 3×3 = One 5×5

Two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution (both see a 5×5 region of the input). But the stacked version is better:

	One 5×5 Conv	Two 3×3 Convs
Receptive field	5 × 5	5 × 5 (same!)
Parameters (C channels)	25C²	18C² (28% fewer)
Non-linearities	1 ReLU	2 ReLUs (more expressive)
Computation	25C²HW	18C²HW (28% less)

By the same logic, three 3×3 convolutions replace one 7×7 with even bigger savings: $3 \times 9C^2 = 27C^2$ vs $49C^2$ (45% fewer parameters). This is why virtually every modern CNN uses 3×3 kernels.

VGG-16 has 138M parameters — most in the fully-connected layers. The convolutional backbone itself is relatively efficient. When used for transfer learning (Section 3), only the conv layers are typically kept, and the FC layers are replaced.

Quick Check

Why do modern CNNs prefer stacking two 3\u00d73 convolutions instead of using one 5\u00d75?

Batch Normalization: The Invisible Ingredient (2015)

VGG pushed depth from 8 to 19 layers; the next attempts to go deeper ran into a training wall. Early layers produced activations with wildly swinging mean and variance, which the optimiser could never settle. Batch Normalisation (Ioffe & Szegedy, 2015) broke the wall. It is the single line of code that made ResNet, DenseNet, and the Inception family practical to train.

The idea is simple. After each convolution, look at the distribution of activations across the batch, height, and width for each channel. If that distribution has mean $\mu_B$ and variance $\sigma_B^2$ , rewrite every activation $x_i$ as

$\hat{x}_i = \dfrac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}$ , then $y_i = \gamma \cdot \hat{x}_i + \beta$ .

The first step forces the activations to zero mean and unit variance — stable targets that make gradient descent's job much easier. The second step lets the network undo that normalisation if it hurts representational power: if the optimal $\gamma$ happens to be $\sqrt{\sigma_B^2}$ and the optimal $\beta$ happens to be $\mu_B$ , BatchNorm becomes the identity and no information is lost.

Batch Normalization Step-by-Step

x_i (raw activations)

Activation Values (Batch 1)

Distribution

Select Batch to Visualize

Notice how different batches have different means (internal covariate shift). BatchNorm normalizes each batch to have mean=0 and variance=1.

Batch Mean (\u03BC)

4.880

Batch Variance (\u03C3\u00B2)

6.843

Batch Std (\u03C3)

2.616

After Norm Mean

4.880

Per-Channel, Not Per-Activation

For 2-D feature maps BatchNorm does not normalise each activation independently. For each channel it pools across the batch dimension $N$ and the two spatial dimensions $H, W$ . So a conv layer with 64 channels learns 64 $(\gamma, \beta)$ pairs — that is all. A ResNet-18 adds roughly $2 \times 4{,}800 \approx 10{,}000$ BN parameters on top of its 11.7 M conv parameters — a rounding error in size, but critical for trainability.

Batch Norm from Scratch (NumPy)

The whole idea fits in seven lines. This is the clearest way to see what nn.BatchNorm2d does when you are not looking.

BatchNorm from scratch — interactive trace

🐍batch_norm_numpy.py

Explanation(12)

Code(33)

1import numpy as np

NumPy gives us ndarrays and the axis-based reductions (mean, var) we need to compute batch statistics along arbitrary combinations of axes.

EXECUTION STATE

numpy = Numerical array library. .mean(axis=...), .var(axis=...), broadcasting, and np.sqrt are all we need.

3def batch_norm_2d(x, gamma, beta, eps=1e-5) → ndarray

The full BatchNorm forward pass for 2-D convolutional feature maps. Ioffe & Szegedy (2015) normalise each channel independently, pooling across the batch and spatial dimensions so every channel sees many activations.

EXECUTION STATE

⬇ input: x (N=2, C=2, H=2, W=2) = Shape (2, 2, 2, 2) = 16 activations. Channel 0 values: [1,2,3,4,5,6,7,8] Channel 1 values: [10,11,12,13,14,15,16,17]

⬇ input: gamma (1, C, 1, 1) = Learnable per-channel scale. One value per channel; broadcasts across N, H, W. Here: [[[[1.]]], [[[1.]]]] (identity).

⬇ input: beta (1, C, 1, 1) = Learnable per-channel shift. Broadcasts the same way. Here: zeros.

⬇ input: eps = 1e-5 = Small constant added to var before sqrt so a dead channel (var = 0) cannot produce division by zero.

⬆ returns = ndarray shape (N, C, H, W) — same shape as input, but every channel has been re-centred, re-scaled, then affine-transformed by (gamma, beta).

10mu = x.mean(axis=(0, 2, 3), keepdims=True)

Batch mean for each channel. Pooling axes (0, 2, 3) means 'keep axis 1 (channels) and average over everything else'. Every channel gets one mean.

EXECUTION STATE

📚 .mean(axis, keepdims) = NumPy reduction: averages values along the given axes. With keepdims=True, those axes remain as size-1 so broadcasting against the original shape still works.

axis=(0, 2, 3) = Average over batch (axis 0), height (axis 2), width (axis 3). Only channels (axis 1) are preserved. Result has shape (1, C, 1, 1).

→ channel 0 mean = (1+2+3+4+5+6+7+8) / 8 = 4.5

→ channel 1 mean = (10+11+12+13+14+15+16+17) / 8 = 13.5

⬆ mu shape (1, 2, 1, 1) =

[[[[4.5]]], [[[13.5]]]]

11var = x.var(axis=(0, 2, 3), keepdims=True)

Population variance per channel (NumPy default: dividing by N, not N-1). PyTorch's BatchNorm uses the biased (population) variance for the normalisation and the unbiased one only for tracking running statistics.

EXECUTION STATE

📚 .var() = Mean of squared deviations from the per-axis mean. NumPy default: sum of (x - mu)^2 divided by N (biased estimator).

→ channel 0 var = Sum of (1-4.5)^2 + ... + (8-4.5)^2 = 42. Divide by 8 → var = 5.25

→ channel 1 var = Same pattern; 5.25 (the numbers are shifted but the spread is identical).

⬆ var shape (1, 2, 1, 1) =

[[[[5.25]]], [[[5.25]]]]

14x_hat = (x - mu) / np.sqrt(var + eps)

Per-channel standardisation. This is the heart of BatchNorm — zero-mean, unit-variance activations for every channel. Broadcasting does the heavy lifting: x (2, 2, 2, 2) − mu (1, 2, 1, 1) expands to the full shape.

EXECUTION STATE

📚 np.sqrt = Element-wise square root. Called on (var + eps). Produces the per-channel batch standard deviation.

→ sqrt(5.25 + 1e-5) = ≈ 2.2913

x_hat channel 0 (sample 0) =

(1-4.5)/2.2913  (2-4.5)/2.2913
(3-4.5)/2.2913  (4-4.5)/2.2913
= [-1.528, -1.091]
  [-0.655, -0.218]

x_hat channel 0 (sample 1) =

[ 0.218,  0.655]
[ 1.091,  1.528]

→ channel-0 mean after norm = (-1.528 -1.091 -0.655 -0.218 +0.218 +0.655 +1.091 +1.528) / 8 = 0.000 ✓

→ channel-0 var after norm = ≈ 1.000 ✓

17return gamma * x_hat + beta

The learnable affine transform. Without this, BN would force every layer's output to be zero-mean unit-variance forever — which could hurt representational power. With it, the network can undo the normalisation if that is actually what the task needs.

EXECUTION STATE

gamma * x_hat = Element-wise multiply. gamma (1, C, 1, 1) broadcasts to (N, C, H, W), scaling every channel independently. Here gamma = 1 everywhere so x_hat is unchanged.

+ beta = Element-wise add with broadcast. Shifts every channel by its learnable beta. Here beta = 0 so no shift.

⬆ return y = Same as x_hat because gamma = 1, beta = 0. After one gradient step gamma and beta would move off these identity values and start adjusting each channel's distribution.

20x = np.array([...]) # shape (2, 2, 2, 2)

A hand-crafted 16-number batch. Channel 0 holds [1..8], channel 1 holds [10..17]. Different means (4.5 vs 13.5) make it easy to see BN flattening everything to mean 0 var 1.

EXECUTION STATE

⬆ x.shape = (2, 2, 2, 2) — N=2 samples, C=2 channels, H=2, W=2

28gamma = np.ones((1, 2, 1, 1))

The learnable scale parameter, initialised to 1 (identity). Shape (1, C, 1, 1) so it broadcasts cleanly against x during multiplication. After training, gamma diverges from 1 wherever the network wants to amplify or attenuate specific channels.

EXECUTION STATE

📚 np.ones(shape) = Creates an array filled with 1.0. Here: [[[[1.]]], [[[1.]]]] — two scalars, one per channel.

29beta = np.zeros((1, 2, 1, 1))

The learnable shift parameter, initialised to 0. Shape matches gamma.

31y = batch_norm_2d(x, gamma, beta)

Run the whole pipeline: mean, var, standardise, affine. Output shape matches input shape exactly.

EXECUTION STATE

⬆ y.shape = (2, 2, 2, 2)

⬆ y channel 0 (sample 0) =

[-1.528 -1.091]
[-0.655 -0.218]

⬆ y channel 0 (sample 1) =

[ 0.218  0.655]
[ 1.091  1.528]

⬆ y channel 1 (sample 0) =

[-1.528 -1.091]
[-0.655 -0.218]  (same because channel 1 is just channel 0 + 9)

32print(y.mean(axis=(0, 2, 3)))

Sanity check: every channel of the output should have mean ~ 0.

EXECUTION STATE

⬆ stdout = per-channel mean: [0. 0.]

33print(y.var(axis=(0, 2, 3)))

And variance ~ 1. This confirms the normalisation did its job regardless of the channel's original mean.

EXECUTION STATE

⬆ stdout = per-channel var : [1. 1.]

21 lines without explanation

1import numpy as np
2
3def batch_norm_2d(x, gamma, beta, eps=1e-5):
4    """BatchNorm for a 4-D tensor of shape (N, C, H, W).
5
6    Statistics are per-channel: we pool over the batch and spatial
7    dimensions, so each channel has its own mu and sigma.
8    """
9    # axis=(0, 2, 3) = pool N, H, W. keepdims keeps (1, C, 1, 1).
10    mu  = x.mean(axis=(0, 2, 3), keepdims=True)
11    var = x.var (axis=(0, 2, 3), keepdims=True)
12
13    # Step 1: centre, step 2: rescale to unit variance
14    x_hat = (x - mu) / np.sqrt(var + eps)
15
16    # Step 3: learnable per-channel scale and shift
17    return gamma * x_hat + beta
18
19
20# Tiny deterministic batch: 2 samples, 2 channels, 2x2 feature maps
21x = np.array([
22    [[[ 1., 2.], [ 3., 4.]],   # sample 0, channel 0
23     [[10.,11.], [12.,13.]]],  # sample 0, channel 1
24    [[[ 5., 6.], [ 7., 8.]],   # sample 1, channel 0
25     [[14.,15.], [16.,17.]]],  # sample 1, channel 1
26])
27
28gamma = np.ones ((1, 2, 1, 1))   # one scale per channel
29beta  = np.zeros((1, 2, 1, 1))   # one shift per channel
30
31y = batch_norm_2d(x, gamma, beta)
32print("per-channel mean:", y.mean(axis=(0, 2, 3)))  # ~ [0., 0.]
33print("per-channel var :", y.var (axis=(0, 2, 3)))  # ~ [1., 1.]

The PyTorch Equivalent

PyTorch's nn.BatchNorm2d packages the same computation but adds two production essentials: running statistics (an exponential moving average of the batch mean and variance, used at inference) and a train/eval switch that chooses between batch statistics and running statistics.

nn.BatchNorm2d — interactive trace

🐍batch_norm_torch.py

Explanation(12)

Code(32)

1import torch

PyTorch core: tensors, autograd, device management. Everything below uses torch.tensor or torch.nn types.

2import torch.nn as nn

The nn module provides layer classes — Linear, Conv2d, BatchNorm2d — that package learnable parameters with their forward pass.

9bn = nn.BatchNorm2d(num_features=2, eps=1e-5, momentum=0.1)

Creates a 2-D BatchNorm layer for feature maps with C=2 channels. PyTorch packages the gamma/beta parameters, the running_mean/running_var buffers, and the train/eval switch into one nn.Module.

EXECUTION STATE

📚 nn.BatchNorm2d = Implements the Ioffe-Szegedy 2015 BatchNorm for 4-D (N, C, H, W) tensors. Learnable: weight (gamma) and bias (beta), each of shape (C,). Non-learnable buffers: running_mean and running_var, each (C,).

⬇ arg 1: num_features = 2 = Number of input channels C. Must match x.shape[1] at forward time. Controls the size of weight, bias, running_mean, running_var.

⬇ arg 2: eps = 1e-5 = Numerical safety constant added under the sqrt. Same role as in the NumPy version.

⬇ arg 3: momentum = 0.1 = Exponential moving-average coefficient for the running statistics. PyTorch's convention: running_mean ← (1 - m) · running_mean + m · batch_mean. Note this is the opposite convention from, say, TensorFlow's Keras.

→ bn.weight (gamma) = tensor([1., 1.], requires_grad=True) — one scalar per channel, init 1.

→ bn.bias (beta) = tensor([0., 0.], requires_grad=True) — one scalar per channel, init 0.

→ bn.running_mean = tensor([0., 0.]) — buffer, NOT learnable. Updated during train() mode only.

→ bn.running_var = tensor([1., 1.]) — buffer, NOT learnable. Init 1 because that is the expected post-normalisation variance.

11x = torch.tensor([...]) # shape (2, 2, 2, 2)

The exact same batch we used in the NumPy version, now as a torch.Tensor. Channel 0 values [1..8], channel 1 values [10..17].

18bn.train()

Switches the module into TRAINING mode. In this mode, forward() uses the BATCH's own statistics (the same mu, var we computed by hand in NumPy) and also updates the running_mean / running_var buffers via the momentum EMA.

EXECUTION STATE

📚 .train() = Sets self.training = True. For BatchNorm2d this changes forward() to use batch statistics and to update running_mean / running_var. For Dropout it enables masking. It does NOT by itself run any training.

19y_train = bn(x)

Forward pass in training mode. Internally PyTorch computes the per-channel batch mean and var (the same 4.5, 13.5, 5.25 we derived), normalises, then multiplies by bn.weight and adds bn.bias.

EXECUTION STATE

⬆ y_train.shape = torch.Size([2, 2, 2, 2]) — identical to input

⬆ y_train channel 0 (sample 0) =

tensor([[-1.528, -1.091],
        [-0.655, -0.218]])

→ matches NumPy = Same values as our hand-rolled batch_norm_2d — PyTorch is doing the exact same computation, just in optimised C++ / CUDA.

20print(y_train.mean(dim=(0, 2, 3)))

Per-channel mean of the normalised output. Should be ~0 for every channel.

EXECUTION STATE

⬆ stdout = train mean: tensor([0., 0.])

21print(y_train.var(dim=(0, 2, 3)))

Per-channel variance. ~ 1 for every channel. Note: torch.var defaults to the unbiased estimator (divides by N-1), so you may see 16/15 ≈ 1.067 instead of exactly 1.0 — an artefact of torch.var's default, not of BatchNorm.

EXECUTION STATE

⬆ stdout = train var : tensor([1.0667, 1.0667]) # unbiased estimator quirk

27print(bn.running_mean)

After one forward pass in train() mode, the running_mean buffer was updated with momentum=0.1 toward the current batch's mean. Initial 0 → 0.9 * 0 + 0.1 * [4.5, 13.5] = [0.45, 1.35].

EXECUTION STATE

⬆ bn.running_mean = tensor([0.4500, 1.3500])

28print(bn.running_var)

running_var updated similarly. Initial 1 → 0.9 * 1 + 0.1 * 5.25 = 1.425 (PyTorch tracks the unbiased batch variance in this buffer, so the exact number depends on sample size).

31bn.eval()

Switches to EVAL mode. From now on, forward() ignores the current batch's statistics and uses the stored running_mean / running_var instead. This is what makes inference deterministic — one image or a batch of ten, same output per image.

EXECUTION STATE

📚 .eval() = Sets self.training = False. For BatchNorm, the running stats take over. For Dropout, masks are disabled.

⚠ production gotcha = If you forget .eval() at inference time, BN will compute statistics on whatever batch you happened to feed in. One-image inference will produce degenerate activations. This is the single most common BatchNorm bug.

32y_eval = bn(x)

Forward pass in eval mode. Normalises using running_mean = [0.45, 1.35] and running_var ≈ [1.425, 1.425]. Because those running stats are nowhere near the batch's true mean (4.5, 13.5), the output is NOT zero-mean unit-variance. That is expected at deployment — running stats accumulate over many training batches.

EXECUTION STATE

⬆ y_eval channel 0 (sample 0) =

x = [1, 2, 3, 4]
(x - 0.45) / sqrt(1.425 + 1e-5)
≈ [0.46, 1.30, 2.13, 2.97]

→ takeaway = Eval mode does NOT re-normalise to zero-mean unit-variance. It applies a fixed linear transform learned during training. That fixed transform is stable and deterministic.

20 lines without explanation

1import torch
2import torch.nn as nn
3
4# nn.BatchNorm2d tracks 4 tensors per channel:
5#   weight (gamma)      — learnable, shape (C,), init 1
6#   bias   (beta)       — learnable, shape (C,), init 0
7#   running_mean        — buffer, shape (C,), init 0, updated in train()
8#   running_var         — buffer, shape (C,), init 1, updated in train()
9bn = nn.BatchNorm2d(num_features=2, eps=1e-5, momentum=0.1)
10
11x = torch.tensor([
12    [[[ 1., 2.], [ 3., 4.]],
13     [[10.,11.], [12.,13.]]],
14    [[[ 5., 6.], [ 7., 8.]],
15     [[14.,15.], [16.,17.]]],
16])  # shape (2, 2, 2, 2)
17
18# ---- TRAINING mode: use the BATCH's own mean and var ----
19bn.train()
20y_train = bn(x)
21print("train mean:", y_train.mean(dim=(0, 2, 3)))  # ~ [0., 0.]
22print("train var :", y_train.var (dim=(0, 2, 3)))  # ~ [1., 1.]
23
24# Each forward pass in train() updates the running stats:
25#   running_mean ← (1 - m) * running_mean + m * batch_mean
26#   running_var  ← (1 - m) * running_var  + m * batch_var
27print("running_mean:", bn.running_mean)   # [0.45, 1.35]
28print("running_var :", bn.running_var)    # updated once by momentum 0.1
29
30# ---- EVAL mode: use the STORED running stats, not the batch's ----
31bn.eval()
32y_eval = bn(x)   # normalises x using running_mean / running_var, NOT batch stats

The four states of a BatchNorm layer. (1) Fresh init:

\gamma = 1, \beta = 0

, running stats not yet updated. (2) During training: forward uses batch stats, running stats drift toward the training distribution. (3) During eval: forward uses running stats, output is a fixed deterministic transform. (4) Under fine-tuning: you often want to freeze the running stats of pretrained layers — we revisit this in Section 3.

GoogLeNet: Thinking Multi-Scale (2014)

While VGG went deeper with uniform 3×3 kernels, Google's team (Szegedy et al.) asked a different question: what if we look at multiple scales simultaneously?

The result was the Inception module — a block that applies 1×1, 3×3, and 5×5 convolutions in parallel, plus max pooling, then concatenates all the results along the channel dimension.

The Inception Module

The key innovation is computing features at multiple scales within a single layer:

Branch	Operation	What It Captures
Branch 1	1×1 conv	Point-wise features (channel mixing)
Branch 2	1×1 conv → 3×3 conv	Local patterns (edges, textures)
Branch 3	1×1 conv → 5×5 conv	Larger-scale patterns (object parts)
Branch 4	3×3 max pool → 1×1 conv	Spatial subsampling features

The 1×1 convolutions before the 3×3 and 5×5 branches serve as bottlenecks: they reduce the channel count before the expensive spatial convolutions. This is how GoogLeNet achieved 6.7% top-5 error with only 6.8M parameters — 20× fewer than VGG's 138M.

1×1 convolutions are one of the most important tricks in CNN design. Despite having no spatial extent, they are full linear transformations across the channel dimension. Think of them as a per-pixel fully-connected layer: each spatial position's channel vector is projected to a new space. They can expand channels (add information), reduce channels (compress/bottleneck), or mix channels (learn cross-channel interactions).

1×1 Convolutions: Parameter Economy

1×1 convolutions look trivial — they have no spatial footprint. Why would you want one? Because they are astonishingly cheap and they let the network decide which channels matter. Lin, Chen & Yan (2013) called this idea Network in Network and it is the backbone of every efficient architecture since.

The Parameter-Count Argument

Compare two ways to move 256 channels through a 5×5 convolution. First, the naive way: one 5×5 conv from 256 input channels to 256 output channels.

Design	Computation	Parameters
Naive: one 5×5 conv, 256 → 256	5 × 5 × 256 × 256	1,638,400
Bottleneck: 1×1 (256→64) → 5×5 (64→64) → 1×1 (64→256)	(1·1·256·64) + (5·5·64·64) + (1·1·64·256)	Sum: 16,384 + 102,400 + 16,384 = 135,168

Twelve times fewer parameters for the same receptive field. The 1×1 at the entrance compresses the 256-channel input into a 64-channel summary; the expensive 5×5 conv operates in that low-dimensional space; the 1×1 at the exit projects back to 256. This is the logic the Inception module exploits at every branch, and the exact structure of the ResNet-50 bottleneck block we will build in a moment.

Why it works in practice. Many of the 256 input channels are redundant — they encode similar features. The first 1×1 conv learns a lossy compression (the 64 most informative linear combinations). The spatial conv works on that compressed representation. The second 1×1 restores the full channel count with another learned linear map. The compression is exactly as aggressive as the task permits, because gradient descent picks its parameters.

ResNet: The Skip Connection Revolution (2015)

By 2015, the trend was clear: deeper networks perform better. VGG went to 19 layers, GoogLeNet to 22. But there was a wall. Networks deeper than ~20 layers actually performed worse than shallower ones — not because of overfitting, but because they could not be trained effectively.

Kaiming He and his team at Microsoft Research diagnosed the problem as degradation: as networks get deeper, the optimization landscape becomes increasingly difficult. Even the identity function (passing input through unchanged) is hard for a deep stack of layers to learn. Their solution was elegant:

The Residual Learning Idea: Instead of asking each block to learn the desired mapping $H(x)$ , ask it to learn the residual $F(x) = H(x) - x$ . Then reconstruct: $H(x) = F(x) + x$ . If the optimal transformation is close to identity, learning $F(x) \approx 0$ is trivially easy — just set all weights to near zero.

The implementation is a single line of code: add the input to the output. This is the skip connection (also called shortcut or residual connection). Explore it interactively:

Residual Block: Skip Connection

Skip connection

x =2.0

Input: x = 2.0

Conv + BN + ReLU

Conv + BN

F(x) = 1.10

identity: x = 2.0

F(x) + x

1.10 + 2.0 = 3.10

ReLU \u2192 3.10

With skip connection: The block learns F(x) = H(x) - x (the residual), then adds the original input back: H(x) = F(x) + x.

Why this helps: If the optimal transformation is close to identity (the layer should not change the input much), learning F(x) ≈ 0 is much easier than learning H(x) ≈ x. Gradients also flow directly through the skip path during backpropagation, preventing vanishing gradients even at 152 layers deep.

Why Skip Connections Solve Vanishing Gradients

During backpropagation through a residual block, the gradient takes two paths:

$\frac{\partial H}{\partial x} = \frac{\partial F(x)}{\partial x} + \frac{\partial x}{\partial x} = \frac{\partial F(x)}{\partial x} + 1$

That $+1$ is the key. Even if the gradient through the convolutional path $\frac{\partial F}{\partial x}$ is tiny (approaching zero), the gradient through the skip path is always exactly 1. This means gradients can flow directly from the loss to any layer in the network, no matter how deep.

With skip connections, He et al. trained networks with 152 layers — and even tested one with 1,202 layers. ResNet-152 achieved a top-5 error of 3.57% on ImageNet, surpassing human-level performance (estimated at ~5.1% by Andrej Karpathy).

Depth	Without Skip Connections	With Skip Connections (ResNet)
20 layers	Trains well	Trains well (same)
56 layers	WORSE than 20-layer (degradation)	Better than 20-layer
110 layers	Cannot train meaningfully	Even better
152 layers	Completely fails	3.57% top-5 error (superhuman!)

Residual Block in Pure Python

Before we reach for nn.Module, let us strip the residual block to its essentials. No BatchNorm, no 3×3 kernels, no bias — just the skip connection. The point is to see the single design choice that makes ResNet work.

The skip connection in 20 lines of NumPy

🐍residual_numpy.py

Explanation(17)

Code(49)

1import numpy as np

We only need NumPy for this sketch — the key idea is the skip connection, not the convolution. We replace a 3×3 spatial convolution with a 1×1 channel mix so the per-pixel arithmetic is easy to follow.

3def relu(x)

Element-wise ReLU. Needed for the two activations inside the residual block and for the final post-skip activation.

EXECUTION STATE

⬇ input: x = Any ndarray

📚 np.maximum(0, x) = Element-wise max between 0 and each entry of x. Equivalent to relu but vectorised.

⬆ returns = Same shape as x, with negative entries clipped to 0.

6def linear_mix(x, W) # a 1×1 conv in disguise

Pure channel mixing. A 1×1 convolution is mathematically the same as multiplying each pixel's channel vector by a matrix W. No spatial footprint, no padding worries — exactly what we need to isolate the skip-connection idea.

EXECUTION STATE

📚 np.einsum('cij,oc->oij', x, W) = Einstein-summation shorthand. Read it as: 'for each output channel o and each pixel (i, j), sum over input channels c of W[o, c] * x[c, i, j]'. The output has shape (O, H, W).

⬇ input: x (C, H, W) = 3-D feature map. Example shape (2, 3, 3).

⬇ input: W (O, C) = Channel-mixing matrix. Example shape (2, 2).

⬆ returns = Array of shape (O, H, W) — same spatial dims, possibly different channel count.

11def residual_block(x, W1, W2)

The entire ResNet innovation distilled into three lines. F(x) is the 'side road' (two transformations with a ReLU in between). The skip connection is the '+ x'. After the addition, a final ReLU gates the combined signal.

EXECUTION STATE

⬇ input: x (C, H, W) = The block's input. Will also serve as the skip (identity) path.

⬇ input: W1 (O, C) = Weights for the first transformation.

⬇ input: W2 (O, C) = Weights for the second transformation. Must produce the same channel count as x so we can add.

⬆ returns = ReLU( F(x) + x ) — same shape as x.

15h = relu(linear_mix(x, W1))

Step 1 of the side road F: mix channels then activate. Shape unchanged because W1 is a (2, 2) matrix.

EXECUTION STATE

→ linear_mix(x, W1) when W1 = 0 = All-zeros array. Every output is 0.

→ relu(0) = 0. So h = 0 in Case A.

16h = linear_mix(h, W2) # no ReLU yet

Step 2 of F. The ReLU is deliberately omitted here — He et al. put the final activation AFTER adding the identity so the residual branch can output negative values if that helps. Applying ReLU here would clip those values and weaken the gradient highway.

EXECUTION STATE

⚠ design choice = Original ResNet paper (He 2015): relu-bn-conv-bn-conv-ADD-relu. Later ResNet-v2 moved the ReLU before the conv ('pre-activation') and found it trains even deeper. Both exist in the wild.

17return relu(h + x) # the skip connection

This is the single line the whole paper rests on. Add the identity path, THEN apply ReLU. In Case A (W1 = W2 = 0) this becomes ReLU(0 + x) = ReLU(x) — the block is the identity (modulo ReLU clipping). In a plain network, zero weights would kill all gradient flow and the block would emit zeros forever.

EXECUTION STATE

→ Case A result = ReLU(x) — the block learns nothing yet but propagates x forward. Gradient arriving here is dy/dx = 1 on positive entries — perfect flow.

→ Case B result = Small perturbation of ReLU(x) since W1, W2 are small random values.

21x = np.array([...])

Toy feature map: 2 channels, 3×3 each. Some values are negative so we can see ReLU do its thing.

EXECUTION STATE

⬆ x.shape = (2, 3, 3)

32W1 = np.zeros((2, 2))

Case A weights: exact zero. This is an 'identity initialisation' — the network has not learned anything yet, so F(x) should be zero.

33W2 = np.zeros((2, 2))

Same — the residual branch is off.

34y_id = residual_block(x, W1, W2)

Run the block. F(x) = 0, so y_id = ReLU(x). The skip connection has given us a free identity — a plain network with zero weights would have output all zeros and frozen training forever.

39print('identity test:', np.allclose(y_id, relu(x)))

Verify: the block's output when both weight matrices are zero equals ReLU of the input. This is the property that lets ResNets start very deep — they can add layers that are initially 'skipped' and only gradually learn to contribute.

EXECUTION STATE

⬆ stdout = identity test : True

42rng = np.random.default_rng(0)

Seeded RNG for reproducibility. default_rng is NumPy's modern, recommended API.

43W1 = rng.standard_normal((2, 2)) * 0.1

Small random weights — what training would actually start from after He initialisation. The factor 0.1 keeps the random perturbation small so the block is a gentle modification of the identity, not a chaotic transformation.

44W2 = rng.standard_normal((2, 2)) * 0.1

Same for the second layer.

45y_small = residual_block(x, W1, W2)

Normal (noisy) residual forward pass. The result is close to ReLU(x) but not exactly — the side road has contributed a small learnable perturbation.

47print('small-weights output magnitude:', ...)

Sanity check: the mean absolute value of the output is close to the mean absolute value of the input. The block has NOT blown up or collapsed the signal — a property that makes it safe to stack dozens of these blocks.

EXECUTION STATE

⬆ stdout (typical values) = small-weights output magnitude: ~4.5 input magnitude: 4.5

32 lines without explanation

1import numpy as np
2
3def relu(x):
4    return np.maximum(0, x)
5
6def linear_mix(x, W):
7    """Per-pixel channel mix. Equivalent to a 1x1 conv:
8    output channel o = sum over input channels c of W[o, c] * x[c]."""
9    return np.einsum('cij,oc->oij', x, W)
10
11def residual_block(x, W1, W2):
12    """y = ReLU( F(x) + x ),  where F(x) = W2 * ReLU(W1 * x).
13
14    The skip connection is the '+ x'. If the network decides F should
15    do nothing, it can simply drive W2 towards zero and the block becomes
16    the identity — a free 'do-nothing' option that plain networks lack."""
17    h = relu(linear_mix(x, W1))   # step 1: first transformation + ReLU
18    h =       linear_mix(h, W2)   # step 2: second transformation (no ReLU yet)
19    return relu(h + x)             # step 3: add skip, THEN activate
20
21
22# Toy input: 2 channels, 3x3 feature map
23x = np.array([
24    [[ 1., -2.,  3.],
25     [-1.,  4., -5.],
26     [ 6., -7.,  8.]],        # channel 0
27    [[-1.,  2., -3.],
28     [ 1., -4.,  5.],
29     [-6.,  7., -8.]],        # channel 1
30])
31
32# Case A — both weight matrices ZERO (the "identity initialisation")
33W1 = np.zeros((2, 2))
34W2 = np.zeros((2, 2))
35y_id = residual_block(x, W1, W2)
36
37# In a PLAIN network these zero weights would kill the signal.
38# In a residual block the skip connection rescues it:
39#   h = 0, so y = ReLU(0 + x) = ReLU(x)
40print("identity test  :", np.allclose(y_id, relu(x)))   # True
41
42# Case B — random small weights (normal training starting point)
43rng = np.random.default_rng(0)
44W1 = rng.standard_normal((2, 2)) * 0.1
45W2 = rng.standard_normal((2, 2)) * 0.1
46y_small = residual_block(x, W1, W2)
47
48print("small-weights output magnitude:", np.abs(y_small).mean().round(3))
49print("input               magnitude:", np.abs(x      ).mean().round(3))

The identity-preservation property. When both weight matrices are exactly zero, a residual block reduces to

y = \mathrm{ReLU}(0 + x) = \mathrm{ReLU}(x)

. A plain (non-residual) block with zero weights would output zero and kill the gradient. The skip connection turns “weights not yet trained” from catastrophic into harmless — which is exactly why you can stack 152 of these blocks without the early layers collapsing.

ResNet Block in PyTorch

Let's implement a residual block and a simple ResNet for MNIST. The pattern is: two 3×3 convolutions with batch normalization, plus a skip connection that adds the input directly to the output.

Residual Block + Simple ResNet

🐍simple_resnet.py

Explanation(30)

Code(58)

1import torch

PyTorch core library.

2import torch.nn as nn

Neural network module with layers and Module base class.

3import torch.nn.functional as F

Functional API: provides stateless operations like F.relu() and F.softmax(). Unlike nn.ReLU() (a module), F.relu() is a plain function — useful when you do not need to store the layer as an attribute.

EXECUTION STATE

📚 torch.nn.functional (F) = Stateless versions of nn layers. F.relu(x) = torch.relu(x). F.conv2d(x, w) = manual conv. Preferred in forward() for operations with no learnable parameters.

5class ResidualBlock(nn.Module):

The fundamental building block of ResNet. Contains two convolutions with batch normalization, plus a skip connection that adds the original input to the output. This is the innovation that enabled networks to go from ~20 layers to 152+ layers.

EXECUTION STATE

ResidualBlock = Learns the residual F(x) = H(x) - x instead of the full mapping H(x). The skip connection adds x back: output = F(x) + x. If the block should be identity, it just learns F(x) ≈ 0.

7def __init__(self, channels):

Constructor takes a single argument: the number of channels. Input and output channels are the same (this is a ‘same-size’ residual block). When channels change, a 1×1 conv is needed on the skip path.

EXECUTION STATE

⬇ input: channels = Number of input and output channels (e.g., 32). Must match so that the skip addition out + identity works (both tensors must have the same shape).

9self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)

First 3×3 convolution. Same padding preserves spatial dimensions. Channels stay the same (32 → 32) so the skip connection can directly add tensors.

EXECUTION STATE

→ weight shape = [channels, channels, 3, 3] — e.g., [32, 32, 3, 3] = 9,216 weights

10self.bn1 = nn.BatchNorm2d(channels)

Batch normalization after the first convolution. Normalizes activations across the batch to have zero mean and unit variance, then applies learnable scale (γ) and shift (β). This stabilizes training and allows higher learning rates.

EXECUTION STATE

📚 nn.BatchNorm2d(num_features) = For each channel: normalize across batch + spatial dims, then scale and shift. Parameters: γ (scale) and β (shift) per channel = 2 × channels learnable params.

→ params = 2 × 32 = 64 (32 gammas + 32 betas)

11self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)

Second 3×3 convolution. Same as conv1. Together, two 3×3 convs have a 5×5 effective receptive field but with two non-linearities instead of one.

12self.bn2 = nn.BatchNorm2d(channels)

Batch norm after the second conv. Note: ReLU is NOT applied here — it comes AFTER the skip addition. This ordering (BN before addition) is important for gradient flow.

14def forward(self, x):

Forward pass with the skip connection pattern: save input, process through main path, add input back, then activate.

EXECUTION STATE

⬇ input: x = [batch, channels, H, W] — e.g., [64, 32, 28, 28]

16identity = x

Save the input tensor for the skip connection. This reference costs zero memory (Python references, no copy). The original input will be added back after the two convolutions.

EXECUTION STATE

identity = Same tensor as x — [batch, 32, 28, 28]. This is the ‘shortcut’ or ‘skip’ that bypasses the conv layers.

19out = F.relu(self.bn1(self.conv1(x)))

Main path step 1: Conv → BatchNorm → ReLU. The convolution detects patterns, batch norm stabilizes, ReLU activates.

EXECUTION STATE

→ shape = [batch, 32, 28, 28] — same spatial size (padding=1 with kernel=3)

20out = self.bn2(self.conv2(out))

Main path step 2: Conv → BatchNorm (NO ReLU yet). ReLU is applied after the skip addition to avoid losing negative residual information before combining with the identity.

EXECUTION STATE

→ shape = [batch, 32, 28, 28] — this is F(x), the learned residual

→ no ReLU here = Applying ReLU before addition would clip negative residuals. The network needs to learn both positive and negative corrections to the input.

23out = out + identity # <-- THE KEY LINE

The skip connection: add the original input (identity) to the processed output (F(x)). This is element-wise addition of two [batch, 32, 28, 28] tensors. During backpropagation, the gradient flows through BOTH paths: ∂L/∂x = ∂L/∂out · (∂F/∂x + 1). That ‘+1’ means the gradient never vanishes, no matter how deep the network.

EXECUTION STATE

out + identity = F(x) + x = H(x). Element-wise addition. Both tensors must have identical shapes.

→ gradient insight = ∂(F(x)+x)/∂x = ∂F/∂x + 1. Even if ∂F/∂x ≈ 0 (vanishing gradient through convs), the +1 ensures the gradient always flows. This is why ResNets can be 152 layers deep.

24out = F.relu(out)

Apply ReLU AFTER the addition. The combined signal H(x) = F(x) + x is activated, then passed to the next block.

EXECUTION STATE

⬆ return: out = [batch, 32, 28, 28] — activated output of the residual block

27class SimpleResNet(nn.Module):

A minimal ResNet for MNIST: one initial conv, three residual blocks, global average pooling, and a single FC layer. Only ~56K parameters but demonstrates the full ResNet pattern.

EXECUTION STATE

SimpleResNet = 3 residual blocks = 6 conv layers + 1 initial conv + 1 FC = effectively 8 learnable layers. Each block has a skip connection.

32self.conv1 = nn.Conv2d(1, 32, 3, padding=1)

Initial convolution: map 1-channel grayscale input to 32-channel feature representation. This is the only layer without a skip connection.

36self.res1 = ResidualBlock(32)

First residual block: 32 channels in, 32 channels out. Two 3×3 convolutions with batch norm, plus skip connection.

37self.res2 = ResidualBlock(32)

Second residual block. Same architecture as res1 but with different learned weights.

38self.res3 = ResidualBlock(32)

Third residual block. Three blocks with 2 convs each = 6 conv layers in the residual stack.

41self.pool = nn.AdaptiveAvgPool2d(1)

Global Average Pooling: reduces each 28×28 feature map to a single value by averaging all spatial positions. Output shape: [batch, 32, 1, 1]. This replaces the flatten + large FC layer pattern, dramatically reducing parameters.

EXECUTION STATE

📚 nn.AdaptiveAvgPool2d(output_size) = Adapts the pooling window size to produce the target output size regardless of input size. AdaptiveAvgPool2d(1) pools the ENTIRE feature map into one value per channel. Used in ResNet, GoogLeNet, and most modern architectures.

⬇ arg: output_size = 1 = Each channel is reduced to 1×1. For 32 channels: [batch, 32, 28, 28] → [batch, 32, 1, 1].

42self.fc = nn.Linear(32, 10)

Single FC layer: 32 → 10 classes. Only 330 parameters! Compare to Section 1’s fc1 with 200,832 parameters. Global average pooling makes the classification head tiny.

EXECUTION STATE

→ params = 32 × 10 + 10 = 330 — orders of magnitude smaller than a flatten+FC approach

45x = F.relu(self.bn1(self.conv1(x)))

Initial conv block: 1 → 32 channels. This transforms the raw pixel input into a 32-channel feature representation that the residual blocks can work with.

EXECUTION STATE

→ shape = [batch, 32, 28, 28]

46x = self.res1(x) # [batch, 32, 28, 28]

First residual block: conv → bn → relu → conv → bn → add identity → relu. Output has the same shape as input — the skip connection ensures this.

47x = self.res2(x) # [batch, 32, 28, 28]

Second residual block. Gradients flow through both the block and the skip connection back to the input.

48x = self.res3(x) # [batch, 32, 28, 28]

Third residual block. The gradient for this block has a direct path through three skip connections back to conv1.

49x = self.pool(x) # [batch, 32, 1, 1]

Global average pooling: average all 28×28 spatial positions. Each of the 32 channels becomes a single number summarizing ‘how much of this feature is present in the entire image’.

EXECUTION STATE

→ shape = [batch, 32, 1, 1] — 28×28 = 784 positions averaged into 1

50x = x.view(x.size(0), -1) # [batch, 32]

Remove the spatial dimensions (1×1) by flattening. 32 channels × 1 × 1 = 32.

EXECUTION STATE

→ shape = [batch, 32]

51x = self.fc(x) # [batch, 10]

Final classification: 32 → 10 digit scores. The entire classification head is just this one small layer.

EXECUTION STATE

⬆ return: x = [batch, 10] — logits for digits 0–9

54total = sum(p.numel() for p in model.parameters())

Our SimpleResNet has ~56K parameters — comparable to LeNet-5 but with skip connections, batch norm, and ReLU. It would achieve ~99% on MNIST, same as our CNN from Section 1.

EXECUTION STATE

total = 56,170 parameters. The real ResNet-152 has 60M params for 1000 ImageNet classes.

28 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class ResidualBlock(nn.Module):
6    """A single residual block: two convolutions + skip connection."""
7    def __init__(self, channels):
8        super().__init__()
9        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
10        self.bn1 = nn.BatchNorm2d(channels)
11        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
12        self.bn2 = nn.BatchNorm2d(channels)
13
14    def forward(self, x):
15        # Save input for skip connection
16        identity = x
17
18        # Main path: conv -> bn -> relu -> conv -> bn
19        out = F.relu(self.bn1(self.conv1(x)))
20        out = self.bn2(self.conv2(out))
21
22        # Add skip connection, then activate
23        out = out + identity   # <-- THE KEY LINE
24        out = F.relu(out)
25        return out
26
27class SimpleResNet(nn.Module):
28    """Minimal ResNet for MNIST demonstration."""
29    def __init__(self):
30        super().__init__()
31        # Initial convolution
32        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
33        self.bn1 = nn.BatchNorm2d(32)
34
35        # Stack of residual blocks
36        self.res1 = ResidualBlock(32)
37        self.res2 = ResidualBlock(32)
38        self.res3 = ResidualBlock(32)
39
40        # Classification head
41        self.pool = nn.AdaptiveAvgPool2d(1)
42        self.fc = nn.Linear(32, 10)
43
44    def forward(self, x):
45        # x: [batch, 1, 28, 28]
46        x = F.relu(self.bn1(self.conv1(x)))  # [batch, 32, 28, 28]
47        x = self.res1(x)                      # [batch, 32, 28, 28]
48        x = self.res2(x)                      # [batch, 32, 28, 28]
49        x = self.res3(x)                      # [batch, 32, 28, 28]
50        x = self.pool(x)                      # [batch, 32, 1, 1]
51        x = x.view(x.size(0), -1)            # [batch, 32]
52        x = self.fc(x)                        # [batch, 10]
53        return x
54
55model = SimpleResNet()
56total = sum(p.numel() for p in model.parameters())
57print(f"SimpleResNet parameters: {total:,}")
58# SimpleResNet parameters: 56,170

The Bottleneck Block (ResNet-50+)

ResNet-18 and ResNet-34 use the basic block we just built. Deeper variants — ResNet-50, -101, -152 — cannot afford two 3×3 convolutions at full channel width; the parameter count and compute would explode. He et al. (2015) replaced the basic block with a bottleneck block that squeezes channels with a 1×1 conv, does the spatial work in the cheap low-dimensional space, then restores channels with another 1×1.

For a 256-channel stage the whole block has about $70{,}000$ parameters instead of $1{,}179{,}000$ for two naive 3×3 convs — a 17× reduction. This is how ResNet-152 fits into the same parameter budget as a much shallower ResNet-34.

Bottleneck block — the heart of ResNet-50+

🐍bottleneck_block.py

Explanation(17)

Code(30)

3class BottleneckBlock(nn.Module)

The block used everywhere in ResNet-50, ResNet-101, ResNet-152. Replaces the basic block's two 3×3 convs with a 1×1 / 3×3 / 1×1 sandwich that is dramatically cheaper at the same channel width.

9def __init__(self, channels, reduction=4)

channels is the input/output channel count (e.g. 256 in a mid-ResNet stage). reduction=4 means the middle (3×3) conv only operates on channels/4 feature maps.

EXECUTION STATE

⬇ channels = 256 = Typical mid-ResNet width. The block must produce the same 256-channel output so it can be added to the skip path.

⬇ reduction = 4 = The 'squeeze factor'. ResNet paper uses 4. Larger reduction = cheaper but less expressive.

11mid = channels // reduction

The bottleneck width. For channels=256, reduction=4: mid = 64.

EXECUTION STATE

mid = 64 when channels=256, reduction=4

13self.conv1 = nn.Conv2d(channels, mid, 1, bias=False)

1×1 convolution that REDUCES channels from 256 to 64. Zero spatial footprint. bias=False because the immediately following BatchNorm has its own shift parameter — the conv bias would be redundant.

EXECUTION STATE

📚 nn.Conv2d(C_in, C_out, k, bias) = 2-D convolution. k=1 means a 'pointwise' conv: each output pixel is a linear combination of the input's channels at the same location.

→ params = C_in × C_out × k² = 256 × 64 × 1 = 16,384 weights. No bias.

14self.bn1 = nn.BatchNorm2d(mid)

BatchNorm over the 64-channel intermediate. Adds 2 × 64 = 128 learnable parameters (γ, β per channel) and 2 × 64 = 128 non-learnable running-stats buffers.

16self.conv2 = nn.Conv2d(mid, mid, 3, padding=1, bias=False)

The actual spatial work — a 3×3 conv, but ONLY on the 64-channel reduced representation. This is the entire point of the bottleneck: do the expensive spatial convolution in the cheap low-channel space.

EXECUTION STATE

→ params = 64 × 64 × 9 = 36,864 weights. Compare with a plain 3×3 on 256 channels: 256 × 256 × 9 = 589,824 — 16× more.

17self.bn2 = nn.BatchNorm2d(mid)

BatchNorm after the 3×3 conv. Same channel count (64).

19self.conv3 = nn.Conv2d(mid, channels, 1, bias=False)

1×1 conv that RESTORES channels from 64 back to 256. Symmetric with conv1. Required because the skip path carries the original 256-channel tensor and the two must match to be added.

EXECUTION STATE

→ params = 64 × 256 × 1 = 16,384 weights. Together the two 1×1 convs cost 32,768 — still far less than one plain 3×3 on 256 channels (589,824).

20self.bn3 = nn.BatchNorm2d(channels)

BatchNorm on the restored 256-channel output. Note: no ReLU on the output of conv3/bn3 before adding the skip — the activation happens AFTER the add.

22self.relu = nn.ReLU(inplace=True)

Single ReLU module reused across the forward pass. inplace=True overwrites the input tensor to save memory — safe here because the pre-ReLU value is not needed elsewhere.

24def forward(self, x)

The forward pass. x has shape (N, channels, H, W). The skip path keeps x unchanged; the residual path transforms it through the three convs.

25identity = x

Remember the input for the skip connection. No copy is made — just a reference.

26out = self.relu(self.bn1(self.conv1(x))) # 1×1 reduce

Step 1: reduce channels 256 → 64, then BN, then ReLU. This is the cheapest of the three convolutions and it prepares the signal for the spatial work.

27out = self.relu(self.bn2(self.conv2(out))) # 3×3 spatial work

Step 2: the actual spatial convolution, on the cheap 64-channel representation. All of the block's 'understanding' of local patterns happens here.

28out = self.bn3(self.conv3(out)) # 1×1 restore, no ReLU

Step 3: restore channels 64 → 256, BN, NO ReLU. The skip addition comes before the final activation so the residual branch can still output negative values.

29out = out + identity # skip connection

The core of ResNet. 256-channel output of the residual branch is added element-wise to the original 256-channel input. If conv3's output is zero, the block is effectively the identity.

30return self.relu(out) # single activation after add

Final activation. Only ONE ReLU sits on the skip path — that keeps the gradient highway clean.

EXECUTION STATE

⬆ returns = Tensor of shape (N, channels, H, W) — same as input, ready to feed into the next block.

13 lines without explanation

1import torch.nn as nn
2
3class BottleneckBlock(nn.Module):
4    """ResNet-50+ bottleneck: 1x1 reduce -> 3x3 -> 1x1 restore.
5
6    All the spatial work happens in the cheap low-channel middle.
7    The 1x1s on either side shrink and restore the channel count.
8    """
9    def __init__(self, channels: int, reduction: int = 4):
10        super().__init__()
11        mid = channels // reduction                     # e.g. 256 -> 64
12
13        self.conv1 = nn.Conv2d(channels, mid, kernel_size=1, bias=False)
14        self.bn1   = nn.BatchNorm2d(mid)
15
16        self.conv2 = nn.Conv2d(mid, mid, kernel_size=3, padding=1, bias=False)
17        self.bn2   = nn.BatchNorm2d(mid)
18
19        self.conv3 = nn.Conv2d(mid, channels, kernel_size=1, bias=False)
20        self.bn3   = nn.BatchNorm2d(channels)
21
22        self.relu  = nn.ReLU(inplace=True)
23
24    def forward(self, x):
25        identity = x
26        out = self.relu(self.bn1(self.conv1(x)))   # reduce channels (1x1)
27        out = self.relu(self.bn2(self.conv2(out)))  # spatial work (3x3)
28        out = self.bn3(self.conv3(out))             # restore channels (1x1, NO ReLU)
29        out = out + identity                        # skip
30        return self.relu(out)                       # activate once after add

Why the skip path does not need a projection here. When channel count changes between blocks (e.g., from 256 to 512 at a stage boundary) the skip path needs a 1×1 projection so its channel count matches the residual branch before adding. Within a stage the bottleneck block keeps the input/output channel count equal — that is the entire reason the third 1×1 conv restores to channels, not to mid.

Comparing Our Three Architectures

	Our CNN (Section 1)	LeNet-5	SimpleResNet
Year of design	Modern	1998	2015
Parameters	206,922	44,426	56,170
Conv layers	2	2	7 (1 + 3×2)
Skip connections	No	No	Yes (3 blocks)
Activation	ReLU	tanh	ReLU
Pooling	MaxPool	AvgPool	Global AvgPool
Batch Norm	No	No	Yes
MNIST accuracy	~99%	~98.5%	~99.2%
Key strength	Simple, effective	Historical first	Scales to extreme depth

Skip connections are everywhere now. ResNet's residual learning idea has become the most widely adopted architectural innovation in deep learning. Transformers use residual connections around every attention and FFN layer. U-Net uses skip connections between encoder and decoder. DenseNet connects every layer to every other layer. The principle is universal: give gradients a highway to flow through.

Choosing an Architecture

With so many architectures available, how do you choose? Here is a practical decision framework:

Scenario	Recommended Architecture	Why
Learning/teaching CNNs	Custom small CNN (Section 1)	Transparent, easy to trace, fast to train
Small dataset, limited compute	ResNet-18 (pretrained)	Transfer learning transfers features from ImageNet
Medium dataset, good GPU	ResNet-50	Best accuracy/efficiency trade-off
Mobile deployment	MobileNet v3 or EfficientNet	Designed for low latency and memory
Maximum accuracy, unlimited compute	EfficientNet-B7 or ConvNeXt	State-of-the-art on ImageNet
Object detection	ResNet/ResNeXt backbone + FPN	Standard backbone for detection frameworks

In practice, you almost never design a CNN from scratch. You pick a pretrained backbone (usually ResNet or EfficientNet), freeze or fine-tune it for your task, and add a custom classification head. This is transfer learning — the topic of the next section.

Looking Ahead: In the next section, we will take a pretrained ResNet that has already learned to recognize 1,000 categories of objects, and adapt it to a completely new task with just a few hundred images. The features learned from ImageNet \u2014 edges, textures, shapes, parts, objects \u2014 transfer remarkably well to almost any visual task.

References

The architectural timeline above compresses roughly two decades of research into a single thread. Each entry below is the original paper that introduced a named innovation. Cite these, not this section, in academic work.

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324. DOI: 10.1109/5.726791. — LeNet-5.
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NeurIPS 2012). — AlexNet.
Lin, M., Chen, Q. & Yan, S. (2013). Network In Network. ICLR 2014 / arXiv:1312.4400. — 1×1 convolutions and Global Average Pooling.
Simonyan, K. & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 / arXiv:1409.1556. — VGG.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. & Rabinovich, A. (2014). Going Deeper with Convolutions. CVPR 2015 / arXiv:1409.4842. — GoogLeNet / Inception-v1.
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015 / arXiv:1502.03167. — BatchNorm.
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Deep Residual Learning for Image Recognition. CVPR 2016 / arXiv:1512.03385. — ResNet (basic block and bottleneck).