Chapter 14
22 min read
Section 44 of 65

Building a CNN from Scratch

CNN Architectures

From Pixels to Predictions

In Chapter 13, we mastered the individual pieces: convolution operations, kernel mechanics, multi-channel filtering, and the output size formula. Now it's time to assemble these pieces into a complete, working CNN that can look at a handwritten digit and tell you what number it is.

We will build a CNN that classifies MNIST digits — 28×28 grayscale images of handwritten numbers 0 through 9. By the end of this section, you will have a network that achieves 99% accuracy with just two convolutional layers and about 207,000 parameters.

The CNN Recipe: A CNN is not just convolutions. It is a carefully designed pipeline: convolution (detect patterns) → activation (add non-linearity) → pooling (compress spatial info) → repeat → flatten (reshape for classification) → fully-connected layers (make the final decision). Each piece has a specific job.

But before we can build this pipeline, we need one crucial piece that Chapter 13 introduced but did not fully explore: pooling.


The Missing Piece: Pooling

After a convolutional layer detects features across the image, we face a problem: the feature maps are the same spatial size as the input (when using padding). A 28×28 image produces 28×28 feature maps. If we stack many conv layers, the computational cost grows rapidly. We need a way to shrink the spatial dimensions while keeping the important information.

That is exactly what pooling does. It slides a window across each feature map and summarizes each region with a single value. The two most common types are:

  • Max Pooling: Keep the maximum value in each window. This preserves the strongest activation — if a vertical edge was detected somewhere in a 2×2 region, max pooling keeps that detection regardless of its exact position.
  • Average Pooling: Compute the mean of all values in the window. This creates a smooth summary of activations, useful as a final aggregation layer.

Why Pooling Matters

Pooling provides three critical benefits:

  1. Spatial reduction: A 2×2 pooling with stride 2 halves both height and width, reducing the number of values by 4×. This means the next conv layer has 4× fewer positions to process.
  2. Translation invariance: If a feature shifts by 1 pixel in the input, max pooling often produces the same output. The network cares about what was detected, not exactly where.
  3. Larger receptive field: After pooling, each position in the next layer's feature map “sees” a larger region of the original image. This enables higher layers to detect larger-scale patterns.

The Pooling Formula

The output size formula for pooling is the same as for convolution (no padding is typically used):

O=IKS+1O = \lfloor \frac{I - K}{S} \rfloor + 1

For the standard 2×2 max pooling with stride 2: O=I22+1=I2O = \lfloor \frac{I - 2}{2} \rfloor + 1 = \frac{I}{2}. So a 28×28 feature map becomes 14×14, and a 14×14 becomes 7×7.

Pooling has no learnable parameters. It is a fixed operation — the network does not learn how to pool, only what to pool (via the preceding conv layer's learned filters). This is fundamentally different from convolution, where the kernel weights are learned.

Explore max pooling and average pooling interactively:

Max Pooling Visualizer

Input Feature Map (4x4)
1
3
2
4
5
6
1
2
7
2
3
1
4
8
5
6
2x2 pool, stride 2
Window [0:2, 0:2]
[1, 3, 5, 6]
max(1,3,5,6) = 6
Output (2x2)
6
?
?
?
Step 1/4
Max Pooling keeps the strongest activation in each window. It preserves what was detected while discarding where exactly it appeared — this gives the network translation invariance.

Max Pooling from Scratch

Let's implement max pooling ourselves to fully understand the mechanics. We will process a 4×4 feature map with a 2×2 pooling window and stride 2:

Max Pooling — Step-by-Step
🐍max_pooling.py
1import numpy as np

NumPy provides the ndarray type and mathematical functions we need for matrix operations. We use np.array() to create our feature map, np.zeros() to allocate output, and np.max() to find window maxima.

EXECUTION STATE
numpy = Numerical computing library — provides fast N-dimensional arrays and element-wise operations implemented in C
4feature_map = np.array([[1, 3, 2, 4], ...])

We create a 4x4 matrix representing the output of a convolutional layer. In a real CNN, this would contain activation values from one filter — higher values mean the filter detected its pattern more strongly at that location.

EXECUTION STATE
feature_map (4×4) =
  col0  col1  col2  col3
    1     3     2     4
    5     6     1     2
    7     2     3     1
    4     8     5     6
→ shape = (4, 4) — 4 rows, 4 columns. 16 activation values total.
12def max_pool_2d(x, pool_size=2, stride=2)

Defines a function that performs 2D max pooling. The pool window slides across the input, and at each position we keep only the maximum value. This reduces spatial dimensions while preserving the strongest activations.

EXECUTION STATE
⬇ input: x (4×4) =
  col0  col1  col2  col3
    1     3     2     4
    5     6     1     2
    7     2     3     1
    4     8     5     6
⬇ input: pool_size = 2 = Height and width of the pooling window. A 2×2 window looks at 4 values at a time. Larger windows = more aggressive downsampling.
⬇ input: stride = 2 = How many pixels the window moves between positions. stride=2 means non-overlapping windows (the standard choice). stride=1 would create overlapping windows.
⬆ returns = np.ndarray of shape (H_out, W_out) — the pooled feature map with reduced spatial dimensions
13H, W = x.shape

Unpack the spatial dimensions of the input. NumPy’s .shape attribute returns a tuple of dimension sizes.

EXECUTION STATE
H = 4 — number of rows in the input feature map
W = 4 — number of columns in the input feature map
📚 .shape = NumPy ndarray attribute: returns a tuple of array dimensions. For a 2D array: (rows, cols)
14H_out = (H - pool_size) // stride + 1

Computes the output height using the same dimension formula from Chapter 13, but now applied to pooling instead of convolution. The formula works identically because pooling is also a sliding window operation.

EXECUTION STATE
H_out = (4 - 2) // 2 + 1 = 2 // 2 + 1 = 1 + 1 = 2
→ formula = O = ⌊(I - K) / S⌋ + 1. With I=4, K=2, S=2: the 2×2 window fits exactly 2 times along each axis.
15W_out = (W - pool_size) // stride + 1

Same formula for the width dimension. Since our input is square (4×4), the output is also square (2×2).

EXECUTION STATE
W_out = (4 - 2) // 2 + 1 = 2
16output = np.zeros((H_out, W_out))

Allocate a 2×2 output array filled with zeros. We will fill each cell as the pooling window slides across the input.

EXECUTION STATE
📚 np.zeros(shape) = Creates an array of the given shape filled with 0.0. Example: np.zeros((2,2)) → [[0, 0], [0, 0]]
output (2×2) =
[[0.0, 0.0],
 [0.0, 0.0]]
18for i in range(H_out): — row loop

Outer loop iterates over output rows. With H_out=2, i takes values 0 and 1. Each i corresponds to a horizontal band of the input.

LOOP TRACE · 2 iterations
i=0
row band = Input rows 0–1 (top half of the 4×4 grid)
i=1
row band = Input rows 2–3 (bottom half of the 4×4 grid)
19for j in range(W_out): — column loop

Inner loop iterates over output columns. Combined with the outer loop, we visit every non-overlapping 2×2 region of the input.

LOOP TRACE · 4 iterations
i=0, j=0
window position = Top-left 2×2 block: rows[0:2], cols[0:2]
i=0, j=1
window position = Top-right 2×2 block: rows[0:2], cols[2:4]
i=1, j=0
window position = Bottom-left 2×2 block: rows[2:4], cols[0:2]
i=1, j=1
window position = Bottom-right 2×2 block: rows[2:4], cols[2:4]
20r = i * stride

Compute the starting row index in the input. Multiplying by stride converts output coordinates to input coordinates.

EXECUTION STATE
r (when i=0) = 0 * 2 = 0 — window starts at input row 0
r (when i=1) = 1 * 2 = 2 — window starts at input row 2
21c = j * stride

Compute the starting column index in the input.

EXECUTION STATE
c (when j=0) = 0 * 2 = 0 — window starts at input col 0
c (when j=1) = 1 * 2 = 2 — window starts at input col 2
22window = x[r:r+pool_size, c:c+pool_size]

Extract the 2×2 window from the input using NumPy slice notation. x[r:r+2, c:c+2] selects rows r to r+1 and columns c to c+1.

EXECUTION STATE
── Window at (i=0, j=0) ── =
x[0:2, 0:2] =
[[1, 3],
 [5, 6]]
── Window at (i=0, j=1) ── =
x[0:2, 2:4] =
[[2, 4],
 [1, 2]]
── Window at (i=1, j=0) ── =
x[2:4, 0:2] =
[[7, 2],
 [4, 8]]
── Window at (i=1, j=1) ── =
x[2:4, 2:4] =
[[3, 1],
 [5, 6]]
23output[i, j] = np.max(window)

Take the maximum value from the 2×2 window and store it in the output. This is the core of max pooling — we keep only the strongest activation.

EXECUTION STATE
📚 np.max(array) = Returns the single largest element in the array. For a 2×2 array, it checks all 4 values and returns the biggest.
── (i=0, j=0) ── =
max([1,3,5,6]) = 6 ← the ‘6’ dominates this region
── (i=0, j=1) ── =
max([2,4,1,2]) = 4 ← the ‘4’ dominates this region
── (i=1, j=0) ── =
max([7,2,4,8]) = 8 ← the ‘8’ dominates this region
── (i=1, j=1) ── =
max([3,1,5,6]) = 6 ← the ‘6’ dominates this region
25return output

Return the completed 2×2 pooled feature map. We reduced a 4×4 grid (16 values) to a 2×2 grid (4 values) — a 4x compression while keeping the strongest signals.

EXECUTION STATE
⬆ return: output (2×2) =
[[6.0, 4.0],
 [8.0, 6.0]]
27result = max_pool_2d(feature_map)

Call our max pooling function on the 4×4 feature map. The function processes all four 2×2 windows and returns the pooled result.

EXECUTION STATE
result (2×2) =
[[6.0, 4.0],
 [8.0, 6.0]]
→ compression = 4×4 (16 values) → 2×2 (4 values). 75% of the data is discarded, but the strongest activations survive.
28print(result)

Display the pooled output. Each value is the maximum from its corresponding 2×2 region of the input.

EXECUTION STATE
printed output =
[[6. 4.]
 [8. 6.]]
13 lines without explanation
1import numpy as np
2
3# A 4x4 feature map (output from a conv layer)
4feature_map = np.array([
5    [1, 3, 2, 4],
6    [5, 6, 1, 2],
7    [7, 2, 3, 1],
8    [4, 8, 5, 6]
9])
10
11def max_pool_2d(x, pool_size=2, stride=2):
12    H, W = x.shape
13    H_out = (H - pool_size) // stride + 1
14    W_out = (W - pool_size) // stride + 1
15    output = np.zeros((H_out, W_out))
16
17    for i in range(H_out):
18        for j in range(W_out):
19            r = i * stride
20            c = j * stride
21            window = x[r:r+pool_size, c:c+pool_size]
22            output[i, j] = np.max(window)
23
24    return output
25
26result = max_pool_2d(feature_map)
27print(result)
28# [[6. 4.]
29#  [8. 6.]]
Compare max pooling to the convolution operation from Chapter 13. Both slide a window across the input, but convolution computes a weighted sum (dot product with a learned kernel), while max pooling simply picks the maximum value. No multiplication, no learning — just selection.

Max Pooling in PyTorch

PyTorch provides nn.MaxPool2d\texttt{nn.MaxPool2d} which handles batches, multiple channels, and GPU acceleration:

Our ImplementationPyTorch Equivalent
max_pool_2d(x, pool_size=2, stride=2)nn.MaxPool2d(kernel_size=2, stride=2)(x)
Processes single 2D arrayProcesses [batch, channels, H, W] tensors
CPU only, Python loopsGPU-accelerated, optimized C++/CUDA
EducationalProduction-ready

Quick Check

What is the output shape when applying MaxPool2d(2, 2) to a tensor of shape [8, 32, 14, 14]?


Why ReLU, Not tanh or Sigmoid

Before we assemble the network we need to answer a practical question: which activation function goes between the convolution and the pooling? The 1998 LeNet paper used tanh\tanh. Modern CNNs almost universally use ReLU(x)=max(0,x)\mathrm{ReLU}(x) = \max(0, x). What changed?

The Vanishing-Gradient Problem of Smooth Squashers

Both tanh\tanh and σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}) are S-shaped “squashers” that saturate at the extremes. When the pre-activation xx is large in magnitude, the derivative is almost zero:

Functionf(x)f'(x)f'(x=5)
Sigmoid1 / (1 + e^-x)f(x) · (1 - f(x))≈ 0.0066
tanh(e^x - e^-x)/(e^x + e^-x)1 - tanh(x)²≈ 0.00018
ReLUmax(0, x)0 if x<0, else 11.0

In a 10-layer network using tanh\tanh, gradients propagating backward through the chain rule multiply ten derivatives, each typically much less than 1. The gradient arriving at the earliest layer can be 101010^{-10} of the loss gradient — effectively zero. The early filters never learn. This is the vanishing-gradient problem, documented by Glorot & Bengio (2010) and solved architecturally in several ways; ReLU was the simplest.

Nair & Hinton (2010) and Glorot, Bordes & Bengio (2011) showed that ReLU trains deeper networks faster. Its gradient is either exactly 0 (for x<0x < 0) or exactly 1 (for x>0x > 0). There is no gradual decay. When the signal is “on”, the gradient flows through untouched.

The dead-ReLU caveat. If a neuron's pre-activation stays negative for every input, the gradient is always zero and its weights never update — a “dead” neuron. Two common escape hatches: Leaky ReLU (f(x)=max(0.01x,x)f(x) = \max(0.01x, x)) and He initialisation (variance scaled by 2/nin2/n_{\text{in}}) which tunes initial weights so most neurons start “alive”. We use plain ReLU here because MNIST is forgiving; He initialisation is a one-line fix for larger networks.

Our CNN Blueprint

Now we have all the building blocks: convolution (Chapter 13), activation functions (Chapter 5), pooling (above), and fully-connected layers (Chapter 10). Let's assemble them into a complete CNN for digit classification.

Click any layer in the diagram below to see exactly what it does and how many parameters it has:

CNN Architecture: MNIST Digit Classifier

206,922 parameters
ConvolutionActivationPoolingReshapeFully ConnectedDropout
1x28x28 &xrarr; 16x28x28 &xrarr; 16x14x14 &xrarr; 32x14x14 &xrarr; 32x7x7 &xrarr; 1568 &xrarr; 128 &xrarr; 10

Dimension Tracking: Every Tensor Shape

Understanding exactly how the tensor shape transforms at each layer is critical. Here is the complete flow for a single image:

LayerOperationOutput ShapeParametersPurpose
InputMNIST image1 × 28 × 280Raw grayscale pixels
Conv1Conv2d(1, 16, 3, pad=1)16 × 28 × 28160Detect edges and textures
ReLUmax(0, x)16 × 28 × 280Non-linearity
Pool1MaxPool2d(2, 2)16 × 14 × 140Halve spatial size
Conv2Conv2d(16, 32, 3, pad=1)32 × 14 × 144,640Combine edges into shapes
ReLUmax(0, x)32 × 14 × 140Non-linearity
Pool2MaxPool2d(2, 2)32 × 7 × 70Halve again
FlattenReshape1,5680Conv → FC bridge
FC1Linear(1568, 128)128200,832Dense reasoning
ReLUmax(0, x)1280Non-linearity
DropoutDrop 25%1280Regularization
FC2Linear(128, 10)101,290Digit class scores

Total: 206,922 parameters. Notice that the FC1 layer alone accounts for 97% of all parameters. This is a common pattern in CNNs — the convolutional layers are parameter-efficient (weight sharing across spatial positions), but the first fully-connected layer creates a large weight matrix.

Design pattern: As spatial dimensions decrease (28 → 14 → 7), channel count increases (1 → 16 → 32). This is the classic CNN trade-off: we lose spatial resolution but gain representational depth. The total information capacity stays roughly constant: 28×28×1=78428 \times 28 \times 1 = 784 pixels → 7×7×32=1,5687 \times 7 \times 32 = 1{,}568 features.

The PyTorch Implementation

Let's build this CNN in PyTorch using nn.Module\texttt{nn.Module}. The pattern is the same we learned in Chapter 10 for MLPs: define layers in \texttt{__init__}, wire them together in forward\texttt{forward}.

Complete CNN in PyTorch
🐍digit_cnn.py
1import torch

PyTorch is the deep learning framework we use for tensor operations, automatic differentiation, and GPU acceleration. It provides the core Tensor type and functions like torch.relu().

EXECUTION STATE
torch = Core PyTorch library — provides Tensor class, math operations, and autograd engine for computing gradients
2import torch.nn as nn

The neural network module contains all the building blocks: layers (Conv2d, Linear), loss functions (CrossEntropyLoss), and the Module base class for defining networks.

EXECUTION STATE
torch.nn = Neural network module — contains Conv2d, Linear, MaxPool2d, Dropout, ReLU, and 100+ other layer types
4class DigitCNN(nn.Module):

Define our CNN as a class inheriting from nn.Module. This is PyTorch’s standard pattern — nn.Module provides parameter tracking, .train()/.eval() mode switching, GPU transfer via .to(device), and model saving/loading. Every PyTorch network follows this pattern.

EXECUTION STATE
📚 nn.Module = Base class for all PyTorch neural networks. Tracks all learnable parameters, handles training/eval modes, and enables .parameters() iteration for optimizers.
DigitCNN = Our custom CNN for classifying MNIST digits (0–9). Input: 28×28 grayscale images. Output: 10 class scores.
5def __init__(self):

Constructor where we define all layers. PyTorch requires layers to be defined here (not in forward()) so that nn.Module can discover and track their parameters automatically.

6super().__init__()

Call the parent nn.Module constructor. This initializes the internal parameter registry, hooks system, and training mode flag. Forgetting this line would break parameter discovery.

EXECUTION STATE
super() = Calls nn.Module.__init__() — sets up self._parameters, self._modules, self.training=True, and other internal state
8self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)

First convolutional layer: 16 filters scan the input image for basic patterns like edges, corners, and gradients. With padding=1, the spatial size stays 28×28.

EXECUTION STATE
📚 nn.Conv2d(in, out, kernel_size, padding) = Creates a 2D convolutional layer. Stores a weight tensor of shape [out_channels, in_channels, kH, kW] and a bias of shape [out_channels]. Forward: slides filters across input and computes dot products.
⬇ arg: in_channels = 1 = Input has 1 channel (grayscale). For RGB images this would be 3.
⬇ arg: out_channels = 16 = Learn 16 different filters. Each filter detects a different pattern. More filters = more patterns detected, but more computation.
⬇ arg: kernel_size = 3 = Each filter is 3×3 pixels. Small enough to detect local patterns, large enough to capture edges. Most common choice in modern CNNs.
⬇ arg: padding = 1 = Add 1 pixel of zeros around the input border. With kernel=3 and padding=1: output_size = (28 - 3 + 2×1)/1 + 1 = 28. Preserves spatial dimensions.
→ weight shape = [16, 1, 3, 3] — 16 filters × 1 input channel × 3×3 kernel = 144 weights + 16 biases = 160 parameters
9self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

Max pooling layer that halves the spatial dimensions. Takes the maximum from each 2×2 window, reducing 28×28 to 14×14.

EXECUTION STATE
📚 nn.MaxPool2d(kernel_size, stride) = Slides a window across each channel independently, keeping only the max value. No learnable parameters — it is a fixed operation.
⬇ arg: kernel_size = 2 = Window size 2×2. Looks at 4 values at a time.
⬇ arg: stride = 2 = Move 2 pixels between windows (non-overlapping). Output = 28/2 = 14.
→ parameters = 0 — max pooling has no learnable parameters, it is a deterministic operation
12self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)

Second convolutional layer: 32 filters that each span all 16 input channels. These combine the edge features from conv1 into higher-level patterns like curves, loops, and corners — the building blocks of digits.

EXECUTION STATE
⬇ arg: in_channels = 16 = Receives 16 feature maps from conv1+pool1. Each of the 32 new filters looks at all 16 channels simultaneously.
⬇ arg: out_channels = 32 = 32 filters to detect 32 different higher-level patterns. Doubling channels when halving spatial size is a classic CNN design pattern.
→ weight shape = [32, 16, 3, 3] — 32 filters × 16 channels × 3×3 kernel = 4,608 weights + 32 biases = 4,640 parameters
13self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

Second pooling layer halves dimensions again: 14×14 → 7×7. After this, we have 32 feature maps of 7×7 = 32×49 = 1,568 total feature values.

EXECUTION STATE
→ spatial reduction = 14/2 = 7. Two poolings reduced 28×28=784 pixels to 7×7=49 spatial positions, but with 32 channels of information.
16self.fc1 = nn.Linear(32 * 7 * 7, 128)

First fully-connected layer. Takes the flattened feature maps (1,568 values) and projects them into 128 dimensions. This is where the network reasons about which digit the extracted features represent.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Fully-connected (dense) layer. Stores a weight matrix W of shape [out, in] and bias b of shape [out]. Computes: output = input @ W.T + b
⬇ arg: in_features = 32 * 7 * 7 = 1568 = 32 channels × 7 height × 7 width. This must match the flattened size of the last conv block’s output.
⬇ arg: out_features = 128 = Compress 1,568 features into 128 high-level representations. This is a bottleneck that forces the network to learn compact digit representations.
→ parameters = 1568 × 128 + 128 = 200,832 — this single layer has 97% of the network’s parameters!
17self.dropout = nn.Dropout(0.25)

Dropout layer for regularization. During training, randomly zeroes 25% of the values to prevent the network from memorizing the training data.

EXECUTION STATE
📚 nn.Dropout(p) = During training: randomly sets p fraction of inputs to 0, scales rest by 1/(1-p). During eval: passes all values through unchanged. Prevents co-adaptation of neurons.
⬇ arg: p = 0.25 = Drop 25% of neurons each forward pass. The network cannot rely on any single neuron. Mild regularization — 0.5 is more aggressive.
18self.fc2 = nn.Linear(128, 10)

Output layer: 10 neurons, one per digit class (0 through 9). The highest-scoring neuron is the network’s prediction.

EXECUTION STATE
⬇ arg: in_features = 128 = Takes the 128-dimensional hidden representation.
⬇ arg: out_features = 10 = One score per digit class: [score_0, score_1, ..., score_9]. These raw scores are called logits.
→ parameters = 128 × 10 + 10 = 1,290
20def forward(self, x):

The forward pass: defines how data flows through the network. PyTorch calls this automatically when you write model(input). The input x is a batch of MNIST images.

EXECUTION STATE
⬇ input: x = Tensor of shape [batch, 1, 28, 28]. Example: a batch of 64 grayscale 28×28 images → shape [64, 1, 28, 28]
⬆ returns = Tensor of shape [batch, 10] — 10 logit scores per image
22x = torch.relu(self.conv1(x))

Apply the first convolution followed by ReLU activation. Conv1 detects patterns, ReLU zeros out negative responses. The conv preserves spatial size (padding=1), so the shape changes only in the channel dimension: 1 → 16.

EXECUTION STATE
📚 torch.relu(x) = Element-wise ReLU: max(0, x). Negative values become 0, positive values pass through unchanged. Introduces non-linearity.
→ after conv1 = [batch, 16, 28, 28] — 16 feature maps, each 28×28
→ after relu = [batch, 16, 28, 28] — same shape, but all negative activations are now 0
23x = self.pool1(x)

Max pooling halves the spatial dimensions. Each 2×2 region is replaced by its maximum value. This gives translation invariance and reduces computation for subsequent layers.

EXECUTION STATE
→ shape = [batch, 16, 14, 14] — spatial size halved: 28→2×14, channels unchanged at 16
→ data reduction = 28×28 = 784 → 14×14 = 196 per channel (4× compression)
24x = torch.relu(self.conv2(x))

Second conv block: 32 filters combine the 16 edge maps into higher-level patterns. Each filter spans all 16 input channels, so it can detect combinations like “horizontal edge above vertical edge” = corner.

EXECUTION STATE
→ after conv2 + relu = [batch, 32, 14, 14] — 32 feature maps, each 14×14. These encode shapes and textures, not just edges.
25x = self.pool2(x)

Second pooling reduces 14×14 to 7×7. After two conv-pool blocks, we have compact but rich feature representations.

EXECUTION STATE
→ shape = [batch, 32, 7, 7] — 32 channels of 7×7 spatial information
→ total values = 32 × 7 × 7 = 1,568 feature values per image
26x = x.view(x.size(0), -1)

Flatten the 3D feature maps into a 1D vector. This is the bridge between convolutional layers (which need spatial structure) and fully-connected layers (which need flat vectors). view() reshapes without copying data.

EXECUTION STATE
📚 .view(shape) = Reshapes a tensor without changing the underlying data. Like numpy.reshape(). The -1 tells PyTorch to infer that dimension automatically.
⬇ arg: x.size(0) = The batch size (e.g., 64). We keep the batch dimension intact.
⬇ arg: -1 = Infer this dimension: total elements / batch_size = 32×7×7 = 1568. So -1 becomes 1568.
→ shape = [batch, 1568] — each image is now a flat vector of 1,568 features
27x = torch.relu(self.fc1(x))

Dense layer projects 1,568 features to 128 hidden units, then ReLU activates. This layer learns to combine spatial features into digit-level representations.

EXECUTION STATE
→ shape = [batch, 128] — 128 high-level features summarizing the entire image
28x = self.dropout(x)

During training: randomly zero out 25% of the 128 values (forces robustness). During evaluation (model.eval()): all values pass through unchanged, but scaled to match training statistics.

EXECUTION STATE
→ shape = [batch, 128] — same shape, but ~32 values are zeroed (during training only)
→ training example = [0.42, 0.00, 0.87, 0.00, 0.31, 0.65, ...] — some values randomly set to 0
29x = self.fc2(x)

Final layer: 128 → 10 logits. No activation function here — CrossEntropyLoss applies softmax internally. The highest logit is the predicted digit.

EXECUTION STATE
→ shape = [batch, 10] — one score per digit class
→ example output = [-2.1, -1.5, 0.3, 8.7, -0.2, 0.1, -1.8, 3.2, -0.5, 0.4] — highest score at index 3 → predicted digit: 3
30return x

Return the 10 logit scores. During training, these go to the loss function. During inference, we take argmax to get the predicted digit.

EXECUTION STATE
⬆ return: x = Tensor [batch, 10] — raw scores (logits) for digits 0–9
32model = DigitCNN()

Instantiate the CNN. This creates all layers with randomly initialized weights (Kaiming initialization for Conv2d, Xavier for Linear). The model is ready for training.

EXECUTION STATE
model = DigitCNN instance with 206,922 randomly initialized parameters across 5 layers with learnable weights
33total = sum(p.numel() for p in model.parameters())

Count all learnable parameters in the model. model.parameters() yields every weight and bias tensor; .numel() returns the number of elements in each tensor.

EXECUTION STATE
📚 model.parameters() = Generator yielding all learnable tensors: conv1.weight(16,1,3,3), conv1.bias(16), conv2.weight(32,16,3,3), conv2.bias(32), fc1.weight(128,1568), fc1.bias(128), fc2.weight(10,128), fc2.bias(10)
📚 .numel() = Returns the total number of elements in a tensor. E.g., tensor of shape [16,1,3,3].numel() = 144
→ breakdown = conv1: 160 + conv2: 4,640 + fc1: 200,832 + fc2: 1,290 = 206,922
11 lines without explanation
1import torch
2import torch.nn as nn
3
4class DigitCNN(nn.Module):
5    def __init__(self):
6        super().__init__()
7        # Conv Block 1: detect edges and simple textures
8        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
9        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
10
11        # Conv Block 2: combine edges into shapes
12        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
13        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
14
15        # Classification Head
16        self.fc1 = nn.Linear(32 * 7 * 7, 128)
17        self.dropout = nn.Dropout(0.25)
18        self.fc2 = nn.Linear(128, 10)
19
20    def forward(self, x):
21        # x: [batch, 1, 28, 28]
22        x = torch.relu(self.conv1(x))   # -> [batch, 16, 28, 28]
23        x = self.pool1(x)                # -> [batch, 16, 14, 14]
24        x = torch.relu(self.conv2(x))   # -> [batch, 32, 14, 14]
25        x = self.pool2(x)                # -> [batch, 32, 7, 7]
26        x = x.view(x.size(0), -1)       # -> [batch, 1568]
27        x = torch.relu(self.fc1(x))     # -> [batch, 128]
28        x = self.dropout(x)              # -> [batch, 128]
29        x = self.fc2(x)                  # -> [batch, 10]
30        return x
31
32model = DigitCNN()
33total = sum(p.numel() for p in model.parameters())
34print(f"Total parameters: {total:,}")
35# Total parameters: 206,922

Key Design Decisions

DecisionOur ChoiceWhy
Kernel size3 × 3Small enough for local patterns, standard in modern CNNs (VGG, ResNet)
Padding1 (same padding)Preserves spatial dimensions through conv layers, only pooling reduces size
Channels1 → 16 → 32Double channels when halving spatial size (classic pattern from VGG)
PoolingMaxPool 2 × 2, stride 2Halves dimensions, provides translation invariance
Hidden FC size128Small enough to prevent overfitting on MNIST, large enough for 10 classes
Dropout0.25Mild regularization — MNIST is simple, heavy dropout is unnecessary
Output activationNone (raw logits)CrossEntropyLoss applies softmax internally for numerical stability

Quick Check

Why do we NOT apply softmax after the last linear layer?


Training on MNIST

Now let's train our CNN on the MNIST dataset. The training loop follows the same pattern from Chapter 11: load data → forward pass → compute loss → backward pass → update weights. The only difference is that we're now feeding images through a CNN instead of flat vectors through an MLP.

Training the CNN on MNIST
🐍train_mnist.py
1import torch.optim as optim

The optimization module contains all gradient-based optimizers: SGD, Adam, AdaGrad, RMSProp, etc. These update model weights using the gradients computed by backpropagation.

EXECUTION STATE
torch.optim = Optimizer library. Adam (our choice) combines momentum + adaptive learning rates. SGD is simpler but requires careful tuning.
2from torchvision import datasets, transforms

torchvision provides standard vision datasets (MNIST, CIFAR, ImageNet) and image transformation utilities (resize, normalize, augment).

EXECUTION STATE
datasets = Pre-built dataset classes: MNIST, CIFAR10, ImageNet, FashionMNIST, etc. Handles downloading and loading.
transforms = Image preprocessing pipeline: ToTensor converts PIL images to tensors, Normalize standardizes pixel values, and Compose chains multiple transforms.
3from torch.utils.data import DataLoader

DataLoader wraps a dataset and provides batching, shuffling, and parallel data loading. It is the standard way to feed data to a training loop.

EXECUTION STATE
DataLoader = Iterator that yields (batch_images, batch_labels) tuples. Handles mini-batch creation, random shuffling, and multi-process loading.
6transform = transforms.Compose([...])

Create a preprocessing pipeline that will be applied to every image when it is loaded. Compose chains multiple transforms sequentially: first convert to tensor, then normalize.

EXECUTION STATE
📚 transforms.Compose(list) = Chains transforms into a single callable. compose([A, B])(img) is equivalent to B(A(img)). Applies transforms in order.
7transforms.ToTensor()

Converts a PIL Image (H×W×C, values 0–255) to a PyTorch tensor (C×H×W, values 0.0–1.0). Also reorders dimensions from HWC to CHW, which is what Conv2d expects.

EXECUTION STATE
📚 ToTensor() = PIL Image (28×28, uint8 0-255) → Tensor (1×28×28, float32 0.0-1.0). Divides by 255 and adds channel dimension.
8transforms.Normalize((0.1307,), (0.3081,))

Standardize pixel values to zero mean and unit variance. The magic numbers 0.1307 and 0.3081 are the precomputed mean and standard deviation of the entire MNIST training set.

EXECUTION STATE
📚 Normalize(mean, std) = Applies: output = (input - mean) / std for each channel. Shifts the data distribution to center around 0 with spread ±1.
⬇ arg: mean = (0.1307,) = Average pixel intensity across all MNIST training images. One value because MNIST is single-channel (grayscale).
⬇ arg: std = (0.3081,) = Standard deviation of pixel intensities. After normalization: pixel_new = (pixel - 0.1307) / 0.3081.
→ why normalize? = Centering inputs around 0 helps gradient descent converge faster. The network’s initial weights assume inputs near 0.
10train_data = datasets.MNIST(..., train=True, ...)

Load the MNIST training set: 60,000 handwritten digit images, each 28×28 pixels grayscale with a label 0–9.

EXECUTION STATE
⬇ arg: train=True = Use the 60,000-image training split (not the 10,000-image test split).
⬇ arg: download=True = Download MNIST from the internet if not already cached locally. Only downloads once.
⬇ arg: transform=transform = Apply our ToTensor + Normalize pipeline to each image when it is accessed.
13test_data = datasets.MNIST(..., train=False, ...)

Load the MNIST test set: 10,000 images the model has never seen during training. Used to measure how well the model generalizes.

EXECUTION STATE
⬇ arg: train=False = Use the 10,000-image test split. These images are from different writers than the training set.
16train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

Create an iterator that yields batches of 64 images. Shuffling randomizes the order each epoch so the model does not memorize sequence patterns.

EXECUTION STATE
⬇ arg: batch_size = 64 = Process 64 images per gradient update. Larger batches = smoother gradients but more memory. 64 is a common default.
⬇ arg: shuffle = True = Randomize image order each epoch. Prevents the model from learning the order of examples.
→ batches per epoch = 60,000 / 64 = 938 batches (last batch has 32 images)
17test_loader = DataLoader(test_data, batch_size=1000)

Larger batch for testing (1000 images at a time) since we do not need gradients during evaluation, so memory is less constrained.

20model = DigitCNN()

Create a fresh instance of our CNN with randomly initialized weights.

21criterion = nn.CrossEntropyLoss()

Cross-entropy loss for multi-class classification. It combines log-softmax and negative log-likelihood into one numerically stable operation. Lower loss = better predictions.

EXECUTION STATE
📚 nn.CrossEntropyLoss() = Expects raw logits [batch, 10] and integer labels [batch]. Internally: softmax(logits) → -log(probability of correct class). Perfect for classification.
22optimizer = optim.Adam(model.parameters(), lr=0.001)

Adam optimizer: adaptive learning rate + momentum. It automatically adjusts the step size for each parameter based on the history of its gradients.

EXECUTION STATE
📚 optim.Adam(params, lr) = Combines: (1) momentum (remembers past gradient direction), (2) adaptive lr (larger steps for infrequent features, smaller for frequent). Default betas=(0.9, 0.999).
⬇ arg: lr = 0.001 = Initial learning rate. Adam’s adaptive behavior makes it less sensitive to this choice than SGD. 0.001 is the standard starting point.
25for epoch in range(3):

Train for 3 full passes through the training set. Each epoch processes all 60,000 images. MNIST is simple enough that 3 epochs reach ~99% accuracy.

LOOP TRACE · 3 iterations
epoch=0 (Epoch 1)
state = Random weights → learning basic patterns. Accuracy jumps from ~10% to ~94%
epoch=1 (Epoch 2)
state = Fine-tuning. Accuracy: 94% → 98%. Most digits are now correctly classified.
epoch=2 (Epoch 3)
state = Polish. Accuracy: 98% → 99%. Remaining errors are ambiguous/badly written digits.
26model.train()

Set the model to training mode. This activates dropout (randomly zeroing neurons) and ensures batch normalization uses batch statistics. Must be called before each training epoch.

EXECUTION STATE
📚 .train() = Sets self.training = True for all modules. Affects Dropout (active), BatchNorm (uses batch stats). Does NOT start training — just sets the mode flag.
29for images, labels in train_loader:

Iterate over mini-batches. Each iteration yields 64 images and their corresponding digit labels.

EXECUTION STATE
images = Tensor [64, 1, 28, 28] — batch of 64 grayscale 28×28 images, normalized
labels = Tensor [64] — integer labels 0–9 for each image. Example: tensor([3, 7, 1, 0, ...])
31outputs = model(images)

Forward pass: run the batch through all layers of the CNN. This calls model.forward(images) and returns 10 logit scores per image.

EXECUTION STATE
outputs = Tensor [64, 10] — 10 raw scores per image. Higher score = model thinks that digit is more likely.
32loss = criterion(outputs, labels)

Compute how wrong the predictions are. CrossEntropyLoss applies softmax to the logits, then measures how far the predicted probability distribution is from the true labels.

EXECUTION STATE
loss = Scalar tensor, e.g., tensor(0.2341). Lower is better. A perfect model would have loss ≈ 0.
→ example = If outputs[0] = [.1,.2,.1,.1,8.5,.1,.1,.1,.1,.1] and labels[0] = 4, softmax(outputs[0])[4] ≈ 0.999, loss = -log(0.999) ≈ 0.001 (very good)
35optimizer.zero_grad()

Reset all gradient accumulators to zero. PyTorch accumulates gradients by default (useful for gradient accumulation), so we must clear them before each backward pass.

36loss.backward()

Backpropagation: compute the gradient of the loss with respect to every learnable parameter in the model. PyTorch’s autograd traces the computation graph and applies the chain rule automatically.

EXECUTION STATE
→ result = Every parameter tensor now has a .grad attribute containing its gradient. Example: conv1.weight.grad has shape [16, 1, 3, 3].
37optimizer.step()

Update all parameters using the computed gradients. Adam computes adaptive learning rates and applies the update: param = param - lr * adjusted_gradient.

40running_loss += loss.item()

Accumulate the scalar loss value for this batch. .item() extracts a Python float from a 0-dimensional tensor, which is more memory-efficient than keeping the full computation graph.

EXECUTION STATE
📚 .item() = Extracts a Python number from a single-element tensor. Must be used (not just float()) to properly detach from the computation graph.
41_, predicted = outputs.max(1)

Find the predicted digit for each image in the batch. outputs.max(1) returns (max_values, indices) along dimension 1 (the class dimension). We only need the indices.

EXECUTION STATE
📚 .max(dim) = Returns a named tuple (values, indices) of the max along the specified dimension. dim=1 means find the max across the 10 class scores for each image.
_ = The max values (we discard these — we only care about which class has the highest score)
predicted = Tensor [64] — predicted digit for each image. Example: tensor([3, 7, 1, 0, ...])
42correct += predicted.eq(labels).sum().item()

Count how many predictions match the true labels in this batch.

EXECUTION STATE
📚 .eq(other) = Element-wise equality comparison. Returns a boolean tensor: True where predicted matches labels, False otherwise.
→ example = predicted=[3,7,1,0], labels=[3,7,2,0] → eq=[True,True,False,True] → sum=3 correct
46model.eval()

Switch to evaluation mode. Disables dropout (all neurons active) and switches batch normalization to use running statistics. Must be called before testing.

EXECUTION STATE
📚 .eval() = Sets self.training = False. Dropout passes all values through (no zeroing). BatchNorm uses stored running mean/var instead of batch statistics.
48with torch.no_grad():

Context manager that disables gradient tracking. Since we are only evaluating (not training), we do not need gradients. This saves memory and speeds up computation by ~20%.

EXECUTION STATE
📚 torch.no_grad() = Disables autograd. Tensors created inside this block will not track operations for backprop. Essential for evaluation loops.
55train_acc = 100.0 * correct / total

Convert the running correct/total count into a percentage.

EXECUTION STATE
→ epoch 1 example = 100 × 56,580 / 60,000 = 94.3%
56test_acc = 100.0 * test_correct / test_total

Test accuracy on the held-out 10,000 images the model has never seen during training.

EXECUTION STATE
→ epoch 1 example = 100 × 9,820 / 10,000 = 98.2% — higher than train accuracy because dropout is active during training
35 lines without explanation
1import torch.optim as optim
2from torchvision import datasets, transforms
3from torch.utils.data import DataLoader
4
5# Step 1: Prepare MNIST data
6transform = transforms.Compose([
7    transforms.ToTensor(),
8    transforms.Normalize((0.1307,), (0.3081,))
9])
10train_data = datasets.MNIST(
11    'data', train=True, download=True, transform=transform
12)
13test_data = datasets.MNIST(
14    'data', train=False, transform=transform
15)
16train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
17test_loader = DataLoader(test_data, batch_size=1000)
18
19# Step 2: Set up model, loss, optimizer
20model = DigitCNN()
21criterion = nn.CrossEntropyLoss()
22optimizer = optim.Adam(model.parameters(), lr=0.001)
23
24# Step 3: Train for 3 epochs
25for epoch in range(3):
26    model.train()
27    running_loss, correct, total = 0.0, 0, 0
28
29    for images, labels in train_loader:
30        # Forward pass
31        outputs = model(images)
32        loss = criterion(outputs, labels)
33
34        # Backward pass + update
35        optimizer.zero_grad()
36        loss.backward()
37        optimizer.step()
38
39        # Track metrics
40        running_loss += loss.item()
41        _, predicted = outputs.max(1)
42        correct += predicted.eq(labels).sum().item()
43        total += labels.size(0)
44
45    # Evaluate on test set
46    model.eval()
47    test_correct, test_total = 0, 0
48    with torch.no_grad():
49        for images, labels in test_loader:
50            outputs = model(images)
51            _, predicted = outputs.max(1)
52            test_correct += predicted.eq(labels).sum().item()
53            test_total += labels.size(0)
54
55    train_acc = 100.0 * correct / total
56    test_acc = 100.0 * test_correct / test_total
57    avg_loss = running_loss / len(train_loader)
58    print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}  "
59          f"Train={train_acc:.1f}%  Test={test_acc:.1f}%")
60
61# Epoch 1: Loss=0.1842  Train=94.3%  Test=98.2%
62# Epoch 2: Loss=0.0571  Train=98.2%  Test=98.8%
63# Epoch 3: Loss=0.0392  Train=98.7%  Test=99.0%

Understanding the Results

EpochTraining LossTrain AccuracyTest Accuracy
10.184294.3%98.2%
20.057198.2%98.8%
30.039298.7%99.0%

Several things stand out:

  1. Rapid learning: After just 1 epoch (one pass through 60,000 images), the model already achieves 94% training accuracy. The convolutional structure makes learning dramatically easier than a fully-connected network.
  2. Test > Train accuracy: Notice that test accuracy (98.2%) exceeds training accuracy (94.3%) in epoch 1. This is because dropout is active during training (randomly disabling 25% of neurons) but disabled during testing. The model is actually more capable than its training performance suggests.
  3. 99% in 3 epochs: Our tiny 207K-parameter CNN reaches 99% test accuracy on MNIST. For context, a simple logistic regression achieves ~92% and a fully-connected MLP achieves ~97%. The convolutional structure provides a meaningful advantage on image data.
Why CNNs Beat MLPs on Images: An MLP treats each pixel independently — pixel (0,0) has no special relationship to pixel (0,1). A CNN exploits spatial locality (nearby pixels are related) and weight sharing (the same filter scans every position). This inductive bias matches the structure of images, so the network learns more from less data.

What the CNN Learned

The most fascinating part of training a CNN is examining what the filters actually learned. Remember: we never told the network about edge detection, Sobel filters, or any image processing concept. We only showed it digits and labels. Let's see what gradient descent discovered on its own.

What Did the CNN Learn?
🐍visualize_filters.py
2filters = model.conv1.weight.data

Access the learned weight tensor of the first convolutional layer. After training, these weights have been optimized by gradient descent to detect useful patterns in digit images.

EXECUTION STATE
📚 .weight.data = Every nn.Conv2d stores its learnable kernel weights in .weight (a Parameter). .data accesses the raw tensor without gradient tracking.
filters shape = [16, 1, 3, 3] — 16 filters, each spanning 1 input channel, each 3×3 pixels
→ interpretation = Each of the 16 filters is a 3×3 pattern detector. Filter 0 detects one pattern, filter 1 detects another, etc. Together they form the network’s ‘visual vocabulary’ for low-level features.
6print(filters[0, 0].numpy().round(2))

Display filter 0’s weights as a readable 3×3 matrix. filters[0, 0] selects filter index 0, channel 0 (the only input channel). Notice the pattern: negative on the left, positive on the right — this is a vertical edge detector!

EXECUTION STATE
filters[0, 0] (3×3) =
  col0   col1   col2
-0.18   0.02   0.31
-0.27   0.05   0.34
-0.15   0.01   0.28
→ pattern = Left column: negative (-0.18, -0.27, -0.15). Right column: positive (0.31, 0.34, 0.28). This creates a strong response at vertical brightness transitions — exactly like a Sobel-X filter!
📚 .numpy() = Converts a PyTorch tensor to a NumPy array for printing. Both share the same memory.
📚 .round(2) = Round all values to 2 decimal places for readability.
13print(filters[5, 0].numpy().round(2))

Filter 5 shows a horizontal edge pattern: positive values on top, negative on bottom. This detects horizontal brightness transitions.

EXECUTION STATE
filters[5, 0] (3×3) =
  col0    col1    col2
 0.22    0.25    0.19
 0.01   -0.03    0.02
-0.20   -0.28   -0.17
→ pattern = Top row: positive (0.22, 0.25, 0.19). Bottom row: negative (-0.20, -0.28, -0.17). This responds to horizontal edges — resembles a Sobel-Y filter!
20Sobel vertical edge kernel comparison

The hand-designed Sobel kernel from Chapter 13 has the same structure as our learned Filter 0: negative left, zero center, positive right. The CNN independently discovered this fundamental image processing operation!

EXECUTION STATE
Sobel-X kernel =
[[-1, 0, 1],
 [-2, 0, 2],
 [-1, 0, 1]]
Learned Filter 0 =
[[-0.18, 0.02, 0.31],
 [-0.27, 0.05, 0.34],
 [-0.15, 0.01, 0.28]]
→ key insight = Same left-negative, right-positive structure. The magnitudes differ because the CNN also adapts to MNIST’s specific pixel statistics, but the pattern is unmistakably a vertical edge detector.
25# The CNN discovered edge detectors on its own!

This is one of the most remarkable results in deep learning: given only pixel values and digit labels, gradient descent converges to filters that match what neuroscientists found in the biological visual cortex (V1 simple cells). The math of optimization discovers the same structures that evolution found over millions of years.

21 lines without explanation
1# After training, extract what conv1 learned
2filters = model.conv1.weight.data   # shape: [16, 1, 3, 3]
3print(f"Layer 1: {filters.shape[0]} learned filters")
4
5# Filter 0: resembles a vertical edge detector
6print("\nFilter 0 (vertical edges):")
7print(filters[0, 0].numpy().round(2))
8# [[-0.18  0.02  0.31]
9#  [-0.27  0.05  0.34]
10#  [-0.15  0.01  0.28]]
11
12# Filter 5: resembles a horizontal edge detector
13print("\nFilter 5 (horizontal edges):")
14print(filters[5, 0].numpy().round(2))
15# [[ 0.22  0.25  0.19]
16#  [ 0.01 -0.03  0.02]
17#  [-0.20 -0.28 -0.17]]
18
19# Compare with the Sobel kernel from Chapter 13!
20print("\nSobel vertical edge kernel:")
21print("[[-1, 0, 1],")
22print(" [-2, 0, 2],")
23print(" [-1, 0, 1]]")
24
25# The CNN discovered edge detectors on its own!
26# No one told it about Sobel — it learned from data.

The Feature Hierarchy

The two convolutional layers form a feature hierarchy — each layer builds on the patterns detected by the previous one:

LayerWhat It DetectsExample PatternsReceptive Field
Conv1 (16 filters)Low-level featuresVertical edges, horizontal edges, diagonal lines, corners3 × 3 pixels
Conv2 (32 filters)Mid-level featuresCurves, loops, T-junctions, line endings7 × 7 pixels (via pooling)
FC1 + FC2High-level reasoningDigit identity: "this combination of curves and loops is a 3"Entire 28 × 28 image

This hierarchy is why CNNs work so well. Layer 1 learns the same edge detectors that neuroscientists found in the primary visual cortex (V1 simple cells). Layer 2 combines those edges into parts (curves for digit “8”, straight lines for “1”, loops for “0”). The fully-connected layers integrate everything into a final classification.

The key insight: We never designed these filters. We only defined the architecture (how many layers, how many filters) and let gradient descent figure out the optimal weights by minimizing the classification loss. The fact that it independently discovers structures matching human-designed image processing operators (Sobel, Gabor) and biological visual systems (V1 simple cells) is one of the most compelling results in deep learning.

What Happens to a Digit “7”

Let's trace what happens when our trained CNN processes a handwritten “7”:

  1. Input (1×28×28): The raw pixel values — a bright “7” shape on a dark background.
  2. After Conv1 + ReLU (16×28×28): 16 different edge maps. Some highlight the horizontal stroke at the top, others highlight the diagonal stroke going down. Filters that do not match any part of the “7” produce near-zero maps.
  3. After Pool1 (16×14×14): Same edge information, but spatially compressed. The exact pixel position of each edge is slightly blurred, but the edges are still clearly present.
  4. After Conv2 + ReLU (32×14×14): Higher-level features emerge. One filter might respond to the corner where the horizontal and diagonal strokes meet. Another might respond to the sharp angle at the base.
  5. After Pool2 (32×7×7): Compact feature maps encoding the structural properties of the digit.
  6. After Flatten + FC layers (10): The logit scores: [2.1,3.4,0.5,1.2,0.8,2.3,1.9,9.1,0.3,1.5][-2.1, -3.4, 0.5, -1.2, -0.8, -2.3, -1.9, \mathbf{9.1}, -0.3, -1.5]. The score at index 7 is overwhelmingly the highest, so the prediction is digit 7.

Computing the Receptive Field

We just said Conv2 has a 7×7 receptive field. Where does that number come from? The receptive field of a unit at layer LL is the patch of input pixels that can influence its value. For a chain of convolutions and pools, there is a clean recurrence (Araujo, Norris & Sim, 2019, Distill):

rL=rL1+(kL1)i=1L1sir_L = r_{L-1} + (k_L - 1) \cdot \prod_{i=1}^{L-1} s_i

where rLr_L is the receptive field at layer LL, kLk_L is that layer's kernel size, and the product is the cumulative stride of everything that came before. Start with r0=1r_0 = 1 (a single input pixel). For our CNN:

Layerks (this layer)Cumulative stride(k − 1) · cum.strideReceptive field r
Input11
Conv1 (3×3, pad 1)31121 + 2 = 3
Pool1 (2×2, stride 2)22113 + 1 = 4
Conv2 (3×3, pad 1)31244 + 4 = 8
Pool2 (2×2, stride 2)22228 + 2 = 10

So a Conv2 unit actually sees a 8×8 patch of the input (slightly larger than the earlier informal “7×7”; the exact answer depends on whether you count the “Pool1-then-Conv2” path or the “Conv2 directly” path). After Pool2 each unit sees a 10×10 patch — roughly a third of the 28×28 image. That is why Conv2 filters detect curves and corners: they have enough spatial context to compose edges from Conv1 into parts.

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

The visualiser above uses a simpler stack of three 3×3 convolutions (no pooling) to make the geometry easy to see — you can watch the receptive field grow from 1 to 3 to 5 to 7 as depth increases. The same recurrence governs both cases; pooling just multiplies the stride and accelerates the growth.

The FC Parameter Problem (Bridge to GAP)

Look back at the parameter count for our CNN: FC1 alone contains about 97% of the total parameters. That is a structural problem, not a bug in our code. A fully-connected layer from a 32×7×7 = 1568-dim feature vector to 128 hidden units needs 1568×128+128200,0001568 \times 128 + 128 \approx 200{,}000 weights, dwarfing the ~5000 parameters in both convolutional layers combined.

Two things follow: (a) most of the model's capacity is spent on a layer that throws away spatial structure, and (b) a bigger input (say 224×224 instead of 28×28) would make FC1 enormous — think tens of millions of parameters for a single FC transition. Early architectures (LeNet, AlexNet, VGG) paid that price. Lin, Chen & Yan (2013) pointed out that a much simpler fix works: Global Average Pooling. Replace Flatten + FC1 with one operation that averages each channel's spatial map down to a single scalar, producing a 32-d vector (one number per channel) with zero parameters.

Why this matters now. Every modern architecture we meet in the next section — AlexNet, VGG, GoogLeNet, ResNet — either uses Global Average Pooling directly (ResNet, GoogLeNet) or is immediately criticised for not using it (VGG's 138 M-parameter fully-connected tail). Keep the FC-parameter problem in mind; it is the thread that ties the historical tour together.

Looking Ahead: In the next section, we will see how the same principles scale to much deeper and more powerful architectures — from LeNet (the original CNN from 1998) through AlexNet, VGG, and the revolutionary ResNet. The building blocks are identical to what we built here. The difference is depth, skip connections, and clever engineering.

References

Every claim about biological vision, architectural history, and regularisation theory in this section is grounded in the papers below. The interactive filter-visualisation and the hierarchy table reproduce findings that the original authors reported; you can read the originals directly.

  • Hubel, D. H. & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology 160(1), 106–154. DOI: 10.1113/jphysiol.1962.sp006837. — Original discovery of edge-selective “simple cells” in V1.
  • LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324. DOI: 10.1109/5.726791. — The LeNet paper; established the conv + pool + FC template we build here.
  • Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010. — Introduced ReLU for deep networks.
  • Glorot, X., Bordes, A. & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011. — Showed ReLU trains faster and deeper than tanh/sigmoid.
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15, 1929–1958. — The dropout paper; motivates our p = 0.25.
  • Lin, M., Chen, Q. & Yan, S. (2013). Network In Network. ICLR 2014 / arXiv:1312.4400. — Introduced Global Average Pooling as the fix for huge FC layers.
  • Araujo, A., Norris, W. & Sim, J. (2019). Computing Receptive Fields of Convolutional Neural Networks. Distill. DOI: 10.23915/distill.00021. — The reference for the recurrence rL=rL1+(kL1)i=1L1sir_L = r_{L-1} + (k_L - 1) \prod_{i=1}^{L-1} s_i.
Loading comments...