From Pixels to Predictions
In Chapter 13, we mastered the individual pieces: convolution operations, kernel mechanics, multi-channel filtering, and the output size formula. Now it's time to assemble these pieces into a complete, working CNN that can look at a handwritten digit and tell you what number it is.
We will build a CNN that classifies MNIST digits — 28×28 grayscale images of handwritten numbers 0 through 9. By the end of this section, you will have a network that achieves 99% accuracy with just two convolutional layers and about 207,000 parameters.
The CNN Recipe: A CNN is not just convolutions. It is a carefully designed pipeline: convolution (detect patterns) → activation (add non-linearity) → pooling (compress spatial info) → repeat → flatten (reshape for classification) → fully-connected layers (make the final decision). Each piece has a specific job.
But before we can build this pipeline, we need one crucial piece that Chapter 13 introduced but did not fully explore: pooling.
The Missing Piece: Pooling
After a convolutional layer detects features across the image, we face a problem: the feature maps are the same spatial size as the input (when using padding). A 28×28 image produces 28×28 feature maps. If we stack many conv layers, the computational cost grows rapidly. We need a way to shrink the spatial dimensions while keeping the important information.
That is exactly what pooling does. It slides a window across each feature map and summarizes each region with a single value. The two most common types are:
- Max Pooling: Keep the maximum value in each window. This preserves the strongest activation — if a vertical edge was detected somewhere in a 2×2 region, max pooling keeps that detection regardless of its exact position.
- Average Pooling: Compute the mean of all values in the window. This creates a smooth summary of activations, useful as a final aggregation layer.
Why Pooling Matters
Pooling provides three critical benefits:
- Spatial reduction: A 2×2 pooling with stride 2 halves both height and width, reducing the number of values by 4×. This means the next conv layer has 4× fewer positions to process.
- Translation invariance: If a feature shifts by 1 pixel in the input, max pooling often produces the same output. The network cares about what was detected, not exactly where.
- Larger receptive field: After pooling, each position in the next layer's feature map “sees” a larger region of the original image. This enables higher layers to detect larger-scale patterns.
The Pooling Formula
The output size formula for pooling is the same as for convolution (no padding is typically used):
For the standard 2×2 max pooling with stride 2: . So a 28×28 feature map becomes 14×14, and a 14×14 becomes 7×7.
Explore max pooling and average pooling interactively:
Max Pooling Visualizer
Max Pooling from Scratch
Let's implement max pooling ourselves to fully understand the mechanics. We will process a 4×4 feature map with a 2×2 pooling window and stride 2:
Max Pooling in PyTorch
PyTorch provides which handles batches, multiple channels, and GPU acceleration:
| Our Implementation | PyTorch Equivalent |
|---|---|
| max_pool_2d(x, pool_size=2, stride=2) | nn.MaxPool2d(kernel_size=2, stride=2)(x) |
| Processes single 2D array | Processes [batch, channels, H, W] tensors |
| CPU only, Python loops | GPU-accelerated, optimized C++/CUDA |
| Educational | Production-ready |
Quick Check
What is the output shape when applying MaxPool2d(2, 2) to a tensor of shape [8, 32, 14, 14]?
Why ReLU, Not tanh or Sigmoid
Before we assemble the network we need to answer a practical question: which activation function goes between the convolution and the pooling? The 1998 LeNet paper used . Modern CNNs almost universally use . What changed?
The Vanishing-Gradient Problem of Smooth Squashers
Both and are S-shaped “squashers” that saturate at the extremes. When the pre-activation is large in magnitude, the derivative is almost zero:
| Function | f(x) | f'(x) | f'(x=5) |
|---|---|---|---|
| Sigmoid | 1 / (1 + e^-x) | f(x) · (1 - f(x)) | ≈ 0.0066 |
| tanh | (e^x - e^-x)/(e^x + e^-x) | 1 - tanh(x)² | ≈ 0.00018 |
| ReLU | max(0, x) | 0 if x<0, else 1 | 1.0 |
In a 10-layer network using , gradients propagating backward through the chain rule multiply ten derivatives, each typically much less than 1. The gradient arriving at the earliest layer can be of the loss gradient — effectively zero. The early filters never learn. This is the vanishing-gradient problem, documented by Glorot & Bengio (2010) and solved architecturally in several ways; ReLU was the simplest.
Nair & Hinton (2010) and Glorot, Bordes & Bengio (2011) showed that ReLU trains deeper networks faster. Its gradient is either exactly 0 (for ) or exactly 1 (for ). There is no gradual decay. When the signal is “on”, the gradient flows through untouched.
The dead-ReLU caveat. If a neuron's pre-activation stays negative for every input, the gradient is always zero and its weights never update — a “dead” neuron. Two common escape hatches: Leaky ReLU () and He initialisation (variance scaled by ) which tunes initial weights so most neurons start “alive”. We use plain ReLU here because MNIST is forgiving; He initialisation is a one-line fix for larger networks.
Our CNN Blueprint
Now we have all the building blocks: convolution (Chapter 13), activation functions (Chapter 5), pooling (above), and fully-connected layers (Chapter 10). Let's assemble them into a complete CNN for digit classification.
Click any layer in the diagram below to see exactly what it does and how many parameters it has:
CNN Architecture: MNIST Digit Classifier
206,922 parametersDimension Tracking: Every Tensor Shape
Understanding exactly how the tensor shape transforms at each layer is critical. Here is the complete flow for a single image:
| Layer | Operation | Output Shape | Parameters | Purpose |
|---|---|---|---|---|
| Input | MNIST image | 1 × 28 × 28 | 0 | Raw grayscale pixels |
| Conv1 | Conv2d(1, 16, 3, pad=1) | 16 × 28 × 28 | 160 | Detect edges and textures |
| ReLU | max(0, x) | 16 × 28 × 28 | 0 | Non-linearity |
| Pool1 | MaxPool2d(2, 2) | 16 × 14 × 14 | 0 | Halve spatial size |
| Conv2 | Conv2d(16, 32, 3, pad=1) | 32 × 14 × 14 | 4,640 | Combine edges into shapes |
| ReLU | max(0, x) | 32 × 14 × 14 | 0 | Non-linearity |
| Pool2 | MaxPool2d(2, 2) | 32 × 7 × 7 | 0 | Halve again |
| Flatten | Reshape | 1,568 | 0 | Conv → FC bridge |
| FC1 | Linear(1568, 128) | 128 | 200,832 | Dense reasoning |
| ReLU | max(0, x) | 128 | 0 | Non-linearity |
| Dropout | Drop 25% | 128 | 0 | Regularization |
| FC2 | Linear(128, 10) | 10 | 1,290 | Digit class scores |
Total: 206,922 parameters. Notice that the FC1 layer alone accounts for 97% of all parameters. This is a common pattern in CNNs — the convolutional layers are parameter-efficient (weight sharing across spatial positions), but the first fully-connected layer creates a large weight matrix.
The PyTorch Implementation
Let's build this CNN in PyTorch using . The pattern is the same we learned in Chapter 10 for MLPs: define layers in \texttt{__init__}, wire them together in .
Key Design Decisions
| Decision | Our Choice | Why |
|---|---|---|
| Kernel size | 3 × 3 | Small enough for local patterns, standard in modern CNNs (VGG, ResNet) |
| Padding | 1 (same padding) | Preserves spatial dimensions through conv layers, only pooling reduces size |
| Channels | 1 → 16 → 32 | Double channels when halving spatial size (classic pattern from VGG) |
| Pooling | MaxPool 2 × 2, stride 2 | Halves dimensions, provides translation invariance |
| Hidden FC size | 128 | Small enough to prevent overfitting on MNIST, large enough for 10 classes |
| Dropout | 0.25 | Mild regularization — MNIST is simple, heavy dropout is unnecessary |
| Output activation | None (raw logits) | CrossEntropyLoss applies softmax internally for numerical stability |
Quick Check
Why do we NOT apply softmax after the last linear layer?
Training on MNIST
Now let's train our CNN on the MNIST dataset. The training loop follows the same pattern from Chapter 11: load data → forward pass → compute loss → backward pass → update weights. The only difference is that we're now feeding images through a CNN instead of flat vectors through an MLP.
Understanding the Results
| Epoch | Training Loss | Train Accuracy | Test Accuracy |
|---|---|---|---|
| 1 | 0.1842 | 94.3% | 98.2% |
| 2 | 0.0571 | 98.2% | 98.8% |
| 3 | 0.0392 | 98.7% | 99.0% |
Several things stand out:
- Rapid learning: After just 1 epoch (one pass through 60,000 images), the model already achieves 94% training accuracy. The convolutional structure makes learning dramatically easier than a fully-connected network.
- Test > Train accuracy: Notice that test accuracy (98.2%) exceeds training accuracy (94.3%) in epoch 1. This is because dropout is active during training (randomly disabling 25% of neurons) but disabled during testing. The model is actually more capable than its training performance suggests.
- 99% in 3 epochs: Our tiny 207K-parameter CNN reaches 99% test accuracy on MNIST. For context, a simple logistic regression achieves ~92% and a fully-connected MLP achieves ~97%. The convolutional structure provides a meaningful advantage on image data.
Why CNNs Beat MLPs on Images: An MLP treats each pixel independently — pixel (0,0) has no special relationship to pixel (0,1). A CNN exploits spatial locality (nearby pixels are related) and weight sharing (the same filter scans every position). This inductive bias matches the structure of images, so the network learns more from less data.
What the CNN Learned
The most fascinating part of training a CNN is examining what the filters actually learned. Remember: we never told the network about edge detection, Sobel filters, or any image processing concept. We only showed it digits and labels. Let's see what gradient descent discovered on its own.
The Feature Hierarchy
The two convolutional layers form a feature hierarchy — each layer builds on the patterns detected by the previous one:
| Layer | What It Detects | Example Patterns | Receptive Field |
|---|---|---|---|
| Conv1 (16 filters) | Low-level features | Vertical edges, horizontal edges, diagonal lines, corners | 3 × 3 pixels |
| Conv2 (32 filters) | Mid-level features | Curves, loops, T-junctions, line endings | 7 × 7 pixels (via pooling) |
| FC1 + FC2 | High-level reasoning | Digit identity: "this combination of curves and loops is a 3" | Entire 28 × 28 image |
This hierarchy is why CNNs work so well. Layer 1 learns the same edge detectors that neuroscientists found in the primary visual cortex (V1 simple cells). Layer 2 combines those edges into parts (curves for digit “8”, straight lines for “1”, loops for “0”). The fully-connected layers integrate everything into a final classification.
What Happens to a Digit “7”
Let's trace what happens when our trained CNN processes a handwritten “7”:
- Input (1×28×28): The raw pixel values — a bright “7” shape on a dark background.
- After Conv1 + ReLU (16×28×28): 16 different edge maps. Some highlight the horizontal stroke at the top, others highlight the diagonal stroke going down. Filters that do not match any part of the “7” produce near-zero maps.
- After Pool1 (16×14×14): Same edge information, but spatially compressed. The exact pixel position of each edge is slightly blurred, but the edges are still clearly present.
- After Conv2 + ReLU (32×14×14): Higher-level features emerge. One filter might respond to the corner where the horizontal and diagonal strokes meet. Another might respond to the sharp angle at the base.
- After Pool2 (32×7×7): Compact feature maps encoding the structural properties of the digit.
- After Flatten + FC layers (10): The logit scores: . The score at index 7 is overwhelmingly the highest, so the prediction is digit 7.
Computing the Receptive Field
We just said Conv2 has a 7×7 receptive field. Where does that number come from? The receptive field of a unit at layer is the patch of input pixels that can influence its value. For a chain of convolutions and pools, there is a clean recurrence (Araujo, Norris & Sim, 2019, Distill):
where is the receptive field at layer , is that layer's kernel size, and the product is the cumulative stride of everything that came before. Start with (a single input pixel). For our CNN:
| Layer | k | s (this layer) | Cumulative stride | (k − 1) · cum.stride | Receptive field r |
|---|---|---|---|---|---|
| Input | — | — | 1 | — | 1 |
| Conv1 (3×3, pad 1) | 3 | 1 | 1 | 2 | 1 + 2 = 3 |
| Pool1 (2×2, stride 2) | 2 | 2 | 1 | 1 | 3 + 1 = 4 |
| Conv2 (3×3, pad 1) | 3 | 1 | 2 | 4 | 4 + 4 = 8 |
| Pool2 (2×2, stride 2) | 2 | 2 | 2 | 2 | 8 + 2 = 10 |
So a Conv2 unit actually sees a 8×8 patch of the input (slightly larger than the earlier informal “7×7”; the exact answer depends on whether you count the “Pool1-then-Conv2” path or the “Conv2 directly” path). After Pool2 each unit sees a 10×10 patch — roughly a third of the 28×28 image. That is why Conv2 filters detect curves and corners: they have enough spatial context to compose edges from Conv1 into parts.
Receptive Field Growth
See how the receptive field expands with each convolutional layer
Input Image (7×7)
Output Size
7×7
Receptive Field
1×1
RF Coverage
2.0% of input
Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride
With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)
The FC Parameter Problem (Bridge to GAP)
Look back at the parameter count for our CNN: FC1 alone contains about 97% of the total parameters. That is a structural problem, not a bug in our code. A fully-connected layer from a 32×7×7 = 1568-dim feature vector to 128 hidden units needs weights, dwarfing the ~5000 parameters in both convolutional layers combined.
Two things follow: (a) most of the model's capacity is spent on a layer that throws away spatial structure, and (b) a bigger input (say 224×224 instead of 28×28) would make FC1 enormous — think tens of millions of parameters for a single FC transition. Early architectures (LeNet, AlexNet, VGG) paid that price. Lin, Chen & Yan (2013) pointed out that a much simpler fix works: Global Average Pooling. Replace Flatten + FC1 with one operation that averages each channel's spatial map down to a single scalar, producing a 32-d vector (one number per channel) with zero parameters.
Why this matters now. Every modern architecture we meet in the next section — AlexNet, VGG, GoogLeNet, ResNet — either uses Global Average Pooling directly (ResNet, GoogLeNet) or is immediately criticised for not using it (VGG's 138 M-parameter fully-connected tail). Keep the FC-parameter problem in mind; it is the thread that ties the historical tour together.
Looking Ahead: In the next section, we will see how the same principles scale to much deeper and more powerful architectures — from LeNet (the original CNN from 1998) through AlexNet, VGG, and the revolutionary ResNet. The building blocks are identical to what we built here. The difference is depth, skip connections, and clever engineering.
References
Every claim about biological vision, architectural history, and regularisation theory in this section is grounded in the papers below. The interactive filter-visualisation and the hierarchy table reproduce findings that the original authors reported; you can read the originals directly.
- Hubel, D. H. & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology 160(1), 106–154. DOI: 10.1113/jphysiol.1962.sp006837. — Original discovery of edge-selective “simple cells” in V1.
- LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324. DOI: 10.1109/5.726791. — The LeNet paper; established the conv + pool + FC template we build here.
- Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010. — Introduced ReLU for deep networks.
- Glorot, X., Bordes, A. & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011. — Showed ReLU trains faster and deeper than tanh/sigmoid.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15, 1929–1958. — The dropout paper; motivates our
p = 0.25. - Lin, M., Chen, Q. & Yan, S. (2013). Network In Network. ICLR 2014 / arXiv:1312.4400. — Introduced Global Average Pooling as the fix for huge FC layers.
- Araujo, A., Norris, W. & Sim, J. (2019). Computing Receptive Fields of Convolutional Neural Networks. Distill. DOI: 10.23915/distill.00021. — The reference for the recurrence .