Introduction
Convolutional Neural Networks power nearly every modern vision system — from the face-unlock on your phone to the perception stack of a self-driving car. But why convolution specifically? Why not a plain fully-connected network like the one in Chapter 7?
This chapter answers that question in three stages. First, we show why fully-connected networks cannot scale to realistic images. Then we present the three inductive biases — parameter sharing, translation equivariance, and local connectivity — that make convolutions the right tool for pixel data. Finally we build the convolution operation from scratch in 1-D and 2-D so that when you meet nn.Conv2d in the next section, there is nothing mysterious left.
Reference
The core claim of this chapter: a convolutional layer is not just “another kind of layer” — it is a fully-connected layer under three strong architectural constraints. Those constraints happen to match the statistics of natural images, and that match is the entire reason CNNs work.
Learning Objectives
After this section you will be able to:
- Quantify why fully-connected layers fail on images — compute the parameter count for a 224×224×3 input and explain why it is infeasible.
- Name the three CNN inductive biases — parameter sharing, translation equivariance, and sparse / local connectivity — and explain the statistical assumption each one encodes.
- Compute 1-D and 2-D convolutions by hand and in
NumPy, then reproduce the exact same result with PyTorch'sF.conv1d/F.conv2d. - Predict output size using for any stride and padding.
- Explain the feature hierarchy — why stacked conv layers go from edges to textures to object parts, and why this mirrors the mammalian visual cortex .
Reference
Hubel & Wiesel, 1962, “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex”, J. Physiology 160(1).
The Parameter Explosion Problem
Suppose we want to classify a single RGB image of ImageNet resolution — 224×224 pixels, 3 channels — into 1,000 categories. The most naïve approach is to flatten the image to a vector and push it through a fully-connected (FC) hidden layer.
The numbers, without waving our hands
Flattening the image gives a vector of length . Connecting that to just one hidden neuron requires one weight per input pixel — that is 150,528 weights per neuron. If we want a hidden layer with 1,000 neurons (modest by modern standards), the parameter count is:
… in a single hidden layer. For comparison, the entire ResNet-50 network (50 layers deep, ImageNet-trained) has about 25M parameters. The fully-connected approach is already six times larger than ResNet-50 after one layer. Training this at scale is impractical for three independent reasons:
Reference
- Memory. 150M float32 weights alone occupy 600 MB — plus gradients, Adam moments, and activations.
- Sample complexity. Classical learning-theory bounds scale roughly with the parameter count. With this many weights you need an astronomically large training set to avoid overfitting.
- It ignores structure. Pixel (12, 37) and pixel (12, 38) are almost certainly related — they are neighbours. An FC layer treats them as arbitrary independent features. The spatial topology of the image is thrown away the moment we flatten.
What a 3×3 convolution would use instead
A convolutional layer that produces the same 64 feature channels from a 3-channel input needs only:
That is about 84,000× fewer parameters than the FC layer above — and it preserves spatial structure. How is that possible? The interactive diagram below shows the key idea: an FC layer connects every input pixel to every output unit, while a conv layer connects only a tiny local neighbourhood, and shares the same weights across every spatial location.
Connectivity Comparison
Hover over nodes to see their connections. In FC, every input connects to every output.
Quick Check
A fully-connected layer maps a 32×32×3 input (CIFAR-10 images) to 512 hidden units. How many weights (excluding bias) does it have?
Three Pillars of CNN Design
A convolutional layer is best understood as a fully-connected layer under three architectural constraints. Each constraint encodes a specific assumption about natural images. If the assumption holds (it does, empirically) we get massive parameter savings and better generalisation for free.
Pillar 1: Parameter sharing
In an FC layer, every output unit has its own private set of input weights. In a conv layer, the same kernel weights are reused at every spatial position. If a 3×3 filter is useful for detecting an edge at pixel (5, 5), it is equally useful at pixel (100, 200) — so we store the filter once and apply it everywhere.
The assumption this encodes is stationarity of image statistics: the distribution of local pixel patterns is roughly the same across the image. That assumption is false for a passport photo centred on a face, but true for natural images on average — and more importantly, the filters we actually want to learn (edges, colour blobs, textures) are themselves translation-independent.
Pillar 2: Translation equivariance
A function is translation equivariant if shifting the input by produces an output shifted by the same . Convolution satisfies this exactly:
Concretely: if a cat appears 50 pixels to the right of where the network saw cats during training, the feature map for “cat” simply shifts 50 pixels — the network does not have to relearn the concept from scratch at every location. This is a structural guarantee of the operation, not something the network has to learn.
Equivariance vs invariance
Pillar 3: Sparse / local connectivity
Each output unit in a conv layer depends on only a tiny local patch of the input — a window — rather than every pixel. This matches two well-known facts:
- Image statistics: Nearby pixels are highly correlated; distant pixels are nearly independent. Local filters exploit this correlation; global connections waste capacity on irrelevant pairs.
- Biological precedent: Hubel & Wiesel showed that neurons in the primary visual cortex (V1) respond to small, localised regions of the visual field and to oriented edges within those regions. CNNs are a deliberate computational echo of that architecture.
Reference
Hubel & Wiesel, 1962, J. Physiology 160(1). Their single-unit recordings in cat striate cortex were the direct biological inspiration for the Neocognitron (Fukushima 1980), the immediate ancestor of modern CNNs.
The whole chapter in one line: a conv layer is an FC layer plus parameter sharing plus locality. The math we are about to do is just the explicit form of those two constraints.
Starting Simple: 1D Convolution
Before we attack images, let us build intuition on a 1-D signal. Exactly the same operation — slide a small filter across the input, sum weighted values — powers audio processing, time-series forecasting, and 1-D sensor fusion.
Intuition: a sliding weighted sum
Imagine a short ruler with three numbers written on it. You lay it on top of a signal, multiply each signal value by the number on the ruler above it, and add the three products. That single number is the output at that position. Slide the ruler one step to the right, repeat.
Mathematical definition
The classical (flipped) definition of 1-D convolution on infinite signals is:
In deep learning we almost always skip the kernel flip and use the simpler cross-correlation definition, which for finite signals of length and kernel of length reads:
Why two definitions?
| Symbol | Meaning | Example value |
|---|---|---|
| f | Input 1-D signal | [1, 2, 3, 4, 5] |
| g | Kernel / filter | [1, 0, -1] |
| n | Output position | 0, 1, 2 |
| k | Index into the kernel | 0, 1, 2 |
| K | Kernel length | 3 |
| L − K + 1 | Output length (no padding, stride 1) | 5 − 3 + 1 = 3 |
Worked example in Python — every line, every value
Below is a complete hand-rolled 1-D convolution. Click any line on the right to see what that line does in memory.
The same calculation in PyTorch
We now reproduce the identical numerical result using torch.nn.functional.conv1d. The point is to see the mapping between our hand-written loop and the optimised library call — not to use a new algorithm.
Mental model
Interactive 1-D convolution
Drag the kernel across the signal, change the weights, and watch each output value update. This is the same operation we just coded by hand — only now you can feel it move.
Interactive 1D Convolution Visualizer
Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)
What happens when we declare this line?
input_size = Number of input channels (17 sensors in C-MAPSS)64 = Output channels (64 learned feature detectors)kernel_size=3 = Window looks at 3 consecutive timestepspadding=1 = Add zeros at boundaries to preserve length1D Convolution Equation
yt = Σk=0K-1 wk · xt+k + b
- • K = kernel size (3 in our case)
- • w = learned weights
- • b = bias term
- • t = output position
Output Dimension Formula
Tout = ⌊(Tin + 2P - K) / S⌋ + 1
With Tin=8, P=1, K=3, S=1:
Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8
Padding preserves sequence length!
Parameter Count
For Conv1d(17, 64, kernel_size=3):
Weights = 64 × 17 × 3 = 3,264
Biases = 64
Total = 3,328 parameters
What the Kernel Learns
The kernel weights are learned during training. Different patterns emerge:
- [1, 0, -1] → Detects rising/falling edges
- [0.33, 0.33, 0.33] → Smoothing/averaging
- [−1, 2, −1] → Detects spikes
64 different kernels learn 64 different patterns!
Quick Check
A signal of length 10 is convolved with a kernel of length 4 (stride 1, no padding). What is the output length?
2D Convolution for Images
Images are 2-D grids of pixels, so we upgrade the sliding window from a line to a square. Exactly the same principle applies — multiply, sum, slide — just along two axes instead of one.
Mathematical definition
For an image and a kernel of size , the (cross-correlation flavour of) 2-D convolution is:
| Symbol | Meaning | Typical value |
|---|---|---|
| I | Input image | 224×224 |
| K | Kernel | 3×3 or 5×5 |
| i, j | Output row / column | 0 … H−M, 0 … W−N |
| m, n | Index inside the kernel | 0 … M−1, 0 … N−1 |
| M, N | Kernel height / width | 3, 3 |
The intuitive picture
Place a small transparency on the image. Each cell of the transparency carries a number (the kernel weight). At each position:
- Multiply every pixel under the transparency by its corresponding kernel weight.
- Add the nine products.
- Write that single number to the output at this position.
- Slide the transparency one pixel right (or down), repeat.
Watch it move
The animation below runs exactly the operation we just described. The highlighted 3×3 window on the left is the current position; the output cell it fills in on the right is computed by the multiply-and-sum rule above.
Convolution Animation
Input (5×5)
Kernel (3×3)
Output (3×3)
Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2
What the kernel weights encode: positive on one side and negative on the other → edge detector. All equal and positive → blur. Large positive centre with negative neighbours → sharpen. In classical vision these were hand-designed; in deep learning they are learned by gradient descent, and the ones that emerge look strikingly like Gabor filters — the same pattern biologists find in V1.Reference
Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV. Also Olshausen & Field, 1996, Nature 381, on sparse-coding models that produce similar filters from natural image statistics alone.
From Edges to Objects: The Feature Hierarchy
A single conv layer can only detect small, local patterns. The reason CNNs work on object recognition is that we stack many of them. Each successive layer sees a slightly larger piece of the input (a growing receptive field, which we formalise in Section 3) and combines the previous layer's features into more complex ones.
Feature Hierarchy in CNNs
How neural networks build complex features from simple ones
Input
— Raw pixelsLayer 1
— Edges & GradientsLayer 2
— Textures & PatternsLayer 3
— Object PartsLayer 4+
— Objects & ScenesKey Insight: Each layer combines features from the previous layer. Early layers detect low-level features; deeper layers capture high-level concepts.
This is not a hand-waved analogy. Feature-visualisation techniques applied to trained CNNs — activation maximisation, deconvolutional projections, saliency maps — repeatedly recover the same hierarchy: edges and colour blobs in layer 1, simple textures and corners in layer 2, motifs like eyes and wheels in the mid layers, and full object-level concepts near the top
Reference
We now have the what and the why. In Section 2 we zoom in on the convolution operation itself — the distinction between cross-correlation and true convolution, a fully worked 2-D example, interactive kernels, and multi-channel (RGB) convolution. In Section 3 we add stride, padding, pooling, and receptive fields to complete the picture.