Chapter 13
18 min read
Section 41 of 65

From Pixels to Features: Why CNNs?

Understanding Convolutions

Introduction

Convolutional Neural Networks power nearly every modern vision system — from the face-unlock on your phone to the perception stack of a self-driving car. But why convolution specifically? Why not a plain fully-connected network like the one in Chapter 7?

This chapter answers that question in three stages. First, we show why fully-connected networks cannot scale to realistic images. Then we present the three inductive biases — parameter sharing, translation equivariance, and local connectivity — that make convolutions the right tool for pixel data. Finally we build the convolution operation from scratch in 1-D and 2-D so that when you meet nn.Conv2d in the next section, there is nothing mysterious left.

Reference

Goodfellow, Bengio & Courville, 2016, Deep Learning, MIT Press, ch. 9.2 “Motivation”.
The core claim of this chapter: a convolutional layer is not just “another kind of layer” — it is a fully-connected layer under three strong architectural constraints. Those constraints happen to match the statistics of natural images, and that match is the entire reason CNNs work.

Learning Objectives

After this section you will be able to:

  1. Quantify why fully-connected layers fail on images — compute the parameter count for a 224×224×3 input and explain why it is infeasible.
  2. Name the three CNN inductive biases — parameter sharing, translation equivariance, and sparse / local connectivity — and explain the statistical assumption each one encodes.
  3. Compute 1-D and 2-D convolutions by hand and in NumPy, then reproduce the exact same result with PyTorch's F.conv1d / F.conv2d.
  4. Predict output size using O=IK+2PS+1O = \left\lfloor \tfrac{I - K + 2P}{S} \right\rfloor + 1 for any stride and padding.
  5. Explain the feature hierarchy — why stacked conv layers go from edges to textures to object parts, and why this mirrors the mammalian visual cortex

    Reference

    Hubel & Wiesel, 1962, “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex”, J. Physiology 160(1).
    .

The Parameter Explosion Problem

Suppose we want to classify a single RGB image of ImageNet resolution — 224×224 pixels, 3 channels — into 1,000 categories. The most naïve approach is to flatten the image to a vector and push it through a fully-connected (FC) hidden layer.

The numbers, without waving our hands

Flattening the image gives a vector of length 224×224×3=150,528224 \times 224 \times 3 = 150{,}528. Connecting that to just one hidden neuron requires one weight per input pixel — that is 150,528 weights per neuron. If we want a hidden layer with 1,000 neurons (modest by modern standards), the parameter count is:

150,528×1,000+1,000  =  150,529,000150M parameters150{,}528 \times 1{,}000 + 1{,}000 \;=\; 150{,}529{,}000 \approx 150\text{M parameters}

… in a single hidden layer. For comparison, the entire ResNet-50 network (50 layers deep, ImageNet-trained) has about 25M parameters. The fully-connected approach is already six times larger than ResNet-50 after one layer. Training this at scale is impractical for three independent reasons:

Reference

He, Zhang, Ren & Sun, 2016, “Deep Residual Learning for Image Recognition”, CVPR. ResNet-50 reports 25.6M parameters.
  1. Memory. 150M float32 weights alone occupy 600 MB — plus gradients, Adam moments, and activations.
  2. Sample complexity. Classical learning-theory bounds scale roughly with the parameter count. With this many weights you need an astronomically large training set to avoid overfitting.
  3. It ignores structure. Pixel (12, 37) and pixel (12, 38) are almost certainly related — they are neighbours. An FC layer treats them as arbitrary independent features. The spatial topology of the image is thrown away the moment we flatten.

What a 3×3 convolution would use instead

A convolutional layer that produces the same 64 feature channels from a 3-channel input needs only:

3×3×3×64+64  =  1,792 parameters3 \times 3 \times 3 \times 64 + 64 \;=\; 1{,}792 \text{ parameters}

That is about 84,000× fewer parameters than the FC layer above — and it preserves spatial structure. How is that possible? The interactive diagram below shows the key idea: an FC layer connects every input pixel to every output unit, while a conv layer connects only a tiny local neighbourhood, and shares the same weights across every spatial location.

Connectivity Comparison

Input (4×4)Output (4×4)01234567891011121314150123456789101112131415256 connections(16 × 16 = 256 unique weights)

Hover over nodes to see their connections. In FC, every input connects to every output.

Quick Check

A fully-connected layer maps a 32×32×3 input (CIFAR-10 images) to 512 hidden units. How many weights (excluding bias) does it have?


Three Pillars of CNN Design

A convolutional layer is best understood as a fully-connected layer under three architectural constraints. Each constraint encodes a specific assumption about natural images. If the assumption holds (it does, empirically) we get massive parameter savings and better generalisation for free.

Pillar 1: Parameter sharing

In an FC layer, every output unit has its own private set of input weights. In a conv layer, the same kernel weights are reused at every spatial position. If a 3×3 filter is useful for detecting an edge at pixel (5, 5), it is equally useful at pixel (100, 200) — so we store the filter once and apply it everywhere.

The assumption this encodes is stationarity of image statistics: the distribution of local pixel patterns is roughly the same across the image. That assumption is false for a passport photo centred on a face, but true for natural images on average — and more importantly, the filters we actually want to learn (edges, colour blobs, textures) are themselves translation-independent.

Pillar 2: Translation equivariance

A function ff is translation equivariant if shifting the input by Δ\Delta produces an output shifted by the same Δ\Delta. Convolution satisfies this exactly:

(TΔI)K  =  TΔ(IK)(T_{\Delta} I) * K \;=\; T_{\Delta} (I * K)

Concretely: if a cat appears 50 pixels to the right of where the network saw cats during training, the feature map for “cat” simply shifts 50 pixels — the network does not have to relearn the concept from scratch at every location. This is a structural guarantee of the operation, not something the network has to learn.

Equivariance vs invariance

Convolution is equivariant (output shifts with input), not invariant (output unchanged). Invariance is usually what the final classifier needs and is typically achieved by combining conv layers with pooling or global average pooling, which we cover in Section 3.

Pillar 3: Sparse / local connectivity

Each output unit in a conv layer depends on only a tiny local patch of the input — a K×KK \times K window — rather than every pixel. This matches two well-known facts:

  • Image statistics: Nearby pixels are highly correlated; distant pixels are nearly independent. Local filters exploit this correlation; global connections waste capacity on irrelevant pairs.
  • Biological precedent: Hubel & Wiesel showed that neurons in the primary visual cortex (V1) respond to small, localised regions of the visual field and to oriented edges within those regions

    Reference

    Hubel & Wiesel, 1962, J. Physiology 160(1). Their single-unit recordings in cat striate cortex were the direct biological inspiration for the Neocognitron (Fukushima 1980), the immediate ancestor of modern CNNs.
    . CNNs are a deliberate computational echo of that architecture.
The whole chapter in one line: a conv layer is an FC layer plus parameter sharing plus locality. The math we are about to do is just the explicit form of those two constraints.

Starting Simple: 1D Convolution

Before we attack images, let us build intuition on a 1-D signal. Exactly the same operation — slide a small filter across the input, sum weighted values — powers audio processing, time-series forecasting, and 1-D sensor fusion.

Intuition: a sliding weighted sum

Imagine a short ruler with three numbers written on it. You lay it on top of a signal, multiply each signal value by the number on the ruler above it, and add the three products. That single number is the output at that position. Slide the ruler one step to the right, repeat.

Mathematical definition

The classical (flipped) definition of 1-D convolution on infinite signals is:

(fg)[n]  =  k=f[k]g[nk](f * g)[n] \;=\; \sum_{k=-\infty}^{\infty} f[k]\, g[n - k]

In deep learning we almost always skip the kernel flip and use the simpler cross-correlation definition, which for finite signals of length LL and kernel of length KK reads:

(fg)[n]  =  k=0K1f[n+k]g[k](f \star g)[n] \;=\; \sum_{k=0}^{K-1} f[n + k]\, g[k]

Why two definitions?

The flipped version (first equation) is the operation signal-processing textbooks call “convolution”. The unflipped version (second equation) is cross-correlation. Because CNN kernels are learned, the flip makes no functional difference — the network absorbs it into the weights. The deep-learning community uses cross-correlation everywhere but still calls it “convolution”. We explain this more carefully in Section 2. For the rest of this chapter, * means the unflipped version.
SymbolMeaningExample value
fInput 1-D signal[1, 2, 3, 4, 5]
gKernel / filter[1, 0, -1]
nOutput position0, 1, 2
kIndex into the kernel0, 1, 2
KKernel length3
L − K + 1Output length (no padding, stride 1)5 − 3 + 1 = 3

Worked example in Python — every line, every value

Below is a complete hand-rolled 1-D convolution. Click any line on the right to see what that line does in memory.

1-D convolution from scratch (NumPy)
🐍conv1d_numpy.py
1import numpy as np

NumPy provides fast N-dimensional arrays and element-wise math. Every operation we need (slice, element-wise multiply, sum) runs as optimized C code rather than Python loops. The alias 'np' is a universal convention.

EXECUTION STATE
numpy = Numerical computing library: ndarray, broadcasting, linear algebra.
as np = Alias — we write np.array() instead of numpy.array().
4signal = np.array([1, 2, 3, 4, 5])

Creates a 1-D NumPy array of length 5. In practice this might be audio samples, daily stock prices, or sensor readings over time. We picked a simple ramp (+1 per step) so the outputs are trivial to verify by hand.

EXECUTION STATE
📚 np.array() = Constructs an ndarray from a Python list. Infers dtype (here int64 on 64-bit systems).
⬇ arg: [1, 2, 3, 4, 5] = Python list of 5 integers — the raw data we want to wrap in an ndarray.
⬆ signal = [1, 2, 3, 4, 5] — shape (5,), dtype int64
7kernel = np.array([1, 0, -1])

The filter we will slide across the signal. [1, 0, -1] computes signal[n] − signal[n+2]: a backward difference, i.e. a discrete derivative. Every filter we will meet later (Sobel, Gaussian, sharpen) is the same shape of object — just different numbers.

EXECUTION STATE
⬇ arg: [1, 0, -1] = The three kernel weights. Left weight = +1, middle = 0, right = −1.
⬆ kernel = [1, 0, -1] — shape (3,), dtype int64
→ what this detects = Outputs +c when the signal is DECREASING (left > right) and −c when INCREASING.
10output = []

An empty Python list that will accumulate one scalar per sliding position. Using a list (then converting later) is simpler than pre-allocating here.

EXECUTION STATE
output = [] — empty list, length 0.
12for n in range(len(signal) - len(kernel) + 1):

Slide the kernel across every valid position where it fits entirely inside the signal. The number of valid positions is len(signal) − len(kernel) + 1 = 5 − 3 + 1 = 3. This is THE output-size formula, in its simplest form (stride=1, padding=0).

EXECUTION STATE
📚 range(n) = Python builtin: yields 0, 1, 2, …, n−1. Here n = 3 so we get n ∈ {0, 1, 2}.
len(signal) = 5
len(kernel) = 3
→ positions to visit = n = 0, 1, 2 — exactly 3 windows.
LOOP TRACE · 3 iterations
n = 0
window start idx = 0 → covers signal[0:3]
n = 1
window start idx = 1 → covers signal[1:4]
n = 2
window start idx = 2 → covers signal[2:5]
13window = signal[n : n + len(kernel)]

Slice out len(kernel) consecutive samples starting at index n. This is the piece of the signal that sits directly under the kernel right now. NumPy slices share memory with the original array (view, not copy).

EXECUTION STATE
📚 arr[a:b] = Python slice: elements at indices a, a+1, …, b−1. Stop index is exclusive.
⬇ start = n = The current sliding position (0, 1, or 2).
⬇ stop = n + 3 = Exclusive upper bound so we get exactly 3 elements.
LOOP TRACE · 3 iterations
n = 0
window = [1, 2, 3]
n = 1
window = [2, 3, 4]
n = 2
window = [3, 4, 5]
14products = window * kernel

Element-wise multiplication between two equal-length 1-D arrays. NumPy pairs up matching positions and multiplies each. No loop — a single vectorised op under the hood. This is the 'weighting' step of the convolution.

EXECUTION STATE
* = On ndarrays: element-wise multiply (NOT matrix multiply — that would be @).
⬇ window = current 3-element slice of the signal
⬇ kernel = [1, 0, -1]
LOOP TRACE · 3 iterations
n = 0
window * kernel = [1, 2, 3] * [1, 0, -1] = [1×1, 2×0, 3×(-1)] = [1, 0, -3]
n = 1
window * kernel = [2, 3, 4] * [1, 0, -1] = [2×1, 3×0, 4×(-1)] = [2, 0, -4]
n = 2
window * kernel = [3, 4, 5] * [1, 0, -1] = [3×1, 4×0, 5×(-1)] = [3, 0, -5]
15value = np.sum(products)

Reduce the 3 products to a single scalar by adding them. Together with the element-wise multiply above, this is the 'dot product between window and kernel' — the heart of every convolution we will ever write.

EXECUTION STATE
📚 np.sum(arr) = NumPy reduction: adds all elements. For a 1-D array of length 3, returns a scalar. Equivalent to arr[0] + arr[1] + arr[2].
⬇ arg: products = The 3-element array from line 14.
LOOP TRACE · 3 iterations
n = 0
sum([1, 0, -3]) = 1 + 0 + (-3) = -2
n = 1
sum([2, 0, -4]) = 2 + 0 + (-4) = -2
n = 2
sum([3, 0, -5]) = 3 + 0 + (-5) = -2
16output.append(value)

Append the scalar we just computed to the running output list. After all three iterations, output has length 3 — exactly what the output-size formula predicted.

EXECUTION STATE
📚 list.append(x) = Python list method: adds x to the end in O(1) amortised time.
LOOP TRACE · 3 iterations
after n = 0
output = [-2]
after n = 1
output = [-2, -2]
after n = 2
output = [-2, -2, -2]
18print("signal:", signal)

Display the input signal for comparison with the kernel and output.

EXECUTION STATE
signal = [1 2 3 4 5]
19print("kernel:", kernel)

Display the kernel weights.

EXECUTION STATE
kernel = [ 1 0 -1]
20print("output:", output)

Display the final output list. The constant −2 tells us the ramp has constant slope = +1: the kernel detects INCREASE and returns −1 × 2 = −2 (the first-vs-last gap across a 3-wide window).

EXECUTION STATE
⬆ output = [-2, -2, -2]
→ interpretation = Flat, negative output ⇒ signal increases at constant rate. If the signal dipped, the corresponding output entry would be positive.
9 lines without explanation
1import numpy as np
2
3# Input signal: a simple linearly-increasing ramp
4signal = np.array([1, 2, 3, 4, 5])
5
6# Kernel: the backward-difference filter [1, 0, -1]
7kernel = np.array([1, 0, -1])
8
9# Output has length = len(signal) - len(kernel) + 1 = 5 - 3 + 1 = 3
10output = []
11
12for n in range(len(signal) - len(kernel) + 1):
13    window = signal[n : n + len(kernel)]   # slice 3 consecutive values
14    products = window * kernel             # element-wise multiply (NumPy broadcasting)
15    value = np.sum(products)               # reduce to a single scalar
16    output.append(value)
17
18print("signal:", signal)
19print("kernel:", kernel)
20print("output:", output)
21# Expected → output: [-2, -2, -2]

The same calculation in PyTorch

We now reproduce the identical numerical result using torch.nn.functional.conv1d. The point is to see the mapping between our hand-written loop and the optimised library call — not to use a new algorithm.

1-D convolution with PyTorch F.conv1d
🐍conv1d_torch.py
1import torch

PyTorch is the deep-learning framework we will use throughout the book. The torch package gives us Tensor (the GPU-aware analog of ndarray) and the autograd system.

2import torch.nn.functional as F

torch.nn.functional contains the STATELESS versions of neural-net operations: F.conv1d, F.conv2d, F.relu, F.softmax… We use 'F' as the standard alias. The module-style nn.Conv1d (capital C) wraps these functions with learnable parameters — we will see that form later.

EXECUTION STATE
torch.nn.functional = Functional API — pure functions that take weights as arguments.
as F = Convention: import torch.nn.functional as F.
6signal = torch.tensor([...]).view(1, 1, 5)

Wrap our 1-D data into the shape PyTorch's conv ops require. F.conv1d demands exactly 3 dimensions: (batch_size, in_channels, length). We have 1 sample, 1 channel, 5 timesteps — hence .view(1, 1, 5).

EXECUTION STATE
📚 torch.tensor(data) = Builds a new tensor and infers dtype. Floats → float32; we used 1. instead of 1 to force float.
📚 .view(a, b, c) = Reinterprets a contiguous tensor with a new shape WITHOUT copying memory. Total element count must match: 1 × 1 × 5 = 5 ✓.
⬇ arg: (1, 1, 5) = (N=batch, C_in=channels, L=length). Batch axis is first in PyTorch.
⬆ signal.shape = torch.Size([1, 1, 5])
7kernel = torch.tensor([...]).view(1, 1, 3)

The kernel for F.conv1d must have shape (out_channels, in_channels, kernel_size). One output channel × one input channel × 3-wide kernel ⇒ (1, 1, 3). The SAME [1, 0, -1] weights as the NumPy version.

EXECUTION STATE
⬇ arg: (1, 1, 3) = (C_out=1, C_in=1, K=3). C_in MUST match the signal's C_in.
⬆ kernel.shape = torch.Size([1, 1, 3])
10output = F.conv1d(signal, kernel, stride=1, padding=0)

The PyTorch call that replaces our entire Python for-loop. Under the hood PyTorch dispatches to optimised CPU kernels (MKL-DNN) or cuDNN on GPU — hundreds to thousands of times faster than our loop at scale.

EXECUTION STATE
📚 F.conv1d(input, weight, ...) = 1-D cross-correlation. Slides 'weight' across 'input' along the L (length) axis and computes a weighted sum at each position.
⬇ arg 1: signal = Shape (1, 1, 5) — what we slide over.
⬇ arg 2: kernel = Shape (1, 1, 3) — the filter weights. IDENTICAL numerical behaviour to NumPy's window*kernel·sum loop above.
⬇ arg 3: stride=1 = Advance the window by 1 element every step (so every valid position is visited).
⬇ arg 4: padding=0 = No zeros added at the boundary ⇒ output length shrinks. With padding=1 we would see output length 5 instead of 3.
⬆ output.shape = torch.Size([1, 1, 3])
12print(signal.shape)

Verify the signal really is (N, C, L) = (1, 1, 5). If you see (5,) instead, you forgot .view and F.conv1d will raise a shape error.

EXECUTION STATE
→ prints = torch.Size([1, 1, 5])
13print(kernel.shape)

Verify the kernel has 3 axes too. A common mistake is passing a 1-D tensor here.

EXECUTION STATE
→ prints = torch.Size([1, 1, 3])
14print(output.squeeze().tolist())

Collapse the two length-1 axes and dump to a Python list so we can compare byte-for-byte with the NumPy version.

EXECUTION STATE
📚 .squeeze() = Removes all dimensions of size 1: (1, 1, 3) → (3,).
📚 .tolist() = Convert tensor to nested Python list. Triggers CPU sync if the tensor lived on GPU.
⬆ prints = [-2.0, -2.0, -2.0] — matches NumPy exactly.
7 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Same signal and kernel as above, but as PyTorch tensors.
5# Conv layers expect shape (batch, channels, length) — we add two dummy axes.
6signal = torch.tensor([1., 2., 3., 4., 5.]).view(1, 1, 5)  # (N=1, C=1, L=5)
7kernel = torch.tensor([1., 0., -1.]).view(1, 1, 3)         # (C_out=1, C_in=1, K=3)
8
9# Cross-correlation (what deep learning calls "convolution")
10output = F.conv1d(signal, kernel, stride=1, padding=0)
11
12print("signal shape :", signal.shape)
13print("kernel shape :", kernel.shape)
14print("output       :", output.squeeze().tolist())
15# Expected → output: [-2.0, -2.0, -2.0]

Mental model

Every PyTorch conv call is just a carefully batched, GPU-accelerated version of the NumPy loop you just read. Any time a conv layer surprises you, reach for the loop form first — the mystery usually evaporates.

Interactive 1-D convolution

Drag the kernel across the signal, change the weights, and watch each output value update. This is the same operation we just coded by hand — only now you can feel it move.

Interactive 1D Convolution Visualizer

Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)

What happens when we declare this line?

input_size = Number of input channels (17 sensors in C-MAPSS)
64 = Output channels (64 learned feature detectors)
kernel_size=3 = Window looks at 3 consecutive timesteps
padding=1 = Add zeros at boundaries to preserve length
Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)
Input(padded)pad0.00t00.82t10.91t20.76t30.88t40.95t50.71t60.84t70.93pad0.00Kernel(size=3)w00.33w10.34w20.33Step 1/8: Calculation at position 0y0 = 0.33 × 0.00 + 0.34 × 0.82 + 0.33 × 0.91 = 0.000 + 0.279 + 0.300 = 0.579Outputy00.58y1y2y3y4y5y6y7
Speed:
Progress1 / 8 positions

1D Convolution Equation

yt = Σk=0K-1 wk · xt+k + b

  • K = kernel size (3 in our case)
  • w = learned weights
  • b = bias term
  • t = output position

Output Dimension Formula

Tout = ⌊(Tin + 2P - K) / S⌋ + 1

With Tin=8, P=1, K=3, S=1:

Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8

Padding preserves sequence length!

Parameter Count

For Conv1d(17, 64, kernel_size=3):

Weights = 64 × 17 × 3 = 3,264

Biases = 64

Total = 3,328 parameters

What the Kernel Learns

The kernel weights are learned during training. Different patterns emerge:

  • [1, 0, -1] → Detects rising/falling edges
  • [0.33, 0.33, 0.33] → Smoothing/averaging
  • [−1, 2, −1] → Detects spikes

64 different kernels learn 64 different patterns!

Quick Check

A signal of length 10 is convolved with a kernel of length 4 (stride 1, no padding). What is the output length?


2D Convolution for Images

Images are 2-D grids of pixels, so we upgrade the sliding window from a line to a square. Exactly the same principle applies — multiply, sum, slide — just along two axes instead of one.

Mathematical definition

For an image II and a kernel KK of size M×NM \times N, the (cross-correlation flavour of) 2-D convolution is:

(IK)[i,j]  =  m=0M1n=0N1I[i+m,j+n]K[m,n](I * K)[i, j] \;=\; \sum_{m=0}^{M-1}\sum_{n=0}^{N-1} I[i+m,\, j+n] \cdot K[m, n]

SymbolMeaningTypical value
IInput image224×224
KKernel3×3 or 5×5
i, jOutput row / column0 … H−M, 0 … W−N
m, nIndex inside the kernel0 … M−1, 0 … N−1
M, NKernel height / width3, 3

The intuitive picture

Place a small 3×33 \times 3 transparency on the image. Each cell of the transparency carries a number (the kernel weight). At each position:

  1. Multiply every pixel under the transparency by its corresponding kernel weight.
  2. Add the nine products.
  3. Write that single number to the output at this position.
  4. Slide the transparency one pixel right (or down), repeat.

Watch it move

The animation below runs exactly the operation we just described. The highlighted 3×3 window on the left is the current position; the output cell it fills in on the right is computed by the multiply-and-sum rule above.

Convolution Animation

Input (5×5)

1×-12×00×1120×-21×02×2102×-10×01×1011202101210

Kernel (3×3)

-10+1-20+2-10+1

Output (3×3)

2-1-2-10-1-100

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

What the kernel weights encode: positive on one side and negative on the other → edge detector. All equal and positive → blur. Large positive centre with negative neighbours → sharpen. In classical vision these were hand-designed; in deep learning they are learned by gradient descent, and the ones that emerge look strikingly like Gabor filters — the same pattern biologists find in V1

Reference

Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV. Also Olshausen & Field, 1996, Nature 381, on sparse-coding models that produce similar filters from natural image statistics alone.
.

From Edges to Objects: The Feature Hierarchy

A single conv layer can only detect small, local patterns. The reason CNNs work on object recognition is that we stack many of them. Each successive layer sees a slightly larger piece of the input (a growing receptive field, which we formalise in Section 3) and combines the previous layer's features into more complex ones.

Feature Hierarchy in CNNs

How neural networks build complex features from simple ones

🖼️
Input
Raw pixels
RGB valuesGrayscale intensity
Level 0
Concrete
📐
Layer 1
Edges & Gradients
Horizontal edgesVertical edgesDiagonal linesColor blobs
Level 1
🔲
Layer 2
Textures & Patterns
CornersSimple texturesGradientsColor patterns
Level 2
👁️
Layer 3
Object Parts
EyesWheelsFur patternsWindows
Level 3
🎯
Layer 4+
Objects & Scenes
FacesCarsAnimalsBuildings
Level 4
Abstract

Key Insight: Each layer combines features from the previous layer. Early layers detect low-level features; deeper layers capture high-level concepts.

This is not a hand-waved analogy. Feature-visualisation techniques applied to trained CNNs — activation maximisation, deconvolutional projections, saliency maps — repeatedly recover the same hierarchy: edges and colour blobs in layer 1, simple textures and corners in layer 2, motifs like eyes and wheels in the mid layers, and full object-level concepts near the top

Reference

Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV.
.


We now have the what and the why. In Section 2 we zoom in on the convolution operation itself — the distinction between cross-correlation and true convolution, a fully worked 2-D example, interactive kernels, and multi-channel (RGB) convolution. In Section 3 we add stride, padding, pooling, and receptive fields to complete the picture.

Loading comments...