Chapter 13
30 min read
Section 43 of 65

Stride, Padding, Pooling & Receptive Fields

Understanding Convolutions

Stride — Controlling Output Resolution

In Section 2 our kernel moved one pixel at a time. That is stride 1. If we instead move s pixels per step, we evaluate fewer windows, the output shrinks by roughly a factor of s in each spatial axis, and we save compute proportionally.

Definition

Let SS be the stride. Then the output size (no padding) becomes:

O  =  IKS+1O \;=\; \left\lfloor \frac{I - K}{S} \right\rfloor + 1

For the 5×5 image and 3×3 Sobel-X kernel of Section 2: stride 1 gives a 3×3 output; stride 2 gives a 2×2 output; stride 3 gives a 1×1 output. Every stride greater than 1 throws information away, but that is often exactly what we want — modern architectures like ResNet use stride-2 convolutions in place of (or alongside) pooling precisely to down-sample

Reference

He et al., 2016, “Deep Residual Learning for Image Recognition”, CVPR. See also Springenberg, Dosovitskiy, Brox & Riedmiller, 2015, “Striving for Simplicity: The All Convolutional Net”, ICLR workshop — which argues strided convolutions can fully replace max pooling in classification networks.
.

Python & PyTorch — stride in action

Stride, from the loop up
🐍stride_demo.py
1import numpy as np

NumPy for our hand-rolled version.

2import torch

PyTorch so we can cross-check against F.conv2d at the end.

3import torch.nn.functional as F

Functional conv API.

17def conv2d_stride(image, kernel, stride=1)

Same conv as Section 2, but with an extra `stride` parameter. `stride` is the number of pixels the kernel moves each step. stride=1 visits every position; stride=2 skips every other, halving the output resolution in both axes.

EXECUTION STATE
⬇ stride (default 1) = Integer step size. stride=1 → overlapping windows. stride=2 → non-overlapping for a 3×3 kernel. Larger stride ⇒ smaller output, less compute.
18H, W = image.shape

Unpack input spatial dims. Here H = W = 5.

19kH, kW = kernel.shape

Unpack kernel dims. Here kH = kW = 3.

21out_H = (H - kH) // stride + 1

Output size formula, stride-aware. Integer-divide the usable span (H − kH) by the stride, then add 1 for the very first position. The // operator is floor division — important when the division is not exact.

EXECUTION STATE
📚 a // b = Python floor division — drops the fractional part. Equivalent to math.floor(a / b) for non-negative a, b.
stride=1 = (5 − 3) // 1 + 1 = 2 + 1 = 3
stride=2 = (5 − 3) // 2 + 1 = 1 + 1 = 2
22out_W = (W - kW) // stride + 1

Same derivation along the width axis.

24output = np.zeros((out_H, out_W))

Pre-allocate the output matrix with zeros.

25for i in range(out_H):

Iterate over OUTPUT rows. i indexes into the OUTPUT, not the image — we derive the image row from h0 = i*stride.

26for j in range(out_W):

Iterate over OUTPUT columns.

27h0 = i * stride

Translate the output row index into the IMAGE row. With stride=2, output row 1 corresponds to image row 2 (not 1). This single multiplication is the only difference between stride=1 and stride=k convolution.

LOOP TRACE · 2 iterations
stride=2, i=0
h0 = 0
stride=2, i=1
h0 = 2
28w0 = j * stride

Same for the column axis.

29window = image[h0 : h0 + kH, w0 : w0 + kW]

Slice the kH×kW patch starting at (h0, w0). Because h0, w0 advance by `stride`, consecutive windows may NOT overlap.

LOOP TRACE · 4 iterations
stride=2, (0,0)
window = [[10, 20, 30], [20, 40, 60], [30, 60, 90]]
stride=2, (0,1)
window = [[30, 40, 50], [60, 80,100], [90,120,150]]
stride=2, (1,0)
window = [[30, 60, 90], [40, 80,120], [50,100,150]]
stride=2, (1,1)
window = [[90,120,150], [120,160,200], [150,200,250]]
30output[i, j] = np.sum(window * kernel)

Standard multiply-and-sum. Values are identical to Section 2 for matching (h0, w0), because stride only changes WHICH windows we evaluate — not the formula within each window.

LOOP TRACE · 4 iterations
stride=2, (0,0)
sum = 160.0
stride=2, (0,1)
sum = 160.0
stride=2, (1,0)
sum = 320.0
stride=2, (1,1)
sum = 320.0
31return output

Return the stride-aware output.

33out_s1 = conv2d_stride(I, K, stride=1)

Stride=1 call — recovers exactly the Section 2 result.

EXECUTION STATE
⬆ out_s1.shape = (3, 3)
⬆ out_s1 =
[[160. 160. 160.]
 [240. 240. 240.]
 [320. 320. 320.]]
34out_s2 = conv2d_stride(I, K, stride=2)

Stride=2 call — half the rows, half the columns, one quarter of the output elements.

EXECUTION STATE
⬆ out_s2.shape = (2, 2)
⬆ out_s2 =
[[160. 160.]
 [320. 320.]]
36print(out_s1.shape, out_s1)

Print stride=1 result for comparison.

37print(out_s2.shape, out_s2)

Print stride=2 result.

40I_t = torch.tensor(I).unsqueeze(0).unsqueeze(0)

Promote the NumPy image to a 4-D PyTorch tensor (1, 1, 5, 5) — the shape F.conv2d expects.

41K_t = torch.tensor(K).unsqueeze(0).unsqueeze(0)

Same for the kernel → (1, 1, 3, 3).

42F.conv2d(I_t, K_t, stride=2).squeeze()

PyTorch call. Identical numerical result to our loop version, confirming our understanding of the stride parameter.

EXECUTION STATE
⬆ torch result =
tensor([[160., 160.],
        [320., 320.]])
22 lines without explanation
1import numpy as np
2import torch
3import torch.nn.functional as F
4
5# Same 5×5 gradient image + Sobel-X kernel as Section 2
6I = np.array([
7    [10, 20, 30, 40, 50],
8    [20, 40, 60, 80, 100],
9    [30, 60, 90, 120, 150],
10    [40, 80, 120, 160, 200],
11    [50, 100, 150, 200, 250],
12], dtype=float)
13
14K = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float)
15
16def conv2d_stride(image, kernel, stride=1):
17    H, W = image.shape
18    kH, kW = kernel.shape
19    # Output size depends on stride: O = (I - K) // S + 1
20    out_H = (H - kH) // stride + 1
21    out_W = (W - kW) // stride + 1
22
23    output = np.zeros((out_H, out_W))
24    for i in range(out_H):
25        for j in range(out_W):
26            h0 = i * stride                       # top-left row in the IMAGE
27            w0 = j * stride                       # top-left col in the IMAGE
28            window = image[h0 : h0 + kH, w0 : w0 + kW]
29            output[i, j] = np.sum(window * kernel)
30    return output
31
32out_s1 = conv2d_stride(I, K, stride=1)
33out_s2 = conv2d_stride(I, K, stride=2)
34
35print("stride=1 output shape:", out_s1.shape, "\n", out_s1)
36print("stride=2 output shape:", out_s2.shape, "\n", out_s2)
37
38# PyTorch check — identical numbers
39I_t = torch.tensor(I).unsqueeze(0).unsqueeze(0)
40K_t = torch.tensor(K).unsqueeze(0).unsqueeze(0)
41print("torch stride=2:\n", F.conv2d(I_t, K_t, stride=2).squeeze())
42
43# Expected:
44# stride=1 output: (3, 3)  → [[160,160,160],[240,240,240],[320,320,320]]
45# stride=2 output: (2, 2)  → [[160,160],[320,320]]
strideOutput shapeOutput values
1(3, 3)[[160,160,160],[240,240,240],[320,320,320]]
2(2, 2)[[160,160],[320,320]]
3(1, 1)[[160]]

When to reach for stride > 1

  • You need to down-sample while simultaneously computing features — a strided conv replaces a conv + pool pair.
  • You want a parameterised down-sampler: unlike pooling, the strided conv still has learnable weights.
  • You are building a generator and need the reverse: fractional/transposed strided conv to upsample.

Quick Check

Input is 28×28, kernel is 5×5, stride is 2, no padding. What is the output size?


Padding — Taming the Boundary

Without padding, every conv layer shrinks the spatial size by K1K - 1. Stack 10 layers with 3×3 kernels and you lose 20 pixels in each dimension. Padding fixes this by adding extra rows and columns around the input, so the kernel has somewhere to land at the boundary.

The four common padding modes

ModeWhat it doesUse case
constant (zero)Pad with a constant (usually 0)Default in nn.Conv2d — simple, fast, slightly biases edges toward zero
replicateRepeat the nearest edge pixelSmooth boundary behaviour; common in image restoration
reflectMirror the input without duplicating the edgeEven smoother; default in many classical image-processing libraries
circularWrap around (right ↔ left, top ↔ bottom)Periodic data — spherical imagery, torus simulations

Terminology: ‘valid’, ‘same’, ‘full’

Higher-level libraries often use named padding shortcuts:

  • valid — no padding. Output shrinks by K1K - 1.
  • same — pad so output spatial size equals input (when stride=1). For odd kernel sizes: P=(K1)/2P = (K - 1)/2.
  • full — pad K1K - 1 on each side. Output is larger than input. Rare in deep learning; common in signal processing.

The same-size recipe

To keep the output the same spatial size as the input with stride=1, use P=(K1)/2P = (K - 1)/2. For K=3P=1K = 3 \rightarrow P = 1; K=5P=2K = 5 \rightarrow P = 2; K=7P=3K = 7 \rightarrow P = 3. This is why almost every 3×3 conv in modern CNNs ships with padding=1.

F.pad — all four modes on the same tensor

Padding modes — side-by-side
🐍padding_modes.py
1import torch

PyTorch tensors.

2import torch.nn.functional as F

F.pad and F.conv2d live here.

5x = torch.tensor([...]).unsqueeze(0).unsqueeze(0)

A 3×3 matrix reshaped to 4-D so F.pad (called on 4-D tensors with the 2-element-per-axis convention) can pad the last two dims.

EXECUTION STATE
⬆ x.shape = torch.Size([1, 1, 3, 3])
→ values =
[[1 2 3]
 [4 5 6]
 [7 8 9]]
12p = (1, 1, 1, 1)

The padding spec for F.pad reads from the LAST dim toward the first, two numbers per axis: (left, right) for the last axis, then (top, bottom) for the next-to-last. Here we pad 1 pixel on every side.

EXECUTION STATE
p = (1, 1, 1, 1) = (left=1, right=1, top=1, bottom=1). Result after padding: 5×5 instead of 3×3.
14zeros = F.pad(x, p, mode="constant", value=0)

Zero padding — the default in nn.Conv2d. Adds rows/columns of zeros around the input. Cheap and simple but biases edges toward zero.

EXECUTION STATE
⬇ mode="constant" = Pad with a constant value.
⬇ value=0 = The constant used when mode="constant". Set to any number — zero is standard.
⬆ result =
[[0 0 0 0 0]
 [0 1 2 3 0]
 [0 4 5 6 0]
 [0 7 8 9 0]
 [0 0 0 0 0]]
15replicate = F.pad(x, p, mode="replicate")

Replicate mode — repeat the nearest edge pixel. Useful when a zero border would introduce a bright/dark boundary artifact.

EXECUTION STATE
⬇ mode="replicate" = Extend the input by repeating the boundary row/column.
⬆ result =
[[1 1 2 3 3]
 [1 1 2 3 3]
 [4 4 5 6 6]
 [7 7 8 9 9]
 [7 7 8 9 9]]
16reflect = F.pad(x, p, mode="reflect")

Reflect mode — mirror the input WITHOUT duplicating the edge. Produces smoother boundaries than replicate and is the default in many image-processing libraries.

EXECUTION STATE
⬇ mode="reflect" = Mirror around the edge. For x=[a, b, c], 1-pixel reflect pad → [b, a, b, c, b]. The edge element (a) is NOT duplicated.
⬆ result =
[[5 4 5 6 5]
 [2 1 2 3 2]
 [5 4 5 6 5]
 [8 7 8 9 8]
 [5 4 5 6 5]]
17circular = F.pad(x, p, mode="circular")

Circular / wrap-around padding. Appropriate for data with intrinsic periodicity (e.g., spherical imagery, toroidal simulations).

EXECUTION STATE
⬇ mode="circular" = Wrap around: the left neighbour of column 0 is column W−1.
⬆ result =
[[9 7 8 9 7]
 [3 1 2 3 1]
 [6 4 5 6 4]
 [9 7 8 9 7]
 [3 1 2 3 1]]
19print('constant:', zeros.squeeze())

Dump the zero-padded result to stdout.

20print('replicate:', replicate.squeeze())

Dump replicate-padded result.

21print('reflect:', reflect.squeeze())

Dump reflect-padded result.

22print('circular:', circular.squeeze())

Dump circular-padded result.

10 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# A tiny 3×3 matrix so padding is easy to see
5x = torch.tensor([
6    [1., 2., 3.],
7    [4., 5., 6.],
8    [7., 8., 9.],
9]).unsqueeze(0).unsqueeze(0)   # (1, 1, 3, 3) for F.pad
10
11# pad = (left, right, top, bottom) — 1 pixel each side
12p = (1, 1, 1, 1)
13
14zeros     = F.pad(x, p, mode="constant", value=0)   # zero-padding
15replicate = F.pad(x, p, mode="replicate")            # repeat edge pixel
16reflect   = F.pad(x, p, mode="reflect")              # mirror without duplicating edge
17circular  = F.pad(x, p, mode="circular")             # wrap-around
18
19print("constant (zero):\n", zeros.squeeze())
20print("replicate:\n",      replicate.squeeze())
21print("reflect:\n",        reflect.squeeze())
22print("circular:\n",       circular.squeeze())

The Output-Size Formula

Combining stride and padding, the full output-size formula becomes:

O  =  IK+2PS+1O \;=\; \left\lfloor \frac{I - K + 2P}{S} \right\rfloor + 1

This is the single most important piece of arithmetic in CNN design

Reference

Dumoulin & Visin, 2016, “A guide to convolution arithmetic for deep learning”, arXiv:1603.07285. A concise and highly recommended reference covering all the output-size identities including transposed and dilated convolutions.
.

ScenarioSettingsCalculationOutput
Same-size 3×3I=224, K=3, P=1, S=1(224−3+2)/1 + 1224
Same-size 5×5I=224, K=5, P=2, S=1(224−5+4)/1 + 1224
Halve via strideI=224, K=3, P=1, S=2⌊(224−3+2)/2⌋ + 1112
No paddingI=224, K=3, P=0, S=1(224−3)/1 + 1222
Classic VGG blockI=224, K=3, P=1, S=1 then 2×2 maxpool224 → 224 → 112112

Quick Check

Input 64, kernel 5, padding 2, stride 2. Output?


Pooling — Down-sampling Without Parameters

Stride-2 convolution is one way to halve the spatial resolution; pooling is the other. A pooling layer slides a window across the input and reduces each window to a single number — typically the MAX or the MEAN — with zero learnable parameters. That “no parameters” is not an accident: LeCun's original LeNet-5 used a sub-sampling layer for exactly this reason — it forces some locality-invariance into the representation without adding capacity

Reference

LeCun, Bottou, Bengio & Haffner, 1998, “Gradient-Based Learning Applied to Document Recognition”, Proc. IEEE 86(11). Scherer, Müller & Behnke, 2010, “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition”, ICANN, compares max vs. average pooling empirically and finds max wins for classification.
.

Max vs. average pool

VariantReductionWhat it keepsTypical use
Max poolmax of the windowthe strongest activation onlydefault in most classical CNNs (VGG, AlexNet, ResNet's first block)
Average poolmean of the windowan average signalsmoother, less noisy; preferred when you do NOT want to throw information away
Global average poolmean over the WHOLE feature mapone scalar per channelreplaces large FC classifier heads (GoogLeNet, ResNet, all modern classifiers)
Adaptive pooloutput size fixed; window size computeduser-specified output shapeuseful when input sizes vary

Why max pool 'works'

Max pool is a cheap proxy for translation invariance over SMALL shifts: moving a bright feature by one pixel within the pooling window does not change the output. It also acts as a crude form of non-linear feature selection — only the strongest response in each region survives.

Interactive pooling visualiser

The visualiser below lets you toggle between max and average pooling on a 4×4 or 6×6 feature map. Step through each window and confirm the arithmetic matches your mental model.

Max Pooling Visualizer

Input Feature Map (4x4)
1
3
2
4
5
6
1
2
7
2
3
1
4
8
5
6
2x2 pool, stride 2
Window [0:2, 0:2]
[1, 3, 5, 6]
max(1,3,5,6) = 6
Output (2x2)
6
?
?
?
Step 1/4
Max Pooling keeps the strongest activation in each window. It preserves what was detected while discarding where exactly it appeared — this gives the network translation invariance.

Pooling by hand — Python and PyTorch

Max and average pooling — hand-rolled then PyTorch
🐍pooling_demo.py
1import numpy as np

NumPy for the hand-rolled pool.

2import torch

Tensors for the PyTorch comparison.

3import torch.nn.functional as F

F.max_pool2d and F.avg_pool2d live here.

6X = np.array([...], dtype=float)

A 4×4 feature map with contrived values. Same numbers you will see in the interactive Pooling Visualizer below.

EXECUTION STATE
X (4×4) =
[[1. 3. 2. 4.]
 [5. 6. 1. 2.]
 [7. 2. 3. 1.]
 [4. 8. 5. 6.]]
13def max_pool_2x2(x) → np.ndarray

2×2 non-overlapping MAX pooling with stride=2 — the single most common pooling operator in classical CNNs. Each 2×2 tile is reduced to its maximum.

EXECUTION STATE
⬇ input: x (2-D) = A feature map. Height and width assumed even.
⬆ returns = Down-sampled feature map of shape (H/2, W/2). Half the linear resolution, a quarter of the elements.
14H, W = x.shape

Unpack spatial dims. Here H = W = 4.

15out = np.zeros((H // 2, W // 2))

Pre-allocate the output — shape (2, 2) for our 4×4 input.

16for i in range(H // 2):

Iterate over output rows (0, 1).

17for j in range(W // 2):

Iterate over output columns (0, 1).

18window = x[2*i : 2*i+2, 2*j : 2*j+2]

Slice a 2×2 non-overlapping tile. The multiplier 2 is the stride; it equals the kernel size so tiles tile the image with no overlap.

LOOP TRACE · 4 iterations
(0,0)
window = [[1 3] [5 6]]
(0,1)
window = [[2 4] [1 2]]
(1,0)
window = [[7 2] [4 8]]
(1,1)
window = [[3 1] [5 6]]
19out[i, j] = window.max()

Reduce the 4 values in the window to their MAXIMUM. Max pooling is known to be translation-invariant over small shifts: a small wiggle of the feature does not change which pixel is largest.

LOOP TRACE · 4 iterations
(0,0)
max([1,3,5,6]) = 6
(0,1)
max([2,4,1,2]) = 4
(1,0)
max([7,2,4,8]) = 8
(1,1)
max([3,1,5,6]) = 6
20return out

Return the 2×2 pooled output.

EXECUTION STATE
⬆ max-pool result =
[[6. 4.]
 [8. 6.]]
22def avg_pool_2x2(x) → np.ndarray

Same scaffolding, but we average instead of max. Average pooling preserves more information (every input contributes) but is less robust to outliers and noise.

23H, W = x.shape

Shape unpack.

24out = np.zeros((H // 2, W // 2))

Pre-allocate.

25for i in range(H // 2):

Iterate output rows.

26for j in range(W // 2):

Iterate output columns.

27window = x[2*i:2*i+2, 2*j:2*j+2]

Same 2×2 non-overlapping slicing.

28out[i, j] = window.mean()

Reduce to the arithmetic mean.

LOOP TRACE · 4 iterations
(0,0)
mean([1,3,5,6]) = 15/4 = 3.75
(0,1)
mean([2,4,1,2]) = 9/4 = 2.25
(1,0)
mean([7,2,4,8]) = 21/4 = 5.25
(1,1)
mean([3,1,5,6]) = 15/4 = 3.75
29return out

Return the averaged result.

EXECUTION STATE
⬆ avg-pool result =
[[3.75 2.25]
 [5.25 3.75]]
31print('max:', ...)

Print max-pool output.

32print('avg:', ...)

Print avg-pool output.

35X_t = torch.tensor(X).unsqueeze(0).unsqueeze(0)

Promote to 4-D tensor (1, 1, 4, 4) for the PyTorch calls.

36F.max_pool2d(X_t, kernel_size=2, stride=2)

PyTorch max-pool. ZERO learnable parameters — pooling has no weights, no bias. Only two hyperparameters: kernel_size and stride.

EXECUTION STATE
📚 F.max_pool2d(input, kernel_size, stride, padding, dilation, ...) = Applies max-pool over a sliding window.
⬇ kernel_size=2 = 2×2 window (kernel_size=(2,2) for non-square).
⬇ stride=2 = Non-overlapping tiles. If omitted, stride defaults to kernel_size (non-overlap) for pooling — NOT 1 like conv.
⬆ result (squeezed) =
[[6., 4.],
 [8., 6.]]
37F.avg_pool2d(X_t, kernel_size=2, stride=2)

PyTorch average-pool. Same shape behaviour; different reduction.

EXECUTION STATE
⬆ result (squeezed) =
[[3.75, 2.25],
 [5.25, 3.75]]
16 lines without explanation
1import numpy as np
2import torch
3import torch.nn.functional as F
4
5# A 4×4 feature map (pretend it came out of a conv layer)
6X = np.array([
7    [1, 3, 2, 4],
8    [5, 6, 1, 2],
9    [7, 2, 3, 1],
10    [4, 8, 5, 6],
11], dtype=float)
12
13def max_pool_2x2(x):
14    H, W = x.shape
15    out = np.zeros((H // 2, W // 2))
16    for i in range(H // 2):
17        for j in range(W // 2):
18            window = x[2*i : 2*i+2, 2*j : 2*j+2]  # 2×2 non-overlapping tile
19            out[i, j] = window.max()
20    return out
21
22def avg_pool_2x2(x):
23    H, W = x.shape
24    out = np.zeros((H // 2, W // 2))
25    for i in range(H // 2):
26        for j in range(W // 2):
27            window = x[2*i : 2*i+2, 2*j : 2*j+2]
28            out[i, j] = window.mean()
29    return out
30
31print("max  :", max_pool_2x2(X).tolist())
32print("avg  :", avg_pool_2x2(X).tolist())
33
34# PyTorch equivalent — zero learnable parameters
35X_t = torch.tensor(X).unsqueeze(0).unsqueeze(0)
36print("torch max:", F.max_pool2d(X_t, kernel_size=2, stride=2).squeeze().tolist())
37print("torch avg:", F.avg_pool2d(X_t, kernel_size=2, stride=2).squeeze().tolist())
38
39# Expected:
40# max: [[6, 4], [8, 6]]
41# avg: [[3.75, 2.25], [5.25, 3.75]]

Stride vs. pool: a modern debate

A much-cited 2015 paper argued that max-pooling can be entirely replaced by strided convolutions without loss of accuracy on CIFAR and ImageNet-style tasks

Reference

Springenberg, Dosovitskiy, Brox & Riedmiller, 2015, “Striving for Simplicity: The All Convolutional Net”, ICLR workshop.
. Most modern architectures still include pooling somewhere (often a single global-average-pool before the classifier), but many intermediate down-sampling steps are now strided convs.

Receptive Field — Why Depth Matters

Each output unit of a conv layer depends on only a small input neighbourhood. The receptive field of a unit is the set of input pixels that can affect its value. Stacking conv layers grows the receptive field layer by layer — which is the fundamental reason CNN depth helps.

How receptive field grows

For a stack of conv layers with (stride 1, kernel size KK_\ell) at layer \ell, the receptive field after LL layers is:

RL  =  1+=1L(K1)i=11SiR_L \;=\; 1 + \sum_{\ell=1}^{L} (K_\ell - 1) \prod_{i=1}^{\ell-1} S_i

… where SiS_i is the stride of layer ii. For an all-stride-1 stack of 3×3 convs the product collapses to 1 and the formula simplifies to RL=1+2LR_L = 1 + 2L.

After layerReceptive field (3×3 stride-1 stack)
13×3
25×5
37×7
49×9
511×11
Why VGG uses stacks of 3×3: Two 3×3 convs give the same receptive field as one 5×5 conv — but with fewer parameters (29=182 \cdot 9 = 18 vs 2525) and one extra non-linearity in between. Three 3×3 convs match a 7×7 receptive field at a third of the parameters

Reference

Simonyan & Zisserman, 2015, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR. This is one of the central design lessons of modern CNNs.
.

Interactive: walk through layers, watch the field grow

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

Effective receptive field

The formula above gives the theoretical receptive field — the set of input pixels that could influence the output. The effective receptive field (the set that actually does, in a trained network) is usually much smaller and roughly Gaussian-shaped

Reference

Luo, Li, Urtasun & Zemel, 2016, “Understanding the Effective Receptive Field in Deep Convolutional Neural Networks”, NeurIPS.
.

PyTorch nn.Conv2d Anatomy

Every hyperparameter you have met so far maps directly to an argument of nn.Conv2d:

🐍conv2d_anatomy.py
1import torch
2import torch.nn as nn
3
4conv = nn.Conv2d(
5    in_channels = 3,       # RGB input
6    out_channels = 64,     # 64 filters → 64 output feature maps
7    kernel_size = 3,       # 3×3 spatial window
8    stride = 1,            # move 1 pixel per step
9    padding = 1,           # "same" padding for K=3
10    dilation = 1,          # 1 = ordinary conv; >1 = atrous/dilated
11    groups = 1,            # 1 = full multi-channel; C_in = depthwise
12    bias = True,           # learnable bias per output channel
13)
14
15print(conv.weight.shape)   # torch.Size([64, 3, 3, 3])
16print(conv.bias.shape)     # torch.Size([64])
17
18x = torch.randn(8, 3, 224, 224)   # 8 RGB images, 224×224
19y = conv(x)
20print(y.shape)                    # torch.Size([8, 64, 224, 224])  — same spatial
ArgumentMeaningTypical value
in_channelsC_in of the input tensor3 (RGB) or whatever the previous layer produced
out_channelsnumber of learnable kernels (= number of output feature maps)32, 64, 128, 256, …
kernel_sizespatial size of each kernel; int or (h, w) tuple3 is the modern default
stridepixels moved per step; int or tuple1 normally; 2 for down-sampling
paddingzeros added around the input; int, tuple, or "same"(K−1)/2 for same-size
dilationspacing between kernel elements — makes the receptive field bigger without more params1 (normal); 2, 4 for dilated / atrous conv
groupssplits channels into independent groups; groups=C_in gives depthwise conv1 (default); C_in for depthwise-separable
biasadd a learnable per-channel biasTrue — unless followed by BatchNorm, which has its own shift

Dilated and depthwise conv — a brief preview

Dilated / atrous convolutions (dilation>1) leave holes between kernel samples, enlarging the receptive field without shrinking the output or adding parameters — crucial for semantic segmentation

Reference

Yu & Koltun, 2016, “Multi-Scale Context Aggregation by Dilated Convolutions”, ICLR.
.Depthwise-separable convolutions (groups=C_in followed by a 1×1 conv) dramatically cut FLOPs and are the basis of MobileNet and Xception

Reference

Chollet, 2017, “Xception: Deep Learning with Depthwise Separable Convolutions”, CVPR; Howard et al., 2017, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv:1704.04861.
. We meet both in Chapter 14.

Manual Implementation (For Reference)

Putting every piece together — stride, padding, multi-channel, bias — into one explicit function. Reading this code is the fastest way to convince yourself that nn.Conv2d holds no mysteries.

Full conv from scratch — verified against nn.Conv2d
🐍conv2d_full_manual.py
4def conv2d_manual(x, w, bias=None, stride=1, padding=0)

Full multi-sample, multi-channel 2-D convolution in explicit Python. 4 nested loops means O(N · C_out · H · W · C_in · kH · kW). Educational; not production.

EXECUTION STATE
⬇ x = Input tensor of shape (N, C_in, H, W).
⬇ w = Weight tensor of shape (C_out, C_in, kH, kW).
⬇ bias = Optional (C_out,) tensor. Added to every output position of the corresponding output channel.
⬇ stride, padding = Same meaning as in F.conv2d.
6N, C_in, H, W = x.shape

Unpack the 4-D input shape.

7C_out, _, kH, kW = w.shape

Unpack the 4-D weight shape. The second axis (C_in) is discarded with _ because it must equal C_in from x — we do not need it named.

9if padding > 0:

Apply zero padding before any window extraction. Padding increases H and W by 2·padding each (padding on both sides).

10x = F.pad(x, (padding,) * 4)

Short-hand for (padding, padding, padding, padding) — equal padding on left, right, top, bottom of the last two axes.

EXECUTION STATE
→ (padding,) * 4 = Python tuple-repeat. For padding=1 → (1, 1, 1, 1).
11H += 2 * padding

Update the effective height to include the pad rows.

12W += 2 * padding

Same for width.

14out_H = (H - kH) // stride + 1

The padding-aware output-size formula. We apply it to the ALREADY-padded H, which is why 2·padding appears here implicitly.

15out_W = (W - kW) // stride + 1

Width version.

16out = torch.zeros(N, C_out, out_H, out_W)

Pre-allocate the 4-D output tensor.

18for n in range(N):

Loop over every sample in the batch. Independent — could be trivially parallelised.

19for c in range(C_out):

Loop over every output channel. Each uses its own 3-D kernel w[c].

20for i in range(out_H):

Loop over output rows.

21for j in range(out_W):

Loop over output columns.

22h0, w0 = i * stride, j * stride

Translate output coords to image coords (stride-aware).

23window = x[n, :, h0:h0+kH, w0:w0+kW]

Extract the (C_in, kH, kW) window across ALL input channels. The `:` in the second axis is the crucial bit — it grabs every input channel at once.

EXECUTION STATE
⬆ window.shape = (C_in, kH, kW) — e.g. (3, 3, 3) for RGB and 3×3.
24out[n, c, i, j] = (window * w[c]).sum()

Element-wise multiply the 3-D window with ONE 3-D kernel w[c] (shape matches: both (C_in, kH, kW)), then sum the entire resulting tensor. One dot product → one output scalar.

EXECUTION STATE
w[c] = The c-th output channel's 3-D kernel — shape (C_in, kH, kW).
(window * w[c]).sum() = Sum of C_in · kH · kW products — e.g. 27 for RGB with 3×3.
26if bias is not None:

Bias is optional — when None we skip the addition.

27out += bias.view(1, -1, 1, 1)

Broadcast the (C_out,) bias over the (N, C_out, H, W) output by reshaping it to (1, C_out, 1, 1). Each output channel gets its own constant added to every spatial position.

EXECUTION STATE
📚 .view(1, -1, 1, 1) = Reshape so broadcasting works. -1 means 'infer this dim from the total element count' — here it resolves to C_out.
28return out

Return the 4-D output tensor.

30x = torch.randn(2, 3, 8, 8)

Test input — 2 samples, 3 channels, 8×8 spatial. torch.randn draws from N(0, 1).

31conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1)

Reference PyTorch layer with matching settings. `padding=1` + kernel=3 preserves spatial size.

32ref = conv(x)

Run the built-in layer to get the reference output.

33mine = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)

Run our hand-rolled version with the SAME weights pulled from the layer.

34print('max abs diff:', ...)

The two should agree to within float32 round-off (~1e-6). Any larger difference is a bug in our implementation.

EXECUTION STATE
⬆ expected = max abs diff: 0.0 (or ~1e-6)
11 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def conv2d_manual(x, w, bias=None, stride=1, padding=0):
5    """Multi-channel 2-D cross-correlation — the slow but explicit version."""
6    N, C_in, H, W = x.shape
7    C_out, _, kH, kW = w.shape
8
9    if padding > 0:
10        x = F.pad(x, (padding,) * 4)
11        H += 2 * padding
12        W += 2 * padding
13
14    out_H = (H - kH) // stride + 1
15    out_W = (W - kW) // stride + 1
16    out = torch.zeros(N, C_out, out_H, out_W)
17
18    for n in range(N):
19        for c in range(C_out):
20            for i in range(out_H):
21                for j in range(out_W):
22                    h0, w0 = i * stride, j * stride
23                    window = x[n, :, h0:h0+kH, w0:w0+kW]
24                    out[n, c, i, j] = (window * w[c]).sum()
25
26    if bias is not None:
27        out += bias.view(1, -1, 1, 1)
28    return out
29
30# Verify against PyTorch
31x = torch.randn(2, 3, 8, 8)
32conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1)
33ref  = conv(x)
34mine = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)
35print("max abs diff:", (ref - mine).abs().max().item())
36# Expected: ≈ 0 (within float32 round-off)

This is slow on purpose

The four nested loops are O(NCoutHWCinkHkW)O(N \cdot C_{\text{out}} \cdot H \cdot W \cdot C_{\text{in}} \cdot kH \cdot kW) and run on a single CPU thread. Real libraries flatten conv into a single big matrix multiply via im2col

Reference

Chellapilla, Puri & Simard, 2006, “High Performance Convolutional Neural Networks for Document Processing”, introduced im2col in the neural-net context; the idea goes back further in linear-algebra literature.
and dispatch to BLAS or cuDNN. The speedup is typically 100×–1000×.

Putting It All Together: A Full CNN Pipeline

We can now read the full CNN pipeline end to end. Conv layers extract features; pooling (or strided conv) down-samples; the receptive field grows with depth; the final pooled/flattened vector feeds a classifier. The interactive below ties every stage to the concepts of this chapter.

2D Convolution: Complete Process Visualization

Watch how a CNN processes an image through convolution and pooling layers, reducing dimensions while extracting features.

CNN Architecture: Dimension Reduction Pipeline

Input
1@28×28
Conv1
32@26×26
K=3×3
Pool1
32@13×13
P=2×2
Conv2
64@11×11
K=3×3
Pool2
64@5×5
P=2×2
Flatten
1×1600
FC
10 units
K = Kernel SizeP = Pool Size@ = Channels × Height × Width

Notice how each layer transforms the data:

28×28 Input:Raw grayscale image (like MNIST digits)
Conv1 (32@26×26):32 different 3×3 kernels extract 32 feature maps, each detecting different patterns
Pool1 (32@13×13):2×2 max pooling halves spatial dimensions, keeping strongest activations
Conv2 (64@11×11):64 kernels build on previous features, learning higher-level patterns
Pool2 (64@5×5):Further spatial reduction
Flatten (1600):Reshape 64×5×5 = 1600 values into a 1D vector
FC (10):Fully connected layer outputs class probabilities

Kernel Filtering

Input (7×7)
3
3
2
1
0
2
1
0
0
1
3
1
0
2
3
1
2
2
3
1
0
2
0
0
2
2
3
1
2
0
0
0
1
2
2
1
3
2
1
0
1
3
0
2
1
3
2
0
1
7×7
*
convolve
Kernel (3×3)
0
1
2
2
2
0
0
1
2
=
Feature Map (Output)
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
5×5 (0/25 computed)
Step 0 / 25
👆 Use the controls above to step through the convolution!
  • Without padding (P=0): Output size = (5-3)/1 + 1 = 3×3
  • With padding (P=1): Output size = (5+2-3)/1 + 1 = 5×5 (same as input!)
  • Stride=2: Kernel moves 2 pixels at a time, reducing output size

Pooling Operation

5×5
Feature Map
2×2 Max Pool
2×2
Pooled Output
Output = ⌊Input / Pool Size⌋ = ⌊5 / 2⌋ = 24 pooling steps
Feature Map (from Conv)
12
12
17
17
7
10
17
19
19
17
9
6
14
18
17
11
8
7
12
18
12
17
15
9
10
5×5
MAX
2×2 pool
Pooled Output
?
?
?
?
2×2 (0/4 computed)
Step 0 / 4
Max Pooling
  • Operation: Takes the maximum value from each 2×2 window
  • Effect: Keeps strongest activations, provides translation invariance
  • Use case: Most common in CNNs (VGG, ResNet, etc.)
  • Output size: 5÷2 = 2×2

Output Size Formula

O = ⌊(I - K + 2P) / S⌋ + 1
O = Output size
I = Input size
K = Kernel size
P = Padding
S = Stride
Current: O = ⌊(7 - 3 + 0) / 1⌋ + 1 = 5
Kernel
= Filters = Feature detectors
Stride
= Step size of kernel movement
Padding
= Zero-padding around input
Feature Map
= Output of convolution

Feature learning vs. classification

The conv + pool backbone is a learnable feature extractor. The final classifier is typically a global-average-pool followed by one small FC layer — a design choice popularised by GoogLeNet and retained by every major architecture since

Reference

Szegedy et al., 2015, “Going Deeper with Convolutions” (GoogLeNet / Inception v1), CVPR; Lin, Chen & Yan, 2014, “Network in Network”, ICLR.
.

AI / Deep Learning Applications

Every component of this chapter — conv, stride, padding, pooling, receptive field — is used in the production systems below.

Object detection (YOLO, Faster R-CNN)

A CNN backbone extracts features at multiple scales; a head predicts bounding boxes + class probabilities at each spatial location. The increasing receptive field with depth is what lets the network reason about whole objects while still operating on convolutional feature maps

Reference

Redmon, Divvala, Girshick & Farhadi, 2016, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR; Ren, He, Girshick & Sun, 2017, “Faster R-CNN”, IEEE TPAMI 39(6).
.

Semantic segmentation (U-Net)

Every pixel gets classified. An encoder of conv + pool layers compresses the image; a decoder of (transposed) conv layers restores spatial resolution; skip connections re-inject fine detail. Medical-image segmentation adopted this architecture wholesale

Reference

Ronneberger, Fischer & Brox, 2015, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI.
.

Neural style transfer

Convolutions factor an image into “content” (responses of deeper layers, roughly object identity) and “style” (Gram-matrix statistics of shallower layers, roughly texture). Optimising an image so its content matches one reference and its style matches another yields Van Gogh-ified photos

Reference

Gatys, Ecker & Bethge, 2016, “Image Style Transfer Using Convolutional Neural Networks”, CVPR.
.

Generative image models

StyleGAN, diffusion models, and most modern generators use transposed (fractionally-strided) convolutions to go from low-resolution noise or latent tensors to a full image — the inverse direction of the down-sampling pipeline we have just built

Reference

Radford, Metz & Chintala, 2016, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR; Karras, Laine & Aila, 2019, “A Style-Based Generator Architecture for Generative Adversarial Networks”, CVPR; Ho, Jain & Abbeel, 2020, “Denoising Diffusion Probabilistic Models”, NeurIPS.
.

A profound empirical observation

If you visualise the first-layer kernels of a trained ImageNet CNN you see oriented edge detectors, colour blobs, and Gabor-like frequency patterns — strikingly close to what electrophysiology finds in V1 of mammals

Reference

Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV; Hubel & Wiesel, 1962, J. Physiology 160(1); Olshausen & Field, 1996, Nature 381.
. Nobody trained the network to learn Gabor filters. Gradient descent discovered them from the data.

Summary

Key concepts

ConceptDefinitionWhy it matters
Convolutionsliding weighted sum, shared weightsthe core feature-extraction operation
Kernel / filtersmall tensor of learnable weightseach kernel learns one feature detector
Feature mapthe output of a conv layerindicates where a feature appears in the input
Stridekernel step sizecontrols output resolution and compute cost
Paddingzeros (or reflection, etc.) added at the borderlets the kernel reach edge pixels; preserves spatial size
Poolingmax/avg reduction over a windowdown-sampling with zero parameters; small-shift invariance
Receptive fieldset of input pixels influencing one outputgrows with depth; motivates stacking small kernels

Critical formulas

  1. 2-D cross-correlation: (IK)[i,j]=mnI[i+m,j+n]K[m,n](I * K)[i, j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m, n]
  2. Output size: O=(IK+2P)/S+1O = \lfloor (I - K + 2P) / S \rfloor + 1
  3. Parameter count: K×K×Cin×Cout+CoutK \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}
  4. Receptive field for a stride-1, KK-kernel stack of LL layers: RL=1+L(K1)R_L = 1 + L(K - 1)

Exercises

Conceptual

  1. A 128×128×3 image passes through nn.Conv2d(3, 32, kernel_size=5, padding=2, stride=2). What is the output shape and how many parameters does the layer have?
  2. Explain in one sentence why Sobel-X, with weights [[-1,0,1],[-2,0,2],[-1,0,1]], detects vertical edges rather than horizontal ones.
  3. You need to preserve spatial size with a 7×7 kernel at stride 1. What padding?
  4. Two stacked 3×3 convs vs. one 5×5 conv — same receptive field. Give two reasons to prefer the stacked version.
  5. Max pool vs. stride-2 conv — when would you reach for each?

Hints

  1. O = (128 − 5 + 4) / 2 + 1 = 64 → output [B, 32, 64, 64]. Params: 5×5×3×32 + 32 = 2,432.
  2. Sobel-X computes the horizontal intensity difference, and a vertical edge is precisely a location where brightness changes horizontally.
  3. P=(71)/2=3P = (7-1)/2 = 3.
  4. (i) Fewer parameters (18 vs. 25); (ii) one extra non-linearity in between — so strictly more expressive.
  5. Max pool: no params, robust small-shift invariance, cheap. Stride conv: learnable, can shape the down-sampling to the task.

Coding

  1. Edge magnitude. Apply Sobel-X and Sobel-Y to an image and combine as Gx2+Gy2\sqrt{G_x^2 + G_y^2}.
  2. Box vs. Gaussian. Apply both to a noisy photograph and explain why Gaussian looks more natural.
  3. Verify the formula. Write a test that varies I, K, P, S and asserts F.conv2d output shape matches (IK+2P)/S+1\lfloor (I - K + 2P)/S \rfloor + 1.
  4. Implement im2col. Rewrite our manual conv as a single matrix multiplication via unfold. Measure the speedup on a 64×64 input with 128 filters.

References

All factual claims in this chapter are drawn from the following primary sources.

  1. Hubel, D. H. and Wiesel, T. N. (1962). “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex.” Journal of Physiology, 160(1), 106–154.
  2. Olshausen, B. A. and Field, D. J. (1996). “Emergence of simple-cell receptive field properties by learning a sparse code for natural images.” Nature, 381, 607–609.
  3. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11), 2278–2324.
  4. Chellapilla, K., Puri, S. and Simard, P. (2006). “High performance convolutional neural networks for document processing.” Intl. Workshop on Frontiers in Handwriting Recognition.
  5. Scherer, D., Müller, A. and Behnke, S. (2010). “Evaluation of pooling operations in convolutional architectures for object recognition.” ICANN.
  6. Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). “ImageNet classification with deep convolutional neural networks.” NeurIPS.
  7. Lin, M., Chen, Q. and Yan, S. (2014). “Network in network.” ICLR.
  8. Zeiler, M. D. and Fergus, R. (2014). “Visualizing and understanding convolutional networks.” ECCV.
  9. Simonyan, K. and Zisserman, A. (2015). “Very deep convolutional networks for large-scale image recognition.” ICLR.
  10. Szegedy, C. et al. (2015). “Going deeper with convolutions” (GoogLeNet / Inception). CVPR.
  11. Springenberg, J. T., Dosovitskiy, A., Brox, T. and Riedmiller, M. (2015). “Striving for simplicity: the all convolutional net.” ICLR workshop.
  12. Ronneberger, O., Fischer, P. and Brox, T. (2015). “U-Net: convolutional networks for biomedical image segmentation.” MICCAI.
  13. He, K., Zhang, X., Ren, S. and Sun, J. (2016). “Deep residual learning for image recognition.” CVPR.
  14. Yu, F. and Koltun, V. (2016). “Multi-scale context aggregation by dilated convolutions.” ICLR.
  15. Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016). “You only look once: unified, real-time object detection.” CVPR.
  16. Luo, W., Li, Y., Urtasun, R. and Zemel, R. (2016). “Understanding the effective receptive field in deep convolutional neural networks.” NeurIPS.
  17. Gatys, L. A., Ecker, A. S. and Bethge, M. (2016). “Image style transfer using convolutional neural networks.” CVPR.
  18. Radford, A., Metz, L. and Chintala, S. (2016). “Unsupervised representation learning with deep convolutional generative adversarial networks.” ICLR.
  19. Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press. (Chapter 9 — Convolutional Networks — is the standard graduate reference.)
  20. Dumoulin, V. and Visin, F. (2016). “A guide to convolution arithmetic for deep learning.” arXiv:1603.07285.
  21. Chollet, F. (2017). “Xception: deep learning with depthwise separable convolutions.” CVPR.
  22. Howard, A. G. et al. (2017). “MobileNets: efficient convolutional neural networks for mobile vision applications.” arXiv:1704.04861.
  23. Ren, S., He, K., Girshick, R. and Sun, J. (2017). “Faster R-CNN: towards real-time object detection with region proposal networks.” IEEE TPAMI, 39(6), 1137–1149.
  24. Karras, T., Laine, S. and Aila, T. (2019). “A style-based generator architecture for generative adversarial networks.” CVPR.
  25. Ho, J., Jain, A. and Abbeel, P. (2020). “Denoising diffusion probabilistic models.” NeurIPS.

With stride, padding, pooling, and receptive fields now firmly in hand, we can move from individual layers to full CNN architectures. In Chapter 14 we build LeNet-5, VGG, and ResNet from scratch, then use transfer learning to adapt pre-trained models to new tasks.

Loading comments...