Boo-AI — Master Artificial Intelligence by Building from Scratch

Stride — Controlling Output Resolution

In Section 2 our kernel moved one pixel at a time. That is stride 1. If we instead move s pixels per step, we evaluate fewer windows, the output shrinks by roughly a factor of s in each spatial axis, and we save compute proportionally.

Definition

Let $S$ be the stride. Then the output size (no padding) becomes:

$O \;=\; \left\lfloor \frac{I - K}{S} \right\rfloor + 1$

For the 5×5 image and 3×3 Sobel-X kernel of Section 2: stride 1 gives a 3×3 output; stride 2 gives a 2×2 output; stride 3 gives a 1×1 output. Every stride greater than 1 throws information away, but that is often exactly what we want — modern architectures like ResNet use stride-2 convolutions in place of (or alongside) pooling precisely to down-sample

Reference

He et al., 2016, “Deep Residual Learning for Image Recognition”, CVPR. See also Springenberg, Dosovitskiy, Brox & Riedmiller, 2015, “Striving for Simplicity: The All Convolutional Net”, ICLR workshop — which argues strided convolutions can fully replace max pooling in classification networks.

Python & PyTorch — stride in action

Stride, from the loop up

🐍stride_demo.py

Explanation(23)

Code(45)

1import numpy as np

NumPy for our hand-rolled version.

2import torch

PyTorch so we can cross-check against F.conv2d at the end.

3import torch.nn.functional as F

Functional conv API.

17def conv2d_stride(image, kernel, stride=1)

Same conv as Section 2, but with an extra `stride` parameter. `stride` is the number of pixels the kernel moves each step. stride=1 visits every position; stride=2 skips every other, halving the output resolution in both axes.

EXECUTION STATE

⬇ stride (default 1) = Integer step size. stride=1 → overlapping windows. stride=2 → non-overlapping for a 3×3 kernel. Larger stride ⇒ smaller output, less compute.

18H, W = image.shape

Unpack input spatial dims. Here H = W = 5.

19kH, kW = kernel.shape

Unpack kernel dims. Here kH = kW = 3.

21out_H = (H - kH) // stride + 1

Output size formula, stride-aware. Integer-divide the usable span (H − kH) by the stride, then add 1 for the very first position. The // operator is floor division — important when the division is not exact.

EXECUTION STATE

📚 a // b = Python floor division — drops the fractional part. Equivalent to math.floor(a / b) for non-negative a, b.

stride=1 = (5 − 3) // 1 + 1 = 2 + 1 = 3

stride=2 = (5 − 3) // 2 + 1 = 1 + 1 = 2

22out_W = (W - kW) // stride + 1

Same derivation along the width axis.

24output = np.zeros((out_H, out_W))

Pre-allocate the output matrix with zeros.

25for i in range(out_H):

Iterate over OUTPUT rows. i indexes into the OUTPUT, not the image — we derive the image row from h0 = i*stride.

26for j in range(out_W):

Iterate over OUTPUT columns.

27h0 = i * stride

Translate the output row index into the IMAGE row. With stride=2, output row 1 corresponds to image row 2 (not 1). This single multiplication is the only difference between stride=1 and stride=k convolution.

LOOP TRACE · 2 iterations

stride=2, i=0

h0 = 0

stride=2, i=1

h0 = 2

28w0 = j * stride

Same for the column axis.

29window = image[h0 : h0 + kH, w0 : w0 + kW]

Slice the kH×kW patch starting at (h0, w0). Because h0, w0 advance by `stride`, consecutive windows may NOT overlap.

LOOP TRACE · 4 iterations

stride=2, (0,0)

window = [[10, 20, 30], [20, 40, 60], [30, 60, 90]]

stride=2, (0,1)

window = [[30, 40, 50], [60, 80,100], [90,120,150]]

stride=2, (1,0)

window = [[30, 60, 90], [40, 80,120], [50,100,150]]

stride=2, (1,1)

window = [[90,120,150], [120,160,200], [150,200,250]]

30output[i, j] = np.sum(window * kernel)

Standard multiply-and-sum. Values are identical to Section 2 for matching (h0, w0), because stride only changes WHICH windows we evaluate — not the formula within each window.

LOOP TRACE · 4 iterations

stride=2, (0,0)

sum = 160.0

stride=2, (0,1)

sum = 160.0

stride=2, (1,0)

sum = 320.0

stride=2, (1,1)

sum = 320.0

31return output

Return the stride-aware output.

33out_s1 = conv2d_stride(I, K, stride=1)

Stride=1 call — recovers exactly the Section 2 result.

EXECUTION STATE

⬆ out_s1.shape = (3, 3)

⬆ out_s1 =

[[160. 160. 160.]
 [240. 240. 240.]
 [320. 320. 320.]]

34out_s2 = conv2d_stride(I, K, stride=2)

Stride=2 call — half the rows, half the columns, one quarter of the output elements.

EXECUTION STATE

⬆ out_s2.shape = (2, 2)

⬆ out_s2 =

[[160. 160.]
 [320. 320.]]

36print(out_s1.shape, out_s1)

Print stride=1 result for comparison.

37print(out_s2.shape, out_s2)

Print stride=2 result.

40I_t = torch.tensor(I).unsqueeze(0).unsqueeze(0)

Promote the NumPy image to a 4-D PyTorch tensor (1, 1, 5, 5) — the shape F.conv2d expects.

41K_t = torch.tensor(K).unsqueeze(0).unsqueeze(0)

Same for the kernel → (1, 1, 3, 3).

42F.conv2d(I_t, K_t, stride=2).squeeze()

PyTorch call. Identical numerical result to our loop version, confirming our understanding of the stride parameter.

EXECUTION STATE

⬆ torch result =

tensor([[160., 160.],
        [320., 320.]])

22 lines without explanation

1import numpy as np
2import torch
3import torch.nn.functional as F
4
5# Same 5×5 gradient image + Sobel-X kernel as Section 2
6I = np.array([
7    [10, 20, 30, 40, 50],
8    [20, 40, 60, 80, 100],
9    [30, 60, 90, 120, 150],
10    [40, 80, 120, 160, 200],
11    [50, 100, 150, 200, 250],
12], dtype=float)
13
14K = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float)
15
16def conv2d_stride(image, kernel, stride=1):
17    H, W = image.shape
18    kH, kW = kernel.shape
19    # Output size depends on stride: O = (I - K) // S + 1
20    out_H = (H - kH) // stride + 1
21    out_W = (W - kW) // stride + 1
22
23    output = np.zeros((out_H, out_W))
24    for i in range(out_H):
25        for j in range(out_W):
26            h0 = i * stride                       # top-left row in the IMAGE
27            w0 = j * stride                       # top-left col in the IMAGE
28            window = image[h0 : h0 + kH, w0 : w0 + kW]
29            output[i, j] = np.sum(window * kernel)
30    return output
31
32out_s1 = conv2d_stride(I, K, stride=1)
33out_s2 = conv2d_stride(I, K, stride=2)
34
35print("stride=1 output shape:", out_s1.shape, "\n", out_s1)
36print("stride=2 output shape:", out_s2.shape, "\n", out_s2)
37
38# PyTorch check — identical numbers
39I_t = torch.tensor(I).unsqueeze(0).unsqueeze(0)
40K_t = torch.tensor(K).unsqueeze(0).unsqueeze(0)
41print("torch stride=2:\n", F.conv2d(I_t, K_t, stride=2).squeeze())
42
43# Expected:
44# stride=1 output: (3, 3)  → [[160,160,160],[240,240,240],[320,320,320]]
45# stride=2 output: (2, 2)  → [[160,160],[320,320]]

stride	Output shape	Output values
1	(3, 3)	[[160,160,160],[240,240,240],[320,320,320]]
2	(2, 2)	[[160,160],[320,320]]
3	(1, 1)	[[160]]

When to reach for stride > 1

You need to down-sample while simultaneously computing features — a strided conv replaces a conv + pool pair.
You want a parameterised down-sampler: unlike pooling, the strided conv still has learnable weights.
You are building a generator and need the reverse: fractional/transposed strided conv to upsample.

Quick Check

Input is 28×28, kernel is 5×5, stride is 2, no padding. What is the output size?

Padding — Taming the Boundary

Without padding, every conv layer shrinks the spatial size by $K - 1$ . Stack 10 layers with 3×3 kernels and you lose 20 pixels in each dimension. Padding fixes this by adding extra rows and columns around the input, so the kernel has somewhere to land at the boundary.

The four common padding modes

Mode	What it does	Use case
constant (zero)	Pad with a constant (usually 0)	Default in nn.Conv2d — simple, fast, slightly biases edges toward zero
replicate	Repeat the nearest edge pixel	Smooth boundary behaviour; common in image restoration
reflect	Mirror the input without duplicating the edge	Even smoother; default in many classical image-processing libraries
circular	Wrap around (right ↔ left, top ↔ bottom)	Periodic data — spherical imagery, torus simulations

Terminology: ‘valid’, ‘same’, ‘full’

Higher-level libraries often use named padding shortcuts:

valid — no padding. Output shrinks by $K - 1$ .
same — pad so output spatial size equals input (when stride=1). For odd kernel sizes: $P = (K - 1)/2$ .
full — pad $K - 1$ on each side. Output is larger than input. Rare in deep learning; common in signal processing.

The same-size recipe

To keep the output the same spatial size as the input with stride=1, use

P = (K - 1)/2

. For

K = 3 \rightarrow P = 1

;

K = 5 \rightarrow P = 2

;

K = 7 \rightarrow P = 3

. This is why almost every 3×3 conv in modern CNNs ships with padding=1.

F.pad — all four modes on the same tensor

Padding modes — side-by-side

🐍padding_modes.py

Explanation(12)

Code(22)

1import torch

PyTorch tensors.

2import torch.nn.functional as F

F.pad and F.conv2d live here.

5x = torch.tensor([...]).unsqueeze(0).unsqueeze(0)

A 3×3 matrix reshaped to 4-D so F.pad (called on 4-D tensors with the 2-element-per-axis convention) can pad the last two dims.

EXECUTION STATE

⬆ x.shape = torch.Size([1, 1, 3, 3])

→ values =

[[1 2 3]
 [4 5 6]
 [7 8 9]]

12p = (1, 1, 1, 1)

The padding spec for F.pad reads from the LAST dim toward the first, two numbers per axis: (left, right) for the last axis, then (top, bottom) for the next-to-last. Here we pad 1 pixel on every side.

EXECUTION STATE

p = (1, 1, 1, 1) = (left=1, right=1, top=1, bottom=1). Result after padding: 5×5 instead of 3×3.

14zeros = F.pad(x, p, mode="constant", value=0)

Zero padding — the default in nn.Conv2d. Adds rows/columns of zeros around the input. Cheap and simple but biases edges toward zero.

EXECUTION STATE

⬇ mode="constant" = Pad with a constant value.

⬇ value=0 = The constant used when mode="constant". Set to any number — zero is standard.

⬆ result =

[[0 0 0 0 0]
 [0 1 2 3 0]
 [0 4 5 6 0]
 [0 7 8 9 0]
 [0 0 0 0 0]]

15replicate = F.pad(x, p, mode="replicate")

Replicate mode — repeat the nearest edge pixel. Useful when a zero border would introduce a bright/dark boundary artifact.

EXECUTION STATE

⬇ mode="replicate" = Extend the input by repeating the boundary row/column.

⬆ result =

[[1 1 2 3 3]
 [1 1 2 3 3]
 [4 4 5 6 6]
 [7 7 8 9 9]
 [7 7 8 9 9]]

16reflect = F.pad(x, p, mode="reflect")

Reflect mode — mirror the input WITHOUT duplicating the edge. Produces smoother boundaries than replicate and is the default in many image-processing libraries.

EXECUTION STATE

⬇ mode="reflect" = Mirror around the edge. For x=[a, b, c], 1-pixel reflect pad → [b, a, b, c, b]. The edge element (a) is NOT duplicated.

⬆ result =

[[5 4 5 6 5]
 [2 1 2 3 2]
 [5 4 5 6 5]
 [8 7 8 9 8]
 [5 4 5 6 5]]

17circular = F.pad(x, p, mode="circular")

Circular / wrap-around padding. Appropriate for data with intrinsic periodicity (e.g., spherical imagery, toroidal simulations).

EXECUTION STATE

⬇ mode="circular" = Wrap around: the left neighbour of column 0 is column W−1.

⬆ result =

[[9 7 8 9 7]
 [3 1 2 3 1]
 [6 4 5 6 4]
 [9 7 8 9 7]
 [3 1 2 3 1]]

19print('constant:', zeros.squeeze())

Dump the zero-padded result to stdout.

20print('replicate:', replicate.squeeze())

Dump replicate-padded result.

21print('reflect:', reflect.squeeze())

Dump reflect-padded result.

22print('circular:', circular.squeeze())

Dump circular-padded result.

10 lines without explanation

1import torch
2import torch.nn.functional as F
3
4# A tiny 3×3 matrix so padding is easy to see
5x = torch.tensor([
6    [1., 2., 3.],
7    [4., 5., 6.],
8    [7., 8., 9.],
9]).unsqueeze(0).unsqueeze(0)   # (1, 1, 3, 3) for F.pad
10
11# pad = (left, right, top, bottom) — 1 pixel each side
12p = (1, 1, 1, 1)
13
14zeros     = F.pad(x, p, mode="constant", value=0)   # zero-padding
15replicate = F.pad(x, p, mode="replicate")            # repeat edge pixel
16reflect   = F.pad(x, p, mode="reflect")              # mirror without duplicating edge
17circular  = F.pad(x, p, mode="circular")             # wrap-around
18
19print("constant (zero):\n", zeros.squeeze())
20print("replicate:\n",      replicate.squeeze())
21print("reflect:\n",        reflect.squeeze())
22print("circular:\n",       circular.squeeze())

The Output-Size Formula

Combining stride and padding, the full output-size formula becomes:

$O \;=\; \left\lfloor \frac{I - K + 2P}{S} \right\rfloor + 1$

This is the single most important piece of arithmetic in CNN design

Reference

Dumoulin & Visin, 2016, “A guide to convolution arithmetic for deep learning”, arXiv:1603.07285. A concise and highly recommended reference covering all the output-size identities including transposed and dilated convolutions.

Scenario	Settings	Calculation	Output
Same-size 3×3	I=224, K=3, P=1, S=1	(224−3+2)/1 + 1	224
Same-size 5×5	I=224, K=5, P=2, S=1	(224−5+4)/1 + 1	224
Halve via stride	I=224, K=3, P=1, S=2	⌊(224−3+2)/2⌋ + 1	112
No padding	I=224, K=3, P=0, S=1	(224−3)/1 + 1	222
Classic VGG block	I=224, K=3, P=1, S=1 then 2×2 maxpool	224 → 224 → 112	112

Quick Check

Input 64, kernel 5, padding 2, stride 2. Output?

Pooling — Down-sampling Without Parameters

Stride-2 convolution is one way to halve the spatial resolution; pooling is the other. A pooling layer slides a window across the input and reduces each window to a single number — typically the MAX or the MEAN — with zero learnable parameters. That “no parameters” is not an accident: LeCun's original LeNet-5 used a sub-sampling layer for exactly this reason — it forces some locality-invariance into the representation without adding capacity

Reference

LeCun, Bottou, Bengio & Haffner, 1998, “Gradient-Based Learning Applied to Document Recognition”, Proc. IEEE 86(11). Scherer, Müller & Behnke, 2010, “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition”, ICANN, compares max vs. average pooling empirically and finds max wins for classification.

Max vs. average pool

Variant	Reduction	What it keeps	Typical use
Max pool	max of the window	the strongest activation only	default in most classical CNNs (VGG, AlexNet, ResNet's first block)
Average pool	mean of the window	an average signal	smoother, less noisy; preferred when you do NOT want to throw information away
Global average pool	mean over the WHOLE feature map	one scalar per channel	replaces large FC classifier heads (GoogLeNet, ResNet, all modern classifiers)
Adaptive pool	output size fixed; window size computed	user-specified output shape	useful when input sizes vary

Why max pool 'works'

Max pool is a cheap proxy for translation invariance over SMALL shifts: moving a bright feature by one pixel within the pooling window does not change the output. It also acts as a crude form of non-linear feature selection — only the strongest response in each region survives.

Interactive pooling visualiser

The visualiser below lets you toggle between max and average pooling on a 4×4 or 6×6 feature map. Step through each window and confirm the arithmetic matches your mental model.

Max Pooling Visualizer

Input Feature Map (4x4)

2x2 pool, stride 2

Window [0:2, 0:2]

[1, 3, 5, 6]

max(1,3,5,6) = 6

Output (2x2)

Step 1/4

Max Pooling keeps the strongest activation in each window. It preserves what was detected while discarding where exactly it appeared — this gives the network translation invariance.

Pooling by hand — Python and PyTorch

Max and average pooling — hand-rolled then PyTorch

🐍pooling_demo.py

Explanation(25)

Code(41)

1import numpy as np

NumPy for the hand-rolled pool.

2import torch

Tensors for the PyTorch comparison.

3import torch.nn.functional as F

F.max_pool2d and F.avg_pool2d live here.

6X = np.array([...], dtype=float)

A 4×4 feature map with contrived values. Same numbers you will see in the interactive Pooling Visualizer below.

EXECUTION STATE

X (4×4) =

[[1. 3. 2. 4.]
 [5. 6. 1. 2.]
 [7. 2. 3. 1.]
 [4. 8. 5. 6.]]

13def max_pool_2x2(x) → np.ndarray

2×2 non-overlapping MAX pooling with stride=2 — the single most common pooling operator in classical CNNs. Each 2×2 tile is reduced to its maximum.

EXECUTION STATE

⬇ input: x (2-D) = A feature map. Height and width assumed even.

⬆ returns = Down-sampled feature map of shape (H/2, W/2). Half the linear resolution, a quarter of the elements.

14H, W = x.shape

Unpack spatial dims. Here H = W = 4.

15out = np.zeros((H // 2, W // 2))

Pre-allocate the output — shape (2, 2) for our 4×4 input.

16for i in range(H // 2):

Iterate over output rows (0, 1).

17for j in range(W // 2):

Iterate over output columns (0, 1).

18window = x[2*i : 2*i+2, 2*j : 2*j+2]

Slice a 2×2 non-overlapping tile. The multiplier 2 is the stride; it equals the kernel size so tiles tile the image with no overlap.

LOOP TRACE · 4 iterations

(0,0)

window = [[1 3] [5 6]]

(0,1)

window = [[2 4] [1 2]]

(1,0)

window = [[7 2] [4 8]]

(1,1)

window = [[3 1] [5 6]]

19out[i, j] = window.max()

Reduce the 4 values in the window to their MAXIMUM. Max pooling is known to be translation-invariant over small shifts: a small wiggle of the feature does not change which pixel is largest.

LOOP TRACE · 4 iterations

(0,0)

max([1,3,5,6]) = 6

(0,1)

max([2,4,1,2]) = 4

(1,0)

max([7,2,4,8]) = 8

(1,1)

max([3,1,5,6]) = 6

20return out

Return the 2×2 pooled output.

EXECUTION STATE

⬆ max-pool result =

[[6. 4.]
 [8. 6.]]

22def avg_pool_2x2(x) → np.ndarray

Same scaffolding, but we average instead of max. Average pooling preserves more information (every input contributes) but is less robust to outliers and noise.

23H, W = x.shape

Shape unpack.

24out = np.zeros((H // 2, W // 2))

Pre-allocate.

25for i in range(H // 2):

Iterate output rows.

26for j in range(W // 2):

Iterate output columns.

27window = x[2*i:2*i+2, 2*j:2*j+2]

Same 2×2 non-overlapping slicing.

28out[i, j] = window.mean()

Reduce to the arithmetic mean.

LOOP TRACE · 4 iterations

(0,0)

mean([1,3,5,6]) = 15/4 = 3.75

(0,1)

mean([2,4,1,2]) = 9/4 = 2.25

(1,0)

mean([7,2,4,8]) = 21/4 = 5.25

(1,1)

mean([3,1,5,6]) = 15/4 = 3.75

29return out

Return the averaged result.

EXECUTION STATE

⬆ avg-pool result =

[[3.75 2.25]
 [5.25 3.75]]

31print('max:', ...)

Print max-pool output.

32print('avg:', ...)

Print avg-pool output.

35X_t = torch.tensor(X).unsqueeze(0).unsqueeze(0)

Promote to 4-D tensor (1, 1, 4, 4) for the PyTorch calls.

36F.max_pool2d(X_t, kernel_size=2, stride=2)

PyTorch max-pool. ZERO learnable parameters — pooling has no weights, no bias. Only two hyperparameters: kernel_size and stride.

EXECUTION STATE

📚 F.max_pool2d(input, kernel_size, stride, padding, dilation, ...) = Applies max-pool over a sliding window.

⬇ kernel_size=2 = 2×2 window (kernel_size=(2,2) for non-square).

⬇ stride=2 = Non-overlapping tiles. If omitted, stride defaults to kernel_size (non-overlap) for pooling — NOT 1 like conv.

⬆ result (squeezed) =

[[6., 4.],
 [8., 6.]]

37F.avg_pool2d(X_t, kernel_size=2, stride=2)

PyTorch average-pool. Same shape behaviour; different reduction.

EXECUTION STATE

⬆ result (squeezed) =

[[3.75, 2.25],
 [5.25, 3.75]]

16 lines without explanation

1import numpy as np
2import torch
3import torch.nn.functional as F
4
5# A 4×4 feature map (pretend it came out of a conv layer)
6X = np.array([
7    [1, 3, 2, 4],
8    [5, 6, 1, 2],
9    [7, 2, 3, 1],
10    [4, 8, 5, 6],
11], dtype=float)
12
13def max_pool_2x2(x):
14    H, W = x.shape
15    out = np.zeros((H // 2, W // 2))
16    for i in range(H // 2):
17        for j in range(W // 2):
18            window = x[2*i : 2*i+2, 2*j : 2*j+2]  # 2×2 non-overlapping tile
19            out[i, j] = window.max()
20    return out
21
22def avg_pool_2x2(x):
23    H, W = x.shape
24    out = np.zeros((H // 2, W // 2))
25    for i in range(H // 2):
26        for j in range(W // 2):
27            window = x[2*i : 2*i+2, 2*j : 2*j+2]
28            out[i, j] = window.mean()
29    return out
30
31print("max  :", max_pool_2x2(X).tolist())
32print("avg  :", avg_pool_2x2(X).tolist())
33
34# PyTorch equivalent — zero learnable parameters
35X_t = torch.tensor(X).unsqueeze(0).unsqueeze(0)
36print("torch max:", F.max_pool2d(X_t, kernel_size=2, stride=2).squeeze().tolist())
37print("torch avg:", F.avg_pool2d(X_t, kernel_size=2, stride=2).squeeze().tolist())
38
39# Expected:
40# max: [[6, 4], [8, 6]]
41# avg: [[3.75, 2.25], [5.25, 3.75]]

Stride vs. pool: a modern debate

A much-cited 2015 paper argued that max-pooling can be entirely replaced by strided convolutions without loss of accuracy on CIFAR and ImageNet-style tasks

Reference

Springenberg, Dosovitskiy, Brox & Riedmiller, 2015, “Striving for Simplicity: The All Convolutional Net”, ICLR workshop.

. Most modern architectures still include pooling somewhere (often a single global-average-pool before the classifier), but many intermediate down-sampling steps are now strided convs.

Receptive Field — Why Depth Matters

Each output unit of a conv layer depends on only a small input neighbourhood. The receptive field of a unit is the set of input pixels that can affect its value. Stacking conv layers grows the receptive field layer by layer — which is the fundamental reason CNN depth helps.

How receptive field grows

For a stack of conv layers with (stride 1, kernel size $K_\ell$ ) at layer $\ell$ , the receptive field after $L$ layers is:

$R_L \;=\; 1 + \sum_{\ell=1}^{L} (K_\ell - 1) \prod_{i=1}^{\ell-1} S_i$

… where $S_i$ is the stride of layer $i$ . For an all-stride-1 stack of 3×3 convs the product collapses to 1 and the formula simplifies to $R_L = 1 + 2L$ .

After layer	Receptive field (3×3 stride-1 stack)
1	3×3
2	5×5
3	7×7
4	9×9
5	11×11

Why VGG uses stacks of 3×3: Two 3×3 convs give the same receptive field as one 5×5 conv — but with fewer parameters ( $2 \cdot 9 = 18$ vs $25$ ) and one extra non-linearity in between. Three 3×3 convs match a 7×7 receptive field at a third of the parameters
Reference
Simonyan & Zisserman, 2015, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR. This is one of the central design lessons of modern CNNs.
.

Interactive: walk through layers, watch the field grow

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RF_n = RF_n-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

Effective receptive field

The formula above gives the theoretical receptive field — the set of input pixels that could influence the output. The effective receptive field (the set that actually does, in a trained network) is usually much smaller and roughly Gaussian-shaped

Reference

Luo, Li, Urtasun & Zemel, 2016, “Understanding the Effective Receptive Field in Deep Convolutional Neural Networks”, NeurIPS.

PyTorch `nn.Conv2d` Anatomy

Every hyperparameter you have met so far maps directly to an argument of nn.Conv2d:

🐍conv2d_anatomy.py

1import torch
2import torch.nn as nn
3
4conv = nn.Conv2d(
5    in_channels = 3,       # RGB input
6    out_channels = 64,     # 64 filters → 64 output feature maps
7    kernel_size = 3,       # 3×3 spatial window
8    stride = 1,            # move 1 pixel per step
9    padding = 1,           # "same" padding for K=3
10    dilation = 1,          # 1 = ordinary conv; >1 = atrous/dilated
11    groups = 1,            # 1 = full multi-channel; C_in = depthwise
12    bias = True,           # learnable bias per output channel
13)
14
15print(conv.weight.shape)   # torch.Size([64, 3, 3, 3])
16print(conv.bias.shape)     # torch.Size([64])
17
18x = torch.randn(8, 3, 224, 224)   # 8 RGB images, 224×224
19y = conv(x)
20print(y.shape)                    # torch.Size([8, 64, 224, 224])  — same spatial

Argument	Meaning	Typical value
in_channels	C_in of the input tensor	3 (RGB) or whatever the previous layer produced
out_channels	number of learnable kernels (= number of output feature maps)	32, 64, 128, 256, …
kernel_size	spatial size of each kernel; int or (h, w) tuple	3 is the modern default
stride	pixels moved per step; int or tuple	1 normally; 2 for down-sampling
padding	zeros added around the input; int, tuple, or "same"	(K−1)/2 for same-size
dilation	spacing between kernel elements — makes the receptive field bigger without more params	1 (normal); 2, 4 for dilated / atrous conv
groups	splits channels into independent groups; groups=C_in gives depthwise conv	1 (default); C_in for depthwise-separable
bias	add a learnable per-channel bias	True — unless followed by BatchNorm, which has its own shift

Dilated and depthwise conv — a brief preview

Dilated / atrous convolutions (dilation>1) leave holes between kernel samples, enlarging the receptive field without shrinking the output or adding parameters — crucial for semantic segmentation

Reference

Yu & Koltun, 2016, “Multi-Scale Context Aggregation by Dilated Convolutions”, ICLR.

.Depthwise-separable convolutions (groups=C_in followed by a 1×1 conv) dramatically cut FLOPs and are the basis of MobileNet and Xception

Reference

Chollet, 2017, “Xception: Deep Learning with Depthwise Separable Convolutions”, CVPR; Howard et al., 2017, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv:1704.04861.

. We meet both in Chapter 14.

Manual Implementation (For Reference)

Putting every piece together — stride, padding, multi-channel, bias — into one explicit function. Reading this code is the fastest way to convince yourself that nn.Conv2d holds no mysteries.

Full conv from scratch — verified against nn.Conv2d

🐍conv2d_full_manual.py

Explanation(25)

Code(36)

4def conv2d_manual(x, w, bias=None, stride=1, padding=0)

Full multi-sample, multi-channel 2-D convolution in explicit Python. 4 nested loops means O(N · C_out · H · W · C_in · kH · kW). Educational; not production.

EXECUTION STATE

⬇ x = Input tensor of shape (N, C_in, H, W).

⬇ w = Weight tensor of shape (C_out, C_in, kH, kW).

⬇ bias = Optional (C_out,) tensor. Added to every output position of the corresponding output channel.

⬇ stride, padding = Same meaning as in F.conv2d.

6N, C_in, H, W = x.shape

Unpack the 4-D input shape.

7C_out, _, kH, kW = w.shape

Unpack the 4-D weight shape. The second axis (C_in) is discarded with _ because it must equal C_in from x — we do not need it named.

9if padding > 0:

Apply zero padding before any window extraction. Padding increases H and W by 2·padding each (padding on both sides).

10x = F.pad(x, (padding,) * 4)

Short-hand for (padding, padding, padding, padding) — equal padding on left, right, top, bottom of the last two axes.

EXECUTION STATE

→ (padding,) * 4 = Python tuple-repeat. For padding=1 → (1, 1, 1, 1).

11H += 2 * padding

Update the effective height to include the pad rows.

12W += 2 * padding

Same for width.

14out_H = (H - kH) // stride + 1

The padding-aware output-size formula. We apply it to the ALREADY-padded H, which is why 2·padding appears here implicitly.

15out_W = (W - kW) // stride + 1

Width version.

16out = torch.zeros(N, C_out, out_H, out_W)

Pre-allocate the 4-D output tensor.

18for n in range(N):

Loop over every sample in the batch. Independent — could be trivially parallelised.

19for c in range(C_out):

Loop over every output channel. Each uses its own 3-D kernel w[c].

20for i in range(out_H):

Loop over output rows.

21for j in range(out_W):

Loop over output columns.

22h0, w0 = i * stride, j * stride

Translate output coords to image coords (stride-aware).

23window = x[n, :, h0:h0+kH, w0:w0+kW]

Extract the (C_in, kH, kW) window across ALL input channels. The `:` in the second axis is the crucial bit — it grabs every input channel at once.

EXECUTION STATE

⬆ window.shape = (C_in, kH, kW) — e.g. (3, 3, 3) for RGB and 3×3.

24out[n, c, i, j] = (window * w[c]).sum()

Element-wise multiply the 3-D window with ONE 3-D kernel w[c] (shape matches: both (C_in, kH, kW)), then sum the entire resulting tensor. One dot product → one output scalar.

EXECUTION STATE

w[c] = The c-th output channel's 3-D kernel — shape (C_in, kH, kW).

(window * w[c]).sum() = Sum of C_in · kH · kW products — e.g. 27 for RGB with 3×3.

26if bias is not None:

Bias is optional — when None we skip the addition.

27out += bias.view(1, -1, 1, 1)

Broadcast the (C_out,) bias over the (N, C_out, H, W) output by reshaping it to (1, C_out, 1, 1). Each output channel gets its own constant added to every spatial position.

EXECUTION STATE

📚 .view(1, -1, 1, 1) = Reshape so broadcasting works. -1 means 'infer this dim from the total element count' — here it resolves to C_out.

28return out

Return the 4-D output tensor.

30x = torch.randn(2, 3, 8, 8)

Test input — 2 samples, 3 channels, 8×8 spatial. torch.randn draws from N(0, 1).

31conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1)

Reference PyTorch layer with matching settings. `padding=1` + kernel=3 preserves spatial size.

32ref = conv(x)

Run the built-in layer to get the reference output.

33mine = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)

Run our hand-rolled version with the SAME weights pulled from the layer.

34print('max abs diff:', ...)

The two should agree to within float32 round-off (~1e-6). Any larger difference is a bug in our implementation.

EXECUTION STATE

⬆ expected = max abs diff: 0.0 (or ~1e-6)

11 lines without explanation

1import torch
2import torch.nn.functional as F
3
4def conv2d_manual(x, w, bias=None, stride=1, padding=0):
5    """Multi-channel 2-D cross-correlation — the slow but explicit version."""
6    N, C_in, H, W = x.shape
7    C_out, _, kH, kW = w.shape
8
9    if padding > 0:
10        x = F.pad(x, (padding,) * 4)
11        H += 2 * padding
12        W += 2 * padding
13
14    out_H = (H - kH) // stride + 1
15    out_W = (W - kW) // stride + 1
16    out = torch.zeros(N, C_out, out_H, out_W)
17
18    for n in range(N):
19        for c in range(C_out):
20            for i in range(out_H):
21                for j in range(out_W):
22                    h0, w0 = i * stride, j * stride
23                    window = x[n, :, h0:h0+kH, w0:w0+kW]
24                    out[n, c, i, j] = (window * w[c]).sum()
25
26    if bias is not None:
27        out += bias.view(1, -1, 1, 1)
28    return out
29
30# Verify against PyTorch
31x = torch.randn(2, 3, 8, 8)
32conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1)
33ref  = conv(x)
34mine = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)
35print("max abs diff:", (ref - mine).abs().max().item())
36# Expected: ≈ 0 (within float32 round-off)

This is slow on purpose

The four nested loops are

O(N \cdot C_{\text{out}} \cdot H \cdot W \cdot C_{\text{in}} \cdot kH \cdot kW)

and run on a single CPU thread. Real libraries flatten conv into a single big matrix multiply via im2col

Reference

Chellapilla, Puri & Simard, 2006, “High Performance Convolutional Neural Networks for Document Processing”, introduced im2col in the neural-net context; the idea goes back further in linear-algebra literature.

and dispatch to BLAS or cuDNN. The speedup is typically 100×–1000×.

Putting It All Together: A Full CNN Pipeline

We can now read the full CNN pipeline end to end. Conv layers extract features; pooling (or strided conv) down-samples; the receptive field grows with depth; the final pooled/flattened vector feeds a classifier. The interactive below ties every stage to the concepts of this chapter.

2D Convolution: Complete Process Visualization

Watch how a CNN processes an image through convolution and pooling layers, reducing dimensions while extracting features.

CNN Architecture: Dimension Reduction Pipeline

Input

1@28×28

Conv1

32@26×26

K=3×3

Pool1

32@13×13

P=2×2

Conv2

64@11×11

K=3×3

Pool2

64@5×5

P=2×2

Flatten

1×1600

10 units

K = Kernel SizeP = Pool Size@ = Channels × Height × Width

Notice how each layer transforms the data:

28×28 Input:Raw grayscale image (like MNIST digits)

Conv1 (32@26×26):32 different 3×3 kernels extract 32 feature maps, each detecting different patterns

Pool1 (32@13×13):2×2 max pooling halves spatial dimensions, keeping strongest activations

Conv2 (64@11×11):64 kernels build on previous features, learning higher-level patterns

Pool2 (64@5×5):Further spatial reduction

Flatten (1600):Reshape 64×5×5 = 1600 values into a 1D vector

FC (10):Fully connected layer outputs class probabilities

Kernel Filtering

Kernel:

Padding:

Stride:

Input (7×7)

7×7

convolve

Kernel (3×3)

Feature Map (Output)

5×5 (0/25 computed)

Step 0 / 25

👆 Use the controls above to step through the convolution!

• Without padding (P=0): Output size = (5-3)/1 + 1 = 3×3
• With padding (P=1): Output size = (5+2-3)/1 + 1 = 5×5 (same as input!)
• Stride=2: Kernel moves 2 pixels at a time, reducing output size

Pooling Operation

Type:

Size:

5×5

Feature Map

2×2 Max Pool

2×2

Pooled Output

Output = ⌊Input / Pool Size⌋ = ⌊5 / 2⌋ = 2→4 pooling steps

Feature Map (from Conv)

5×5

MAX

2×2 pool

Pooled Output

2×2 (0/4 computed)

Step 0 / 4

Max Pooling

• Operation: Takes the maximum value from each 2×2 window
• Effect: Keeps strongest activations, provides translation invariance
• Use case: Most common in CNNs (VGG, ResNet, etc.)
• Output size: 5÷2 = 2×2

Output Size Formula

O = ⌊(I - K + 2P) / S⌋ + 1

O = Output size

I = Input size

K = Kernel size

P = Padding

S = Stride

Current: O = ⌊(7 - 3 + 0) / 1⌋ + 1 = 5

Kernel

= Filters = Feature detectors

Stride

= Step size of kernel movement

Padding

= Zero-padding around input

Feature Map

= Output of convolution

Feature learning vs. classification

The conv + pool backbone is a learnable feature extractor. The final classifier is typically a global-average-pool followed by one small FC layer — a design choice popularised by GoogLeNet and retained by every major architecture since

Reference

Szegedy et al., 2015, “Going Deeper with Convolutions” (GoogLeNet / Inception v1), CVPR; Lin, Chen & Yan, 2014, “Network in Network”, ICLR.

AI / Deep Learning Applications

Every component of this chapter — conv, stride, padding, pooling, receptive field — is used in the production systems below.

Object detection (YOLO, Faster R-CNN)

A CNN backbone extracts features at multiple scales; a head predicts bounding boxes + class probabilities at each spatial location. The increasing receptive field with depth is what lets the network reason about whole objects while still operating on convolutional feature maps

Reference

Redmon, Divvala, Girshick & Farhadi, 2016, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR; Ren, He, Girshick & Sun, 2017, “Faster R-CNN”, IEEE TPAMI 39(6).

Semantic segmentation (U-Net)

Every pixel gets classified. An encoder of conv + pool layers compresses the image; a decoder of (transposed) conv layers restores spatial resolution; skip connections re-inject fine detail. Medical-image segmentation adopted this architecture wholesale

Reference

Ronneberger, Fischer & Brox, 2015, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI.

Neural style transfer

Convolutions factor an image into “content” (responses of deeper layers, roughly object identity) and “style” (Gram-matrix statistics of shallower layers, roughly texture). Optimising an image so its content matches one reference and its style matches another yields Van Gogh-ified photos

Reference

Gatys, Ecker & Bethge, 2016, “Image Style Transfer Using Convolutional Neural Networks”, CVPR.

Generative image models

StyleGAN, diffusion models, and most modern generators use transposed (fractionally-strided) convolutions to go from low-resolution noise or latent tensors to a full image — the inverse direction of the down-sampling pipeline we have just built

Reference

Radford, Metz & Chintala, 2016, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR; Karras, Laine & Aila, 2019, “A Style-Based Generator Architecture for Generative Adversarial Networks”, CVPR; Ho, Jain & Abbeel, 2020, “Denoising Diffusion Probabilistic Models”, NeurIPS.

A profound empirical observation

If you visualise the first-layer kernels of a trained ImageNet CNN you see oriented edge detectors, colour blobs, and Gabor-like frequency patterns — strikingly close to what electrophysiology finds in V1 of mammals
Reference
Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV; Hubel & Wiesel, 1962, J. Physiology 160(1); Olshausen & Field, 1996, Nature 381.
. Nobody trained the network to learn Gabor filters. Gradient descent discovered them from the data.

Summary

Key concepts

Concept	Definition	Why it matters
Convolution	sliding weighted sum, shared weights	the core feature-extraction operation
Kernel / filter	small tensor of learnable weights	each kernel learns one feature detector
Feature map	the output of a conv layer	indicates where a feature appears in the input
Stride	kernel step size	controls output resolution and compute cost
Padding	zeros (or reflection, etc.) added at the border	lets the kernel reach edge pixels; preserves spatial size
Pooling	max/avg reduction over a window	down-sampling with zero parameters; small-shift invariance
Receptive field	set of input pixels influencing one output	grows with depth; motivates stacking small kernels

Critical formulas

2-D cross-correlation: $(I * K)[i, j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m, n]$
Output size: $O = \lfloor (I - K + 2P) / S \rfloor + 1$
Parameter count: $K \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}$
Receptive field for a stride-1, $K$ -kernel stack of $L$ layers: $R_L = 1 + L(K - 1)$

Exercises

Conceptual

A 128×128×3 image passes through nn.Conv2d(3, 32, kernel_size=5, padding=2, stride=2). What is the output shape and how many parameters does the layer have?
Explain in one sentence why Sobel-X, with weights [[-1,0,1],[-2,0,2],[-1,0,1]], detects vertical edges rather than horizontal ones.
You need to preserve spatial size with a 7×7 kernel at stride 1. What padding?
Two stacked 3×3 convs vs. one 5×5 conv — same receptive field. Give two reasons to prefer the stacked version.
Max pool vs. stride-2 conv — when would you reach for each?

Hints

O = (128 − 5 + 4) / 2 + 1 = 64 → output [B, 32, 64, 64]. Params: 5×5×3×32 + 32 = 2,432.
Sobel-X computes the horizontal intensity difference, and a vertical edge is precisely a location where brightness changes horizontally.
$P = (7-1)/2 = 3$ .
(i) Fewer parameters (18 vs. 25); (ii) one extra non-linearity in between — so strictly more expressive.
Max pool: no params, robust small-shift invariance, cheap. Stride conv: learnable, can shape the down-sampling to the task.

Coding

Edge magnitude. Apply Sobel-X and Sobel-Y to an image and combine as $\sqrt{G_x^2 + G_y^2}$ .
Box vs. Gaussian. Apply both to a noisy photograph and explain why Gaussian looks more natural.
Verify the formula. Write a test that varies I, K, P, S and asserts F.conv2d output shape matches $\lfloor (I - K + 2P)/S \rfloor + 1$ .
Implement im2col. Rewrite our manual conv as a single matrix multiplication via unfold. Measure the speedup on a 64×64 input with 128 filters.

References

All factual claims in this chapter are drawn from the following primary sources.

Hubel, D. H. and Wiesel, T. N. (1962). “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex.” Journal of Physiology, 160(1), 106–154.
Olshausen, B. A. and Field, D. J. (1996). “Emergence of simple-cell receptive field properties by learning a sparse code for natural images.” Nature, 381, 607–609.
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11), 2278–2324.
Chellapilla, K., Puri, S. and Simard, P. (2006). “High performance convolutional neural networks for document processing.” Intl. Workshop on Frontiers in Handwriting Recognition.
Scherer, D., Müller, A. and Behnke, S. (2010). “Evaluation of pooling operations in convolutional architectures for object recognition.” ICANN.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). “ImageNet classification with deep convolutional neural networks.” NeurIPS.
Lin, M., Chen, Q. and Yan, S. (2014). “Network in network.” ICLR.
Zeiler, M. D. and Fergus, R. (2014). “Visualizing and understanding convolutional networks.” ECCV.
Simonyan, K. and Zisserman, A. (2015). “Very deep convolutional networks for large-scale image recognition.” ICLR.
Szegedy, C. et al. (2015). “Going deeper with convolutions” (GoogLeNet / Inception). CVPR.
Springenberg, J. T., Dosovitskiy, A., Brox, T. and Riedmiller, M. (2015). “Striving for simplicity: the all convolutional net.” ICLR workshop.
Ronneberger, O., Fischer, P. and Brox, T. (2015). “U-Net: convolutional networks for biomedical image segmentation.” MICCAI.
He, K., Zhang, X., Ren, S. and Sun, J. (2016). “Deep residual learning for image recognition.” CVPR.
Yu, F. and Koltun, V. (2016). “Multi-scale context aggregation by dilated convolutions.” ICLR.
Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016). “You only look once: unified, real-time object detection.” CVPR.
Luo, W., Li, Y., Urtasun, R. and Zemel, R. (2016). “Understanding the effective receptive field in deep convolutional neural networks.” NeurIPS.
Gatys, L. A., Ecker, A. S. and Bethge, M. (2016). “Image style transfer using convolutional neural networks.” CVPR.
Radford, A., Metz, L. and Chintala, S. (2016). “Unsupervised representation learning with deep convolutional generative adversarial networks.” ICLR.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press. (Chapter 9 — Convolutional Networks — is the standard graduate reference.)
Dumoulin, V. and Visin, F. (2016). “A guide to convolution arithmetic for deep learning.” arXiv:1603.07285.
Chollet, F. (2017). “Xception: deep learning with depthwise separable convolutions.” CVPR.
Howard, A. G. et al. (2017). “MobileNets: efficient convolutional neural networks for mobile vision applications.” arXiv:1704.04861.
Ren, S., He, K., Girshick, R. and Sun, J. (2017). “Faster R-CNN: towards real-time object detection with region proposal networks.” IEEE TPAMI, 39(6), 1137–1149.
Karras, T., Laine, S. and Aila, T. (2019). “A style-based generator architecture for generative adversarial networks.” CVPR.
Ho, J., Jain, A. and Abbeel, P. (2020). “Denoising diffusion probabilistic models.” NeurIPS.

With stride, padding, pooling, and receptive fields now firmly in hand, we can move from individual layers to full CNN architectures. In Chapter 14 we build LeNet-5, VGG, and ResNet from scratch, then use transfer learning to adapt pre-trained models to new tasks.

Stride — Controlling Output Resolution

Definition

Reference

Python & PyTorch — stride in action

When to reach for stride > 1

Quick Check

Padding — Taming the Boundary

The four common padding modes

Terminology: ‘valid’, ‘same’, ‘full’

The same-size recipe

F.pad — all four modes on the same tensor

The Output-Size Formula

Reference

Quick Check

Pooling — Down-sampling Without Parameters

Reference

Max vs. average pool

Why max pool 'works'

Interactive pooling visualiser

Max Pooling Visualizer

Pooling by hand — Python and PyTorch

Stride vs. pool: a modern debate

Reference

Receptive Field — Why Depth Matters

How receptive field grows

Reference

Interactive: walk through layers, watch the field grow

Receptive Field Growth

Effective receptive field

Reference

PyTorch nn.Conv2d Anatomy

Dilated and depthwise conv — a brief preview

Reference

Reference

Manual Implementation (For Reference)

This is slow on purpose

Reference

Putting It All Together: A Full CNN Pipeline

2D Convolution: Complete Process Visualization

CNN Architecture: Dimension Reduction Pipeline

Kernel Filtering

Input (7×7)

Kernel (3×3)

Feature Map (Output)

👆 Use the controls above to step through the convolution!

Pooling Operation

Feature Map (from Conv)

Pooled Output

Max Pooling

Output Size Formula

Feature learning vs. classification

Reference

AI / Deep Learning Applications

Object detection (YOLO, Faster R-CNN)

Reference

Semantic segmentation (U-Net)

Reference

Neural style transfer

Reference

Generative image models

Reference

A profound empirical observation

Reference

Summary

Key concepts

Critical formulas

Exercises

Conceptual

Hints

Coding

References

PyTorch `nn.Conv2d` Anatomy