Chapter 10
28 min read
Section 64 of 179

Convolution Parameters

Convolution Operations

Introduction

In Section 10.1 we argued why convolutions exist; in Section 10.2 we defined what a convolution is and worked out a single 5×55 \times 5 example by hand. We now answer the natural follow-up: how do we control the shape, receptive field, and computational cost of a convolutional layer?

Three knobs do essentially all the work:

  • Stride SS — how far the kernel jumps between evaluations. Controls output resolution.
  • Padding PP — how we treat the boundary. Controls whether spatial size shrinks each layer.
  • Dilation DD — how far apart the kernel's taps sit. Controls the receptive field for a fixed parameter budget.
The one sentence summary: stride shrinks the output, padding prevents that shrinkage, and dilation enlarges the field of view without adding parameters. Master these three and you can read any CNN architecture diagram on sight.

A single formula from Dumoulin & Visin (2016) [1] ties them all together, and it is exactly the formula PyTorch uses internally for nn.Conv2d [2]. We will derive it, implement it twice (NumPy first, then PyTorch), and verify every number produced along the way.


Learning Objectives

  1. Predict output shape for any combination of I,K,P,S,DI, K, P, S, D using the master formula, and know when you need the floor function.
  2. Choose stride vs. pooling appropriately for downsampling — including why strided convolutions replaced max-pooling in many modern architectures (Springenberg et al., 2015) [3].
  3. Pick a padding mode (zeros, reflect, replicate, circular) based on what the input actually represents.
  4. Reason about dilation and compute the receptive field of a stack of dilated convolutions — the core idea behind WaveNet [4] and DeepLab [5].
  5. Implement each knob twice: once from scratch in NumPy, once with PyTorch, and verify that both produce the same numbers.

The Master Output-Shape Formula

For a 1-D spatial dimension with input size II, kernel size KK, padding PP applied on each side, stride SS, and dilation DD, the output size is

  O  =  I+2PD(K1)1S+1  \; O \;=\; \left\lfloor \dfrac{I + 2P - D(K-1) - 1}{S} \right\rfloor + 1 \;

This is the exact formula listed in the PyTorch documentation for nn.Conv2d [2] and derived geometrically in Dumoulin & Visin [1]. It applies independently to height and width (2-D images have two such equations).

Breaking It Down

SymbolMeaningTypical values
IInput size (height or width)32, 224, …
KKernel size3, 5, 7
PPadding per side (both sides added)0, 1, 2
SStride1, 2
DDilation rate1, 2, 4, 8 (WaveNet)
D·(K−1)+1Effective kernel size k_eff3 when D=1, 5 when D=2, K=3
⌊·⌋Floor: drop any fractional leftover— required when S > 1

A cleaner way to remember it

Define the effective kernel size keff=D(K1)+1k_\text{eff} = D(K-1) + 1. Then the formula becomes the much more recognizable O=(I+2Pkeff)/S+1O = \lfloor (I + 2P - k_\text{eff}) / S \rfloor + 1. All the dilation complexity is absorbed into keffk_\text{eff}.

Why the floor?

If (I+2Pkeff)(I + 2P - k_\text{eff}) is not a multiple of SS, the kernel will run off the right edge at some point before reaching the end. PyTorch, like most frameworks, silently drops those overflowing positions — which is what the floor encodes. If you need the opposite rounding (ceil), you must add extra padding manually.

Stride — Controlling Output Resolution

Stride SS is how many pixels the kernel jumps between evaluations. S=1S = 1 is dense sampling; every pixel gets its own output. S=2S = 2 visits every other pixel along each axis, so the output footprint is roughly a quarter of the input.

Intuition

Imagine the kernel as a flashlight sweeping a wall. Stride is the distance between successive flashes. A fast sweep (large stride) covers the whole wall in fewer frames but loses detail; a slow sweep (small stride) gives you more frames of fine-grained information at higher cost.

Plugging P=0,D=1P = 0, D = 1 into the master formula gives the stride-only special case O=(IK)/S+1O = \lfloor (I - K)/S \rfloor + 1.

IKSO = ⌊(I−K)/S⌋ + 1Effect
323130Barely shrinks (K−1 = 2)
323215Halves resolution
32348Quarter resolution
327213Halves + K=7 costs 2 pixels

Stride vs. pooling for downsampling

Classical CNNs (LeNet, AlexNet, VGG) downsampled with max-pooling layers. Springenberg et al. (2015) [3] showed that a plain stride-2 convolution works just as well, and it has two advantages: it is learnable end-to-end, and it fits more naturally into residual blocks. ResNet [6] uses stride-2 convolutions in its bottleneck blocks for exactly this reason.

Stride in Pure Python

Let's start where every working implementation of a new idea should: with a few lines of NumPy that we can trace on paper. The input is the same 5×55 \times 5 grid we used in Section 10.2, and we compare stride = 1 against stride = 2.

Stride from Scratch — NumPy
🐍stride_numpy.py
1import numpy as np

NumPy is the numerical backbone. We use ndarray for the image and kernel, slicing for windows, and element-wise multiplication + np.sum() to compute the dot product at each sliding position.

EXECUTION STATE
numpy = Provides ndarray + vectorized arithmetic written in C
as np = Conventional alias so we can write np.array(), np.sum(), etc.
3Input image I (5×5)

A 5×5 grayscale image with values 1…25 arranged row-major. Easy numbers let us verify each output by hand.

EXECUTION STATE
I (5×5) =
      c0   c1   c2   c3   c4
r0   1    2    3    4    5
r1   6    7    8    9   10
r2  11   12   13   14   15
r3  16   17   18   19   20
r4  21   22   23   24   25
→ dtype=float = Force floating-point so dot products don't accidentally overflow for large kernel sums. Integer dtypes silently wrap in NumPy.
11Kernel K (3×3)

A fixed 'corners-plus-center' kernel. The 5 non-zero taps sum up 5 specific cells of the window; the 4 zero taps are ignored. We keep it simple so arithmetic is easy to trace.

EXECUTION STATE
K (3×3) =
     c0  c1  c2
r0    1   0   1
r1    0   1   0
r2    1   0   1
→ why this shape? = Non-zero taps sit at the 4 corners + center of the 3×3 window. Useful for teaching because the output equals the sum of 5 cells — easy to verify.
17def conv2d_strided(img, kern, stride) → np.ndarray

Pure-Python cross-correlation with an explicit stride. The only change from a stride=1 version is that the window's top-left corner advances by `stride` each step and the output shape shrinks accordingly.

EXECUTION STATE
⬇ input: img (5×5) = The NumPy 2-D image to convolve over.
⬇ input: kern (3×3) = The NumPy 2-D kernel whose weights we slide across the image.
⬇ input: stride (int) = How many pixels to jump between successive windows. stride=1 = dense sampling (every pixel). stride=2 = skip every other pixel → ≈½ the output.
⬆ returns = np.ndarray of shape (out_H, out_W), one weighted sum per valid window.
18H, W = img.shape

Unpack the 2-D image size. In PyTorch the shape would be (N, C, H, W); here we're working with a single 2-D plane so img.shape has two numbers.

EXECUTION STATE
img.shape = (5, 5) — height × width
H = 5 — number of rows
W = 5 — number of columns
19k, _ = kern.shape

We unpack only the first dimension into `k` because the kernel is square. The underscore `_` is Python's conventional throwaway name for values we don't use.

EXECUTION STATE
kern.shape = (3, 3)
k = 3 — kernel height (and width)
_ (underscore) = Discards the second value. Pythonic way to say 'I'm unpacking but don't need this'.
20out_H = (H - k) // stride + 1

The canonical output-size formula from Dumoulin & Visin (2016), specialised to padding=0 and dilation=1. `//` is floor-division — the usual choice because the kernel must fit entirely inside the image.

EXECUTION STATE
📚 (H − K) // S + 1 = Counts how many positions a K-wide window can occupy when it moves S pixels each step. Floor-division means any leftover pixels at the right edge are simply ignored.
→ stride=1 case = (5 − 3)//1 + 1 = 2 + 1 = 3
→ stride=2 case = (5 − 3)//2 + 1 = 1 + 1 = 2
out_H = 3 or 2 depending on stride
21out_W = (H - k) // stride + 1

Same formula, this time for width. For a square image it equals out_H.

EXECUTION STATE
out_W = 3 (stride=1) or 2 (stride=2)
22out = np.zeros((out_H, out_W))

Allocate the output buffer up front. Pre-allocating is both faster than appending to a Python list and clearer about the expected shape.

EXECUTION STATE
📚 np.zeros(shape) = Creates an ndarray of zeros with the given shape and default dtype float64.
⬇ arg: shape = (3, 3) or (2, 2) — matches the formula above.
23for i in range(out_H):

Outer loop over output rows. Each i corresponds to one vertical position of the kernel window.

LOOP TRACE · 5 iterations
stride=1, i=0
window rows = img[0:3, :]
stride=1, i=1
window rows = img[1:4, :]
stride=1, i=2
window rows = img[2:5, :]
stride=2, i=0
window rows = img[0:3, :]
stride=2, i=1
window rows = img[2:5, :]
24for j in range(out_W):

Inner loop over output columns. Combined with i, we visit every valid kernel position exactly once.

EXECUTION STATE
total positions (s=1) = 3 × 3 = 9
total positions (s=2) = 2 × 2 = 4 — stride=2 visits only ¼ the positions
25window = img[i*stride : i*stride+k, j*stride : j*stride+k]

NumPy slice that extracts the current k×k patch. The stride shows up in both the start (i*stride) and implicitly in the spacing between successive windows.

EXECUTION STATE
📚 NumPy slicing arr[a:b, c:d] = Returns a view (not a copy) of rows a..b-1 and columns c..d-1. Very cheap — no data is copied.
→ s=1, i=1, j=1 =
img[1:4, 1:4] →
  7   8   9
 12  13  14
 17  18  19
→ s=2, i=1, j=1 =
img[2:5, 2:5] →
 13  14  15
 18  19  20
 23  24  25
27out[i, j] = np.sum(window * kern)

Element-wise multiply the patch by the kernel, then sum all 9 products. This single scalar is one output pixel. Because 4 of the 9 kernel taps are zero, only 5 patch cells actually contribute.

EXECUTION STATE
📚 * (ndarray × ndarray) = Element-wise multiplication when shapes match. Result has the same shape as the inputs.
📚 np.sum() = Reduces all elements to a single scalar. Equivalent here to np.dot(window.ravel(), kern.ravel()).
→ example (s=1, i=1, j=1) = window = [[7,8,9],[12,13,14],[17,18,19]] kern = [[1,0,1],[0,1,0],[1,0,1]] non-zero taps: 7 + 9 + 13 + 17 + 19 = 65
28return out

Return the finished feature map. For stride=1 we get 3×3; for stride=2 we get 2×2 — one quarter the spatial footprint, computed ~¼ as much work.

EXECUTION STATE
⬆ return: stride=1 (3×3) =
     c0  c1  c2
r0  35  40  45
r1  60  65  70
r2  85  90  95
⬆ return: stride=2 (2×2) =
     c0  c1
r0  35  45
r1  85  95
→ observation = Stride=2 picks exactly the corners of the stride=1 output: (0,0),(0,2),(2,0),(2,2) → 35, 45, 85, 95. Striding = subsampling in a very literal sense.
30print(conv2d_strided(I, K, stride=1))

Dense sampling — every pixel position produces an output.

EXECUTION STATE
output shape = (3, 3) — 9 values
36print(conv2d_strided(I, K, stride=2))

Stride=2 downsamples by roughly a factor of 2 along each axis, cutting the output footprint to ~¼.

EXECUTION STATE
output shape = (2, 2) — 4 values
23 lines without explanation
1import numpy as np
2
3I = np.array([
4    [1,  2,  3,  4,  5],
5    [6,  7,  8,  9, 10],
6    [11, 12, 13, 14, 15],
7    [16, 17, 18, 19, 20],
8    [21, 22, 23, 24, 25],
9], dtype=float)
10
11K = np.array([
12    [1, 0, 1],
13    [0, 1, 0],
14    [1, 0, 1],
15], dtype=float)
16
17def conv2d_strided(img, kern, stride):
18    H, W = img.shape
19    k, _ = kern.shape
20    out_H = (H - k) // stride + 1
21    out_W = (W - k) // stride + 1
22    out = np.zeros((out_H, out_W))
23    for i in range(out_H):
24        for j in range(out_W):
25            window = img[i*stride : i*stride + k,
26                         j*stride : j*stride + k]
27            out[i, j] = np.sum(window * kern)
28    return out
29
30print("stride=1 →")
31print(conv2d_strided(I, K, stride=1))
32# [[35. 40. 45.]
33#  [60. 65. 70.]
34#  [85. 90. 95.]]
35
36print("stride=2 →")
37print(conv2d_strided(I, K, stride=2))
38# [[35. 45.]
39#  [85. 95.]]

Why stride=2 output equals a subsample of stride=1

Compare the numbers. stride=1 gives [[35,40,45],[60,65,70],[85,90,95]]; stride=2 gives [[35,45],[85,95]]. These are exactly the corners of the stride=1 output — (0,0), (0,2), (2,0), (2,2). Striding a convolution is mathematically identical to computing the dense convolution and then subsampling — it's just much cheaper.

Stride in PyTorch

PyTorch expresses stride through the stride keyword of both F.conv2d and nn.Conv2d. The math is identical; only the tensor layout (NCHW) and the call site differ.

Stride in PyTorch
🐍stride_pytorch.py
1import torch

Core PyTorch tensor library. Tensors are like NumPy arrays but live on CPU or GPU and track gradients for autograd.

EXECUTION STATE
torch = Provides Tensor, autograd, device transfers, and the nn / functional APIs.
2import torch.nn.functional as F

The functional API. torch.nn.Conv2d is a stateful Module (holds learned weights); F.conv2d is a pure function — useful when you want to pass weights explicitly, e.g. for fixed kernels or shared weights.

EXECUTION STATE
F.conv2d = Pure function: F.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
5I = torch.tensor([...]).view(1, 1, 5, 5)

Build the image tensor, then reshape to PyTorch's 4-D NCHW layout: (batch, channels, height, width). A single grayscale image → N=1, C=1.

EXECUTION STATE
📚 torch.tensor(data) = Copy a Python list/NumPy array into a new tensor on CPU.
📚 .view(1, 1, 5, 5) = Reshape in-place without copying. Equivalent to .reshape() when the tensor is contiguous.
⬇ arg: dtype=torch.float32 = 32-bit float — the default for most CNN weights and what cuDNN expects on GPU.
I.shape = torch.Size([1, 1, 5, 5]) — (N, C, H, W)
13K = torch.tensor([...]).view(1, 1, 3, 3)

Kernel reshaped to (out_channels, in_channels, kH, kW) — the layout F.conv2d expects. 1 output channel, 1 input channel, 3×3 spatial.

EXECUTION STATE
K.shape = torch.Size([1, 1, 3, 3]) — (C_out, C_in, kH, kW)
→ why this layout? = One 'filter' in a conv layer spans all input channels and produces one output channel. The tensor stores all C_out filters stacked along axis 0.
20out_s1 = F.conv2d(I, K, stride=1)

Run cross-correlation with stride=1. PyTorch calls this 'convolution' even though it doesn't flip the kernel — the same convention we saw in Section 2.

EXECUTION STATE
📚 F.conv2d = Executes cross-correlation on the GPU/CPU via cuDNN/oneDNN. Much faster than nested Python loops.
⬇ arg 1: I (1,1,5,5) = The NCHW input tensor.
⬇ arg 2: K (1,1,3,3) = The kernel weight tensor.
⬇ arg 3: stride=1 = Move 1 pixel per step along H and W. A single int applies to both axes; a tuple (sH, sW) allows asymmetric stride.
⬆ out_s1.shape = torch.Size([1, 1, 3, 3])
21out_s2 = F.conv2d(I, K, stride=2)

Same operation but skipping every other pixel. The formula gives ⌊(5−3)/2⌋+1 = 2, so we get a 2×2 feature map — identical to the NumPy version.

EXECUTION STATE
⬇ arg: stride=2 = Double step along H and W. Output is roughly halved in each spatial dim.
⬆ out_s2.shape = torch.Size([1, 1, 2, 2])
23print(out_s1.shape)

Sanity-check the shape. Reading the shape of intermediate tensors is the #1 debugging technique for CNNs.

EXECUTION STATE
out_s1.shape = torch.Size([1, 1, 3, 3])
24print(out_s1.squeeze())

squeeze() removes all size-1 dimensions so we can print the feature map as a plain 2-D matrix.

EXECUTION STATE
📚 .squeeze() = Strips every singleton dim. (1,1,3,3) → (3,3). Use .squeeze(dim) to target a specific axis.
→ values =
[[35, 40, 45],
 [60, 65, 70],
 [85, 90, 95]]
29print(out_s2.shape)

The stride=2 output. Same math, just fewer positions.

EXECUTION STATE
out_s2.shape = torch.Size([1, 1, 2, 2])
30print(out_s2.squeeze())

The four 'corner' values of the stride=1 result — identical to what conv2d_strided produced in pure NumPy. The agreement is proof that PyTorch's implementation follows the same mathematical definition.

EXECUTION STATE
→ values =
[[35, 45],
 [85, 95]]
22 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Same 5x5 input and 3x3 kernel, but in PyTorch's NCHW layout
5I = torch.tensor([
6    [1,  2,  3,  4,  5],
7    [6,  7,  8,  9, 10],
8    [11, 12, 13, 14, 15],
9    [16, 17, 18, 19, 20],
10    [21, 22, 23, 24, 25],
11], dtype=torch.float32).view(1, 1, 5, 5)
12
13K = torch.tensor([
14    [1, 0, 1],
15    [0, 1, 0],
16    [1, 0, 1],
17], dtype=torch.float32).view(1, 1, 3, 3)
18
19# Functional API — lets us pass weights explicitly
20out_s1 = F.conv2d(I, K, stride=1)
21out_s2 = F.conv2d(I, K, stride=2)
22
23print("stride=1 shape:", out_s1.shape)   # torch.Size([1, 1, 3, 3])
24print(out_s1.squeeze())
25# tensor([[35., 40., 45.],
26#         [60., 65., 70.],
27#         [85., 90., 95.]])
28
29print("stride=2 shape:", out_s2.shape)   # torch.Size([1, 1, 2, 2])
30print(out_s2.squeeze())
31# tensor([[35., 45.],
32#         [85., 95.]])

Quick Check

A 224×224 feature map goes through nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2). What is the output spatial size?


Padding — Handling the Boundary

Without padding, every convolution with K>1K > 1 shrinks the output. After n stride-1 layers with K=3K = 3, a32×3232 \times 32 input becomes (322n)×(322n)(32 - 2n) \times (32 - 2n); by 16 layers it has vanished. Padding prevents that collapse by inserting extra rows and columns around the border before the convolution runs.

Valid, Same, and Full Padding

NamePadding POutput size (S=1, D=1)When to use
valid0I − K + 1You are happy to lose pixels at the border — e.g. the final conv before a classifier.
same⌊(K−1)/2⌋I (unchanged)The dominant choice in VGG, ResNet, EfficientNet. Keeps spatial size so you can stack many layers cleanly.
fullK − 1I + K − 1Every valid overlap between image and kernel is represented. Rare in forward conv; shows up implicitly in the backward pass / transposed conv.

“Same” is really the recipe P=(keff1)/2P = (k_\text{eff} - 1)/2, which reduces to (K1)/2(K-1)/2 when D=1D = 1 and KK is odd. Even KK requires asymmetric padding, which is why library support is rare.

Padding Modes (Zero, Reflect, Replicate, Circular)

PyTorch's nn.Conv2d supports four padding_modes [2], each appropriate for different data. Toggle the visualizer below to see the effect of each choice on a 5×5 input with P=2P = 2:

Padding Modes

Padding doesn't have to be zeros. PyTorch's nn.Conv2d supports four modes. Switch between them to see how each fills the 2-pixel border around a 5×5 input.

0000000000000000000012345000067891000001112131415000016171819200000212223242500000000000000000000
originalpadded
PyTorch default
Zero padding

Border is filled with 0. The default in most CNNs and the one torch.nn.Conv2d uses when padding_mode='zeros'.

ModeBorder fillBest for
zeros0.0 everywhereDefault. Classification CNNs where the boundary is rarely informative.
reflectMirror without repeating the edge pixelStyle transfer, image-to-image models: avoids fake dark borders.
replicateRepeat the edge pixel outwardLow-level vision, derivative filters; keeps intensity continuous.
circularWrap around as if periodicInherently periodic data — angular/ spherical signals, cylindrical images.

Zero-padding bias

Zero-padding is convenient but injects a constant 0 signal at the edges. Networks pick up on this and, by the last few layers, can tell where the boundary is. Islam et al. (2020) [7] showed this effectively gives CNNs a weak form of absolute position encoding — sometimes useful, sometimes an unwanted shortcut.

Padding in Pure Python

With NumPy's np.pad we can implement all four modes without leaving the standard library. Note how padding becomes a pre-processing step: we enlarge the input, then run the same unpadded convolution we already wrote.

Padding from Scratch — NumPy
🐍padding_numpy.py
1import numpy as np

NumPy gives us np.pad — a fast, general padding routine that supports every mode PyTorch does and more.

EXECUTION STATE
numpy = Ships with np.pad, which handles constant, reflect, edge, wrap, symmetric, …
3I = np.array(...) — 5×5 input

Same 1…25 grid as before. Using familiar data makes it easy to spot what changed when padding is added.

EXECUTION STATE
I (5×5) = Values 1..25 row-major. Corner pixel I[0,0]=1, bottom-right I[4,4]=25.
11K = np.array(...) — corner-plus-center kernel

Same 5-tap kernel we used for stride.

EXECUTION STATE
K (3×3) =
[[1,0,1],[0,1,0],[1,0,1]]
17def conv2d_padded(img, kern, padding, pad_mode) → np.ndarray

Padding is a pre-processing step: we enlarge the input, then run an unpadded stride-1 convolution on top. Conceptually simpler than threading padding through the main loop.

EXECUTION STATE
⬇ input: img = NumPy 2-D image to convolve.
⬇ input: kern = NumPy 2-D kernel.
⬇ input: padding (int) = How many rows/cols of fill are added on each of the 4 sides. p=0 = valid (no padding); p=(k−1)/2 = same (output=input when stride=1).
⬇ input: pad_mode (str) = How to fill the border. 'constant' = zeros (default), 'reflect' = mirror, 'edge' = replicate boundary, 'wrap' = circular. Matches PyTorch's padding_mode options.
⬆ returns = np.ndarray of shape (out_H, out_W).
19if padding > 0: img = np.pad(img, padding, mode=pad_mode)

np.pad wraps a border of width `padding` around the array. Because we rebind `img` inside the function, the caller's original array is untouched.

EXECUTION STATE
📚 np.pad(array, pad_width, mode) = Pads an ndarray along every axis. pad_width can be an int (same on all sides) or a tuple per axis.
⬇ arg: pad_width=padding = Integer → padding is applied symmetrically on all 2*ndim sides.
⬇ arg: mode=pad_mode = Selects the fill strategy.
→ p=1, mode='constant' → img (7×7) =
[[ 0,  0,  0,  0,  0,  0,  0],
 [ 0,  1,  2,  3,  4,  5,  0],
 [ 0,  6,  7,  8,  9, 10,  0],
 [ 0, 11, 12, 13, 14, 15,  0],
 [ 0, 16, 17, 18, 19, 20,  0],
 [ 0, 21, 22, 23, 24, 25,  0],
 [ 0,  0,  0,  0,  0,  0,  0]]
20H, W = img.shape

After the optional pad, re-read the (possibly enlarged) shape. With p=1 the 5×5 input becomes 7×7.

EXECUTION STATE
H = 5 if padding=0, else 5 + 2·padding = 7
W = same as H for this square example
21k, _ = kern.shape

Kernel is square, so one number describes both dimensions.

EXECUTION STATE
k = 3
22out_H = H - k + 1

Stride-1 output size on the padded image. Combined with the padding step above, this realises the classic formula O = I + 2P − K + 1.

EXECUTION STATE
→ p=0 = 5 − 3 + 1 = 3
→ p=1 = 7 − 3 + 1 = 5 (same as original!)
23out_W = W - k + 1

Symmetric formula for width.

24out = np.zeros((out_H, out_W))

Pre-allocate the output feature map.

25for i in range(out_H):

Outer loop — rows of the output.

26for j in range(out_W):

Inner loop — columns of the output.

27out[i, j] = np.sum(img[i:i+k, j:j+k] * kern)

The same core computation as before. Because padding was applied to `img` before the loop, windows near the original edge now overlap the synthetic border pixels.

EXECUTION STATE
→ p=1, (i,j)=(0,0) → corner = window = [[0,0,0],[0,1,2],[0,6,7]] non-zero taps hit: 0 + 0 + 1 + 0 + 7 = 8
→ p=1, (i,j)=(2,2) → center = window = [[7,8,9],[12,13,14],[17,18,19]] non-zero taps: 7 + 9 + 13 + 17 + 19 = 65
28return out

Ship the feature map back to the caller.

EXECUTION STATE
⬆ return: p=1 (5×5) =
[[ 8, 16, 19, 22, 14],
 [20, 35, 40, 45, 28],
 [35, 60, 65, 70, 43],
 [50, 85, 90, 95, 58],
 [38, 56, 59, 62, 44]]
→ notice = The interior 3×3 block (rows 1..3, cols 1..3) is exactly the p=0 output [[35,40,45],[60,65,70],[85,90,95]]. Padding only changes the boundary.
31valid_out = conv2d_padded(I, K, padding=0)

'Valid' padding means no padding: the kernel only visits positions where it fits entirely inside the image. Output shrinks by (k−1) per axis.

EXECUTION STATE
valid_out.shape = (3, 3)
32same_out = conv2d_padded(I, K, padding=1, pad_mode='constant')

'Same' padding with p=(k−1)/2 and stride=1 preserves spatial size. This is the pattern used throughout VGG and ResNet.

EXECUTION STATE
📚 'same' recipe = For odd k and stride=1: set p = (k−1)/2. k=3 → p=1. k=5 → p=2. k=7 → p=3.
same_out.shape = (5, 5) — same as input!
34print("valid:", valid_out.shape)

Confirms the shape shrink without padding.

35print("same :", same_out.shape)

Confirms the shape is preserved with p=(k−1)/2.

24 lines without explanation
1import numpy as np
2
3I = np.array([
4    [1,  2,  3,  4,  5],
5    [6,  7,  8,  9, 10],
6    [11, 12, 13, 14, 15],
7    [16, 17, 18, 19, 20],
8    [21, 22, 23, 24, 25],
9], dtype=float)
10
11K = np.array([
12    [1, 0, 1],
13    [0, 1, 0],
14    [1, 0, 1],
15], dtype=float)
16
17def conv2d_padded(img, kern, padding=0, pad_mode='constant'):
18    # Pad the image first, then do a stride-1 conv
19    if padding > 0:
20        img = np.pad(img, padding, mode=pad_mode)
21    H, W = img.shape
22    k, _ = kern.shape
23    out_H = H - k + 1
24    out_W = W - k + 1
25    out = np.zeros((out_H, out_W))
26    for i in range(out_H):
27        for j in range(out_W):
28            out[i, j] = np.sum(img[i:i+k, j:j+k] * kern)
29    return out
30
31# Same padding: p = (k-1)//2 keeps output = input
32valid_out = conv2d_padded(I, K, padding=0)
33same_out  = conv2d_padded(I, K, padding=1, pad_mode='constant')
34
35print("valid:", valid_out.shape)   # (3, 3)
36print("same :", same_out.shape)    # (5, 5)
37print(same_out)
38# [[ 8. 16. 19. 22. 14.]
39#  [20. 35. 40. 45. 28.]
40#  [35. 60. 65. 70. 43.]
41#  [50. 85. 90. 95. 58.]
42#  [38. 56. 59. 62. 44.]]

Padding in PyTorch

In PyTorch, padding is a keyword on both the functional and module APIs. The string shortcut padding='same' is convenient but only legal when every spatial stride is 1 — an easy-to-miss restriction.

Padding in PyTorch — Functional and Module APIs
🐍padding_pytorch.py
1import torch

Main PyTorch namespace — provides Tensor and torch.arange here.

2import torch.nn as nn

nn contains Module classes with learned state — e.g. nn.Conv2d stores the weight tensor and biases.

EXECUTION STATE
nn.Conv2d = Module wrapping a convolution layer. Registers weight, bias, and padding_mode as parameters/attributes.
3import torch.nn.functional as F

F is the stateless twin of nn. F.conv2d takes weights as an argument; useful for fixed kernels and for research code.

5I = torch.arange(1, 26).view(1, 1, 5, 5)

Compact way to build the 1..25 tensor. arange(1,26) stops before 26, giving exactly 25 values.

EXECUTION STATE
📚 torch.arange(start, end) = 1-D sequence [start, start+1, …, end−1]. Half-open like Python range.
⬇ arg 1: start=1 = first value
⬇ arg 2: end=26 = exclusive upper bound — value 26 is NOT included
I.shape = torch.Size([1, 1, 5, 5])
6K = torch.tensor([...]).view(1, 1, 3, 3)

Same corner-plus-center kernel, reshaped to (C_out=1, C_in=1, kH=3, kW=3).

11valid = F.conv2d(I, K, padding=0)

No padding. Output is (5 − 3 + 1) = 3 in each spatial dim, matching the NumPy version exactly.

EXECUTION STATE
⬇ arg: padding=0 = No border added. Kernel must fit entirely inside the input.
⬆ valid.shape = torch.Size([1, 1, 3, 3])
12same = F.conv2d(I, K, padding=1)

Pads 1 zero on each of the 4 sides. With k=3 and stride=1 this preserves the spatial size — the 'same' pattern used everywhere in modern CNNs.

EXECUTION STATE
⬇ arg: padding=1 = Integer → same padding on all sides. Tuple (pH, pW) allows asymmetric pad.
⬆ same.shape = torch.Size([1, 1, 5, 5])
→ F.conv2d padding semantics = F.conv2d always uses zero-padding when `padding` is an int. For other modes use F.pad(x, ...) explicitly before calling conv2d, or use nn.Conv2d with padding_mode=.
17conv_same = nn.Conv2d(in=1, out=1, k=3, padding=1, padding_mode='reflect', bias=False)

The Module API. Registers learnable weights of shape (1, 1, 3, 3) and configures how the boundary is filled at every forward pass.

EXECUTION STATE
📚 nn.Conv2d = class torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')
⬇ in_channels=1 = Grayscale input — 1 channel.
⬇ out_channels=1 = We'll learn a single filter.
⬇ kernel_size=3 = 3×3 spatial kernel (shorthand for (3,3)).
⬇ padding=1 = One-pixel border on each side.
⬇ padding_mode='reflect' = Border is filled by mirroring the inside without repeating the edge pixel. Allowed values: 'zeros' (default), 'reflect', 'replicate', 'circular'.
⬇ bias=False = Skip the per-output-channel bias — handy when we want our output to match a bias-free reference implementation.
26with torch.no_grad(): conv_same.weight.copy_(K)

Overwrite the randomly initialised kernel with our hand-picked K. Wrapping it in torch.no_grad() avoids creating autograd graph nodes for this setup step.

EXECUTION STATE
📚 torch.no_grad() = Context manager that temporarily disables gradient tracking. Used for weight-surgery, inference, or manual parameter copies.
📚 .copy_(other) = In-place copy of another tensor's values. The trailing underscore signals in-place — a PyTorch convention.
28y = conv_same(I)

Forward pass. Under the hood PyTorch pads I with the reflect rule, then runs cross-correlation with the stored weight.

EXECUTION STATE
⬆ y.shape = torch.Size([1, 1, 5, 5])
→ reflect vs zeros = With reflect padding the corner no longer sees synthetic zeros — it sees a mirrored copy of nearby pixels. The interior values are identical; only the border row/col differs.
29print(y.shape)

Confirms the shape matches the input — the hallmark of 'same' padding.

32conv_auto = nn.Conv2d(1, 1, kernel_size=3, padding='same', bias=False)

PyTorch ≥1.9 lets you pass the string 'same' instead of an integer. PyTorch computes p for you so that H_out == H_in. Restriction: works only when stride=1 in every spatial dim.

EXECUTION STATE
⬇ padding='same' = String shortcut. Internally PyTorch computes p = (k_eff − 1)/2 on each side, with an extra pixel on one side if k_eff is even.
⬇ padding='valid' = Equivalent to padding=0. Provided for symmetry.
→ restriction = Passing padding='same' with stride>1 raises ValueError. In that case, compute padding manually.
23 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5I = torch.arange(1, 26, dtype=torch.float32).view(1, 1, 5, 5)
6K = torch.tensor([[1, 0, 1],
7                  [0, 1, 0],
8                  [1, 0, 1]], dtype=torch.float32).view(1, 1, 3, 3)
9
10# 1) Functional API — pass padding explicitly
11valid = F.conv2d(I, K, padding=0)          # "valid"
12same  = F.conv2d(I, K, padding=1)          # "same" when stride=1, k=3
13
14print("valid:", valid.shape)  # (1, 1, 3, 3)
15print("same :", same.shape)   # (1, 1, 5, 5)
16
17# 2) Module API — Conv2d stores weights and padding config
18conv_same = nn.Conv2d(
19    in_channels=1, out_channels=1,
20    kernel_size=3,
21    padding=1,
22    padding_mode='reflect',   # 'zeros' | 'reflect' | 'replicate' | 'circular'
23    bias=False,
24)
25
26# Replace the random init with our fixed kernel to see deterministic output
27with torch.no_grad():
28    conv_same.weight.copy_(K)
29
30y = conv_same(I)
31print(y.shape)                 # torch.Size([1, 1, 5, 5])
32
33# 3) String shortcut (PyTorch >= 1.9)
34conv_auto = nn.Conv2d(1, 1, kernel_size=3, padding='same', bias=False)
35# padding='same' auto-computes p so that H_out == H_in (stride=1 only)

Quick Check

You want a 5×5 convolution to preserve spatial dimensions with stride=1. What padding do you need?


Dilation — Larger Receptive Field for Free

Dilation (also called “atrous” convolution, from the French à trous = “with holes”) inserts D1D - 1 zeros between adjacent kernel taps. The result behaves like a much larger kernel for the purpose of receptive field, while still storing only the original K×KK \times K weights.

The formalism is due to Yu & Koltun (ICLR 2016) [8], who used it to capture multi-scale context in dense prediction. WaveNet [4] uses stacks of dilated causal 1-D convolutions with dilations 1,2,4,8,1, 2, 4, 8, \ldots to model seconds of audio with only a handful of layers. DeepLab [5] uses them (“atrous spatial pyramid pooling”) to capture multi-scale image context for semantic segmentation.

Visualizing Dilation

Drag the slider to see how the effective kernel size keff=D(K1)+1k_\text{eff} = D(K-1) + 1 grows while the parameter count stays fixed at 9:

Dilation: Expanding Receptive Field for Free

A 3×3 kernel always has 9 parameters, but dilation D inserts D−1 zeros between each tap. The kernel covers a larger area of the input without learning any new weights.

D = 1
keff = D·(K−1) + 1 = 1·(3−1) + 1 = 3(still only 9 learned weights)
Input grid (9×9) — yellow cells are the 9 taps
What the kernel “sees”

D = 1 (standard): taps are adjacent, 3×3 window.

D = 2: taps skip every other cell, 5×5 window but still 9 weights. Used in WaveNet.

D = 3: window is 7×7, still 9 weights.

Why it matters: stacking layers with dilations 1, 2, 4, 8, … grows the receptive field exponentially with depth, not linearly — which is how WaveNet reaches seconds of audio context and DeepLab captures multi-scale image context.
Dk_eff = D·(K−1) + 1ParametersComment
139Ordinary 3×3 convolution.
2595×5 coverage, still 9 weights. Used in many segmentation heads.
4999×9 coverage; one layer already covers a chunk.
8179WaveNet-scale; used in the upper layers of the dilated stack.

Dilation in Pure Python

Implementing dilation is a one-line change to the Python loop: where we used to index img[i+a, j+b], we now index img[i+a·D, j+b·D]. The formula for the output size uses keffk_\text{eff} in place of KK.

Dilated Convolution from Scratch — NumPy
🐍dilation_numpy.py
1import numpy as np

Same setup as before.

3I = np.arange(1, 26).reshape(5, 5)

Compact way to spell the 1…25 matrix. np.arange(1,26) gives a 25-element 1-D array; reshape turns it into 5×5.

EXECUTION STATE
📚 np.arange(start, stop) = Half-open integer/float range like Python's range, returned as an ndarray.
📚 .reshape(5, 5) = View the same data with a new shape. Total element count must match.
4K = corner-plus-center 3×3

Same kernel — the 5 non-zero taps make it easy to see which input cells a dilated kernel samples.

10def conv2d_dilated(img, kern, dilation) → np.ndarray

A single extra parameter — dilation — turns our plain convolution into a dilated (also called 'atrous') convolution. From Yu & Koltun (ICLR 2016).

EXECUTION STATE
⬇ input: dilation (int) = Gap between successive kernel taps. D=1 reduces to the ordinary conv. D=2 inserts 1 zero between each tap → 5×5 footprint. D=3 → 7×7 footprint. Parameter count never changes.
11H, W = img.shape

Read off the input dimensions.

EXECUTION STATE
H, W = 5, 5
12k, _ = kern.shape

Square kernel size.

EXECUTION STATE
k = 3 — physical kernel edge length
13k_eff = dilation * (k - 1) + 1

The effective receptive span of the kernel once dilation is applied. This is THE key formula — reuse it everywhere (output shape, receptive field, padding).

EXECUTION STATE
📚 k_eff = D·(K−1) + 1 = Derivation: K taps spaced D apart span (K−1)·D zero-gaps plus the K taps themselves → D·(K−1)+1 cells.
→ D=1, K=3 = 1·(3−1)+1 = 3 (plain 3×3)
→ D=2, K=3 = 2·(3−1)+1 = 5 (5×5 footprint, still 9 weights)
→ D=3, K=3 = 3·(3−1)+1 = 7
14out_H = H - k_eff + 1

Same (I − k + 1) formula as plain conv, but with the *effective* kernel size. With D=2 the 3×3 kernel behaves like a 5×5 one for shape purposes.

EXECUTION STATE
→ D=1 = 5 − 3 + 1 = 3
→ D=2 = 5 − 5 + 1 = 1
15out_W = W - k_eff + 1

Symmetric for width.

16out = np.zeros((out_H, out_W))

Pre-allocate.

17for i in range(out_H):

Iterate over output rows.

18for j in range(out_W):

Iterate over output columns.

19s = 0.0

Accumulator for the current output pixel's weighted sum.

20for a in range(k):

Iterate over the k physical kernel rows (3 rows regardless of dilation).

21for b in range(k):

Iterate over the k physical kernel cols.

22s += img[i + a*dilation, j + b*dilation] * kern[a, b]

The only line where dilation shows up: we multiply `a` and `b` by `dilation` when indexing the image. The kernel itself stays the same 3×3 tensor — no new parameters.

EXECUTION STATE
→ D=2, (i,j)=(0,0), tap (a,b)=(0,0) = img[0, 0] = 1 × kern[0,0]=1 → 1
→ D=2, (i,j)=(0,0), tap (a,b)=(0,2) = img[0, 4] = 5 × kern[0,2]=1 → 5
→ D=2, (i,j)=(0,0), tap (a,b)=(1,1) = img[2, 2] = 13 × kern[1,1]=1 → 13
→ D=2, (i,j)=(0,0), tap (a,b)=(2,0) = img[4, 0] = 21 × kern[2,0]=1 → 21
→ D=2, (i,j)=(0,0), tap (a,b)=(2,2) = img[4, 4] = 25 × kern[2,2]=1 → 25
→ total = 1 + 5 + 13 + 21 + 25 = 65
24out[i, j] = s

Commit the accumulated sum to the output array.

25return out

Hand the feature map back.

EXECUTION STATE
⬆ return: D=2 (1×1) =
[[65.]]
→ interpretation = Only one valid position exists for a 5×5 effective kernel on a 5×5 image. The single output pixel 'sees' the four corners + the center of the input.
27print("D=1 →", conv2d_dilated(I, K, dilation=1).shape)

Sanity check — D=1 reproduces the plain conv shape (3, 3).

30print("D=2 →", conv2d_dilated(I, K, dilation=2))

The dilated result. Same 9 weights; bigger footprint; fewer valid positions.

13 lines without explanation
1import numpy as np
2
3I = np.arange(1, 26, dtype=float).reshape(5, 5)
4K = np.array([
5    [1, 0, 1],
6    [0, 1, 0],
7    [1, 0, 1],
8], dtype=float)
9
10def conv2d_dilated(img, kern, dilation=1):
11    H, W = img.shape
12    k, _ = kern.shape
13    k_eff = dilation * (k - 1) + 1       # effective kernel size
14    out_H = H - k_eff + 1
15    out_W = W - k_eff + 1
16    out = np.zeros((out_H, out_W))
17    for i in range(out_H):
18        for j in range(out_W):
19            s = 0.0
20            for a in range(k):
21                for b in range(k):
22                    s += img[i + a * dilation,
23                             j + b * dilation] * kern[a, b]
24            out[i, j] = s
25    return out
26
27print("D=1 →", conv2d_dilated(I, K, dilation=1).shape)
28# (3, 3)
29
30print("D=2 →", conv2d_dilated(I, K, dilation=2))
31# [[65.]]
32# The sole output samples the 5 non-zero taps at (0,0),(0,4),(2,2),(4,0),(4,4)
33# → 1 + 5 + 13 + 21 + 25 = 65

Dilation in PyTorch

PyTorch exposes dilation as a keyword on both F.conv2d and nn.Conv2d. The “same” padding recipe generalises cleanly: P=D(K1)/2P = D(K-1)/2 for odd KK.

Dilation in PyTorch
🐍dilation_pytorch.py
1import torch

PyTorch core.

2import torch.nn.functional as F

Functional API — we'll call F.conv2d with an explicit weight tensor.

4I = torch.arange(1, 26, dtype=float32).view(1,1,5,5)

Reshape the 1..25 vector to NCHW = (1, 1, 5, 5).

EXECUTION STATE
I.shape = torch.Size([1, 1, 5, 5])
5K = kernel reshaped to (1,1,3,3)

PyTorch needs 4-D weight tensors: (C_out, C_in, kH, kW).

10y1 = F.conv2d(I, K, dilation=1)

Baseline — plain convolution, output shape (1,1,3,3). Matches our NumPy stride=1 result.

EXECUTION STATE
⬇ arg: dilation=1 = No gaps between taps. This is the default.
⬆ y1.shape = torch.Size([1, 1, 3, 3])
14y2 = F.conv2d(I, K, dilation=2)

Dilated convolution. PyTorch inserts (D−1)=1 zero between each pair of adjacent taps in the kernel before running the cross-correlation.

EXECUTION STATE
⬇ arg: dilation=2 = One zero between taps → 5×5 effective kernel, still 9 learned weights.
⬆ y2.shape = torch.Size([1, 1, 1, 1])
16print(y2.item())

`.item()` pulls a 0-d tensor out as a Python float. Here it's 65.0 — exactly matching our hand-traced NumPy computation of 1+5+13+21+25.

EXECUTION STATE
📚 .item() = Converts a single-element tensor to a native Python number. Raises if the tensor has more than one element.
→ value = 65.0
20y2_same = F.conv2d(I, K, dilation=2, padding=2)

Preserve the spatial size under dilation by padding with (k_eff − 1)/2 = 2. General rule: p = d·(k−1)/2 for odd k.

EXECUTION STATE
⬇ arg: padding=2 = Pads 2 pixels on each side → input becomes effectively 9×9 before the 5×5-effective kernel slides.
⬆ y2_same.shape = torch.Size([1, 1, 5, 5])
→ 'same' recipe for dilation = p = d·(k−1)/2. k=3,d=1 → p=1. k=3,d=2 → p=2. k=3,d=4 → p=4.
13 lines without explanation
1import torch
2import torch.nn.functional as F
3
4I = torch.arange(1, 26, dtype=torch.float32).view(1, 1, 5, 5)
5K = torch.tensor([[1, 0, 1],
6                  [0, 1, 0],
7                  [1, 0, 1]], dtype=torch.float32).view(1, 1, 3, 3)
8
9# Dilation = 1 → ordinary conv
10y1 = F.conv2d(I, K, dilation=1)
11print("D=1 shape:", y1.shape)         # (1, 1, 3, 3)
12
13# Dilation = 2 → 5x5 effective kernel on 5x5 input → 1x1 output
14y2 = F.conv2d(I, K, dilation=2)
15print("D=2 shape:", y2.shape)         # (1, 1, 1, 1)
16print("D=2 value:", y2.item())        # 65.0
17
18# Dilation = 2 WITH padding to keep spatial size
19#   k_eff = 2*(3-1)+1 = 5, so p = (5-1)/2 = 2
20y2_same = F.conv2d(I, K, dilation=2, padding=2)
21print("D=2 same:", y2_same.shape)     # (1, 1, 5, 5)

Quick Check

A 3×3 convolution with dilation=4 is applied to a feature map. What is the effective kernel size?


Interactive Parameter Playground

Now that each knob is understood in isolation, try combining them. The playground below runs a live convolution with a fixed 3×3 kernel on a fixed 5×5 input; move the stride, padding, and dilation sliders to see the effective kernel, the output shape, and the numeric output update in real time:

Convolution Parameter Playground

Move the sliders to see how stride, padding, and dilation reshape the output. Yellow squares show which input positions the kernel samples at the current step.

1

how far the kernel jumps each step

0

border added on each side

1

gap between kernel taps

active only when P > 0

effective kernel keff = D·(K−1) + 1 = 1·(3−1) + 1 = 3
O = ⌊(I + 2P − keff) / S⌋ + 1 = ⌊(5 + 2·03) / 1⌋ + 1 = 3
Input (5×5)
12345678910111213141516171819202122232425
Kernel (3×3)
101010101

Fixed 3×3 kernel. With dilation, non-zero taps spread apart but the weights stay the same.

Output (3×3)
354045606570859095
step 1 / 9

What to try

  • Set S=2, P=1 — the standard ResNet downsampling block; output is (5+2−3)/2+1 = 3.
  • Set D=2 with P=2 — “same” dilated conv; output shape matches the input.
  • Set D=2 with P=0 — output collapses to 1×1 because k_eff = 5 matches the input.
  • Toggle padding mode between zero / reflect / replicate and watch the corner cells change.

Receptive Field Arithmetic

The receptive field (RF) of a unit in a deep feature map is the size of the region in the input image that can influence its value. Stride, dilation, and depth all increase it. Araujo et al. (2019) [9] derive the following recursion — this is the formula you should commit to memory:

r  =  r1+(k1)di=11si,r0=1r_\ell \;=\; r_{\ell - 1} + (k_\ell - 1)\, d_\ell \prod_{i=1}^{\ell - 1} s_i, \qquad r_0 = 1

In words: each new layer adds (k1)d(k-1) d “input pixels” to the RF, scaled by the product of all previous strides (the cumulative downsampling, called the jump). Pooling layers obey the same formula — they just use different k,sk, s values.

Worked Example: three 3×3 convs, stride 1

With k=3,s=1,d=1k = 3, s = 1, d = 1 at every layer: r0=1r_0 = 1, r1=1+2=3r_1 = 1 + 2 = 3, r2=3+2=5r_2 = 3 + 2 = 5, r3=5+2=7r_3 = 5 + 2 = 7. Three 3×3 convs “see” a 7×7 window — which is why VGG [10] stacks two 3×3s instead of using a single 5×5: same receptive field, fewer parameters (18 vs 25 per channel pair), more non-linearities.

Worked Example: WaveNet-style dilated stack

Four 1-D causal convolutions with k=2,s=1k = 2, s = 1 and dilations 1,2,4,81, 2, 4, 8 give: r1=1+11=2r_1 = 1 + 1 \cdot 1 = 2, r2=2+12=4r_2 = 2 + 1 \cdot 2 = 4, r3=4+14=8r_3 = 4 + 1 \cdot 4 = 8, r4=8+18=16r_4 = 8 + 1 \cdot 8 = 16. The receptive field doubles every layer. Stacking such blocks is how WaveNet [4] reaches audio context windows of thousands of samples with fewer than 20 layers.

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

Quick Check

A network has layers: conv3×3(s=1), conv3×3(s=1), pool2×2(s=2), conv3×3(s=1), conv3×3(s=1). What is the receptive field of the final output unit?


Common Design Patterns

The VGG pattern: stack of same-padded 3×3s

K=3,S=1,P=1,D=1K = 3, S = 1, P = 1, D = 1 everywhere, punctuated by stride-2 or pool-2 every few layers. Simonyan & Zisserman [10] showed that this beats larger kernels because two stacked 3×3 layers have the same 5×5 receptive field but use 232CinCout=18CinCout2 \cdot 3^2 C_\text{in} C_\text{out} = 18 C_\text{in} C_\text{out} parameters instead of 25CinCout25 C_\text{in} C_\text{out}, and they insert an extra non-linearity.

The ResNet pattern: stride-2 conv for downsampling

He et al. [6] downsample with K=3,S=2,P=1K = 3, S = 2, P = 1 (halves resolution, still “same-ish” at stride-2) and rely on 1×1 convs inside the bottleneck to change channel counts cheaply. The pool-free downsampling preserves gradient flow through residual connections.

The WaveNet pattern: exponentially dilated causal convolutions

Dilation schedule 1,2,4,,2L11, 2, 4, \ldots, 2^{L-1} with K=2,S=1K = 2, S = 1 yields rL=2Lr_L = 2^L. L=10 layers already reach 1,024 audio samples of context [4]. Causality (only left-hand taps) is orthogonal and implemented via asymmetric padding.

The DeepLab pattern: atrous spatial pyramid pooling

Apply several parallel 3×3 dilated convs with dilations D{6,12,18,24}D \in \{6, 12, 18, 24\} to the same feature map, then fuse. Each branch captures a different scale; together they cover a wide range of context without needing an image pyramid [5].


Summary

KnobEffect on output sizeEffect on receptive fieldParameter cost
Stride SDivides by SMultiplies downstream contributions by S (via the 'jump')None
Padding PAdds 2PNo direct effectNone
Dilation DSubtracts D·(K−1) instead of (K−1)Multiplies (K−1) contribution by DNone
Kernel KSubtracts K−1 (or k_eff−1)Adds (K−1) at this layerScales as K²·C_in·C_out

Commit these to memory

  1. Master formula: O=(I+2PD(K1)1)/S+1O = \lfloor (I + 2P - D(K-1) - 1)/S \rfloor + 1
  2. Same padding recipe: P=D(K1)/2P = D(K-1)/2 for odd KK and stride 1 preserves spatial size.
  3. Receptive field recursion: r=r1+(k1)di<sir_\ell = r_{\ell-1} + (k_\ell - 1) d_\ell \prod_{i<\ell} s_i
  4. Stride subsamples, padding preserves size, dilation enlarges the view for free.

Exercises

Conceptual

  1. Compute by hand the output spatial size of nn.Conv2d(64, 128, kernel_size=5, padding=2, stride=1, dilation=1) applied to a 56×56 feature map.
  2. Repeat with dilation=2, padding=4. What stays the same; what changes?
  3. Two stacked 3×3 convolutions (stride 1) have the same receptive field as what singlek×k kernel? How many fewer parameters do the two 3×3s use, per input-output channel pair?
  4. Why does padding='same' raise an error when stride>1 in PyTorch? (Hint: the master formula is no longer invertible to a single integer PP.)

Hints

  1. 1: O=(56+45)/1+1=56O = \lfloor (56 + 4 - 5)/1 \rfloor + 1 = 56 — “same” with K=5.
  2. 2: k_eff = 5; output still 56×56. Same shape, larger receptive field, zero extra parameters.
  3. 3: 5×5. Param ratio: 2·9 / 25 = 18/25 ≈ 72%.
  4. 4: Same-padding requires Hout=HinH_\text{out} = H_\text{in}, but with stride > 1 that target depends on both PP and the ceiling/floor rounding — no single PP works in general.

Coding

  1. Extend conv2d_dilated to accept stride and padding in addition to dilation. Verify it matches F.conv2d byte-for-byte on random inputs.
  2. Implement a function receptive_field(layers) where layers is a list of (k, s, d) tuples and the function returns the final RF using the recursion above. Test it against the playground's 3-conv VGG example (should give 7).
  3. Implement padding modes reflect and replicate in pure Python (no np.pad) and compare to the NumPy output.

Challenge

Reproduce the WaveNet dilated-causal-conv stack (dilations 1,2,4,,5121, 2, 4, \ldots, 512, kernel size 2, 10 layers) in PyTorch and print the receptive field after each layer. Verify that it doubles each time. The “causal” part means you must pad on the left only — search the PyTorch docs for F.pad if you get stuck.


References

  1. Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. (Canonical derivation of the output-shape formula.)
  2. PyTorch documentation, torch.nn.Conv2d. https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html (Explicit output-size formula and padding_mode options.)
  3. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All-Convolutional Net. ICLR workshop. arXiv:1412.6806.
  4. van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.
  5. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2018). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TPAMI 40(4). arXiv:1606.00915.
  6. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. arXiv:1512.03385.
  7. Islam, M. A., Jia, S., & Bruce, N. D. B. (2020). How Much Position Information Do Convolutional Neural Networks Encode? ICLR. arXiv:2001.08248.
  8. Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR. arXiv:1511.07122.
  9. Araujo, A., Norris, W., & Sim, J. (2019). Computing Receptive Fields of Convolutional Neural Networks. Distill. https://distill.pub/2019/computing-receptive-fields/
  10. Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG). ICLR. arXiv:1409.1556.

In the next section we'll combine every knob from this chapter into a single production-grade conv2d implementation — including the im2col trick that turns a quadruple loop into a single matrix multiplication.

Loading comments...