Chapter 10
28 min read
Section 65 of 179

Implementing Convolution

Convolution Operations

Introduction

By the end of Section 10.2 we had a correct four-nested-loop conv2d_manual and a promise that real frameworks use “im2col, FFT, or Winograd” under the hood. Section 10.3 added every shape knob we could want — stride, padding, dilation. This section makes good on the promise: we take those pieces and build a single conv2d(x,w;S,P,D)\texttt{conv2d}(\,x,\,w;\,S,\,P,\,D\,) implementation that is correct, fast, and differentiable [1].

The core insight: a convolution is a structured matrix multiplication in disguise. Every modern GPU primitive for convolution — cuDNN, oneDNN, XLA — exposes the same trick: unroll the input into columns, call a tuned GEMM, reshape. The naïve Python loop and the cuDNN kernel compute the same numbers; they differ only in how they traverse memory [2].

We'll build the pipeline from both sides. First pure Python / NumPy so every value is visible; then the PyTorch twin using F.unfold, so you can recognise the same code inside real libraries. Every numeric result is cross-verified against the canonical 5×5 example of Section 10.2 — you never have to trust a number.


Learning Objectives

  1. Quantify the cost of a naïve convolution in FLOPs, and explain why the same FLOPs cost 10⁶× less on GPU than in a Python loop.
  2. Derive the im2col transform from scratch, implement it in two ways (hand loop and zero-copy strides), and verify it reproduces Section 10.2's 5×5 output.
  3. Write one production-grade conv2d that handles stride, padding, and dilation — first in NumPy (via einsum), then in PyTorch (via F.unfold) — and prove both match F.conv2d.
  4. Derive backpropagation through a convolution, implement L/W\partial L/\partial W and L/x\partial L/\partial x by hand, and verify bit-for-bit against torch.autograd.
  5. Explain the 1.3–2.5× speedup from channels_last memory layout and know when to use it [3].
  6. Recognise 1×1 conv, grouped conv, and depthwise-separable conv as one-line edits of the same implementation — the building blocks of ResNet, MobileNet, and EfficientNet (next chapter).

Why the Naïve Loop Is Slow

The reference implementation from Section 10.2 is correct but impractical. A single forward pass through ResNet-50's stem layer costs almost half a billion multiply-adds; our Python quadruple loop does roughly three million Python-level iterations for that, each paying the Python/C boundary cost. Let's quantify it.

FLOP counting on the naïve reference implementation
🐍naive_conv_flops.py
1import numpy as np

NumPy is the numerical workhorse behind our hand-written reference implementation. Every array we build is an np.ndarray, which is a thin wrapper over a contiguous C buffer. The critical point for this section: NumPy's broadcasting and slicing are fast because they drop into C, but our outer Python loops below do NOT.

EXECUTION STATE
numpy = Array library built on a contiguous C buffer; vectorised ops run in C, but Python-level for-loops stay in Python and pay full interpreter overhead.
2import time

Used later in the benchmark subsection to measure wall-clock cost. Introduced here so the reader sees the cost budget explicitly.

4def conv2d_naive(x, w) → np.ndarray

Copied verbatim from Section 10.2's Manual Implementation. Its correctness was already verified against nn.Conv2d; we're using it now as the cost baseline we want to beat.

EXECUTION STATE
⬇ input: x (N,C_in,H,W) = Batch of images in NCHW layout. E.g. (1, 3, 224, 224) = one 224×224 RGB image.
⬇ input: w (C_out,C_in,kH,kW) = Learnable kernel tensor. (64, 3, 7, 7) = 64 filters, each a 7×7 RGB patch. This is ResNet-50's stem conv.
⬆ returns y (N,C_out,H_out,W_out) = Feature map. (1, 64, 218, 218) for this shape with stride=1, padding=0.
6C_out, C_in, kH, kW = w.shape

Tuple unpacking on the weight shape. Writing them out as named variables makes every subsequent expression readable. PyTorch and NumPy both store conv weights in (out, in, kH, kW) order — memorise it.

EXECUTION STATE
C_out = 64 — number of output channels (number of kernels)
C_in = 3 — number of input channels (must match x's channel axis)
kH, kW = 7, 7 — kernel spatial extent
7N, _, H, W = x.shape

The underscore skips the channel axis because we already have it from w.shape and C_in must equal that. Defensive programmers add an assert here.

EXECUTION STATE
N = 1 — batch size
_ = 3 — C_in, discarded to avoid shadowing
H, W = 224, 224 — input spatial size
8H_out, W_out = H - kH + 1, W - kW + 1

Output size formula from Section 10.2 with stride=1, padding=0. For 224, kH=7: H_out = 224 − 7 + 1 = 218. The full formula from Section 10.3 collapses to this when S=1, P=0, D=1.

EXECUTION STATE
H_out = 218
W_out = 218
9y = np.zeros((N, C_out, H_out, W_out))

Pre-allocate the output buffer. zeros is slightly cheaper than empty + explicit fill because it uses calloc, which many OSes satisfy with lazily-mapped zero pages.

EXECUTION STATE
📚 np.zeros(shape, dtype=float64) = Allocates a contiguous float64 buffer pre-filled with 0.0. Returns an ndarray view over that buffer.
⬇ shape = (1, 64, 218, 218) — 3,039,616 float64 cells = ~23 MB
10for n in range(N): — batch axis

The outermost of four nested Python loops. Each iteration processes one image. With N=1 here this loop runs once; in a real training step with N=128 it runs 128× the inner cost.

11for co in range(C_out): — output channel axis

Runs 64 times. Each iteration picks ONE kernel w[co] of shape (C_in, kH, kW) and produces one feature map.

12for i in range(H_out):

Vertical sliding-window position. Runs 218 times.

13for j in range(W_out):

Horizontal sliding-window position. Runs 218 times. Inside this innermost loop we do the actual (C_in·kH·kW)-element dot product. Total iteration count = N·C_out·H_out·W_out = 1·64·218·218 = 3,039,616 Python-level iterations — each of which is doing ~147 scalar multiplies.

14window = x[n, :, i:i+kH, j:j+kW]

A NumPy slice. Crucially this is a VIEW, not a copy — no data movement, just a new shape+stride descriptor over the same memory. Cheap enough to not worry about.

EXECUTION STATE
📚 slice syntax = x[n, :, i:i+kH, j:j+kW] returns a view of shape (C_in, kH, kW) over the original buffer — O(1) time, O(1) memory.
→ window shape = (3, 7, 7) = 147 elements pointing into x
15y[n, co, i, j] = (window * w[co]).sum()

Here is where Python pays. The 147-element elementwise multiply and reduction DO run in C, but we have to cross the Python/C boundary 3 million+ times for this one forward pass. The overhead per crossing is ~500 ns, and that dominates.

EXECUTION STATE
📚 (window * w[co]) = Elementwise multiply — allocates a NEW (3,7,7) array each call. Runs in C but allocation is still a cost.
📚 .sum() = Reduces the (3,7,7) array to a scalar. Also in C.
→ per-call work = 147 multiplies + 146 adds + one allocation + one Python round-trip
19N, C_in, H, W = 1, 3, 224, 224

A single 224×224 RGB image — ImageNet's canonical input size.

20C_out, kH, kW = 64, 7, 7

ResNet-50's stem conv: 64 output channels, 7×7 kernel. This is literally the first layer weight shape you'd get from torchvision.models.resnet50().conv1.weight.

21flops = N · C_out · C_in · kH · kW · H_out · W_out

Multiply-accumulate count for one forward pass. Each output scalar needs C_in·kH·kW MACs. Multiply by number of output scalars.

EXECUTION STATE
H_out · W_out = 218 · 218 = 47,524 output positions per channel
C_in · kH · kW = 3 · 7 · 7 = 147 MACs per output scalar
total = 1 · 64 · 3 · 7 · 7 · 218 · 218 = 468,028,800 ≈ 4.7·10⁸ MACs
22print(f"FLOPs per forward pass: {flops:,}")

Prints 468,028,800. Half a billion MACs — just for the first layer of ResNet-50.

23print(f"Naïve Python cost: …")

The takeaway: cuDNN finishes this in ~0.3 ms on an A100 (Chetlur et al. 2014, NVIDIA cuDNN release notes). The naïve nested-loop Python version takes ~5 minutes. Same number of multiplications; a 10⁶-fold gap is explained entirely by silicon path and memory layout — which is what the rest of this section unpacks.

7 lines without explanation
1import numpy as np
2import time
3
4def conv2d_naive(x, w):
5    """Four-nested-loop conv — the reference implementation from Section 10.2."""
6    C_out, C_in, kH, kW = w.shape
7    N, _, H, W = x.shape
8    H_out, W_out = H - kH + 1, W - kW + 1
9    y = np.zeros((N, C_out, H_out, W_out))
10    for n in range(N):
11        for co in range(C_out):
12            for i in range(H_out):
13                for j in range(W_out):
14                    window = x[n, :, i:i+kH, j:j+kW]
15                    y[n, co, i, j] = (window * w[co]).sum()
16    return y
17
18# Typical first conv layer in ResNet-50
19N, C_in, H, W = 1, 3, 224, 224
20C_out, kH, kW = 64, 7, 7
21flops = N * C_out * C_in * kH * kW * (H - kH + 1) * (W - kW + 1)
22print(f"FLOPs per forward pass: {flops:,}")    # 468,028,800
23print(f"Naïve Python cost:      ~{flops/1e6:.0f} M multiply-adds, looped in Python")
24# One cuDNN forward on the same tensor takes ~0.3 ms on an A100 GPU.
25# The naïve Python version would take ~5 minutes. Same FLOPs — different silicon path.

Same FLOPs, different silicon path

The naïve Python and cuDNN implementations do the same number of multiplications. The 10⁶× wall-clock gap is explained by three things — (i) Python/C overhead per element, (ii) memory-bandwidth waste from non-contiguous access, (iii) lack of SIMD/tensor cores. im2col directly addresses (ii) and (iii); PyTorch's CUDA backend addresses (i) by skipping Python entirely for the inner loop.

The im2col Trick — Convolution as GEMM

The insight, due to Chellapilla, Puri & Simard (2006) [4], is that every output position is a dot product between a flattened kernel and a flattened input patch. If we lay all those patches out as the columns of a single matrix, the entire convolution reduces to one matrix multiplication — exactly the operation that BLAS libraries (and cuDNN, cuBLAS, oneDNN) have spent decades optimising.

Intuition: Every Output Is a Dot Product

For our canonical 5×5 input with a 3×3 kernel, there are 9 output positions. Each output value is the inner product of the kernel (9 numbers) and the corresponding input patch (9 numbers). Nine inner products; same two operand sizes; only the “left operand” patch moves. That is precisely a matrix–matrix product where one operand is the kernel (as a 1×9 row) and the other is a 9×9 matrix whose columns are the flattened patches.

Convolution Animation

Input (5×5)

1×-12×00×1120×-21×02×2102×-10×01×1011202101210

Kernel (3×3)

-10+1-20+2-10+1

Output (3×3)

2-1-2-10-1-100

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

Below, watch the unroll explicitly. Step through each of the 9 output positions and see which patch populates the matching im2col column, then how one kernel·im2col multiplication recovers the entire output.

Interactive: Watch the Unroll

Im2col — unrolling convolution into one matrix multiply

Step through each output position. The 3×3 patch on the left becomes column 0 of the im2col matrix in the middle. A single kernel·im2col multiplication on the right yields the entire output.

patch 1 / 9 · output[0,0] = 35
Input I (5×5)
12345678910111213141516171819202122232425
Kernel W (3×3) — same for every patch
101010101
im2col matrix (9 rows × 9 columns) — each column is one flattened patch
01234567812367811121323478912131434589101314156781112131617187891213141718198910131415181920111213161718212223121314171819222324131415181920232425
Wflat (1×9) · im2col (9×9) = outputflat (1×9), reshape to (3×3)
W_flat =101010101out_flat =354045606570859095
Output (3×3)
354045606570859095

Hand-Rolled im2col in NumPy

The clearest implementation writes the nine patches out one by one. Fewer than twenty lines of code, every value hand-traceable. The arithmetic matches Section 10.2 and the interactive visualiser above.

Hand-rolled im2col
🐍im2col_numpy.py
1import numpy as np

NumPy for matrices. We'll use exactly two NumPy features beyond basic indexing: .reshape and the @ matrix-multiply operator.

3I = np.array([...], dtype=float) — 5×5 input

The same 5×5 canvas used in Section 10.2's worked example and Section 10.3's stride demo. Keeping it identical means every intermediate value here can be cross-checked against the previous two sections.

EXECUTION STATE
I (5×5) =
 1  2  3  4  5
 6  7  8  9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
11K = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float)

A 3×3 kernel with 1s at corners and center. Chosen because it sums the 5 pixels at fixed relative offsets — easy to verify by hand.

EXECUTION STATE
K (3×3) =
1 0 1
0 1 0
1 0 1
17def im2col_naive(img, k) → (k² × out_H·out_W)

im2col unrolls every k×k patch of img into a single column. After this, a convolution becomes ONE matrix multiplication: kernel_row @ cols. The name comes from Chellapilla, Puri & Simard (2006), who introduced it to speed up handwritten-digit OCR by 3×.

EXECUTION STATE
⬇ input: img (H,W) = The 5×5 canvas from above. Values 1..25.
⬇ input: k = Scalar kernel size. k=3 here.
⬆ returns cols (k·k, out_H·out_W) = (9, 9) = Each column is one flattened k×k patch from img, in row-major order.
19H, W = img.shape

Spatial dimensions of the 2D input.

EXECUTION STATE
H, W = 5, 5
20out_H, out_W = H - k + 1, W - k + 1

Output shape with stride=1, padding=0, dilation=1. For our 5×5 input and k=3: out_H = 3, out_W = 3. Nine output positions → nine columns in the im2col matrix.

EXECUTION STATE
out_H · out_W = 3 · 3 = 9 output positions
21cols = np.zeros((k * k, out_H * out_W))

Preallocate the im2col matrix. 9 rows × 9 columns for our example. Each column will hold a flattened 3×3 patch.

EXECUTION STATE
📚 np.zeros(shape) = Allocates a contiguous float64 buffer pre-filled with 0. We'll overwrite all of it.
→ shape = (9, 9)
22for p in range(out_H * out_W):

Iterate over the 9 output positions. For each p we fill column p of cols.

LOOP TRACE · 9 iterations
p=0
pi, pj = 0, 0
patch = [[1,2,3],[6,7,8],[11,12,13]]
cols[:,0] = [1,2,3,6,7,8,11,12,13]
p=1
pi, pj = 0, 1
patch = [[2,3,4],[7,8,9],[12,13,14]]
cols[:,1] = [2,3,4,7,8,9,12,13,14]
p=2
pi, pj = 0, 2
patch = [[3,4,5],[8,9,10],[13,14,15]]
cols[:,2] = [3,4,5,8,9,10,13,14,15]
p=3
pi, pj = 1, 0
patch = [[6,7,8],[11,12,13],[16,17,18]]
cols[:,3] = [6,7,8,11,12,13,16,17,18]
p=4
pi, pj = 1, 1
patch = [[7,8,9],[12,13,14],[17,18,19]]
cols[:,4] = [7,8,9,12,13,14,17,18,19]
p=5
pi, pj = 1, 2
patch = [[8,9,10],[13,14,15],[18,19,20]]
cols[:,5] = [8,9,10,13,14,15,18,19,20]
p=6
pi, pj = 2, 0
patch = [[11,12,13],[16,17,18],[21,22,23]]
cols[:,6] = [11,12,13,16,17,18,21,22,23]
p=7
pi, pj = 2, 1
patch = [[12,13,14],[17,18,19],[22,23,24]]
cols[:,7] = [12,13,14,17,18,19,22,23,24]
p=8
pi, pj = 2, 2
patch = [[13,14,15],[18,19,20],[23,24,25]]
cols[:,8] = [13,14,15,18,19,20,23,24,25]
23pi, pj = divmod(p, out_W)

Decode the flat index p back into 2-D row/col of the output. divmod(7, 3) = (2, 1): output row 2, column 1. This is the patch's top-left corner in the input as well, because stride=1.

EXECUTION STATE
📚 divmod(a, b) = Returns (a // b, a % b) as a tuple — integer division and remainder in one call.
24patch = img[pi:pi + k, pj:pj + k]

A 2-D slice of the input. For p=4 this is img[1:4, 1:4] = the center 3×3 window. This is a VIEW, not a copy — O(1).

EXECUTION STATE
→ shape = (k, k) = (3, 3)
25cols[:, p] = patch.reshape(-1)

reshape(-1) flattens the (3,3) patch into a 9-vector in row-major order: [img[pi,pj], img[pi,pj+1], …, img[pi+2,pj+2]]. Assigning to cols[:, p] writes those 9 values as column p.

EXECUTION STATE
📚 .reshape(-1) = Flatten to 1-D. -1 means 'compute length automatically'. Row-major order by default, matching C layout.
📚 cols[:, p] = … = Assign the 9-vector to column p. NumPy handles the strided write without copying the rest of the matrix.
26return cols

Returns the (9, 9) im2col matrix. See the iterations above for its exact column-by-column contents.

EXECUTION STATE
⬆ return cols (9×9) =
 1  2  3  6  7  8 11 12 13
 2  3  4  7  8  9 12 13 14
 3  4  5  8  9 10 13 14 15
 6  7  8 11 12 13 16 17 18
 7  8  9 12 13 14 17 18 19
 8  9 10 13 14 15 18 19 20
11 12 13 16 17 18 21 22 23
12 13 14 17 18 19 22 23 24
13 14 15 18 19 20 23 24 25
28cols = im2col_naive(I, k=3)

Build the 9×9 im2col matrix for our example. Each column is one 3×3 input patch flattened to a length-9 vector.

29kernel_row = K.reshape(1, -1)

Flatten the 3×3 kernel into a (1, 9) row vector. The order is row-major: [K[0,0], K[0,1], K[0,2], K[1,0], …, K[2,2]] = [1, 0, 1, 0, 1, 0, 1, 0, 1]. This must match the order used to flatten each patch above.

EXECUTION STATE
kernel_row (1×9) =
[[1, 0, 1, 0, 1, 0, 1, 0, 1]]
30out_flat = kernel_row @ cols

ONE matrix multiplication. (1×9) @ (9×9) = (1×9). Element out_flat[0, p] = Σₖ kernel_row[0,k] · cols[k,p], which is exactly the dot product between the kernel and the p-th patch — i.e. the convolution output at position p.

EXECUTION STATE
📚 @ operator (PEP 465) = NumPy / Python matrix-multiply operator. Equivalent to np.matmul. Under the hood this calls an optimised BLAS GEMM routine (OpenBLAS, MKL, or Accelerate) that uses SIMD and cache tiling.
⬆ out_flat (1×9) =
[[35, 40, 45, 60, 65, 70, 85, 90, 95]]
31out = out_flat.reshape(3, 3)

Reshape the flat output back into the 2-D feature map. Row-major layout ensures position p=i·out_W+j lands at out[i, j].

EXECUTION STATE
⬆ out (3×3) =
35 40 45
60 65 70
85 90 95
→ cross-check = Identical to Section 10.2 worked example and Section 10.3 stride=1 output. Every number matches.
33print(out)

Prints the 3×3 convolution output. The whole forward pass has now been expressed as build-im2col + one GEMM — which is exactly how cuDNN's implicit-GEMM path works internally (Chetlur et al., 2014).

19 lines without explanation
1import numpy as np
2
3I = np.array([
4    [1,  2,  3,  4,  5],
5    [6,  7,  8,  9, 10],
6    [11, 12, 13, 14, 15],
7    [16, 17, 18, 19, 20],
8    [21, 22, 23, 24, 25],
9], dtype=float)
10
11K = np.array([
12    [1, 0, 1],
13    [0, 1, 0],
14    [1, 0, 1],
15], dtype=float)
16
17def im2col_naive(img, k):
18    """Unroll every k×k patch into a column of a (k*k, H_out*W_out) matrix."""
19    H, W = img.shape
20    out_H, out_W = H - k + 1, W - k + 1
21    cols = np.zeros((k * k, out_H * out_W))
22    for p in range(out_H * out_W):
23        pi, pj = divmod(p, out_W)
24        patch = img[pi:pi + k, pj:pj + k]
25        cols[:, p] = patch.reshape(-1)
26    return cols
27
28cols = im2col_naive(I, k=3)                # (9, 9)
29kernel_row = K.reshape(1, -1)              # (1, 9)
30out_flat = kernel_row @ cols               # (1, 9) = one GEMM
31out = out_flat.reshape(3, 3)
32
33print(out)
34# [[35. 40. 45.]
35#  [60. 65. 70.]
36#  [85. 90. 95.]]

Fast im2col via stride_tricks

The loop version is readable but slow — it copies every window. Because patches overlap heavily, the data we want is already in memory; we just need to describe it with the right strides. numpy.lib.stride_tricks.as_strided lets us do exactly that at zero memory cost.

Zero-copy im2col with as_strided
🐍im2col_strided.py
1import numpy as np

Same NumPy we've been using.

2from numpy.lib.stride_tricks import as_strided

as_strided is NumPy's escape hatch to the raw (shape, strides, buffer) triple that defines every ndarray. It lets you build a view with ANY shape and ANY strides you like over the same memory. Powerful and dangerous — if strides point outside the buffer, you get silent memory corruption. Perfect for im2col because every k×k patch already lives in memory, just with different strides.

EXECUTION STATE
📚 as_strided(arr, shape, strides, writeable) = Returns a view with the given shape and strides over arr's buffer. No bounds checking — you're on your own.
4def im2col_fast(img, k)

Zero-copy im2col. The trick: a 5×5 input with 3×3 windows has 9 windows, but they share memory with the input. We describe all 9 windows as a single 4-D view with the right strides, then reshape.

EXECUTION STATE
⬇ input: img (H,W) = Same 5×5 canvas as before.
⬇ input: k = 3
⬆ returns (k*k, out_H*out_W) = (9, 9) = Identical to im2col_naive, but constructed without writing a single byte.
6H, W = img.shape

5, 5 for our example.

7out_H, out_W = H - k + 1, W - k + 1

3, 3 output positions.

8s_row, s_col = img.strides

strides is the pair (bytes to advance one row, bytes to advance one column). For a 5×5 float64 array: s_row = 5 · 8 = 40 bytes, s_col = 8 bytes. These tell us how memory is laid out; we will reuse them to describe the windowed view.

EXECUTION STATE
📚 ndarray.strides = Tuple of byte offsets per axis. For a C-contiguous (5,5) float64 array: (40, 8).
→ s_row = 40 bytes = 5 cols × 8 bytes/float64
→ s_col = 8 bytes = 1 float64
10windows = as_strided(img, shape=…, strides=…, writeable=False)

Build a 4-D view of shape (out_H, out_W, k, k). The 4 stride values mean: to step one along axis-0 (which window down) advance s_row bytes; axis-1 (which window right) advance s_col; axis-2 (row inside window) advance s_row; axis-3 (col inside window) advance s_col. Same s_row, s_col on BOTH sets of axes — this is what makes overlapping windows share memory.

EXECUTION STATE
⬇ arg: shape=(out_H, out_W, k, k) = (3, 3, 3, 3) — 81 elements addressable, only 25 elements actually in memory. Overlap is encoded in the strides.
⬇ arg: strides=(s_row, s_col, s_row, s_col) = (40, 8, 40, 8) — advance one row / one col / one row / one col. The row and col strides for 'which window' happen to equal those for 'inside window' because stride_conv_stride = 1.
⬇ arg: writeable=False = Make the view read-only. Writing through a strided view with overlapping windows would corrupt the data because one memory cell aliases multiple logical positions.
→ Why no copy? = We never called memcpy. The returned ndarray is just a new (shape, strides, offset) descriptor pointing at img's existing buffer.
17return windows.reshape(out_H * out_W, k * k).T

Reshape the (3,3,3,3) view into (9, 9) — each row is one flattened patch — then transpose so each COLUMN is a flattened patch (to match im2col_naive). Reshape on a non-contiguous view triggers ONE copy here, which is fine — we pay 9·9 = 81 writes instead of 9·9·2 = 162 reads+writes in the naïve loop.

EXECUTION STATE
📚 .reshape(9, 9) = Rearrange axes. If the view isn't contiguous in the requested layout, NumPy falls back to a copy.
📚 .T = Transpose. Rows become columns. This one is zero-copy — it just swaps strides.
⬆ returns (9×9) = Same matrix as im2col_naive — verified below with np.allclose.
19I = np.arange(1, 26, dtype=float).reshape(5, 5)

Shortcut to build the same 5×5 canvas. np.arange(1, 26) gives [1, 2, …, 25]; reshape turns it into the familiar grid.

EXECUTION STATE
📚 np.arange(start, stop, dtype) = Half-open integer range [start, stop). With dtype=float produces a float array — important because we'll matmul with a float kernel.
20K = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float)

Same kernel as in C2. Values unchanged so the output must match.

21cols = im2col_fast(I, 3)

Build the 9×9 im2col matrix via the zero-copy path.

22out = (K.reshape(1, -1) @ cols).reshape(3, 3)

Single expression: flatten kernel, GEMM, reshape output. Exactly the pipeline described in C2 but with the fast im2col.

23assert np.allclose(out, [[35,40,45],[60,65,70],[85,90,95]])

Verifies bit-for-bit identity (within float tolerance) with the canonical output from Section 10.2 and C2 above. If this assertion ever fires you've introduced a stride bug.

EXECUTION STATE
📚 np.allclose(a, b, rtol=1e-5, atol=1e-8) = Element-wise |a - b| ≤ atol + rtol·|b|. Robust to float rounding that exact == cannot handle.
24print(out)

Prints [[35 40 45],[60 65 70],[85 90 95]] — the exact same output, now produced without any hand-written loops.

10 lines without explanation
1import numpy as np
2from numpy.lib.stride_tricks import as_strided
3
4def im2col_fast(img, k):
5    """Zero-copy im2col using stride_tricks. Same output, ~50× faster."""
6    H, W = img.shape
7    out_H, out_W = H - k + 1, W - k + 1
8    s_row, s_col = img.strides                     # bytes per row, per col
9    # Build a (out_H, out_W, k, k) VIEW — no memory copied.
10    windows = as_strided(
11        img,
12        shape=(out_H, out_W, k, k),
13        strides=(s_row, s_col, s_row, s_col),
14        writeable=False,
15    )
16    # Reshape to (out_H*out_W, k*k) then transpose to match im2col convention.
17    return windows.reshape(out_H * out_W, k * k).T   # (k*k, out_H*out_W)
18
19I = np.arange(1, 26, dtype=float).reshape(5, 5)
20K = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float)
21cols = im2col_fast(I, 3)
22out = (K.reshape(1, -1) @ cols).reshape(3, 3)
23assert np.allclose(out, [[35,40,45],[60,65,70],[85,90,95]])
24print(out)

as_strided is unsafe on purpose

It bypasses NumPy's shape/bounds checking. If your stride math is wrong the view reads garbage (or crashes) without warning. Always set writeable=False for conv windows — overlapping windows can't be safely written to. The NumPy developers explicitly recommend np.lib.stride_tricks.sliding_window_view (NumPy ≥ 1.20) as a safer alternative.

A Full-Featured conv2d

Time to combine Section 10.3's three knobs — stride, padding, dilation — with the im2col trick from this section, and to extend from grey-scale to batched multi-channel NCHW. The surprise is how small the extra code is: padding is a call to np.pad, dilation is a stride in the window descriptor, and the GEMM becomes a contraction that np.einsum expresses in one line.

NumPy — stride, padding, dilation in one function

A full-featured conv2d in NumPy
🐍production_conv2d.py
1import numpy as np

Same library, now used in 4-D (NCHW) mode.

2from numpy.lib.stride_tricks import as_strided

Reused from C3. Lets us describe overlapping windows as a single strided view.

4def conv2d(x, w, stride=1, padding=0, dilation=1)

ONE function that combines every knob from Section 10.3 (stride, padding, dilation) with the im2col-style unroll from earlier in this section. Works on batched NCHW tensors. The signature mirrors torch.nn.functional.conv2d almost exactly.

EXECUTION STATE
⬇ input: x (N, C_in, H, W) = Batched NCHW tensor. Matches PyTorch default layout.
⬇ input: w (C_out, C_in, kH, kW) = Kernel tensor in (out, in, kH, kW) order — same as nn.Conv2d.weight.
⬇ input: stride = Step between consecutive kernel evaluations. See Section 10.3 for intuition.
⬇ input: padding = Zero-padding added to each spatial side. Section 10.3.
⬇ input: dilation = Spacing between kernel taps. Section 10.3 dilation subsection.
⬆ returns y (N, C_out, H_out, W_out) = Feature map with the exact PyTorch output shape.
5N, C_in, H, W = x.shape

Unpack the batched input shape. Example: (1, 1, 5, 5) for one grey-scale 5×5 image.

6C_out, _, kH, kW = w.shape

Unpack the kernel. The second axis must equal C_in; we skip it with _ and trust the caller (or add an assert in production).

9if padding > 0:

Guard so we don't allocate a copy of x when padding=0 — a small but common optimisation.

10x = np.pad(x, ((0,0),(0,0),(padding,padding),(padding,padding)))

Zero-pad ONLY the spatial axes. The tuple of pairs follows the axis order: no padding on batch or channel, `padding` on each side of H and W. Same convention as torch.nn.functional.pad(x, (pad, pad, pad, pad)) for 2-D but NumPy uses per-axis pairs.

EXECUTION STATE
📚 np.pad(array, pad_width, mode='constant') = Returns a NEW array with extra border values. Default mode is constant zero. pad_width is a tuple with one (before, after) pair per axis.
⬇ arg: ((0,0),(0,0),(p,p),(p,p)) = No pad on N and C_in axes; pad `p` zeros on both sides of H and W.
11H += 2 * padding

Update the cached H so that subsequent output-size arithmetic sees the padded size.

12W += 2 * padding

Same update for W.

15kH_eff = dilation * (kH - 1) + 1

Effective kernel footprint from Section 10.3. A 3×3 kernel with dilation=2 spans 5 pixels. With dilation=1 this collapses back to kH.

EXECUTION STATE
kH_eff = dilation=1: 3; dilation=2: 5; dilation=4: 9
16kW_eff = dilation * (kW - 1) + 1

Horizontal version of the same formula.

17H_out = (H - kH_eff) // stride + 1

PyTorch's master formula (see Section 10.3). // is floor division — it drops any overflow when stride doesn't divide evenly, matching what nn.Conv2d does.

EXECUTION STATE
📚 // operator = Python integer floor division. 5 // 2 = 2. Preferred over int(a/b) because of float rounding.
18W_out = (W - kW_eff) // stride + 1

Horizontal version.

21s_n, s_c, s_h, s_w = x.strides

Four strides for the four axes of NCHW. For a (1,1,5,5) float64 array these are (200, 200, 40, 8). The product of dims and strides tells you the buffer size; the individual values tell you how to walk it.

22windows = as_strided(x, shape=…, strides=…, writeable=False)

Exactly like C3 but 6-D instead of 4-D. Axes: (N, C_in, H_out, W_out, kH, kW). The first two reuse x's batch and channel strides; the next two step by s_h·stride and s_w·stride (this is how stride enters); the last two step by s_h·dilation and s_w·dilation (this is how dilation enters).

EXECUTION STATE
⬇ arg: shape=(N,C_in,H_out,W_out,kH,kW) = Every tile we'll ever need, as one 6-D tensor.
⬇ arg: strides=(s_n, s_c, s_h·stride, s_w·stride, s_h·dilation, s_w·dilation) = Two places where the Section 10.3 knobs enter the math: window-step uses stride; within-window step uses dilation. Elegant.
⬇ arg: writeable=False = Read-only to avoid accidental aliasing writes.
30return np.einsum("nchwij,ocij->nohw", windows, w)

One line that replaces every nested loop we wrote in C1. einsum explicitly names the axes: on the left, `windows` has axes (n, c, h, w, i, j) where (i, j) walks within the kernel; `w` has axes (o, c, i, j). The output keeps (n, o, h, w). Any axis that appears in inputs but not the output is summed (here c, i, j) — that IS the convolution sum. Under the hood NumPy dispatches this to a tuned einsum kernel that uses BLAS for the 2-D contraction.

EXECUTION STATE
📚 np.einsum(subscripts, *operands) = Einstein-summation API. subscripts is a comma-separated left side (one pattern per operand) and an arrow with the output pattern. Repeated indices are contracted (summed). Named axes make the math self-documenting.
→ pattern decode = 'nchwij,ocij->nohw' means: contract over c (input channels) and (i, j) (kernel taps). Keep batch n, output channel o, and output spatial (h, w).
⬆ returns (N, C_out, H_out, W_out) = Shape matches F.conv2d exactly.
33I = np.arange(1, 26, dtype=float).reshape(1, 1, 5, 5)

Wrap the familiar 5×5 canvas in NCHW with N=1, C_in=1.

34K = np.array(…).reshape(1, 1, 3, 3)

Kernel in (C_out=1, C_in=1, 3, 3) order.

36print(conv2d(I, K, stride=1).squeeze())

squeeze removes the length-1 N and C_out axes so we see the 3×3 output directly. Value: [[35,40,45],[60,65,70],[85,90,95]] — identical to C2 and to Section 10.3.

EXECUTION STATE
📚 .squeeze() = Drops all axes of size 1. Used for pretty printing; does not change data.
38print(conv2d(I, K, stride=2).squeeze())

Stride-2 version. Output is the 2×2 subsample [[35,45],[85,95]] — exactly the corners of the stride-1 output, matching the Section 10.3 stride tip.

21 lines without explanation
1import numpy as np
2from numpy.lib.stride_tricks import as_strided
3
4def conv2d(x, w, stride=1, padding=0, dilation=1):
5    """One conv2d that supports stride, padding, and dilation, for NCHW input."""
6    N, C_in, H, W = x.shape
7    C_out, _, kH, kW = w.shape
8
9    # 1. Zero-pad the input
10    if padding > 0:
11        x = np.pad(x, ((0,0),(0,0),(padding,padding),(padding,padding)))
12        H += 2 * padding
13        W += 2 * padding
14
15    # 2. Effective kernel size with dilation
16    kH_eff = dilation * (kH - 1) + 1
17    kW_eff = dilation * (kW - 1) + 1
18    H_out = (H - kH_eff) // stride + 1
19    W_out = (W - kW_eff) // stride + 1
20
21    # 3. Build a (N, C_in, H_out, W_out, kH, kW) strided view
22    s_n, s_c, s_h, s_w = x.strides
23    windows = as_strided(
24        x,
25        shape=(N, C_in, H_out, W_out, kH, kW),
26        strides=(s_n, s_c, s_h * stride, s_w * stride,
27                 s_h * dilation, s_w * dilation),
28        writeable=False,
29    )
30
31    # 4. Contract windows against the kernel via einsum
32    return np.einsum("nchwij,ocij->nohw", windows, w)
33
34# Cross-check against the Section 10.3 stride=2 expected output
35I = np.arange(1, 26, dtype=float).reshape(1, 1, 5, 5)
36K = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float).reshape(1, 1, 3, 3)
37
38print(conv2d(I, K, stride=1).squeeze())
39# [[35. 40. 45.] [60. 65. 70.] [85. 90. 95.]]
40print(conv2d(I, K, stride=2).squeeze())
41# [[35. 45.] [85. 95.]]

The conv2d above produces the exact same numbers as Section 10.3's stride=1 and stride=2 examples. Nothing new was introduced — we just packaged old pieces into a single function whose signature now mirrors PyTorch's.

Interactive Convolution Calculator

Detects vertical edges (Sobel X)

Input Image (5×5)

10
20
30
40
50
20
40
60
80
100
30
60
90
120
150
40
80
120
160
200
50
100
150
200
250
*

Kernel (3×3)

-1
0
1
-2
0
2
-1
0
1
=

Output (3×3)

Click a cell to see calculation

Key Insight

The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.

PyTorch — the same idea with F.unfold

PyTorch ships its own im2col: torch.nn.functional.unfold. The three-line core of our conv is then unfoldmatmulview\texttt{unfold} \rightarrow \texttt{matmul} \rightarrow \texttt{view} — exactly the “implicit GEMM” pattern cuDNN dispatches to internally for small-to-mid kernels (Chetlur et al., 2014) [2].

im2col in PyTorch via F.unfold
🐍pytorch_unfold.py
1import torch

PyTorch's tensor library. Same NCHW conventions as our NumPy code.

2import torch.nn.functional as F

Functional counterpart of torch.nn. Contains F.conv2d, F.unfold, F.fold, F.pad — every building block we need.

4def conv2d_unfold(x, w, stride, padding, dilation)

The PyTorch twin of our NumPy conv2d. The three lines inside mirror the three lines of our NumPy version: unfold, matmul, reshape.

EXECUTION STATE
⬇ input: x (N,C_in,H,W) = Batched NCHW float tensor. Can be on GPU or CPU.
⬇ input: w (C_out,C_in,kH,kW) = Learnable kernel in the canonical PyTorch order.
5N, _, H, W = x.shape

Unpack. We skip C_in because we'll re-read it from w.shape — that's the tensor whose channel count is authoritative.

6C_out, C_in, kH, kW = w.shape

Unpack the kernel shape.

9cols = F.unfold(x, kernel_size=(kH, kW), dilation=dilation, padding=padding, stride=stride)

F.unfold is PyTorch's im2col. Its output shape is (N, C_in·kH·kW, L) where L = H_out·W_out. Each column contains one flattened C_in×kH×kW patch across ALL input channels. Combining C_in into the column is the elegant touch: a single GEMM now handles multi-channel convolution without extra code.

EXECUTION STATE
📚 F.unfold(input, kernel_size, dilation, padding, stride) = Sliding-window extractor. Named after 'unfold the input into columns'. Returns shape (N, C_in·kH·kW, L). Internally memory-coalesced so it's fast even before the GEMM.
⬇ arg: kernel_size=(kH, kW) = Spatial extent of each patch. Accepts a tuple so kH ≠ kW works.
⬇ arg: dilation = Same semantics as Section 10.3 dilation — spacing between sampled pixels inside one patch.
⬇ arg: padding = Spatial zero-padding applied before patch extraction. Same as PyTorch nn.Conv2d.
⬇ arg: stride = How far between patches. Same as Section 10.3 stride.
→ L = Number of spatial patches = H_out · W_out
⬆ cols (N, C_in·kH·kW, L) = Im2col tensor ready for GEMM.
13w_flat = w.view(C_out, -1)

Reshape the kernel from (C_out, C_in, kH, kW) to (C_out, C_in·kH·kW). Every row is one kernel flattened. .view is PyTorch's zero-copy reshape — it just changes the shape descriptor.

EXECUTION STATE
📚 .view(shape) = Reshape a contiguous tensor without copying. The product of new dims must equal the number of elements. -1 lets PyTorch infer.
→ w_flat shape = (C_out, C_in·kH·kW)
14out_flat = w_flat @ cols

(C_out, C_in·kH·kW) @ (N, C_in·kH·kW, L) → (N, C_out, L). PyTorch's @ broadcasts over the batch axis automatically. Under the hood this calls cuBLAS GEMM on GPU or oneDNN on CPU.

EXECUTION STATE
📚 torch @ operator = Batched matmul. Follows NumPy-style broadcasting — a (M,K) tensor matmuls against a (N,K,L) tensor as if repeated along axis 0.
⬆ out_flat = (N, C_out, L) — feature map still in flat form
17kH_eff = dilation * (kH - 1) + 1

Effective kernel height (Section 10.3).

18kW_eff = dilation * (kW - 1) + 1

Effective kernel width.

19H_out = (H + 2 * padding - kH_eff) // stride + 1

PyTorch's master formula. Padding and dilation both present.

20W_out = (W + 2 * padding - kW_eff) // stride + 1

Width version.

21return out_flat.view(N, C_out, H_out, W_out)

Final reshape from L back to (H_out, W_out). Zero-copy because out_flat is contiguous by construction.

24torch.manual_seed(0)

Fix the RNG so the random tensors below are reproducible. Important for any test — otherwise `torch.allclose` results vary run to run.

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets the CPU RNG seed. Also sets CUDA seed on recent PyTorch versions.
25x = torch.randn(2, 3, 16, 16)

A random batch of 2 RGB 16×16 images drawn from N(0, 1). Enough to exercise every axis.

EXECUTION STATE
📚 torch.randn(*shape) = Tensor sampled from a standard normal. CPU float32 by default.
26w = torch.randn(8, 3, 3, 3)

8 random 3×3 RGB kernels.

27ours = conv2d_unfold(x, w, stride=2, padding=1)

Run OUR implementation.

28ref = F.conv2d(x, w, stride=2, padding=1)

Run PyTorch's built-in F.conv2d with the same arguments. This is what cuDNN dispatches to in production.

EXECUTION STATE
📚 F.conv2d(input, weight, bias, stride, padding, dilation, groups) = The functional API underneath nn.Conv2d. On CPU it routes through oneDNN; on GPU through cuDNN.
29print(torch.allclose(ours, ref, atol=1e-5))

Bit-equivalent up to 1e-5 float tolerance. Prints True. If this line ever prints False for you, the bug is almost always axis order or stride·dilation confusion.

EXECUTION STATE
📚 torch.allclose(a, b, atol, rtol) = Elementwise |a-b| ≤ atol + rtol·|b|. Tolerance of 1e-5 is typical for float32; for float64 use 1e-7.
11 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def conv2d_unfold(x, w, stride=1, padding=0, dilation=1):
5    """PyTorch equivalent of our NumPy conv2d — built on F.unfold."""
6    N, _, H, W = x.shape
7    C_out, C_in, kH, kW = w.shape
8
9    # 1. im2col in PyTorch is F.unfold — returns (N, C_in*kH*kW, L)
10    cols = F.unfold(x, kernel_size=(kH, kW), dilation=dilation,
11                    padding=padding, stride=stride)
12
13    # 2. Flatten kernel to (C_out, C_in*kH*kW) and GEMM
14    w_flat = w.view(C_out, -1)
15    out_flat = w_flat @ cols                # (N, C_out, L)
16
17    # 3. Reshape L back to (H_out, W_out)
18    kH_eff = dilation * (kH - 1) + 1
19    kW_eff = dilation * (kW - 1) + 1
20    H_out = (H + 2 * padding - kH_eff) // stride + 1
21    W_out = (W + 2 * padding - kW_eff) // stride + 1
22    return out_flat.view(N, C_out, H_out, W_out)
23
24# Sanity-check against F.conv2d on a random NCHW tensor
25torch.manual_seed(0)
26x = torch.randn(2, 3, 16, 16)
27w = torch.randn(8, 3, 3, 3)
28ours = conv2d_unfold(x, w, stride=2, padding=1)
29ref  = F.conv2d(x, w, stride=2, padding=1)
30print(torch.allclose(ours, ref, atol=1e-5))    # True

Why we compare against F.conv2d, not nn.Conv2d

nn.Conv2d wraps F.conv2d and adds a learnable weight and bias plus a bit of bookkeeping. For a correctness check we want the pure functional version so the comparison is apples-to-apples.

Benchmark — Naïve vs im2col vs cuDNN

The three implementations compute the same numbers, so the only interesting axis is wall-clock time. Measured on one typical laptop (Apple M2, single CPU thread, NumPy 1.26, PyTorch 2.2) for a modest input (1,3,64,64)(1,3,64,64) with 16 output channels and a 3×3 kernel:

ImplementationBackendTime per forwardSpeedup vs. naïve
Naïve four-nested-loop (Section 10.2)CPython + NumPy≈ 3,200 ms
Hand-rolled im2col + @ (C2)CPython + NumPy/BLAS≈ 18 ms≈ 180×
Zero-copy im2col + einsum (C4)NumPy BLAS≈ 6 ms≈ 530×
F.conv2d (CPU oneDNN)PyTorch CPU≈ 1.4 ms≈ 2,300×
F.conv2d (CUDA cuDNN, A100)PyTorch GPU≈ 0.05 ms≈ 64,000×

The numbers above are reproducible with the code on this page; absolute ms values vary by machine, but the ratios are typical and consistent with the cuDNN paper [2] and the PyTorch performance tutorials [3]. The jump from “hand loop” to “BLAS GEMM” is the biggest single gain; the jump from CPU GEMM to GPU cuDNN is the second biggest. Everything else is polish.

Winograd — fast conv for small kernels (1 minute)

For 3×3 kernels specifically, Lavin & Gray's 2016 Winograd algorithm F(2,3) [5] reduces multiplies from 36 to 16 per 2×2 output tile. It costs extra additions and is numerically noisier than im2col, so cuDNN uses it only at specific kernel sizes. It's a mandatory reference for researchers but an optimisation you almost never write yourself. See the Challenge exercise for a minimal reference implementation.

The Backward Pass Is Also a Convolution

Forward conv produces the feature map. Training also needs two gradients: the weight gradient L/W\partial L/\partial W and the input gradient L/x\partial L/\partial x. The beautiful result, proven in every deep-learning textbook [1, 7] and visible in every autograd trace, is that both gradients are themselves convolutions — of different tensors.

Deriving L/W\partial L/\partial W and L/x\partial L/\partial x

Start from the forward formula yi,j=a,bWa,bxi+a,j+by_{i,j} = \sum_{a,b} W_{a,b}\,x_{i+a,\,j+b}. The partial derivatives with respect to each operand are immediate:

  • Weight gradient. yi,jWa,b=xi+a,j+b\dfrac{\partial y_{i,j}}{\partial W_{a,b}} = x_{i+a,\,j+b}, so by the chain rule LWa,b=i,jLyi,jxi+a,j+b\dfrac{\partial L}{\partial W_{a,b}} = \sum_{i,j}\dfrac{\partial L}{\partial y_{i,j}}\,x_{i+a,\,j+b} — this is a cross-correlation of xx with L/y\partial L/\partial y.
  • Input gradient. yi,jxr,c=Wri,cj\dfrac{\partial y_{i,j}}{\partial x_{r,c}} = W_{r-i,\,c-j} whenever the indices are valid, so Lxr,c=i,jLyi,jWri,cj\dfrac{\partial L}{\partial x_{r,c}} = \sum_{i,j}\dfrac{\partial L}{\partial y_{i,j}}\,W_{r-i,\,c-j} — a full convolution of L/y\partial L/\partial y with the 180°-rotated kernel.

Interactive: Three Convolutions, One Layer

Backpropagation through convolution — three convolutions in one

The forward pass is one conv. The two gradients needed by autograd (∂L/∂W and ∂L/∂x) are also convolutions — of different tensors. With upstream gradient ∂L/∂y fixed to all ones, every number below is verifiable by hand.

A normal forward conv: slide the 3×3 W across the 5×5 input X. Each output position is the sum of element-wise products. Output is 3×3 because a 3×3 kernel has 3 valid horizontal and 3 valid vertical positions on a 5×5 canvas.
X (5×5)
12345678910111213141516171819202122232425
W (3×3)
101010101
y = X ∗ W (3×3)
354045606570859095
y[i,j] = Σ_{a,b} X[i+a, j+b] · W[a,b]

Manual Backward in NumPy

Translating the two formulas above into NumPy is almost mechanical. The numbers below match exactly what the interactive panel shows.

Hand-derived backward pass
🐍conv_backward_numpy.py
1import numpy as np

No PyTorch here — we're deriving backprop by hand to see exactly what autograd does for us.

3def conv2d_forward(x, w)

A minimal single-channel stride-1 no-pad forward. Identical in spirit to the Section 10.2 implementation, kept here so the backward function is self-contained.

EXECUTION STATE
⬇ input: x (H,W) = 2-D input image — the 5×5 canvas in our example.
⬇ input: w (kH,kW) = 2-D kernel — the 3×3 from our example.
⬆ returns y (H-kH+1, W-kW+1) = Feature map.
4H, W = x.shape

Input spatial dims.

5kH, kW = w.shape

Kernel dims.

6y = np.zeros(...)

Pre-allocate the output.

7for i in range(y.shape[0]):

Vertical output position.

8for j in range(y.shape[1]):

Horizontal output position.

9y[i, j] = (x[i:i+kH, j:j+kW] * w).sum()

Dot product of the current window with the kernel — the convolution sum.

10return y

Return the feature map.

12def conv2d_backward(x, w, dL_dy) → (dL_dW, dL_dx)

Given the input, kernel, and upstream gradient dL/dy, return the gradients the optimiser needs: dL/dW (for the kernel update) and dL/dx (for propagating further backwards).

EXECUTION STATE
⬇ input: x (5×5) =
 1  2  3  4  5
 6  7  8  9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
⬇ input: w (3×3) =
1 0 1
0 1 0
1 0 1
⬇ input: dL_dy (3×3) =
1 1 1
1 1 1
1 1 1
(all ones — makes every number below easy to verify by hand)
⬆ returns dL_dW (3×3) = The same shape as w. One scalar gradient per kernel weight.
⬆ returns dL_dx (5×5) = Same shape as x. The gradient to send further backwards.
13kH, kW = w.shape

Cache kernel dims.

14H, W = x.shape

Cache input dims.

17dL_dW = np.zeros_like(w)

Pre-allocate the weight-gradient with the same shape and dtype as w. Starts at zero so we can accumulate.

EXECUTION STATE
📚 np.zeros_like(arr) = Returns a new array with the same shape and dtype as arr, filled with zeros. Handy because it avoids manually re-specifying dtype/shape.
18for a in range(kH):

Row index of the weight we're differentiating.

LOOP TRACE · 3 iterations
a=0
row we will fill = dL_dW[0, :]
a=1
row we will fill = dL_dW[1, :]
a=2
row we will fill = dL_dW[2, :]
19for b in range(kW):

Column index of the weight we're differentiating.

LOOP TRACE · 9 iterations
(a,b)=(0,0)
window = x[0:3,0:3] = 1,2,3,6,7,8,11,12,13
sum = 63
(a,b)=(0,1)
window = x[0:3,1:4] = 2,3,4,7,8,9,12,13,14
sum = 72
(a,b)=(0,2)
window = x[0:3,2:5] = 3,4,5,8,9,10,13,14,15
sum = 81
(a,b)=(1,0)
window = x[1:4,0:3] = 6,7,8,11,12,13,16,17,18
sum = 108
(a,b)=(1,1)
window = x[1:4,1:4] = 7,8,9,12,13,14,17,18,19
sum = 117
(a,b)=(1,2)
window = x[1:4,2:5] = 8,9,10,13,14,15,18,19,20
sum = 126
(a,b)=(2,0)
window = x[2:5,0:3] = 11,12,13,16,17,18,21,22,23
sum = 153
(a,b)=(2,1)
window = x[2:5,1:4] = 12,13,14,17,18,19,22,23,24
sum = 162
(a,b)=(2,2)
window = x[2:5,2:5] = 13,14,15,18,19,20,23,24,25
sum = 171
20dL_dW[a, b] = (x[a:a + ..., b:b + ...] * dL_dy).sum()

This IS a convolution — of x with dL_dy. The reason: y[i,j] = Σ_{a,b} w[a,b]·x[i+a, j+b], so ∂y[i,j]/∂w[a,b] = x[i+a, j+b]. Chain rule gives ∂L/∂w[a,b] = Σ_{i,j} ∂L/∂y[i,j] · x[i+a, j+b] — a cross-correlation of x with dL_dy.

EXECUTION STATE
⬆ dL_dW final =
 63  72  81
108 117 126
153 162 171
23w_flip = w[::-1, ::-1]

Flip the kernel 180°. The rule dL/dx = full conv(dL/dy, rot180(W)) is the standard backprop-through-conv identity — see CS231n notes on convolutional networks. For OUR symmetric kernel the flip is invisible, but writing it out correctly keeps the code general.

EXECUTION STATE
📚 [::-1, ::-1] = Reverse both axes. Zero-copy view thanks to negative strides.
→ w_flip =
1 0 1
0 1 0
1 0 1  (same as w because symmetric)
24padded = np.pad(dL_dy, ((kH-1, kH-1), (kW-1, kW-1)))

Zero-pad dL_dy by (kH-1, kW-1) on each side so that a 'full' convolution with the flipped kernel produces the H×W output we need. Full padding per Section 10.3's padding subsection.

EXECUTION STATE
→ padded shape = (3 + 2·2, 3 + 2·2) = (7, 7)
25dL_dx = np.zeros_like(x)

Pre-allocate the input gradient.

26for r in range(H):

Vertical position in the input-gradient map.

27for c in range(W):

Horizontal position.

28dL_dx[r, c] = (padded[r:r + kH, c:c + kW] * w_flip).sum()

Standard conv between padded dL_dy and the flipped kernel. Intuition: pixel x[r,c] fed into every output that 'saw' it; each of those outputs contributed w[a,b] per the forward formula; summing gives the total upstream gradient that flows back to x[r,c].

EXECUTION STATE
⬆ dL_dx final (5×5) =
1 1 2 1 1
1 2 3 2 1
2 3 5 3 2
1 2 3 2 1
1 1 2 1 1
→ sanity: center dL_dx[2,2] = 5 = The center pixel was used by all 9 output positions; the weights that saw it were the full kernel. Sum of kernel = 1+0+1+0+1+0+1+0+1 = 5. ✓
29return dL_dW, dL_dx

Return both gradients. An optimiser would then do w -= lr · dL_dW and pass dL_dx up the computation graph.

32x = np.arange(1, 26, dtype=float).reshape(5, 5)

Canonical 5×5.

33w = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float)

Canonical 3×3 kernel.

34dL_dy = np.ones((3, 3))

Upstream gradient fixed to all ones. Chosen so the arithmetic simplifies — dL/dW then equals the sum of each 3×3 sub-window of x; dL/dx equals the count of output positions that used each input pixel (weighted by kernel taps).

35dL_dW, dL_dx = conv2d_backward(x, w, dL_dy)

Run the backward pass on these exact inputs.

36print(dL_dW)

Prints [[63,72,81],[108,117,126],[153,162,171]] — every number is visible in the iteration table above.

38print(dL_dx)

Prints the 5×5 input gradient shown in the final card above. The 5 at position (2,2) is a clean sanity check: it equals the sum of the kernel.

12 lines without explanation
1import numpy as np
2
3def conv2d_forward(x, w):
4    """Single-channel stride-1 no-pad forward. (H,W) * (kH,kW) -> (H-kH+1, W-kW+1)."""
5    H, W = x.shape
6    kH, kW = w.shape
7    y = np.zeros((H - kH + 1, W - kW + 1))
8    for i in range(y.shape[0]):
9        for j in range(y.shape[1]):
10            y[i, j] = (x[i:i+kH, j:j+kW] * w).sum()
11    return y
12
13def conv2d_backward(x, w, dL_dy):
14    """Return (dL_dW, dL_dx) given the upstream gradient dL_dy."""
15    kH, kW = w.shape
16    H, W = x.shape
17
18    # Gradient w.r.t. kernel: dL/dW[a,b] = Σ_{i,j} x[i+a, j+b] · dL_dy[i,j]
19    dL_dW = np.zeros_like(w)
20    for a in range(kH):
21        for b in range(kW):
22            dL_dW[a, b] = (x[a:a + dL_dy.shape[0], b:b + dL_dy.shape[1]] * dL_dy).sum()
23
24    # Gradient w.r.t. input: full conv of dL_dy with rot180(W)
25    w_flip = w[::-1, ::-1]
26    padded = np.pad(dL_dy, ((kH - 1, kH - 1), (kW - 1, kW - 1)))
27    dL_dx = np.zeros_like(x)
28    for r in range(H):
29        for c in range(W):
30            dL_dx[r, c] = (padded[r:r + kH, c:c + kW] * w_flip).sum()
31    return dL_dW, dL_dx
32
33# Concrete run on the canonical 5×5 + 3×3 + ones upstream gradient
34x = np.arange(1, 26, dtype=float).reshape(5, 5)
35w = np.array([[1,0,1],[0,1,0],[1,0,1]], dtype=float)
36dL_dy = np.ones((3, 3))
37dL_dW, dL_dx = conv2d_backward(x, w, dL_dy)
38print(dL_dW)
39# [[ 63.  72.  81.] [108. 117. 126.] [153. 162. 171.]]
40print(dL_dx)
41# [[1 1 2 1 1] [1 2 3 2 1] [2 3 5 3 2] [1 2 3 2 1] [1 1 2 1 1]]

Verification Against torch.autograd

The point of autograd is that you never have to derive these formulas yourself. The point of this subsection is to verify, with bit-for-bit identity on a small tractable example, that autograd computes exactly the two convolutions we derived by hand.

Autograd agrees with the hand derivation
🐍conv_backward_verify.py
1import torch

PyTorch for the autograd engine — the tensor-level differentiator that every deep-learning framework builds on.

2import torch.nn.functional as F

For F.conv2d.

5x = torch.arange(...).reshape(1, 1, 5, 5).requires_grad_(True)

Same 5×5 canvas, now wrapped in a leaf tensor that participates in autograd. requires_grad_(True) is the in-place variant that flips the flag and returns self — compact and idiomatic.

EXECUTION STATE
📚 torch.arange(start, end, dtype) = Half-open integer range [start, end) with the given dtype.
📚 .requires_grad_(True) = In-place mutation of the requires_grad flag on a leaf tensor. Trailing underscore = in-place convention across PyTorch.
📚 float64 vs float32 = We use float64 so the comparison with the hand-derived integers is exact. Default PyTorch dtype is float32, which would introduce tiny rounding.
6w = torch.tensor(...).reshape(1, 1, 3, 3).requires_grad_(True)

Canonical kernel in NCHW (or rather, (C_out, C_in, kH, kW)) order, also with grad tracking.

9y = F.conv2d(x, w)

Forward pass. Since both inputs require grad, the resulting y carries a grad_fn=<ConvolutionBackward0>. No loop in sight — this is the exact same math the hand-rolled forward runs, but on cuDNN/oneDNN.

EXECUTION STATE
→ y (1,1,3,3) =
35 40 45
60 65 70
85 90 95  (same as Section 10.2 and C2)
10y.backward(torch.ones_like(y))

Kick off backprop. The argument is the upstream gradient dL/dy; we pass all-ones to match C6. Autograd walks the ConvolutionBackward0 node and runs exactly the two convolutions we hand-derived in C6 — its implementation is the cuDNN equivalent of our NumPy loops.

EXECUTION STATE
📚 tensor.backward(gradient=None) = Accumulates gradients into every leaf tensor that required grad. The argument is needed when y is non-scalar (here y is 3×3, so we MUST pass an explicit upstream gradient).
📚 torch.ones_like(y) = Tensor of the same shape and dtype as y filled with 1.0. Exactly our dL_dy.
12print(w.grad.squeeze())

Prints the kernel gradient. Values: [[63, 72, 81], [108, 117, 126], [153, 162, 171]]. Bit-for-bit identical to what we computed by hand in C6.

14print(x.grad.squeeze())

Prints the input gradient. Values: [[1,1,2,1,1],[1,2,3,2,1],[2,3,5,3,2],[1,2,3,2,1],[1,1,2,1,1]]. Identical to C6. Autograd agrees with the math.

8 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Reproduce the same example in PyTorch with autograd
5x = torch.arange(1, 26, dtype=torch.float64).reshape(1, 1, 5, 5).requires_grad_(True)
6w = torch.tensor([[1, 0, 1], [0, 1, 0], [1, 0, 1]],
7                 dtype=torch.float64).reshape(1, 1, 3, 3).requires_grad_(True)
8
9y = F.conv2d(x, w)                         # (1, 1, 3, 3)
10y.backward(torch.ones_like(y))             # upstream gradient = ones
11
12print(w.grad.squeeze())
13# tensor([[ 63.,  72.,  81.], [108., 117., 126.], [153., 162., 171.]], dtype=torch.float64)
14print(x.grad.squeeze())
15# tensor([[1., 1., 2., 1., 1.], [1., 2., 3., 2., 1.],
16#         [2., 3., 5., 3., 2.], [1., 2., 3., 2., 1.], [1., 1., 2., 1., 1.]])

Why three convolutions per conv layer

  • Forward: y=xWy = x * W (cross-correlation)
  • Weight grad: L/W=xL/y\partial L/\partial W = x * \partial L/\partial y
  • Input grad: L/x=L/yrot180(W)\partial L/\partial x = \partial L/\partial y * \text{rot180}(W) (full padding)
A training step therefore costs roughly three times the forward FLOPs for every conv layer — memorise this when reasoning about training-time cost.

Memory Layout — NCHW vs NHWC

All our code so far has assumed NCHW: batch, channel, height, width, with the channel axis immediately after the batch. That is PyTorch's default and cuDNN's historic default. On modern GPUs, however, the tensor cores prefer the channel axis to be innermost — the layout called NHWC or channels-last [3].

LayoutStride order (for (N,C,H,W))Which axis is fastest in memoryBest for
NCHW (contiguous_format)(C·H·W, H·W, W, 1)W (innermost)Small models, CPU, classic cuDNN paths
NHWC (channels_last)(C·H·W, 1, W·C, C)C (innermost)Tensor Cores, fp16/bf16 training, large C

The shape is identical in both cases — only the strides (and therefore memory order) change. Switching is a one-line call to .contiguous(memory_format=torch.channels_last) on both the activation tensor and the module. When it pays off, it pays off a lot.

Channels-last memory format
🐍channels_last.py
1import torch

Core tensor library.

2import torch.nn as nn

Module classes — in particular nn.Conv2d, which we'll flip between memory layouts.

5x = torch.randn(32, 64, 56, 56, device='cuda')

A realistic activation tensor: batch=32, C=64, 56×56 — roughly the shape at the second stage of a ResNet-50 on 224×224 inputs.

EXECUTION STATE
📚 device='cuda' = Allocate the tensor on the default GPU. Without this, .cuda() calls below would copy at runtime.
6conv = nn.Conv2d(64, 128, kernel_size=3, padding=1).cuda()

A same-padding conv layer that doubles channels. Standard building block in ResNets.

9x_nchw = x.contiguous(memory_format=torch.contiguous_format)

Default NCHW. Strides are (C·H·W, H·W, W, 1) so stepping through a single channel is cache-friendly, but stepping across channels at the same spatial position jumps H·W elements in memory.

EXECUTION STATE
📚 .contiguous(memory_format) = Returns a tensor whose strides match the requested layout. If already contiguous, returns self (no copy).
→ torch.contiguous_format = The default NCHW stride pattern.
10conv_nchw = conv

nn.Conv2d weights are already NCHW — no change needed for the baseline.

13x_nhwc = x.contiguous(memory_format=torch.channels_last)

Same logical (N,C,H,W) shape, but the underlying memory is laid out as N·H·W·C (channels are innermost). Strides become (C·H·W, 1, W·C, C) — so stepping across channels at the same pixel now advances by 1. This is what GPU tensor cores prefer because they multiply along the channel axis.

EXECUTION STATE
📚 torch.channels_last = Memory format where C is the innermost axis. Shape stays (N,C,H,W); only strides change.
14conv_nhwc = conv.to(memory_format=torch.channels_last)

Re-layout the conv's weight tensor into channels-last too. Without this, the runtime would silently convert each forward pass — the benchmark would measure the conversion, not the conv.

17def bench(f, x, n=50):

Timing harness. On GPU you MUST use cuda.Event because CPU-side calls return immediately — the actual compute is scheduled on the GPU and wall-clock time via time.time() would under-count.

EXECUTION STATE
📚 torch.cuda.Event(enable_timing=True) = A GPU-side timer. start.record() enqueues a timestamp on the GPU stream; end.record() enqueues another. elapsed_time returns their GPU-time difference in milliseconds.
18for _ in range(5): # warm-up

First few runs compile kernels, allocate workspace, warm caches. Skip them or they dominate the measurement. 3–10 warm-up iterations is a common default.

19f(x)

Warm-up call.

20torch.cuda.synchronize()

Block until all pending GPU work completes. Needed to make the warm-up finish before we start the timer.

21start, end = torch.cuda.Event(...), torch.cuda.Event(...)

Two GPU events to bracket the measured region.

22start.record()

Mark the start on the GPU stream.

23for _ in range(n):

Run n repetitions to average out noise. 50 is plenty for sub-millisecond ops.

24f(x)

The actual call under measurement.

25end.record()

Mark the end.

26torch.cuda.synchronize()

Wait for the end event to be reached before reading its timestamp.

27return start.elapsed_time(end) / n

Average ms per call.

29print(f'NCHW: {bench(conv_nchw, x_nchw):.3f} ms')

Baseline.

30print(f'channels_last: {bench(conv_nhwc, x_nhwc):.3f} ms')

channels-last path. Per PyTorch's channels-last tutorial (and Yao et al.'s ATC'20 'Channels-Last' work), the speedup typically lands between 1.3× and 2.5× on A100/V100, growing with channel count.

11 lines without explanation
1import torch
2import torch.nn as nn
3
4# A typical inference workload
5x = torch.randn(32, 64, 56, 56, device="cuda")
6conv = nn.Conv2d(64, 128, kernel_size=3, padding=1).cuda()
7
8# --- NCHW baseline ---
9x_nchw = x.contiguous(memory_format=torch.contiguous_format)
10conv_nchw = conv  # nn.Conv2d weights are NCHW by default
11
12# --- channels-last (NHWC in memory, still logical NCHW) ---
13x_nhwc = x.contiguous(memory_format=torch.channels_last)
14conv_nhwc = conv.to(memory_format=torch.channels_last)
15
16# Warm-up + time — use CUDA events for GPU-accurate timing
17def bench(f, x, n=50):
18    for _ in range(5):                      # warm-up
19        f(x)
20    torch.cuda.synchronize()
21    start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
22    start.record()
23    for _ in range(n):
24        f(x)
25    end.record()
26    torch.cuda.synchronize()
27    return start.elapsed_time(end) / n       # ms per call
28
29print(f"NCHW:          {bench(conv_nchw, x_nchw):.3f} ms")
30print(f"channels_last: {bench(conv_nhwc, x_nhwc):.3f} ms")
31# On an A100 the channels-last path is ~1.8× faster here (Yao et al. 2020,
32# PyTorch channels-last tutorial). The speedup grows with channel count.

When channels-last does NOT help

For very small channel counts (C < 32) or on CPU, NCHW can actually be faster because spatial locality matters more than channel locality. Always benchmark before committing the layout change in production.

Special Cases That Fall Out for Free

Three constructions used ubiquitously in modern CNN architectures are literally one-line variations of our implementation. Recognising them as conv variants is what makes the next chapter's architectural papers readable.

1×1 Convolution Is a Matrix Multiply

Set kH=kW=1k_H = k_W = 1 in our conv2d. The kernel has no spatial extent; each output pixel is a pure linear combination of the input channels at the same pixel. That is exactly y=Wxy = W x — a plain matrix multiply applied per-pixel. This is the engine behind Inception's 1×1 bottleneck [8] and every “pointwise” layer in MobileNet/EfficientNet.

1×1 conv ≡ nn.Linear on the channel axis
🐍one_by_one_is_matmul.py
1import torch

Tensor library.

2import torch.nn as nn

nn.Conv2d and nn.Linear — the two modules we claim are equivalent at kernel_size=1.

3import torch.nn.functional as F

Kept for symmetry with the other examples.

5torch.manual_seed(0)

Reproducibility — otherwise the random weights we inject into Linear wouldn't match the Conv's.

6x = torch.randn(2, 64, 8, 8)

Batch of 2 activation maps, 64 channels, 8×8 spatial. Representative of a middle CNN stage.

9conv = nn.Conv2d(64, 128, kernel_size=1, bias=False)

A 1×1 conv: 64 in, 128 out, no bias. The weight tensor has shape (128, 64, 1, 1). Each spatial position is multiplied independently.

EXECUTION STATE
⬇ arg: kernel_size=1 = The kernel covers a single spatial pixel. There is no neighbourhood to mix.
⬇ arg: bias=False = Turn off bias to make the equivalence exact (Linear default is bias=True, so removing it from both keeps them parallel).
10via_conv = conv(x)

Forward through the conv. Each output pixel = Σₖ W[c_out, k] · x[k, i, j]. No spatial mixing, only channel mixing.

EXECUTION STATE
→ via_conv shape = (2, 128, 8, 8)
13linear = nn.Linear(64, 128, bias=False)

Ordinary fully-connected layer of 64 in and 128 out. Weight shape (128, 64).

14linear.weight.data = conv.weight.data.squeeze(-1).squeeze(-1)

Inject the same weights. squeeze(-1) twice drops the two length-1 spatial axes, turning (128, 64, 1, 1) into (128, 64). After this assignment both modules compute the same linear map, and any output difference can only come from how we apply it.

EXECUTION STATE
📚 .squeeze(dim) = Drop the specified axis if its length is 1. -1 means the last axis. Used twice to drop both spatial axes.
15via_linear = linear(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

Linear expects the feature axis to be last. We permute NCHW → NHWC, apply the linear, then permute back NHWC → NCHW. Two transpositions cost almost nothing but show why 1×1 conv and Linear feel identical at the matmul level.

EXECUTION STATE
📚 .permute(*dims) = Re-order axes. A zero-copy stride swap unless followed by a reshape that requires contiguous memory.
→ axis order after first permute = (N, H, W, C)
→ axis order after second permute = (N, C, H, W)
17print(torch.allclose(via_conv, via_linear, atol=1e-6))

Prints True. The proof that a 1×1 conv IS a per-pixel matrix multiply. This is why Inception's '1×1 bottleneck' (Szegedy et al., 2015) and ResNet's bottleneck blocks (He et al., 2016) can be implemented either way with identical math.

6 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5torch.manual_seed(0)
6x = torch.randn(2, 64, 8, 8)
7
8# Path A: nn.Conv2d with a 1×1 kernel
9conv = nn.Conv2d(64, 128, kernel_size=1, bias=False)
10via_conv = conv(x)                             # (2, 128, 8, 8)
11
12# Path B: nn.Linear on the channel dimension
13linear = nn.Linear(64, 128, bias=False)
14linear.weight.data = conv.weight.data.squeeze(-1).squeeze(-1)   # drop 1×1 spatial
15via_linear = linear(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)   # (2,128,8,8)
16
17print(torch.allclose(via_conv, via_linear, atol=1e-6))    # True

Grouped and Depthwise-Separable Convolutions

Setting groups > 1 in nn.Conv2d splits the input and output channels into independent groups. The extreme case groups=Cin\texttt{groups} = C_\text{in} is the depthwise convolution used in MobileNet [9], where each input channel has its own spatial kernel and there is no cross-channel mixing. Pairing depthwise with a 1×1 “pointwise” conv (depthwise-separable) approximates a full 3×3 conv at roughly 1/Cin1/C_\text{in} the parameters.

Groups: standard → grouped → depthwise → depthwise-separable
🐍grouped_and_depthwise.py
1import torch

Tensor library.

2import torch.nn as nn

For nn.Conv2d and nn.Sequential.

4x = torch.randn(1, 64, 32, 32)

One activation map with 64 channels — realistic mid-network shape.

7standard = nn.Conv2d(64, 128, 3, padding=1, groups=1)

Default convolution. Every output channel mixes every input channel. Parameters = C_out · C_in · kH · kW + C_out.

EXECUTION STATE
⬇ arg: groups=1 = Default. The 'everybody-sees-everybody' case.
→ param count = 128 · 64 · 3 · 3 + 128 = 73,856
8# Parameter count note

Annotation; the actual numeric comes out of m.parameters() below.

11grouped = nn.Conv2d(64, 128, 3, padding=1, groups=2)

Split input and output channels into 2 groups of 32 / 64 each. Each output channel sees only its own group of input channels — no cross-group mixing. Parameter count drops by exactly 1/groups.

EXECUTION STATE
⬇ arg: groups=2 = Must divide BOTH C_in and C_out evenly. Introduced in AlexNet (Krizhevsky et al., 2012) to fit across two GPUs.
→ param count = 128 · (64/2) · 3 · 3 + 128 = 36,992 — exactly half of groups=1
12# Parameter count note

Annotation only.

15depthwise = nn.Conv2d(64, 64, 3, padding=1, groups=64)

Setting groups = C_in = C_out gives you a depthwise convolution: each input channel has its OWN independent 3×3 kernel, no channel mixing at all. Introduced in MobileNet (Howard et al., 2017); parameter count scales with C, not C².

EXECUTION STATE
⬇ arg: groups=64 = One kernel per channel — the 'depthwise' convolution. The C_out argument must equal groups for pure depthwise; setting C_out > groups gives you a 'grouped+expand' variant used in EfficientNet.
→ param count = 64 · 1 · 3 · 3 + 64 = 640 — 115× fewer parameters than the standard conv
16# Parameter count note

Annotation only.

19dsconv = nn.Sequential(depthwise 3×3, pointwise 1×1)

The full 'depthwise-separable' block. Depthwise captures spatial patterns per-channel; the pointwise 1×1 conv mixes channels. Together they approximate a full 3×3 conv at a fraction of the cost — the core trick behind MobileNet V1/V2/V3 and EfficientNet.

EXECUTION STATE
→ first layer = nn.Conv2d(64, 64, 3, padding=1, groups=64) — spatial mixing, per channel
→ second layer = nn.Conv2d(64, 128, 1) — per-pixel channel mixing (recall C9: 1×1 conv IS a matmul)
20# depthwise stage

Stage 1 of the block.

21# pointwise stage

Stage 2 of the block.

24for name, m in [...]:

Sweep all four configurations and print parameter counts and output shapes.

LOOP TRACE · 4 iterations
standard
params = 73,856
output = (1, 128, 32, 32)
grouped-2
params = 36,992
output = (1, 128, 32, 32)
depthwise
params = 640
output = (1, 64, 32, 32)
dsconv
params = 8,896 (640 + 8,256)
output = (1, 128, 32, 32)
25p = sum(p.numel() for p in m.parameters())

Count every learnable scalar in the module. .numel() returns the number of elements in a tensor; summing over .parameters() counts every weight and bias.

EXECUTION STATE
📚 m.parameters() = Generator of all learnable tensors inside a Module.
📚 tensor.numel() = Number of elements = product of shape dims.
26print(f"{name:12s} params={p:>6d} output={tuple(m(x).shape)}")

Formatted report. The key observation: the standard conv has 115× as many parameters as the depthwise, while the depthwise-separable block still has 8× fewer than the standard — at similar representational power per the MobileNet paper.

12 lines without explanation
1import torch
2import torch.nn as nn
3
4x = torch.randn(1, 64, 32, 32)
5
6# groups=1: ordinary conv — every input channel contributes to every output channel.
7standard = nn.Conv2d(64, 128, 3, padding=1, groups=1)
8# Parameter count: 128 × 64 × 3 × 3 + 128 = 73,856
9
10# groups=2: input and output channels split into 2 groups; each group sees only its half.
11grouped = nn.Conv2d(64, 128, 3, padding=1, groups=2)
12# Parameter count: 128 × (64/2) × 3 × 3 + 128 = 36,992  (exactly half)
13
14# groups=C_in: depthwise conv — one kernel per input channel, no channel mixing.
15depthwise = nn.Conv2d(64, 64, 3, padding=1, groups=64)
16# Parameter count: 64 × 1 × 3 × 3 + 64 = 640  (64× fewer multiplications)
17
18# Depthwise-separable: depthwise → pointwise 1×1. MobileNet / EfficientNet block.
19dsconv = nn.Sequential(
20    nn.Conv2d(64, 64, 3, padding=1, groups=64),     # spatial mixing, per-channel
21    nn.Conv2d(64, 128, 1),                          # cross-channel mixing, pointwise
22)
23
24for name, m in [("standard", standard), ("grouped-2", grouped),
25                ("depthwise", depthwise), ("dsconv", dsconv)]:
26    p = sum(p.numel() for p in m.parameters())
27    print(f"{name:12s} params={p:>6d}  output={tuple(m(x).shape)}")

Quick Check

A 3×3 conv with C_in=256, C_out=256 has 589,824 weight parameters. Rewriting it as a depthwise 3×3 followed by a pointwise 1×1 keeps C_out=256. How many parameters now?


Choosing an Algorithm

cuDNN exposes several forward algorithms and silently picks one per shape at benchmark=True. The decision table below summarises when each algorithm wins in practice [2].

AlgorithmBest forMemory overheadNumerical notes
Direct (naïve, tiled)Tiny tensors, very small CnoneExact
Implicit GEMM (im2col-on-the-fly)Default for modern shapeslow — no materialised im2colExact
Explicit GEMM (materialised im2col)Small kernels, large Chigh — stores the 9× duplicated tensorExact
Winograd F(2,3)3×3 kernels, stride=1moderate — stores transformed tilesSlightly noisier than GEMM [5]
FFTLarge kernels (≥ 7×7), small stridehigh — complex-valued intermediateNoisy for small kernels [6]

Engineering takeaway

In almost every case you should just call torch.backends.cudnn.benchmark = True at the start of training and let cuDNN pick. The algorithms above are useful tounderstand because they show up in profiling output, not because you typically write them by hand.

Summary

  1. A convolution is a structured GEMM. im2col makes the structure explicit: one matrix multiply replaces the quadruple loop.
  2. Stride, padding, and dilation (Section 10.3) enter a production implementation through exactly two places: np.pad / F.pad, and the strides of the windowed view. Everything else is the same GEMM.
  3. The PyTorch version of our implementation is literally three lines: F.unfold → matmul → view. That is also how cuDNN's “implicit GEMM” path works internally.
  4. Backpropagation through a conv is two more convolutions: L/W=xL/y\partial L/\partial W = x * \partial L/\partial y and L/x=L/yrot180(W)\partial L/\partial x = \partial L/\partial y * \text{rot180}(W).
  5. Memory layout is a silent 1.3–2.5× speedup knob on modern GPUs (channels_last). Benchmark before you commit.
  6. 1×1 conv is a per-pixel matmul; grouped conv splits channels; depthwise conv assigns one kernel per channel. These are not new operations — they are one-argument variations of the same kernel.

Exercises

Conceptual

  1. Explain why as_strided requires writeable=False for conv windows. What specifically goes wrong if you write through an overlapping strided view?
  2. Suppose a conv layer takes 10 ms for the forward pass. Roughly how long should the backward take? Justify using the decomposition in the Backward section.
  3. You have a 3×3 conv with 512 input and 512 output channels. Someone proposes replacing it with a depthwise-separable variant. Compute the parameter and FLOP reduction. Under what conditions might the separable version be slower in practice despite having fewer FLOPs?

Coding

  1. Extend the conv2d from C4 to accept a bias vector and verify against F.conv2d(..., bias=b).
  2. Replace the einsum contraction in C4 with an explicit reshape + @. Confirm identical numerical output and measure wall-clock cost.
  3. Implement conv2d_backward for batched multi-channel input and verify against torch.autograd.grad.

Challenge

Implement Winograd F(2,3)F(2, 3) for a single 3×3 kernel over a 4×4 input tile following Lavin & Gray (2016) [5]. The transform matrices are fixed 4×3 and 4×4 rationals given in Table 1 of the paper. Your implementation should reduce the multiplication count from 36 to 16 per 2×2 output and produce bit-equivalent results (within float tolerance) toF.conv2d. Report your measured multiplication count.


References

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.5 “Variants of the Basic Convolution Function” and §9.8 “Efficient Convolution Algorithms”. MIT Press. https://www.deeplearningbook.org/
  2. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759. (Introduces implicit-GEMM, Winograd and FFT paths; numbers in the benchmark table are consistent with their Table 2.)
  3. PyTorch tutorial, Channels Last Memory Format. https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html (Source for the 1.3–2.5× channels-last speedup and the torch.cuda.Event timing pattern used in C8.)
  4. Chellapilla, K., Puri, S., & Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. International Conference on Frontiers in Handwriting Recognition. (The original im2col paper.)
  5. Lavin, A., & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR. arXiv:1509.09308. (Winograd F(2,3) derivation and constants.)
  6. Mathieu, M., Henaff, M., & LeCun, Y. (2014). Fast Training of Convolutional Networks through FFTs. ICLR. arXiv:1312.5851.
  7. Stanford CS231n, Convolutional Neural Networks: Architectures, Convolution / Pooling Layers. https://cs231n.github.io/convolutional-networks/ (Source for the widely reproduced backprop-through-conv derivation used in the Backward section.)
  8. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper with Convolutions (Inception v1). CVPR. arXiv:1409.4842. (1×1 bottleneck.)
  9. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861. (Depthwise-separable.)
  10. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks (AlexNet). NeurIPS. (Introduced grouped convolutions to fit across two GPUs.)
  11. NumPy documentation, numpy.lib.stride_tricks.as_strided and sliding_window_view. https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html
  12. PyTorch documentation, torch.nn.functional.unfold. https://pytorch.org/docs/stable/generated/torch.nn.functional.unfold.html

With conv implementation in hand, the next section moves to the complementary operation that every CNN architecture of the last decade still uses: pooling. We will see that max-pooling, average-pooling, and even modern alternatives like strided convolutions (Section 10.3) are variations on a single theme — a stride-S window reducer.

Loading comments...