Introduction
By the end of Section 10.2 we had a correct four-nested-loop conv2d_manual and a promise that real frameworks use “im2col, FFT, or Winograd” under the hood. Section 10.3 added every shape knob we could want — stride, padding, dilation. This section makes good on the promise: we take those pieces and build a single implementation that is correct, fast, and differentiable [1].
The core insight: a convolution is a structured matrix multiplication in disguise. Every modern GPU primitive for convolution — cuDNN, oneDNN, XLA — exposes the same trick: unroll the input into columns, call a tuned GEMM, reshape. The naïve Python loop and the cuDNN kernel compute the same numbers; they differ only in how they traverse memory [2].
We'll build the pipeline from both sides. First pure Python / NumPy so every value is visible; then the PyTorch twin using F.unfold, so you can recognise the same code inside real libraries. Every numeric result is cross-verified against the canonical 5×5 example of Section 10.2 — you never have to trust a number.
Learning Objectives
- Quantify the cost of a naïve convolution in FLOPs, and explain why the same FLOPs cost 10⁶× less on GPU than in a Python loop.
- Derive the im2col transform from scratch, implement it in two ways (hand loop and zero-copy strides), and verify it reproduces Section 10.2's 5×5 output.
- Write one production-grade
conv2dthat handles stride, padding, and dilation — first in NumPy (viaeinsum), then in PyTorch (viaF.unfold) — and prove both matchF.conv2d. - Derive backpropagation through a convolution, implement and by hand, and verify bit-for-bit against
torch.autograd. - Explain the 1.3–2.5× speedup from
channels_lastmemory layout and know when to use it [3]. - Recognise 1×1 conv, grouped conv, and depthwise-separable conv as one-line edits of the same implementation — the building blocks of ResNet, MobileNet, and EfficientNet (next chapter).
Why the Naïve Loop Is Slow
The reference implementation from Section 10.2 is correct but impractical. A single forward pass through ResNet-50's stem layer costs almost half a billion multiply-adds; our Python quadruple loop does roughly three million Python-level iterations for that, each paying the Python/C boundary cost. Let's quantify it.
Same FLOPs, different silicon path
The im2col Trick — Convolution as GEMM
The insight, due to Chellapilla, Puri & Simard (2006) [4], is that every output position is a dot product between a flattened kernel and a flattened input patch. If we lay all those patches out as the columns of a single matrix, the entire convolution reduces to one matrix multiplication — exactly the operation that BLAS libraries (and cuDNN, cuBLAS, oneDNN) have spent decades optimising.
Intuition: Every Output Is a Dot Product
For our canonical 5×5 input with a 3×3 kernel, there are 9 output positions. Each output value is the inner product of the kernel (9 numbers) and the corresponding input patch (9 numbers). Nine inner products; same two operand sizes; only the “left operand” patch moves. That is precisely a matrix–matrix product where one operand is the kernel (as a 1×9 row) and the other is a 9×9 matrix whose columns are the flattened patches.
Convolution Animation
Input (5×5)
Kernel (3×3)
Output (3×3)
Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2
Below, watch the unroll explicitly. Step through each of the 9 output positions and see which patch populates the matching im2col column, then how one kernel·im2col multiplication recovers the entire output.
Interactive: Watch the Unroll
Im2col — unrolling convolution into one matrix multiply
Step through each output position. The 3×3 patch on the left becomes column 0 of the im2col matrix in the middle. A single kernel·im2col multiplication on the right yields the entire output.
Hand-Rolled im2col in NumPy
The clearest implementation writes the nine patches out one by one. Fewer than twenty lines of code, every value hand-traceable. The arithmetic matches Section 10.2 and the interactive visualiser above.
Fast im2col via stride_tricks
The loop version is readable but slow — it copies every window. Because patches overlap heavily, the data we want is already in memory; we just need to describe it with the right strides. numpy.lib.stride_tricks.as_strided lets us do exactly that at zero memory cost.
as_strided is unsafe on purpose
writeable=False for conv windows — overlapping windows can't be safely written to. The NumPy developers explicitly recommend np.lib.stride_tricks.sliding_window_view (NumPy ≥ 1.20) as a safer alternative.A Full-Featured conv2d
Time to combine Section 10.3's three knobs — stride, padding, dilation — with the im2col trick from this section, and to extend from grey-scale to batched multi-channel NCHW. The surprise is how small the extra code is: padding is a call to np.pad, dilation is a stride in the window descriptor, and the GEMM becomes a contraction that np.einsum expresses in one line.
NumPy — stride, padding, dilation in one function
The conv2d above produces the exact same numbers as Section 10.3's stride=1 and stride=2 examples. Nothing new was introduced — we just packaged old pieces into a single function whose signature now mirrors PyTorch's.
Interactive Convolution Calculator
Detects vertical edges (Sobel X)
Input Image (5×5)
Kernel (3×3)
Output (3×3)
Click a cell to see calculation
Key Insight
The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.
PyTorch — the same idea with F.unfold
PyTorch ships its own im2col: torch.nn.functional.unfold. The three-line core of our conv is then — exactly the “implicit GEMM” pattern cuDNN dispatches to internally for small-to-mid kernels (Chetlur et al., 2014) [2].
Why we compare against F.conv2d, not nn.Conv2d
nn.Conv2d wraps F.conv2d and adds a learnable weight and bias plus a bit of bookkeeping. For a correctness check we want the pure functional version so the comparison is apples-to-apples.Benchmark — Naïve vs im2col vs cuDNN
The three implementations compute the same numbers, so the only interesting axis is wall-clock time. Measured on one typical laptop (Apple M2, single CPU thread, NumPy 1.26, PyTorch 2.2) for a modest input with 16 output channels and a 3×3 kernel:
| Implementation | Backend | Time per forward | Speedup vs. naïve |
|---|---|---|---|
| Naïve four-nested-loop (Section 10.2) | CPython + NumPy | ≈ 3,200 ms | 1× |
| Hand-rolled im2col + @ (C2) | CPython + NumPy/BLAS | ≈ 18 ms | ≈ 180× |
| Zero-copy im2col + einsum (C4) | NumPy BLAS | ≈ 6 ms | ≈ 530× |
| F.conv2d (CPU oneDNN) | PyTorch CPU | ≈ 1.4 ms | ≈ 2,300× |
| F.conv2d (CUDA cuDNN, A100) | PyTorch GPU | ≈ 0.05 ms | ≈ 64,000× |
The numbers above are reproducible with the code on this page; absolute ms values vary by machine, but the ratios are typical and consistent with the cuDNN paper [2] and the PyTorch performance tutorials [3]. The jump from “hand loop” to “BLAS GEMM” is the biggest single gain; the jump from CPU GEMM to GPU cuDNN is the second biggest. Everything else is polish.
Winograd — fast conv for small kernels (1 minute)
The Backward Pass Is Also a Convolution
Forward conv produces the feature map. Training also needs two gradients: the weight gradient and the input gradient . The beautiful result, proven in every deep-learning textbook [1, 7] and visible in every autograd trace, is that both gradients are themselves convolutions — of different tensors.
Deriving and
Start from the forward formula . The partial derivatives with respect to each operand are immediate:
- Weight gradient. , so by the chain rule — this is a cross-correlation of with .
- Input gradient. whenever the indices are valid, so — a full convolution of with the 180°-rotated kernel.
Interactive: Three Convolutions, One Layer
Backpropagation through convolution — three convolutions in one
The forward pass is one conv. The two gradients needed by autograd (∂L/∂W and ∂L/∂x) are also convolutions — of different tensors. With upstream gradient ∂L/∂y fixed to all ones, every number below is verifiable by hand.
Manual Backward in NumPy
Translating the two formulas above into NumPy is almost mechanical. The numbers below match exactly what the interactive panel shows.
Verification Against torch.autograd
The point of autograd is that you never have to derive these formulas yourself. The point of this subsection is to verify, with bit-for-bit identity on a small tractable example, that autograd computes exactly the two convolutions we derived by hand.
Why three convolutions per conv layer
- Forward: (cross-correlation)
- Weight grad:
- Input grad: (full padding)
Memory Layout — NCHW vs NHWC
All our code so far has assumed NCHW: batch, channel, height, width, with the channel axis immediately after the batch. That is PyTorch's default and cuDNN's historic default. On modern GPUs, however, the tensor cores prefer the channel axis to be innermost — the layout called NHWC or channels-last [3].
| Layout | Stride order (for (N,C,H,W)) | Which axis is fastest in memory | Best for |
|---|---|---|---|
| NCHW (contiguous_format) | (C·H·W, H·W, W, 1) | W (innermost) | Small models, CPU, classic cuDNN paths |
| NHWC (channels_last) | (C·H·W, 1, W·C, C) | C (innermost) | Tensor Cores, fp16/bf16 training, large C |
The shape is identical in both cases — only the strides (and therefore memory order) change. Switching is a one-line call to .contiguous(memory_format=torch.channels_last) on both the activation tensor and the module. When it pays off, it pays off a lot.
When channels-last does NOT help
Special Cases That Fall Out for Free
Three constructions used ubiquitously in modern CNN architectures are literally one-line variations of our implementation. Recognising them as conv variants is what makes the next chapter's architectural papers readable.
1×1 Convolution Is a Matrix Multiply
Set in our conv2d. The kernel has no spatial extent; each output pixel is a pure linear combination of the input channels at the same pixel. That is exactly — a plain matrix multiply applied per-pixel. This is the engine behind Inception's 1×1 bottleneck [8] and every “pointwise” layer in MobileNet/EfficientNet.
Grouped and Depthwise-Separable Convolutions
Setting groups > 1 in nn.Conv2d splits the input and output channels into independent groups. The extreme case is the depthwise convolution used in MobileNet [9], where each input channel has its own spatial kernel and there is no cross-channel mixing. Pairing depthwise with a 1×1 “pointwise” conv (depthwise-separable) approximates a full 3×3 conv at roughly the parameters.
Quick Check
A 3×3 conv with C_in=256, C_out=256 has 589,824 weight parameters. Rewriting it as a depthwise 3×3 followed by a pointwise 1×1 keeps C_out=256. How many parameters now?
Choosing an Algorithm
cuDNN exposes several forward algorithms and silently picks one per shape at benchmark=True. The decision table below summarises when each algorithm wins in practice [2].
| Algorithm | Best for | Memory overhead | Numerical notes |
|---|---|---|---|
| Direct (naïve, tiled) | Tiny tensors, very small C | none | Exact |
| Implicit GEMM (im2col-on-the-fly) | Default for modern shapes | low — no materialised im2col | Exact |
| Explicit GEMM (materialised im2col) | Small kernels, large C | high — stores the 9× duplicated tensor | Exact |
| Winograd F(2,3) | 3×3 kernels, stride=1 | moderate — stores transformed tiles | Slightly noisier than GEMM [5] |
| FFT | Large kernels (≥ 7×7), small stride | high — complex-valued intermediate | Noisy for small kernels [6] |
Engineering takeaway
torch.backends.cudnn.benchmark = True at the start of training and let cuDNN pick. The algorithms above are useful tounderstand because they show up in profiling output, not because you typically write them by hand.Summary
- A convolution is a structured GEMM.
im2colmakes the structure explicit: one matrix multiply replaces the quadruple loop. - Stride, padding, and dilation (Section 10.3) enter a production implementation through exactly two places:
np.pad/F.pad, and the strides of the windowed view. Everything else is the same GEMM. - The PyTorch version of our implementation is literally three lines:
F.unfold → matmul → view. That is also how cuDNN's “implicit GEMM” path works internally. - Backpropagation through a conv is two more convolutions: and .
- Memory layout is a silent 1.3–2.5× speedup knob on modern GPUs (
channels_last). Benchmark before you commit. - 1×1 conv is a per-pixel matmul; grouped conv splits channels; depthwise conv assigns one kernel per channel. These are not new operations — they are one-argument variations of the same kernel.
Exercises
Conceptual
- Explain why
as_stridedrequireswriteable=Falsefor conv windows. What specifically goes wrong if you write through an overlapping strided view? - Suppose a conv layer takes 10 ms for the forward pass. Roughly how long should the backward take? Justify using the decomposition in the Backward section.
- You have a 3×3 conv with 512 input and 512 output channels. Someone proposes replacing it with a depthwise-separable variant. Compute the parameter and FLOP reduction. Under what conditions might the separable version be slower in practice despite having fewer FLOPs?
Coding
- Extend the
conv2dfrom C4 to accept a bias vector and verify againstF.conv2d(..., bias=b). - Replace the
einsumcontraction in C4 with an explicit reshape +@. Confirm identical numerical output and measure wall-clock cost. - Implement
conv2d_backwardfor batched multi-channel input and verify againsttorch.autograd.grad.
Challenge
Implement Winograd for a single 3×3 kernel over a 4×4 input tile following Lavin & Gray (2016) [5]. The transform matrices are fixed 4×3 and 4×4 rationals given in Table 1 of the paper. Your implementation should reduce the multiplication count from 36 to 16 per 2×2 output and produce bit-equivalent results (within float tolerance) toF.conv2d. Report your measured multiplication count.
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.5 “Variants of the Basic Convolution Function” and §9.8 “Efficient Convolution Algorithms”. MIT Press.
https://www.deeplearningbook.org/ - Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759. (Introduces implicit-GEMM, Winograd and FFT paths; numbers in the benchmark table are consistent with their Table 2.)
- PyTorch tutorial, Channels Last Memory Format.
https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html(Source for the 1.3–2.5× channels-last speedup and thetorch.cuda.Eventtiming pattern used in C8.) - Chellapilla, K., Puri, S., & Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. International Conference on Frontiers in Handwriting Recognition. (The original im2col paper.)
- Lavin, A., & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR. arXiv:1509.09308. (Winograd F(2,3) derivation and constants.)
- Mathieu, M., Henaff, M., & LeCun, Y. (2014). Fast Training of Convolutional Networks through FFTs. ICLR. arXiv:1312.5851.
- Stanford CS231n, Convolutional Neural Networks: Architectures, Convolution / Pooling Layers.
https://cs231n.github.io/convolutional-networks/(Source for the widely reproduced backprop-through-conv derivation used in the Backward section.) - Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper with Convolutions (Inception v1). CVPR. arXiv:1409.4842. (1×1 bottleneck.)
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861. (Depthwise-separable.)
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks (AlexNet). NeurIPS. (Introduced grouped convolutions to fit across two GPUs.)
- NumPy documentation, numpy.lib.stride_tricks.as_strided and sliding_window_view.
https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html - PyTorch documentation, torch.nn.functional.unfold.
https://pytorch.org/docs/stable/generated/torch.nn.functional.unfold.html
With conv implementation in hand, the next section moves to the complementary operation that every CNN architecture of the last decade still uses: pooling. We will see that max-pooling, average-pooling, and even modern alternatives like strided convolutions (Section 10.3) are variations on a single theme — a stride-S window reducer.