Introduction
In Section 10.1 we argued why convolutions exist; in Section 10.2 we defined what a convolution is and worked out a single example by hand. We now answer the natural follow-up: how do we control the shape, receptive field, and computational cost of a convolutional layer?
Three knobs do essentially all the work:
- Stride — how far the kernel jumps between evaluations. Controls output resolution.
- Padding — how we treat the boundary. Controls whether spatial size shrinks each layer.
- Dilation — how far apart the kernel's taps sit. Controls the receptive field for a fixed parameter budget.
The one sentence summary: stride shrinks the output, padding prevents that shrinkage, and dilation enlarges the field of view without adding parameters. Master these three and you can read any CNN architecture diagram on sight.
A single formula from Dumoulin & Visin (2016) [1] ties them all together, and it is exactly the formula PyTorch uses internally for nn.Conv2d [2]. We will derive it, implement it twice (NumPy first, then PyTorch), and verify every number produced along the way.
Learning Objectives
- Predict output shape for any combination of using the master formula, and know when you need the floor function.
- Choose stride vs. pooling appropriately for downsampling — including why strided convolutions replaced max-pooling in many modern architectures (Springenberg et al., 2015) [3].
- Pick a padding mode (zeros, reflect, replicate, circular) based on what the input actually represents.
- Reason about dilation and compute the receptive field of a stack of dilated convolutions — the core idea behind WaveNet [4] and DeepLab [5].
- Implement each knob twice: once from scratch in NumPy, once with PyTorch, and verify that both produce the same numbers.
The Master Output-Shape Formula
For a 1-D spatial dimension with input size , kernel size , padding applied on each side, stride , and dilation , the output size is
This is the exact formula listed in the PyTorch documentation for nn.Conv2d [2] and derived geometrically in Dumoulin & Visin [1]. It applies independently to height and width (2-D images have two such equations).
Breaking It Down
| Symbol | Meaning | Typical values |
|---|---|---|
| I | Input size (height or width) | 32, 224, … |
| K | Kernel size | 3, 5, 7 |
| P | Padding per side (both sides added) | 0, 1, 2 |
| S | Stride | 1, 2 |
| D | Dilation rate | 1, 2, 4, 8 (WaveNet) |
| D·(K−1)+1 | Effective kernel size k_eff | 3 when D=1, 5 when D=2, K=3 |
| ⌊·⌋ | Floor: drop any fractional leftover | — required when S > 1 |
A cleaner way to remember it
Why the floor?
Stride — Controlling Output Resolution
Stride is how many pixels the kernel jumps between evaluations. is dense sampling; every pixel gets its own output. visits every other pixel along each axis, so the output footprint is roughly a quarter of the input.
Intuition
Imagine the kernel as a flashlight sweeping a wall. Stride is the distance between successive flashes. A fast sweep (large stride) covers the whole wall in fewer frames but loses detail; a slow sweep (small stride) gives you more frames of fine-grained information at higher cost.
Plugging into the master formula gives the stride-only special case .
| I | K | S | O = ⌊(I−K)/S⌋ + 1 | Effect |
|---|---|---|---|---|
| 32 | 3 | 1 | 30 | Barely shrinks (K−1 = 2) |
| 32 | 3 | 2 | 15 | Halves resolution |
| 32 | 3 | 4 | 8 | Quarter resolution |
| 32 | 7 | 2 | 13 | Halves + K=7 costs 2 pixels |
Stride vs. pooling for downsampling
Stride in Pure Python
Let's start where every working implementation of a new idea should: with a few lines of NumPy that we can trace on paper. The input is the same grid we used in Section 10.2, and we compare stride = 1 against stride = 2.
Why stride=2 output equals a subsample of stride=1
[[35,40,45],[60,65,70],[85,90,95]]; stride=2 gives [[35,45],[85,95]]. These are exactly the corners of the stride=1 output — (0,0), (0,2), (2,0), (2,2). Striding a convolution is mathematically identical to computing the dense convolution and then subsampling — it's just much cheaper.Stride in PyTorch
PyTorch expresses stride through the stride keyword of both F.conv2d and nn.Conv2d. The math is identical; only the tensor layout (NCHW) and the call site differ.
Quick Check
A 224×224 feature map goes through nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2). What is the output spatial size?
Padding — Handling the Boundary
Without padding, every convolution with shrinks the output. After n stride-1 layers with , a input becomes ; by 16 layers it has vanished. Padding prevents that collapse by inserting extra rows and columns around the border before the convolution runs.
Valid, Same, and Full Padding
| Name | Padding P | Output size (S=1, D=1) | When to use |
|---|---|---|---|
| valid | 0 | I − K + 1 | You are happy to lose pixels at the border — e.g. the final conv before a classifier. |
| same | ⌊(K−1)/2⌋ | I (unchanged) | The dominant choice in VGG, ResNet, EfficientNet. Keeps spatial size so you can stack many layers cleanly. |
| full | K − 1 | I + K − 1 | Every valid overlap between image and kernel is represented. Rare in forward conv; shows up implicitly in the backward pass / transposed conv. |
“Same” is really the recipe , which reduces to when and is odd. Even requires asymmetric padding, which is why library support is rare.
Padding Modes (Zero, Reflect, Replicate, Circular)
PyTorch's nn.Conv2d supports four padding_modes [2], each appropriate for different data. Toggle the visualizer below to see the effect of each choice on a 5×5 input with :
Padding Modes
Padding doesn't have to be zeros. PyTorch's nn.Conv2d supports four modes. Switch between them to see how each fills the 2-pixel border around a 5×5 input.
Border is filled with 0. The default in most CNNs and the one torch.nn.Conv2d uses when padding_mode='zeros'.
| Mode | Border fill | Best for |
|---|---|---|
| zeros | 0.0 everywhere | Default. Classification CNNs where the boundary is rarely informative. |
| reflect | Mirror without repeating the edge pixel | Style transfer, image-to-image models: avoids fake dark borders. |
| replicate | Repeat the edge pixel outward | Low-level vision, derivative filters; keeps intensity continuous. |
| circular | Wrap around as if periodic | Inherently periodic data — angular/ spherical signals, cylindrical images. |
Zero-padding bias
Padding in Pure Python
With NumPy's np.pad we can implement all four modes without leaving the standard library. Note how padding becomes a pre-processing step: we enlarge the input, then run the same unpadded convolution we already wrote.
Padding in PyTorch
In PyTorch, padding is a keyword on both the functional and module APIs. The string shortcut padding='same' is convenient but only legal when every spatial stride is 1 — an easy-to-miss restriction.
Quick Check
You want a 5×5 convolution to preserve spatial dimensions with stride=1. What padding do you need?
Dilation — Larger Receptive Field for Free
Dilation (also called “atrous” convolution, from the French à trous = “with holes”) inserts zeros between adjacent kernel taps. The result behaves like a much larger kernel for the purpose of receptive field, while still storing only the original weights.
The formalism is due to Yu & Koltun (ICLR 2016) [8], who used it to capture multi-scale context in dense prediction. WaveNet [4] uses stacks of dilated causal 1-D convolutions with dilations to model seconds of audio with only a handful of layers. DeepLab [5] uses them (“atrous spatial pyramid pooling”) to capture multi-scale image context for semantic segmentation.
Visualizing Dilation
Drag the slider to see how the effective kernel size grows while the parameter count stays fixed at 9:
Dilation: Expanding Receptive Field for Free
A 3×3 kernel always has 9 parameters, but dilation D inserts D−1 zeros between each tap. The kernel covers a larger area of the input without learning any new weights.
D = 1 (standard): taps are adjacent, 3×3 window.
D = 2: taps skip every other cell, 5×5 window but still 9 weights. Used in WaveNet.
D = 3: window is 7×7, still 9 weights.
| D | k_eff = D·(K−1) + 1 | Parameters | Comment |
|---|---|---|---|
| 1 | 3 | 9 | Ordinary 3×3 convolution. |
| 2 | 5 | 9 | 5×5 coverage, still 9 weights. Used in many segmentation heads. |
| 4 | 9 | 9 | 9×9 coverage; one layer already covers a chunk. |
| 8 | 17 | 9 | WaveNet-scale; used in the upper layers of the dilated stack. |
Dilation in Pure Python
Implementing dilation is a one-line change to the Python loop: where we used to index img[i+a, j+b], we now index img[i+a·D, j+b·D]. The formula for the output size uses in place of .
Dilation in PyTorch
PyTorch exposes dilation as a keyword on both F.conv2d and nn.Conv2d. The “same” padding recipe generalises cleanly: for odd .
Quick Check
A 3×3 convolution with dilation=4 is applied to a feature map. What is the effective kernel size?
Interactive Parameter Playground
Now that each knob is understood in isolation, try combining them. The playground below runs a live convolution with a fixed 3×3 kernel on a fixed 5×5 input; move the stride, padding, and dilation sliders to see the effective kernel, the output shape, and the numeric output update in real time:
Convolution Parameter Playground
Move the sliders to see how stride, padding, and dilation reshape the output. Yellow squares show which input positions the kernel samples at the current step.
how far the kernel jumps each step
border added on each side
gap between kernel taps
active only when P > 0
Fixed 3×3 kernel. With dilation, non-zero taps spread apart but the weights stay the same.
What to try
- Set S=2, P=1 — the standard ResNet downsampling block; output is (5+2−3)/2+1 = 3.
- Set D=2 with P=2 — “same” dilated conv; output shape matches the input.
- Set D=2 with P=0 — output collapses to 1×1 because k_eff = 5 matches the input.
- Toggle padding mode between zero / reflect / replicate and watch the corner cells change.
Receptive Field Arithmetic
The receptive field (RF) of a unit in a deep feature map is the size of the region in the input image that can influence its value. Stride, dilation, and depth all increase it. Araujo et al. (2019) [9] derive the following recursion — this is the formula you should commit to memory:
In words: each new layer adds “input pixels” to the RF, scaled by the product of all previous strides (the cumulative downsampling, called the jump). Pooling layers obey the same formula — they just use different values.
Worked Example: three 3×3 convs, stride 1
With at every layer: , , , . Three 3×3 convs “see” a 7×7 window — which is why VGG [10] stacks two 3×3s instead of using a single 5×5: same receptive field, fewer parameters (18 vs 25 per channel pair), more non-linearities.
Worked Example: WaveNet-style dilated stack
Four 1-D causal convolutions with and dilations give: , , , . The receptive field doubles every layer. Stacking such blocks is how WaveNet [4] reaches audio context windows of thousands of samples with fewer than 20 layers.
Receptive Field Growth
See how the receptive field expands with each convolutional layer
Input Image (7×7)
Output Size
7×7
Receptive Field
1×1
RF Coverage
2.0% of input
Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride
With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)
Quick Check
A network has layers: conv3×3(s=1), conv3×3(s=1), pool2×2(s=2), conv3×3(s=1), conv3×3(s=1). What is the receptive field of the final output unit?
Common Design Patterns
The VGG pattern: stack of same-padded 3×3s
everywhere, punctuated by stride-2 or pool-2 every few layers. Simonyan & Zisserman [10] showed that this beats larger kernels because two stacked 3×3 layers have the same 5×5 receptive field but use parameters instead of , and they insert an extra non-linearity.
The ResNet pattern: stride-2 conv for downsampling
He et al. [6] downsample with (halves resolution, still “same-ish” at stride-2) and rely on 1×1 convs inside the bottleneck to change channel counts cheaply. The pool-free downsampling preserves gradient flow through residual connections.
The WaveNet pattern: exponentially dilated causal convolutions
Dilation schedule with yields . L=10 layers already reach 1,024 audio samples of context [4]. Causality (only left-hand taps) is orthogonal and implemented via asymmetric padding.
The DeepLab pattern: atrous spatial pyramid pooling
Apply several parallel 3×3 dilated convs with dilations to the same feature map, then fuse. Each branch captures a different scale; together they cover a wide range of context without needing an image pyramid [5].
Summary
| Knob | Effect on output size | Effect on receptive field | Parameter cost |
|---|---|---|---|
| Stride S | Divides by S | Multiplies downstream contributions by S (via the 'jump') | None |
| Padding P | Adds 2P | No direct effect | None |
| Dilation D | Subtracts D·(K−1) instead of (K−1) | Multiplies (K−1) contribution by D | None |
| Kernel K | Subtracts K−1 (or k_eff−1) | Adds (K−1) at this layer | Scales as K²·C_in·C_out |
Commit these to memory
- Master formula:
- Same padding recipe: for odd and stride 1 preserves spatial size.
- Receptive field recursion:
- Stride subsamples, padding preserves size, dilation enlarges the view for free.
Exercises
Conceptual
- Compute by hand the output spatial size of
nn.Conv2d(64, 128, kernel_size=5, padding=2, stride=1, dilation=1)applied to a 56×56 feature map. - Repeat with
dilation=2, padding=4. What stays the same; what changes? - Two stacked 3×3 convolutions (stride 1) have the same receptive field as what singlek×k kernel? How many fewer parameters do the two 3×3s use, per input-output channel pair?
- Why does
padding='same'raise an error whenstride>1in PyTorch? (Hint: the master formula is no longer invertible to a single integer .)
Hints
- 1: — “same” with K=5.
- 2: k_eff = 5; output still 56×56. Same shape, larger receptive field, zero extra parameters.
- 3: 5×5. Param ratio: 2·9 / 25 = 18/25 ≈ 72%.
- 4: Same-padding requires , but with stride > 1 that target depends on both and the ceiling/floor rounding — no single works in general.
Coding
- Extend
conv2d_dilatedto accept stride and padding in addition to dilation. Verify it matchesF.conv2dbyte-for-byte on random inputs. - Implement a function
receptive_field(layers)wherelayersis a list of(k, s, d)tuples and the function returns the final RF using the recursion above. Test it against the playground's 3-conv VGG example (should give 7). - Implement padding modes reflect and replicate in pure Python (no
np.pad) and compare to the NumPy output.
Challenge
Reproduce the WaveNet dilated-causal-conv stack (dilations , kernel size 2, 10 layers) in PyTorch and print the receptive field after each layer. Verify that it doubles each time. The “causal” part means you must pad on the left only — search the PyTorch docs for F.pad if you get stuck.
References
- Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. (Canonical derivation of the output-shape formula.)
- PyTorch documentation, torch.nn.Conv2d.
https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html(Explicit output-size formula andpadding_modeoptions.) - Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All-Convolutional Net. ICLR workshop. arXiv:1412.6806.
- van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.
- Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2018). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TPAMI 40(4). arXiv:1606.00915.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. arXiv:1512.03385.
- Islam, M. A., Jia, S., & Bruce, N. D. B. (2020). How Much Position Information Do Convolutional Neural Networks Encode? ICLR. arXiv:2001.08248.
- Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR. arXiv:1511.07122.
- Araujo, A., Norris, W., & Sim, J. (2019). Computing Receptive Fields of Convolutional Neural Networks. Distill.
https://distill.pub/2019/computing-receptive-fields/ - Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG). ICLR. arXiv:1409.1556.
In the next section we'll combine every knob from this chapter into a single production-grade conv2d implementation — including the im2col trick that turns a quadruple loop into a single matrix multiplication.