Stride — Controlling Output Resolution
In Section 2 our kernel moved one pixel at a time. That is stride 1. If we instead move s pixels per step, we evaluate fewer windows, the output shrinks by roughly a factor of s in each spatial axis, and we save compute proportionally.
Definition
Let be the stride. Then the output size (no padding) becomes:
For the 5×5 image and 3×3 Sobel-X kernel of Section 2: stride 1 gives a 3×3 output; stride 2 gives a 2×2 output; stride 3 gives a 1×1 output. Every stride greater than 1 throws information away, but that is often exactly what we want — modern architectures like ResNet use stride-2 convolutions in place of (or alongside) pooling precisely to down-sample
Reference
Python & PyTorch — stride in action
| stride | Output shape | Output values |
|---|---|---|
| 1 | (3, 3) | [[160,160,160],[240,240,240],[320,320,320]] |
| 2 | (2, 2) | [[160,160],[320,320]] |
| 3 | (1, 1) | [[160]] |
When to reach for stride > 1
- You need to down-sample while simultaneously computing features — a strided conv replaces a conv + pool pair.
- You want a parameterised down-sampler: unlike pooling, the strided conv still has learnable weights.
- You are building a generator and need the reverse: fractional/transposed strided conv to upsample.
Quick Check
Input is 28×28, kernel is 5×5, stride is 2, no padding. What is the output size?
Padding — Taming the Boundary
Without padding, every conv layer shrinks the spatial size by . Stack 10 layers with 3×3 kernels and you lose 20 pixels in each dimension. Padding fixes this by adding extra rows and columns around the input, so the kernel has somewhere to land at the boundary.
The four common padding modes
| Mode | What it does | Use case |
|---|---|---|
| constant (zero) | Pad with a constant (usually 0) | Default in nn.Conv2d — simple, fast, slightly biases edges toward zero |
| replicate | Repeat the nearest edge pixel | Smooth boundary behaviour; common in image restoration |
| reflect | Mirror the input without duplicating the edge | Even smoother; default in many classical image-processing libraries |
| circular | Wrap around (right ↔ left, top ↔ bottom) | Periodic data — spherical imagery, torus simulations |
Terminology: ‘valid’, ‘same’, ‘full’
Higher-level libraries often use named padding shortcuts:
- valid — no padding. Output shrinks by .
- same — pad so output spatial size equals input (when stride=1). For odd kernel sizes: .
- full — pad on each side. Output is larger than input. Rare in deep learning; common in signal processing.
The same-size recipe
padding=1.F.pad — all four modes on the same tensor
The Output-Size Formula
Combining stride and padding, the full output-size formula becomes:
This is the single most important piece of arithmetic in CNN design
Reference
| Scenario | Settings | Calculation | Output |
|---|---|---|---|
| Same-size 3×3 | I=224, K=3, P=1, S=1 | (224−3+2)/1 + 1 | 224 |
| Same-size 5×5 | I=224, K=5, P=2, S=1 | (224−5+4)/1 + 1 | 224 |
| Halve via stride | I=224, K=3, P=1, S=2 | ⌊(224−3+2)/2⌋ + 1 | 112 |
| No padding | I=224, K=3, P=0, S=1 | (224−3)/1 + 1 | 222 |
| Classic VGG block | I=224, K=3, P=1, S=1 then 2×2 maxpool | 224 → 224 → 112 | 112 |
Quick Check
Input 64, kernel 5, padding 2, stride 2. Output?
Pooling — Down-sampling Without Parameters
Stride-2 convolution is one way to halve the spatial resolution; pooling is the other. A pooling layer slides a window across the input and reduces each window to a single number — typically the MAX or the MEAN — with zero learnable parameters. That “no parameters” is not an accident: LeCun's original LeNet-5 used a sub-sampling layer for exactly this reason — it forces some locality-invariance into the representation without adding capacity
Reference
Max vs. average pool
| Variant | Reduction | What it keeps | Typical use |
|---|---|---|---|
| Max pool | max of the window | the strongest activation only | default in most classical CNNs (VGG, AlexNet, ResNet's first block) |
| Average pool | mean of the window | an average signal | smoother, less noisy; preferred when you do NOT want to throw information away |
| Global average pool | mean over the WHOLE feature map | one scalar per channel | replaces large FC classifier heads (GoogLeNet, ResNet, all modern classifiers) |
| Adaptive pool | output size fixed; window size computed | user-specified output shape | useful when input sizes vary |
Why max pool 'works'
Interactive pooling visualiser
The visualiser below lets you toggle between max and average pooling on a 4×4 or 6×6 feature map. Step through each window and confirm the arithmetic matches your mental model.
Max Pooling Visualizer
Pooling by hand — Python and PyTorch
Stride vs. pool: a modern debate
Reference
Receptive Field — Why Depth Matters
Each output unit of a conv layer depends on only a small input neighbourhood. The receptive field of a unit is the set of input pixels that can affect its value. Stacking conv layers grows the receptive field layer by layer — which is the fundamental reason CNN depth helps.
How receptive field grows
For a stack of conv layers with (stride 1, kernel size ) at layer , the receptive field after layers is:
… where is the stride of layer . For an all-stride-1 stack of 3×3 convs the product collapses to 1 and the formula simplifies to .
| After layer | Receptive field (3×3 stride-1 stack) |
|---|---|
| 1 | 3×3 |
| 2 | 5×5 |
| 3 | 7×7 |
| 4 | 9×9 |
| 5 | 11×11 |
Why VGG uses stacks of 3×3: Two 3×3 convs give the same receptive field as one 5×5 conv — but with fewer parameters ( vs ) and one extra non-linearity in between. Three 3×3 convs match a 7×7 receptive field at a third of the parameters.Reference
Simonyan & Zisserman, 2015, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR. This is one of the central design lessons of modern CNNs.
Interactive: walk through layers, watch the field grow
Receptive Field Growth
See how the receptive field expands with each convolutional layer
Input Image (7×7)
Output Size
7×7
Receptive Field
1×1
RF Coverage
2.0% of input
Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride
With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)
Effective receptive field
Reference
PyTorch nn.Conv2d Anatomy
Every hyperparameter you have met so far maps directly to an argument of nn.Conv2d:
1import torch
2import torch.nn as nn
3
4conv = nn.Conv2d(
5 in_channels = 3, # RGB input
6 out_channels = 64, # 64 filters → 64 output feature maps
7 kernel_size = 3, # 3×3 spatial window
8 stride = 1, # move 1 pixel per step
9 padding = 1, # "same" padding for K=3
10 dilation = 1, # 1 = ordinary conv; >1 = atrous/dilated
11 groups = 1, # 1 = full multi-channel; C_in = depthwise
12 bias = True, # learnable bias per output channel
13)
14
15print(conv.weight.shape) # torch.Size([64, 3, 3, 3])
16print(conv.bias.shape) # torch.Size([64])
17
18x = torch.randn(8, 3, 224, 224) # 8 RGB images, 224×224
19y = conv(x)
20print(y.shape) # torch.Size([8, 64, 224, 224]) — same spatial| Argument | Meaning | Typical value |
|---|---|---|
| in_channels | C_in of the input tensor | 3 (RGB) or whatever the previous layer produced |
| out_channels | number of learnable kernels (= number of output feature maps) | 32, 64, 128, 256, … |
| kernel_size | spatial size of each kernel; int or (h, w) tuple | 3 is the modern default |
| stride | pixels moved per step; int or tuple | 1 normally; 2 for down-sampling |
| padding | zeros added around the input; int, tuple, or "same" | (K−1)/2 for same-size |
| dilation | spacing between kernel elements — makes the receptive field bigger without more params | 1 (normal); 2, 4 for dilated / atrous conv |
| groups | splits channels into independent groups; groups=C_in gives depthwise conv | 1 (default); C_in for depthwise-separable |
| bias | add a learnable per-channel bias | True — unless followed by BatchNorm, which has its own shift |
Dilated and depthwise conv — a brief preview
dilation>1) leave holes between kernel samples, enlarging the receptive field without shrinking the output or adding parameters — crucial for semantic segmentationReference
groups=C_in followed by a 1×1 conv) dramatically cut FLOPs and are the basis of MobileNet and XceptionReference
Manual Implementation (For Reference)
Putting every piece together — stride, padding, multi-channel, bias — into one explicit function. Reading this code is the fastest way to convince yourself that nn.Conv2d holds no mysteries.
This is slow on purpose
Reference
Putting It All Together: A Full CNN Pipeline
We can now read the full CNN pipeline end to end. Conv layers extract features; pooling (or strided conv) down-samples; the receptive field grows with depth; the final pooled/flattened vector feeds a classifier. The interactive below ties every stage to the concepts of this chapter.
2D Convolution: Complete Process Visualization
Watch how a CNN processes an image through convolution and pooling layers, reducing dimensions while extracting features.
CNN Architecture: Dimension Reduction Pipeline
Notice how each layer transforms the data:
Kernel Filtering
Input (7×7)
Kernel (3×3)
Feature Map (Output)
👆 Use the controls above to step through the convolution!
- • Without padding (P=0): Output size = (5-3)/1 + 1 = 3×3
- • With padding (P=1): Output size = (5+2-3)/1 + 1 = 5×5 (same as input!)
- • Stride=2: Kernel moves 2 pixels at a time, reducing output size
Pooling Operation
Feature Map (from Conv)
Pooled Output
Max Pooling
- • Operation: Takes the maximum value from each 2×2 window
- • Effect: Keeps strongest activations, provides translation invariance
- • Use case: Most common in CNNs (VGG, ResNet, etc.)
- • Output size: 5÷2 = 2×2
Output Size Formula
Feature learning vs. classification
Reference
AI / Deep Learning Applications
Every component of this chapter — conv, stride, padding, pooling, receptive field — is used in the production systems below.
Object detection (YOLO, Faster R-CNN)
A CNN backbone extracts features at multiple scales; a head predicts bounding boxes + class probabilities at each spatial location. The increasing receptive field with depth is what lets the network reason about whole objects while still operating on convolutional feature maps
Reference
Semantic segmentation (U-Net)
Every pixel gets classified. An encoder of conv + pool layers compresses the image; a decoder of (transposed) conv layers restores spatial resolution; skip connections re-inject fine detail. Medical-image segmentation adopted this architecture wholesale
Reference
Neural style transfer
Convolutions factor an image into “content” (responses of deeper layers, roughly object identity) and “style” (Gram-matrix statistics of shallower layers, roughly texture). Optimising an image so its content matches one reference and its style matches another yields Van Gogh-ified photos
Reference
Generative image models
StyleGAN, diffusion models, and most modern generators use transposed (fractionally-strided) convolutions to go from low-resolution noise or latent tensors to a full image — the inverse direction of the down-sampling pipeline we have just built
Reference
A profound empirical observation
If you visualise the first-layer kernels of a trained ImageNet CNN you see oriented edge detectors, colour blobs, and Gabor-like frequency patterns — strikingly close to what electrophysiology finds in V1 of mammals. Nobody trained the network to learn Gabor filters. Gradient descent discovered them from the data.Reference
Zeiler & Fergus, 2014, “Visualizing and Understanding Convolutional Networks”, ECCV; Hubel & Wiesel, 1962, J. Physiology 160(1); Olshausen & Field, 1996, Nature 381.
Summary
Key concepts
| Concept | Definition | Why it matters |
|---|---|---|
| Convolution | sliding weighted sum, shared weights | the core feature-extraction operation |
| Kernel / filter | small tensor of learnable weights | each kernel learns one feature detector |
| Feature map | the output of a conv layer | indicates where a feature appears in the input |
| Stride | kernel step size | controls output resolution and compute cost |
| Padding | zeros (or reflection, etc.) added at the border | lets the kernel reach edge pixels; preserves spatial size |
| Pooling | max/avg reduction over a window | down-sampling with zero parameters; small-shift invariance |
| Receptive field | set of input pixels influencing one output | grows with depth; motivates stacking small kernels |
Critical formulas
- 2-D cross-correlation:
- Output size:
- Parameter count:
- Receptive field for a stride-1, -kernel stack of layers:
Exercises
Conceptual
- A 128×128×3 image passes through
nn.Conv2d(3, 32, kernel_size=5, padding=2, stride=2). What is the output shape and how many parameters does the layer have? - Explain in one sentence why Sobel-X, with weights
[[-1,0,1],[-2,0,2],[-1,0,1]], detects vertical edges rather than horizontal ones. - You need to preserve spatial size with a 7×7 kernel at stride 1. What padding?
- Two stacked 3×3 convs vs. one 5×5 conv — same receptive field. Give two reasons to prefer the stacked version.
- Max pool vs. stride-2 conv — when would you reach for each?
Hints
- O = (128 − 5 + 4) / 2 + 1 = 64 → output
[B, 32, 64, 64]. Params: 5×5×3×32 + 32 = 2,432. - Sobel-X computes the horizontal intensity difference, and a vertical edge is precisely a location where brightness changes horizontally.
- .
- (i) Fewer parameters (18 vs. 25); (ii) one extra non-linearity in between — so strictly more expressive.
- Max pool: no params, robust small-shift invariance, cheap. Stride conv: learnable, can shape the down-sampling to the task.
Coding
- Edge magnitude. Apply Sobel-X and Sobel-Y to an image and combine as .
- Box vs. Gaussian. Apply both to a noisy photograph and explain why Gaussian looks more natural.
- Verify the formula. Write a test that varies I, K, P, S and asserts
F.conv2doutput shape matches . - Implement im2col. Rewrite our manual conv as a single matrix multiplication via
unfold. Measure the speedup on a 64×64 input with 128 filters.
References
All factual claims in this chapter are drawn from the following primary sources.
- Hubel, D. H. and Wiesel, T. N. (1962). “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex.” Journal of Physiology, 160(1), 106–154.
- Olshausen, B. A. and Field, D. J. (1996). “Emergence of simple-cell receptive field properties by learning a sparse code for natural images.” Nature, 381, 607–609.
- LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11), 2278–2324.
- Chellapilla, K., Puri, S. and Simard, P. (2006). “High performance convolutional neural networks for document processing.” Intl. Workshop on Frontiers in Handwriting Recognition.
- Scherer, D., Müller, A. and Behnke, S. (2010). “Evaluation of pooling operations in convolutional architectures for object recognition.” ICANN.
- Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). “ImageNet classification with deep convolutional neural networks.” NeurIPS.
- Lin, M., Chen, Q. and Yan, S. (2014). “Network in network.” ICLR.
- Zeiler, M. D. and Fergus, R. (2014). “Visualizing and understanding convolutional networks.” ECCV.
- Simonyan, K. and Zisserman, A. (2015). “Very deep convolutional networks for large-scale image recognition.” ICLR.
- Szegedy, C. et al. (2015). “Going deeper with convolutions” (GoogLeNet / Inception). CVPR.
- Springenberg, J. T., Dosovitskiy, A., Brox, T. and Riedmiller, M. (2015). “Striving for simplicity: the all convolutional net.” ICLR workshop.
- Ronneberger, O., Fischer, P. and Brox, T. (2015). “U-Net: convolutional networks for biomedical image segmentation.” MICCAI.
- He, K., Zhang, X., Ren, S. and Sun, J. (2016). “Deep residual learning for image recognition.” CVPR.
- Yu, F. and Koltun, V. (2016). “Multi-scale context aggregation by dilated convolutions.” ICLR.
- Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016). “You only look once: unified, real-time object detection.” CVPR.
- Luo, W., Li, Y., Urtasun, R. and Zemel, R. (2016). “Understanding the effective receptive field in deep convolutional neural networks.” NeurIPS.
- Gatys, L. A., Ecker, A. S. and Bethge, M. (2016). “Image style transfer using convolutional neural networks.” CVPR.
- Radford, A., Metz, L. and Chintala, S. (2016). “Unsupervised representation learning with deep convolutional generative adversarial networks.” ICLR.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press. (Chapter 9 — Convolutional Networks — is the standard graduate reference.)
- Dumoulin, V. and Visin, F. (2016). “A guide to convolution arithmetic for deep learning.” arXiv:1603.07285.
- Chollet, F. (2017). “Xception: deep learning with depthwise separable convolutions.” CVPR.
- Howard, A. G. et al. (2017). “MobileNets: efficient convolutional neural networks for mobile vision applications.” arXiv:1704.04861.
- Ren, S., He, K., Girshick, R. and Sun, J. (2017). “Faster R-CNN: towards real-time object detection with region proposal networks.” IEEE TPAMI, 39(6), 1137–1149.
- Karras, T., Laine, S. and Aila, T. (2019). “A style-based generator architecture for generative adversarial networks.” CVPR.
- Ho, J., Jain, A. and Abbeel, P. (2020). “Denoising diffusion probabilistic models.” NeurIPS.
With stride, padding, pooling, and receptive fields now firmly in hand, we can move from individual layers to full CNN architectures. In Chapter 14 we build LeNet-5, VGG, and ResNet from scratch, then use transfer learning to adapt pre-trained models to new tasks.