Introduction
§10.4 closed with a promise: “max-pooling, average-pooling, and even modern alternatives like strided convolutions are variations on a single theme — a stride-S window reducer.” This section pays that off. We will show that the whole family of pooling operators is specified by two choices: the window geometry (which inherits the formula from §10.3), and a reducer function applied inside each window. Swap the reducer and you traverse the whole design space: max, average, Lp, stochastic, fractional, mixed, and eventually the all-convolutional replacement.
The Core Insight: Pooling is a deliberately parameter-free, non-linear downsampler. It trades a tiny amount of expressiveness for three concrete wins: smaller feature maps, approximate translation invariance, and a bigger effective receptive field for every downstream layer. The famous “complex cells” of Hubel & Wiesel that §10.1 introduced map almost directly onto this operator.
You already met pooling informally — the animation in §10.2's CNN2DConvolutionVisualizer halves a 7×7 feature map with a 2×2 max-pool, and every CNN in Chapter 11 (LeNet, AlexNet, VGG) will lean on it. Here we open the hood: what each variant does mathematically, how gradients route through it, and why modern architectures have been steadily replacing it with strided convolutions.
Learning Objectives
After working through this section, you will be able to:
- Implement max-pool and average-pool from scratch in NumPy and reproduce PyTorch's output byte-for-byte.
- Predict output shapes using for any kernel, stride, and padding.
- Derive the backward pass for both max and average pool and verify it against
torch.autograd. - Explain Global Average Pooling and why it replaced fully-connected heads in GoogLeNet, ResNet, and every modern classifier.
- Use
nn.AdaptiveAvgPool2dto make a network accept variable input sizes. - Argue — with citations — why ResNet and ConvNeXt eliminated max-pool in favour of stride-2 convolutions.
The Four Jobs Pooling Does
Before any maths, let us be honest about why pooling exists. It is not a decorative layer: it does four concrete jobs, and if you remove it you must replace it with something that does the same jobs. Identifying these jobs lets us reason about what the modern stride-2 conv actually replaces.
| Job | What it buys | Typical alternative |
|---|---|---|
| 1. Spatial downsampling | Halving H and W at each stage keeps compute tractable — 224×224 → 7×7 over 5 stages. | Stride-2 convolution (Springenberg 2015 [10]) |
| 2. Approximate translation invariance | A one-pixel shift of the input is mostly absorbed by the reducer — the same feature still fires within its window. | Data augmentation + architectural choices (GAP) |
| 3. Enlarging the effective receptive field | Two stride-2 pools multiply the receptive field's growth rate by 4 (§10.3 recursion). | Dilated convolutions (DeepLab [§10.3 R5]) |
| 4. Cheapening compute | Halves compute in the next layer without adding parameters. LeNet-5 depended on this in 1998 [1]. | Depthwise-separable convs (MobileNet, §10.4) |
The complex-cell analogy, made precise
Max Pooling
The most common pooling variant, and the one you will see in AlexNet, VGG, and every textbook diagram. For a window of size centred at output position :
Intuition: “report the loudest activation in this neighbourhood.” Under a ReLU activation (§5.3), pixel values represent how strongly a learned feature fires — so max-pool asks whether the feature is present, not exactly where. This is Boureau, Ponce & LeCun (2010) [3]'s formal argument for why max-pool works well on sparse codes: when most cells are zero, the max of a window effectively detects presence.
Hand-Computation on a 4×4 Feature Map
Let us do a full worked example on the smallest non-trivial input. The 4×4 grid below is exactly the one you will see animated in the Pooling Playground further down, so you can match each hand-computed value against the live widget.
| Window (i,j) | Cells | Max |
|---|---|---|
| (0, 0) | [[1, 3], [5, 6]] | 6 |
| (0, 1) | [[2, 4], [1, 2]] | 4 |
| (1, 0) | [[7, 2], [4, 8]] | 8 |
| (1, 1) | [[3, 1], [5, 6]] | 6 |
Result: [[6, 4], [8, 6]] — a 2×2 output from a 4×4 input, exactly as the master formula predicts: .
Max Pool in Pure Python
Max Pool in PyTorch
Quick Check
A 32×32 feature map is fed into nn.MaxPool2d(kernel_size=3, stride=2, padding=1). What is the output spatial size?
Average Pooling
The other classical reducer. For the same window :
LeCun et al.'s original LeNet-5 [1] actually used a learnable scaled average pool (one coefficient and one bias per feature map), but the modern convention settled on plain unweighted averaging. Average-pool preserves the total energy of the window and, unlike max, routes gradient to every input cell — a property that will matter a great deal in the Backward Pass section below.
Average Pool in Pure Python
Average Pool in PyTorch
count_include_pad — the silent bug
count_include_pad=True; TensorFlow's default is the equivalent of False. Porting average-pool with padding between frameworks without flipping this flag silently changes every border output. The PyTorch docs (torch.nn.AvgPool2d) spell out the formula.The Master Pooling Formula
Pooling is a convolution that reduces instead of weight-summing, so it inherits §10.3's master formula. Set dilation to 1 and you have:
with the same symbol dictionary: = input size, = kernel size, = padding per side, = stride. The formula applies identically to max, avg, Lp, stochastic and mixed pooling — only the reducer changes, never the geometry.
| Setting | Formula | Output |
|---|---|---|
| I=28, K=2, S=2, P=0 (LeNet/AlexNet halving) | (28−2)/2+1 | 14 |
| I=7, K=2, S=2, P=0 (on a 7×7 fmap, drops last row/col) | (7−2)//2+1 | 3 |
| I=7, K=7, S=1, P=0 (Global pool on 7×7 fmap) | (7−7)/1+1 | 1 |
| I=32, K=3, S=2, P=1 (AlexNet overlapping pool) | (32+2−3)/2+1 | 16 |
| I=4, K=2, S=1, P=0 (overlapping non-downsampling) | (4−2)/1+1 | 3 |
Ceil mode in a sentence
ceil_mode=True replaces the floor with a ceiling in the formula above, so a 7×7 input with K=2, S=2 produces a 4×4 output instead of 3×3. The extra row/column is “reached into” using for max-pool (so the max ignores it) or 0 for avg-pool with count_include_pad=True. Most modern code leaves this asFalse.Interactive: Pooling Playground
Switch between max and average modes, step through the windows, and watch how the output builds up cell by cell. The 4×4 grid matches the hand-computation above; the 6×6 grid lets you see a longer sliding pattern.
Max Pooling Visualizer
Quick Check
With the 4×4 grid in Max Pool mode, why does output cell (1, 0) display the value 8?
Max vs Average — When to Use Which
The choice is not arbitrary; there is a real theoretical and empirical literature. Boureau, Ponce & LeCun (2010) [3] analysed pooling on sparse codes under simple Bernoulli activation models, and Scherer, Müller & Behnke (2010) [4] ran a direct bake-off on object-recognition benchmarks. Their findings cohere:
| Dimension | Max pooling wins when… | Average pooling wins when… |
|---|---|---|
| Feature sparsity | Features are sparse — most cells are zero or near zero. Max detects presence. | Features are dense and every cell carries signal you want to integrate. |
| Network position | Early/mid layers that look for local patterns (edges, textures, parts). | Final layer head — GAP preserves the C-dim signature of the whole image [7]. |
| Gradient flow | Acceptable only when training data is plentiful — many cells get zero gradient. | Every cell gets (1/K²) of the upstream gradient — kinder in small-data regimes. |
| Translation invariance | Slightly better: max is unchanged by any shift that keeps the argmax in the window. | Smoother but also slightly less invariant — the mean shifts continuously. |
| Empirical benchmark [4] | Object recognition on NORB, CIFAR-10 (Scherer 2010). | Medical image segmentation heads, audio classifiers (post-2016 literature). |
Rule of thumb
Worked Example — Continuing the 7×7 from §10.2
§10.2 animated a 7×7 feature map sliding under a 3×3 kernel inside CNN2DConvolutionVisualizer. We reuse exactly that matrix here so you can track specific cells through the conv → pool pipeline end-to-end.
Continuity across sections
Global Average Pooling
If you set the pooling window equal to the full spatial extent of the feature map, the output is one number per channel — a C-dimensional vector. This is Global Average Pooling (GAP), introduced by Lin, Chen & Yan in Network-in-Network (2014) [7] and immediately adopted by GoogLeNet (2015) [8] and then every ResNet-style architecture [9].
Why GAP replaced the giant FC head
The AlexNet (2012) head looked like Flatten(6144) → FC(4096) → FC(4096) → FC(1000) — roughly 58 million parameters just in the head. GoogLeNet swapped this for GAP(1024) → FC(1000), i.e. ~1 million parameters. Same accuracy on ImageNet, with vastly less overfitting risk. Four structural wins:
- Zero parameters in the pool itself.
- Works for any input size. GAP always produces a C-dim vector regardless of H and W, so the classifier downstream is size-agnostic.
- Acts as a structural regulariser. Lin et al. (2014) [7] note that GAP forces each feature map to correspond to a class, because the classifier can only average over it.
- Localisation comes for free. Class-Activation Maps (Zhou et al. 2016, [Ref 9 of §10.6 foreshadow]) use the fact that GAP is a weighted sum to highlight which spatial cells drove the prediction.
The modern classifier head template
features → nn.AdaptiveAvgPool2d(1) → nn.Flatten() → nn.Linear(C, num_classes). This is literally the head of every torchvision ResNet, MobileNet, ConvNeXt, EfficientNet and ViT.Adaptive Pooling
Adaptive pooling is the generalisation that enables GAP: you specify the output size, and PyTorch picks the kernel and stride schedule for you. This removes the hard-coded 7×7 feature-map assumption that plagued early CNN code and is the reason torchvision models accept any reasonable input resolution.
For an input of height and a target height , PyTorch's rule for output cell is:
Two consequences: bins can have different widths when is not a multiple of , and adjacent bins can overlap by one cell. Adaptive pool is therefore not a plain “pick the right kernel” reduction.
The Backward Pass — How Gradients Route
Pooling has no parameters, but it still has a non-trivial backward pass: an upstream gradient must be routed back to the input cells. How it is routed depends entirely on the reducer.
Max pool — the argmax router
The derivative of with respect to is 1 if is the max, and 0 otherwise (technically a sub-gradient since max is non-smooth at ties, but ties are measure-zero in practice). Therefore:
The consequence is striking: in a window of size , exactly one cell receives gradient and receive none. For our 4×4 example with 2×2 pools, 12 of 16 input cells get nothing.
Average pool — the uniform splitter
Average pool is a linear operator, so its derivative is constant: each input cell of each window gets of the upstream gradient. No zero-gradient cells, no dead zones.
Backward Pass in Pure Python
Verification Against torch.autograd
Practical implication of the argmax router
Variants — Stochastic, Lp, Mixed, Fractional
The 2013–2015 literature explored many alternative reducers. None dethroned max and avg as defaults, but several are useful tools and they make the “single family, different reducer” framing concrete.
| Variant | Reducer | Original paper |
|---|---|---|
| Max pool | max of the window | LeNet-5 used subsampling [1]; max-pool popularised by AlexNet [Krizhevsky 2012] |
| Average pool | mean of the window | LeCun et al. (1998) [1] (learnable scaled version) |
| Stochastic pool | sample a cell, probability ∝ value | Zeiler & Fergus (2013) [5] |
| L^p pool | (mean(x^p))^(1/p) — interpolates avg (p=1) ↔ max (p→∞) | Sermanet, Chintala, LeCun (2013) [6] |
| Mixed pool | λ·max + (1−λ)·mean, λ random or learnable | Yu, Wang, Chen, Wei (2014) [12] |
| Fractional max pool | Non-integer kernel/stride — output ≈ input / √2 | Graham (2014) [11] |
Fractional max-pool in PyTorch
torch.nn.FractionalMaxPool2d. It draws random boundaries so the effective downsampling ratio is an arbitrary real number rather than an integer, which acts as a regulariser similar to stochastic pooling.Pool vs Strided Convolution
The central architectural question of the late 2010s: if pooling just halves the spatial resolution, why not use a stride-2 convolution that halves it and learns what to keep? Springenberg, Dosovitskiy, Brox & Riedmiller (2015) [10] made this case explicitly in “Striving for Simplicity: The All Convolutional Net” and showed that replacing every max-pool with a stride-2 conv reaches equal or better accuracy on CIFAR-10 and ImageNet.
| Property | nn.MaxPool2d(2, 2) | nn.Conv2d(C, C, 3, stride=2, padding=1) |
|---|---|---|
| Learnable parameters | 0 | 9·C² + C |
| Output shape on (N, C, H, W) | (N, C, H/2, W/2) | (N, C, H/2, W/2) |
| Reducer | max (fixed, non-linear) | learned linear combination + activation |
| Gradient flow to input cells | 1 of K² cells per window | all cells via learned weights |
| Inductive bias | Strong — assumes max is informative | Weak — must be learned |
| Memory cost (forward) | cache of argmax indices | activations |
| Best use case | Early layers when data is limited | Modern architectures with enough data |
Why Pooling Declined in Modern CNNs
Post-2016, every flagship architecture has moved max-pool out of the spine:
- ResNet (He et al. 2016) [9] uses stride-2 3×3 conv inside every transition block and keeps only a single
MaxPool2d(3, 2, 1)after the stem. Residual connections mean the signal can skip past the learned downsampler, so the network never has to choose between preserving information and reducing resolution. - DenseNet (2017) uses average pool in transition blocks, not max.
- Vision Transformer (Dosovitskiy et al. 2021) [14] contains no pooling at all. Tokenisation via a 16×16 stride-16 patch embedding simultaneously reduces resolution and learns a linear projection.
- ConvNeXt (Liu et al. 2022) [15] — the explicitly ResNet-modernising architecture — uses stride-2 depthwise convs for downsampling inside stages and a “patchify stem” of stride-4 conv up front. Zero max-pool.
What has survived everywhere is global average pool as the head — cheap, learnable-free, size-agnostic, class-activation-map-friendly.
Takeaway: Pooling in the body of the network is an optional inductive-bias choice that trades parameters for regularisation. Pooling at the head (GAP) is nearly universal and there is no sign of it going away.
Design Patterns in Real Architectures
| Architecture | Pooling strategy | Reference |
|---|---|---|
| LeNet-5 (1998) | Learnable scaled avg-pool (2×2, stride 2) after each conv | LeCun et al. [1] |
| AlexNet (2012) | Overlapping max-pool (3×3, stride 2) — found to reduce error vs 2×2 | Krizhevsky et al. |
| VGG-16 (2015) | Non-overlapping max-pool (2×2, stride 2) after each conv block | Simonyan & Zisserman |
| GoogLeNet (2015) | Max-pool inside Inception modules + Global Avg Pool head (replaced 4096-d FC) | Szegedy et al. [8] |
| ResNet (2016) | Stem: 7×7 stride 2 conv then 3×3 stride 2 max-pool. Stages: stride-2 conv. Head: GAP. | He et al. [9] |
| DenseNet (2017) | Avg-pool transitions between dense blocks; GAP head | Huang et al. |
| U-Net (2015) | Max-pool in encoder, transposed conv in decoder (see §10.6) | Ronneberger et al. [13] |
| ViT (2021) | No pooling. Patchify stem is a stride-16 conv. Class token replaces GAP. | Dosovitskiy et al. [14] |
| ConvNeXt (2022) | Stride-4 patchify stem + stride-2 depthwise downsampling between stages. No max-pool. | Liu et al. [15] |
Summary
| Knob | Effect on output size | Effect on gradient flow | Parameter cost |
|---|---|---|---|
| Reducer = max | no effect | 1 cell per window receives grad | 0 |
| Reducer = avg | no effect | Every cell receives 1/K² of upstream grad | 0 |
| Kernel K | Subtracts K-1 from numerator | Smaller windows = less aggregation | 0 |
| Stride S | Divides by S | No direct effect | 0 |
| Padding P | Adds 2P to numerator | Zero-pad acts as constant input to reducer | 0 |
| Adaptive(target) | Forces output to exactly target × target | Varies per cell (bin-size dependent) | 0 |
Commit these to memory
- Output-shape formula: (same as conv with dilation = 1).
- Max-pool gradient rule: one cell per window receives the full upstream gradient; the rest receive zero.
- Avg-pool gradient rule: every cell receives of the upstream gradient.
- GAP = AdaptiveAvgPool2d(1) = .mean(dim=(2,3)). Three spellings for the modern CNN classifier head.
- Stride-2 conv can replace max-pool at the cost of parameters per layer. Modern architectures prefer this when data is plentiful.
Exercises
Conceptual
- Given a (1, 3, 28, 28) input, what is the output shape after
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)? Show your use of the master formula. - Explain in one sentence why max-pool's backward pass creates zero-gradient cells but avg-pool's does not.
- A classifier is trained on 224×224 images but deployed on 384×384 images. Which single layer change lets the model work at the new resolution without retraining? (Hint: not the conv layers.)
- You observe that every cell of a particular feature map eventually becomes an argmax of at least one window during training. Is max-pool still problematic for gradient flow here? Why or why not?
- Why did Springenberg et al. (2015) [10] get a parameter increase when replacing pool with stride-2 conv, yet their total model was still competitive? (Hint: compare to the parameters they could remove elsewhere.)
Hints
- 1: . Output shape (1, 3, 14, 14).
- 2: Max is non-linear; its (sub)gradient is 1 at the argmax and 0 elsewhere. Avg is linear; its gradient is the constant 1/K².
- 3: Swap the final pool for
nn.AdaptiveAvgPool2d((1,1)). Every other layer is already size-agnostic. - 4: Not for the reason you might think: the problem is per-batch sparsity, not per-training-run. Even if every cell wins eventually, within any single backward pass most cells still get zero gradient, which slows convergence.
- 5: Shallower and fewer FC layers — they removed the giant dense head, so overall param count could stay roughly constant or even drop.
Coding
- Extend
max_pool2dto support padding. Pad with (not 0) so the max is unaffected by the padded cells, and verify againstF.max_pool2d(x, K, S, padding=P). - Implement
max_unpool2d: given the argmax indices cached during the forward pass, reconstruct a sparse tensor where each argmax cell holds the corresponding output value. This is the core operation of the Zeiler & Fergus (2014) deconvnet visualisation. Check your implementation against PyTorch'snn.MaxUnpool2d. - Replace every max-pool in a small VGG-style network with a stride-2 3×3 conv (Springenberg et al. [10]). Train both on CIFAR-10 for 10 epochs and compare final accuracy and training curves.
- Implement the bin schedule in
adaptive_bins(H_in, H_out)and feed your output into a manual NumPy avg-pool. Verify the result matchesF.adaptive_avg_pool2dbyte-for-byte on 20 random input sizes.
Challenge
Reproduce the GoogLeNet head switch. Take a small pre-trained CNN with a dense classifier head (e.g. Flatten → FC(4096) → FC(num_classes)), replace the head with AdaptiveAvgPool2d(1) → Flatten → FC(C, num_classes), and fine-tune for a few epochs. Report (a) the parameter reduction, (b) the change in validation accuracy, and (c) a visualisation of the Class Activation Map produced by the GAP head using the technique of Zhou et al. (2016). This single experiment is the clearest demonstration of why GAP replaced dense heads across nearly every vision architecture from 2015 onwards.
References
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324. DOI:10.1109/5.726791. (First large-scale use of subsampling layers in LeNet-5.)
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.3 “Pooling”. MIT Press.
https://www.deeplearningbook.org/(Canonical textbook treatment of pooling as approximate invariance.) - Boureau, Y.-L., Ponce, J., & LeCun, Y. (2010). A Theoretical Analysis of Feature Pooling in Visual Recognition. Proc. 27th ICML, 111–118. (Bernoulli-activation analysis of max vs avg — cited for the sparse-features argument.)
- Scherer, D., Müller, A., & Behnke, S. (2010). Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. ICANN, LNCS 6354, 92–101. Springer. (Empirical NORB / CIFAR comparison.)
- Zeiler, M. D., & Fergus, R. (2013). Stochastic Pooling for Regularization of Deep Convolutional Neural Networks. ICLR. arXiv:1301.3557.
- Sermanet, P., Chintala, S., & LeCun, Y. (2013). Convolutional Neural Networks Applied to House Numbers Digit Classification. ICPR. arXiv:1204.3968. (Introduces Lp-pooling in the SVHN classifier.)
- Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. ICLR. arXiv:1312.4400. (Introduces Global Average Pooling.)
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper with Convolutions (GoogLeNet). CVPR. arXiv:1409.4842. (GAP replacing giant FC head.)
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. arXiv:1512.03385. (Stride-2 conv + GAP head in ResNet.)
- Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. ICLR workshop. arXiv:1412.6806. (Empirical case that stride-2 conv ≥ max-pool.)
- Graham, B. (2014). Fractional Max-Pooling. arXiv:1412.6071.
- Yu, D., Wang, H., Chen, P., & Wei, Z. (2014). Mixed Pooling for Convolutional Neural Networks. RSKT, LNCS 8818. Springer. (Random and learnable convex combination of max and average.)
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI. arXiv:1505.04597. (Max-pool in the encoder path, inverted by transposed convolution in the decoder — motivating §10.6.)
- Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT). ICLR. arXiv:2010.11929. (First flagship vision architecture with no pooling at all.)
- Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s (ConvNeXt). CVPR. arXiv:2201.03545. (Modern ResNet-style CNN with stride-2 depthwise downsampling, no max-pool.)
- PyTorch documentation, torch.nn.MaxPool2d, AvgPool2d, AdaptiveAvgPool2d, AdaptiveMaxPool2d, FractionalMaxPool2d, functional.max_pool2d, functional.avg_pool2d, functional.adaptive_avg_pool2d.
https://pytorch.org/docs/stable/nn.html(Authoritative reference forcount_include_pad,ceil_mode, and adaptive-pool bin schedule.)
In the next section we tackle the inverse of every downsampling operation we have seen so far — transposed convolution. It is the upsampler that lets decoders (U-Net), generators (DCGAN), and super-resolution networks grow feature maps back up to image resolution.