Chapter 10
25 min read
Section 66 of 179

Pooling Operations

Convolution Operations

Introduction

§10.4 closed with a promise: “max-pooling, average-pooling, and even modern alternatives like strided convolutions are variations on a single theme — a stride-S window reducer.” This section pays that off. We will show that the whole family of pooling operators is specified by two choices: the window geometry (which inherits the O=(I+2PK)/S+1O = \lfloor (I + 2P - K)/S \rfloor + 1 formula from §10.3), and a reducer function applied inside each window. Swap the reducer and you traverse the whole design space: max, average, Lp, stochastic, fractional, mixed, and eventually the all-convolutional replacement.

The Core Insight: Pooling is a deliberately parameter-free, non-linear downsampler. It trades a tiny amount of expressiveness for three concrete wins: smaller feature maps, approximate translation invariance, and a bigger effective receptive field for every downstream layer. The famous “complex cells” of Hubel & Wiesel that §10.1 introduced map almost directly onto this operator.

You already met pooling informally — the animation in §10.2's CNN2DConvolutionVisualizer halves a 7×7 feature map with a 2×2 max-pool, and every CNN in Chapter 11 (LeNet, AlexNet, VGG) will lean on it. Here we open the hood: what each variant does mathematically, how gradients route through it, and why modern architectures have been steadily replacing it with strided convolutions.


Learning Objectives

After working through this section, you will be able to:

  1. Implement max-pool and average-pool from scratch in NumPy and reproduce PyTorch's output byte-for-byte.
  2. Predict output shapes using O=(I+2PK)/S+1O = \lfloor (I + 2P - K)/S \rfloor + 1 for any kernel, stride, and padding.
  3. Derive the backward pass for both max and average pool and verify it against torch.autograd.
  4. Explain Global Average Pooling and why it replaced fully-connected heads in GoogLeNet, ResNet, and every modern classifier.
  5. Use nn.AdaptiveAvgPool2d to make a network accept variable input sizes.
  6. Argue — with citations — why ResNet and ConvNeXt eliminated max-pool in favour of stride-2 convolutions.

The Four Jobs Pooling Does

Before any maths, let us be honest about why pooling exists. It is not a decorative layer: it does four concrete jobs, and if you remove it you must replace it with something that does the same jobs. Identifying these jobs lets us reason about what the modern stride-2 conv actually replaces.

JobWhat it buysTypical alternative
1. Spatial downsamplingHalving H and W at each stage keeps compute tractable — 224×224 → 7×7 over 5 stages.Stride-2 convolution (Springenberg 2015 [10])
2. Approximate translation invarianceA one-pixel shift of the input is mostly absorbed by the reducer — the same feature still fires within its window.Data augmentation + architectural choices (GAP)
3. Enlarging the effective receptive fieldTwo stride-2 pools multiply the receptive field's growth rate by 4 (§10.3 recursion).Dilated convolutions (DeepLab [§10.3 R5])
4. Cheapening computeHalves compute in the next layer without adding parameters. LeNet-5 depended on this in 1998 [1].Depthwise-separable convs (MobileNet, §10.4)

The complex-cell analogy, made precise

§10.1 introduced Hubel & Wiesel's “complex cells” as neurons that respond to a feature regardless of its exact position within a small window. A 2×2 max-pool is the simplest possible model of this: it fires with the maximum of four adjacent simple cells, so a 1-pixel shift of the stimulus rarely changes the output. LeCun et al. (1998) [1] called their version “subsampling” for exactly this reason.

Max Pooling

The most common pooling variant, and the one you will see in AlexNet, VGG, and every textbook diagram. For a window WijW_{ij} of size K×KK \times K centred at output position (i,j)(i,j):

y[i,j]=max(m,n)Wijx[m,n]y[i,j] = \max_{(m,n) \in W_{ij}} x[m,n]

Intuition: “report the loudest activation in this neighbourhood.” Under a ReLU activation (§5.3), pixel values represent how strongly a learned feature fires — so max-pool asks whether the feature is present, not exactly where. This is Boureau, Ponce & LeCun (2010) [3]'s formal argument for why max-pool works well on sparse codes: when most cells are zero, the max of a window effectively detects presence.

Hand-Computation on a 4×4 Feature Map

Let us do a full worked example on the smallest non-trivial input. The 4×4 grid below is exactly the one you will see animated in the Pooling Playground further down, so you can match each hand-computed value against the live widget.

Window (i,j)CellsMax
(0, 0)[[1, 3], [5, 6]]6
(0, 1)[[2, 4], [1, 2]]4
(1, 0)[[7, 2], [4, 8]]8
(1, 1)[[3, 1], [5, 6]]6

Result: [[6, 4], [8, 6]] — a 2×2 output from a 4×4 input, exactly as the master formula predicts: (42)/2+1=2(4 - 2)/2 + 1 = 2.

Max Pool in Pure Python

Max Pool — NumPy Reference Implementation
🐍max_pool_numpy.py
1import numpy as np

NumPy gives us np.ndarray plus vectorised slicing and reductions (window.max(), np.mean, np.argmax). The window.max() call inside the loop runs in C; only the outer Python loop is slow.

EXECUTION STATE
numpy = N-dimensional array library on a contiguous C buffer. Slicing returns a view (no copy); reductions run in optimised C.
3x — the 4×4 input feature map

This is the same 4×4 grid the Pooling Visualizer above animates, so you can match cell-by-cell what the code does to what the picture shows. dtype=float to match the output type PyTorch uses.

EXECUTION STATE
x (4×4) =
     c0  c1  c2  c3
r0    1   3   2   4
r1    5   6   1   2
r2    7   2   3   1
r3    4   8   5   6
10def max_pool2d(x, K, S) → np.ndarray

The whole operation is one nested loop: for every output position, slide to the window in the input, take its max. No learnable parameters — pooling is a pure reducer.

EXECUTION STATE
⬇ input: x (H×W) = 2-D feature map. Real CNNs apply this per-channel; we'll add the channel axis in the PyTorch version.
⬇ input: K (int) = Kernel (window) side length. K=2 is the canonical choice since LeNet-5 [Ref 1].
⬇ input: S (int) = Stride — how many pixels the window jumps between positions. S=K (non-overlapping) is the standard; S<K gives overlapping pools [Ref 9 §3.3].
⬆ returns = np.ndarray of shape (out_H, out_W) with one maximum per window.
12H, W = x.shape

Unpack the 2-D shape. For a batched 4-D tensor in PyTorch the shape would be (N, C, H, W); here we strip that down to pure spatial dimensions to make the arithmetic clear.

EXECUTION STATE
H = 4 — number of rows
W = 4 — number of columns
13out_H = (H - K) // S + 1

This is the same master formula we derived in §10.3 for convolution, specialised to padding=0 and dilation=1. Floor-division `//` discards any leftover pixels at the bottom/right — the same behaviour as nn.MaxPool2d with ceil_mode=False.

EXECUTION STATE
📚 (H − K) // S + 1 = Counts how many K-wide windows fit along an axis when each window moves S pixels. Any remainder is dropped — this is why nn.MaxPool2d(2,2) on a 7-pixel row produces 3 outputs, not 4.
→ K=2, S=2 case = (4 − 2)//2 + 1 = 2 → out_H = 2
15y = np.zeros((out_H, out_W))

Pre-allocate the output with zeros so we can fill it in column-by-column during the loop. Pre-allocation is faster than repeated np.append because np.zeros does one bulk malloc.

EXECUTION STATE
y (2×2) =
     c0   c1
r0    0    0
r1    0    0  (before loop)
16for i in range(out_H):

Outer loop over output rows.

LOOP TRACE · 2 iterations
i=0 (processes input rows 0–1)
i=1 (processes input rows 2–3)
17for j in range(out_W):

Inner loop over output columns. Combined with the outer loop this visits all 4 output positions of the 2×2 output grid.

LOOP TRACE · 4 iterations
j=0, i=0 → window rows 0–1, cols 0–1
window = [[1, 3], [5, 6]]
window.max() = 6
j=1, i=0 → window rows 0–1, cols 2–3
window = [[2, 4], [1, 2]]
window.max() = 4
j=0, i=1 → window rows 2–3, cols 0–1
window = [[7, 2], [4, 8]]
window.max() = 8
j=1, i=1 → window rows 2–3, cols 2–3
window = [[3, 1], [5, 6]]
window.max() = 6
18window = x[i*S : i*S + K, j*S : j*S + K]

Slice a K×K view out of x starting at (i·S, j·S). Because NumPy slices are views (not copies), this costs no memory — we're just looking at the same buffer through a smaller lens.

EXECUTION STATE
📚 array[start:stop] = NumPy half-open slice: includes start, excludes stop. Two slices on a 2-D array produce a rectangular view.
→ i·S : i·S+K = Row window [i·S, i·S+K). For i=0, S=2, K=2: rows 0–1.
→ j·S : j·S+K = Column window [j·S, j·S+K). For j=1, S=2, K=2: cols 2–3.
20y[i, j] = window.max()

window.max() is the core operation. It is the single difference between max-pool and average-pool — swap .max() for .mean() and you have the other operator.

EXECUTION STATE
📚 ndarray.max() = Vectorised reduction over all elements of the slice. O(K²) FLOPs, runs in C. Equivalent to np.max(window).
→ why max? = Picks the strongest activation in the window. Under ReLU activations (§5.3), a large value signals 'this feature is present here'. Max-pool says: I only care that the feature IS there, not exactly where.
22return y

Return the filled 2×2 output.

EXECUTION STATE
⬆ return: y (2×2) =
[[6., 4.],
 [8., 6.]]
24print(max_pool2d(x, K=2, S=2))

Call with the canonical kernel=stride=2 configuration. The output is exactly what the Pooling Visualizer animates — 6, 4, 8, 6 in reading order.

14 lines without explanation
1import numpy as np
2
3# 4x4 feature map — same values shown in the Pooling Visualizer above
4x = np.array([
5    [1, 3, 2, 4],
6    [5, 6, 1, 2],
7    [7, 2, 3, 1],
8    [4, 8, 5, 6],
9], dtype=float)
10
11def max_pool2d(x, K, S):
12    """2-D max-pool with kernel size K and stride S, no padding."""
13    H, W = x.shape
14    out_H = (H - K) // S + 1
15    out_W = (W - K) // S + 1
16    y = np.zeros((out_H, out_W))
17    for i in range(out_H):
18        for j in range(out_W):
19            window = x[i*S : i*S + K,
20                       j*S : j*S + K]
21            y[i, j] = window.max()
22    return y
23
24print(max_pool2d(x, K=2, S=2))
25# [[6. 4.]
26#  [8. 6.]]

Max Pool in PyTorch

Max Pool — PyTorch Module and Functional Forms
🐍max_pool_pytorch.py
1Three PyTorch imports

torch for the tensor type, torch.nn for the stateful Module wrapper, torch.nn.functional for the stateless function form. Pooling has no weights so both forms produce identical outputs; choose whichever reads better in your model code.

5x — the input tensor in NCHW layout

PyTorch conv/pool layers expect 4-D tensors in (N, C, H, W) order. We wrap our 4×4 grid in two extra axes so it reads as 'batch of 1 image, 1 channel, 4×4 spatial'.

EXECUTION STATE
⬇ x.shape = (1, 1, 4, 4) — N=1 batch, C=1 channel, H=4, W=4
→ why NCHW? = PyTorch stores tensors channels-first to match cuDNN's layout. TensorFlow's default is NHWC. §10.4 covers the memory-layout trade-offs.
13pool = nn.MaxPool2d(kernel_size=2, stride=2)

Create the module. nn.MaxPool2d records its configuration (kernel, stride, padding, dilation, ceil_mode, return_indices) but holds zero learnable parameters. Building it is free.

EXECUTION STATE
📚 nn.MaxPool2d = PyTorch Module wrapping F.max_pool2d. Its .forward() is literally a call to the functional form. Lives in torch/nn/modules/pooling.py.
⬇ arg: kernel_size=2 = Side length of the pooling window. Accepts int or (kH, kW) tuple. For odd sizes pad asymmetrically if you want same-size output.
⬇ arg: stride=2 = Window step. Defaults to kernel_size if omitted (so MaxPool2d(2) alone is equivalent to MaxPool2d(2, 2)). Set stride<kernel_size for overlapping pools.
→ hidden defaults = padding=0, dilation=1, return_indices=False, ceil_mode=False. Set return_indices=True if you plan to invert the operation later (MaxUnpool2d).
14y1 = pool(x)

Forward pass. Dispatches through __call__ → forward → F.max_pool2d → ATen → cuDNN (on GPU) or a hand-written CPU kernel. The actual reduction runs in C++/CUDA, not Python.

EXECUTION STATE
y1.shape = (1, 1, 2, 2) — spatial halved, batch & channel preserved
y1 =
[[[[6., 4.],
    [8., 6.]]]]
17y2 = F.max_pool2d(x, kernel_size=2, stride=2)

Stateless functional form — no Module object, just a function call. Useful inside Module.forward() when you want to keep the pooling parameters next to the rest of the forward logic instead of storing them as attributes.

EXECUTION STATE
📚 F.max_pool2d = Direct ATen binding. Same kernel as the Module version, same output, same gradient. Lives in torch/nn/functional.py.
19print(y1.squeeze())

.squeeze() drops all size-1 dimensions so we can see the 2×2 output without the batch and channel wrappers cluttering the print.

EXECUTION STATE
📚 tensor.squeeze(dim=None) = Remove axes of size 1. squeeze() with no argument removes all such axes. Use squeeze(0) to remove only the batch axis if you want to keep a potentially size-1 channel axis.
→ shape change = (1, 1, 2, 2) → (2, 2)
23assert torch.equal(y1, y2)

Proves the Module form and functional form produce byte-equal output. They share the same underlying kernel; the wrapper is a thin convenience.

24assert sum(p.numel() for p in pool.parameters()) == 0

Pooling is parameter-free. pool.parameters() returns an empty iterator, so the sum is 0. This is the defining property that separates pooling from strided convolution: same geometric effect on the feature map, but zero gradient signal flowing into learnable weights.

EXECUTION STATE
→ consequence = Adding more pooling layers never grows your parameter count. Adding more stride-2 convs does — every stride-2 3×3 conv costs 9·C_in·C_out weights.
17 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Same 4x4 feature map, now as an (N, C, H, W) tensor
6x = torch.tensor([[[
7    [1., 3., 2., 4.],
8    [5., 6., 1., 2.],
9    [7., 2., 3., 1.],
10    [4., 8., 5., 6.],
11]]])  # shape: (1, 1, 4, 4)
12
13# Option A — the Module form (preferred inside nn.Sequential)
14pool = nn.MaxPool2d(kernel_size=2, stride=2)
15y1 = pool(x)
16
17# Option B — the functional form (preferred when you don't need state)
18y2 = F.max_pool2d(x, kernel_size=2, stride=2)
19
20print(y1.squeeze())
21# tensor([[6., 4.],
22#         [8., 6.]])
23
24assert torch.equal(y1, y2)                  # same numbers
25assert sum(p.numel() for p in pool.parameters()) == 0   # ZERO learnable params

Quick Check

A 32×32 feature map is fed into nn.MaxPool2d(kernel_size=3, stride=2, padding=1). What is the output spatial size?


Average Pooling

The other classical reducer. For the same window WijW_{ij}:

y[i,j]=1Wij(m,n)Wijx[m,n]y[i,j] = \frac{1}{|W_{ij}|} \sum_{(m,n) \in W_{ij}} x[m,n]

LeCun et al.'s original LeNet-5 [1] actually used a learnable scaled average pool (one coefficient and one bias per feature map), but the modern convention settled on plain unweighted averaging. Average-pool preserves the total energy of the window and, unlike max, routes gradient to every input cell — a property that will matter a great deal in the Backward Pass section below.

Average Pool in Pure Python

Average Pool — NumPy Reference Implementation
🐍avg_pool_numpy.py
10def avg_pool2d(x, K, S) — only one line differs from max_pool2d

We intentionally keep everything else identical so the difference is surgical: window.mean() replaces window.max(). That one swap changes gradient behaviour completely (see §Backward Pass below).

EXECUTION STATE
→ design lesson = Both max and avg pool are instances of a single pattern: 'stride-S windowed reducer'. The reducer is the only degree of freedom.
18y[i, j] = window.mean()

Average of all K² cells in the window. For K=2 this divides by 4.

EXECUTION STATE
📚 ndarray.mean() = Sum of elements divided by their count. Runs in C. Equivalent to window.sum() / window.size.
LOOP TRACE · 4 iterations
i=0, j=0 → window = [[1,3],[5,6]]
sum = 1 + 3 + 5 + 6 = 15
mean = 15/4 = 3.75
i=0, j=1 → window = [[2,4],[1,2]]
sum = 2 + 4 + 1 + 2 = 9
mean = 9/4 = 2.25
i=1, j=0 → window = [[7,2],[4,8]]
sum = 7 + 2 + 4 + 8 = 21
mean = 21/4 = 5.25
i=1, j=1 → window = [[3,1],[5,6]]
sum = 3 + 1 + 5 + 6 = 15
mean = 15/4 = 3.75
21print(avg_pool2d(x, K=2, S=2))

Compare with the max-pool output: [[6,4],[8,6]] vs [[3.75,2.25],[5.25,3.75]]. Max preserves peak activations; average preserves the overall energy level — the same input, two different statistics.

EXECUTION STATE
⬆ result =
[[3.75, 2.25],
 [5.25, 3.75]]
21 lines without explanation
1import numpy as np
2
3x = np.array([
4    [1, 3, 2, 4],
5    [5, 6, 1, 2],
6    [7, 2, 3, 1],
7    [4, 8, 5, 6],
8], dtype=float)
9
10def avg_pool2d(x, K, S):
11    """Identical to max_pool2d but swaps .max() for .mean()."""
12    H, W = x.shape
13    out_H = (H - K) // S + 1
14    out_W = (W - K) // S + 1
15    y = np.zeros((out_H, out_W))
16    for i in range(out_H):
17        for j in range(out_W):
18            window = x[i*S : i*S + K, j*S : j*S + K]
19            y[i, j] = window.mean()
20    return y
21
22print(avg_pool2d(x, K=2, S=2))
23# [[3.75 2.25]
24#  [5.25 3.75]]

Average Pool in PyTorch

Average Pool — PyTorch with the count_include_pad subtlety
🐍avg_pool_pytorch.py
12avg = nn.AvgPool2d(kernel_size=2, stride=2)

Sibling of nn.MaxPool2d. Same API surface; only the reducer differs.

EXECUTION STATE
📚 nn.AvgPool2d = Average-pooling Module. Also parameter-free.
→ extra flag: count_include_pad = Default True. Controls whether padding zeros are counted in the denominator when you add padding.
21count_include_pad=True vs False

This flag is the most common source of bugs when you port code between PyTorch and TensorFlow. TF divides only by real (non-padded) cells; PyTorch's default divides by all K² cells including zero-padded ones.

EXECUTION STATE
📚 count_include_pad (PyTorch-specific) = True: denominator = K² always. False: denominator = number of real cells in the window (matches TF's behaviour).
→ why it matters = At a corner cell with padding=1 and K=2, a 2×2 window covers 1 real cell and 3 zero-padded cells. With include_pad=True, the average is (real_value + 0 + 0 + 0)/4 = real_value/4 — a 4× attenuation at the border.
26 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5x = torch.tensor([[[
6    [1., 3., 2., 4.],
7    [5., 6., 1., 2.],
8    [7., 2., 3., 1.],
9    [4., 8., 5., 6.],
10]]])
11
12# Module form
13avg = nn.AvgPool2d(kernel_size=2, stride=2)
14y_avg = avg(x)
15
16# Functional form
17y_func = F.avg_pool2d(x, kernel_size=2, stride=2)
18
19print(y_avg.squeeze())
20# tensor([[3.7500, 2.2500],
21#         [5.2500, 3.7500]])
22
23# Important subtlety: padding + count_include_pad
24x_padded_demo = torch.tensor([[[[1., 2.], [3., 4.]]]])
25print(F.avg_pool2d(x_padded_demo, 2, 1, padding=1,
26                   count_include_pad=True))   # divides by 4 (incl. zeros)
27print(F.avg_pool2d(x_padded_demo, 2, 1, padding=1,
28                   count_include_pad=False))  # divides only by real cells

count_include_pad — the silent bug

PyTorch's default is count_include_pad=True; TensorFlow's default is the equivalent of False. Porting average-pool with padding between frameworks without flipping this flag silently changes every border output. The PyTorch docs (torch.nn.AvgPool2d) spell out the formula.

The Master Pooling Formula

Pooling is a convolution that reduces instead of weight-summing, so it inherits §10.3's master formula. Set dilation to 1 and you have:

O=I+2PKS+1O = \left\lfloor \frac{I + 2P - K}{S} \right\rfloor + 1

with the same symbol dictionary: II = input size, KK = kernel size, PP = padding per side, SS = stride. The formula applies identically to max, avg, Lp, stochastic and mixed pooling — only the reducer changes, never the geometry.

Verify the master formula against PyTorch on 20 random configurations
🐍pool_output_size.py
5def pool_out(I, K, S, P) — the master formula

This is exactly the §10.3 convolution formula with dilation D=1 hard-coded. Pooling is a convolution that reduces instead of weight-sums, so it inherits the same geometry.

EXECUTION STATE
📚 (I + 2P − K) // S + 1 = Counts integer window positions after adding P pixels of padding on each side. Verified in Dumoulin & Visin (2016) [Ref §10.3 R1] and PyTorch docs.
14assert predicted == actual

Sanity check: our formula must match PyTorch's cuDNN output on every valid configuration. If this ever fails, suspect ceil_mode (which uses ⌈·⌉ instead of ⌊·⌋) or dilation ≠ 1.

23 lines without explanation
1import torch
2import torch.nn.functional as F
3import math
4
5def pool_out(I, K, S, P):
6    """Master pooling formula (same as conv with dilation=1, §10.3)."""
7    return (I + 2*P - K) // S + 1
8
9# Verify against PyTorch for 20 random valid configurations
10torch.manual_seed(0)
11for _ in range(20):
12    I = torch.randint(4, 65, (1,)).item()
13    K = torch.randint(2, 8, (1,)).item()
14    S = torch.randint(1, K+1, (1,)).item()
15    P = torch.randint(0, K//2 + 1, (1,)).item()
16    if 2*P >= K:  # PyTorch rejects pad >= kernel/2
17        continue
18
19    x = torch.randn(1, 1, I, I)
20    y = F.max_pool2d(x, kernel_size=K, stride=S, padding=P)
21
22    predicted = pool_out(I, K, S, P)
23    actual = y.shape[-1]
24    assert predicted == actual, (I, K, S, P, predicted, actual)
25    print(f"I={I:2d}  K={K}  S={S}  P={P}{predicted}×{predicted}  ✓")
SettingFormulaOutput
I=28, K=2, S=2, P=0 (LeNet/AlexNet halving)(28−2)/2+114
I=7, K=2, S=2, P=0 (on a 7×7 fmap, drops last row/col)(7−2)//2+13
I=7, K=7, S=1, P=0 (Global pool on 7×7 fmap)(7−7)/1+11
I=32, K=3, S=2, P=1 (AlexNet overlapping pool)(32+2−3)/2+116
I=4, K=2, S=1, P=0 (overlapping non-downsampling)(4−2)/1+13

Ceil mode in a sentence

Setting ceil_mode=True replaces the floor with a ceiling in the formula above, so a 7×7 input with K=2, S=2 produces a 4×4 output instead of 3×3. The extra row/column is “reached into” using -\infty for max-pool (so the max ignores it) or 0 for avg-pool with count_include_pad=True. Most modern code leaves this asFalse.

Interactive: Pooling Playground

Switch between max and average modes, step through the windows, and watch how the output builds up cell by cell. The 4×4 grid matches the hand-computation above; the 6×6 grid lets you see a longer sliding pattern.

Max Pooling Visualizer

Input Feature Map (4x4)
1
3
2
4
5
6
1
2
7
2
3
1
4
8
5
6
2x2 pool, stride 2
Window [0:2, 0:2]
[1, 3, 5, 6]
max(1,3,5,6) = 6
Output (2x2)
6
?
?
?
Step 1/4
Max Pooling keeps the strongest activation in each window. It preserves what was detected while discarding where exactly it appeared — this gives the network translation invariance.

Quick Check

With the 4×4 grid in Max Pool mode, why does output cell (1, 0) display the value 8?


Max vs Average — When to Use Which

The choice is not arbitrary; there is a real theoretical and empirical literature. Boureau, Ponce & LeCun (2010) [3] analysed pooling on sparse codes under simple Bernoulli activation models, and Scherer, Müller & Behnke (2010) [4] ran a direct bake-off on object-recognition benchmarks. Their findings cohere:

DimensionMax pooling wins when…Average pooling wins when…
Feature sparsityFeatures are sparse — most cells are zero or near zero. Max detects presence.Features are dense and every cell carries signal you want to integrate.
Network positionEarly/mid layers that look for local patterns (edges, textures, parts).Final layer head — GAP preserves the C-dim signature of the whole image [7].
Gradient flowAcceptable only when training data is plentiful — many cells get zero gradient.Every cell gets (1/K²) of the upstream gradient — kinder in small-data regimes.
Translation invarianceSlightly better: max is unchanged by any shift that keeps the argmax in the window.Smoother but also slightly less invariant — the mean shifts continuously.
Empirical benchmark [4]Object recognition on NORB, CIFAR-10 (Scherer 2010).Medical image segmentation heads, audio classifiers (post-2016 literature).

Rule of thumb

Use max-pool in the body of a CNN when you have it in your budget, and global average pool right before the classifier. This is the pattern GoogLeNet (2015) [8] introduced and ResNet (2016) [9] canonised. The decision to drop max-pool entirely in favour of stride-2 convs is orthogonal — see the “Why Pooling Declined” section below.

Worked Example — Continuing the 7×7 from §10.2

§10.2 animated a 7×7 feature map sliding under a 3×3 kernel inside CNN2DConvolutionVisualizer. We reuse exactly that matrix here so you can track specific cells through the conv → pool pipeline end-to-end.

Same 7×7 matrix from §10.2 — now flowed through max-pool and avg-pool
🐍seven_by_seven_pool.py
3x — 7×7 feature map (continuity with §10.2)

This is deliberately the same matrix you saw sliding under a 3×3 kernel in §10.2's CNN2DConvolutionVisualizer. Using identical numbers across sections lets you track the exact same cells through the full conv → pool pipeline.

EXECUTION STATE
x (7×7) =
     c0 c1 c2 c3 c4 c5 c6
r0    3  3  2  1  0  2  1
r1    0  0  1  3  1  0  2
r2    3  1  2  2  3  1  0
r3    2  0  0  2  2  3  1
r4    2  0  0  0  1  2  2
r5    1  3  2  1  0  1  3
r6    0  2  1  3  2  0  1
14Output size for 7×7 input, K=2, S=2, P=0

(7 − 2)//2 + 1 = 5//2 + 1 = 2 + 1 = 3. The last row and last column of the input are DROPPED — this is the default ceil_mode=False behaviour. If you set ceil_mode=True the output would be 4×4 and PyTorch would 'reach out' beyond the feature map using −∞ for max-pool or 0 for avg-pool.

EXECUTION STATE
out_H = (7-2)//2 + 1 = 3
→ dropped cells = Input column 6 and input row 6 are not visited by any window.
24Max-pool result (3×3)

Each output cell is the max of a 2×2 patch of the 6×6 top-left portion of x. Notice how the output is quite flat — lots of 3s and 2s. Max-pool compresses information by keeping only the strongest activation per window, which saturates quickly when values repeat.

EXECUTION STATE
⬆ max output (3×3) =
     c0  c1  c2
r0    3   3   2
r1    3   2   3
r2    3   2   2
42Average-pool result (3×3)

Same windows, mean instead of max. The output now has fractional values and captures a 'local density' rather than a peak. This is why average-pool is favoured in the final head of classifiers — it integrates evidence across a region rather than arguing from a single loud cell.

EXECUTION STATE
⬆ avg output (3×3) =
      c0    c1    c2
r0   1.50  1.75  0.75
r1   1.50  1.50  2.25
r2   1.50  0.75  1.00
42 lines without explanation
1import numpy as np
2
3# 7x7 feature map — SAME matrix used in the §10.2 CNN2DConvolutionVisualizer
4x = np.array([
5    [3, 3, 2, 1, 0, 2, 1],
6    [0, 0, 1, 3, 1, 0, 2],
7    [3, 1, 2, 2, 3, 1, 0],
8    [2, 0, 0, 2, 2, 3, 1],
9    [2, 0, 0, 0, 1, 2, 2],
10    [1, 3, 2, 1, 0, 1, 3],
11    [0, 2, 1, 3, 2, 0, 1],
12], dtype=float)
13
14# 2x2 max pool, stride 2 — the 'VGG/AlexNet halving block'
15def max_pool2d(x, K, S):
16    H, W = x.shape
17    out_H = (H - K) // S + 1
18    out_W = (W - K) // S + 1
19    y = np.zeros((out_H, out_W))
20    for i in range(out_H):
21        for j in range(out_W):
22            y[i, j] = x[i*S:i*S+K, j*S:j*S+K].max()
23    return y
24
25print("max-pool 2x2 stride 2 →")
26print(max_pool2d(x, 2, 2))
27# [[3. 3. 2.]
28#  [3. 2. 3.]
29#  [3. 2. 2.]]
30
31# Same function, same input, with .mean() instead of .max():
32def avg_pool2d(x, K, S):
33    H, W = x.shape
34    out_H = (H - K) // S + 1
35    out_W = (W - K) // S + 1
36    y = np.zeros((out_H, out_W))
37    for i in range(out_H):
38        for j in range(out_W):
39            y[i, j] = x[i*S:i*S+K, j*S:j*S+K].mean()
40    return y
41
42print("avg-pool 2x2 stride 2 →")
43print(avg_pool2d(x, 2, 2))
44# [[1.50 1.75 0.75]
45#  [1.50 1.50 2.25]
46#  [1.50 0.75 1.00]]

Continuity across sections

The 7×7 matrix is the canonical example tensor for this chapter. It appears in §10.2 (conv animation), here in §10.5 (pool walk-through), and will come back in the transposed-convolution walk-through of §10.6. Keeping the input constant lets you isolate the effect of each operation.

Global Average Pooling

If you set the pooling window equal to the full spatial extent of the feature map, the output is one number per channel — a C-dimensional vector. This is Global Average Pooling (GAP), introduced by Lin, Chen & Yan in Network-in-Network (2014) [7] and immediately adopted by GoogLeNet (2015) [8] and then every ResNet-style architecture [9].

yc=1HWi=1Hj=1Wxc,i,jy_c = \frac{1}{H \cdot W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_{c,i,j}

Why GAP replaced the giant FC head

The AlexNet (2012) head looked like Flatten(6144) → FC(4096) → FC(4096) → FC(1000) — roughly 58 million parameters just in the head. GoogLeNet swapped this for GAP(1024) → FC(1000), i.e. ~1 million parameters. Same accuracy on ImageNet, with vastly less overfitting risk. Four structural wins:

  1. Zero parameters in the pool itself.
  2. Works for any input size. GAP always produces a C-dim vector regardless of H and W, so the classifier downstream is size-agnostic.
  3. Acts as a structural regulariser. Lin et al. (2014) [7] note that GAP forces each feature map to correspond to a class, because the classifier can only average over it.
  4. Localisation comes for free. Class-Activation Maps (Zhou et al. 2016, [Ref 9 of §10.6 foreshadow]) use the fact that GAP is a weighted sum to highlight which spatial cells drove the prediction.
Three ways to spell GAP — all equivalent
🐍global_avg_pool.py
6x — feature map from the last conv stage

A (1, 3, 4, 4) tensor: 1 image, 3 channels, 4×4 spatial. In a real ResNet-50 this shape would be (N, 2048, 7, 7) right before the head.

EXECUTION STATE
x.shape = (1, 3, 4, 4)
channel 0 mean = (1+3+2+4+5+6+1+2+7+2+3+1+4+8+5+6)/16 = 60/16 = 3.75
channel 1 mean = 32/16 = 2.0 (all values are 2)
channel 2 mean = 80/16 = 5.0 (all values are 5)
23nn.AdaptiveAvgPool2d(output_size=1) — this IS Global Average Pooling

GAP is a special case of AdaptiveAvgPool2d where the target spatial size is 1×1. It collapses each channel to its mean, yielding a C-dim vector per image — exactly what Lin, Chen & Yan introduced in Network-in-Network (2014) [Ref 7] and what GoogLeNet (2015) [Ref 8] and ResNet (2016) [Ref 9] adopted to replace giant FC heads.

EXECUTION STATE
📚 nn.AdaptiveAvgPool2d(output_size) = Takes a target spatial size. Internally computes a kernel/stride schedule that produces exactly that size for any input H and W. With output_size=1, every channel reduces to one number.
→ parameter count = 0. GAP is the cheapest head you can attach to a CNN.
24y1 = gap_module(x).flatten(1)

After GAP the shape is (1, 3, 1, 1). .flatten(1) drops the two trailing 1s so downstream code sees a clean (N, C) feature vector ready for a final nn.Linear(C, num_classes).

EXECUTION STATE
📚 tensor.flatten(start_dim) = Collapse all dimensions from start_dim onwards into one. flatten(1) on (1,3,1,1) → (1,3).
⬆ y1 =
tensor([[3.75, 2.00, 5.00]])
28y3 = x.mean(dim=(2, 3))

The 'no module needed' version: GAP is mathematically just the mean over the spatial dimensions. dim=(2, 3) reduces the two spatial axes, leaving (N, C). This is the form you'll see in many research codebases.

EXECUTION STATE
📚 tensor.mean(dim) = Reduces along the listed dimensions. dim=(2,3) means 'reduce H and W, keep N and C'.
→ equivalent to = y = x.sum(dim=(2,3)) / (H * W)
29 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Imagine the final feature map of a classifier: 3 channels, 4x4 spatial
6x = torch.tensor([[
7    # channel 0 — same 4x4 matrix we've been using
8    [[1., 3., 2., 4.],
9     [5., 6., 1., 2.],
10     [7., 2., 3., 1.],
11     [4., 8., 5., 6.]],
12    # channel 1 — uniform response
13    [[2., 2., 2., 2.],
14     [2., 2., 2., 2.],
15     [2., 2., 2., 2.],
16     [2., 2., 2., 2.]],
17    # channel 2 — uniform high response
18    [[5., 5., 5., 5.],
19     [5., 5., 5., 5.],
20     [5., 5., 5., 5.],
21     [5., 5., 5., 5.]],
22]])  # shape: (1, 3, 4, 4)
23
24# Three equivalent ways to spell GAP — pick the one that reads best
25gap_module = nn.AdaptiveAvgPool2d(output_size=1)       # (1) the canonical module
26y1 = gap_module(x).flatten(1)                          # (1, 3)
27
28y2 = F.adaptive_avg_pool2d(x, 1).flatten(1)            # (2) functional
29y3 = x.mean(dim=(2, 3))                                # (3) pure tensor ops, same result
30
31print(y1)   # tensor([[3.7500, 2.0000, 5.0000]])
32assert torch.allclose(y1, y2)
33assert torch.allclose(y1, y3)

The modern classifier head template

features → nn.AdaptiveAvgPool2d(1) → nn.Flatten() → nn.Linear(C, num_classes). This is literally the head of every torchvision ResNet, MobileNet, ConvNeXt, EfficientNet and ViT.

Adaptive Pooling

Adaptive pooling is the generalisation that enables GAP: you specify the output size, and PyTorch picks the kernel and stride schedule for you. This removes the hard-coded 7×7 feature-map assumption that plagued early CNN code and is the reason torchvision models accept any reasonable input resolution.

For an input of height HinH_{\text{in}} and a target height HoutH_{\text{out}}, PyTorch's rule for output cell ii is:

starti=iHinHout,endi=(i+1)HinHout\text{start}_i = \left\lfloor \frac{i \cdot H_{\text{in}}}{H_{\text{out}}} \right\rfloor, \quad \text{end}_i = \left\lceil \frac{(i+1) \cdot H_{\text{in}}}{H_{\text{out}}} \right\rceil

Two consequences: bins can have different widths when HinH_{\text{in}} is not a multiple of HoutH_{\text{out}}, and adjacent bins can overlap by one cell. Adaptive pool is therefore not a plain “pick the right kernel” reduction.

Adaptive pooling — same output size from any input size
🐍adaptive_pool_bins.py
6Same target size, any input size

This is the super-power of adaptive pooling: you can feed in any spatial size and get out exactly 7×7, 1×1, or whatever you asked for. That's why every torchvision classification model ends with nn.AdaptiveAvgPool2d((1,1)) instead of a hard-coded pool — models trained on 224×224 still work on 299×299 or 384×384 inputs.

19def adaptive_bins — the actual bin schedule

PyTorch's rule (documented in torch.nn.functional.adaptive_avg_pool2d source): output cell i averages input cells from floor(i·H_in/H_out) to ceil((i+1)·H_in/H_out). Bins can overlap and have different widths, which is why AdaptiveAvgPool2d is not just 'average-pool with auto-picked kernel'.

EXECUTION STATE
📚 bin formula = start_i = ⌊i · H_in / H_out⌋, end_i = ⌈(i+1) · H_in / H_out⌉
→ H_in=10, H_out=7 = bins: widths 2,2,3,2,3,2,2 — not uniform
25 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Feed the SAME network feature sizes that don't divide evenly into 7
5# AdaptiveAvgPool2d has to invent a variable-size binning schedule on the fly.
6for H in [7, 8, 10, 14, 17, 224]:
7    x = torch.arange(H*H, dtype=torch.float32).view(1, 1, H, H)
8    y = F.adaptive_avg_pool2d(x, 7)
9    print(f"input {H:3d}x{H:3d}  →  output {y.shape[-2]}x{y.shape[-1]}")
10# input   7x  7  →  output 7x7
11# input   8x  8  →  output 7x7
12# input  10x 10  →  output 7x7
13# input  14x 14  →  output 7x7
14# input  17x 17  →  output 7x7
15# input 224x224  →  output 7x7
16
17# The bin boundaries for output cell i are:
18#   start = floor(i  *  H_in / H_out)
19#   end   = ceil((i+1) * H_in / H_out)
20# Bins can have different widths when H_in is not a multiple of H_out.
21def adaptive_bins(H_in, H_out):
22    import math
23    return [(math.floor(i*H_in/H_out), math.ceil((i+1)*H_in/H_out))
24            for i in range(H_out)]
25
26print(adaptive_bins(10, 7))
27# [(0, 2), (1, 3), (2, 5), (4, 6), (5, 8), (7, 9), (8, 10)]

The Backward Pass — How Gradients Route

Pooling has no parameters, but it still has a non-trivial backward pass: an upstream gradient L/y\partial L / \partial y must be routed back to the input cells. How it is routed depends entirely on the reducer.

Max pool — the argmax router

The derivative of y=max(x1,,xK2)y = \max(x_1, \ldots, x_{K^2}) with respect to xkx_k is 1 if xkx_k is the max, and 0 otherwise (technically a sub-gradient since max is non-smooth at ties, but ties are measure-zero in practice). Therefore:

Lx[m,n]=Ly[i,j]if (m,n)=argmaxWijx, else 0\frac{\partial L}{\partial x[m,n]} = \frac{\partial L}{\partial y[i,j]} \quad \text{if } (m,n) = \arg\max_{W_{ij}} x, \text{ else } 0

The consequence is striking: in a window of size K×KK \times K, exactly one cell receives gradient and K21K^2 - 1 receive none. For our 4×4 example with 2×2 pools, 12 of 16 input cells get nothing.

Average pool — the uniform splitter

Average pool is a linear operator, so its derivative is constant: each input cell of each window gets 1/K21/K^2 of the upstream gradient. No zero-gradient cells, no dead zones.

Lx[m,n]=1K2Wij(m,n)Ly[i,j]\frac{\partial L}{\partial x[m,n]} = \frac{1}{K^2} \sum_{W_{ij} \ni (m,n)} \frac{\partial L}{\partial y[i,j]}

Backward Pass in Pure Python

Max-pool and Avg-pool backward — from scratch
🐍pool_backward_numpy.py
12def max_pool_forward — ALSO remembers argmax

The forward pass has an extra responsibility when we'll need gradients: cache the input location of the max in every window. Without this, the backward pass would have to re-scan every window to find the argmax — doubling work.

EXECUTION STATE
→ why cache argmax in input coords? = Makes the backward loop a simple scatter. If we cached local (di,dj) we'd need to recompute i·S+di every time.
📚 np.unravel_index(k, shape) = Turns a flat index k back into N-D coordinates for an array of the given shape. For a 2×2 window, k=3 (last element) becomes (1,1).
25def max_pool_backward — the argmax router

This is the core of max-pool backprop: exactly one cell per output window receives the upstream gradient, and it's the cell that 'won' the forward max. Every other input cell gets zero gradient from this window.

EXECUTION STATE
📚 the max-pool gradient rule = ∂L/∂x[m,n] = ∂L/∂y[i,j] if (m,n) == argmax_window(i,j), else 0. Derived by differentiating y = max(x₁,…,x_{K²}) through the sub-gradient of max.
→ why += and not =? = When stride < kernel size (overlapping pools) or when two adjacent windows happen to have the same argmax cell, multiple windows may route gradient to the same input cell. The accumulation += is necessary.
35def avg_pool_backward — the uniform splitter

Average-pool is a linear operator y = (1/K²) Σ xₖ, so its gradient is a constant 1/K² per input cell of each window. Every cell gets SOMETHING, which is exactly why average pool is kinder to gradient flow than max pool — no zero-gradient dead zones.

EXECUTION STATE
📚 the avg-pool gradient rule = ∂L/∂x[m,n] = (1/K²) · Σ_{windows ⊃ (m,n)} ∂L/∂y[i,j]. For non-overlapping windows (S=K) the sum has exactly one term.
45dL_dy = [[1, 2], [3, 4]]

Simple integer upstream gradient so the routing is easy to trace by eye. In training the upstream gradient would come from whatever layer sits after the pool.

48max-pool backward output — sparse

Exactly 4 of the 16 input cells are non-zero. Cells (0,3), (1,1), (3,1), (3,3) are the argmaxes of the 4 windows (values 4, 6, 8, 6) and they receive upstream gradients 2, 1, 3, 4 respectively.

EXECUTION STATE
⬆ dL/dx (max) =
[[0. 0. 0. 2.]
 [0. 1. 0. 0.]
 [0. 0. 0. 0.]
 [0. 3. 0. 4.]]
→ key observation = 12 of 16 cells received zero gradient. If a cell is NEVER the winner across training, it gets no learning signal at all.
54avg-pool backward output — dense

Every cell gets a signal. The 4 windows produce grads 1, 2, 3, 4; each divides by 4 and spreads uniformly across its 4 cells. No dead zones.

EXECUTION STATE
⬆ dL/dx (avg) =
[[0.25 0.25 0.50 0.50]
 [0.25 0.25 0.50 0.50]
 [0.75 0.75 1.00 1.00]
 [0.75 0.75 1.00 1.00]]
55 lines without explanation
1import numpy as np
2
3# FORWARD — same 4x4 input, 2x2 stride-2 max pool
4x = np.array([
5    [1, 3, 2, 4],
6    [5, 6, 1, 2],
7    [7, 2, 3, 1],
8    [4, 8, 5, 6],
9], dtype=float)
10
11def max_pool_forward(x, K, S):
12    H, W = x.shape
13    out_H = (H - K) // S + 1
14    out_W = (W - K) // S + 1
15    y = np.zeros((out_H, out_W))
16    argmax = np.zeros((out_H, out_W, 2), dtype=int)   # cache WHERE the max was
17    for i in range(out_H):
18        for j in range(out_W):
19            window = x[i*S:i*S+K, j*S:j*S+K]
20            y[i, j] = window.max()
21            # np.unravel_index turns a flat argmax into (row, col) within the window
22            di, dj = np.unravel_index(window.argmax(), window.shape)
23            argmax[i, j] = [i*S + di, j*S + dj]        # store in INPUT coordinates
24    return y, argmax
25
26# BACKWARD — route upstream gradient to the argmax cell of each window
27def max_pool_backward(dL_dy, argmax, x_shape):
28    dL_dx = np.zeros(x_shape)
29    out_H, out_W = dL_dy.shape
30    for i in range(out_H):
31        for j in range(out_W):
32            r, c = argmax[i, j]
33            dL_dx[r, c] += dL_dy[i, j]   # += handles the case two windows share the same max
34    return dL_dx
35
36# AVERAGE POOL — gradient splits uniformly over all K^2 cells
37def avg_pool_backward(dL_dy, x_shape, K, S):
38    dL_dx = np.zeros(x_shape)
39    out_H, out_W = dL_dy.shape
40    for i in range(out_H):
41        for j in range(out_W):
42            dL_dx[i*S:i*S+K, j*S:j*S+K] += dL_dy[i, j] / (K * K)
43    return dL_dx
44
45# Let the upstream gradient be [[1,2],[3,4]] to make tracing easy
46y, argmax = max_pool_forward(x, 2, 2)
47dL_dy = np.array([[1, 2], [3, 4]], dtype=float)
48
49print("max-pool backward dL/dx:")
50print(max_pool_backward(dL_dy, argmax, x.shape))
51# [[0. 0. 0. 2.]
52#  [0. 1. 0. 0.]
53#  [0. 0. 0. 0.]
54#  [0. 3. 0. 4.]]
55
56print("avg-pool backward dL/dx:")
57print(avg_pool_backward(dL_dy, x.shape, 2, 2))
58# [[0.25 0.25 0.50 0.50]
59#  [0.25 0.25 0.50 0.50]
60#  [0.75 0.75 1.00 1.00]
61#  [0.75 0.75 1.00 1.00]]

Verification Against torch.autograd

The NumPy reference matches torch.autograd exactly
🐍pool_backward_verify.py
9requires_grad=True and the .unsqueeze chain

requires_grad tells autograd to track every operation applied to this tensor and build a graph that can be differentiated. The two .unsqueeze(0) calls add the batch and channel axes needed by F.max_pool2d.

EXECUTION STATE
📚 tensor.unsqueeze(dim) = Insert a size-1 axis at position dim. unsqueeze(0) on (4,4) → (1,4,4).
→ shape after both unsqueezes = (1, 1, 4, 4) = (N, C, H, W)
13y.backward(dL_dy)

Manually-supplied upstream gradient. Usually you'd call loss.backward() and PyTorch starts from a scalar dL/dL=1; here we pass a 2×2 tensor because y itself is 2-D, making the trace easier to match against the NumPy reference.

EXECUTION STATE
📚 tensor.backward(grad) = Propagates grad backwards through the autograd graph, accumulating partial derivatives into the .grad attribute of every leaf tensor with requires_grad=True.
21x_t.grad = None

PyTorch ACCUMULATES gradients across backward calls (that's what the += in max_pool_backward above mirrors). If we don't zero x_t.grad before the second backward call, we'd get the SUM of the max-pool and avg-pool gradients — usually not what you want. In a training loop, optimizer.zero_grad() serves the same purpose.

31 lines without explanation
1import torch
2import torch.nn.functional as F
3import numpy as np
4
5x_t = torch.tensor([
6    [1., 3., 2., 4.],
7    [5., 6., 1., 2.],
8    [7., 2., 3., 1.],
9    [4., 8., 5., 6.],
10], requires_grad=True).unsqueeze(0).unsqueeze(0)   # (1,1,4,4)
11
12# MAX-POOL — let autograd compute the gradient
13y = F.max_pool2d(x_t, kernel_size=2, stride=2)
14dL_dy = torch.tensor([[[[1., 2.], [3., 4.]]]])
15y.backward(dL_dy)
16print("autograd (max):")
17print(x_t.grad.squeeze())
18# tensor([[0., 0., 0., 2.],
19#         [0., 1., 0., 0.],
20#         [0., 0., 0., 0.],
21#         [0., 3., 0., 4.]])
22
23# AVG-POOL — reset grads first
24x_t.grad = None
25y = F.avg_pool2d(x_t, kernel_size=2, stride=2)
26y.backward(dL_dy)
27print("autograd (avg):")
28print(x_t.grad.squeeze())
29# tensor([[0.2500, 0.2500, 0.5000, 0.5000],
30#         [0.2500, 0.2500, 0.5000, 0.5000],
31#         [0.7500, 0.7500, 1.0000, 1.0000],
32#         [0.7500, 0.7500, 1.0000, 1.0000]])
33
34# Both match the NumPy reference implementation byte-for-byte.

Practical implication of the argmax router

In a deep network, a cell that is never the argmax of any pool during training receives zero gradient from the pool layer — it can only be updated by the layer feeding into it. This is part of why very deep pool-heavy networks were hard to train before ResNet added skip connections [9].

Variants — Stochastic, Lp, Mixed, Fractional

The 2013–2015 literature explored many alternative reducers. None dethroned max and avg as defaults, but several are useful tools and they make the “single family, different reducer” framing concrete.

VariantReducerOriginal paper
Max poolmax of the windowLeNet-5 used subsampling [1]; max-pool popularised by AlexNet [Krizhevsky 2012]
Average poolmean of the windowLeCun et al. (1998) [1] (learnable scaled version)
Stochastic poolsample a cell, probability ∝ valueZeiler & Fergus (2013) [5]
L^p pool(mean(x^p))^(1/p) — interpolates avg (p=1) ↔ max (p→∞)Sermanet, Chintala, LeCun (2013) [6]
Mixed poolλ·max + (1−λ)·mean, λ random or learnableYu, Wang, Chen, Wei (2014) [12]
Fractional max poolNon-integer kernel/stride — output ≈ input / √2Graham (2014) [11]
Stochastic, L^p, and Mixed pooling — reference implementations
🐍variants.py
5Stochastic pooling — a regulariser

Instead of always taking the max (deterministic) or always the mean (also deterministic), pick a cell stochastically with probability proportional to its activation. Behaves like dropout inside the pool: at train time it injects noise; at test time Zeiler & Fergus (2013) [Ref 5] propose a probabilistic weighting that reduces to a deterministic expectation.

EXECUTION STATE
→ constraint = Requires non-negative values in the window, which is automatic after ReLU.
18L^p pooling — the continuous knob

y = (mean(x^p))^(1/p). p=1 gives the arithmetic mean (average-pool). p→∞ gives the max. p=2 gives RMS pooling, which Sermanet et al. (2013) [Ref 6] used in their street-view digit classifier.

EXECUTION STATE
📚 (mean(x^p))^(1/p) = The p-norm generalised mean. A well-known identity in real analysis: lim_{p→∞} = max, lim_{p→0⁺} = geometric mean.
28Mixed pooling — convex combination

y = λ·max + (1−λ)·mean. λ can be a per-layer random draw (Yu et al. 2014 [Ref 12]) or a learnable parameter. A middle ground that preserves peak information but still lets gradient flow to every cell.

36 lines without explanation
1import numpy as np
2
3# Stochastic pooling (Zeiler & Fergus, 2013) — sample a cell with probability
4# proportional to its value, not argmax
5def stochastic_pool2d(x, K, S, rng):
6    H, W = x.shape
7    out_H = (H - K) // S + 1
8    out_W = (W - K) // S + 1
9    y = np.zeros((out_H, out_W))
10    for i in range(out_H):
11        for j in range(out_W):
12            w = x[i*S:i*S+K, j*S:j*S+K].flatten()
13            probs = w / w.sum()           # requires non-negative inputs (e.g., post-ReLU)
14            y[i, j] = rng.choice(w, p=probs)
15    return y
16
17# L^p pooling (Sermanet, Chintala, LeCun 2013) — interpolates between avg (p=1) and max (p→∞)
18def lp_pool2d(x, K, S, p):
19    H, W = x.shape
20    out_H = (H - K) // S + 1
21    out_W = (W - K) // S + 1
22    y = np.zeros((out_H, out_W))
23    for i in range(out_H):
24        for j in range(out_W):
25            w = x[i*S:i*S+K, j*S:j*S+K]
26            y[i, j] = (np.mean(w**p)) ** (1/p)
27    return y
28
29# Mixed pooling (Yu, Wang, Chen, Wei 2014) — random (or learnable) convex mix
30def mixed_pool2d(x, K, S, lam):
31    H, W = x.shape
32    out_H = (H - K) // S + 1
33    out_W = (W - K) // S + 1
34    y = np.zeros((out_H, out_W))
35    for i in range(out_H):
36        for j in range(out_W):
37            w = x[i*S:i*S+K, j*S:j*S+K]
38            y[i, j] = lam * w.max() + (1.0 - lam) * w.mean()
39    return y

Fractional max-pool in PyTorch

Graham (2014) [11]'s fractional max-pool is available as torch.nn.FractionalMaxPool2d. It draws random boundaries so the effective downsampling ratio is an arbitrary real number rather than an integer, which acts as a regulariser similar to stochastic pooling.

Pool vs Strided Convolution

The central architectural question of the late 2010s: if pooling just halves the spatial resolution, why not use a stride-2 convolution that halves it and learns what to keep? Springenberg, Dosovitskiy, Brox & Riedmiller (2015) [10] made this case explicitly in “Striving for Simplicity: The All Convolutional Net” and showed that replacing every max-pool with a stride-2 conv reaches equal or better accuracy on CIFAR-10 and ImageNet.

Same output shape, very different parameter count
🐍pool_vs_stride.py
4Two ways to halve a feature map

Both layers produce output shape (1, 64, 16, 16) from input (1, 64, 32, 32). The pool is a hand-picked reducer; the stride-2 conv is a learned reducer. This single-line difference became the core argument of Springenberg et al. (2015) [Ref 10] — 'if it has to reduce anyway, let it LEARN how'.

13Parameter count comparison

Pool: 0 parameters. Stride-2 3×3 conv: K·K·C_in·C_out + C_out = 3·3·64·64 + 64 = 36,928 parameters. The extra capacity is the whole point — it lets the network learn a data-dependent downsampling policy instead of a fixed max rule.

EXECUTION STATE
→ trade-off = Pool: cheaper, well-understood inductive bias, always works. Strided conv: more capacity, but needs enough data to learn the reduction without overfitting.
18 lines without explanation
1import torch
2import torch.nn as nn
3
4# Same goal — halve a (C=64, 32x32) feature map in both dimensions.
5pool = nn.MaxPool2d(kernel_size=2, stride=2)           # 0 parameters
6sconv = nn.Conv2d(64, 64, kernel_size=3,
7                  stride=2, padding=1)                 # 64·64·3·3 + 64 = 37,000 parameters
8
9x = torch.randn(1, 64, 32, 32)
10print("pool  output:", pool(x).shape)    # torch.Size([1, 64, 16, 16])
11print("sconv output:", sconv(x).shape)   # torch.Size([1, 64, 16, 16])
12
13pool_params = sum(p.numel() for p in pool.parameters())
14sconv_params = sum(p.numel() for p in sconv.parameters())
15print(f"pool params:  {pool_params:>6d}")   #       0
16print(f"sconv params: {sconv_params:>6d}")  #  36,928
17
18# Same spatial downsampling, but sconv LEARNS what to keep.
19# Springenberg et al. (2015) showed that replacing every max-pool with
20# a stride-2 conv reaches equal or better accuracy on CIFAR-10 / ImageNet.
Propertynn.MaxPool2d(2, 2)nn.Conv2d(C, C, 3, stride=2, padding=1)
Learnable parameters09·C² + C
Output shape on (N, C, H, W)(N, C, H/2, W/2)(N, C, H/2, W/2)
Reducermax (fixed, non-linear)learned linear combination + activation
Gradient flow to input cells1 of K² cells per windowall cells via learned weights
Inductive biasStrong — assumes max is informativeWeak — must be learned
Memory cost (forward)cache of argmax indicesactivations
Best use caseEarly layers when data is limitedModern architectures with enough data

Why Pooling Declined in Modern CNNs

Post-2016, every flagship architecture has moved max-pool out of the spine:

  1. ResNet (He et al. 2016) [9] uses stride-2 3×3 conv inside every transition block and keeps only a single MaxPool2d(3, 2, 1) after the stem. Residual connections mean the signal can skip past the learned downsampler, so the network never has to choose between preserving information and reducing resolution.
  2. DenseNet (2017) uses average pool in transition blocks, not max.
  3. Vision Transformer (Dosovitskiy et al. 2021) [14] contains no pooling at all. Tokenisation via a 16×16 stride-16 patch embedding simultaneously reduces resolution and learns a linear projection.
  4. ConvNeXt (Liu et al. 2022) [15] — the explicitly ResNet-modernising architecture — uses stride-2 depthwise convs for downsampling inside stages and a “patchify stem” of stride-4 conv up front. Zero max-pool.

What has survived everywhere is global average pool as the head — cheap, learnable-free, size-agnostic, class-activation-map-friendly.

Takeaway: Pooling in the body of the network is an optional inductive-bias choice that trades parameters for regularisation. Pooling at the head (GAP) is nearly universal and there is no sign of it going away.

Design Patterns in Real Architectures

ArchitecturePooling strategyReference
LeNet-5 (1998)Learnable scaled avg-pool (2×2, stride 2) after each convLeCun et al. [1]
AlexNet (2012)Overlapping max-pool (3×3, stride 2) — found to reduce error vs 2×2Krizhevsky et al.
VGG-16 (2015)Non-overlapping max-pool (2×2, stride 2) after each conv blockSimonyan & Zisserman
GoogLeNet (2015)Max-pool inside Inception modules + Global Avg Pool head (replaced 4096-d FC)Szegedy et al. [8]
ResNet (2016)Stem: 7×7 stride 2 conv then 3×3 stride 2 max-pool. Stages: stride-2 conv. Head: GAP.He et al. [9]
DenseNet (2017)Avg-pool transitions between dense blocks; GAP headHuang et al.
U-Net (2015)Max-pool in encoder, transposed conv in decoder (see §10.6)Ronneberger et al. [13]
ViT (2021)No pooling. Patchify stem is a stride-16 conv. Class token replaces GAP.Dosovitskiy et al. [14]
ConvNeXt (2022)Stride-4 patchify stem + stride-2 depthwise downsampling between stages. No max-pool.Liu et al. [15]

Summary

KnobEffect on output sizeEffect on gradient flowParameter cost
Reducer = maxno effect1 cell per window receives grad0
Reducer = avgno effectEvery cell receives 1/K² of upstream grad0
Kernel KSubtracts K-1 from numeratorSmaller windows = less aggregation0
Stride SDivides by SNo direct effect0
Padding PAdds 2P to numeratorZero-pad acts as constant input to reducer0
Adaptive(target)Forces output to exactly target × targetVaries per cell (bin-size dependent)0

Commit these to memory

  1. Output-shape formula: O=(I+2PK)/S+1O = \lfloor (I + 2P - K)/S \rfloor + 1 (same as conv with dilation = 1).
  2. Max-pool gradient rule: one cell per window receives the full upstream gradient; the rest receive zero.
  3. Avg-pool gradient rule: every cell receives 1/K21/K^2 of the upstream gradient.
  4. GAP = AdaptiveAvgPool2d(1) = .mean(dim=(2,3)). Three spellings for the modern CNN classifier head.
  5. Stride-2 conv can replace max-pool at the cost of K2C2K^2 C^2 parameters per layer. Modern architectures prefer this when data is plentiful.

Exercises

Conceptual

  1. Given a (1, 3, 28, 28) input, what is the output shape after nn.MaxPool2d(kernel_size=3, stride=2, padding=1)? Show your use of the master formula.
  2. Explain in one sentence why max-pool's backward pass creates zero-gradient cells but avg-pool's does not.
  3. A classifier is trained on 224×224 images but deployed on 384×384 images. Which single layer change lets the model work at the new resolution without retraining? (Hint: not the conv layers.)
  4. You observe that every cell of a particular feature map eventually becomes an argmax of at least one window during training. Is max-pool still problematic for gradient flow here? Why or why not?
  5. Why did Springenberg et al. (2015) [10] get a parameter increase when replacing pool with stride-2 conv, yet their total model was still competitive? (Hint: compare to the parameters they could remove elsewhere.)

Hints

  1. 1: O=(28+23)/2+1=14O = \lfloor (28 + 2 - 3) / 2 \rfloor + 1 = 14. Output shape (1, 3, 14, 14).
  2. 2: Max is non-linear; its (sub)gradient is 1 at the argmax and 0 elsewhere. Avg is linear; its gradient is the constant 1/K².
  3. 3: Swap the final pool for nn.AdaptiveAvgPool2d((1,1)). Every other layer is already size-agnostic.
  4. 4: Not for the reason you might think: the problem is per-batch sparsity, not per-training-run. Even if every cell wins eventually, within any single backward pass most cells still get zero gradient, which slows convergence.
  5. 5: Shallower and fewer FC layers — they removed the giant dense head, so overall param count could stay roughly constant or even drop.

Coding

  1. Extend max_pool2d to support padding. Pad with -\infty (not 0) so the max is unaffected by the padded cells, and verify against F.max_pool2d(x, K, S, padding=P).
  2. Implement max_unpool2d: given the argmax indices cached during the forward pass, reconstruct a sparse tensor where each argmax cell holds the corresponding output value. This is the core operation of the Zeiler & Fergus (2014) deconvnet visualisation. Check your implementation against PyTorch's nn.MaxUnpool2d.
  3. Replace every max-pool in a small VGG-style network with a stride-2 3×3 conv (Springenberg et al. [10]). Train both on CIFAR-10 for 10 epochs and compare final accuracy and training curves.
  4. Implement the bin schedule in adaptive_bins(H_in, H_out) and feed your output into a manual NumPy avg-pool. Verify the result matches F.adaptive_avg_pool2d byte-for-byte on 20 random input sizes.

Challenge

Reproduce the GoogLeNet head switch. Take a small pre-trained CNN with a dense classifier head (e.g. Flatten → FC(4096) → FC(num_classes)), replace the head with AdaptiveAvgPool2d(1) → Flatten → FC(C, num_classes), and fine-tune for a few epochs. Report (a) the parameter reduction, (b) the change in validation accuracy, and (c) a visualisation of the Class Activation Map produced by the GAP head using the technique of Zhou et al. (2016). This single experiment is the clearest demonstration of why GAP replaced dense heads across nearly every vision architecture from 2015 onwards.


References

  1. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324. DOI:10.1109/5.726791. (First large-scale use of subsampling layers in LeNet-5.)
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.3 “Pooling”. MIT Press. https://www.deeplearningbook.org/ (Canonical textbook treatment of pooling as approximate invariance.)
  3. Boureau, Y.-L., Ponce, J., & LeCun, Y. (2010). A Theoretical Analysis of Feature Pooling in Visual Recognition. Proc. 27th ICML, 111–118. (Bernoulli-activation analysis of max vs avg — cited for the sparse-features argument.)
  4. Scherer, D., Müller, A., & Behnke, S. (2010). Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. ICANN, LNCS 6354, 92–101. Springer. (Empirical NORB / CIFAR comparison.)
  5. Zeiler, M. D., & Fergus, R. (2013). Stochastic Pooling for Regularization of Deep Convolutional Neural Networks. ICLR. arXiv:1301.3557.
  6. Sermanet, P., Chintala, S., & LeCun, Y. (2013). Convolutional Neural Networks Applied to House Numbers Digit Classification. ICPR. arXiv:1204.3968. (Introduces Lp-pooling in the SVHN classifier.)
  7. Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. ICLR. arXiv:1312.4400. (Introduces Global Average Pooling.)
  8. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper with Convolutions (GoogLeNet). CVPR. arXiv:1409.4842. (GAP replacing giant FC head.)
  9. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. arXiv:1512.03385. (Stride-2 conv + GAP head in ResNet.)
  10. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. ICLR workshop. arXiv:1412.6806. (Empirical case that stride-2 conv ≥ max-pool.)
  11. Graham, B. (2014). Fractional Max-Pooling. arXiv:1412.6071.
  12. Yu, D., Wang, H., Chen, P., & Wei, Z. (2014). Mixed Pooling for Convolutional Neural Networks. RSKT, LNCS 8818. Springer. (Random and learnable convex combination of max and average.)
  13. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI. arXiv:1505.04597. (Max-pool in the encoder path, inverted by transposed convolution in the decoder — motivating §10.6.)
  14. Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT). ICLR. arXiv:2010.11929. (First flagship vision architecture with no pooling at all.)
  15. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s (ConvNeXt). CVPR. arXiv:2201.03545. (Modern ResNet-style CNN with stride-2 depthwise downsampling, no max-pool.)
  16. PyTorch documentation, torch.nn.MaxPool2d, AvgPool2d, AdaptiveAvgPool2d, AdaptiveMaxPool2d, FractionalMaxPool2d, functional.max_pool2d, functional.avg_pool2d, functional.adaptive_avg_pool2d. https://pytorch.org/docs/stable/nn.html (Authoritative reference forcount_include_pad, ceil_mode, and adaptive-pool bin schedule.)

In the next section we tackle the inverse of every downsampling operation we have seen so far — transposed convolution. It is the upsampler that lets decoders (U-Net), generators (DCGAN), and super-resolution networks grow feature maps back up to image resolution.

Loading comments...