Introduction
Every operator we have met so far in this chapter reduces spatial resolution: a 3×3 conv with stride 1 shrinks the feature map by 2 pixels, a 2×2 max-pool halves it. This is fine for classification, where the pipeline is image → features → class. But some of the most interesting networks in deep learning need to go the other way. A GAN generator takes a 100-dim random vector and produces a 64×64 image. A U-Net decoder takes a 7×7 feature map and recovers a 224×224 segmentation mask. A super-resolution model turns a low-resolution photo into a high-resolution one. For all of these, we need a learned upsampler.
The Core Insight: Transposed convolution is the adjoint of a regular convolution — literally the transpose of its im2col matrix. It is NOT the inverse. “Deconvolution,” the name in older papers, is a mathematical misnomer: the operation undoes the shapechange of a convolution but never recovers the original pixel values. Everything interesting about transposed conv follows from this distinction.
In this section we derive transposed convolution from two complementary viewpoints — a sparse-matrix transpose and a scatter-add operation — and show the two views compute identical numbers. We will also meet the famous checkerboard artifact (Odena, Dumoulin & Olah 2016 [Ref 2]) and the two modern fixes that have mostly replaced transposed conv in state-of-the-art generators and super-resolution networks.
Learning Objectives
After working through this section, you will be able to:
- Derive the transposed-conv output-size formula from the matrix view.
- Implement transposed conv from scratch in pure NumPy (scatter-add) and verify the result against
nn.ConvTranspose2dbyte-for-byte. - Explain the checkerboard artifact — reproduce the 1-2-4 overlap pattern and predict when it will appear from K, S alone.
- Use the two modern alternatives: upsample + conv (Odena) and pixel shuffle (Shi et al. [4]).
- Read and write DCGAN, U-Net and FCN decoder blocks with confidence.
- Know when to use
output_paddingand why PyTorch requires it to be less than the stride.
Why We Need Learned Upsampling
Three classes of vision architecture are fundamentally dilative: they start small and end large, or they need to reconstruct fine detail from coarse representations.
| Architecture family | Start → end shape | Why a learned upsampler? |
|---|---|---|
| GAN generators (DCGAN, StyleGAN, BigGAN) | (N, 100) latent → (N, 3, 64, 64) or larger image | Must synthesise every pixel; the upsampler LEARNS where edges go. |
| Segmentation decoders (FCN, U-Net, DeepLab) | Encoder's 7×7 feature map → H×W per-pixel class map | Per-pixel labels demand per-pixel resolution; bilinear resize loses class-specific detail. |
| Super-resolution (SRCNN, ESRGAN) | Low-res image → 2×/4× higher resolution | Hallucinating plausible high-frequency detail is the WHOLE task. |
| Autoencoder decoders | Latent z → reconstruction | Mirror the encoder's shape trajectory. |
You could use torch.nn.Upsample(mode='bilinear') for every one of these and add a stride-1 conv afterwards. That is in fact a perfectly reasonable modern design (and we will see why below). Historically, though, each new upsampling step was implemented as one transposed-conv layer, so the weights had to learn both the resize and the filter at once. We need to understand transposed conv to read a decade of papers, state dicts, and reference implementations.
A Name Problem — Deconvolution, Transposed, Fractionally-Strided
The operation has three names in the literature. They all refer to the same thing but carry different baggage.
| Name | Used by | Accuracy |
|---|---|---|
| Deconvolution | Zeiler et al. (2010) [5], early image-processing literature | Misleading — 'deconvolution' in signal processing means inverting a convolution (approximately), which requires the kernel's pseudo-inverse. This operation does NOT do that. |
| Transposed convolution | Dumoulin & Visin (2016) [1], PyTorch, modern papers | Mathematically precise — the operator's matrix representation is the transpose of the forward conv's matrix. |
| Fractionally-strided convolution | Older PyTorch docs, some CNN papers | Operational — describes the implementation trick of 'insert zeros between input cells, then apply regular conv with stride 1'. |
Read carefully
The Matrix View — It Really Is a Transpose
From §10.4 we know that a forward convolution can be written as , where is the sparse im2col matrix and is the flattened input. For a 3×3 kernel mapping a 4×4 image to a 2×2 output, is a matrix with 9 non-zeros per row.
The transposed convolution with the same kernel is, by definition:
where is now the flattened 2×2 input (length 4) and is the 4×4 output (length 16). has shape . Every row corresponds to one output cell and is populated by a specific subset of the 9 kernel taps.
The adjoint, not the inverse
Interactive: Transpose as Adjoint
Hover any cell of the 2×2 input or the 4×4 output below. The column of belonging to that input cell lights up; so do the output cells it paints into. Hovering an output cell highlights its row of and the input cells whose kernel taps feed it. The adjoint relationship is no longer a definition — it's a click.
The Scatter-Add View — Hand Computation
Multiplying a 16×4 matrix by a 4-vector is the mathematical definition, but nobody actually implements transposed conv that way. The practical trick: every input cell “paints” a scaled copy of the kernel into the output, starting at position , and overlapping copies accumulate. This is exactly the rule you get by reading column by column.
Worked example — 2×2 input, 3×3 kernel, stride 1
Let us paint by hand. Our input is the 2×2 tensor [[1, 2], [3, 4]] and our kernel is the sparse [[1, 0, 1], [0, 1, 0], [1, 0, 1]]. Stride 1, padding 0. The output size is , so the result is 4×4.
| Input cell | Value | Paints into | Contribution |
|---|---|---|---|
| (0, 0) | 1 | output[0:3, 0:3] | [[1,0,1],[0,1,0],[1,0,1]] |
| (0, 1) | 2 | output[0:3, 1:4] | [[2,0,2],[0,2,0],[2,0,2]] |
| (1, 0) | 3 | output[1:4, 0:3] | [[3,0,3],[0,3,0],[3,0,3]] |
| (1, 1) | 4 | output[1:4, 1:4] | [[4,0,4],[0,4,0],[4,0,4]] |
Summing overlapping contributions gives:
1[[1, 2, 1, 2],
2 [3, 5, 5, 4],
3 [1, 5, 5, 2],
4 [3, 4, 3, 4]]Try e.g. cell (1, 1) — it is hit by four of the five non-zero kernel taps from four different input cells: . ✓
Interactive: Scatter-Add Painter
Click an input cell or press Play to watch every input cell paint a scaled copy of the kernel into the output. Switch the kernel between “sparse corners-plus-centre” (the example above) and all-ones, and try strides 1, 2, 3 to see how zero-stride windows cease to overlap. The numbers in green are running sums of the contributions you have painted so far — once all four input cells are painted, you should see the same [[1,2,1,2],[3,5,5,4],[1,5,5,2],[3,4,3,4]] the table predicts.
Transposed Conv in Pure Python
Equivalence with the Transposed Matrix
The two views must agree numerically. Let us build explicitly and verify:
Transposed Conv in PyTorch
Weight shape is (C_in, C_out, kH, kW), not (C_out, C_in, kH, kW)
nn.Conv2d stores weights as .nn.ConvTranspose2d stores them as . Converting between them requires weight.transpose(0, 1), not just renaming the layer.The Output-Size Formula
The full formula, from Dumoulin & Visin (2016) [1], Eq. 15, and repeated verbatim in the PyTorch docs for torch.nn.ConvTranspose2d:
With each symbol:
| Symbol | Role | Typical values |
|---|---|---|
| I | Input spatial size | 2, 7, 14, 32, … |
| K | Kernel size | 3, 4, 5 |
| S | Stride | 1 (no upsample), 2 (2× upsample) |
| P | Padding — CROPS the output | 0, 1, 2 |
| D | Dilation | 1 (default) |
| output_padding | Extra pixels on right/bottom only. Must be < S. | 0 or 1 |
| Forward conv | Transposed conv inverse | Output |
|---|---|---|
| (I=4, K=3, S=1, P=0) → 2 | (I=2, K=3, S=1, P=0) | 4 |
| (I=4, K=3, S=1, P=1) → 4 (same padding) | (I=4, K=3, S=1, P=1) | 4 |
| (I=8, K=4, S=2, P=1) → 4 | (I=4, K=4, S=2, P=1, output_padding=0) | 8 |
| (I=7, K=3, S=2, P=1) → 4 (floor) | (I=4, K=3, S=2, P=1, output_padding=0) | 7 |
| (I=8, K=3, S=2, P=1) → 4 | (I=4, K=3, S=2, P=1, output_padding=1) | 8 |
Interactive: Output-Size Calculator
Drag the six sliders to see every term of the output formula light up as it is added. The forward-conv counterpart range tells you which forward-conv inputs all collapse to the current , and output_padding is automatically clamped to (PyTorch's constraint). The copy-paste-ready PyTorch invocation is the same line you would type when reproducing this configuration in code.
Strided Transposed Convolution — Zero Insertion
When , transposed conv becomes an upsampler. The classic “fractionally-strided” interpretation makes this intuitive: insert zero rows and columns between every pair of input cells, pad the result by zeros on each side, and then apply a regular stride-1 convolution with the same kernel. The output size matches the formula above.
Example — 2×2 input, K=3, S=2:
1Input (2x2):
2 [[1, 2],
3 [3, 4]]
4
5After inserting S-1 = 1 zero between cells (3x3):
6 [[1, 0, 2],
7 [0, 0, 0],
8 [3, 0, 4]]
9
10Pad by K-1-P = 2 zeros around (7x7):
11 [[0, 0, 0, 0, 0, 0, 0],
12 [0, 0, 0, 0, 0, 0, 0],
13 [0, 0, 1, 0, 2, 0, 0],
14 [0, 0, 0, 0, 0, 0, 0],
15 [0, 0, 3, 0, 4, 0, 0],
16 [0, 0, 0, 0, 0, 0, 0],
17 [0, 0, 0, 0, 0, 0, 0]]
18
19Apply regular 3x3 conv, stride 1 → 5x5 output.
20(Output size check: (2-1)*2 + 3 = 5 ✓)Modern implementations skip the explicit zero insertion (too wasteful) and go straight to the scatter-add form or to a cuDNN routine that fuses both steps. But the zero-insertion picture is useful for intuition and is the origin of the name “fractionally-strided convolution”.
Interactive: Fractionally-Strided View
Step through the four stages below — original input, zero-inserted, zero-padded, then a regular stride-1 conv slid across the result. The output size of the final conv equals the transposed-conv output size from the formula above, exactly. Toggle stride and padding to confirm.
output_padding — Resolving the Ambiguity
A forward conv with does floor-division in its output formula. Two adjacent input sizes can map to the same output size, so the transposed conv going the other way is ambiguous: given a 4×4 output, was the forward input 7×7 or 8×8? output_padding picks the answer.
PyTorch's constraint: output_padding < stride
The Checkerboard Artifact — Odena, Dumoulin & Olah (2016)
The most important known pathology of transposed convolution. Odena, Dumoulin & Olah (2016) [Ref 2], in a short and influential Distill article, showed that GAN generators built from stacked ConvTranspose2d(K=3, S=2) layers produce outputs with visible grid-like patterns — even when the training signal contains no such patterns. The cause is a purely geometric property of transposed conv, not a training bug.
Reproducing the 1-2-4 Overlap Pattern
Take a uniform 3×3 input of ones, a uniform 3×3 kernel of ones, and apply stride-2 transposed conv. The output — which should be uniform by symmetry — is not.
Why does this happen? Each output cell is the sum of contributions from every input cell whose scattered kernel touches it. With and , the number of input cells that touch a given output cell alternates between 1 and 2 along each axis, producing the 1-2-4 pattern we see. Whenever is not a multiple of , the overlap is non-uniform and a checkerboard appears.
Diagnostic rule
Interactive: Overlap Heatmap Explorer
Slide and below. Whenever , the heatmap is a single uniform colour and the diagnostic strip turns green; otherwise it shows the periodic checkerboard pattern. This is exactly Figure 3 from Odena, Dumoulin & Olah (2016) [Ref 2], but you can drag the knobs.
Fix 1 — Nearest Upsample + Regular Conv
Odena's recommended fix: decouple upsampling from filtering. First use a parameter-free nearest-neighbour (or bilinear) nn.Upsample to resize, then apply a regular nn.Conv2d to filter. Same receptive field, same parameter count, no checkerboard.
Fix 2 — Pixel Shuffle / Sub-Pixel Convolution
Shi et al. (2016) [Ref 4]'s ESPCN introduced a different approach: do the convolution in low resolution with output channels, then use a zero-parameter pixel-shuffle to permute those channels into -times larger spatial extent.
Interactive: PixelShuffle Rearrangement
Hover any cell on either side of the visualizer below. The mapping rule becomes click-tracking; the input channels collapse onto exactly output pixels per super-pixel, with no overlap — which is why pixel shuffle is artifact-free by construction.
Which upsampler should I choose?
- GAN generator (modern): upsample + conv (StyleGAN, BigGAN) OR pixel shuffle.
- U-Net segmentation: transposed conv (historically) or upsample + conv. Both work; residual skip connections hide most artifacts.
- Super-resolution (ESRGAN, EDSR): pixel shuffle — lowest compute and no checkerboard.
- Legacy code you must reproduce: transposed conv, unfortunately. Know the pitfalls and hope the training pipeline has already learned to cancel the pattern.
The Backward Pass — It Really Is a Regular Convolution
§10.4 derived the backward pass of a forward conv: is itself a convolution of with the flipped kernel. The analogous fact for transposed conv is even cleaner: because the forward pass is , differentiating gives:
But is exactly the forward pass of a regular convolution. So:
The backward pass ofConvTranspose2dis a forwardConv2d. And the backward pass ofConv2dis a forwardConvTranspose2d. The two operators are each other's backward pass — which is why PyTorch implements them with shared cuDNN kernels.
This duality is the reason they have the same parameter count for matching hyperparameters, and the reason cuDNN can reuse the same highly-optimised im2col-and-GEMM primitives (§10.4) to accelerate both.
Applications in the Wild
DCGAN Generator
Radford, Metz & Chintala (2016) [Ref 3] introduced the deep convolutional GAN: a generator built entirely from transposed convolutions stacked to upsample a 100-dim latent vector into a 64×64 RGB image.
Why K=4, S=2 in DCGAN?
U-Net Decoder
Ronneberger, Fischer & Brox (2015) [Ref 7] introduced U-Net for biomedical segmentation. The encoder halves the spatial resolution stage-by-stage with max-pool (§10.5); the decoder doubles it back with transposed conv, concatenating the matching-resolution encoder feature map at every step via a skip connection.
The K=2, S=2 transposed conv exactly doubles spatial resolution and is checkerboard-safe (2 is a multiple of 2). Modern U-Net variants often replace the transposed conv with nn.Upsample(scale_factor=2) + nn.Conv2d(...)— same shape, no artifacts.
FCN for Semantic Segmentation
Long, Shelhamer & Darrell (2015) [Ref 6] introduced the first fully convolutional network for semantic segmentation. They used transposed conv to upsample the coarse class-score map from the last conv stage back to image resolution, initialising the kernel to bilinear interpolation and then letting the network fine-tune it. This initialisation trick is still used today when a decoder is trained from scratch on small data.
Design Patterns — Pick the Right Upsampler
| Architecture | Upsampler used | Checkerboard-safe? | Reference |
|---|---|---|---|
| DCGAN (2016) | ConvTranspose2d(K=4, S=2) | Yes — K divisible by S | Radford et al. [3] |
| StyleGAN2 (2020) | Upsample (bilinear) + modulated conv | Yes | Karras et al. |
| BigGAN (2019) | Upsample (nearest) + conv | Yes | Brock et al. |
| U-Net (2015) | ConvTranspose2d(K=2, S=2) | Yes | Ronneberger et al. [7] |
| FCN-8s (2015) | ConvTranspose2d with bilinear init | Yes after fine-tuning | Long et al. [6] |
| ESPCN / ESRGAN (2016+) | Conv + PixelShuffle | Yes by construction | Shi et al. [4] |
| Deconvnet (2015) | ConvTranspose2d(K=3, S=2) | No — classic checkerboard | Noh et al. [8] |
| Autoencoder tutorials | ConvTranspose2d(K=3, S=2) | No — common pedagogical bug | (widespread) |
Quick Check
You are designing a 2× upsampling block with a 3×3 kernel. Which option avoids the checkerboard artifact with the least change to downstream param counts?
Summary
| Knob | Effect on output size | Artifact risk | Parameter cost |
|---|---|---|---|
| Stride S | Multiplies by ~S | Checkerboard if S does not divide K | None |
| Padding P | CROPS 2P pixels | None | None |
| Kernel K | Adds (K-1) | Safe iff K is multiple of S | Scales as K² · C_in · C_out |
| Dilation D | Adds D(K-1) − (K-1) | None | None |
| output_padding | Adds 0..S-1 on right/bottom | None | None |
Commit these to memory
- Output formula:
- Matrix identity: transposed conv where is the forward im2col matrix. NOT the inverse of convolution.
- Checkerboard diagnostic: stride-S transposed conv is artifact-free iff kernel size is a multiple of . Prefer K=2,S=2 or K=4,S=2.
- The two modern replacements: nearest-upsample + conv (Odena [2]) and conv + pixel-shuffle (Shi [4]). Same shape, same parameter count, no artifacts.
- Weight-shape gotcha: ConvTranspose2d stores weights as , opposite to Conv2d.
- Duality: backward of ConvTranspose2d = forward of Conv2d, and vice versa.
Exercises
Conceptual
- Compute by hand the output spatial size of
nn.ConvTranspose2d(32, 64, kernel_size=4, stride=2, padding=1)applied to a 14×14 feature map. - Show that for , transposed conv and forward conv have the same output size. Why? (Hint: plug into both formulas.)
- Explain why transposed conv is NOT the inverse of convolution. Construct a small example where .
- If forward conv(K=3, S=2, P=1) maps inputs 7 and 8 both to 4, which value of
output_paddingrecovers each? - Why do DCGAN generators use K=4, S=2 instead of K=3, S=2? Answer in one sentence referencing Odena et al. (2016) [2].
Hints
- 1: . Exactly 2× upsample.
- 2: Forward: O = (I-K+2P)/S + 1 = I - 1. Transposed: O = (I-1)·1 + 2 = I + 1. They differ by 2 — the asymmetry disappears only when padding is added symmetrically.
- 3: Pick any non-injective forward conv (most are). E.g. a 2×2 avg kernel. W·x loses information; W.T·(W·x) cannot recover x.
- 4: output_padding=0 → 7, output_padding=1 → 8.
- 5: K=4 is a multiple of S=2, so the scattered-kernel overlaps are uniform and no checkerboard appears.
Coding
- Extend the scatter-add
conv_transpose2dto handle multi-channel inputs and outputs. Verify againstF.conv_transpose2don random 10 configurations. - Reproduce Figure 3 from Odena et al. (2016) [2]: plot the overlap-count heatmap for (K, S) pairs in and verify that uniformity occurs iff K is a multiple of S.
- Train two tiny image autoencoders on MNIST: one using stacked
ConvTranspose2d(K=3, S=2)in the decoder, the other usingUpsample + Conv2d. Compare reconstruction quality and plot a few samples side by side. Do you see checkerboards? - Implement
nn.PixelShuffle(r)from scratch usingtensor.reshapeandtensor.permute. Verify against the built-in module.
Challenge
Reproduce the Odena et al. (2016) [2] experiment. Train a small DCGAN on CIFAR-10 for 20 epochs twice — once with ConvTranspose2d(K=3, S=2) layers and once with Upsample + Conv2d. Visualise a batch of generated samples from both models and measure the FID score. The original paper shows a ~10–30% FID improvement from eliminating the checkerboard. Reproducing even a qualitative version of this result is the clearest possible demonstration of the theoretical argument.
References
- Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. (The canonical derivation of the transposed conv output formula, §4. Figure 4.1 is the 2×2 → 4×4 example we hand-computed.)
- Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and Checkerboard Artifacts. Distill.
https://distill.pub/2016/deconv-checkerboard/(Diagnoses the checkerboard, proposes upsample-then-conv.) - Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN). ICLR. arXiv:1511.06434.
- Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., & Wang, Z. (2016). Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN). CVPR. arXiv:1609.05158. (Introduces pixel shuffle.)
- Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional Networks. CVPR. (Early use of “deconvolution” for reconstructing inputs from feature maps; source of the misleading name.)
- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR. arXiv:1411.4038. (First large-scale use of transposed conv in vision. Introduces bilinear-init trick for the upsample layer.)
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI. arXiv:1505.04597. (Encoder pool + decoder transposed-conv + skip connections.)
- Noh, H., Hong, S., & Han, B. (2015). Learning Deconvolution Network for Semantic Segmentation. ICCV. arXiv:1505.04366. (The “Deconvnet” architecture — representative of the checkerboard-prone K=3, S=2 stack.)
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.5 “Variants of the Basic Convolution Function”. MIT Press.
https://www.deeplearningbook.org/(Textbook treatment of transposed conv and its role as the adjoint of ordinary conv.) - PyTorch documentation, torch.nn.ConvTranspose2d, torch.nn.functional.conv_transpose2d, torch.nn.Upsample, torch.nn.PixelShuffle.
https://pytorch.org/docs/stable/nn.html(Authoritative reference foroutput_paddingsemantics and the weight-shape ordering.) - Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2). CVPR. arXiv:1912.04958. (Moves from transposed-conv to bilinear-upsample-then-conv to eliminate residual artifacts.)
This concludes Chapter 10. You now have a complete, mathematically grounded view of the convolution operator and its entire ecosystem — the forward pass, the parameters, the efficient implementation, pooling, and transposed convolution. Chapter 11 builds on these foundations to walk through the architectural lineage that shaped modern computer vision: LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet.