Chapter 10
28 min read
Section 67 of 179

Transposed Convolutions

Convolution Operations

Introduction

Every operator we have met so far in this chapter reduces spatial resolution: a 3×3 conv with stride 1 shrinks the feature map by 2 pixels, a 2×2 max-pool halves it. This is fine for classification, where the pipeline is image → features → class. But some of the most interesting networks in deep learning need to go the other way. A GAN generator takes a 100-dim random vector and produces a 64×64 image. A U-Net decoder takes a 7×7 feature map and recovers a 224×224 segmentation mask. A super-resolution model turns a low-resolution photo into a high-resolution one. For all of these, we need a learned upsampler.

The Core Insight: Transposed convolution is the adjoint of a regular convolution — literally the transpose of its im2col matrix. It is NOT the inverse. “Deconvolution,” the name in older papers, is a mathematical misnomer: the operation undoes the shapechange of a convolution but never recovers the original pixel values. Everything interesting about transposed conv follows from this distinction.

In this section we derive transposed convolution from two complementary viewpoints — a sparse-matrix transpose and a scatter-add operation — and show the two views compute identical numbers. We will also meet the famous checkerboard artifact (Odena, Dumoulin & Olah 2016 [Ref 2]) and the two modern fixes that have mostly replaced transposed conv in state-of-the-art generators and super-resolution networks.


Learning Objectives

After working through this section, you will be able to:

  1. Derive the transposed-conv output-size formula O=(I1)S2P+D(K1)+output_padding+1O = (I - 1)S - 2P + D(K-1) + \text{output\_padding} + 1 from the matrix view.
  2. Implement transposed conv from scratch in pure NumPy (scatter-add) and verify the result against nn.ConvTranspose2d byte-for-byte.
  3. Explain the checkerboard artifact — reproduce the 1-2-4 overlap pattern and predict when it will appear from K, S alone.
  4. Use the two modern alternatives: upsample + conv (Odena) and pixel shuffle (Shi et al. [4]).
  5. Read and write DCGAN, U-Net and FCN decoder blocks with confidence.
  6. Know when to use output_padding and why PyTorch requires it to be less than the stride.

Why We Need Learned Upsampling

Three classes of vision architecture are fundamentally dilative: they start small and end large, or they need to reconstruct fine detail from coarse representations.

Architecture familyStart → end shapeWhy a learned upsampler?
GAN generators (DCGAN, StyleGAN, BigGAN)(N, 100) latent → (N, 3, 64, 64) or larger imageMust synthesise every pixel; the upsampler LEARNS where edges go.
Segmentation decoders (FCN, U-Net, DeepLab)Encoder's 7×7 feature map → H×W per-pixel class mapPer-pixel labels demand per-pixel resolution; bilinear resize loses class-specific detail.
Super-resolution (SRCNN, ESRGAN)Low-res image → 2×/4× higher resolutionHallucinating plausible high-frequency detail is the WHOLE task.
Autoencoder decodersLatent z → reconstructionMirror the encoder's shape trajectory.

You could use torch.nn.Upsample(mode='bilinear') for every one of these and add a stride-1 conv afterwards. That is in fact a perfectly reasonable modern design (and we will see why below). Historically, though, each new upsampling step was implemented as one transposed-conv layer, so the weights had to learn both the resize and the filter at once. We need to understand transposed conv to read a decade of papers, state dicts, and reference implementations.


A Name Problem — Deconvolution, Transposed, Fractionally-Strided

The operation has three names in the literature. They all refer to the same thing but carry different baggage.

NameUsed byAccuracy
DeconvolutionZeiler et al. (2010) [5], early image-processing literatureMisleading — 'deconvolution' in signal processing means inverting a convolution (approximately), which requires the kernel's pseudo-inverse. This operation does NOT do that.
Transposed convolutionDumoulin & Visin (2016) [1], PyTorch, modern papersMathematically precise — the operator's matrix representation is the transpose of the forward conv's matrix.
Fractionally-strided convolutionOlder PyTorch docs, some CNN papersOperational — describes the implementation trick of 'insert zeros between input cells, then apply regular conv with stride 1'.

Read carefully

When a paper says “deconvolutional layer”, it almost always means transposed convolution. True deconvolution (Wiener, Richardson-Lucy) exists as a signal-processing technique, but is almost never what neural networks do.

The Matrix View — It Really Is a Transpose

From §10.4 we know that a forward convolution can be written as y=Wx\mathbf{y} = W\mathbf{x}, where WW is the sparse im2col matrix and x\mathbf{x} is the flattened input. For a 3×3 kernel mapping a 4×4 image to a 2×2 output, WW is a (4,16)(4, 16) matrix with 9 non-zeros per row.

The transposed convolution with the same kernel is, by definition:

y=Wx\mathbf{y}' = W^{\top} \mathbf{x}'

where x\mathbf{x}' is now the flattened 2×2 input (length 4) and y\mathbf{y}' is the 4×4 output (length 16). WW^{\top} has shape (16,4)(16, 4). Every row corresponds to one output cell and is populated by a specific subset of the 9 kernel taps.

The adjoint, not the inverse

WW^{\top} is the adjoint of WW, not its inverse. In general WWIW^{\top} W \neq I. So if you apply a forward conv then a transposed conv (with the same weights!), you get a tensor of the right shape but not the original values. Transposed conv undoes the shape, not the information.

Interactive: Transpose as Adjoint

Hover any cell of the 2×2 input or the 4×4 output below. The column of WW^{\top} belonging to that input cell lights up; so do the output cells it paints into. Hovering an output cell highlights its row of WW^{\top} and the input cells whose kernel taps feed it. The adjoint relationship is no longer a definition — it's a click.

Loading transpose-matrix view

The Scatter-Add View — Hand Computation

Multiplying a 16×4 matrix by a 4-vector is the mathematical definition, but nobody actually implements transposed conv that way. The practical trick: every input cell x[i,j]x[i,j] “paints” a scaled copy of the kernel into the output, starting at position (iS,jS)(i \cdot S, j \cdot S), and overlapping copies accumulate. This is exactly the rule you get by reading WxW^{\top} \mathbf{x} column by column.

Worked example — 2×2 input, 3×3 kernel, stride 1

Let us paint by hand. Our input is the 2×2 tensor [[1, 2], [3, 4]] and our kernel is the sparse [[1, 0, 1], [0, 1, 0], [1, 0, 1]]. Stride 1, padding 0. The output size is (21)1+3=4(2-1)\cdot 1 + 3 = 4, so the result is 4×4.

Input cellValuePaints intoContribution
(0, 0)1output[0:3, 0:3][[1,0,1],[0,1,0],[1,0,1]]
(0, 1)2output[0:3, 1:4][[2,0,2],[0,2,0],[2,0,2]]
(1, 0)3output[1:4, 0:3][[3,0,3],[0,3,0],[3,0,3]]
(1, 1)4output[1:4, 1:4][[4,0,4],[0,4,0],[4,0,4]]

Summing overlapping contributions gives:

📝result.txt
1[[1, 2, 1, 2],
2 [3, 5, 5, 4],
3 [1, 5, 5, 2],
4 [3, 4, 3, 4]]

Try e.g. cell (1, 1) — it is hit by four of the five non-zero kernel taps from four different input cells: 11+20+30+41=51 \cdot 1 + 2 \cdot 0 + 3 \cdot 0 + 4 \cdot 1 = 5. ✓

Interactive: Scatter-Add Painter

Click an input cell or press Play to watch every input cell paint a scaled copy of the kernel into the output. Switch the kernel between “sparse corners-plus-centre” (the example above) and all-ones, and try strides 1, 2, 3 to see how zero-stride windows cease to overlap. The numbers in green are running sums of the contributions you have painted so far — once all four input cells are painted, you should see the same [[1,2,1,2],[3,5,5,4],[1,5,5,2],[3,4,3,4]] the table predicts.

Loading scatter-add painter

Transposed Conv in Pure Python

Transposed Convolution — NumPy scatter-add reference
🐍conv_transpose_numpy.py
1import numpy as np

NumPy's ndarray plus broadcast scalar-times-matrix (x[i,j] * w) is all we need. No fancy stride tricks or im2col — transposed conv is easy to understand in its scatter-add form.

3x — the 2×2 input feature map

In a GAN generator this would be an early-stage feature map; in a U-Net decoder it is the output of the previous up-block. We keep it tiny so the 4×4 output can be verified by hand.

EXECUTION STATE
x (2×2) =
    c0 c1
r0   1  2
r1   3  4
9w — sparse 'corners + center' kernel

Nine kernel entries, but five are 1 and four are 0. The sparsity makes every contribution obvious when you read the output by eye: any non-zero output cell is a sum of a specific subset of the five non-zero kernel taps times specific input cells.

EXECUTION STATE
w (3×3) =
    c0 c1 c2
r0   1  0  1
r1   0  1  0
r2   1  0  1
16def conv_transpose2d(x, w, stride, padding) → np.ndarray

The whole operation is ONE nested loop with a scatter-add. For every input cell, multiply the kernel by that scalar and ADD the 3×3 block into the output — at an offset determined by the input position and the stride.

EXECUTION STATE
⬇ input: x = (H_in, W_in) input feature map (single channel for clarity).
⬇ input: w = (kH, kW) kernel. Same dtype as x.
⬇ input: stride = Same role as in forward conv but with inverted effect: stride S makes the output (I−1)·S+K instead of (I−K)/S+1. Stride>1 is the 'upsampling' regime.
⬇ input: padding = In the FORWARD conv, P adds pixels around the input. In the transpose, P CROPS the same number of pixels off the output. This preserves the adjoint relationship.
⬆ returns = np.ndarray with shape (H_out, W_out) computed via the output-size formula in §Output-Size Formula below.
24H_big = (H_in - 1) * stride + kH

This is the output-size formula from Dumoulin & Visin (2016) [Ref 1], Eq. 15, specialised to no padding and no output_padding. We derive it below; for now, plug in: (2−1)·1 + 3 = 4.

EXECUTION STATE
📚 (I-1)·S + K = Exactly the inverse of (I−K)/S+1 when S=1 and no padding. Shown equal to the adjoint operator's output dimension in Dumoulin & Visin §4.1.
→ our case = (2−1)·1 + 3 = 4 → output is 4×4
27y_big = np.zeros((H_big, W_big))

Pre-allocate the full-size output with zeros. The loop then accumulates contributions — this is the 'scatter-add' — so every cell that never gets touched stays 0.

EXECUTION STATE
y_big (4×4) = All zeros, 16 cells, before the loop writes into it.
29for i in range(H_in): / for j in range(W_in):

Unlike forward conv, the outer loop is over INPUTS, not outputs. Every input cell 'paints' an entire kernel-shaped region into the output.

LOOP TRACE · 4 iterations
i=0, j=0 → paint x[0,0]=1 into y_big[0:3, 0:3]
x[0,0] * w = [[1, 0, 1], [0, 1, 0], [1, 0, 1]]
write region = y_big[0:3, 0:3]
i=0, j=1 → paint x[0,1]=2 into y_big[0:3, 1:4]
x[0,1] * w = [[2, 0, 2], [0, 2, 0], [2, 0, 2]]
write region = y_big[0:3, 1:4] (OVERLAPS previous region at cols 1, 2)
i=1, j=0 → paint x[1,0]=3 into y_big[1:4, 0:3]
x[1,0] * w = [[3, 0, 3], [0, 3, 0], [3, 0, 3]]
i=1, j=1 → paint x[1,1]=4 into y_big[1:4, 1:4]
x[1,1] * w = [[4, 0, 4], [0, 4, 0], [4, 0, 4]]
32y_big[r:r+kH, c:c+kW] += x[i, j] * w

The single most important line in this section. Two NumPy idioms combine:

EXECUTION STATE
📚 scalar × ndarray = Broadcasts the scalar x[i,j] across every cell of w, producing a scalar-scaled kernel.
📚 += on a slice = In-place accumulation into a sub-region of y_big. Overlapping writes from different (i,j) iterations ACCUMULATE — this is why output cells in the middle can be larger than any single scaled-kernel entry.
→ why += and not =? = Windows overlap whenever stride < kernel_size. At every pixel, the output is a SUM of contributions, which is exactly W·x in matrix form — see the next code block.
36if padding > 0: y_big = y_big[padding:-padding, padding:-padding]

Forward-conv padding P adds pixels around the input; transposed-conv padding CROPS P pixels off each side of the result. This is the correct adjoint: if F_P is 'pad by P then convolve', then F_P^T is 'conv-transpose then crop by P'.

EXECUTION STATE
📚 array[P:-P, P:-P] = NumPy slicing with negative end index. P:-P excludes the first P and last P entries along that axis.
39print(...) — the 4×4 result

Trace check: output[1,1] = x[0,0]·w[1,1] + x[0,1]·w[1,0] + x[1,0]·w[0,1] + x[1,1]·w[0,0] = 1·1 + 2·0 + 3·0 + 4·1 = 5. ✓

EXECUTION STATE
⬆ returns (4×4) =
[[1. 2. 1. 2.]
 [3. 5. 5. 4.]
 [1. 5. 5. 2.]
 [3. 4. 3. 4.]]
34 lines without explanation
1import numpy as np
2
3# Input feature map (2x2)
4x = np.array([
5    [1, 2],
6    [3, 4],
7], dtype=float)
8
9# Kernel (3x3) — sparse 'corners+center' so we can verify by hand
10w = np.array([
11    [1, 0, 1],
12    [0, 1, 0],
13    [1, 0, 1],
14], dtype=float)
15
16def conv_transpose2d(x, w, stride=1, padding=0):
17    """Transposed 2-D convolution via the scatter-add interpretation.
18
19    For every input cell x[i,j], add x[i,j] * w to the output region
20    starting at (i*stride, j*stride). Then strip 'padding' pixels off
21    each side of the result.
22    """
23    H_in, W_in = x.shape
24    kH, kW = w.shape
25    H_big = (H_in - 1) * stride + kH
26    W_big = (W_in - 1) * stride + kW
27    y_big = np.zeros((H_big, W_big))
28
29    for i in range(H_in):
30        for j in range(W_in):
31            r = i * stride
32            c = j * stride
33            y_big[r:r+kH, c:c+kW] += x[i, j] * w
34
35    # The padding P in the forward conv becomes a CROP P in the transpose
36    if padding > 0:
37        y_big = y_big[padding:-padding, padding:-padding]
38    return y_big
39
40print(conv_transpose2d(x, w, stride=1, padding=0))
41# [[1. 2. 1. 2.]
42#  [3. 5. 5. 4.]
43#  [1. 5. 5. 2.]
44#  [3. 4. 3. 4.]]

Equivalence with the Transposed Matrix

The two views must agree numerically. Let us build WW explicitly and verify:

The scatter-add view equals W.T · x byte-for-byte
🐍matrix_verify.py
9def forward_conv_matrix — build the sparse im2col matrix W

A forward 3×3 conv on a 4×4 input is a linear map from 16 input pixels to 4 output pixels. That map is a (4, 16) matrix W with exactly 9 non-zeros per row (the kernel taps) — the same im2col matrix you met in §10.4.

EXECUTION STATE
→ why this matters = If the forward pass is y = W·x, then by the definition of the adjoint, the transposed-conv forward pass is y = W.T·x. No ambiguity, no hand-waving.
27y_flat = W_fwd.T @ x_flat

Literally 'W transposed, times x'. This is where the operation gets its name. W_fwd has shape (4, 16); W_fwd.T has shape (16, 4); multiplied by x_flat (shape 4) gives y_flat (shape 16), which reshapes to 4×4.

EXECUTION STATE
📚 @ operator = Python's matrix-multiplication operator (PEP 465). Equivalent to np.matmul.
W_fwd.shape = (4, 16)
W_fwd.T.shape = (16, 4)
x_flat.shape = (4,)
y_flat.shape = (16,)
31Equivalence with the scatter-add result

The matrix view and the scatter-add view compute the SAME numbers. The matrix view is the clean mathematical definition; the scatter-add view is how anybody actually implements it in code. They are dual descriptions of one operator.

EXECUTION STATE
→ Dumoulin & Visin (2016) [1] = The canonical reference. Their Figure 4.1 is exactly this 2×2 → 4×4 construction, and §4.1 of their guide proves the equivalence.
42 lines without explanation
1import numpy as np
2
3# Same 2x2 input and 3x3 kernel as before
4x = np.array([[1., 2.], [3., 4.]])
5w = np.array([[1., 0., 1.],
6              [0., 1., 0.],
7              [1., 0., 1.]])
8
9# STEP 1 — build the (4, 16) im2col-style matrix W for the FORWARD
10# conv mapping a 4x4 image to a 2x2 output. Each row of W picks out
11# the 3x3 window that produces one output cell.
12def forward_conv_matrix(kH, kW, H_in, W_in):
13    H_out = H_in - kH + 1
14    W_out = W_in - kW + 1
15    rows = H_out * W_out
16    cols = H_in * W_in
17    W_mat = np.zeros((rows, cols))
18    for i in range(H_out):
19        for j in range(W_out):
20            r = i * W_out + j
21            for di in range(kH):
22                for dj in range(kW):
23                    c = (i + di) * W_in + (j + dj)
24                    W_mat[r, c] = w[di, dj]
25    return W_mat
26
27W_fwd = forward_conv_matrix(3, 3, 4, 4)   # shape (4, 16)
28print("forward matrix shape:", W_fwd.shape)
29
30# STEP 2 — transposed conv IS multiplication by W_fwd.T
31x_flat = x.flatten()                       # shape (4,)
32y_flat = W_fwd.T @ x_flat                  # shape (16,)
33y_matrix_view = y_flat.reshape(4, 4)
34
35print("matrix-transpose view:")
36print(y_matrix_view)
37# [[1. 2. 1. 2.]
38#  [3. 5. 5. 4.]
39#  [1. 5. 5. 2.]
40#  [3. 4. 3. 4.]]
41
42# Must match the scatter-add result byte-for-byte:
43from numpy.testing import assert_allclose
44# (assuming conv_transpose2d from the previous block is in scope)
45# assert_allclose(conv_transpose2d(x, w), y_matrix_view)

Transposed Conv in PyTorch

nn.ConvTranspose2d with the same inputs as the NumPy reference
🐍pytorch_convT.py
12nn.ConvTranspose2d — the PyTorch wrapper

The stateful Module wrapping F.conv_transpose2d. Stores the (C_in, C_out, kH, kW) kernel — note the channel-order flip vs nn.Conv2d: transposed conv weights are (C_in, C_out, kH, kW), NOT (C_out, C_in, kH, kW).

EXECUTION STATE
📚 nn.ConvTranspose2d = PyTorch Module. Inherits from _ConvNd. Weight shape is (in_channels, out_channels, *kernel_size). Default initialisation is Kaiming-uniform like nn.Conv2d.
→ weight-shape flip = Because W.T maps C_in → C_out in the forward pass of the transpose, the weight tensor is stored with C_in as the leading axis. Caught many people off-guard when loading state dicts between ConvNd and ConvTransposeNd.
18stride, padding, bias arguments

Same API surface as nn.Conv2d — but padding CROPS the output here instead of expanding the input. This is the single biggest source of confusion for students: set padding=0 to match the scatter-add reference.

EXECUTION STATE
→ padding (transpose semantics) = If forward conv uses P, transposed conv must also use P to be its adjoint. PyTorch's semantic is: CROP P pixels off each side of the result. Set padding=1 with a 3×3 kernel and stride=1 to get an output the SAME spatial size as the input.
→ bias = One scalar per output channel, added to every spatial position. bias=False when followed by BatchNorm.
21ct.weight.copy_(torch.tensor(...))

Manually inject the same kernel we used in NumPy so the outputs can be compared byte-for-byte. The weight tensor shape is (in_channels, out_channels, kH, kW) = (1, 1, 3, 3) — hence the triple bracket nesting.

EXECUTION STATE
📚 tensor.copy_(other) = In-place copy of 'other' into tensor, preserving dtype/device. The trailing underscore signals in-place; required inside torch.no_grad() to avoid autograd tracking the weight assignment.
28y = ct(x) — the forward pass

Dispatches through __call__ → forward → F.conv_transpose2d → ATen → cuDNN. Internally cuDNN typically runs this as a regular convolution after zero-insertion, not as a literal matrix-transpose multiply — same output, better memory access pattern.

EXECUTION STATE
y.shape = (1, 1, 4, 4) — (N, C_out, H_out, W_out) with H_out = (H_in-1)·S − 2P + K
34Functional form F.conv_transpose2d

Same operation, no Module. Useful in Module.forward() when you prefer to keep all hyperparameters in one place instead of as attributes.

37 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Same 2x2 input, same 3x3 kernel, but now in (N, C_in, H, W) layout
6x = torch.tensor([[[
7    [1., 2.],
8    [3., 4.],
9]]])  # shape (1, 1, 2, 2)
10
11# Build the ConvTranspose2d module and manually set its weights so we
12# can cross-check against the NumPy result.
13ct = nn.ConvTranspose2d(
14    in_channels=1,
15    out_channels=1,
16    kernel_size=3,
17    stride=1,
18    padding=0,
19    bias=False,
20)
21with torch.no_grad():
22    ct.weight.copy_(torch.tensor([[[
23        [1., 0., 1.],
24        [0., 1., 0.],
25        [1., 0., 1.],
26    ]]]))
27
28y = ct(x)
29print(y.squeeze())
30# tensor([[1., 2., 1., 2.],
31#         [3., 5., 5., 4.],
32#         [1., 5., 5., 2.],
33#         [3., 4., 3., 4.]])
34
35# Functional form
36y2 = F.conv_transpose2d(
37    x,
38    ct.weight,
39    stride=1,
40    padding=0,
41)
42assert torch.equal(y, y2)

Weight shape is (C_in, C_out, kH, kW), not (C_out, C_in, kH, kW)

This is the single most common state-dict bug. nn.Conv2d stores weights as (Cout,Cin,kH,kW)(C_\text{out}, C_\text{in}, kH, kW).nn.ConvTranspose2d stores them as (Cin,Cout,kH,kW)(C_\text{in}, C_\text{out}, kH, kW). Converting between them requires weight.transpose(0, 1), not just renaming the layer.

The Output-Size Formula

The full formula, from Dumoulin & Visin (2016) [1], Eq. 15, and repeated verbatim in the PyTorch docs for torch.nn.ConvTranspose2d:

O=(I1)S2P+D(K1)+output_padding+1O = (I - 1) S - 2P + D(K - 1) + \text{output\_padding} + 1

With each symbol:

SymbolRoleTypical values
IInput spatial size2, 7, 14, 32, …
KKernel size3, 4, 5
SStride1 (no upsample), 2 (2× upsample)
PPadding — CROPS the output0, 1, 2
DDilation1 (default)
output_paddingExtra pixels on right/bottom only. Must be < S.0 or 1
Verify the formula against PyTorch on 20 random configurations
🐍convT_output_formula.py
4def convT_out — the canonical output formula

This is the exact formula in the PyTorch docs for ConvTranspose2d and in Dumoulin & Visin (2016) [1], Eq. 15. Every term has a pedagogical reading:

EXECUTION STATE
📚 (I-1)·S = Spacing between scattered kernel copies. With S=1 they overlap by (K-1); with S>1 gaps appear and output gets larger.
📚 −2P = Crops P pixels off each side of the result — the adjoint of padding the input in a forward conv.
📚 dilation·(K-1) = The effective kernel footprint when dilation > 1. For dilation=1 this is just (K-1).
📚 output_padding = Adds 0..S-1 pixels ONLY on the right/bottom of the output. Exists to break the ambiguity that a forward conv with stride S can map multiple input sizes to the same output size — see next code block.
📚 + 1 = The usual 'off-by-one' conversion between (size) and (end−start+1).
15assert predicted == actual — PyTorch matches the formula

Sanity check over 20 random valid configurations. If this ever fails, suspect a dilation or output_padding misuse — not a PyTorch bug.

23 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def convT_out(I, K, S, P, output_padding=0, dilation=1):
5    """Dumoulin & Visin (2016), Eq. 15, reproduced verbatim from the PyTorch
6    torch.nn.ConvTranspose2d docs."""
7    return (I - 1) * S - 2 * P + dilation * (K - 1) + output_padding + 1
8
9# Verify against PyTorch on 20 random configurations
10torch.manual_seed(0)
11for _ in range(20):
12    I = torch.randint(2, 20, (1,)).item()
13    K = torch.randint(1, 6, (1,)).item()
14    S = torch.randint(1, 4, (1,)).item()
15    P = torch.randint(0, K, (1,)).item()
16    op = torch.randint(0, S, (1,)).item()  # output_padding < stride (PyTorch requires this)
17
18    kernel = torch.randn(1, 1, K, K)
19    x = torch.randn(1, 1, I, I)
20    y = F.conv_transpose2d(x, kernel, stride=S, padding=P, output_padding=op)
21
22    predicted = convT_out(I, K, S, P, op)
23    actual = y.shape[-1]
24    assert predicted == actual, (I, K, S, P, op, predicted, actual)
25print("All 20 random configurations match the formula.")
Forward convTransposed conv inverseOutput
(I=4, K=3, S=1, P=0) → 2(I=2, K=3, S=1, P=0)4
(I=4, K=3, S=1, P=1) → 4 (same padding)(I=4, K=3, S=1, P=1)4
(I=8, K=4, S=2, P=1) → 4(I=4, K=4, S=2, P=1, output_padding=0)8
(I=7, K=3, S=2, P=1) → 4 (floor)(I=4, K=3, S=2, P=1, output_padding=0)7
(I=8, K=3, S=2, P=1) → 4(I=4, K=3, S=2, P=1, output_padding=1)8

Interactive: Output-Size Calculator

Drag the six sliders to see every term of the output formula light up as it is added. The forward-conv counterpart range tells you which forward-conv inputs all collapse to the current II, and output_padding is automatically clamped to <S< S (PyTorch's constraint). The copy-paste-ready PyTorch invocation is the same line you would type when reproducing this configuration in code.

Loading ConvTranspose calculator

Strided Transposed Convolution — Zero Insertion

When S>1S > 1, transposed conv becomes an upsampler. The classic “fractionally-strided” interpretation makes this intuitive: insert S1S-1 zero rows and columns between every pair of input cells, pad the result by K1PK - 1 - P zeros on each side, and then apply a regular stride-1 convolution with the same kernel. The output size matches the formula above.

Example — 2×2 input, K=3, S=2:

📝zero_insertion.txt
1Input (2x2):
2    [[1, 2],
3     [3, 4]]
4
5After inserting S-1 = 1 zero between cells (3x3):
6    [[1, 0, 2],
7     [0, 0, 0],
8     [3, 0, 4]]
9
10Pad by K-1-P = 2 zeros around (7x7):
11    [[0, 0, 0, 0, 0, 0, 0],
12     [0, 0, 0, 0, 0, 0, 0],
13     [0, 0, 1, 0, 2, 0, 0],
14     [0, 0, 0, 0, 0, 0, 0],
15     [0, 0, 3, 0, 4, 0, 0],
16     [0, 0, 0, 0, 0, 0, 0],
17     [0, 0, 0, 0, 0, 0, 0]]
18
19Apply regular 3x3 conv, stride 1 → 5x5 output.
20(Output size check: (2-1)*2 + 3 = 5 ✓)

Modern implementations skip the explicit zero insertion (too wasteful) and go straight to the scatter-add form or to a cuDNN routine that fuses both steps. But the zero-insertion picture is useful for intuition and is the origin of the name “fractionally-strided convolution”.

Interactive: Fractionally-Strided View

Step through the four stages below — original input, zero-inserted, zero-padded, then a regular stride-1 conv slid across the result. The output size of the final conv equals the transposed-conv output size from the formula above, exactly. Toggle stride and padding to confirm.

Loading zero-insertion animation

output_padding — Resolving the Ambiguity

A forward conv with S>1S > 1 does floor-division in its output formula. Two adjacent input sizes can map to the same output size, so the transposed conv going the other way is ambiguous: given a 4×4 output, was the forward input 7×7 or 8×8? output_padding picks the answer.

output_padding breaks the size ambiguity of strided forward convs
🐍output_padding_demo.py
9Why output_padding exists

A forward conv with stride S > 1 has floor-division in its output formula. Two adjacent input sizes can therefore map to the same output size — and so the transpose, which is a map in the other direction, has an irreducible ambiguity. output_padding picks which of the (up to S) candidate sizes you want.

EXECUTION STATE
📚 ambiguity example = Forward conv(K=3, S=2, P=1) maps input=7 → 4 AND input=8 → 4. Two different forward inputs, same forward output. Transpose of 4 back to 'original' is either 7 or 8, and you have to say which.
16output_padding=0 vs output_padding=1

output_padding adds 0 ≤ op < S pixels ONLY to the right and bottom of the output — never symmetric padding. PyTorch requires op < S so the choice is unique.

EXECUTION STATE
→ rule of thumb = If your network uses conv(K, S, P) to go DOWN, mirror it with convT(K, S, P, output_padding = (target_size − naive_size)) to go back UP. Usually 0 or 1.
20ShapeS round-trip but VALUES do not

This is the single most misunderstood fact about transposed convolution: it is the ADJOINT operator, NOT the inverse. The shape round-trips (input → forward conv → output → conv_transpose → same shape as input) but the values do not — recovering exact values would require the kernel's pseudo-inverse, which a transposed conv does not compute.

EXECUTION STATE
→ therefore = 'Deconvolution' is a historical misnomer. See the Name Problem section above.
23 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Two DIFFERENT forward convs that produce the SAME output size:
6#   input 7,  K=3, S=2, P=1  →  (7 + 2 - 3) / 2 + 1 = 4 (floor)
7#   input 8,  K=3, S=2, P=1  →  (8 + 2 - 3) / 2 + 1 = 4 (integer)
8# Going in reverse, we get a 4x4 input and want either 7x7 OR 8x8 back.
9# The ambiguity is resolved by output_padding.
10
11x = torch.randn(1, 1, 4, 4)
12ct = nn.ConvTranspose2d(1, 1, kernel_size=3, stride=2, padding=1, bias=False)
13
14# output_padding=0 → smaller of the two candidates (7x7)
15y_small = F.conv_transpose2d(x, ct.weight, stride=2, padding=1, output_padding=0)
16print("output_padding=0:", y_small.shape)  # torch.Size([1, 1, 7, 7])
17
18# output_padding=1 → larger candidate (8x8)
19y_large = F.conv_transpose2d(x, ct.weight, stride=2, padding=1, output_padding=1)
20print("output_padding=1:", y_large.shape)  # torch.Size([1, 1, 8, 8])
21
22# Round-trip check: forward conv on the 8x8 output maps back to 4x4
23x_back = F.conv2d(y_large, ct.weight, stride=2, padding=1)
24print("round-trip:", x_back.shape)          # torch.Size([1, 1, 4, 4])
25# NOTE: values differ — transposed conv is NOT the inverse of convolution.
26# Only the SHAPE round-trips; the values would require w and w^+ (pseudo-inverse) to match.

PyTorch's constraint: output_padding < stride

The ambiguity is exactly S1S-1 pixels wide, so any integer 0output_padding<S0 \leq \text{output\_padding} < S uniquely identifies one of the candidate output sizes. PyTorch enforces this at module-construction time.

The Checkerboard Artifact — Odena, Dumoulin & Olah (2016)

The most important known pathology of transposed convolution. Odena, Dumoulin & Olah (2016) [Ref 2], in a short and influential Distill article, showed that GAN generators built from stacked ConvTranspose2d(K=3, S=2) layers produce outputs with visible grid-like patterns — even when the training signal contains no such patterns. The cause is a purely geometric property of transposed conv, not a training bug.

Reproducing the 1-2-4 Overlap Pattern

Take a uniform 3×3 input of ones, a uniform 3×3 kernel of ones, and apply stride-2 transposed conv. The output — which should be uniform by symmetry — is not.

Uniform input + uniform kernel + stride 2 ≠ uniform output
🐍checkerboard.py
5x = np.ones((3, 3)), w = np.ones((3, 3))

The cleanest possible test case: uniform input, uniform kernel. If the transposed-conv output were uniform too, there would be no artifact. It is not.

18print(...) — the 1-2-4 pattern

The output is 7×7 with a periodic structure of 1s, 2s, and 4s. The 4s are where FOUR scattered-kernel copies overlap; the 2s where TWO overlap; the 1s where exactly ONE copy lands. This is the checkerboard artifact.

EXECUTION STATE
⬆ output (7×7) =
[[1 1 2 1 2 1 1]
 [1 1 2 1 2 1 1]
 [2 2 4 2 4 2 2]
 [1 1 2 1 2 1 1]
 [2 2 4 2 4 2 2]
 [1 1 2 1 2 1 1]
 [1 1 2 1 2 1 1]]
→ diagnostic formula = For stride S and kernel size K, the overlap count varies along each axis with period S. Uniform overlap ONLY happens when K is a multiple of S (Odena et al. 2016 [Ref 2], Fig. 3).
23Why this matters in practice

A trained GAN generator with ConvTranspose2d(K=3, S=2) learns kernel weights that try to compensate for the 1-2-4 overlap. But with very limited capacity at that depth, the generator often cannot fully cancel the pattern — and we see visible grid-like artifacts in the generated images (see the DCGAN outputs in Radford et al. 2016 [3] for the canonical early examples).

27 lines without explanation
1import numpy as np
2
3# Odena, Dumoulin & Olah (2016) — "Deconvolution and Checkerboard Artifacts"
4# Setup: 3x3 feature map of all ones, K=3 kernel of all ones, stride=2.
5x = np.ones((3, 3))
6w = np.ones((3, 3))
7
8def conv_transpose2d_strided(x, w, stride):
9    H_in, W_in = x.shape
10    kH, kW = w.shape
11    H_out = (H_in - 1) * stride + kH
12    W_out = (W_in - 1) * stride + kW
13    y = np.zeros((H_out, W_out))
14    for i in range(H_in):
15        for j in range(W_in):
16            y[i*stride:i*stride+kH, j*stride:j*stride+kW] += x[i, j] * w
17    return y
18
19print(conv_transpose2d_strided(x, w, stride=2))
20# [[1. 1. 2. 1. 2. 1. 1.]
21#  [1. 1. 2. 1. 2. 1. 1.]
22#  [2. 2. 4. 2. 4. 2. 2.]
23#  [1. 1. 2. 1. 2. 1. 1.]
24#  [2. 2. 4. 2. 4. 2. 2.]
25#  [1. 1. 2. 1. 2. 1. 1.]
26#  [1. 1. 2. 1. 2. 1. 1.]]
27
28# The periodic 1-2-4 pattern is the CHECKERBOARD. Even with a perfectly
29# uniform input and a perfectly uniform kernel, transposed-conv output
30# is non-uniform whenever kernel_size is not divisible by stride.

Why does this happen? Each output cell (r,c)(r, c) is the sum of contributions from every input cell whose scattered kernel touches it. With K=3K=3 and S=2S=2, the number of input cells that touch a given output cell alternates between 1 and 2 along each axis, producing the 1-2-4 pattern we see. Whenever KK is not a multiple of SS, the overlap is non-uniform and a checkerboard appears.

Diagnostic rule

Transposed conv is artifact-free (in this sense) if and only if KK is a multiple of SS. Odena et al. (2016) [2] recommendK=4,S=2K=4, S=2 if you must use transposed conv, but prefer the two alternatives below.

Interactive: Overlap Heatmap Explorer

Slide KK and SS below. Whenever KmodS=0K \bmod S = 0, the heatmap is a single uniform colour and the diagnostic strip turns green; otherwise it shows the periodic checkerboard pattern. This is exactly Figure 3 from Odena, Dumoulin & Olah (2016) [Ref 2], but you can drag the knobs.

Loading checkerboard heatmap

Fix 1 — Nearest Upsample + Regular Conv

Odena's recommended fix: decouple upsampling from filtering. First use a parameter-free nearest-neighbour (or bilinear) nn.Upsample to resize, then apply a regular nn.Conv2d to filter. Same receptive field, same parameter count, no checkerboard.

Odena's fix: nn.Upsample + nn.Conv2d
🐍upsample_conv.py
9convT — the classic (and prone to checkerboard)

Canonical generator/decoder block since DCGAN (2016). Learns a 3×3 kernel that 'upsamples + convolves' in one step. Fast, compact, but the overlap pattern from the previous example leaks into the learned kernel as an implicit constraint — the kernel must cancel the checkerboard bias, which it cannot always do perfectly.

15class UpsampleConv — Odena's proposed fix

Decouple upsampling from feature extraction: first resize the feature map deterministically (nearest-neighbour has no learnable bias), then apply a regular stride-1 conv that can focus entirely on filtering. Same effective receptive field, same output shape, no checkerboard.

EXECUTION STATE
📚 nn.Upsample(scale_factor, mode) = Parameter-free resize. mode can be 'nearest' (no artifact), 'bilinear' (smoother), 'bicubic' (smoother still). Nearest is the Odena default and is fastest.
→ parameter count = UpsampleConv: 9·4·8 + 8 = 296 for the conv. ConvTranspose2d: 3·3·4·8 + 8 = 296. Identical. The fix is free.
25Same shape, kinder gradients

Both variants output (1, 8, 14, 14). The upsample+conv variant is now the default in many modern GAN architectures (StyleGAN1 and later, BigGAN) and in most U-Net reimplementations.

25 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Same (N=1, C_in=4, H=7, W=7) input for both variants
6x = torch.randn(1, 4, 7, 7)
7
8# Variant A — the classic (prone to checkerboard)
9convT = nn.ConvTranspose2d(4, 8, kernel_size=3, stride=2, padding=1, output_padding=1)
10yA = convT(x)
11print("convT:        ", yA.shape)  # torch.Size([1, 8, 14, 14])
12
13# Variant B — Odena et al. (2016) recommendation:
14#             nearest-upsample, then stride-1 conv
15class UpsampleConv(nn.Module):
16    def __init__(self, C_in, C_out, scale_factor=2, kernel_size=3, padding=1):
17        super().__init__()
18        self.up = nn.Upsample(scale_factor=scale_factor, mode='nearest')
19        self.conv = nn.Conv2d(C_in, C_out, kernel_size=kernel_size, padding=padding)
20    def forward(self, x):
21        return self.conv(self.up(x))
22
23up_conv = UpsampleConv(4, 8)
24yB = up_conv(x)
25print("upsample+conv:", yB.shape)  # torch.Size([1, 8, 14, 14])
26
27# Same shape, roughly same parameter count (9·4·8 + 8 = 296 either way),
28# but the upsample+conv variant eliminates the 1-2-4 overlap pattern.

Fix 2 — Pixel Shuffle / Sub-Pixel Convolution

Shi et al. (2016) [Ref 4]'s ESPCN introduced a different approach: do the convolution in low resolution with r2Coutr^2 C_\text{out} output channels, then use a zero-parameter pixel-shuffle to permute those channels into rr-times larger spatial extent.

Interactive: PixelShuffle Rearrangement

Hover any cell on either side of the visualizer below. The mapping rule y[c,ri+di,rj+dj]=x[c+C(dir+dj), i, j]y[c, ri+di, rj+dj] = x[c + C(di\,r + dj),\ i,\ j] becomes click-tracking; the r2r^2 input channels collapse onto exactly r2r^2 output pixels per super-pixel, with no overlap — which is why pixel shuffle is artifact-free by construction.

Loading PixelShuffle visualizer
Pixel shuffle (Shi et al. 2016) — the super-resolution upsampler
🐍pixel_shuffle.py
6class SubPixelUp — the ESPCN pattern

Shi et al. (2016) [4] observed that rather than upsample first and then filter, we can filter first in the low-resolution space (much cheaper) and then rearrange channels into space — sub-pixel convolution. This became the default upsampler in super-resolution (EDSR, ESRGAN) and is used in many modern generators too.

EXECUTION STATE
→ cost efficiency = Regular conv in low-res space: O(K²·C_in·C_out·r²·H·W). Upsample-then-conv: O(K²·C_in·C_out·r²·H·W). Same; but sub-pixel conv benefits more from cuDNN kernel selection on small feature maps.
13self.conv — produces r² × C_out channels

Key trick: conv output has r² times more channels than the final output. For 2× upsampling and C_out=8 final channels, conv produces 32 channels at the same spatial resolution.

EXECUTION STATE
→ conv output shape = (N, r² · C_out, H, W) = (1, 32, 7, 7) for our example
15nn.PixelShuffle — the zero-param rearrange

Reorders a (N, C·r², H, W) tensor into (N, C, r·H, r·W) by interleaving the 'extra' channels into new spatial positions. No multiplies, no adds — it is a pure index permutation.

EXECUTION STATE
📚 nn.PixelShuffle(r) = Also available as torch.nn.functional.pixel_shuffle. Inverse: nn.PixelUnshuffle (introduced in PyTorch 1.8). Parameter count: 0.
→ mapping rule = Output cell (c, r·i+di, r·j+dj) = input cell (c + C·(di·r + dj), i, j). The r² 'extra' channels become the r² pixels of each super-pixel in the output.
25No checkerboard by construction

The rearrangement is exact — every output pixel comes from exactly ONE input cell, scaled by exactly ONE kernel weight. There is no overlap pattern to create a 1-2-4 checkerboard. This is why modern super-resolution networks prefer pixel shuffle over transposed conv.

25 lines without explanation
1import torch
2import torch.nn as nn
3
4# Pixel shuffle (Shi et al. 2016 [4]) — aka 'sub-pixel convolution'.
5# Key idea: produce r^2 * C_out output channels with a regular conv,
6# then rearrange them into one tensor with r^2 times the spatial area.
7
8class SubPixelUp(nn.Module):
9    """Replace ConvTranspose2d with conv → pixel_shuffle. No checkerboard."""
10    def __init__(self, C_in, C_out, scale_factor=2):
11        super().__init__()
12        r = scale_factor
13        # Conv produces r^2 * C_out channels at the SAME spatial size
14        self.conv = nn.Conv2d(C_in, C_out * r * r, kernel_size=3, padding=1)
15        # PixelShuffle rearranges (r^2 * C_out, H, W) → (C_out, r*H, r*W)
16        self.shuffle = nn.PixelShuffle(upscale_factor=r)
17
18    def forward(self, x):
19        return self.shuffle(self.conv(x))
20
21x = torch.randn(1, 4, 7, 7)
22up = SubPixelUp(C_in=4, C_out=8, scale_factor=2)
23y = up(x)
24print(y.shape)  # torch.Size([1, 8, 14, 14])
25
26# Parameter count — same 3x3 conv as before, just with 4x more output
27# channels (8 * 4 = 32), so 9*4*32 + 32 = 1184 parameters.
28n_params = sum(p.numel() for p in up.parameters())
29print(f"params: {n_params}")  # 1184

Which upsampler should I choose?

  1. GAN generator (modern): upsample + conv (StyleGAN, BigGAN) OR pixel shuffle.
  2. U-Net segmentation: transposed conv (historically) or upsample + conv. Both work; residual skip connections hide most artifacts.
  3. Super-resolution (ESRGAN, EDSR): pixel shuffle — lowest compute and no checkerboard.
  4. Legacy code you must reproduce: transposed conv, unfortunately. Know the pitfalls and hope the training pipeline has already learned to cancel the pattern.

The Backward Pass — It Really Is a Regular Convolution

§10.4 derived the backward pass of a forward conv: L/x\partial L / \partial \mathbf{x} is itself a convolution of L/y\partial L / \partial \mathbf{y} with the flipped kernel. The analogous fact for transposed conv is even cleaner: because the forward pass is y=Wx\mathbf{y} = W^{\top} \mathbf{x}, differentiating gives:

Lx=WLy\frac{\partial L}{\partial \mathbf{x}} = W \cdot \frac{\partial L}{\partial \mathbf{y}}

But WvW \cdot \mathbf{v} is exactly the forward pass of a regular convolution. So:

The backward pass of ConvTranspose2d is a forward Conv2d. And the backward pass of Conv2dis a forward ConvTranspose2d. The two operators are each other's backward pass — which is why PyTorch implements them with shared cuDNN kernels.

This duality is the reason they have the same parameter count for matching hyperparameters, and the reason cuDNN can reuse the same highly-optimised im2col-and-GEMM primitives (§10.4) to accelerate both.


Applications in the Wild

DCGAN Generator

Radford, Metz & Chintala (2016) [Ref 3] introduced the deep convolutional GAN: a generator built entirely from transposed convolutions stacked to upsample a 100-dim latent vector into a 64×64 RGB image.

DCGAN generator — five ConvTranspose2d blocks, 100-dim latent → 64×64 RGB image
🐍dcgan_generator.py
1import torch.nn as nn

Brings in PyTorch's neural-network module collection. We'll use nn.Module (base class), nn.Sequential (container), nn.ConvTranspose2d (the upsampler this section is about), nn.BatchNorm2d, nn.ReLU, and nn.Tanh.

3class DCGANGenerator(nn.Module)

The generator network from Radford, Metz & Chintala (2016) [Ref 3]. Takes a 100-dim latent noise vector and produces a 64×64 RGB image. Inheriting from nn.Module gives us .parameters(), .to(device), .train()/.eval(), and a hookable forward pass.

EXECUTION STATE
📚 nn.Module = PyTorch's base class for all neural networks. Anything you put in self.* gets registered as a child Module or Parameter and is automatically tracked for gradient updates and device transfer.
5def __init__(self, nz=100, ngf=64, nc=3)

Standard DCGAN hyperparameters. nz = latent-space dim (size of input noise), ngf = base number of generator filters, nc = number of output channels.

EXECUTION STATE
⬇ input: nz = 100 = Length of the input noise vector z. Each z is sampled from N(0, 1). The 100-dim default comes directly from the DCGAN paper.
⬇ input: ngf = 64 = Generator filter base width. Layer i has ngf · 2^(4−i) filters: 512, 256, 128, 64, then 3 (RGB). Doubling channels as we halve depth keeps capacity roughly constant.
⬇ input: nc = 3 = Output channels. 3 for RGB images. Set to 1 for greyscale (e.g. MNIST).
6super().__init__()

Mandatory first line of every nn.Module subclass — initialises the parent's parameter registry. Skipping it produces the infamous 'cannot assign parameter before Module.__init__' error.

7self.main = nn.Sequential(...)

Stacks the upsampling blocks in order. nn.Sequential calls each child in turn during forward(), so we don't have to write the boilerplate.

EXECUTION STATE
📚 nn.Sequential = Container Module. forward(x) ≡ for layer in self: x = layer(x). Useful when the dataflow is a linear chain — DCGAN's generator is exactly that.
9ConvTranspose2d(nz, ngf*8, 4, 1, 0, bias=False) — (1,1) → (4,4)

First upsampling block. K=4, S=1, P=0 expands 1×1 → 4×4 (formula: (1−1)·1 − 0 + 3 + 0 + 1 = 4). Conceptually: 'unfold' the latent vector into a 4×4 spatial grid of 512 channels.

EXECUTION STATE
args (in_ch, out_ch, K, S, P, bias) = (100, 512, 4, 1, 0, bias=False). K=4 is divisible by S=1, so trivially checkerboard-safe (any K is, when S=1).
→ params = K²·C_in·C_out = 4·4·100·512 = 819,200 weights, no bias.
→ bias=False = Standard pattern: every ConvTranspose2d that is followed by BatchNorm2d uses bias=False, because BN's β term subsumes any constant offset.
10BatchNorm2d(512), ReLU(True)

Stabilise activation statistics across the batch (BN), then apply pointwise non-linearity. inplace=True saves memory by writing the result back into the input buffer instead of allocating a new tensor.

EXECUTION STATE
📚 BatchNorm2d = Per-channel running mean/var normalisation plus learnable γ, β. For a (N, C, H, W) tensor it normalises over the (N, H, W) axes, leaving C statistics. Crucial for GAN training stability.
📚 ReLU(True) = max(0, x) applied elementwise; True flips it to in-place mode. ReLU is used in the generator; the discriminator uses LeakyReLU. Both come straight from the DCGAN paper.
12ConvTranspose2d(ngf*8, ngf*4, 4, 2, 1, bias=False) — (4,4) → (8,8)

Second block. K=4, S=2, P=1 doubles spatial size (formula: (4−1)·2 − 2 + 3 + 0 + 1 = 8). K is divisible by S → uniform overlap → no checkerboard. This (K=4, S=2, P=1) is the canonical DCGAN upsample block.

EXECUTION STATE
→ params = 4·4·512·256 = 2,097,152 — the largest layer of the generator by parameter count.
13BatchNorm2d(256), ReLU(True)

Same pattern as line 10, with 256 channels.

15ConvTranspose2d(ngf*4, ngf*2, 4, 2, 1) — (8,8) → (16,16)

Third block. Same K=4, S=2, P=1 recipe. Halves channels (256 → 128), doubles spatial size. Output formula check: (8−1)·2 − 2 + 3 + 0 + 1 = 16. ✓

EXECUTION STATE
→ params = 4·4·256·128 = 524,288.
16BatchNorm2d(128), ReLU(True)

Same pattern. 128 channels.

18ConvTranspose2d(ngf*2, ngf, 4, 2, 1) — (16,16) → (32,32)

Fourth block. Output: (16−1)·2 − 2 + 3 + 0 + 1 = 32. ✓

EXECUTION STATE
→ params = 4·4·128·64 = 131,072.
19BatchNorm2d(64), ReLU(True)

Same pattern. 64 channels.

21ConvTranspose2d(ngf, nc, 4, 2, 1) — (32,32) → (64,64)

Final upsampling block. Output: (32−1)·2 − 2 + 3 + 0 + 1 = 64. ✓ Note: NO BatchNorm on the final layer — see line 22 for why.

EXECUTION STATE
→ params = 4·4·64·3 = 3,072 — the smallest layer.
→ why no BN here? = Putting BN before the final Tanh constrains the output distribution and visibly limits image diversity (DCGAN paper, §3 Architecture Guidelines).
22nn.Tanh()

Squashes every output pixel into [−1, 1]. The training images are pre-scaled to the same range (using transforms.Normalize((0.5,)*3, (0.5,)*3)), so the generator and the data live in the same space. Using Sigmoid here would put outputs in [0, 1] — also valid, but the original DCGAN paper used Tanh because it gave them faster convergence.

24def forward(self, z)

Forward pass — just delegate to self.main, which is the entire Sequential stack.

EXECUTION STATE
⬇ input: z = Latent noise tensor. Shape (N, nz, 1, 1) — the trailing 1×1 is needed because nn.ConvTranspose2d expects 4-D (N, C, H, W) input.
⬆ returns = Image tensor, shape (N, 3, 64, 64), values in [−1, 1].
25return self.main(z)

Five ConvTranspose2d layers later, you have a 64×64 RGB image. Total parameters: 819,200 + 2,097,152 + 524,288 + 131,072 + 3,072 + (BN params) ≈ 3.6M. Tiny by 2024 standards, but enough to learn CIFAR-10, MNIST or LSUN bedrooms.

8 lines without explanation
1import torch.nn as nn
2
3class DCGANGenerator(nn.Module):
4    """Radford, Metz & Chintala (2016), canonical 64x64 generator."""
5    def __init__(self, nz=100, ngf=64, nc=3):
6        super().__init__()
7        self.main = nn.Sequential(
8            # Latent z: (nz, 1, 1)  →  (ngf*8, 4, 4)
9            nn.ConvTranspose2d(nz, ngf*8, 4, 1, 0, bias=False),
10            nn.BatchNorm2d(ngf*8), nn.ReLU(True),
11            # (ngf*8, 4, 4)  →  (ngf*4, 8, 8)
12            nn.ConvTranspose2d(ngf*8, ngf*4, 4, 2, 1, bias=False),
13            nn.BatchNorm2d(ngf*4), nn.ReLU(True),
14            # (ngf*4, 8, 8)  →  (ngf*2, 16, 16)
15            nn.ConvTranspose2d(ngf*4, ngf*2, 4, 2, 1, bias=False),
16            nn.BatchNorm2d(ngf*2), nn.ReLU(True),
17            # (ngf*2, 16, 16)  →  (ngf, 32, 32)
18            nn.ConvTranspose2d(ngf*2, ngf,   4, 2, 1, bias=False),
19            nn.BatchNorm2d(ngf),   nn.ReLU(True),
20            # (ngf, 32, 32)  →  (nc, 64, 64)
21            nn.ConvTranspose2d(ngf,  nc,     4, 2, 1, bias=False),
22            nn.Tanh(),
23        )
24    def forward(self, z):
25        return self.main(z)

Why K=4, S=2 in DCGAN?

Exactly the diagnostic rule we derived: with K=4K=4 divisible by S=2S=2, the overlap is uniform and there is no 1-2-4 pattern. DCGAN's kernel choice was empirical (Radford et al. arXiv:1511.06434, Nov 2015); Odena, Dumoulin & Olah's Distill article (Oct 2016) [2] gave it its theoretical justification roughly eleven months later.

U-Net Decoder

Ronneberger, Fischer & Brox (2015) [Ref 7] introduced U-Net for biomedical segmentation. The encoder halves the spatial resolution stage-by-stage with max-pool (§10.5); the decoder doubles it back with transposed conv, concatenating the matching-resolution encoder feature map at every step via a skip connection.

U-Net decoder block — ConvTranspose2d(K=2, S=2) + skip-concat + two refinement convs
🐍unet_decoder_block.py
1import torch, torch.nn as nn

torch for tensor primitives (we'll use torch.cat to fuse with the skip connection); nn for the building blocks (ConvTranspose2d, Conv2d, ReLU, Sequential).

3class UNetUpBlock(nn.Module)

One stage of the U-Net decoder (Ronneberger, Fischer & Brox, 2015 [Ref 7]). Doubles the spatial resolution, fuses the matching encoder feature map via a skip connection, and refines with two 3×3 convs. Stack four such blocks to undo a typical encoder.

5def __init__(self, C_in, C_out)

C_in is the channel count of the deeper feature map (lower resolution, more channels). C_out is the channel count after upsampling — and also the number of channels in the matching skip connection.

EXECUTION STATE
⬇ input: C_in = e.g. 1024 in the deepest stage; 512 / 256 / 128 in shallower stages.
⬇ input: C_out = Halved channel count = the skip's channel count. Concatenation of (skip, up) along channels yields C_out·2 channels for the conv block to digest.
7self.up = ConvTranspose2d(C_in, C_out, K=2, S=2)

K=2, S=2 ConvTranspose2d. K is divisible by S (2 by 2) → checkerboard-safe by Odena et al.'s diagnostic. Output formula: (I−1)·2 − 0 + 1 + 0 + 1 = 2·I — exactly doubles spatial resolution.

EXECUTION STATE
→ why this exact recipe? = Ronneberger et al. (2015) tested several upsamplers. The 2×2 strided ConvTranspose was the simplest learned upsampler that doubled spatial size with no artifact. Modern reimplementations often replace this with nn.Upsample(scale_factor=2) + nn.Conv2d.
→ params = K²·C_in·C_out + C_out (default bias=True). For C_in=1024, C_out=512: 2·2·1024·512 + 512 = 2,097,664.
8self.conv = nn.Sequential(...)

Two 3×3 convs with padding=1 (so spatial size is preserved) and ReLU between them. Their job is to refine the fused feature map after concatenation.

9Conv2d(C_out*2, C_out, 3, padding=1), ReLU(True)

First refinement conv. The input has 2·C_out channels because we will concatenate the upsampled tensor (C_out channels) with the skip feature map (C_out channels) along dim=1. Output: C_out channels — the conv 'mixes' the up-stream and the skip-stream.

EXECUTION STATE
→ why concat, not add? = Concatenation lets the network learn how to combine the two streams (it can choose to ignore one channel-wise). Adding (as in ResNet) forces them to share a representation. U-Net's authors found concat consistently better for segmentation.
10Conv2d(C_out, C_out, 3, padding=1), ReLU(True)

Second refinement conv. Input and output channels match (C_out). Same 3×3 + padding=1 → preserves spatial size. Two stacked 3×3 convs have an effective 5×5 receptive field — see §10.3.

13def forward(self, x, skip)

Two inputs this time: x is the lower-resolution decoder tensor (C_in channels), skip is the matching encoder feature map (C_out channels, double the spatial size of x).

EXECUTION STATE
⬇ input: x = Shape (N, C_in, H, W). The deeper, narrower feature map from the previous decoder stage (or from the bottleneck on the first call).
⬇ input: skip = Shape (N, C_out, 2H, 2W). The matching-resolution encoder feature map. Channel count is half of x's — the shape difference is exactly what self.up + concat is designed to bridge.
14x = self.up(x) — double spatial

Doubles the spatial size and halves the channel count. Output shape: (N, C_out, 2H, 2W).

EXECUTION STATE
→ numerical example = Input (1, 1024, 28, 28) → up → (1, 512, 56, 56). Same spatial size as the matching encoder skip.
15x = torch.cat([skip, x], dim=1)

Concatenate along the channel axis. The two tensors have identical (N, *, 2H, 2W) shape except for channels (C_out and C_out), giving an output with 2·C_out channels.

EXECUTION STATE
📚 torch.cat(tensors, dim) = Joins tensors along an existing axis. For a 4-D (N, C, H, W) tensor: dim=0 stacks batches, dim=1 stacks channels, dim=2/3 stack spatial. We want dim=1 (channels) here.
→ output shape = (N, 2·C_out, 2H, 2W). For our running example: (1, 1024, 56, 56).
→ why skip first? = Convention only — torch.cat([x, skip], dim=1) gives an identical capacity network with shuffled channel indices. The order matters only for matching pretrained state dicts.
16return self.conv(x)

Two 3×3 convs collapse 2·C_out → C_out → C_out, returning a (N, C_out, 2H, 2W) tensor. This becomes the input x for the next UNetUpBlock.

5 lines without explanation
1import torch, torch.nn as nn
2
3class UNetUpBlock(nn.Module):
4    """One U-Net decoder stage: upsample then fuse with the matching skip."""
5    def __init__(self, C_in, C_out):
6        super().__init__()
7        self.up = nn.ConvTranspose2d(C_in, C_out, kernel_size=2, stride=2)
8        self.conv = nn.Sequential(
9            nn.Conv2d(C_out*2, C_out, 3, padding=1), nn.ReLU(True),
10            nn.Conv2d(C_out,   C_out, 3, padding=1), nn.ReLU(True),
11        )
12
13    def forward(self, x, skip):
14        x = self.up(x)                                   # double spatial
15        x = torch.cat([skip, x], dim=1)                  # concat along channels
16        return self.conv(x)

The K=2, S=2 transposed conv exactly doubles spatial resolution and is checkerboard-safe (2 is a multiple of 2). Modern U-Net variants often replace the transposed conv with nn.Upsample(scale_factor=2) + nn.Conv2d(...)— same shape, no artifacts.

FCN for Semantic Segmentation

Long, Shelhamer & Darrell (2015) [Ref 6] introduced the first fully convolutional network for semantic segmentation. They used transposed conv to upsample the coarse class-score map from the last conv stage back to image resolution, initialising the kernel to bilinear interpolation and then letting the network fine-tune it. This initialisation trick is still used today when a decoder is trained from scratch on small data.

FCN bilinear-init upsampler — Long, Shelhamer & Darrell (2015)
🐍fcn_decoder.py
1import torch / import torch.nn as nn

torch for arange/zeros/abs; nn for ConvTranspose2d. The bilinear kernel itself is computed on the CPU — no autograd needed because we use it only as a deterministic initialiser.

4def bilinear_kernel(C_in, C_out, K) → torch.Tensor

Builds the canonical bilinear-interpolation kernel for a transposed conv, exactly as described in the supplementary material of Long, Shelhamer & Darrell (2015) [Ref 6]. The result has shape (C_in, C_out, K, K) — i.e. ConvTranspose2d's weight layout.

EXECUTION STATE
⬇ input: C_in / C_out = Number of input/output channels of the transposed conv. For semantic segmentation with 21 PASCAL VOC classes, C_in = C_out = 21.
⬇ input: K = Kernel size. For an S× upsampler, the standard recipe is K = 2·S (K=64 for S=32, K=8 for S=4, etc.).
⬆ returns = torch.Tensor of shape (C_in, C_out, K, K). Block-diagonal: identical bilinear filt2d on each diagonal channel pair, zero on cross-channel pairs.
6factor = (K + 1) // 2

Half-width of the support of the bilinear kernel. For K=64: factor = 65//2 = 32. For K=4: factor = 5//2 = 2.

EXECUTION STATE
→ why ceil division? = The bilinear kernel reaches out factor pixels from its centre. With (K+1)//2 the kernel exactly fills K pixels for both even and odd K.
7center = factor - 1 if K % 2 == 1 else factor - 0.5

Coordinate of the kernel's centre. Odd K places the centre at integer pixel (factor − 1); even K places it at the boundary between pixels (factor − 0.5). This is the standard convention in image-processing libraries.

EXECUTION STATE
→ K=64 (even) → center = 31.5 = Half-pixel offset, lying between pixels 31 and 32.
→ K=4 (even) → center = 1.5 = Half-pixel offset between pixels 1 and 2.
8og = torch.arange(K).float()

Coordinate axis. og = [0, 1, 2, …, K−1] as a float32 tensor.

EXECUTION STATE
📚 torch.arange(K) = Returns a 1-D tensor of integers [0, K). Equivalent to Python's range(K) but as a tensor. We call .float() because we'll subtract the half-pixel center next, which requires float arithmetic.
→ K=4 example = og = tensor([0., 1., 2., 3.]).
9filt1d = 1 - (og - center).abs() / factor

The 1-D bilinear weight: linearly decays from 1 at the centre to 0 at distance = factor. This is exactly the triangular ('tent') function used in classical bilinear interpolation.

EXECUTION STATE
→ K=4, factor=2, center=1.5 = |og − 1.5| = [1.5, 0.5, 0.5, 1.5] / 2 = [0.75, 0.25, 0.25, 0.75] 1 − = [0.25, 0.75, 0.75, 0.25] = filt1d
10filt2d = filt1d.unsqueeze(0) * filt1d.unsqueeze(1)

Outer product of the 1-D kernel with itself → 2-D bilinear kernel (separable). unsqueeze(0) gives shape (1, K); unsqueeze(1) gives (K, 1); broadcasted multiplication gives (K, K).

EXECUTION STATE
📚 .unsqueeze(dim) = Inserts a size-1 axis at position dim. unsqueeze(0): (K,) → (1, K). unsqueeze(1): (K,) → (K, 1). The product (1,K) × (K,1) broadcasts to (K, K).
→ K=4 result (2-D bilinear) =
[[0.0625, 0.1875, 0.1875, 0.0625],
 [0.1875, 0.5625, 0.5625, 0.1875],
 [0.1875, 0.5625, 0.5625, 0.1875],
 [0.0625, 0.1875, 0.1875, 0.0625]]
→ sum check = Row sums in pixels covered by one output: each S×S super-pixel sums to 1 → preserves overall image intensity.
11weight = torch.zeros(C_in, C_out, K, K)

Allocate the full ConvTranspose2d weight tensor. Note the channel ordering: ConvTranspose2d stores its weight as (C_in, C_out, kH, kW) — the OPPOSITE of nn.Conv2d. (See the Warning earlier in this section.)

EXECUTION STATE
📚 torch.zeros(*sizes) = All-zero tensor of the requested shape. We start at zero and only fill the diagonal channel pairs in the next step.
12for c in range(min(C_in, C_out)):

Iterate over the diagonal channel index. We treat the upsampler as 'apply bilinear filter independently to each class score channel' — so cross-channel weights are zero.

LOOP TRACE · 4 iterations
c=0 → weight[0, 0] = filt2d
weight[0, 0] = Set to the (K, K) bilinear kernel.
c=1 → weight[1, 1] = filt2d
weight[1, 1] = Same kernel on channel 1.
...
c=20 → weight[20, 20] = filt2d
weight[20, 20] = Last PASCAL VOC class — final diagonal entry.
13weight[c, c] = filt2d

Copy the (K, K) bilinear kernel onto the (c, c) channel-pair slot. All off-diagonal slots stay zero — so the upsampler treats each class score map independently.

14return weight

Returns the (21, 21, 64, 64) tensor with bilinear kernels on the diagonal. This is identical to the F.interpolate(mode='bilinear') operation if the network is frozen — but because we copy it into a learnable nn.Parameter, the network can fine-tune it during training.

17upsample = nn.ConvTranspose2d(21, 21, K=64, S=32, P=16, bias=False)

The 32× upsampling layer used at the very end of FCN-32s (Long et al. 2015 [Ref 6]). Output formula: (I−1)·32 − 32 + 63 + 0 + 1 = 32·I — exactly 32× upsample. K=64 is divisible by S=32 → checkerboard-safe.

EXECUTION STATE
→ why K = 2·S? = The supplementary material of Long et al. (2015) shows K = 2·S minimises bilinear interpolation error. With K=64 and S=32 the kernel exactly covers two output super-pixels in each direction.
→ params = K²·C_in·C_out = 64·64·21·21 = 1,806,336 weights, no bias. Most of these stay close to their bilinear initialisation throughout training.
19upsample.weight.data.copy_(bilinear_kernel(21, 21, 64))

Replace the random Kaiming-uniform init with the bilinear init we just built. After this line the layer behaves like F.interpolate(mode='bilinear', scale_factor=32) but is fully learnable. This is the FCN bilinear-init trick — strong inductive bias plus the freedom to deviate as training demands.

EXECUTION STATE
📚 .copy_(other) = In-place copy of 'other' into the tensor. The trailing underscore is PyTorch's convention for in-place ops. Used here because we want to overwrite the layer's existing parameter buffer without breaking gradient tracking.
→ .data attribute = Direct access to the underlying tensor, bypassing autograd. Modifying .data is safe outside of training; modifying .weight inside a training step would break grad flow.
6 lines without explanation
1import torch
2import torch.nn as nn
3
4def bilinear_kernel(C_in, C_out, K):
5    """Bilinear-interpolation kernel (Long et al. 2015, supplementary)."""
6    factor = (K + 1) // 2
7    center = factor - 1 if K % 2 == 1 else factor - 0.5
8    og = torch.arange(K).float()
9    filt1d = 1 - (og - center).abs() / factor
10    filt2d = filt1d.unsqueeze(0) * filt1d.unsqueeze(1)
11    weight = torch.zeros(C_in, C_out, K, K)
12    for c in range(min(C_in, C_out)):
13        weight[c, c] = filt2d
14    return weight
15
16# 32x upsample at the end of FCN-32s
17upsample = nn.ConvTranspose2d(21, 21, kernel_size=64, stride=32,
18                              padding=16, bias=False)
19upsample.weight.data.copy_(bilinear_kernel(21, 21, 64))

Design Patterns — Pick the Right Upsampler

ArchitectureUpsampler usedCheckerboard-safe?Reference
DCGAN (2016)ConvTranspose2d(K=4, S=2)Yes — K divisible by SRadford et al. [3]
StyleGAN2 (2020)Upsample (bilinear) + modulated convYesKarras et al.
BigGAN (2019)Upsample (nearest) + convYesBrock et al.
U-Net (2015)ConvTranspose2d(K=2, S=2)YesRonneberger et al. [7]
FCN-8s (2015)ConvTranspose2d with bilinear initYes after fine-tuningLong et al. [6]
ESPCN / ESRGAN (2016+)Conv + PixelShuffleYes by constructionShi et al. [4]
Deconvnet (2015)ConvTranspose2d(K=3, S=2)No — classic checkerboardNoh et al. [8]
Autoencoder tutorialsConvTranspose2d(K=3, S=2)No — common pedagogical bug(widespread)

Quick Check

You are designing a 2× upsampling block with a 3×3 kernel. Which option avoids the checkerboard artifact with the least change to downstream param counts?


Summary

KnobEffect on output sizeArtifact riskParameter cost
Stride SMultiplies by ~SCheckerboard if S does not divide KNone
Padding PCROPS 2P pixelsNoneNone
Kernel KAdds (K-1)Safe iff K is multiple of SScales as K² · C_in · C_out
Dilation DAdds D(K-1) − (K-1)NoneNone
output_paddingAdds 0..S-1 on right/bottomNoneNone

Commit these to memory

  1. Output formula: O=(I1)S2P+D(K1)+output_padding+1O = (I-1)S - 2P + D(K-1) + \text{output\_padding} + 1
  2. Matrix identity: transposed conv =Wx= W^{\top} \mathbf{x} where WW is the forward im2col matrix. NOT the inverse of convolution.
  3. Checkerboard diagnostic: stride-S transposed conv is artifact-free iff kernel size KK is a multiple of SS. Prefer K=2,S=2 or K=4,S=2.
  4. The two modern replacements: nearest-upsample + conv (Odena [2]) and conv + pixel-shuffle (Shi [4]). Same shape, same parameter count, no artifacts.
  5. Weight-shape gotcha: ConvTranspose2d stores weights as (Cin,Cout,kH,kW)(C_\text{in}, C_\text{out}, kH, kW), opposite to Conv2d.
  6. Duality: backward of ConvTranspose2d = forward of Conv2d, and vice versa.

Exercises

Conceptual

  1. Compute by hand the output spatial size of nn.ConvTranspose2d(32, 64, kernel_size=4, stride=2, padding=1) applied to a 14×14 feature map.
  2. Show that for K=2,S=1,P=0K = 2, S = 1, P = 0, transposed conv and forward conv have the same output size. Why? (Hint: plug into both formulas.)
  3. Explain why transposed conv is NOT the inverse of convolution. Construct a small example where WWxxW^{\top} W \mathbf{x} \neq \mathbf{x}.
  4. If forward conv(K=3, S=2, P=1) maps inputs 7 and 8 both to 4, which value of output_padding recovers each?
  5. Why do DCGAN generators use K=4, S=2 instead of K=3, S=2? Answer in one sentence referencing Odena et al. (2016) [2].

Hints

  1. 1: O=(141)22+3+0+1=28O = (14-1)\cdot 2 - 2 + 3 + 0 + 1 = 28. Exactly 2× upsample.
  2. 2: Forward: O = (I-K+2P)/S + 1 = I - 1. Transposed: O = (I-1)·1 + 2 = I + 1. They differ by 2 — the asymmetry disappears only when padding is added symmetrically.
  3. 3: Pick any non-injective forward conv (most are). E.g. a 2×2 avg kernel. W·x loses information; W.T·(W·x) cannot recover x.
  4. 4: output_padding=0 → 7, output_padding=1 → 8.
  5. 5: K=4 is a multiple of S=2, so the scattered-kernel overlaps are uniform and no checkerboard appears.

Coding

  1. Extend the scatter-add conv_transpose2d to handle multi-channel inputs and outputs. Verify against F.conv_transpose2d on random 10 configurations.
  2. Reproduce Figure 3 from Odena et al. (2016) [2]: plot the overlap-count heatmap for (K, S) pairs in {2,3,4}×{1,2,3}\{2,3,4\} \times \{1,2,3\}and verify that uniformity occurs iff K is a multiple of S.
  3. Train two tiny image autoencoders on MNIST: one using stacked ConvTranspose2d(K=3, S=2) in the decoder, the other using Upsample + Conv2d. Compare reconstruction quality and plot a few samples side by side. Do you see checkerboards?
  4. Implement nn.PixelShuffle(r) from scratch using tensor.reshape and tensor.permute. Verify against the built-in module.

Challenge

Reproduce the Odena et al. (2016) [2] experiment. Train a small DCGAN on CIFAR-10 for 20 epochs twice — once with ConvTranspose2d(K=3, S=2) layers and once with Upsample + Conv2d. Visualise a batch of generated samples from both models and measure the FID score. The original paper shows a ~10–30% FID improvement from eliminating the checkerboard. Reproducing even a qualitative version of this result is the clearest possible demonstration of the theoretical argument.


References

  1. Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. (The canonical derivation of the transposed conv output formula, §4. Figure 4.1 is the 2×2 → 4×4 example we hand-computed.)
  2. Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and Checkerboard Artifacts. Distill. https://distill.pub/2016/deconv-checkerboard/ (Diagnoses the checkerboard, proposes upsample-then-conv.)
  3. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN). ICLR. arXiv:1511.06434.
  4. Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., & Wang, Z. (2016). Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (ESPCN). CVPR. arXiv:1609.05158. (Introduces pixel shuffle.)
  5. Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional Networks. CVPR. (Early use of “deconvolution” for reconstructing inputs from feature maps; source of the misleading name.)
  6. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR. arXiv:1411.4038. (First large-scale use of transposed conv in vision. Introduces bilinear-init trick for the upsample layer.)
  7. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI. arXiv:1505.04597. (Encoder pool + decoder transposed-conv + skip connections.)
  8. Noh, H., Hong, S., & Han, B. (2015). Learning Deconvolution Network for Semantic Segmentation. ICCV. arXiv:1505.04366. (The “Deconvnet” architecture — representative of the checkerboard-prone K=3, S=2 stack.)
  9. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, §9.5 “Variants of the Basic Convolution Function”. MIT Press. https://www.deeplearningbook.org/ (Textbook treatment of transposed conv and its role as the adjoint of ordinary conv.)
  10. PyTorch documentation, torch.nn.ConvTranspose2d, torch.nn.functional.conv_transpose2d, torch.nn.Upsample, torch.nn.PixelShuffle. https://pytorch.org/docs/stable/nn.html (Authoritative reference foroutput_padding semantics and the weight-shape ordering.)
  11. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2). CVPR. arXiv:1912.04958. (Moves from transposed-conv to bilinear-upsample-then-conv to eliminate residual artifacts.)

This concludes Chapter 10. You now have a complete, mathematically grounded view of the convolution operator and its entire ecosystem — the forward pass, the parameters, the efficient implementation, pooling, and transposed convolution. Chapter 11 builds on these foundations to walk through the architectural lineage that shaped modern computer vision: LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet.

Loading comments...