Chapter 13
22 min read
Section 42 of 65

The Convolution Operation Explained

Understanding Convolutions

Cross-Correlation vs Convolution

In signal-processing textbooks, convolution flips the kernel before sliding. In deep-learning libraries it does not. Both operations are called “convolution” — which is a recipe for confusion the first time you read a paper.

The mathematical (flipped) version

True convolution rotates the kernel 180° (i.e. flips it both horizontally and vertically) and then slides:

(IK)[i,j]  =  mnI[im,jn]K[m,n](I * K)[i, j] \;=\; \sum_{m}\sum_{n} I[i-m,\, j-n] \cdot K[m, n]

The flip is what gives convolution the property of commutativity: fg=gff * g = g * f. It also makes convolution play nicely with the Fourier transform — multiplication in frequency becomes convolution in time/space.

The deep-learning (unflipped) version

Deep-learning frameworks skip the flip. The operation they implement is cross-correlation:

(IK)[i,j]  =  mnI[i+m,j+n]K[m,n](I \star K)[i, j] \;=\; \sum_{m}\sum_{n} I[i+m,\, j+n] \cdot K[m, n]

PyTorch, TensorFlow, JAX, MXNet, Caffe — all of them call this “convolution.” The reason it does not matter in practice is that the kernel weights are learned: whatever flipping we skip, gradient descent absorbs into the weights. A kernel that would detect edges as a cross-correlation can be re-expressed as a (different) kernel that detects the same edges as a true convolution.

Terminology alert

When a paper or library says “convolution,” assume cross-correlation unless stated otherwise. In signal-processing literature the opposite convention holds — the only safe move is to check the formula.
AspectTrue convolutionCross-correlation (DL)
Kernel flipped?Yes (rotate 180°)No
Commutative?Yes: f * g = g * fNo
Where usedSignal processing, math, DSPDeep learning (PyTorch, TF, …)
Matters for training?No — weights adaptNo — weights adapt
Speedup via FFT?Yes (convolution theorem)Yes (after a flip)

Practical shortcut

Treat every CNN you meet as “slide an unflipped kernel, compute a dot product.” The flip is a historical footnote unless you are reading a signal-processing paper.

Step-by-Step Computation (NumPy)

We will do one 2-D convolution end to end — by hand, then in NumPy, then in PyTorch — so that every number you see in the output is traceable to an explicit pair of window and kernel values.

Input: a 5×5 gradient image

📝input.txt
1I = [[10, 20, 30, 40, 50],
2     [20, 40, 60, 80, 100],
3     [30, 60, 90, 120, 150],
4     [40, 80, 120, 160, 200],
5     [50, 100, 150, 200, 250]]

Kernel: 3×3 Sobel-X (vertical-edge detector)

📝kernel.txt
1K = [[-1, 0, 1],
2     [-2, 0, 2],
3     [-1, 0, 1]]

One position by hand: output[0, 0]

The top-left window is the 3×3 block I[0:3,0:3]I[0:3,\, 0:3]:

📝output_00.txt
1Window I[0:3, 0:3]:     Kernel K:            Element-wise product:
2[[10, 20, 30],          [[-1, 0, 1],         [[-10,   0,  30],
3 [20, 40, 60],     ×     [-2, 0, 2],    =     [-40,   0, 120],
4 [30, 60, 90]]           [-1, 0, 1]]          [-30,   0,  90]]
5
6Sum = -10 + 0 + 30 - 40 + 0 + 120 - 30 + 0 + 90 = 160
7
8→ output[0, 0] = 160

All nine positions — the interactive trace

The code below runs the full loop. Click any line to see what that line does. The loop lines (27, 28, 29, 30) expand into per-iteration cards with the exact window, the element-wise product, and the final sum at each of the 9 output positions. Nothing is hidden.

2-D convolution from scratch (NumPy, fully traced)
🐍conv2d_numpy.py
1import numpy as np

NumPy gives us ndarray, slicing, element-wise arithmetic, and np.sum — all the ingredients of a hand-rolled convolution. The alias np is universal.

4I = np.array([...], dtype=float)

Build the 5×5 input image. Every row increases by a fixed step, but the step itself differs between rows (+10 in row 0, +20 in row 1, …). That asymmetry is what makes the Sobel-X response below non-constant — a fact the previous version of this page got wrong.

EXECUTION STATE
📚 np.array(data, dtype) = Creates an ndarray from a Python (nested) list. dtype=float forces float64, so downstream arithmetic never silently truncates.
⬇ data = Nested list of 5 rows × 5 columns.
⬇ dtype=float = Force float64 — Sobel produces signed values and we want negatives preserved.
⬆ I (5×5) =
     c0   c1   c2   c3   c4
r0   10   20   30   40   50
r1   20   40   60   80  100
r2   30   60   90  120  150
r3   40   80  120  160  200
r4   50  100  150  200  250
13K = np.array([[-1,0,1],[-2,0,2],[-1,0,1]], dtype=float)

The Sobel-X kernel. It subtracts the left column of any 3×3 patch from the right column (the middle column contributes 0), with the centre row weighted twice. Large |output| at a pixel ⇒ strong horizontal intensity change at that pixel ⇒ a vertical edge.

EXECUTION STATE
⬆ K (3×3) =
     c0  c1  c2
r0   -1   0   1
r1   -2   0   2
r2   -1   0   1
→ row weights = top & bottom rows × 1, middle row × 2 — this is a smoothed horizontal gradient, which is why it is less noisy than a plain [-1, 0, 1] filter.
19def conv2d_manual(image, kernel) → np.ndarray

Our from-scratch 2-D cross-correlation. It accepts an image and a kernel of arbitrary size, walks every valid position, and returns the output feature map. Assumes stride=1, padding=0 — we will generalise in Section 3.

EXECUTION STATE
⬇ input: image (2-D) = The input tensor. In this call, the 5×5 matrix I above.
⬇ input: kernel (2-D) = The filter. In this call, the 3×3 Sobel-X.
⬆ returns = np.ndarray of shape (H−kH+1, W−kW+1). For 5×5 input and 3×3 kernel: (3, 3).
20H, W = image.shape

Python tuple unpacking. .shape returns (rows, cols); we name them H (height) and W (width) because these are images.

EXECUTION STATE
📚 arr.shape = ndarray attribute: tuple of dimension sizes.
⬆ H, W = H = 5, W = 5
21kH, kW = kernel.shape

Same trick for the kernel. Naming them kH, kW (kernel H, kernel W) avoids confusion with the input dimensions.

EXECUTION STATE
⬆ kH, kW = kH = 3, kW = 3
22out_H = H - kH + 1

Output height for stride=1, no padding. Derivation: the kernel's top-left corner must land at row 0, 1, …, H−kH inclusive → H − kH + 1 valid rows.

EXECUTION STATE
out_H = 5 − 3 + 1 = 3
23out_W = W - kW + 1

Output width, same argument along the other axis.

EXECUTION STATE
out_W = 5 − 3 + 1 = 3
25output = np.zeros((out_H, out_W))

Pre-allocate the output matrix with zeros. Pre-allocation is faster than growing a list and converting at the end, especially for large outputs.

EXECUTION STATE
📚 np.zeros(shape) = Create an ndarray of the given shape filled with 0.0 (float64 by default).
⬆ output (3×3) =
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
27for i in range(out_H):

Outer loop — iterate over output rows. For each i, the kernel's top edge will sit at row i of the input.

LOOP TRACE · 3 iterations
i = 0
kernel top row = aligns with image row 0
i = 1
kernel top row = aligns with image row 1
i = 2
kernel top row = aligns with image row 2
28for j in range(out_W):

Inner loop — iterate over output columns. For each (i, j) we do one multiply-and-sum.

29window = image[i:i+kH, j:j+kW]

Slice out the kH × kW patch currently under the kernel. NumPy slicing uses half-open intervals: [i : i+kH] gives rows i, i+1, …, i+kH−1. Slicing does NOT copy — it creates a view, which is fast.

EXECUTION STATE
📚 arr[a:b, c:d] = 2-D slice. Rows a through b−1, columns c through d−1. Returns a VIEW into the original array (shares memory).
LOOP TRACE · 9 iterations
(i, j) = (0, 0)
window = [[10, 20, 30], [20, 40, 60], [30, 60, 90]]
(0, 1)
window = [[20, 30, 40], [40, 60, 80], [60, 90,120]]
(0, 2)
window = [[30, 40, 50], [60, 80,100], [90,120,150]]
(1, 0)
window = [[20, 40, 60], [30, 60, 90], [40, 80,120]]
(1, 1)
window = [[40, 60, 80], [60, 90,120], [80,120,160]]
(1, 2)
window = [[60, 80,100], [90,120,150], [120,160,200]]
(2, 0)
window = [[30, 60, 90], [40, 80,120], [50,100,150]]
(2, 1)
window = [[60, 90,120], [80,120,160], [100,150,200]]
(2, 2)
window = [[90,120,150], [120,160,200], [150,200,250]]
30output[i, j] = np.sum(window * kernel)

The heart of 2-D convolution. window * kernel is element-wise (9 products); np.sum collapses them to a scalar; the scalar is written to output[i, j].

EXECUTION STATE
window * kernel = element-wise multiply of two 3×3 matrices — 9 individual products.
📚 np.sum(arr) = Adds every element of arr — for a 3×3 matrix: 9 additions → a single float.
LOOP TRACE · 9 iterations
(0, 0)
element-wise product = [[-10, 0, 30], [-40, 0,120], [-30, 0, 90]]
sum = −10+30 −40+120 −30+90 = 160.0
(0, 1)
sum = 160.0
(0, 2)
sum = 160.0
(1, 0)
element-wise product = [[-20, 0, 60], [-60, 0,180], [-40, 0,120]]
sum = −20+60 −60+180 −40+120 = 240.0
(1, 1)
sum = 240.0
(1, 2)
sum = 240.0
(2, 0)
element-wise product = [[-30, 0, 90], [-80, 0,240], [-50, 0,150]]
sum = −30+90 −80+240 −50+150 = 320.0
(2, 1)
sum = 320.0
(2, 2)
sum = 320.0
32return output

Return the fully-populated output matrix. Note the pattern: top row 160, middle row 240, bottom row 320. The row value equals the horizontal intensity step that dominates that row of windows.

EXECUTION STATE
⬆ output (3×3) =
[[160. 160. 160.]
 [240. 240. 240.]
 [320. 320. 320.]]
34out = conv2d_manual(I, K)

Call our function on the 5×5 image and the Sobel-X kernel. All 9 output positions are filled in by the two nested loops.

EXECUTION STATE
⬆ out.shape = (3, 3)
35print("Output shape:", out.shape)

Sanity check that the output really is (3, 3) as predicted by H − kH + 1.

36print(out)

Display the full 3×3 output. Observe the per-row differences: Sobel-X responds to the LOCAL horizontal step, which in this image grows linearly with row index.

EXECUTION STATE
⬆ final output =
[[160. 160. 160.]
 [240. 240. 240.]
 [320. 320. 320.]]
23 lines without explanation
1import numpy as np
2
3# 5×5 input image — a "smooth diagonal gradient"
4I = np.array([
5    [10, 20, 30, 40, 50],
6    [20, 40, 60, 80, 100],
7    [30, 60, 90, 120, 150],
8    [40, 80, 120, 160, 200],
9    [50, 100, 150, 200, 250],
10], dtype=float)
11
12# 3×3 Sobel-X kernel — detects vertical edges (horizontal intensity change)
13K = np.array([
14    [-1, 0, 1],
15    [-2, 0, 2],
16    [-1, 0, 1],
17], dtype=float)
18
19def conv2d_manual(image, kernel):
20    H, W   = image.shape          # input spatial size
21    kH, kW = kernel.shape         # kernel spatial size
22    out_H  = H - kH + 1           # output height (stride=1, padding=0)
23    out_W  = W - kW + 1           # output width
24
25    output = np.zeros((out_H, out_W))
26
27    for i in range(out_H):
28        for j in range(out_W):
29            window = image[i : i + kH, j : j + kW]   # kH × kW patch
30            output[i, j] = np.sum(window * kernel)    # dot product → scalar
31
32    return output
33
34out = conv2d_manual(I, K)
35print("Output shape:", out.shape)
36print(out)
37# Expected (3×3):
38# [[160. 160. 160.]
39#  [240. 240. 240.]
40#  [320. 320. 320.]]

Why is the output NOT constant?

Row 0 of II advances by 10 per column, row 1 by 20, row 2 by 30, … Sobel-X detects exactly this per-column rate-of-change, which doubles from row 0 to row 1 and triples from row 0 to row 2. So the output is flat across each row (160, 240, 320) but different between rows — not uniform. A common misconception (and an error in an earlier version of this page) was to claim the output is 160 everywhere.

Quick Check

At position (i, j) = (1, 1), which slice of the input does the 3×3 kernel cover?


The Same Computation in PyTorch

We now replay the same 2-D convolution with torch.nn.functional.conv2d. Every number should match the NumPy version to the last digit. The point is to see how the hand-rolled loop maps onto a production framework call.

2-D convolution with PyTorch F.conv2d
🐍conv2d_torch.py
1import torch

Top-level PyTorch package. Provides Tensor, autograd, CUDA dispatch, and the nn module.

2import torch.nn.functional as F

Pure-function versions of neural-net ops. F.conv2d here is the stateless counterpart of nn.Conv2d — same math, but weights are passed in as an argument instead of being stored inside the module.

6I = torch.tensor([...]).unsqueeze(0).unsqueeze(0)

Build the same 5×5 image as a PyTorch tensor. F.conv2d requires a 4-D input (batch, channels, H, W), so we insert two length-1 axes at the front.

EXECUTION STATE
📚 torch.tensor(data) = Copy a (nested) list into a new tensor. Using float literals (50.) forces dtype=float32.
📚 .unsqueeze(dim) = Insert a new axis of size 1 at position dim. First call: (5,5) → (1,5,5). Second call: (1,5,5) → (1,1,5,5).
→ why two unsqueeze? = F.conv2d demands (N, C_in, H, W). We have N=1 sample and C_in=1 channel. Two unsqueezes supply the two missing axes.
⬆ I.shape = torch.Size([1, 1, 5, 5])
14K = torch.tensor([...]).unsqueeze(0).unsqueeze(0)

Same reshape trick for the kernel. F.conv2d needs weights of shape (C_out, C_in, kH, kW). We have 1 output channel × 1 input channel × 3×3 — hence two unsqueezes on a 2-D tensor.

EXECUTION STATE
⬆ K.shape = torch.Size([1, 1, 3, 3])
→ weight layout = (C_out, C_in, kH, kW). For a learnable conv layer this becomes nn.Parameter and gets gradients.
20out = F.conv2d(I, K, stride=1, padding=0)

Replaces every NumPy line of the previous listing with a single call. Under the hood PyTorch dispatches to MKL-DNN on CPU or cuDNN on GPU — both use highly-tuned im2col + GEMM routines that are 100–1000× faster than the Python loop for realistic sizes.

EXECUTION STATE
📚 F.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1) = PyTorch's 2-D cross-correlation. Despite the name, it does NOT flip the kernel (that would be 'true' convolution).
⬇ arg 1: input = I = Shape (1, 1, 5, 5). The image(s) to slide over.
⬇ arg 2: weight = K = Shape (1, 1, 3, 3). The filter bank.
⬇ arg 3: stride = 1 = Advance the window one pixel at a time in each spatial direction. stride=2 would halve the output resolution — we explore this in Section 3.
⬇ arg 4: padding = 0 = No zero border added → the output shrinks by (kH − 1) and (kW − 1). padding=1 with a 3×3 kernel would preserve spatial size.
⬆ out.shape = torch.Size([1, 1, 3, 3])
22print("Input shape:", I.shape)

Confirm the 4-D layout. If you see shape (5, 5) here you forgot both unsqueezes and F.conv2d will error.

EXECUTION STATE
→ prints = torch.Size([1, 1, 5, 5])
23print("Kernel shape:", K.shape)

Confirm the kernel shape (C_out, C_in, kH, kW).

EXECUTION STATE
→ prints = torch.Size([1, 1, 3, 3])
24print("Output shape:", out.shape)

(1, 1, 3, 3) — batch and channel preserved, spatial halved by kernel size.

EXECUTION STATE
→ prints = torch.Size([1, 1, 3, 3])
25print(out.squeeze())

Collapse the two length-1 axes to reveal the 3×3 output matrix. Notice the exact same pattern as the NumPy version — PyTorch and our hand-rolled loop agree to the last decimal.

EXECUTION STATE
📚 .squeeze() = Remove dimensions of size 1. (1, 1, 3, 3) → (3, 3).
⬆ printed tensor =
tensor([[160., 160., 160.],
        [240., 240., 240.],
        [320., 320., 320.]])
20 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Same I and K as above, promoted to PyTorch tensors.
5# F.conv2d expects input (N, C_in, H, W) and weight (C_out, C_in, kH, kW).
6I = torch.tensor([
7    [10., 20., 30., 40., 50.],
8    [20., 40., 60., 80.,100.],
9    [30., 60., 90.,120.,150.],
10    [40., 80.,120.,160.,200.],
11    [50.,100.,150.,200.,250.],
12]).unsqueeze(0).unsqueeze(0)   # shape (1, 1, 5, 5)
13
14K = torch.tensor([
15    [-1., 0., 1.],
16    [-2., 0., 2.],
17    [-1., 0., 1.],
18]).unsqueeze(0).unsqueeze(0)   # shape (1, 1, 3, 3)
19
20out = F.conv2d(I, K, stride=1, padding=0)
21
22print("Input  shape:", I.shape)
23print("Kernel shape:", K.shape)
24print("Output shape:", out.shape)
25print(out.squeeze())
26# Expected → identical to the NumPy version:
27# tensor([[160., 160., 160.],
28#         [240., 240., 240.],
29#         [320., 320., 320.]])

When to use F.conv2d vs nn.Conv2d

F.conv2d is stateless — you pass weights in. nn.Conv2d is a Module that owns the weights (and bias), gets registered in the optimiser, and participates in .parameters() / checkpoints. Use nn.Conv2d for anything learnable; use F.conv2d for one-off operations or when you want a custom weight source.

Interactive Convolution Calculator

Click any output cell to see the step-by-step calculation, or press “Animate” to watch the kernel slide across the input. Use the kernel dropdown to see how different weights produce different outputs.

Interactive Convolution Calculator

Detects vertical edges (Sobel X)

Input Image (5×5)

10
20
30
40
50
20
40
60
80
100
30
60
90
120
150
40
80
120
160
200
50
100
150
200
250
*

Kernel (3×3)

-1
0
1
-2
0
2
-1
0
1
=

Output (3×3)

Click a cell to see calculation

Key Insight

The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.

  • Identity — output equals input (useful as a sanity check).
  • Vertical edge (Sobel X) — high response where brightness changes left → right.
  • Horizontal edge (Sobel Y) — high response where brightness changes top → bottom.
  • Box blur — smooths by averaging all nine neighbours.
  • Sharpen — enhances local differences, hence edges.

Kernel Effects Gallery

Different weights → different outputs, applied to the same input. The gallery below is the whole idea of classical image processing in a box: each filter was hand-designed by someone studying a specific problem.

Kernel Effect Gallery

Input
*
Kernel
-1
0
1
-2
0
2
-1
0
1
=
Output
Description

Detects vertical edges by computing horizontal gradients. Left pixels are subtracted from right pixels.

Formula

Gx = ∂I/∂x ≈ I(x+1) - I(x-1)

Use Case in Deep Learning

Edge detection, feature extraction in CNNs

Compare All Kernels:

Identity
Sobel X (Vertical Edges)
Sobel Y (Horizontal Edges)
Box Blur
Gaussian Blur
Sharpen
Laplacian
Emboss

Key Insight

Each kernel acts as a feature detector. In CNNs, instead of hand-designing these kernels, we let the network learn optimal kernels from data. The first layers often learn edge detectors similar to Sobel, while deeper layers learn more complex patterns.

Why each kernel works

KernelWeight patternWhy it works
Vertical edge (Sobel X)negative left, positive rightsubtracts left column from right column → response ∝ horizontal intensity change
Horizontal edge (Sobel Y)negative top, positive bottomsame idea, 90° rotated — detects horizontal edges
Box blur (3×3)all 1/9arithmetic mean of the 9-pixel neighbourhood — suppresses noise, blurs detail
Gaussian blur (3×3)bell-curve weightsweighted mean with peak at the centre — smoother than box blur for the same radius
Sharpen5 at the centre, −1 at the 4-neighbours, 0 at cornersboosts the centre relative to its neighbours — enhances edges
Laplacian−4 at the centre, 1 at the 4-neighboursdiscrete second derivative — responds to edges of any orientation

From hand-designed to learned

Classical vision required experts to hand-tune these kernels. Deep learning's breakthrough was letting gradient descent learn optimal kernel weights from data. The first layer of a trained CNN typically converges to oriented edge detectors and colour blobs — filters that are strikingly similar to the Gabor-like receptive fields Hubel & Wiesel recorded in V1

Reference

LeCun, Bottou, Bengio & Haffner, 1998, “Gradient-Based Learning Applied to Document Recognition”, Proc. IEEE 86(11); Krizhevsky, Sutskever & Hinton, 2012, “ImageNet Classification with Deep CNNs”, NeurIPS.
.

Multi-Channel Convolution

Real images have 3 channels (RGB). Intermediate feature maps can have hundreds. How does convolution handle this? The answer is a single, tidy rule: one kernel spans every input channel and produces exactly one output channel. To get multiple output channels you use multiple kernels.

RGB convolution, spelled out

  • Input: shape (Cin=3,H,W)(C_{\text{in}} = 3,\, H,\, W) — 3 colour channels.
  • One kernel: shape (Cin=3,K,K)(C_{\text{in}} = 3,\, K,\, K) — separate weights for each input channel, covering the same spatial K×KK \times K window.
  • Output: shape (1,H,W)(1,\, H',\, W') — one scalar per position.

At each output position we compute three element-wise multiply-and-sum operations (one per input channel) and add them together to get one scalar. The diagram below is the visual version of that recipe.

Multi-Channel (RGB) Convolution

Position:

Input Image (4×4×3 RGB)

R
255
200
150
100
220
180
140
80
180
140
100
60
140
100
60
40
G
100
120
140
160
80
100
120
140
60
80
100
120
40
60
80
100
B
50
80
110
140
70
100
130
160
90
120
150
180
110
140
170
200
*

Kernel (3×3×3)

KR
1
0
-1
2
0
-2
1
0
-1
KG
0
1
0
1
-4
1
0
1
0
KB
-1
-1
-1
-1
8
-1
-1
-1
-1
=

Calculation at position (0, 0):

R channel:Σ(IR × KR) = 345
G channel:Σ(IG × KG) = 0
B channel:Σ(IB × KB) = 0
Total:345 + 0 + 0 = 345

Output Feature Map (2×2×1)

Output
345
380
320
320

Key Insight 1: Single Kernel Spans All Channels

One 3×3×3 kernel covers all input channels and produces one output value per position. The kernel has separate weights for R, G, and B, but their contributions are summed.

Key Insight 2: Multiple Kernels = Multiple Outputs

To produce multiple output channels (feature maps), we use multiple kernels. 64 kernels → 64 output channels. Each kernel learns different features!

Parameter Count Formula:

Parameters = KH × KW × Cin × Cout + Cout
Example: 3 × 3 × 3 × 64 + 64 = 1,792 parameters

Numerical trace: a 2×2 RGB patch by one 2×2×3 kernel

The code below walks the full arithmetic with tiny integer values so you can check every multiplication in your head.

Multi-channel convolution — every product visible
🐍multichannel_conv.py
1import numpy as np

Same NumPy import as before — ndarray + element-wise arithmetic is all we need.

4I = np.array([...]) — shape (3, 2, 2)

A tiny RGB-like input stored in 'channels-first' layout (C, H, W), which matches PyTorch's convention. Each channel is a 2×2 matrix.

EXECUTION STATE
I[0] — R channel =
[[1 2]
 [3 4]]
I[1] — G channel =
[[5 6]
 [7 8]]
I[2] — B channel =
[[ 9 10]
 [11 12]]
I.shape = (3, 2, 2) — (channels, height, width)
11K = np.array([...]) — shape (3, 2, 2)

A single multi-channel kernel. ONE kernel has weights for EVERY input channel — that is the critical fact. Here the 3 input channels each get their own 2×2 weight matrix. The kernel spans the entire input-channel dimension.

EXECUTION STATE
K[0] — R weights =
[[1 0]
 [0 1]]
K[1] — G weights =
[[0 1]
 [1 0]]
K[2] — B weights =
[[1 1]
 [1 1]]
→ key insight = If we had 3 channels and 64 kernels we would need a weight tensor of shape (64, 3, 2, 2) — one 3-channel kernel per output feature map.
19red_contrib = np.sum(I[0] * K[0])

Compute the scalar contribution of the R channel. Element-wise multiply (4 products) then reduce-sum. This is identical in spirit to the single-channel case — one channel at a time.

EXECUTION STATE
I[0] * K[0] =
[[1×1, 2×0],   [[1, 0],
 [3×0, 4×1]] =  [0, 4]]
np.sum(...) = 1 + 0 + 0 + 4 = 5
20green_contrib = np.sum(I[1] * K[1])

Same recipe on the G channel.

EXECUTION STATE
I[1] * K[1] =
[[5×0, 6×1],   [[0,  6],
 [7×1, 8×0]] =  [7,  0]]
np.sum(...) = 0 + 6 + 7 + 0 = 13
21blue_contrib = np.sum(I[2] * K[2])

And the B channel.

EXECUTION STATE
I[2] * K[2] =
[[ 9×1,10×1],  [[ 9,10],
 [11×1,12×1]]= [11,12]]
np.sum(...) = 9 + 10 + 11 + 12 = 42
24output = red_contrib + green_contrib + blue_contrib

The three channel contributions are ADDED to produce a single output value. This is what makes a multi-channel kernel a SINGLE feature detector: it looks for a pattern that combines information from every input channel simultaneously.

EXECUTION STATE
⬆ output = 5 + 13 + 42 = 60
→ shape accounting = input (C=3, H=2, W=2) → one output scalar. If the kernel were smaller than the image (e.g. 1×1) we would get a 2×2 output feature map instead.
26print("R contribution:", red_contrib)

Display each channel's isolated contribution so you can verify the math by hand.

EXECUTION STATE
→ prints = R contribution: 5
27print("G contribution:", green_contrib)

G channel contribution.

EXECUTION STATE
→ prints = G contribution: 13
28print("B contribution:", blue_contrib)

B channel contribution.

EXECUTION STATE
→ prints = B contribution: 42
29print("output pixel :", output)

Final per-pixel output — the SUM of the per-channel contributions.

EXECUTION STATE
→ prints = output pixel : 60
18 lines without explanation
1import numpy as np
2
3# Tiny 2×2 RGB image: 3 channels, each 2×2 pixels
4I = np.array([
5    [[1, 2], [3, 4]],      # R channel
6    [[5, 6], [7, 8]],      # G channel
7    [[9,10], [11,12]],     # B channel
8])  # shape (C=3, H=2, W=2)
9
10# One 2×2×3 kernel — same spatial size as the image, so we get a scalar out
11K = np.array([
12    [[1, 0], [0, 1]],      # R weights
13    [[0, 1], [1, 0]],      # G weights
14    [[1, 1], [1, 1]],      # B weights
15])  # shape (C=3, kH=2, kW=2)
16
17# Per-channel contribution — element-wise multiply then sum each channel
18red_contrib   = np.sum(I[0] * K[0])
19green_contrib = np.sum(I[1] * K[1])
20blue_contrib  = np.sum(I[2] * K[2])
21
22# The output pixel is the SUM of the three channel contributions
23output = red_contrib + green_contrib + blue_contrib
24
25print("R contribution:", red_contrib)
26print("G contribution:", green_contrib)
27print("B contribution:", blue_contrib)
28print("output pixel  :", output)
29# Expected → R=5, G=13, B=42, output=60

Multiple output channels

To produce CoutC_{\text{out}} output feature maps we use CoutC_{\text{out}} independent multi-channel kernels. The full weight tensor of a conv layer therefore has shape:

WRCout×Cin×K×K\mathbf{W} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times K \times K}

The first layer of many CNNs chooses Cout=64C_{\text{out}} = 64: 64 different kernels, each watching all 3 RGB channels. Different kernels converge to detect different features — horizontal edges, vertical edges, diagonals, colour blobs, …

Parameter count

params  =  K×K×Cin×Cout  +  Cout\text{params} \;=\; K \times K \times C_{\text{in}} \times C_{\text{out}} \;+\; C_{\text{out}}

… where the trailing CoutC_{\text{out}} is one bias per output channel.

🐍param_count.py
1# A canonical "first conv layer": RGB input, 64 filters, 3×3 kernels
2import torch.nn as nn
3
4conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
5params = sum(p.numel() for p in conv1.parameters())
6print(f"Parameters: {params}")
7# 3 × 3 × 3 × 64 + 64 = 1,728 + 64 = 1,792

Quick Check

How many parameters does nn.Conv2d(64, 128, kernel_size=3) have?


We now have the operation itself down cold — in math, in NumPy, and in PyTorch. Section 3 adds the three knobs that turn this raw operation into a practical CNN building block: stride, padding, and pooling. It also introduces the concept that explains why CNN depth matters — the receptive field.

Loading comments...