Chapter 10
22 min read
Section 63 of 178

The Convolution Operation

Convolution Operations

Introduction

In the previous section, we established why convolutions are essential for processing images. Now we'll dive deep into what the convolution operation actually does and how it works mathematically.

The Core Insight: Before convolutions, image processing required hand-crafted feature detectors for every possible pattern at every possible location. The convolution operation elegantly solves this by sliding a small learnable filter across the entire image, computing weighted sums at each position. This single mathematical operation replaced thousands of lines of hand-coded pattern matching.

The convolution operation is used billions of times per second in production AI systems:

  • Instagram: Every photo filter applies multiple convolution operations
  • Tesla Autopilot: Processes 30+ frames/second through deep CNN stacks
  • Medical AI: Detects tumors in X-rays using learned convolution filters
  • Face ID: Unlocks your phone via convolutions detecting facial features

By the end of this section, you'll understand exactly what happens when you write nn.Conv2d() in PyTorch—not as a black box, but as a mathematical operation you can compute by hand.


Learning Objectives

After completing this section, you will be able to:

  1. Master the Mathematical Definition: Understand the convolution formula, explain what each symbol means, and recognize the difference between convolution and cross-correlation (and why deep learning uses cross-correlation but calls it convolution)
  2. Compute Convolutions by Hand: Given a 5×5 image and a 3×3 kernel, calculate the complete output matrix step-by-step
  3. Predict Output Dimensions: Use the formula O=IK+2PS+1O = \frac{I - K + 2P}{S} + 1 to calculate output sizes for any configuration
  4. Understand Kernel Design: Explain why certain kernel values detect edges, blur images, or sharpen details
  5. Handle Multi-Channel Images: Describe how RGB convolution works and calculate the parameter count for any Conv2d layer
  6. Implement from Scratch: Write convolution using raw NumPy/PyTorch operations without using nn.Conv2d

Where You'll Apply This Knowledge

  • Building CNNs: Every conv layer in ResNet, VGG, EfficientNet, YOLO uses this exact operation
  • Debugging: When models fail, understanding feature map computation helps diagnose issues
  • Research: Novel architectures (depthwise separable, dilated, deformable convolutions) are variations of this core operation
  • Optimization: Knowing the math helps you understand memory usage and computational cost

Starting Simple: 1D Convolution

Before tackling 2D images, let's build intuition with 1D signals. This is exactly how audio processing and time series analysis work.

The Intuition: Sliding Window

Imagine you have a ruler (the kernel) that you slide across a signal. At each position, you multiply the signal values under the ruler by the ruler's markings and sum the results.

Mathematical Definition

For a 1D signal ff and kernel gg, the convolution is:

(fg)[n]=k=f[k]g[nk](f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[n - k]

In practice, with finite signals and kernels:

(fg)[n]=k=0K1f[n+k]g[k](f * g)[n] = \sum_{k=0}^{K-1} f[n + k] \cdot g[k]

Let's break down each symbol:

SymbolMeaningExample
fInput signal (1D array)[1, 2, 3, 4, 5, 6, 7]
gKernel/filter (1D array)[1, 0, -1]
nOutput position indexn = 0, 1, 2, ...
kKernel element indexk = 0, 1, 2 for 3-element kernel
KKernel sizeK = 3
*Convolution operator(f * g) produces new signal

Worked Example: 1D Convolution

Let's compute the convolution of signal [1, 2, 3, 4, 5] with kernel [1, 0, -1]:

1D Convolution Step-by-Step
🐍conv1d_example.py
4Input Signal

A simple 1D signal with 5 elements. In real applications, this could be audio samples, stock prices, or sensor readings.

EXAMPLE
signal = [1, 2, 3, 4, 5]
7Kernel/Filter

A 3-element kernel that computes the difference between the first and last element in each window. This is a simple derivative approximation (gradient detection).

EXAMPLE
kernel = [1, 0, -1] detects changes
11Output Length

The output has length = input_length - kernel_length + 1. Here: 5 - 3 + 1 = 3 output values.

EXAMPLE
range(3) gives n = 0, 1, 2
13Window Extraction

At each position n, we extract kernel_length elements from the signal. This is the 'sliding window' moving across the signal.

EXAMPLE
n=0: window=[1,2,3], n=1: window=[2,3,4]
15Dot Product

Element-wise multiplication followed by sum. This is the core computation: how much does this window 'match' the kernel pattern?

EXAMPLE
[1,2,3] * [1,0,-1] = [1,0,-3] → sum = -2
23Position n=0 Trace

First output: multiply signal[0:3] with kernel, sum the results. The result -2 indicates the signal is increasing at this position.

29 lines without explanation
1import numpy as np
2
3# Input signal (1D array)
4signal = np.array([1, 2, 3, 4, 5])
5
6# Kernel (1D filter)
7kernel = np.array([1, 0, -1])
8
9# Manual convolution computation
10output = []
11for n in range(len(signal) - len(kernel) + 1):
12    # Extract the window from the signal
13    window = signal[n:n+len(kernel)]
14    # Compute dot product: element-wise multiply, then sum
15    value = np.sum(window * kernel)
16    output.append(value)
17
18print(f"Input signal: {signal}")
19print(f"Kernel: {kernel}")
20print(f"Output: {output}")
21
22# Let's trace position n=0:
23# window = [1, 2, 3]
24# kernel = [1, 0, -1]
25# 1*1 + 2*0 + 3*(-1) = 1 + 0 - 3 = -2
26
27# Position n=1:
28# window = [2, 3, 4]
29# kernel = [1, 0, -1]
30# 2*1 + 3*0 + 4*(-1) = 2 + 0 - 4 = -2
31
32# Position n=2:
33# window = [3, 4, 5]
34# kernel = [1, 0, -1]
35# 3*1 + 4*0 + 5*(-1) = 3 + 0 - 5 = -2

What does this kernel detect?

The kernel [1, 0, -1] computes signal[n] - signal[n+2], which is the backward difference (a discrete derivative). Positive output means the signal is decreasing; negative means increasing. Our constant increase of 2 per step gives constant output -2.

Quick Check

What is the output length when convolving a signal of length 10 with a kernel of length 4?


2D Convolution for Images

Now we extend to 2D—the foundation of all image processing in deep learning. The concept is identical: slide a kernel across the input, compute weighted sums at each position.

Mathematical Definition

For a 2D image II and kernel KK, the 2D convolution is:

(IK)[i,j]=m=0M1n=0N1I[i+m,j+n]K[m,n](I * K)[i, j] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} I[i+m, j+n] \cdot K[m, n]

Let's decode every symbol:

SymbolMeaningTypical Value
IInput image (2D matrix)224×224 pixels
KKernel/filter (2D matrix)3×3 or 5×5
i, jOutput position (row, column)i=0..H-K+1, j=0..W-K+1
m, nKernel indices (row, column)m,n = 0..K-1
M, NKernel height and widthM=N=3 for 3×3 kernel
*2D convolution operatorProduces feature map

The Intuitive Picture

Imagine placing a small 3×33 \times 3 transparent overlay on an image. Each cell of the overlay has a number (the kernel weight). At each position:

  1. Multiply each pixel under the overlay by its corresponding kernel weight
  2. Sum all 9 products
  3. Write this sum to the output at that position
  4. Slide the overlay one pixel to the right (or down)
  5. Repeat until you've covered the entire image

Visualizing the Sliding Operation

Watch the convolution operation in action. This animation shows exactly how the kernel slides across the input, computing one output value at a time:

Convolution Animation

Input (5×5)

1×-12×00×1120×-21×02×2102×-10×01×1011202101210

Kernel (3×3)

-10+1-20+2-10+1

Output (3×3)

2-1-2-10-1-100

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

Why This Works

The power of convolution comes from what the kernel weights encode:

  • Edge detection: Kernels with positive weights on one side and negative on the other detect transitions
  • Blurring: Kernels with equal positive weights average neighboring pixels
  • Sharpening: Kernels that emphasize the center relative to neighbors enhance details
Key Insight: In classical image processing, engineers hand-designed kernels for specific tasks. In deep learning, we let the network learn optimal kernel values from data. The backpropagation algorithm adjusts kernel weights to minimize the loss function.

Full CNN Pipeline: The Big Picture

Now that you understand the basic convolution operation, let's see how multiple convolution and pooling layers work together in a real CNN. This interactive visualization shows the complete flow from input image to classification:

2D Convolution: Complete Process Visualization

Watch how a CNN processes an image through convolution and pooling layers, reducing dimensions while extracting features.

CNN Architecture: Dimension Reduction Pipeline

Input
1@28×28
Conv1
32@26×26
K=3×3
Pool1
32@13×13
P=2×2
Conv2
64@11×11
K=3×3
Pool2
64@5×5
P=2×2
Flatten
1×1600
FC
10 units
K = Kernel SizeP = Pool Size@ = Channels × Height × Width

Notice how each layer transforms the data:

28×28 Input:Raw grayscale image (like MNIST digits)
Conv1 (32@26×26):32 different 3×3 kernels extract 32 feature maps, each detecting different patterns
Pool1 (32@13×13):2×2 max pooling halves spatial dimensions, keeping strongest activations
Conv2 (64@11×11):64 kernels build on previous features, learning higher-level patterns
Pool2 (64@5×5):Further spatial reduction
Flatten (1600):Reshape 64×5×5 = 1600 values into a 1D vector
FC (10):Fully connected layer outputs class probabilities

Kernel Filtering

Input (7×7)
3
3
2
1
0
2
1
0
0
1
3
1
0
2
3
1
2
2
3
1
0
2
0
0
2
2
3
1
2
0
0
0
1
2
2
1
3
2
1
0
1
3
0
2
1
3
2
0
1
7×7
*
convolve
Kernel (3×3)
0
1
2
2
2
0
0
1
2
=
Feature Map (Output)
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
5×5 (0/25 computed)
Step 0 / 25
👆 Use the controls above to step through the convolution!
  • Without padding (P=0): Output size = (5-3)/1 + 1 = 3×3
  • With padding (P=1): Output size = (5+2-3)/1 + 1 = 5×5 (same as input!)
  • Stride=2: Kernel moves 2 pixels at a time, reducing output size

Pooling Operation

5×5
Feature Map
2×2 Max Pool
2×2
Pooled Output
Output = ⌊Input / Pool Size⌋ = ⌊5 / 2⌋ = 24 pooling steps
Feature Map (from Conv)
12
12
17
17
7
10
17
19
19
17
9
6
14
18
17
11
8
7
12
18
12
17
15
9
10
5×5
MAX
2×2 pool
Pooled Output
?
?
?
?
2×2 (0/4 computed)
Step 0 / 4
Max Pooling
  • Operation: Takes the maximum value from each 2×2 window
  • Effect: Keeps strongest activations, provides translation invariance
  • Use case: Most common in CNNs (VGG, ResNet, etc.)
  • Output size: 5÷2 = 2×2

Output Size Formula

O = ⌊(I - K + 2P) / S⌋ + 1
O = Output size
I = Input size
K = Kernel size
P = Padding
S = Stride
Current: O = ⌊(7 - 3 + 0) / 1⌋ + 1 = 5
Kernel
= Filters = Feature detectors
Stride
= Step size of kernel movement
Padding
= Zero-padding around input
Feature Map
= Output of convolution

Feature Learning vs Classification

The convolution and pooling layers form the feature learning part of the network—they automatically discover useful representations. The final fully connected layers perform classification based on these learned features.

Cross-Correlation vs Convolution

There's an important subtlety that causes confusion: deep learning uses cross-correlation, but calls it convolution.

Mathematical Convolution (Signal Processing)

In mathematics and signal processing, convolution flips the kernel before sliding:

(IK)[i,j]=mnI[im,jn]K[m,n](I * K)[i, j] = \sum_{m} \sum_{n} I[i-m, j-n] \cdot K[m, n]

The kernel is flipped both horizontally and vertically (rotated 180°). This ensures certain mathematical properties like commutativity: fg=gff * g = g * f.

Cross-Correlation (Deep Learning)

In deep learning, we use cross-correlation—no flipping:

(IK)[i,j]=mnI[i+m,j+n]K[m,n](I \star K)[i, j] = \sum_{m} \sum_{n} I[i+m, j+n] \cdot K[m, n]

The kernel slides across the image without being flipped.

Why Does Deep Learning Use Cross-Correlation?

  1. Learned kernels adapt: Since we learn kernel weights, it doesn't matter if we flip or not—the network will learn the appropriate (possibly flipped) pattern
  2. Simpler implementation: No flip operation needed
  3. Same result: For symmetric kernels (like Gaussian blur), convolution = cross-correlation
  4. Historical convention: The deep learning community standardized on this approach

Terminology Alert

When a deep learning paper or library says "convolution," they almost always mean cross-correlation. PyTorch's nn.Conv2d performs cross-correlation. Be aware of this when reading signal processing literature.
AspectTrue ConvolutionCross-Correlation (DL)
Kernel flipYes (rotate 180°)No
CommutativeYes: f*g = g*fNo
Used inSignal processing, mathDeep learning
PyTorchNot defaultnn.Conv2d
Matters for learning?No - weights adaptNo - weights adapt

Practical advice

For deep learning, ignore the flip distinction. Just think of convolution as "slide the kernel, compute weighted sums." The network learns what it needs either way.

Step-by-Step Computation

Let's work through a complete example by hand. This builds the intuition you need to debug CNNs and understand what's happening inside.

Example: 5×5 Image with 3×3 Kernel

Input image II (5×5):

📝input.txt
1I = [[10, 20, 30, 40, 50],
2     [20, 40, 60, 80, 100],
3     [30, 60, 90, 120, 150],
4     [40, 80, 120, 160, 200],
5     [50, 100, 150, 200, 250]]

Kernel KK (3×3 Sobel vertical edge detector):

📝kernel.txt
1K = [[-1, 0, 1],
2     [-2, 0, 2],
3     [-1, 0, 1]]

Computing Output[0,0]

Position (0,0) overlays the kernel on the top-left 3×3 region of the input:

📝calculation.txt
1Window at (0,0):         Kernel:
2[[10, 20, 30],           [[-1, 0, 1],
3 [20, 40, 60],     ×      [-2, 0, 2],
4 [30, 60, 90]]            [-1, 0, 1]]
5
6Element-wise multiply:
7[[-10, 0, 30],
8 [-40, 0, 120],
9 [-30, 0, 90]]
10
11Sum all elements: -10 + 0 + 30 + (-40) + 0 + 120 + (-30) + 0 + 90 = 160
12
13Output[0,0] = 160

Computing All Positions

The output is a 3×3 matrix (since 5-3+1=3 for both dimensions):

2D Convolution from Scratch
🐍conv2d_manual.py
4Input Image Definition

A 5×5 grayscale image where intensity increases diagonally. Each row increases by 10, each column by 20. This creates a gradient pattern.

EXAMPLE
Top-left=10, bottom-right=250
13Sobel X Kernel

This kernel detects vertical edges by computing horizontal gradients. It subtracts left pixels from right pixels, with the center row weighted 2× more.

EXAMPLE
Detects transitions like: dark|bright
21Output Dimensions

With no padding, output size = input size - kernel size + 1. Here: 5-3+1=3 for both height and width.

EXAMPLE
5×5 input, 3×3 kernel → 3×3 output
27Nested Loop Iteration

We iterate over every valid position (i,j) where the kernel fits entirely within the image. This is the 'sliding window' traversal.

EXAMPLE
i,j ∈ {0,1,2} × {0,1,2} = 9 positions
30Window Extraction

At each position, extract the 3×3 region that the kernel currently overlays. Python slicing makes this elegant.

EXAMPLE
image[1:4, 2:5] extracts rows 1-3, cols 2-4
32Core Computation

Element-wise multiply the window with the kernel, then sum all 9 products. This single number becomes one output pixel.

EXAMPLE
Σ(window × kernel) = one scalar
41Result Analysis

The constant output (160 everywhere) indicates the gradient is uniform across the image. The Sobel X kernel detects that there's a consistent horizontal change.

40 lines without explanation
1import numpy as np
2
3# Input image (5x5)
4I = np.array([
5    [10, 20, 30, 40, 50],
6    [20, 40, 60, 80, 100],
7    [30, 60, 90, 120, 150],
8    [40, 80, 120, 160, 200],
9    [50, 100, 150, 200, 250]
10], dtype=float)
11
12# Sobel X kernel (3x3) - detects vertical edges
13K = np.array([
14    [-1, 0, 1],
15    [-2, 0, 2],
16    [-1, 0, 1]
17], dtype=float)
18
19# Compute 2D convolution (cross-correlation)
20def conv2d_manual(image, kernel):
21    H, W = image.shape
22    kH, kW = kernel.shape
23
24    # Output dimensions
25    out_H = H - kH + 1
26    out_W = W - kW + 1
27
28    output = np.zeros((out_H, out_W))
29
30    # Slide kernel across image
31    for i in range(out_H):
32        for j in range(out_W):
33            # Extract window
34            window = image[i:i+kH, j:j+kW]
35            # Element-wise multiply and sum
36            output[i, j] = np.sum(window * kernel)
37
38    return output
39
40output = conv2d_manual(I, K)
41print("Output (3x3):")
42print(output)
43
44# Output:
45# [[160. 160. 160.]
46#  [160. 160. 160.]
47#  [160. 160. 160.]]

Why is the output constant?

Our input image has a uniform gradient—each row increases by the same amount (20 per step horizontally). The Sobel X kernel computes horizontal differences, which are constant across the image. A natural image with varying edges would produce a varied output.

Quick Check

At position (1,1), which input pixels does the 3×3 kernel overlay?


Interactive Convolution Calculator

Now it's your turn! Use this interactive calculator to see exactly how convolution works. Click on any output cell to see the step-by-step calculation, or press "Animate" to watch the kernel slide across the input.

Interactive Convolution Calculator

Detects vertical edges (Sobel X)

Input Image (5×5)

10
20
30
40
50
20
40
60
80
100
30
60
90
120
150
40
80
120
160
200
50
100
150
200
250
*

Kernel (3×3)

-1
0
1
-2
0
2
-1
0
1
=

Output (3×3)

Click a cell to see calculation

Key Insight

The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.

Try different kernels to see how they produce different outputs:

  • Identity: Output equals input (useful for testing)
  • Vertical Edge: High response where brightness changes left-to-right
  • Horizontal Edge: High response where brightness changes top-to-bottom
  • Box Blur: Smooths the image by averaging neighbors
  • Sharpen: Enhances edges and details

Kernel Effects Gallery

Different kernel weights produce dramatically different outputs. This gallery lets you compare how various kernels transform the same input pattern.

Kernel Effect Gallery

Input
*
Kernel
-1
0
1
-2
0
2
-1
0
1
=
Output
Description

Detects vertical edges by computing horizontal gradients. Left pixels are subtracted from right pixels.

Formula

Gx = ∂I/∂x ≈ I(x+1) - I(x-1)

Use Case in Deep Learning

Edge detection, feature extraction in CNNs

Compare All Kernels:

Identity
Sobel X (Vertical Edges)
Sobel Y (Horizontal Edges)
Box Blur
Gaussian Blur
Sharpen
Laplacian
Emboss

Key Insight

Each kernel acts as a feature detector. In CNNs, instead of hand-designing these kernels, we let the network learn optimal kernels from data. The first layers often learn edge detectors similar to Sobel, while deeper layers learn more complex patterns.

What Makes Each Kernel Work?

KernelWeight PatternWhy It Works
Vertical Edge (Sobel X)Negative left, positive rightSubtracts left from right; large |value| means brightness changes horizontally
Horizontal Edge (Sobel Y)Negative top, positive bottomSubtracts top from bottom; large |value| means brightness changes vertically
Box BlurAll equal (1/9 each)Averages all 9 neighbors equally, smoothing out variations
Gaussian BlurBell curve weightsWeights decay with distance from center, giving smoother blur than box
SharpenLarge positive center, negative neighborsAmplifies center relative to neighbors; enhances differences = edges
LaplacianNegative center, positive neighborsSecond derivative; responds to edges regardless of direction

From Hand-Designed to Learned

Classical computer vision required experts to design these kernels. Deep learning's breakthrough: let the network learn optimal kernels via gradient descent. The first layer of a trained CNN often learns Gabor-like filters (edge detectors at various orientations)—similar to what neuroscientists find in the visual cortex!

Multi-Channel Convolution

Real images have multiple channels (RGB). How does convolution handle this? The key insight: one kernel spans ALL input channels and produces one output channel.

RGB Convolution Explained

For an RGB image:

  • Input: H × W × 3 (height × width × RGB channels)
  • One kernel: K × K × 3 (covers all 3 channels)
  • Output: H' × W' × 1 (one value per position)

The kernel has separate weights for each input channel. At each position, we compute three separate sums (one per channel), then add them together.

Multi-Channel (RGB) Convolution

Position:

Input Image (4×4×3 RGB)

R
255
200
150
100
220
180
140
80
180
140
100
60
140
100
60
40
G
100
120
140
160
80
100
120
140
60
80
100
120
40
60
80
100
B
50
80
110
140
70
100
130
160
90
120
150
180
110
140
170
200
*

Kernel (3×3×3)

KR
1
0
-1
2
0
-2
1
0
-1
KG
0
1
0
1
-4
1
0
1
0
KB
-1
-1
-1
-1
8
-1
-1
-1
-1
=

Calculation at position (0, 0):

R channel:Σ(IR × KR) = 345
G channel:Σ(IG × KG) = 0
B channel:Σ(IB × KB) = 0
Total:345 + 0 + 0 = 345

Output Feature Map (2×2×1)

Output
345
380
320
320

Key Insight 1: Single Kernel Spans All Channels

One 3×3×3 kernel covers all input channels and produces one output value per position. The kernel has separate weights for R, G, and B, but their contributions are summed.

Key Insight 2: Multiple Kernels = Multiple Outputs

To produce multiple output channels (feature maps), we use multiple kernels. 64 kernels → 64 output channels. Each kernel learns different features!

Parameter Count Formula:

Parameters = KH × KW × Cin × Cout + Cout
Example: 3 × 3 × 3 × 64 + 64 = 1,792 parameters

Multiple Output Channels

To produce multiple output channels (multiple feature maps), we use multiple kernels:

64 kernels64 output channels (feature maps)\text{64 kernels} \rightarrow \text{64 output channels (feature maps)}

Each kernel learns to detect a different feature. The first layer might learn:

  • Kernel 1: Horizontal edges
  • Kernel 2: Vertical edges
  • Kernel 3: Diagonal edges (45°)
  • Kernel 4: Diagonal edges (135°)
  • Kernel 5-64: Various orientations, frequencies, colors...

General Formula

For a layer with CinC_{\text{in}} input channels and CoutC_{\text{out}} output channels using K×KK \times K kernels:

Parameters=K×K×Cin×Cout+Cout\text{Parameters} = K \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}

Where:

  • K×K×Cin×CoutK \times K \times C_{\text{in}} \times C_{\text{out}} = kernel weights
  • CoutC_{\text{out}} = bias terms (one per output channel)

Example: First Conv Layer

🐍param_count.py
1# Typical first conv layer: RGB input, 64 filters, 3x3 kernels
2# Parameters = 3 × 3 × 3 × 64 + 64 = 1,728 + 64 = 1,792
3
4import torch.nn as nn
5
6conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
7params = sum(p.numel() for p in conv1.parameters())
8print(f"Parameters: {params}")  # Output: 1792

Quick Check

How many parameters does nn.Conv2d(64, 128, kernel_size=3) have?


PyTorch Implementation

Now let's see how to use convolution in PyTorch, mapping every parameter to what we've learned.

nn.Conv2d Anatomy

PyTorch Conv2d Complete Guide
🐍pytorch_conv.py
6in_channels

Number of channels in the input. RGB images have 3 channels. A previous conv layer with 64 filters would output 64 channels.

EXAMPLE
RGB: 3, Grayscale: 1, After conv(64): 64
7out_channels

Number of filters (kernels) to learn. Each filter produces one output channel (feature map). More filters = more features detected.

EXAMPLE
64 filters → 64 feature maps
8kernel_size

Spatial size of each filter. 3×3 is most common (captures local patterns with few parameters). Can be tuple like (3,5) for non-square.

EXAMPLE
3 → 3×3, (3,5) → 3×5
9stride

How many pixels to move the kernel each step. stride=1 moves pixel-by-pixel. stride=2 skips every other position, halving output size.

EXAMPLE
stride=2: output_size ≈ input_size/2
10padding

Zeros added around the input border. padding=1 with kernel_size=3 preserves spatial dimensions. padding='same' auto-calculates.

EXAMPLE
padding=1: adds 1-pixel zero border
11bias

Whether to add a learnable constant to each output channel. Usually True. Set False when followed by BatchNorm (which has its own bias).

EXAMPLE
True → out = conv(x) + bias
15Weight Shape

Shape is (out_channels, in_channels, kernel_H, kernel_W). This tensor contains ALL the learnable kernel weights.

EXAMPLE
[64, 3, 3, 3] = 64 kernels, each 3×3×3
20Bias Shape

One scalar per output channel. After computing convolution, this constant is added to every position of that feature map.

EXAMPLE
[64] = one bias per filter
26Input Format (NCHW)

PyTorch uses NCHW format: (Batch, Channels, Height, Width). This is different from TensorFlow's default NHWC. Make sure your data matches!

EXAMPLE
[8, 3, 224, 224] = 8 RGB images, 224×224
31Output Shape Analysis

With padding=1 and kernel_size=3, spatial dimensions preserved. Channels changed from 3 to 64. Batch size unchanged.

EXAMPLE
[8, 64, 224, 224] = 64 feature maps per image
25 lines without explanation
1import torch
2import torch.nn as nn
3
4# Create a convolutional layer
5conv = nn.Conv2d(
6    in_channels=3,      # RGB input (3 channels)
7    out_channels=64,    # 64 different filters
8    kernel_size=3,      # 3x3 kernels
9    stride=1,           # Move 1 pixel at a time
10    padding=1,          # Add 1 pixel border of zeros
11    bias=True           # Include bias terms
12)
13
14# Examine the shapes
15print(f"Weight shape: {conv.weight.shape}")
16# Output: torch.Size([64, 3, 3, 3])
17# Interpretation: 64 kernels, each 3×3×3 (3×3 spatial, 3 channels)
18
19print(f"Bias shape: {conv.bias.shape}")
20# Output: torch.Size([64])
21# Interpretation: One bias per output channel
22
23# Create a batch of images
24batch_size = 8
25height, width = 224, 224
26images = torch.randn(batch_size, 3, height, width)
27print(f"Input shape: {images.shape}")
28# Output: torch.Size([8, 3, 224, 224])
29# Format: (batch, channels, height, width) - NCHW format
30
31# Forward pass
32output = conv(images)
33print(f"Output shape: {output.shape}")
34# Output: torch.Size([8, 64, 224, 224])
35# Same spatial size due to padding=1

Manual Implementation (No nn.Conv2d)

To truly understand convolution, let's implement it using only basic tensor operations:

Convolution Implementation from Scratch
🐍conv2d_from_scratch.py
17Shape Extraction

We extract dimensions from input (batch, channels, height, width) and weight (out_channels, in_channels, kernel_height, kernel_width).

EXAMPLE
input=[2,3,8,8], weight=[16,3,3,3]
20Padding Application

F.pad adds zeros around the image borders. The tuple (p,p,p,p) adds p pixels to (left, right, top, bottom). This lets the kernel process edge pixels.

EXAMPLE
8×8 with padding=1 → 10×10
27Output Size Formula

The core formula: output = (input - kernel) / stride + 1. With padding, input size is already adjusted. Integer division handles the case where it doesn't divide evenly.

EXAMPLE
(10 - 3) / 1 + 1 = 8
34Nested Loops

Four nested loops: batch → output channel → row → column. This is O(N × C_out × H_out × W_out × C_in × kH × kW) - very slow! Real implementations use matrix tricks.

40Window Coordinates

stride determines how far we jump between positions. stride=2 means h_start goes 0, 2, 4... instead of 0, 1, 2...

EXAMPLE
stride=2: positions at 0, 2, 4, 6
42Window Extraction

Extract the C_in × kH × kW region that the current kernel position covers. Note: we take ALL input channels (the : in dimension 1).

EXAMPLE
window shape: [3, 3, 3] for RGB with 3×3 kernel
45Core Computation

Multiply the window by ONE kernel (weight[c_out] has shape [C_in, kH, kW]). Sum all products. This is the dot product between window and kernel.

EXAMPLE
[3,3,3] × [3,3,3] → scalar
49Bias Broadcasting

Reshape bias from [C_out] to [1, C_out, 1, 1] so it broadcasts correctly. Each output channel gets its own bias added to every position.

EXAMPLE
bias[16] → broadcast to [2, 16, 8, 8]
54 lines without explanation
1import torch
2
3def conv2d_manual(input, weight, bias=None, stride=1, padding=0):
4    """
5    Manual 2D convolution implementation.
6
7    Args:
8        input: (N, C_in, H, W) input tensor
9        weight: (C_out, C_in, kH, kW) kernel weights
10        bias: (C_out,) optional bias
11        stride: step size for sliding
12        padding: zero-padding on each side
13
14    Returns:
15        (N, C_out, H_out, W_out) output tensor
16    """
17    N, C_in, H, W = input.shape
18    C_out, _, kH, kW = weight.shape
19
20    # Add padding if needed
21    if padding > 0:
22        input = torch.nn.functional.pad(
23            input, (padding, padding, padding, padding)
24        )
25        H += 2 * padding
26        W += 2 * padding
27
28    # Calculate output dimensions
29    H_out = (H - kH) // stride + 1
30    W_out = (W - kW) // stride + 1
31
32    # Initialize output
33    output = torch.zeros(N, C_out, H_out, W_out)
34
35    # Perform convolution
36    for n in range(N):                    # Each image in batch
37        for c_out in range(C_out):        # Each output channel
38            for i in range(H_out):        # Each output row
39                for j in range(W_out):    # Each output column
40                    # Extract window
41                    h_start = i * stride
42                    w_start = j * stride
43                    window = input[n, :, h_start:h_start+kH, w_start:w_start+kW]
44
45                    # Multiply and sum
46                    output[n, c_out, i, j] = (window * weight[c_out]).sum()
47
48    # Add bias
49    if bias is not None:
50        output += bias.view(1, -1, 1, 1)  # Broadcast bias to all positions
51
52    return output
53
54# Test against PyTorch
55x = torch.randn(2, 3, 8, 8)  # 2 images, 3 channels, 8x8
56conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=True)
57
58official_output = conv(x)
59manual_output = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)
60
61print(f"Max difference: {(official_output - manual_output).abs().max():.10f}")
62# Output: Max difference: 0.0000000000 (or very small floating point error)

Performance Note

This naive implementation is extremely slow. Real convolution uses optimized algorithms like im2col (reshape input so convolution becomes matrix multiplication) or FFT-based methods. PyTorch's nn.Conv2d uses cuDNN on GPU, which is 100-1000× faster.

Output Size Formula

The output dimensions depend on input size, kernel size, padding, and stride. Master this formula:

O=IK+2PS+1O = \left\lfloor \frac{I - K + 2P}{S} \right\rfloor + 1

Where:

SymbolMeaningCommon Values
OOutput size (height or width)Calculated
IInput size224, 32, etc.
KKernel size3, 5, 7
PPadding (added to each side)0, 1, 2
SStride (step size)1, 2
⌊⌋Floor function (round down)-

Common Scenarios

ScenarioSettingsFormulaResult
Same size (padding)K=3, P=1, S=1(224-3+2)/1+1224
Same size (padding)K=5, P=2, S=1(224-5+4)/1+1224
Halve size (stride)K=3, P=1, S=2(224-3+2)/2+1112
No paddingK=3, P=0, S=1(224-3+0)/1+1222
VGG styleK=3, P=1, S=1 + pool→ pool halves112 after pool

Preserve dimensions recipe

To keep output size = input size with stride=1, use padding = (kernel_size - 1) / 2. For 3×3: padding=1. For 5×5: padding=2. For 7×7: padding=3.

Quick Check

What is the output size for input=64, kernel=5, padding=2, stride=2?


AI/Deep Learning Applications

The convolution operation you've learned is the foundation of modern computer vision. Here's how it's applied in cutting-edge AI systems:

Object Detection (YOLO, Faster R-CNN)

Convolutions extract hierarchical features: edges → textures → parts → objects. The network learns to detect "wheel," "headlight," and "car body" through different layers of convolution.

🐍yolo_concept.py
1# Conceptual YOLO architecture
2backbone = nn.Sequential(
3    nn.Conv2d(3, 64, 3, padding=1),    # Edges, colors
4    nn.Conv2d(64, 128, 3, padding=1),  # Textures
5    nn.Conv2d(128, 256, 3, padding=1), # Parts
6    nn.Conv2d(256, 512, 3, padding=1), # Objects
7)
8# Final layer predicts: (x, y, w, h, confidence, class)

Semantic Segmentation (U-Net)

Every pixel gets classified. Convolutions in the encoder capture context; deconvolutions in the decoder restore spatial resolution. Medical imaging (tumor segmentation) relies heavily on this.

Neural Style Transfer

Convolutions capture "style" (textures, brush strokes) vs "content" (shapes, objects). By matching feature statistics from a style image to a content image, we can paint photos in the style of Van Gogh.

Generative Models (StyleGAN, Diffusion)

Image generation uses convolutions in reverse: starting from noise, transposed convolutions (upsampling) progressively build images. Each conv layer adds more detail.

Key Insight: First Layer Kernels

If you visualize the learned kernels in the first conv layer of a trained network (like ResNet), you'll see:

  • Edge detectors at various orientations (0°, 45°, 90°, 135°)
  • Color blob detectors (red, green, blue regions)
  • Gabor-like filters (oriented frequency patterns)

These closely match what neuroscientists find in the primary visual cortex (V1)! The network "discovers" biologically-relevant features through gradient descent.

The Profound Insight: We didn't tell the network to learn Gabor filters. We just said "minimize classification error" and let backpropagation adjust the kernel weights. The fact that it converges to filters similar to biological neurons suggests something fundamental about optimal visual feature extraction.

Summary

You've now mastered the convolution operation—the foundation of all CNNs:

Key Concepts

ConceptDefinitionWhy It Matters
ConvolutionSliding window weighted sumCore feature extraction operation
Kernel/FilterSmall weight matrixLearns to detect specific patterns
Feature MapConvolution outputEncodes presence of patterns at each location
StrideKernel step sizeControls output resolution
PaddingBorder zerosPreserves spatial dimensions

Critical Formulas

  1. 2D Convolution: (IK)[i,j]=mnI[i+m,j+n]K[m,n](I * K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]
  2. Output Size: O=(IK+2P)/S+1O = \lfloor(I - K + 2P)/S\rfloor + 1
  3. Parameters: K×K×Cin×Cout+CoutK \times K \times C_{in} \times C_{out} + C_{out}

Remember

  • Deep learning uses cross-correlation but calls it convolution
  • One kernel spans all input channels, produces one output channel
  • Multiple kernels = multiple feature maps
  • CNNs learn kernels via backpropagation, discovering optimal features

Exercises

Conceptual Questions

  1. A 128×128×3 RGB image passes through nn.Conv2d(3, 32, kernel_size=5, padding=2, stride=2). What is the output shape? How many parameters does the layer have?
  2. Why does the Sobel X kernel [[-1,0,1], [-2,0,2], [-1,0,1]] detect vertical edges rather than horizontal edges?
  3. If you wanted to preserve spatial dimensions with a 7×7 kernel and stride=1, how much padding would you need?
  4. Explain why a CNN with learned 3×3 kernels might be better than using hand-designed Sobel/Laplacian kernels for image classification.

Solution Hints for Conceptual Questions

  1. Q1: Output spatial: (128-5+4)/2+1 = 64. Output shape: [batch, 32, 64, 64]. Params: 5×5×3×32 + 32 = 2,432.
  2. Q2: It computes horizontal differences (left-right). Vertical edges ARE horizontal transitions in brightness!
  3. Q3: P = (K-1)/2 = (7-1)/2 = 3.
  4. Q4: Learned kernels adapt to the specific task, can be asymmetric, and deeper layers build on early features.

Coding Exercises

  1. Implement edge magnitude: Apply Sobel X and Sobel Y to an image, then compute edge magnitude as Gx2+Gy2\sqrt{G_x^2 + G_y^2}.
  2. Box blur vs Gaussian: Apply both to an image and visualize the difference. Why does Gaussian look more natural?
  3. Verify the output formula: Create inputs of various sizes and verify that nn.Conv2d produces the output size you calculate.

Challenge Exercise

Implement im2col convolution: The naive nested-loop implementation is O(N × H × W × K² × C). The im2col technique reshapes the input so convolution becomes a single matrix multiplication, leveraging optimized BLAS libraries. Research and implement this technique.

im2col hint

Each output position corresponds to one row in the im2col matrix. That row contains all K×K×C values from the input that contribute to that position. The kernel weights become a column matrix. One matrix multiply computes all outputs!

In the next section, we'll explore convolution parameters—stride, padding, and dilation—in depth, seeing how they control the output size and receptive field of your CNN.