Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

In the previous section, we established why convolutions are essential for processing images. Now we'll dive deep into what the convolution operation actually does and how it works mathematically.

The Core Insight: Before convolutions, image processing required hand-crafted feature detectors for every possible pattern at every possible location. The convolution operation elegantly solves this by sliding a small learnable filter across the entire image, computing weighted sums at each position. This single mathematical operation replaced thousands of lines of hand-coded pattern matching.

The convolution operation is used billions of times per second in production AI systems:

Instagram: Every photo filter applies multiple convolution operations
Tesla Autopilot: Processes 30+ frames/second through deep CNN stacks
Medical AI: Detects tumors in X-rays using learned convolution filters
Face ID: Unlocks your phone via convolutions detecting facial features

By the end of this section, you'll understand exactly what happens when you write nn.Conv2d() in PyTorch—not as a black box, but as a mathematical operation you can compute by hand.

Learning Objectives

After completing this section, you will be able to:

Master the Mathematical Definition: Understand the convolution formula, explain what each symbol means, and recognize the difference between convolution and cross-correlation (and why deep learning uses cross-correlation but calls it convolution)
Compute Convolutions by Hand: Given a 5×5 image and a 3×3 kernel, calculate the complete output matrix step-by-step
Predict Output Dimensions: Use the formula $O = \frac{I - K + 2P}{S} + 1$ to calculate output sizes for any configuration
Understand Kernel Design: Explain why certain kernel values detect edges, blur images, or sharpen details
Handle Multi-Channel Images: Describe how RGB convolution works and calculate the parameter count for any Conv2d layer
Implement from Scratch: Write convolution using raw NumPy/PyTorch operations without using nn.Conv2d

Where You'll Apply This Knowledge

Building CNNs: Every conv layer in ResNet, VGG, EfficientNet, YOLO uses this exact operation
Debugging: When models fail, understanding feature map computation helps diagnose issues
Research: Novel architectures (depthwise separable, dilated, deformable convolutions) are variations of this core operation
Optimization: Knowing the math helps you understand memory usage and computational cost

Starting Simple: 1D Convolution

Before tackling 2D images, let's build intuition with 1D signals. This is exactly how audio processing and time series analysis work.

The Intuition: Sliding Window

Imagine you have a ruler (the kernel) that you slide across a signal. At each position, you multiply the signal values under the ruler by the ruler's markings and sum the results.

Mathematical Definition

For a 1D signal $f$ and kernel $g$ , the convolution is:

(f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[n - k]

In practice, with finite signals and kernels:

(f * g)[n] = \sum_{k=0}^{K-1} f[n + k] \cdot g[k]

Let's break down each symbol:

Symbol	Meaning	Example
f	Input signal (1D array)	[1, 2, 3, 4, 5, 6, 7]
g	Kernel/filter (1D array)	[1, 0, -1]
n	Output position index	n = 0, 1, 2, ...
k	Kernel element index	k = 0, 1, 2 for 3-element kernel
K	Kernel size	K = 3
*	Convolution operator	(f * g) produces new signal

Worked Example: 1D Convolution

Let's compute the convolution of signal [1, 2, 3, 4, 5] with kernel [1, 0, -1]:

1D Convolution Step-by-Step

🐍conv1d_example.py

Explanation(6)

Code(35)

4Input Signal

A simple 1D signal with 5 elements. In real applications, this could be audio samples, stock prices, or sensor readings.

EXAMPLE

signal = [1, 2, 3, 4, 5]

7Kernel/Filter

A 3-element kernel that computes the difference between the first and last element in each window. This is a simple derivative approximation (gradient detection).

EXAMPLE

kernel = [1, 0, -1] detects changes

11Output Length

The output has length = input_length - kernel_length + 1. Here: 5 - 3 + 1 = 3 output values.

EXAMPLE

range(3) gives n = 0, 1, 2

13Window Extraction

At each position n, we extract kernel_length elements from the signal. This is the 'sliding window' moving across the signal.

EXAMPLE

n=0: window=[1,2,3], n=1: window=[2,3,4]

15Dot Product

Element-wise multiplication followed by sum. This is the core computation: how much does this window 'match' the kernel pattern?

EXAMPLE

[1,2,3] * [1,0,-1] = [1,0,-3] → sum = -2

23Position n=0 Trace

First output: multiply signal[0:3] with kernel, sum the results. The result -2 indicates the signal is increasing at this position.

29 lines without explanation

1import numpy as np
2
3# Input signal (1D array)
4signal = np.array([1, 2, 3, 4, 5])
5
6# Kernel (1D filter)
7kernel = np.array([1, 0, -1])
8
9# Manual convolution computation
10output = []
11for n in range(len(signal) - len(kernel) + 1):
12    # Extract the window from the signal
13    window = signal[n:n+len(kernel)]
14    # Compute dot product: element-wise multiply, then sum
15    value = np.sum(window * kernel)
16    output.append(value)
17
18print(f"Input signal: {signal}")
19print(f"Kernel: {kernel}")
20print(f"Output: {output}")
21
22# Let's trace position n=0:
23# window = [1, 2, 3]
24# kernel = [1, 0, -1]
25# 1*1 + 2*0 + 3*(-1) = 1 + 0 - 3 = -2
26
27# Position n=1:
28# window = [2, 3, 4]
29# kernel = [1, 0, -1]
30# 2*1 + 3*0 + 4*(-1) = 2 + 0 - 4 = -2
31
32# Position n=2:
33# window = [3, 4, 5]
34# kernel = [1, 0, -1]
35# 3*1 + 4*0 + 5*(-1) = 3 + 0 - 5 = -2

What does this kernel detect?

The kernel [1, 0, -1] computes signal[n] - signal[n+2], which is the backward difference (a discrete derivative). Positive output means the signal is decreasing; negative means increasing. Our constant increase of 2 per step gives constant output -2.

Quick Check

What is the output length when convolving a signal of length 10 with a kernel of length 4?

2D Convolution for Images

Now we extend to 2D—the foundation of all image processing in deep learning. The concept is identical: slide a kernel across the input, compute weighted sums at each position.

Mathematical Definition

For a 2D image $I$ and kernel $K$ , the 2D convolution is:

(I * K)[i, j] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} I[i+m, j+n] \cdot K[m, n]

Let's decode every symbol:

Symbol	Meaning	Typical Value
I	Input image (2D matrix)	224×224 pixels
K	Kernel/filter (2D matrix)	3×3 or 5×5
i, j	Output position (row, column)	i=0..H-K+1, j=0..W-K+1
m, n	Kernel indices (row, column)	m,n = 0..K-1
M, N	Kernel height and width	M=N=3 for 3×3 kernel
*	2D convolution operator	Produces feature map

The Intuitive Picture

Imagine placing a small $3 \times 3$ transparent overlay on an image. Each cell of the overlay has a number (the kernel weight). At each position:

Multiply each pixel under the overlay by its corresponding kernel weight
Sum all 9 products
Write this sum to the output at that position
Slide the overlay one pixel to the right (or down)
Repeat until you've covered the entire image

Visualizing the Sliding Operation

Watch the convolution operation in action. This animation shows exactly how the kernel slides across the input, computing one output value at a time:

Convolution Animation

Input (5×5)

Kernel (3×3)

Output (3×3)

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

Why This Works

The power of convolution comes from what the kernel weights encode:

Edge detection: Kernels with positive weights on one side and negative on the other detect transitions
Blurring: Kernels with equal positive weights average neighboring pixels
Sharpening: Kernels that emphasize the center relative to neighbors enhance details

Key Insight: In classical image processing, engineers hand-designed kernels for specific tasks. In deep learning, we let the network learn optimal kernel values from data. The backpropagation algorithm adjusts kernel weights to minimize the loss function.

Full CNN Pipeline: The Big Picture

Now that you understand the basic convolution operation, let's see how multiple convolution and pooling layers work together in a real CNN. This interactive visualization shows the complete flow from input image to classification:

Loading 3D CNN architecture…

2D Convolution: Complete Process Visualization

Watch how a CNN processes an image through convolution and pooling layers, reducing dimensions while extracting features.

CNN Architecture: Dimension Reduction Pipeline

Input

1@28×28

Conv1

32@26×26

K=3×3

Pool1

32@13×13

P=2×2

Conv2

64@11×11

K=3×3

Pool2

64@5×5

P=2×2

Flatten

1×1600

10 units

K = Kernel SizeP = Pool Size@ = Channels × Height × Width

Notice how each layer transforms the data:

28×28 Input:Raw grayscale image (like MNIST digits)

Conv1 (32@26×26):32 different 3×3 kernels extract 32 feature maps, each detecting different patterns

Pool1 (32@13×13):2×2 max pooling halves spatial dimensions, keeping strongest activations

Conv2 (64@11×11):64 kernels build on previous features, learning higher-level patterns

Pool2 (64@5×5):Further spatial reduction

Flatten (1600):Reshape 64×5×5 = 1600 values into a 1D vector

FC (10):Fully connected layer outputs class probabilities

Kernel Filtering

Kernel:

Padding:

Stride:

Input (7×7)

7×7

convolve

Kernel (3×3)

Feature Map (Output)

5×5 (0/25 computed)

Step 0 / 25

👆 Use the controls above to step through the convolution!

• Without padding (P=0): Output size = (5-3)/1 + 1 = 3×3
• With padding (P=1): Output size = (5+2-3)/1 + 1 = 5×5 (same as input!)
• Stride=2: Kernel moves 2 pixels at a time, reducing output size

Pooling Operation

Type:

Size:

5×5

Feature Map

2×2 Max Pool

2×2

Pooled Output

Output = ⌊Input / Pool Size⌋ = ⌊5 / 2⌋ = 2→4 pooling steps

Feature Map (from Conv)

5×5

MAX

2×2 pool

Pooled Output

2×2 (0/4 computed)

Step 0 / 4

Max Pooling

• Operation: Takes the maximum value from each 2×2 window
• Effect: Keeps strongest activations, provides translation invariance
• Use case: Most common in CNNs (VGG, ResNet, etc.)
• Output size: 5÷2 = 2×2

Output Size Formula

O = ⌊(I - K + 2P) / S⌋ + 1

O = Output size

I = Input size

K = Kernel size

P = Padding

S = Stride

Current: O = ⌊(7 - 3 + 0) / 1⌋ + 1 = 5

Kernel

= Filters = Feature detectors

Stride

= Step size of kernel movement

Padding

= Zero-padding around input

Feature Map

= Output of convolution

Feature Learning vs Classification

The convolution and pooling layers form the feature learning part of the network—they automatically discover useful representations. The final fully connected layers perform classification based on these learned features.

Cross-Correlation vs Convolution

There's an important subtlety that causes confusion: deep learning uses cross-correlation, but calls it convolution.

Mathematical Convolution (Signal Processing)

In mathematics and signal processing, convolution flips the kernel before sliding:

(I * K)[i, j] = \sum_{m} \sum_{n} I[i-m, j-n] \cdot K[m, n]

The kernel is flipped both horizontally and vertically (rotated 180°). This ensures certain mathematical properties like commutativity: $f * g = g * f$ .

Cross-Correlation (Deep Learning)

In deep learning, we use cross-correlation—no flipping:

(I \star K)[i, j] = \sum_{m} \sum_{n} I[i+m, j+n] \cdot K[m, n]

The kernel slides across the image without being flipped.

Why Does Deep Learning Use Cross-Correlation?

Learned kernels adapt: Since we learn kernel weights, it doesn't matter if we flip or not—the network will learn the appropriate (possibly flipped) pattern
Simpler implementation: No flip operation needed
Same result: For symmetric kernels (like Gaussian blur), convolution = cross-correlation
Historical convention: The deep learning community standardized on this approach

Terminology Alert

When a deep learning paper or library says "convolution," they almost always mean cross-correlation. PyTorch's nn.Conv2d performs cross-correlation. Be aware of this when reading signal processing literature.

Aspect	True Convolution	Cross-Correlation (DL)
Kernel flip	Yes (rotate 180°)	No
Commutative	Yes: fg = gf	No
Used in	Signal processing, math	Deep learning
PyTorch	Not default	nn.Conv2d
Matters for learning?	No - weights adapt	No - weights adapt

Practical advice

For deep learning, ignore the flip distinction. Just think of convolution as "slide the kernel, compute weighted sums." The network learns what it needs either way.

Step-by-Step Computation

Let's work through a complete example by hand. This builds the intuition you need to debug CNNs and understand what's happening inside.

Example: 5×5 Image with 3×3 Kernel

Input image $I$ (5×5):

📝input.txt

1I = [[10, 20, 30, 40, 50],
2     [20, 40, 60, 80, 100],
3     [30, 60, 90, 120, 150],
4     [40, 80, 120, 160, 200],
5     [50, 100, 150, 200, 250]]

Kernel $K$ (3×3 Sobel vertical edge detector):

📝kernel.txt

1K = [[-1, 0, 1],
2     [-2, 0, 2],
3     [-1, 0, 1]]

Computing Output[0,0]

Position (0,0) overlays the kernel on the top-left 3×3 region of the input:

📝calculation.txt

1Window at (0,0):         Kernel:
2[[10, 20, 30],           [[-1, 0, 1],
3 [20, 40, 60],     ×      [-2, 0, 2],
4 [30, 60, 90]]            [-1, 0, 1]]
5
6Element-wise multiply:
7[[-10, 0, 30],
8 [-40, 0, 120],
9 [-30, 0, 90]]
10
11Sum all elements: -10 + 0 + 30 + (-40) + 0 + 120 + (-30) + 0 + 90 = 160
12
13Output[0,0] = 160

Computing All Positions

The output is a 3×3 matrix (since 5-3+1=3 for both dimensions):

2D Convolution from Scratch

🐍conv2d_manual.py

Explanation(7)

Code(47)

4Input Image Definition

A 5×5 grayscale image where intensity increases diagonally. Each row increases by 10, each column by 20. This creates a gradient pattern.

EXAMPLE

Top-left=10, bottom-right=250

13Sobel X Kernel

This kernel detects vertical edges by computing horizontal gradients. It subtracts left pixels from right pixels, with the center row weighted 2× more.

EXAMPLE

Detects transitions like: dark|bright

21Output Dimensions

With no padding, output size = input size - kernel size + 1. Here: 5-3+1=3 for both height and width.

EXAMPLE

5×5 input, 3×3 kernel → 3×3 output

27Nested Loop Iteration

We iterate over every valid position (i,j) where the kernel fits entirely within the image. This is the 'sliding window' traversal.

EXAMPLE

i,j ∈ {0,1,2} × {0,1,2} = 9 positions

30Window Extraction

At each position, extract the 3×3 region that the kernel currently overlays. Python slicing makes this elegant.

EXAMPLE

image[1:4, 2:5] extracts rows 1-3, cols 2-4

32Core Computation

Element-wise multiply the window with the kernel, then sum all 9 products. This single number becomes one output pixel.

EXAMPLE

Σ(window × kernel) = one scalar

41Result Analysis

The constant output (160 everywhere) indicates the gradient is uniform across the image. The Sobel X kernel detects that there's a consistent horizontal change.

40 lines without explanation

1import numpy as np
2
3# Input image (5x5)
4I = np.array([
5    [10, 20, 30, 40, 50],
6    [20, 40, 60, 80, 100],
7    [30, 60, 90, 120, 150],
8    [40, 80, 120, 160, 200],
9    [50, 100, 150, 200, 250]
10], dtype=float)
11
12# Sobel X kernel (3x3) - detects vertical edges
13K = np.array([
14    [-1, 0, 1],
15    [-2, 0, 2],
16    [-1, 0, 1]
17], dtype=float)
18
19# Compute 2D convolution (cross-correlation)
20def conv2d_manual(image, kernel):
21    H, W = image.shape
22    kH, kW = kernel.shape
23
24    # Output dimensions
25    out_H = H - kH + 1
26    out_W = W - kW + 1
27
28    output = np.zeros((out_H, out_W))
29
30    # Slide kernel across image
31    for i in range(out_H):
32        for j in range(out_W):
33            # Extract window
34            window = image[i:i+kH, j:j+kW]
35            # Element-wise multiply and sum
36            output[i, j] = np.sum(window * kernel)
37
38    return output
39
40output = conv2d_manual(I, K)
41print("Output (3x3):")
42print(output)
43
44# Output:
45# [[160. 160. 160.]
46#  [160. 160. 160.]
47#  [160. 160. 160.]]

Why is the output constant?

Our input image has a uniform gradient—each row increases by the same amount (20 per step horizontally). The Sobel X kernel computes horizontal differences, which are constant across the image. A natural image with varying edges would produce a varied output.

Quick Check

At position (1,1), which input pixels does the 3×3 kernel overlay?

Interactive Convolution Calculator

Now it's your turn! Use this interactive calculator to see exactly how convolution works. Click on any output cell to see the step-by-step calculation, or press "Animate" to watch the kernel slide across the input.

Interactive Convolution Calculator

Select Kernel:

Detects vertical edges (Sobel X)

Input Image (5×5)

100

120

150

120

160

200

100

150

200

250

Kernel (3×3)

-1

-2

-1

Output (3×3)

Click a cell to see calculation

Key Insight

The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.

Try different kernels to see how they produce different outputs:

Identity: Output equals input (useful for testing)
Vertical Edge: High response where brightness changes left-to-right
Horizontal Edge: High response where brightness changes top-to-bottom
Box Blur: Smooths the image by averaging neighbors
Sharpen: Enhances edges and details

Kernel Effects Gallery

Different kernel weights produce dramatically different outputs. This gallery lets you compare how various kernels transform the same input pattern.

Kernel Effect Gallery

Input Pattern:

Input

Kernel

-1

-2

-1

Output

Description

Detects vertical edges by computing horizontal gradients. Left pixels are subtracted from right pixels.

Formula

Gx = ∂I/∂x ≈ I(x+1) - I(x-1)

Use Case in Deep Learning

Edge detection, feature extraction in CNNs

Compare All Kernels:

Identity

Sobel X (Vertical Edges)

Sobel Y (Horizontal Edges)

Box Blur

Gaussian Blur

Sharpen

Laplacian

Emboss

Key Insight

Each kernel acts as a feature detector. In CNNs, instead of hand-designing these kernels, we let the network learn optimal kernels from data. The first layers often learn edge detectors similar to Sobel, while deeper layers learn more complex patterns.

What Makes Each Kernel Work?

Kernel	Weight Pattern	Why It Works
Vertical Edge (Sobel X)	Negative left, positive right	Subtracts left from right; large \|value\| means brightness changes horizontally
Horizontal Edge (Sobel Y)	Negative top, positive bottom	Subtracts top from bottom; large \|value\| means brightness changes vertically
Box Blur	All equal (1/9 each)	Averages all 9 neighbors equally, smoothing out variations
Gaussian Blur	Bell curve weights	Weights decay with distance from center, giving smoother blur than box
Sharpen	Large positive center, negative neighbors	Amplifies center relative to neighbors; enhances differences = edges
Laplacian	Negative center, positive neighbors	Second derivative; responds to edges regardless of direction

From Hand-Designed to Learned

Classical computer vision required experts to design these kernels. Deep learning's breakthrough: let the network learn optimal kernels via gradient descent. The first layer of a trained CNN often learns Gabor-like filters (edge detectors at various orientations)—similar to what neuroscientists find in the visual cortex!

Multi-Channel Convolution

Real images have multiple channels (RGB). How does convolution handle this? The key insight: one kernel spans ALL input channels and produces one output channel.

RGB Convolution Explained

For an RGB image:

Input: H × W × 3 (height × width × RGB channels)
One kernel: K × K × 3 (covers all 3 channels)
Output: H' × W' × 1 (one value per position)

The kernel has separate weights for each input channel. At each position, we compute three separate sums (one per channel), then add them together.

Multi-Channel (RGB) Convolution

Position:

Input Image (4×4×3 RGB)

255

200

150

100

220

180

140

180

140

100

140

100

120

140

160

100

120

140

100

120

100

110

140

100

130

160

120

150

180

110

140

170

200

Kernel (3×3×3)

K_R

-1

-2

-1

K_G

-4

K_B

-1

Calculation at position (0, 0):

R channel:Σ(I_R × K_R) = 345

G channel:Σ(I_G × K_G) = 0

B channel:Σ(I_B × K_B) = 0

Total:345 + 0 + 0 = 345

Output Feature Map (2×2×1)

Output

345

380

320

Key Insight 1: Single Kernel Spans All Channels

One 3×3×3 kernel covers all input channels and produces one output value per position. The kernel has separate weights for R, G, and B, but their contributions are summed.

Key Insight 2: Multiple Kernels = Multiple Outputs

To produce multiple output channels (feature maps), we use multiple kernels. 64 kernels → 64 output channels. Each kernel learns different features!

Parameter Count Formula:

Parameters = K_H × K_W × C_in × C_out + C_out
Example: 3 × 3 × 3 × 64 + 64 = 1,792 parameters

Multiple Output Channels

To produce multiple output channels (multiple feature maps), we use multiple kernels:

\text{64 kernels} \rightarrow \text{64 output channels (feature maps)}

Each kernel learns to detect a different feature. The first layer might learn:

Kernel 1: Horizontal edges
Kernel 2: Vertical edges
Kernel 3: Diagonal edges (45°)
Kernel 4: Diagonal edges (135°)
Kernel 5-64: Various orientations, frequencies, colors...

General Formula

For a layer with $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels using $K \times K$ kernels:

\text{Parameters} = K \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}

Where:

$K \times K \times C_{\text{in}} \times C_{\text{out}}$ = kernel weights
$C_{\text{out}}$ = bias terms (one per output channel)

Example: First Conv Layer

🐍param_count.py

1# Typical first conv layer: RGB input, 64 filters, 3x3 kernels
2# Parameters = 3 × 3 × 3 × 64 + 64 = 1,728 + 64 = 1,792
3
4import torch.nn as nn
5
6conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
7params = sum(p.numel() for p in conv1.parameters())
8print(f"Parameters: {params}")  # Output: 1792

Quick Check

How many parameters does nn.Conv2d(64, 128, kernel_size=3) have?

PyTorch Implementation

Now let's see how to use convolution in PyTorch, mapping every parameter to what we've learned.

nn.Conv2d Anatomy

PyTorch Conv2d Complete Guide

🐍pytorch_conv.py

Explanation(10)

Code(35)

6in_channels

Number of channels in the input. RGB images have 3 channels. A previous conv layer with 64 filters would output 64 channels.

EXAMPLE

RGB: 3, Grayscale: 1, After conv(64): 64

7out_channels

Number of filters (kernels) to learn. Each filter produces one output channel (feature map). More filters = more features detected.

EXAMPLE

64 filters → 64 feature maps

8kernel_size

Spatial size of each filter. 3×3 is most common (captures local patterns with few parameters). Can be tuple like (3,5) for non-square.

EXAMPLE

3 → 3×3, (3,5) → 3×5

9stride

How many pixels to move the kernel each step. stride=1 moves pixel-by-pixel. stride=2 skips every other position, halving output size.

EXAMPLE

stride=2: output_size ≈ input_size/2

10padding

Zeros added around the input border. padding=1 with kernel_size=3 preserves spatial dimensions. padding='same' auto-calculates.

EXAMPLE

padding=1: adds 1-pixel zero border

11bias

Whether to add a learnable constant to each output channel. Usually True. Set False when followed by BatchNorm (which has its own bias).

EXAMPLE

True → out = conv(x) + bias

15Weight Shape

Shape is (out_channels, in_channels, kernel_H, kernel_W). This tensor contains ALL the learnable kernel weights.

EXAMPLE

[64, 3, 3, 3] = 64 kernels, each 3×3×3

20Bias Shape

One scalar per output channel. After computing convolution, this constant is added to every position of that feature map.

EXAMPLE

[64] = one bias per filter

26Input Format (NCHW)

PyTorch uses NCHW format: (Batch, Channels, Height, Width). This is different from TensorFlow's default NHWC. Make sure your data matches!

EXAMPLE

[8, 3, 224, 224] = 8 RGB images, 224×224

31Output Shape Analysis

With padding=1 and kernel_size=3, spatial dimensions preserved. Channels changed from 3 to 64. Batch size unchanged.

EXAMPLE

[8, 64, 224, 224] = 64 feature maps per image

25 lines without explanation

1import torch
2import torch.nn as nn
3
4# Create a convolutional layer
5conv = nn.Conv2d(
6    in_channels=3,      # RGB input (3 channels)
7    out_channels=64,    # 64 different filters
8    kernel_size=3,      # 3x3 kernels
9    stride=1,           # Move 1 pixel at a time
10    padding=1,          # Add 1 pixel border of zeros
11    bias=True           # Include bias terms
12)
13
14# Examine the shapes
15print(f"Weight shape: {conv.weight.shape}")
16# Output: torch.Size([64, 3, 3, 3])
17# Interpretation: 64 kernels, each 3×3×3 (3×3 spatial, 3 channels)
18
19print(f"Bias shape: {conv.bias.shape}")
20# Output: torch.Size([64])
21# Interpretation: One bias per output channel
22
23# Create a batch of images
24batch_size = 8
25height, width = 224, 224
26images = torch.randn(batch_size, 3, height, width)
27print(f"Input shape: {images.shape}")
28# Output: torch.Size([8, 3, 224, 224])
29# Format: (batch, channels, height, width) - NCHW format
30
31# Forward pass
32output = conv(images)
33print(f"Output shape: {output.shape}")
34# Output: torch.Size([8, 64, 224, 224])
35# Same spatial size due to padding=1

Manual Implementation (No nn.Conv2d)

To truly understand convolution, let's implement it using only basic tensor operations:

Convolution Implementation from Scratch

🐍conv2d_from_scratch.py

Explanation(8)

Code(62)

17Shape Extraction

We extract dimensions from input (batch, channels, height, width) and weight (out_channels, in_channels, kernel_height, kernel_width).

EXAMPLE

input=[2,3,8,8], weight=[16,3,3,3]

20Padding Application

F.pad adds zeros around the image borders. The tuple (p,p,p,p) adds p pixels to (left, right, top, bottom). This lets the kernel process edge pixels.

EXAMPLE

8×8 with padding=1 → 10×10

27Output Size Formula

The core formula: output = (input - kernel) / stride + 1. With padding, input size is already adjusted. Integer division handles the case where it doesn't divide evenly.

EXAMPLE

(10 - 3) / 1 + 1 = 8

34Nested Loops

Four nested loops: batch → output channel → row → column. This is O(N × C_out × H_out × W_out × C_in × kH × kW) - very slow! Real implementations use matrix tricks.

40Window Coordinates

stride determines how far we jump between positions. stride=2 means h_start goes 0, 2, 4... instead of 0, 1, 2...

EXAMPLE

stride=2: positions at 0, 2, 4, 6

42Window Extraction

Extract the C_in × kH × kW region that the current kernel position covers. Note: we take ALL input channels (the : in dimension 1).

EXAMPLE

window shape: [3, 3, 3] for RGB with 3×3 kernel

45Core Computation

Multiply the window by ONE kernel (weight[c_out] has shape [C_in, kH, kW]). Sum all products. This is the dot product between window and kernel.

EXAMPLE

[3,3,3] × [3,3,3] → scalar

49Bias Broadcasting

Reshape bias from [C_out] to [1, C_out, 1, 1] so it broadcasts correctly. Each output channel gets its own bias added to every position.

EXAMPLE

bias[16] → broadcast to [2, 16, 8, 8]

54 lines without explanation

1import torch
2
3def conv2d_manual(input, weight, bias=None, stride=1, padding=0):
4    """
5    Manual 2D convolution implementation.
6
7    Args:
8        input: (N, C_in, H, W) input tensor
9        weight: (C_out, C_in, kH, kW) kernel weights
10        bias: (C_out,) optional bias
11        stride: step size for sliding
12        padding: zero-padding on each side
13
14    Returns:
15        (N, C_out, H_out, W_out) output tensor
16    """
17    N, C_in, H, W = input.shape
18    C_out, _, kH, kW = weight.shape
19
20    # Add padding if needed
21    if padding > 0:
22        input = torch.nn.functional.pad(
23            input, (padding, padding, padding, padding)
24        )
25        H += 2 * padding
26        W += 2 * padding
27
28    # Calculate output dimensions
29    H_out = (H - kH) // stride + 1
30    W_out = (W - kW) // stride + 1
31
32    # Initialize output
33    output = torch.zeros(N, C_out, H_out, W_out)
34
35    # Perform convolution
36    for n in range(N):                    # Each image in batch
37        for c_out in range(C_out):        # Each output channel
38            for i in range(H_out):        # Each output row
39                for j in range(W_out):    # Each output column
40                    # Extract window
41                    h_start = i * stride
42                    w_start = j * stride
43                    window = input[n, :, h_start:h_start+kH, w_start:w_start+kW]
44
45                    # Multiply and sum
46                    output[n, c_out, i, j] = (window * weight[c_out]).sum()
47
48    # Add bias
49    if bias is not None:
50        output += bias.view(1, -1, 1, 1)  # Broadcast bias to all positions
51
52    return output
53
54# Test against PyTorch
55x = torch.randn(2, 3, 8, 8)  # 2 images, 3 channels, 8x8
56conv = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=True)
57
58official_output = conv(x)
59manual_output = conv2d_manual(x, conv.weight, conv.bias, stride=1, padding=1)
60
61print(f"Max difference: {(official_output - manual_output).abs().max():.10f}")
62# Output: Max difference: 0.0000000000 (or very small floating point error)

Performance Note

This naive implementation is extremely slow. Real convolution uses optimized algorithms like im2col (reshape input so convolution becomes matrix multiplication) or FFT-based methods. PyTorch's nn.Conv2d uses cuDNN on GPU, which is 100-1000× faster.

Output Size Formula

The output dimensions depend on input size, kernel size, padding, and stride. Master this formula:

O = \left\lfloor \frac{I - K + 2P}{S} \right\rfloor + 1

Where:

Symbol	Meaning	Common Values
O	Output size (height or width)	Calculated
I	Input size	224, 32, etc.
K	Kernel size	3, 5, 7
P	Padding (added to each side)	0, 1, 2
S	Stride (step size)	1, 2
⌊⌋	Floor function (round down)	-

Common Scenarios

Scenario	Settings	Formula	Result
Same size (padding)	K=3, P=1, S=1	(224-3+2)/1+1	224
Same size (padding)	K=5, P=2, S=1	(224-5+4)/1+1	224
Halve size (stride)	K=3, P=1, S=2	(224-3+2)/2+1	112
No padding	K=3, P=0, S=1	(224-3+0)/1+1	222
VGG style	K=3, P=1, S=1 + pool	→ pool halves	112 after pool

Preserve dimensions recipe

To keep output size = input size with stride=1, use padding = (kernel_size - 1) / 2. For 3×3: padding=1. For 5×5: padding=2. For 7×7: padding=3.

Quick Check

What is the output size for input=64, kernel=5, padding=2, stride=2?

AI/Deep Learning Applications

The convolution operation you've learned is the foundation of modern computer vision. Here's how it's applied in cutting-edge AI systems:

Object Detection (YOLO, Faster R-CNN)

Convolutions extract hierarchical features: edges → textures → parts → objects. The network learns to detect "wheel," "headlight," and "car body" through different layers of convolution.

🐍yolo_concept.py

1# Conceptual YOLO architecture
2backbone = nn.Sequential(
3    nn.Conv2d(3, 64, 3, padding=1),    # Edges, colors
4    nn.Conv2d(64, 128, 3, padding=1),  # Textures
5    nn.Conv2d(128, 256, 3, padding=1), # Parts
6    nn.Conv2d(256, 512, 3, padding=1), # Objects
7)
8# Final layer predicts: (x, y, w, h, confidence, class)

Semantic Segmentation (U-Net)

Every pixel gets classified. Convolutions in the encoder capture context; deconvolutions in the decoder restore spatial resolution. Medical imaging (tumor segmentation) relies heavily on this.

Neural Style Transfer

Convolutions capture "style" (textures, brush strokes) vs "content" (shapes, objects). By matching feature statistics from a style image to a content image, we can paint photos in the style of Van Gogh.

Generative Models (StyleGAN, Diffusion)

Image generation uses convolutions in reverse: starting from noise, transposed convolutions (upsampling) progressively build images. Each conv layer adds more detail.

Key Insight: First Layer Kernels

If you visualize the learned kernels in the first conv layer of a trained network (like ResNet), you'll see:

Edge detectors at various orientations (0°, 45°, 90°, 135°)
Color blob detectors (red, green, blue regions)
Gabor-like filters (oriented frequency patterns)

These closely match what neuroscientists find in the primary visual cortex (V1)! The network "discovers" biologically-relevant features through gradient descent.

The Profound Insight: We didn't tell the network to learn Gabor filters. We just said "minimize classification error" and let backpropagation adjust the kernel weights. The fact that it converges to filters similar to biological neurons suggests something fundamental about optimal visual feature extraction.

Summary

You've now mastered the convolution operation—the foundation of all CNNs:

Key Concepts

Concept	Definition	Why It Matters
Convolution	Sliding window weighted sum	Core feature extraction operation
Kernel/Filter	Small weight matrix	Learns to detect specific patterns
Feature Map	Convolution output	Encodes presence of patterns at each location
Stride	Kernel step size	Controls output resolution
Padding	Border zeros	Preserves spatial dimensions

Critical Formulas

2D Convolution: $(I * K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]$
Output Size: $O = \lfloor(I - K + 2P)/S\rfloor + 1$
Parameters: $K \times K \times C_{in} \times C_{out} + C_{out}$

Remember

Deep learning uses cross-correlation but calls it convolution
One kernel spans all input channels, produces one output channel
Multiple kernels = multiple feature maps
CNNs learn kernels via backpropagation, discovering optimal features

Exercises

Conceptual Questions

A 128×128×3 RGB image passes through nn.Conv2d(3, 32, kernel_size=5, padding=2, stride=2). What is the output shape? How many parameters does the layer have?
Why does the Sobel X kernel [[-1,0,1], [-2,0,2], [-1,0,1]] detect vertical edges rather than horizontal edges?
If you wanted to preserve spatial dimensions with a 7×7 kernel and stride=1, how much padding would you need?
Explain why a CNN with learned 3×3 kernels might be better than using hand-designed Sobel/Laplacian kernels for image classification.

Solution Hints for Conceptual Questions

Q1: Output spatial: (128-5+4)/2+1 = 64. Output shape: [batch, 32, 64, 64]. Params: 5×5×3×32 + 32 = 2,432.
Q2: It computes horizontal differences (left-right). Vertical edges ARE horizontal transitions in brightness!
Q3: P = (K-1)/2 = (7-1)/2 = 3.
Q4: Learned kernels adapt to the specific task, can be asymmetric, and deeper layers build on early features.

Coding Exercises

Implement edge magnitude: Apply Sobel X and Sobel Y to an image, then compute edge magnitude as $\sqrt{G_x^2 + G_y^2}$ .
Box blur vs Gaussian: Apply both to an image and visualize the difference. Why does Gaussian look more natural?
Verify the output formula: Create inputs of various sizes and verify that nn.Conv2d produces the output size you calculate.

Challenge Exercise

Implement im2col convolution: The naive nested-loop implementation is O(N × H × W × K² × C). The im2col technique reshapes the input so convolution becomes a single matrix multiplication, leveraging optimized BLAS libraries. Research and implement this technique.

im2col hint

Each output position corresponds to one row in the im2col matrix. That row contains all K×K×C values from the input that contribute to that position. The kernel weights become a column matrix. One matrix multiply computes all outputs!

In the next section, we'll explore convolution parameters—stride, padding, and dilation—in depth, seeing how they control the output size and receptive field of your CNN.