Chapter 10
18 min read
Section 62 of 178

Motivation for Convolutions

Convolution Operations

Introduction

We've spent the previous chapters building fully connected neural networks that can learn complex functions. These networks are universal function approximators—given enough neurons, they can theoretically learn any function. So why do we need a new architecture for images?

The short answer: images are special. They have structure that fully connected networks ignore, and exploiting this structure leads to dramatically better models. Convolutional Neural Networks (CNNs) were designed specifically to leverage the spatial structure inherent in visual data.

The Core Insight: Fully connected networks treat every input pixel as independent, ignoring the fact that nearby pixels are highly correlated and that the same patterns (edges, textures, shapes) appear throughout an image.

In this section, we'll build the motivation for convolutions from first principles. By the end, you'll understand why convolutions are not just an optimization trick, but a fundamentally better way to process visual information.


The Fully Connected Problem

What Happens When We Apply MLPs to Images?

Let's think about what happens when we try to classify images using a standard fully connected network. Consider a modest 224×224224 \times 224 RGB image—a common input size for image classification.

Counting Parameters

First, we need to flatten the image into a vector:

Input size=224×224×3=150,528 pixels\text{Input size} = 224 \times 224 \times 3 = 150{,}528 \text{ pixels}

Now, if our first hidden layer has just 1,000 neurons (quite modest), we need:

Parameters=150,528×1,000+1,000=150,529,000\text{Parameters} = 150{,}528 \times 1{,}000 + 1{,}000 = 150{,}529{,}000

That's over 150 million parameters in just the first layer! For comparison, the entire VGG-16 network (a large CNN) has about 138 million parameters total.

Image SizeInput NeuronsHidden (1000)Parameters (Layer 1)
28 × 28 (MNIST)7841,000785,000
32 × 32 × 3 (CIFAR)3,0721,0003,073,000
224 × 224 × 3 (ImageNet)150,5281,000150,529,000
1920 × 1080 × 3 (HD)6,220,8001,0006,221,801,000

The parameter explosion

For HD images, a single fully connected layer would require 6 billion parameters—more than GPT-2! This is clearly impractical.

Problems with the Fully Connected Approach

  1. Memory and Computation: Storing and computing with billions of parameters is prohibitively expensive
  2. Overfitting: With so many parameters and relatively few training examples, the network will memorize rather than generalize
  3. Wasted Capacity: Most of these parameters learn nothing useful because they connect unrelated pixels
  4. No Spatial Awareness: The network treats pixel (0,0) and pixel (223,223) as equally related to pixel (112,112)
FC vs Conv Parameter Count
🐍fc_params.py
4Input Size Calculation

For a 224×224 RGB image, we flatten all pixels into a single vector: 224 × 224 × 3 = 150,528 input neurons.

EXAMPLE
224 * 224 * 3 = 150,528
7Fully Connected Layer

nn.Linear creates a fully connected layer where every input connects to every output. This means 150,528 × 1,000 weights plus 1,000 biases.

EXAMPLE
fc_layer.weight.shape = (1000, 150528)
8Parameter Count

We count all learnable parameters: weights (150,528 × 1,000 = 150,528,000) plus biases (1,000). That's over 150 million parameters!

12Convolutional Layer

nn.Conv2d creates a 2D convolution. Parameters: 3 input channels × 64 output channels × 3 × 3 kernel + 64 biases = 1,792 total.

EXAMPLE
3 × 64 × 3 × 3 + 64 = 1,792
16The Key Insight

Convolution uses 84,000× fewer parameters while processing the same input! This is possible because of sparse connectivity and weight sharing.

11 lines without explanation
1import torch.nn as nn
2
3# Parameters for fully connected network on ImageNet-sized input
4input_size = 224 * 224 * 3  # 150,528
5hidden_size = 1000
6
7fc_layer = nn.Linear(input_size, hidden_size)
8print(f"Parameters: {sum(p.numel() for p in fc_layer.parameters()):,}")
9# Output: Parameters: 150,529,000
10
11# For comparison: a convolutional layer
12conv_layer = nn.Conv2d(3, 64, kernel_size=3, padding=1)
13print(f"Conv parameters: {sum(p.numel() for p in conv_layer.parameters()):,}")
14# Output: Conv parameters: 1,792
15
16# Ratio: 150,529,000 / 1,792 = 84,000x more parameters!

Connectivity Comparison

Input (4×4)Output (4×4)01234567891011121314150123456789101112131415256 connections(16 × 16 = 256 unique weights)

Hover over nodes to see their connections. In FC, every input connects to every output.

Quick Check

A 512×512 RGB image is fed into a fully connected layer with 1,000 output neurons. How many parameters are in this layer (including biases)?


The Curse of Dimensionality

The parameter explosion is a symptom of a deeper problem: the curse of dimensionality. As input dimensionality grows, the amount of data needed to adequately cover the space grows exponentially.

Why More Parameters Means More Data

A rough rule of thumb in machine learning: you need at least 5-10 training examples per parameter to avoid severe overfitting. Let's apply this to our fully connected image classifier:

NetworkParametersMin Training ExamplesReality Check
FC on MNIST~800K4-8 millionMNIST has 60K (insufficient!)
FC on CIFAR~3M15-30 millionCIFAR has 50K (way insufficient!)
FC on ImageNet~150M750M-1.5BImageNet has 1.2M (nowhere close!)

Why do MLPs work on MNIST at all?

MNIST is an unusually easy dataset. The digits are centered, normalized, and have minimal variation. Even with massive underfitting of the theoretical data requirements, simple patterns are learnable. Real-world images are far more complex.

The Sparsity of Image Space

Here's another way to think about the problem. A 28×2828 \times 28 grayscale image has 256784256^{784} possible configurations. That's approximately 10188810^{1888} possible images—more than the number of atoms in the observable universe (1080\sim 10^{80}).

Yet most of these configurations are meaningless noise. Real images occupy a tiny, highly structured subspace. A good model should encode this structure, not try to memorize every possible pixel configuration.


Images Have Spatial Structure

The key insight that motivates convolutions: images are not random collections of pixels. They have rich spatial structure that we can exploit.

Property 1: Locality

Nearby pixels are more related than distant pixels.

When you look at a pixel in an image, its immediate neighbors tell you almost everything about it. The pixel at location (100, 100) is highly correlated with pixels at (99, 100), (101, 100), (100, 99), and (100, 101). It has almost no correlation with the pixel at (0, 0).

This suggests: don't connect every input to every output. Instead, connect each output to only a local neighborhood of inputs.

🐍locality.py
1import numpy as np
2from scipy.ndimage import correlate
3
4# Load any image (simulated here)
5np.random.seed(42)
6image = np.random.randn(100, 100)
7
8# Correlation between adjacent pixels
9adjacent_corr = np.corrcoef(image[:-1, :].flatten(),
10                            image[1:, :].flatten())[0, 1]
11print(f"Adjacent pixel correlation: {adjacent_corr:.4f}")
12# For natural images: typically 0.9+
13
14# Correlation between distant pixels
15distant_corr = np.corrcoef(image[:50, :50].flatten(),
16                           image[50:, 50:].flatten())[0, 1]
17print(f"Distant pixel correlation: {distant_corr:.4f}")
18# For natural images: typically near 0

Property 2: Translation Invariance

The same patterns appear at different locations.

A cat's ear looks like a cat's ear whether it appears in the top-left corner, center, or bottom-right of the image. An edge is an edge regardless of where it is. This means the same feature detector should be useful everywhere in the image.

Implication: We should use the same weights (parameters) at every spatial location. This is called parameter sharing or weight tying.

Property 3: Compositionality

Complex patterns are built from simpler patterns.

A face is made of eyes, nose, and mouth. An eye is made of edges forming specific shapes. This hierarchical structure suggests that we should build layers of feature detectors, where each layer combines features from the previous layer.

LayerFeatures DetectedReceptive Field
Layer 1Edges, color blobs3×3 to 5×5 pixels
Layer 2Corners, textures~20×20 pixels
Layer 3Object parts (eyes, wheels)~50×50 pixels
Layer 4Whole objects~100×100 pixels

Feature Hierarchy in CNNs

How neural networks build complex features from simple ones

🖼️
Input
Raw pixels
RGB valuesGrayscale intensity
Level 0
Concrete
📐
Layer 1
Edges & Gradients
Horizontal edgesVertical edgesDiagonal linesColor blobs
Level 1
🔲
Layer 2
Textures & Patterns
CornersSimple texturesGradientsColor patterns
Level 2
👁️
Layer 3
Object Parts
EyesWheelsFur patternsWindows
Level 3
🎯
Layer 4+
Objects & Scenes
FacesCarsAnimalsBuildings
Level 4
Abstract

Key Insight: Each layer combines features from the previous layer. Early layers detect low-level features; deeper layers capture high-level concepts.

Quick Check

Which property of images does parameter sharing exploit?


Three Key Insights That Define Convolutions

Based on the properties of images, we can identify three key design principles that convolutions implement:

1. Sparse Connectivity (Local Receptive Fields)

Instead of connecting each output to all inputs, connect it to only a small, local region called the receptive field.

FC connections: nin×nout=150,528×1,000150M\text{FC connections: } n_{\text{in}} \times n_{\text{out}} = 150{,}528 \times 1{,}000 \approx 150M
Conv connections: k2×cin×cout=32×3×64=1,728\text{Conv connections: } k^2 \times c_{\text{in}} \times c_{\text{out}} = 3^2 \times 3 \times 64 = 1{,}728

This reduces parameters by a factor of almost 100,000x!

2. Parameter Sharing (Weight Tying)

Use the same weights at every spatial location. A 3×33 \times 3 edge detector applied to the top-left uses the same 9 weights as when applied to the bottom-right.

🐍weight_sharing.py
1# In a fully connected layer, every connection has unique weights
2# fc_layer[i, j] != fc_layer[i, k] for different input positions j, k
3
4# In a convolutional layer, the same kernel is applied everywhere
5kernel = torch.tensor([
6    [-1, 0, 1],
7    [-2, 0, 2],
8    [-1, 0, 1]
9], dtype=torch.float32)  # Sobel edge detector
10
11# This SAME kernel is applied at position (0,0), (0,1), (1,0), ...
12# The kernel "slides" across the entire image

3. Translation Equivariance

Because we use the same weights everywhere, if the input shifts, the output shifts by the same amount. This property is called translation equivariance.

f(shift(x))=shift(f(x))f(\text{shift}(x)) = \text{shift}(f(x))

If a cat moves from the left to the right of the image, the feature maps detecting "cat parts" also move correspondingly.

Equivariance vs Invariance

Equivariance: Output shifts when input shifts (convolutional layers).
Invariance: Output stays the same when input shifts (achieved by pooling or global operations).
Both are valuable: equivariance preserves spatial information, invariance enables position-independent recognition.

Interactive: Convolution in Action

Now that we understand the theory, let's see convolution in action. The interactive demos below let you explore how convolutions work step by step.

Watch the Kernel Slide

The convolution operation slides a small kernel (filter) across the input image, computing a weighted sum at each position. Press play to watch it happen, or step through manually.

Convolution Animation

Input (5×5)

1×-12×00×1120×-21×02×2102×-10×01×1011202101210

Kernel (3×3)

-10+1-20+2-10+1

Output (3×3)

2-1-2-10-1-100

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

Receptive Field Growth

As we stack more convolutional layers, each output neuron "sees" a larger region of the original input. This is called the receptive field. Click through the layers to see how it grows.

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RFn = RFn-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

Quick Check

After 3 layers of 3×3 convolutions (no padding, stride 1), what is the receptive field size?


Biological Inspiration

The design of CNNs was heavily inspired by neuroscience, particularly the work of Hubel and Wiesel on the visual cortex (Nobel Prize, 1981).

The Visual Cortex

Key discoveries about how mammals process visual information:

  1. Simple Cells: Neurons that respond to edges at specific orientations in specific locations. Like our convolutional filters!
  2. Complex Cells: Neurons that respond to edges regardless of exact position within their receptive field. Like our pooling layers!
  3. Hierarchical Processing: Visual information flows through layers V1 → V2 → V4 → IT, with increasingly abstract representations. Like our deep CNN layers!
Visual CortexCNN EquivalentWhat It Does
Simple cellsConv filtersDetect oriented edges at specific locations
Complex cellsPooling layersProvide local translation invariance
V1 → V2 → V4 → ITConv → Conv → Conv → FCBuild hierarchy of features
Receptive fieldKernel size + depthRegion of input that affects output

Historical context

The neocognitron (Fukushima, 1980) was the first neural network explicitly inspired by Hubel and Wiesel's work. LeCun's LeNet (1989) added backpropagation training, creating the modern CNN.

Edge Detection: A Motivating Example

Before diving into the mathematics of convolution, let's see a concrete example of why local operations are powerful. Edge detection is one of the most fundamental operations in image processing.

The Sobel Filter

The Sobel filter detects edges by computing the gradient (rate of change) of pixel intensities:

Sobel Edge Detection with Convolution
🐍sobel_edge.py
6Sobel X Filter

The Sobel X filter detects vertical edges by computing horizontal gradients. It subtracts left pixels from right pixels, with more weight (2×) in the center row.

EXAMPLE
At an edge: -1(0) + 1(1) + -2(0) + 2(1) + -1(0) + 1(1) = 4
10Reshape for Conv2d

PyTorch's conv2d expects shape (out_channels, in_channels, H, W). We reshape our 3×3 filter to (1, 1, 3, 3) for a single filter on a grayscale image.

12Sobel Y Filter

The Sobel Y filter detects horizontal edges by computing vertical gradients. It subtracts top pixels from bottom pixels.

19Test Image

We create a 32×32 grayscale image with a sharp vertical edge at column 16: left half is black (0), right half is white (1).

23Apply Convolution

F.conv2d slides the Sobel kernel across the image, computing weighted sums at each position. padding=1 keeps the output the same size as input.

EXAMPLE
edges_x.shape = (1, 1, 32, 32)
27Edge Magnitude

Combine horizontal and vertical gradients using the Pythagorean theorem to get total edge strength, regardless of orientation.

EXAMPLE
magnitude = √(Gx² + Gy²)
25 lines without explanation
1import torch
2import torch.nn.functional as F
3import matplotlib.pyplot as plt
4
5# Sobel filters for horizontal and vertical edges
6sobel_x = torch.tensor([
7    [-1, 0, 1],
8    [-2, 0, 2],
9    [-1, 0, 1]
10], dtype=torch.float32).view(1, 1, 3, 3)
11
12sobel_y = torch.tensor([
13    [-1, -2, -1],
14    [ 0,  0,  0],
15    [ 1,  2,  1]
16], dtype=torch.float32).view(1, 1, 3, 3)
17
18# Create a simple test image with a vertical edge
19image = torch.zeros(1, 1, 32, 32)
20image[:, :, :, 16:] = 1.0  # Right half is white
21
22# Apply Sobel filters (this IS convolution!)
23edges_x = F.conv2d(image, sobel_x, padding=1)
24edges_y = F.conv2d(image, sobel_y, padding=1)
25
26# Combine for edge magnitude
27edges = torch.sqrt(edges_x**2 + edges_y**2)
28
29print(f"Input shape: {image.shape}")
30print(f"Output shape: {edges.shape}")
31print(f"Max edge response: {edges.max():.2f}")

What the Sobel Filter Computes

At each pixel location, the Sobel filter computes a weighted sum of the 3×33 \times 3 neighborhood:

Gx=[101202101]IG_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} * I

This computes the horizontal gradient: bright pixels on the right minus bright pixels on the left. Large values indicate a vertical edge.

Why This Matters

  • Only 9 parameters: The entire operation uses just 9 numbers, not millions
  • Works everywhere: The same 9 weights detect edges anywhere in the image
  • Meaningful output: The result tells us something useful about image structure
  • Composable: We can stack edge detectors to find corners, then shapes, then objects

The CNN insight

Classical image processing uses hand-designed filters like Sobel. CNNs learn the filter weights from data. This allows them to discover the optimal features for each task.

From Fully Connected to Convolution

Let's see how convolution relates to fully connected layers, and why it's a strict generalization.

A Fully Connected Layer as a Sparse Matrix

Consider a tiny 4×44 \times 4 image flattened to 16 pixels, connected to a 4×44 \times 4 output (also 16 values). A fully connected layer has a 16×16=25616 \times 16 = 256 weight matrix where every input connects to every output.

A convolutional layer with a 3×33 \times 3 kernel can be viewed as the same 16×1616 \times 16 matrix, but with two constraints:

  1. Sparsity: Most entries are zero (only local connections exist)
  2. Weight Sharing: Non-zero entries share the same values
Convolution as Sparse Matrix Multiplication
🐍conv_as_matrix.py
5Create Convolution

We create a 2×2 convolution with 1 input channel and 1 output channel. bias=False means only the 4 kernel weights are learnable.

EXAMPLE
nn.Conv2d(in=1, out=1, kernel=2)
6Set Kernel Weights

We manually set the kernel weights to [1,2; 3,4] so we can verify the computation by hand. The kernel shape is (out, in, H, W).

EXAMPLE
kernel = [[1, 2], [3, 4]]
9Create Input

Create a 4×4 image with values 0-15 arranged in order. Reshape to (batch=1, channels=1, H=4, W=4) for PyTorch.

EXAMPLE
[[0,1,2,3], [4,5,6,7], [8,9,10,11], [12,13,14,15]]
14Apply Convolution

The 2×2 kernel slides across the 4×4 input with stride 1. Output size: (4-2+1) × (4-2+1) = 3×3.

20Manual Verification

At position (0,0): kernel[0,0]×input[0,0] + kernel[0,1]×input[0,1] + kernel[1,0]×input[1,0] + kernel[1,1]×input[1,1] = 1×0 + 2×1 + 3×4 + 4×5 = 34

EXAMPLE
1(0) + 2(1) + 3(4) + 4(5) = 34
16 lines without explanation
1import torch
2import torch.nn as nn
3
4# Create a simple 2x2 convolution on a 4x4 input
5conv = nn.Conv2d(1, 1, kernel_size=2, bias=False)
6conv.weight.data = torch.tensor([[[[1, 2], [3, 4]]]], dtype=torch.float32)
7
8# Input: 4x4 image
9x = torch.arange(16, dtype=torch.float32).view(1, 1, 4, 4)
10print("Input:")
11print(x.squeeze())
12
13# Apply convolution
14y = conv(x)
15print("Output (via conv):")
16print(y.squeeze())
17
18# Verify: manual computation for position (0,0)
19# output[0,0] = 1*input[0,0] + 2*input[0,1] + 3*input[1,0] + 4*input[1,1]
20manual = 1*0 + 2*1 + 3*4 + 4*5
21print(f"Manual computation for (0,0): {manual}")  # Matches y[0,0]

The Efficiency Gain

For a 224×224224 \times 224 image with a 3×33 \times 3 convolution:

ApproachParametersConnections
Fully Connected50,176 × 50,176 = 2.5B2.5 billion unique
Convolution3 × 3 = 9~450K (9 reused everywhere)
Savings~280 million×Same expressiveness for images

What We Give Up

The constraints of convolution mean we cannot learn certain functions:

  • Global dependencies: A single conv layer cannot relate pixels far apart
  • Position-specific processing: We cannot learn "if pixel is in top-left, do X; if bottom-right, do Y"
  • Asymmetric relationships: The relationship between positions (0,0) and (1,0) must be the same as between (50,50) and (51,50)

For images, these constraints are features, not bugs! They encode the prior knowledge that images have translation-invariant local structure.

When convolutions are wrong

For data without spatial structure (tabular data, graphs with arbitrary structure), convolutions are inappropriate. Use fully connected networks, graph neural networks, or transformers instead.

Real-World Applications

CNNs have revolutionized computer vision and beyond. Here are some real-world applications that demonstrate the power of convolutional architectures:

DomainApplicationExample ModelsImpact
Image ClassificationCategorizing images into classesResNet, EfficientNet, ViTImageNet: 1000 classes, 90%+ accuracy
Object DetectionLocating and identifying objectsYOLO, Faster R-CNN, DETRReal-time detection at 60+ FPS
Medical ImagingDisease diagnosis from scansU-Net, DenseNetDetecting cancer, COVID-19, retinal diseases
Autonomous VehiclesScene understanding for self-drivingCustom CNNs, 3D convolutionsTesla, Waymo, Cruise systems
Face RecognitionIdentity verificationFaceNet, ArcFacePhone unlocking, security systems
Image GenerationCreating new imagesStyleGAN, Stable DiffusionArt generation, deepfakes, design tools

Industry Impact

  • Healthcare: CNNs detect diabetic retinopathy, skin cancer, and COVID-19 from X-rays with accuracy matching or exceeding human experts
  • Agriculture: Drone-mounted CNN systems identify crop diseases, estimate yields, and optimize irrigation
  • Manufacturing: Visual inspection systems detect defects in products at superhuman speed and accuracy
  • Security: Surveillance systems use CNNs for face recognition, anomaly detection, and threat identification
  • Entertainment: Video games and movies use CNNs for real-time style transfer, upscaling, and visual effects

Beyond Images

While CNNs were designed for images, they're also used for 1D signals (audio, time series) and 3D data (video, medical scans). The key insight—local patterns that repeat—applies to many domains.

Summary

We've established why convolutions are essential for processing images:

Problem with FCCNN SolutionBenefit
Too many parametersSparse connectivityTractable model size
Ignores localityLocal receptive fieldsExploits spatial structure
No weight sharingSame kernel everywhereFewer parameters, better generalization
Position-dependentTranslation equivarianceRobust to object location

Key Takeaways

  1. Fully connected layers have too many parameters for high-dimensional inputs like images, leading to overfitting and computational issues
  2. Images have spatial structure: locality (nearby pixels are related), translation invariance (patterns repeat), and compositionality (complex features built from simple ones)
  3. Convolutions exploit this structure through sparse connectivity, parameter sharing, and translation equivariance
  4. The design is biologically inspired by how the visual cortex processes information
  5. Convolution is a constrained fully connected layer—we trade expressiveness for efficiency and better inductive bias

Exercises

Conceptual Questions

  1. Calculate the number of parameters in the first layer of a fully connected network that takes a 64×64×364 \times 64 \times 3 image and outputs 512 features. Compare to a 5×55 \times 5 convolution with 64 output channels.
  2. Why would a fully connected network struggle to classify an image of a cat in the top-left corner if it was only trained on cats in the center?
  3. Explain the difference between translation equivariance and translation invariance. Which does convolution provide, and how do we achieve the other?
  4. In what scenarios would you NOT want to use convolutions? Give at least two examples.

Solution Hints for Conceptual Questions

  1. Q1: FC: (64×64×3) × 512 + 512 = 6,291,968. Conv: (5×5×3) × 64 + 64 = 4,864. That's ~1,300× fewer parameters!
  2. Q2: Think about how FC learns position-specific patterns. The weights that detect "cat features" are tied to specific input positions.
  3. Q3: Equivariance: f(shift(x)) = shift(f(x)). Invariance: f(shift(x)) = f(x). Convolution provides equivariance; pooling provides local invariance.
  4. Q4: Tabular data (no spatial structure), graphs (arbitrary connectivity), sequences where order matters more than locality.

Coding Exercises

  1. Implement a vertical edge detector and horizontal edge detector as 3×33 \times 3 convolutions. Apply them to a checkerboard image and visualize the results.
  2. Write code to demonstrate translation equivariance: shift an input image by 5 pixels, apply a convolution, and show that the output is also shifted by 5 pixels.
  3. Create a fully connected layer that implements a 3×33 \times 3 convolution on a 6×66 \times 6 input (hint: most weights will be zero, and many will be shared).

Coding Exercise Hints

  • Exercise 1: Use Sobel filters: vertical = [[-1,0,1],[-2,0,2],[-1,0,1]], horizontal = transpose. Create checkerboard with np.indices((8,8)).sum(axis=0) % 2.
  • Exercise 2: Use torch.roll() to shift the image. Apply same conv to both original and shifted. Compare outputs with another roll.
  • Exercise 3: Create a sparse weight matrix of shape (16, 36) where each row has exactly 9 non-zero entries corresponding to the 3×3 receptive field. All rows share the same 9 kernel values.

Challenge Exercise

Implement a CNN from scratch using only NumPy. Create a simple 2-layer CNN (conv → relu → conv → relu → flatten → fc) and train it on a toy dataset like a simplified MNIST (e.g., just digits 0 and 1, resized to 14×14). This will solidify your understanding of forward and backward passes through convolutional layers.

Difficulty Level

The challenge exercise is advanced and may take several hours. Start with the forward pass only, then add backpropagation. Consider implementing just the 2D convolution first before tackling the full network.

In the next section, we'll dive deep into the mathematics of the convolution operation itself, understanding exactly how kernels slide across images to produce feature maps.