Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

We've spent the previous chapters building fully connected neural networks that can learn complex functions. These networks are universal function approximators—given enough neurons, they can theoretically learn any function. So why do we need a new architecture for images?

The short answer: images are special. They have structure that fully connected networks ignore, and exploiting this structure leads to dramatically better models. Convolutional Neural Networks (CNNs) were designed specifically to leverage the spatial structure inherent in visual data.

The Core Insight: Fully connected networks treat every input pixel as independent, ignoring the fact that nearby pixels are highly correlated and that the same patterns (edges, textures, shapes) appear throughout an image.

In this section, we'll build the motivation for convolutions from first principles. By the end, you'll understand why convolutions are not just an optimization trick, but a fundamentally better way to process visual information.

The Fully Connected Problem

What Happens When We Apply MLPs to Images?

Let's think about what happens when we try to classify images using a standard fully connected network. Consider a modest $224 \times 224$ RGB image—a common input size for image classification.

Counting Parameters

First, we need to flatten the image into a vector:

\text{Input size} = 224 \times 224 \times 3 = 150{,}528 \text{ pixels}

Now, if our first hidden layer has just 1,000 neurons (quite modest), we need:

\text{Parameters} = 150{,}528 \times 1{,}000 + 1{,}000 = 150{,}529{,}000

That's over 150 million parameters in just the first layer! For comparison, the entire VGG-16 network (a large CNN) has about 138 million parameters total.

Image Size	Input Neurons	Hidden (1000)	Parameters (Layer 1)
28 × 28 (MNIST)	784	1,000	785,000
32 × 32 × 3 (CIFAR)	3,072	1,000	3,073,000
224 × 224 × 3 (ImageNet)	150,528	1,000	150,529,000
1920 × 1080 × 3 (HD)	6,220,800	1,000	6,221,801,000

The parameter explosion

For HD images, a single fully connected layer would require 6 billion parameters—more than GPT-2! This is clearly impractical.

Problems with the Fully Connected Approach

Memory and Computation: Storing and computing with billions of parameters is prohibitively expensive
Overfitting: With so many parameters and relatively few training examples, the network will memorize rather than generalize
Wasted Capacity: Most of these parameters learn nothing useful because they connect unrelated pixels
No Spatial Awareness: The network treats pixel (0,0) and pixel (223,223) as equally related to pixel (112,112)

FC vs Conv Parameter Count

🐍fc_params.py

Explanation(5)

Code(16)

4Input Size Calculation

For a 224×224 RGB image, we flatten all pixels into a single vector: 224 × 224 × 3 = 150,528 input neurons.

EXAMPLE

224 * 224 * 3 = 150,528

7Fully Connected Layer

nn.Linear creates a fully connected layer where every input connects to every output. This means 150,528 × 1,000 weights plus 1,000 biases.

EXAMPLE

fc_layer.weight.shape = (1000, 150528)

8Parameter Count

We count all learnable parameters: weights (150,528 × 1,000 = 150,528,000) plus biases (1,000). That's over 150 million parameters!

12Convolutional Layer

nn.Conv2d creates a 2D convolution. Parameters: 3 input channels × 64 output channels × 3 × 3 kernel + 64 biases = 1,792 total.

EXAMPLE

3 × 64 × 3 × 3 + 64 = 1,792

16The Key Insight

Convolution uses 84,000× fewer parameters while processing the same input! This is possible because of sparse connectivity and weight sharing.

11 lines without explanation

1import torch.nn as nn
2
3# Parameters for fully connected network on ImageNet-sized input
4input_size = 224 * 224 * 3  # 150,528
5hidden_size = 1000
6
7fc_layer = nn.Linear(input_size, hidden_size)
8print(f"Parameters: {sum(p.numel() for p in fc_layer.parameters()):,}")
9# Output: Parameters: 150,529,000
10
11# For comparison: a convolutional layer
12conv_layer = nn.Conv2d(3, 64, kernel_size=3, padding=1)
13print(f"Conv parameters: {sum(p.numel() for p in conv_layer.parameters()):,}")
14# Output: Conv parameters: 1,792
15
16# Ratio: 150,529,000 / 1,792 = 84,000x more parameters!

Connectivity Comparison

Hover over nodes to see their connections. In FC, every input connects to every output.

Quick Check

A 512×512 RGB image is fed into a fully connected layer with 1,000 output neurons. How many parameters are in this layer (including biases)?

The Curse of Dimensionality

The parameter explosion is a symptom of a deeper problem: the curse of dimensionality. As input dimensionality grows, the amount of data needed to adequately cover the space grows exponentially.

Why More Parameters Means More Data

A rough rule of thumb in machine learning: you need at least 5-10 training examples per parameter to avoid severe overfitting. Let's apply this to our fully connected image classifier:

Network	Parameters	Min Training Examples	Reality Check
FC on MNIST	~800K	4-8 million	MNIST has 60K (insufficient!)
FC on CIFAR	~3M	15-30 million	CIFAR has 50K (way insufficient!)
FC on ImageNet	~150M	750M-1.5B	ImageNet has 1.2M (nowhere close!)

Why do MLPs work on MNIST at all?

MNIST is an unusually easy dataset. The digits are centered, normalized, and have minimal variation. Even with massive underfitting of the theoretical data requirements, simple patterns are learnable. Real-world images are far more complex.

The Sparsity of Image Space

Here's another way to think about the problem. A $28 \times 28$ grayscale image has $256^{784}$ possible configurations. That's approximately $10^{1888}$ possible images—more than the number of atoms in the observable universe ( $\sim 10^{80}$ ).

Yet most of these configurations are meaningless noise. Real images occupy a tiny, highly structured subspace. A good model should encode this structure, not try to memorize every possible pixel configuration.

Images Have Spatial Structure

The key insight that motivates convolutions: images are not random collections of pixels. They have rich spatial structure that we can exploit.

Property 1: Locality

Nearby pixels are more related than distant pixels.

When you look at a pixel in an image, its immediate neighbors tell you almost everything about it. The pixel at location (100, 100) is highly correlated with pixels at (99, 100), (101, 100), (100, 99), and (100, 101). It has almost no correlation with the pixel at (0, 0).

This suggests: don't connect every input to every output. Instead, connect each output to only a local neighborhood of inputs.

🐍locality.py

1import numpy as np
2from scipy.ndimage import correlate
3
4# Load any image (simulated here)
5np.random.seed(42)
6image = np.random.randn(100, 100)
7
8# Correlation between adjacent pixels
9adjacent_corr = np.corrcoef(image[:-1, :].flatten(),
10                            image[1:, :].flatten())[0, 1]
11print(f"Adjacent pixel correlation: {adjacent_corr:.4f}")
12# For natural images: typically 0.9+
13
14# Correlation between distant pixels
15distant_corr = np.corrcoef(image[:50, :50].flatten(),
16                           image[50:, 50:].flatten())[0, 1]
17print(f"Distant pixel correlation: {distant_corr:.4f}")
18# For natural images: typically near 0

Property 2: Translation Invariance

The same patterns appear at different locations.

A cat's ear looks like a cat's ear whether it appears in the top-left corner, center, or bottom-right of the image. An edge is an edge regardless of where it is. This means the same feature detector should be useful everywhere in the image.

Implication: We should use the same weights (parameters) at every spatial location. This is called parameter sharing or weight tying.

Property 3: Compositionality

Complex patterns are built from simpler patterns.

A face is made of eyes, nose, and mouth. An eye is made of edges forming specific shapes. This hierarchical structure suggests that we should build layers of feature detectors, where each layer combines features from the previous layer.

Layer	Features Detected	Receptive Field
Layer 1	Edges, color blobs	3×3 to 5×5 pixels
Layer 2	Corners, textures	~20×20 pixels
Layer 3	Object parts (eyes, wheels)	~50×50 pixels
Layer 4	Whole objects	~100×100 pixels

Feature Hierarchy in CNNs

How neural networks build complex features from simple ones

🖼️

Input

— Raw pixels

RGB valuesGrayscale intensity

Level 0

Concrete

📐

Layer 1

— Edges & Gradients

Horizontal edgesVertical edgesDiagonal linesColor blobs

Level 1

🔲

Layer 2

— Textures & Patterns

CornersSimple texturesGradientsColor patterns

Level 2

👁️

Layer 3

— Object Parts

EyesWheelsFur patternsWindows

Level 3

🎯

Layer 4+

— Objects & Scenes

FacesCarsAnimalsBuildings

Level 4

Abstract

More Abstract →

Key Insight: Each layer combines features from the previous layer. Early layers detect low-level features; deeper layers capture high-level concepts.

Quick Check

Which property of images does parameter sharing exploit?

Three Key Insights That Define Convolutions

Based on the properties of images, we can identify three key design principles that convolutions implement:

1. Sparse Connectivity (Local Receptive Fields)

Instead of connecting each output to all inputs, connect it to only a small, local region called the receptive field.

\text{FC connections: } n_{\text{in}} \times n_{\text{out}} = 150{,}528 \times 1{,}000 \approx 150M

\text{Conv connections: } k^2 \times c_{\text{in}} \times c_{\text{out}} = 3^2 \times 3 \times 64 = 1{,}728

This reduces parameters by a factor of almost 100,000x!

Use the same weights at every spatial location. A $3 \times 3$ edge detector applied to the top-left uses the same 9 weights as when applied to the bottom-right.

🐍weight_sharing.py

1# In a fully connected layer, every connection has unique weights
2# fc_layer[i, j] != fc_layer[i, k] for different input positions j, k
3
4# In a convolutional layer, the same kernel is applied everywhere
5kernel = torch.tensor([
6    [-1, 0, 1],
7    [-2, 0, 2],
8    [-1, 0, 1]
9], dtype=torch.float32)  # Sobel edge detector
10
11# This SAME kernel is applied at position (0,0), (0,1), (1,0), ...
12# The kernel "slides" across the entire image

3. Translation Equivariance

Because we use the same weights everywhere, if the input shifts, the output shifts by the same amount. This property is called translation equivariance.

f(\text{shift}(x)) = \text{shift}(f(x))

If a cat moves from the left to the right of the image, the feature maps detecting "cat parts" also move correspondingly.

Equivariance vs Invariance

Equivariance: Output shifts when input shifts (convolutional layers).
Invariance: Output stays the same when input shifts (achieved by pooling or global operations).
Both are valuable: equivariance preserves spatial information, invariance enables position-independent recognition.

Interactive: Convolution in Action

Now that we understand the theory, let's see convolution in action. The interactive demos below let you explore how convolutions work step by step.

Watch the Kernel Slide

The convolution operation slides a small kernel (filter) across the input image, computing a weighted sum at each position. Press play to watch it happen, or step through manually.

Convolution Animation

Input (5×5)

Kernel (3×3)

Output (3×3)

Position (0, 0): (1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (2×2) + (2×-1) + (0×0) + (1×1) = 2

Receptive Field Growth

As we stack more convolutional layers, each output neuron "sees" a larger region of the original input. This is called the receptive field. Click through the layers to see how it grows.

Receptive Field Growth

See how the receptive field expands with each convolutional layer

Input Image (7×7)

Input

Output Size

7×7

Receptive Field

1×1

RF Coverage

2.0% of input

Receptive Field Formula: RF_n = RF_n-1 + (k - 1) × stride

With 3×3 kernels and stride 1: RF grows by 2 pixels per layer (1 → 3 → 5 → 7)

Quick Check

After 3 layers of 3×3 convolutions (no padding, stride 1), what is the receptive field size?

Biological Inspiration

The design of CNNs was heavily inspired by neuroscience, particularly the work of Hubel and Wiesel on the visual cortex (Nobel Prize, 1981).

The Visual Cortex

Key discoveries about how mammals process visual information:

Simple Cells: Neurons that respond to edges at specific orientations in specific locations. Like our convolutional filters!
Complex Cells: Neurons that respond to edges regardless of exact position within their receptive field. Like our pooling layers!
Hierarchical Processing: Visual information flows through layers V1 → V2 → V4 → IT, with increasingly abstract representations. Like our deep CNN layers!

Visual Cortex	CNN Equivalent	What It Does
Simple cells	Conv filters	Detect oriented edges at specific locations
Complex cells	Pooling layers	Provide local translation invariance
V1 → V2 → V4 → IT	Conv → Conv → Conv → FC	Build hierarchy of features
Receptive field	Kernel size + depth	Region of input that affects output

Historical context

The neocognitron (Fukushima, 1980) was the first neural network explicitly inspired by Hubel and Wiesel's work. LeCun's LeNet (1989) added backpropagation training, creating the modern CNN.

Edge Detection: A Motivating Example

Before diving into the mathematics of convolution, let's see a concrete example of why local operations are powerful. Edge detection is one of the most fundamental operations in image processing.

The Sobel Filter

The Sobel filter detects edges by computing the gradient (rate of change) of pixel intensities:

Sobel Edge Detection with Convolution

🐍sobel_edge.py

Explanation(6)

Code(31)

6Sobel X Filter

The Sobel X filter detects vertical edges by computing horizontal gradients. It subtracts left pixels from right pixels, with more weight (2×) in the center row.

EXAMPLE

At an edge: -1(0) + 1(1) + -2(0) + 2(1) + -1(0) + 1(1) = 4

10Reshape for Conv2d

PyTorch's conv2d expects shape (out_channels, in_channels, H, W). We reshape our 3×3 filter to (1, 1, 3, 3) for a single filter on a grayscale image.

12Sobel Y Filter

The Sobel Y filter detects horizontal edges by computing vertical gradients. It subtracts top pixels from bottom pixels.

19Test Image

We create a 32×32 grayscale image with a sharp vertical edge at column 16: left half is black (0), right half is white (1).

23Apply Convolution

F.conv2d slides the Sobel kernel across the image, computing weighted sums at each position. padding=1 keeps the output the same size as input.

EXAMPLE

edges_x.shape = (1, 1, 32, 32)

27Edge Magnitude

Combine horizontal and vertical gradients using the Pythagorean theorem to get total edge strength, regardless of orientation.

EXAMPLE

magnitude = √(Gx² + Gy²)

25 lines without explanation

1import torch
2import torch.nn.functional as F
3import matplotlib.pyplot as plt
4
5# Sobel filters for horizontal and vertical edges
6sobel_x = torch.tensor([
7    [-1, 0, 1],
8    [-2, 0, 2],
9    [-1, 0, 1]
10], dtype=torch.float32).view(1, 1, 3, 3)
11
12sobel_y = torch.tensor([
13    [-1, -2, -1],
14    [ 0,  0,  0],
15    [ 1,  2,  1]
16], dtype=torch.float32).view(1, 1, 3, 3)
17
18# Create a simple test image with a vertical edge
19image = torch.zeros(1, 1, 32, 32)
20image[:, :, :, 16:] = 1.0  # Right half is white
21
22# Apply Sobel filters (this IS convolution!)
23edges_x = F.conv2d(image, sobel_x, padding=1)
24edges_y = F.conv2d(image, sobel_y, padding=1)
25
26# Combine for edge magnitude
27edges = torch.sqrt(edges_x**2 + edges_y**2)
28
29print(f"Input shape: {image.shape}")
30print(f"Output shape: {edges.shape}")
31print(f"Max edge response: {edges.max():.2f}")

What the Sobel Filter Computes

At each pixel location, the Sobel filter computes a weighted sum of the $3 \times 3$ neighborhood:

G_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} * I

This computes the horizontal gradient: bright pixels on the right minus bright pixels on the left. Large values indicate a vertical edge.

Why This Matters

Only 9 parameters: The entire operation uses just 9 numbers, not millions
Works everywhere: The same 9 weights detect edges anywhere in the image
Meaningful output: The result tells us something useful about image structure
Composable: We can stack edge detectors to find corners, then shapes, then objects

The CNN insight

Classical image processing uses hand-designed filters like Sobel. CNNs learn the filter weights from data. This allows them to discover the optimal features for each task.

From Fully Connected to Convolution

Let's see how convolution relates to fully connected layers, and why it's a strict generalization.

A Fully Connected Layer as a Sparse Matrix

Consider a tiny $4 \times 4$ image flattened to 16 pixels, connected to a $4 \times 4$ output (also 16 values). A fully connected layer has a $16 \times 16 = 256$ weight matrix where every input connects to every output.

A convolutional layer with a $3 \times 3$ kernel can be viewed as the same $16 \times 16$ matrix, but with two constraints:

Sparsity: Most entries are zero (only local connections exist)
Weight Sharing: Non-zero entries share the same values

Convolution as Sparse Matrix Multiplication

🐍conv_as_matrix.py

Explanation(5)

Code(21)

5Create Convolution

We create a 2×2 convolution with 1 input channel and 1 output channel. bias=False means only the 4 kernel weights are learnable.

EXAMPLE

nn.Conv2d(in=1, out=1, kernel=2)

6Set Kernel Weights

We manually set the kernel weights to [1,2; 3,4] so we can verify the computation by hand. The kernel shape is (out, in, H, W).

EXAMPLE

kernel = [[1, 2], [3, 4]]

9Create Input

Create a 4×4 image with values 0-15 arranged in order. Reshape to (batch=1, channels=1, H=4, W=4) for PyTorch.

EXAMPLE

[[0,1,2,3], [4,5,6,7], [8,9,10,11], [12,13,14,15]]

14Apply Convolution

The 2×2 kernel slides across the 4×4 input with stride 1. Output size: (4-2+1) × (4-2+1) = 3×3.

20Manual Verification

At position (0,0): kernel[0,0]×input[0,0] + kernel[0,1]×input[0,1] + kernel[1,0]×input[1,0] + kernel[1,1]×input[1,1] = 1×0 + 2×1 + 3×4 + 4×5 = 34

EXAMPLE

1(0) + 2(1) + 3(4) + 4(5) = 34

16 lines without explanation

1import torch
2import torch.nn as nn
3
4# Create a simple 2x2 convolution on a 4x4 input
5conv = nn.Conv2d(1, 1, kernel_size=2, bias=False)
6conv.weight.data = torch.tensor([[[[1, 2], [3, 4]]]], dtype=torch.float32)
7
8# Input: 4x4 image
9x = torch.arange(16, dtype=torch.float32).view(1, 1, 4, 4)
10print("Input:")
11print(x.squeeze())
12
13# Apply convolution
14y = conv(x)
15print("Output (via conv):")
16print(y.squeeze())
17
18# Verify: manual computation for position (0,0)
19# output[0,0] = 1*input[0,0] + 2*input[0,1] + 3*input[1,0] + 4*input[1,1]
20manual = 1*0 + 2*1 + 3*4 + 4*5
21print(f"Manual computation for (0,0): {manual}")  # Matches y[0,0]

The Efficiency Gain

For a $224 \times 224$ image with a $3 \times 3$ convolution:

Approach	Parameters	Connections
Fully Connected	50,176 × 50,176 = 2.5B	2.5 billion unique
Convolution	3 × 3 = 9	~450K (9 reused everywhere)
Savings	~280 million×	Same expressiveness for images

What We Give Up

The constraints of convolution mean we cannot learn certain functions:

Global dependencies: A single conv layer cannot relate pixels far apart
Position-specific processing: We cannot learn "if pixel is in top-left, do X; if bottom-right, do Y"
Asymmetric relationships: The relationship between positions (0,0) and (1,0) must be the same as between (50,50) and (51,50)

For images, these constraints are features, not bugs! They encode the prior knowledge that images have translation-invariant local structure.

When convolutions are wrong

For data without spatial structure (tabular data, graphs with arbitrary structure), convolutions are inappropriate. Use fully connected networks, graph neural networks, or transformers instead.

Real-World Applications

CNNs have revolutionized computer vision and beyond. Here are some real-world applications that demonstrate the power of convolutional architectures:

Domain	Application	Example Models	Impact
Image Classification	Categorizing images into classes	ResNet, EfficientNet, ViT	ImageNet: 1000 classes, 90%+ accuracy
Object Detection	Locating and identifying objects	YOLO, Faster R-CNN, DETR	Real-time detection at 60+ FPS
Medical Imaging	Disease diagnosis from scans	U-Net, DenseNet	Detecting cancer, COVID-19, retinal diseases
Autonomous Vehicles	Scene understanding for self-driving	Custom CNNs, 3D convolutions	Tesla, Waymo, Cruise systems
Face Recognition	Identity verification	FaceNet, ArcFace	Phone unlocking, security systems
Image Generation	Creating new images	StyleGAN, Stable Diffusion	Art generation, deepfakes, design tools

Industry Impact

Healthcare: CNNs detect diabetic retinopathy, skin cancer, and COVID-19 from X-rays with accuracy matching or exceeding human experts
Agriculture: Drone-mounted CNN systems identify crop diseases, estimate yields, and optimize irrigation
Manufacturing: Visual inspection systems detect defects in products at superhuman speed and accuracy
Security: Surveillance systems use CNNs for face recognition, anomaly detection, and threat identification
Entertainment: Video games and movies use CNNs for real-time style transfer, upscaling, and visual effects

Beyond Images

While CNNs were designed for images, they're also used for 1D signals (audio, time series) and 3D data (video, medical scans). The key insight—local patterns that repeat—applies to many domains.

Summary

We've established why convolutions are essential for processing images:

Problem with FC	CNN Solution	Benefit
Too many parameters	Sparse connectivity	Tractable model size
Ignores locality	Local receptive fields	Exploits spatial structure
No weight sharing	Same kernel everywhere	Fewer parameters, better generalization
Position-dependent	Translation equivariance	Robust to object location

Key Takeaways

Fully connected layers have too many parameters for high-dimensional inputs like images, leading to overfitting and computational issues
Images have spatial structure: locality (nearby pixels are related), translation invariance (patterns repeat), and compositionality (complex features built from simple ones)
Convolutions exploit this structure through sparse connectivity, parameter sharing, and translation equivariance
The design is biologically inspired by how the visual cortex processes information
Convolution is a constrained fully connected layer—we trade expressiveness for efficiency and better inductive bias

Exercises

Conceptual Questions

Calculate the number of parameters in the first layer of a fully connected network that takes a $64 \times 64 \times 3$ image and outputs 512 features. Compare to a $5 \times 5$ convolution with 64 output channels.
Why would a fully connected network struggle to classify an image of a cat in the top-left corner if it was only trained on cats in the center?
Explain the difference between translation equivariance and translation invariance. Which does convolution provide, and how do we achieve the other?
In what scenarios would you NOT want to use convolutions? Give at least two examples.

Solution Hints for Conceptual Questions

Q1: FC: (64×64×3) × 512 + 512 = 6,291,968. Conv: (5×5×3) × 64 + 64 = 4,864. That's ~1,300× fewer parameters!
Q2: Think about how FC learns position-specific patterns. The weights that detect "cat features" are tied to specific input positions.
Q3: Equivariance: f(shift(x)) = shift(f(x)). Invariance: f(shift(x)) = f(x). Convolution provides equivariance; pooling provides local invariance.
Q4: Tabular data (no spatial structure), graphs (arbitrary connectivity), sequences where order matters more than locality.

Coding Exercises

Implement a vertical edge detector and horizontal edge detector as $3 \times 3$ convolutions. Apply them to a checkerboard image and visualize the results.
Write code to demonstrate translation equivariance: shift an input image by 5 pixels, apply a convolution, and show that the output is also shifted by 5 pixels.
Create a fully connected layer that implements a $3 \times 3$ convolution on a $6 \times 6$ input (hint: most weights will be zero, and many will be shared).

Coding Exercise Hints

Exercise 1: Use Sobel filters: vertical = [[-1,0,1],[-2,0,2],[-1,0,1]], horizontal = transpose. Create checkerboard with np.indices((8,8)).sum(axis=0) % 2.
Exercise 2: Use torch.roll() to shift the image. Apply same conv to both original and shifted. Compare outputs with another roll.
Exercise 3: Create a sparse weight matrix of shape (16, 36) where each row has exactly 9 non-zero entries corresponding to the 3×3 receptive field. All rows share the same 9 kernel values.

Challenge Exercise

Implement a CNN from scratch using only NumPy. Create a simple 2-layer CNN (conv → relu → conv → relu → flatten → fc) and train it on a toy dataset like a simplified MNIST (e.g., just digits 0 and 1, resized to 14×14). This will solidify your understanding of forward and backward passes through convolutional layers.

Difficulty Level

The challenge exercise is advanced and may take several hours. Start with the forward pass only, then add backpropagation. Consider implementing just the 2D convolution first before tackling the full network.

In the next section, we'll dive deep into the mathematics of the convolution operation itself, understanding exactly how kernels slide across images to produce feature maps.

Introduction

The Fully Connected Problem

What Happens When We Apply MLPs to Images?

Counting Parameters

The parameter explosion

Problems with the Fully Connected Approach

Connectivity Comparison

Quick Check

The Curse of Dimensionality

Why More Parameters Means More Data

Why do MLPs work on MNIST at all?

The Sparsity of Image Space

Images Have Spatial Structure

Property 1: Locality

Property 2: Translation Invariance

Property 3: Compositionality

Feature Hierarchy in CNNs

Input

Layer 1

Layer 2

Layer 3

Layer 4+

Quick Check

Three Key Insights That Define Convolutions

1. Sparse Connectivity (Local Receptive Fields)

2. Parameter Sharing (Weight Tying)

3. Translation Equivariance

Equivariance vs Invariance

Interactive: Convolution in Action

Watch the Kernel Slide

Convolution Animation

Receptive Field Growth

Receptive Field Growth

Quick Check

Biological Inspiration

The Visual Cortex

Historical context

Edge Detection: A Motivating Example

The Sobel Filter

What the Sobel Filter Computes

Why This Matters

The CNN insight

From Fully Connected to Convolution

A Fully Connected Layer as a Sparse Matrix

The Efficiency Gain

What We Give Up

When convolutions are wrong

Real-World Applications

Industry Impact

Beyond Images

Summary

Key Takeaways

Exercises

Conceptual Questions

Solution Hints for Conceptual Questions

Coding Exercises

Coding Exercise Hints

Challenge Exercise

Difficulty Level