Boo-AI — Master Artificial Intelligence by Building from Scratch

The Mission: Build a Real Image Classifier

Over the last two chapters, you learned how convolutions detect patterns in images (Chapter 13) and how to assemble them into CNN architectures (Chapter 14). Now it's time to put it all together. In this three-section project, you will build a complete image classifier from scratch — from raw pixels to a model that identifies objects with over 90% accuracy.

This section covers the crucial first step that most tutorials skip over: preparing your data properly. A model is only as good as the data you feed it. Get this wrong, and no amount of architecture tuning will save you. Get it right, and even a simple CNN performs surprisingly well.

What you'll learn in this section:
How CIFAR-10 stores 60,000 tiny images as raw numbers
Why we normalize pixel values and the exact math behind it
How data augmentation creates “virtual” training examples to fight overfitting
Building efficient DataLoaders with proper train/validation/test splits

Meet CIFAR-10: Your Training Ground

CIFAR-10 is one of the most widely used benchmark datasets in computer vision. Created by Alex Krizhevsky in 2009 at the Canadian Institute for Advanced Research (CIFAR), it contains 60,000 color images split across 10 everyday object categories. It sits in a sweet spot: complex enough that a simple logistic regression fails (~40% accuracy), yet small enough that you can train a CNN on a laptop in minutes.

CIFAR-10 Dataset Explorer

60,000 tiny color images across 10 everyday object categories

60,000

Total Images

50,000

Training

10,000

Testing

32 × 32 px

Image Size

3 (RGB)

Channels

Classes

How Tiny is 32 × 32?

Actual size

32px

Each image is only 1,024 pixels
per channel (32 × 32)

Total: 3,072 values (RGB)

Scale reference (256 × 256)

A modern phone photo is ~4000 × 3000 = 12 million pixels. CIFAR-10 images are just 1,024 pixels per channel — roughly 12,000× smaller. Yet CNNs can classify them with over 90% accuracy.

Each image is just 32 × 32 pixels with 3 color channels (RGB). That's only $32 \times 32 \times 3 = 3{,}072$ values per image. To put this in perspective, a single iPhone photo has about 12 million pixels — roughly 4,000 times more data. Yet a well-trained CNN can look at these 3,072 numbers and correctly tell you whether it's a photo of a cat or an airplane. That's the power of learned feature extraction.

Property	Value
Total images	60,000 (50,000 train + 10,000 test)
Image size	32 × 32 pixels, 3 channels (RGB)
Classes	10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
Images per class	6,000 (5,000 train + 1,000 test) — perfectly balanced
Pixel values	Integers in [0, 255]
File format	Python pickle (binary serialized NumPy arrays)

CIFAR-10's balanced class distribution means a random-guessing baseline achieves 10% accuracy (1 in 10). Any model worth its weights should beat this easily. A simple fully-connected network reaches ~55%. Our CNN will aim for 90%+.

Loading and Exploring the Data

Before building any model, you must understand your data at the lowest level. What shape are the arrays? What range are the values? How are the channels arranged? Let's start with raw Python to see exactly how CIFAR-10 stores its images, then switch to PyTorch's convenient torchvision interface.

The Raw Binary Format

CIFAR-10 distributes its data as Python pickle files. Each batch file contains a dictionary with the raw pixel data (a flat NumPy array) and labels (a list of integers). The pixel layout is unusual: all red values come first, then all green, then all blue — not the interleaved RGB-per-pixel format you might expect.

Exploring CIFAR-10 with Pure Python

🐍explore_cifar10.py

Explanation(11)

Code(19)

1import numpy as np

NumPy is a numerical computing library for Python. We use it here for fast array operations — reshaping the flat pixel data into image format and inspecting value ranges. All array operations run as optimized C code, not slow Python loops.

EXECUTION STATE

numpy = Numerical computing library — provides ndarray (N-dimensional array), mathematical functions, and fast element-wise operations

as np = Standard alias so we write np.array() instead of numpy.array()

2import pickle

pickle is Python’s built-in serialization module. CIFAR-10’s raw download format stores data as serialized Python dictionaries. pickle.load() deserializes the binary file back into a Python dict containing the image arrays and labels.

EXECUTION STATE

📚 pickle = Python standard library module for serializing/deserializing Python objects to binary files. The CIFAR-10 dataset uses this format for its batch files.

4with open(...) as f:

Opens the first of five training batch files in binary read mode. CIFAR-10 splits its 50,000 training images into five batches of 10,000 each (data_batch_1 through data_batch_5). The 'with' statement ensures the file is automatically closed when done.

EXECUTION STATE

"cifar-10-batches-py/data_batch_1" = Path to the first training batch file. After downloading and extracting CIFAR-10, this directory contains: data_batch_1..5 (training), test_batch (testing), batches.meta (class names)

"rb" = Read Binary mode — required because pickle files contain raw bytes, not human-readable text

5batch = pickle.load(f, encoding="bytes")

Deserializes the binary file into a Python dictionary. The encoding="bytes" parameter is needed because CIFAR-10 was created with Python 2 where dictionary keys are byte strings (b"data") rather than regular strings.

EXECUTION STATE

📚 pickle.load(file, encoding) = Reads a pickle file and reconstructs the original Python object. Returns whatever was serialized — in this case, a dictionary.

⬇ arg: f = The open file handle from the 'with' statement above

⬇ arg: encoding="bytes" = Tells pickle to keep dictionary keys as byte strings (b"data") rather than trying to decode them. Required for Python 2 pickles loaded in Python 3.

⬆ return: batch = A dict with keys: b"data" (pixel arrays), b"labels" (class integers), b"filenames" (original filenames), b"batch_label" ("training batch 1 of 5")

7images = batch[b"data"]

Extracts the raw pixel data from the dictionary. Each image is stored as a flat 1D array of 3,072 values: 1,024 red pixels, then 1,024 green, then 1,024 blue. All 10,000 images are stacked into one 2D array.

EXECUTION STATE

b"data" = Byte string key — the 'b' prefix creates a bytes literal. Required because CIFAR-10 was pickled with Python 2.

images = np.ndarray of shape (10000, 3072) — each row is one image’s pixels stored flat: [R0, R1, ..., R1023, G0, G1, ..., G1023, B0, B1, ..., B1023]

→ 3072 = 3 × 32 × 32 = 3 color channels × 32 rows × 32 columns = 3,072 values per image

8labels = batch[b"labels"]

Extracts the class labels. Each label is an integer from 0 to 9 corresponding to one of the 10 CIFAR-10 classes. The mapping is: 0=airplane, 1=automobile, 2=bird, 3=cat, 4=deer, 5=dog, 6=frog, 7=horse, 8=ship, 9=truck.

EXECUTION STATE

labels = Python list of 10,000 integers, each in range [0, 9]. Example first few: [6, 9, 9, 4, 1, 1, 2, 7, 8, 3] — the first image is a frog (6).

10print(f"Raw shape: {images.shape}")

Prints the dimensions of the raw pixel array. The output confirms 10,000 images, each stored as 3,072 flat values.

EXECUTION STATE

images.shape = (10000, 3072) — 10,000 images, each with 3,072 pixel values

11print(f"Pixel range: [{images.min()}, {images.max()}]")

Checks the minimum and maximum pixel values across the entire array. Raw CIFAR-10 uses standard 8-bit unsigned integers: 0 (black) to 255 (maximum brightness).

EXECUTION STATE

📚 np.ndarray.min() = Returns the single smallest element in the entire array. Scans all 10,000 × 3,072 = 30.72 million values.

📚 np.ndarray.max() = Returns the single largest element. Together with min(), this tells us the data range.

⬆ output = Pixel range: [0, 255] — standard 8-bit unsigned integer range

14images = images.reshape(-1, 3, 32, 32)

Reshapes the flat 2D array into a 4D tensor with explicit channel, height, and width dimensions. The -1 tells NumPy to infer the first dimension (10,000) automatically. This transforms 3,072 flat values into a structured (3, 32, 32) image per sample.

EXECUTION STATE

📚 ndarray.reshape(*shape) = Returns a new view of the array with a different shape. Total elements must be the same: 10000 × 3072 = 10000 × 3 × 32 × 32 = 30,720,000 ✓

⬇ arg: -1 = Wildcard: NumPy computes this dimension automatically. 30,720,000 / (3 × 32 × 32) = 10,000

⬇ arg: 3 = 3 color channels: Red, Green, Blue

⬇ arg: 32, 32 = Image height and width in pixels

⬆ result shape = (10000, 3, 32, 32) — (N, C, H, W) format. This is the standard PyTorch convention: batch × channels × height × width.

15print(f"Reshaped: {images.shape}")

Confirms the reshape succeeded. The (N, C, H, W) layout matches PyTorch’s convention where channels come before spatial dimensions.

EXECUTION STATE

⬆ output = Reshaped: (10000, 3, 32, 32)

18print(f"RGB at [0,0]: {images[0, :, 0, 0]}")

Extracts the RGB values of the top-left pixel from the first image. images[0, :, 0, 0] means: first image (0), all 3 channels (:), row 0, column 0. This gives us the Red, Green, and Blue intensity of one pixel.

EXECUTION STATE

images[0, :, 0, 0] = Fancy indexing: [image_index, channels, row, col]. The ':' selects all 3 channels for that single pixel position.

── Indexing breakdown ── =

0 (first dim) = First image in the batch (a frog)

: (second dim) = All 3 channels — gives us [R, G, B]

0, 0 (third, fourth) = Top-left pixel: row 0, column 0

⬆ output = RGB at [0,0]: [ 59 43 50] — a dark brownish-gray pixel (low values = dark)

8 lines without explanation

1import numpy as np
2import pickle
3
4# Load one CIFAR-10 batch (raw binary format)
5with open("cifar-10-batches-py/data_batch_1", "rb") as f:
6    batch = pickle.load(f, encoding="bytes")
7
8images = batch[b"data"]
9labels = batch[b"labels"]
10
11print(f"Raw shape: {images.shape}")
12print(f"Pixel range: [{images.min()}, {images.max()}]")
13
14# Reshape: 3072 = 3 channels x 32 height x 32 width
15images = images.reshape(-1, 3, 32, 32)
16print(f"Reshaped: {images.shape}")
17
18# First image, first pixel (top-left corner)
19print(f"RGB at [0,0]: {images[0, :, 0, 0]}")

The key insight: a single CIFAR-10 image lives as 3,072 flat numbers. The reshape to $(N, C, H, W)$ gives structure to this flat data — separating the 3 color channels and the 32 × 32 spatial grid. This is exactly the format that PyTorch CNN layers (like nn.Conv2d) expect as input.

The PyTorch Way

In practice, you rarely load CIFAR-10 manually. PyTorch's torchvision.datasets handles downloading, caching, and applying transforms automatically. The transforms.ToTensor() transform does two things: reorders dimensions from $(H, W, C)$ to $(C, H, W)$ and rescales pixel values from $[0, 255]$ integers to $[0.0, 1.0]$ floats.

Loading CIFAR-10 with PyTorch

🐍load_cifar10_pytorch.py

Explanation(12)

Code(25)

1import torch

PyTorch is a deep learning framework that provides tensor computation with GPU acceleration and automatic differentiation. It is the foundation for all model training in this project.

EXECUTION STATE

torch = Core PyTorch library — provides Tensor (like NumPy ndarray but with GPU support and autograd), nn (neural network layers), optim (optimizers)

2import torchvision

torchvision extends PyTorch with computer vision utilities: popular datasets (CIFAR-10, ImageNet, MNIST), pretrained models (ResNet, VGG), and image transforms. It handles downloading, caching, and preprocessing automatically.

EXECUTION STATE

torchvision = Computer vision library for PyTorch — provides datasets, models, and transforms. Saves you from manually downloading and parsing image data.

3import torchvision.transforms as transforms

The transforms module contains image preprocessing operations: resizing, cropping, normalization, augmentation, and conversion between PIL Images and PyTorch Tensors. We alias it as 'transforms' for brevity.

EXECUTION STATE

torchvision.transforms = Module with composable image transformations. Key classes: ToTensor(), Normalize(), RandomHorizontalFlip(), RandomCrop(), Compose()

6basic_transform = transforms.Compose([...])

Creates a pipeline of image transformations that will be applied sequentially to every image when it is loaded. Compose chains transforms: the output of each becomes the input to the next. Here we have just one transform, but later we will chain several.

EXECUTION STATE

📚 transforms.Compose(list) = Takes a list of transforms and returns a single callable that applies them in order. Example: Compose([A, B, C]) produces output = C(B(A(input)))

⬇ arg: transform list = [transforms.ToTensor()] — a list with one transform. The pipeline currently only converts PIL Images to Tensors.

7transforms.ToTensor()

Converts a PIL Image (H, W, C) with values [0, 255] into a PyTorch Tensor (C, H, W) with values [0.0, 1.0]. Two critical changes happen: (1) channels move from last to first dimension, and (2) pixel values are divided by 255 to normalize into [0, 1].

EXECUTION STATE

📚 transforms.ToTensor() = PIL Image → torch.FloatTensor. Shape: (H, W, C) → (C, H, W). Values: [0, 255] integers → [0.0, 1.0] floats (divides by 255).

→ Why channel-first? = PyTorch convention is (batch, channels, height, width). GPU convolution kernels are optimized for this layout (NCHW format). NumPy/PIL use channel-last (HWC).

→ Example = PIL pixel [128, 64, 255] → Tensor [128/255, 64/255, 255/255] = [0.502, 0.251, 1.000]

11trainset = torchvision.datasets.CIFAR10(...)

Creates a CIFAR-10 dataset object. On first run with download=True, it downloads ~170 MB of data from the internet and extracts it to the root directory. On subsequent runs, it loads from the local cache. Every time you access trainset[i], the transform pipeline is applied to that image.

EXECUTION STATE

📚 torchvision.datasets.CIFAR10 = Dataset class for CIFAR-10. Handles downloading, caching, loading individual images, and applying transforms. Implements __len__ and __getitem__ for indexing.

⬇ arg: root="./data" = Directory where CIFAR-10 files are stored/downloaded. Creates ./data/cifar-10-batches-py/ containing the batch files.

⬇ arg: train=True = Load the 50,000-image training set. Set to False for the 10,000-image test set.

⬇ arg: download=True = Download CIFAR-10 if not already present in root. Safe to leave True — it skips the download if files already exist.

⬇ arg: transform=basic_transform = The transform pipeline to apply to every image. Each call to trainset[i] loads a PIL Image, applies this pipeline, and returns the result.

18print(f"Dataset size: {len(trainset)}")

CIFAR-10’s training set contains exactly 50,000 images (5,000 per class × 10 classes). The len() function calls trainset.__len__() under the hood.

EXECUTION STATE

len(trainset) = 50000 — total training images

19print(f"Classes: {trainset.classes}")

The .classes attribute lists all 10 category names in label order. This means label integer 0 maps to 'airplane', label 1 to 'automobile', and so on.

EXECUTION STATE

trainset.classes = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

→ mapping = Label 0='airplane', 1='automobile', 2='bird', 3='cat', 4='deer', 5='dog', 6='frog', 7='horse', 8='ship', 9='truck'

22image, label = trainset[0]

Accesses the first image-label pair. Behind the scenes, trainset.__getitem__(0) loads the PIL Image from the binary batch file, applies basic_transform (ToTensor), and returns the tensor plus its integer label. This is the same mechanism DataLoader uses internally.

EXECUTION STATE

image = torch.Tensor of shape (3, 32, 32) and dtype float32. Values in [0.0, 1.0] because ToTensor divides by 255.

label = 6 — integer class label (6 = 'frog'). The first image in CIFAR-10 training set is a frog.

23print(f"Tensor shape: {image.shape}")

Confirms the tensor is in PyTorch’s (C, H, W) format: 3 channels (RGB), 32 rows, 32 columns. This is the format every PyTorch CNN layer expects.

EXECUTION STATE

image.shape = torch.Size([3, 32, 32]) — (channels, height, width)

24print(f"Range: [{image.min():.3f}, {image.max():.3f}]")

Verifies that ToTensor successfully scaled pixel values from [0, 255] integers to [0.0, 1.0] floats. The .3f format shows 3 decimal places.

EXECUTION STATE

image.min() = 0.000 — the darkest pixel (was 0/255 = 0.0)

image.max() = 1.000 — the brightest pixel (was 255/255 = 1.0)

⬆ output = Range: [0.000, 1.000]

25print(f"Label: {label} = {trainset.classes[label]}")

Translates the integer label into a human-readable class name by indexing into the classes list. This pattern — classes[label] — is how you convert model predictions back to class names.

EXECUTION STATE

label = 6

trainset.classes[6] = "frog"

⬆ output = Label: 6 = frog

13 lines without explanation

1import torch
2import torchvision
3import torchvision.transforms as transforms
4
5# Basic transform: PIL Image -> Tensor [0.0, 1.0]
6basic_transform = transforms.Compose([
7    transforms.ToTensor(),
8])
9
10# Download and load CIFAR-10
11trainset = torchvision.datasets.CIFAR10(
12    root="./data",
13    train=True,
14    download=True,
15    transform=basic_transform,
16)
17
18print(f"Dataset size: {len(trainset)}")
19print(f"Classes: {trainset.classes}")
20
21# Get one sample
22image, label = trainset[0]
23print(f"Tensor shape: {image.shape}")
24print(f"Range: [{image.min():.3f}, {image.max():.3f}]")
25print(f"Label: {label} = {trainset.classes[label]}")

Quick Check: After ToTensor(), what is the tensor shape and value range for a single CIFAR-10 image?

Answer: Shape: (3, 32, 32) in (C, H, W) format. Values: [0.0, 1.0] floats.ToTensor divides each pixel by 255 and moves channels from last to first dimension.

Normalization: Centering Your Pixels

After ToTensor(), our pixels are in $[0, 1]$ . That's better than $[0, 255]$ , but still not ideal for training. The problem is that different channels have different average brightness and spread. The red channel averages around 0.49, while blue averages around 0.45. This asymmetry forces the optimizer to compensate with unequal weight updates, slowing down convergence.

The fix is per-channel normalization. For each channel $c$ , we compute the mean $\mu_c$ and standard deviation $\sigma_c$ across all training images, then transform each pixel:

$x_{\text{norm}} = \frac{x - \mu_c}{\sigma_c}$

This centers each channel around zero and scales it to approximately unit variance. After normalization, a pixel at the channel mean becomes 0.0, a pixel one standard deviation above the mean becomes +1.0, and a pixel one standard deviation below becomes -1.0. Most values fall in the range $[-2, +2]$ .

Computing the Statistics

Where do the normalization constants come from? We compute them from the training data itself. This is a one-time computation that produces six numbers: three means and three standard deviations, one for each RGB channel.

Computing Per-Channel Normalization Statistics

🐍compute_normalization.py

Explanation(10)

Code(22)

1import torch

PyTorch core library. We need it here for DataLoader (to batch-load all images) and tensor operations (.mean(), .std()).

EXECUTION STATE

torch = Core PyTorch — provides Tensor, DataLoader, and mathematical operations

2from torchvision import datasets, transforms

Imports the CIFAR10 dataset class and ToTensor transform directly, avoiding the need for the full torchvision prefix.

EXECUTION STATE

datasets = torchvision.datasets module — contains CIFAR10, MNIST, ImageNet, etc.

transforms = torchvision.transforms module — contains ToTensor, Normalize, etc.

5trainset = datasets.CIFAR10(...)

Loads the training set with just ToTensor (no normalization yet). We need the raw [0, 1] values to compute the dataset’s true mean and standard deviation. You cannot normalize before computing the statistics — that would be circular.

EXECUTION STATE

→ Why no Normalize? = We are computing the normalization parameters FROM this data. Applying Normalize first would give us the stats of already-normalized data — useless.

11loader = torch.utils.data.DataLoader(...)

Creates a DataLoader that will batch all 50,000 images into a single giant batch. This is a trick to get all images into one tensor without a manual loop. In production training, you would never use batch_size=50000 — it would exhaust GPU memory.

EXECUTION STATE

📚 torch.utils.data.DataLoader = Wraps a dataset and provides batched iteration. Handles batching, shuffling, parallel loading, and collation of individual samples into tensors.

⬇ arg: trainset = The CIFAR-10 dataset object. DataLoader calls trainset[i] for each sample and stacks results into a batch.

⬇ arg: batch_size=50000 = Load ALL 50,000 images in one batch. This is only for computing statistics — never for training (would OOM on GPU). Works here because CIFAR-10 is small: 50000 × 3 × 32 × 32 × 4 bytes = ~600 MB.

⬇ arg: shuffle=False = Don’t randomize order. For statistics computation, order doesn’t matter — mean is the same regardless of order. But False is slightly faster.

14all_images, all_labels = next(iter(loader))

Pulls the single batch from the DataLoader. iter(loader) creates an iterator, and next() grabs the first (and only) batch. This is a common one-liner to extract one batch from any DataLoader.

EXECUTION STATE

📚 iter(loader) = Creates a Python iterator from the DataLoader. Each call to next() returns one batch.

📚 next() = Gets the next item from the iterator. Since batch_size=50000 = full dataset, there’s only one batch.

all_images = torch.Tensor of shape (50000, 3, 32, 32) — all training images stacked

all_labels = torch.Tensor of shape (50000,) — all training labels

15print(f"Shape: {all_images.shape}")

Confirms we have all 50,000 images in a single 4D tensor. The shape (50000, 3, 32, 32) means: 50K images, each with 3 channels, each channel is 32×32 pixels.

EXECUTION STATE

⬆ output = Shape: torch.Size([50000, 3, 32, 32])

→ memory = 50000 × 3 × 32 × 32 × 4 bytes (float32) = 614.4 MB in RAM

18mean = all_images.mean(dim=[0, 2, 3])

Computes the mean pixel value for each color channel independently. dim=[0, 2, 3] means: average over dimension 0 (images), dimension 2 (height), and dimension 3 (width). Only dimension 1 (channels) survives, giving us one mean per channel.

EXECUTION STATE

📚 tensor.mean(dim=...) = Computes arithmetic mean along specified dimensions, reducing those dimensions. The result has shape equal to the surviving dimensions.

⬇ arg: dim=[0, 2, 3] = Average over: dim 0 (50000 images), dim 2 (32 rows), dim 3 (32 cols). Surviving: dim 1 (3 channels). We average 50000 × 32 × 32 = 51,200,000 values per channel.

⬆ result: mean = tensor([0.4914, 0.4822, 0.4465]) — one mean per channel (R, G, B)

→ interpretation = Red channel: average pixel is 0.4914 (49.1% brightness). Green: 0.4822. Blue: 0.4465 (darkest channel — CIFAR-10 images tend to have warm tones).

19std = all_images.std(dim=[0, 2, 3])

Computes the standard deviation for each channel, measuring how spread out pixel values are. Same dimension logic as mean: average the variance over images, rows, and columns, keep channels separate.

EXECUTION STATE

📚 tensor.std(dim=...) = Computes standard deviation along specified dimensions. Uses Bessel’s correction (divides by N-1) by default for unbiased estimation.

⬆ result: std = tensor([0.2470, 0.2435, 0.2616]) — one std per channel (R, G, B)

→ interpretation = All three channels have similar spread (~0.25). This means pixel values are roughly distributed over [mean ± 2×std] ≈ [0.0, 1.0] for all channels.

21print(f"Mean (R, G, B): {mean}")

Displays the per-channel means. These three numbers are the CIFAR-10 normalization constants you’ll see in every CIFAR-10 tutorial and paper. They are dataset-specific — ImageNet has different values.

EXECUTION STATE

⬆ output = Mean (R, G, B): tensor([0.4914, 0.4822, 0.4465])

22print(f"Std (R, G, B): {std}")

Displays the per-channel standard deviations. Together with the means, these six numbers fully characterize CIFAR-10’s pixel distribution and are used in transforms.Normalize() for training.

EXECUTION STATE

⬆ output = Std (R, G, B): tensor([0.2470, 0.2435, 0.2616])

12 lines without explanation

1import torch
2from torchvision import datasets, transforms
3
4# Load all training images as tensors
5trainset = datasets.CIFAR10(
6    root="./data", train=True,
7    transform=transforms.ToTensor(),
8)
9
10# Stack all 50,000 images into one big tensor
11loader = torch.utils.data.DataLoader(
12    trainset, batch_size=50000, shuffle=False,
13)
14all_images, all_labels = next(iter(loader))
15print(f"Shape: {all_images.shape}")
16
17# Per-channel mean: average over (N, H, W) dims
18mean = all_images.mean(dim=[0, 2, 3])
19std = all_images.std(dim=[0, 2, 3])
20
21print(f"Mean (R, G, B): {mean}")
22print(f"Std  (R, G, B): {std}")

Let's verify what normalization does to concrete pixel values. Take a reddish pixel with raw value 0.75 in the red channel:

$x_{\text{norm}} = \frac{0.75 - 0.4914}{0.2470} = \frac{0.2586}{0.2470} = 1.047$

This pixel is about one standard deviation above the mean — a moderately bright red. Now consider the extremes:

Pixel Value	Normalized (Red Channel)	Interpretation
0.00 (black)	(0.00 − 0.4914) / 0.2470 = −1.99	About 2σ below the mean
0.4914 (mean)	(0.4914 − 0.4914) / 0.2470 = 0.00	Exactly at the mean
0.75 (bright)	(0.75 − 0.4914) / 0.2470 = +1.05	About 1σ above the mean
1.00 (white)	(1.00 − 0.4914) / 0.2470 = +2.06	About 2σ above the mean

Why does normalization help training? Recall from Chapter 9 that optimizers like SGD and Adam update weights proportionally to gradients. If input features have wildly different scales, some gradients will be enormous and others tiny, making the loss landscape elongated and hard to navigate. Normalization makes the landscape more spherical, so the optimizer can take equal-sized steps in all directions. In practice, normalization can cut training time in half.

Always compute normalization statistics from the training set only. Never use the test set — that would leak test information into your preprocessing pipeline. Apply the same training-set statistics when normalizing validation and test data.

Data Augmentation: Teaching Generalization

In Chapter 12, you learned that overfitting happens when a model memorizes training examples instead of learning general patterns. One of the most powerful defenses is data augmentation: applying random transformations to training images so the model never sees the exact same image twice.

The idea is simple but profound. A cat flipped horizontally is still a cat. A truck shifted 2 pixels to the right is still a truck. A bird in slightly different lighting is still a bird. By randomly applying these transformations during training, we teach the network that these variations don't change the class label. The network learns to focus on the essential what (shape, texture, parts) rather than the incidental where andhow (position, orientation, brightness).

Data Augmentation Playground

Click transforms to see exactly what happens to pixels

Original

Augmented

Applied Transforms

None yet — click a button

What Each Transform Does

Flip: Mirrors pixels across an axis

Crop: Pads with black, takes random sub-region

Jitter: Randomly shifts RGB brightness/contrast

Each transformation addresses a specific type of variation the model should be invariant to:

Transform	What It Does	What It Teaches the CNN
RandomHorizontalFlip	Mirrors image left ↔ right (50% chance)	Object identity doesn’t depend on facing direction
RandomCrop(32, padding=4)	Shifts image by up to 4px in any direction	Objects can appear anywhere in the frame
ColorJitter	Randomly adjusts brightness/contrast by ±20%	Same object under different lighting conditions

Now let's define the actual transform pipelines in PyTorch. We need two separate pipelines: one for training (with augmentation) and one for testing (without).

Defining Training and Test Transform Pipelines

🐍augmentation_pipeline.py

Explanation(10)

Code(31)

1import torchvision.transforms as transforms

The transforms module contains all image preprocessing operations. We define two separate pipelines: one for training (with augmentation) and one for testing (without). This distinction is critical.

EXECUTION STATE

transforms = Module with composable image operations: RandomHorizontalFlip, RandomCrop, ColorJitter, ToTensor, Normalize, and many more

4train_transform = transforms.Compose([...])

Creates the training pipeline. The order matters: augmentation transforms operate on PIL Images and must come BEFORE ToTensor(). Normalize() operates on tensors and must come AFTER ToTensor(). The pipeline processes every training image on-the-fly during training.

EXECUTION STATE

📚 transforms.Compose(list) = Chains transforms sequentially: output of each becomes input to the next. The full chain runs every time you access trainset[i].

→ pipeline order = PIL augmentations → ToTensor → Normalize. NEVER put ToTensor before PIL transforms — they expect PIL Images, not Tensors.

→ on-the-fly = Augmentation runs every epoch, so each time the network sees an image, it looks slightly different. This is like having a much larger dataset.

6transforms.RandomHorizontalFlip(p=0.5)

With 50% probability, flips the image left-to-right. A horizontally flipped car is still a car. This teaches the CNN that object identity doesn’t depend on horizontal orientation. Note: we do NOT use vertical flip because an upside-down car looks unnatural.

EXECUTION STATE

📚 RandomHorizontalFlip(p) = Flips the PIL Image along the vertical axis (swaps left↔right columns) with probability p. p=0.5 means each image has a coin-flip chance of being mirrored.

⬇ arg: p=0.5 = Probability of flipping. 0.5 = fair coin flip. 0.0 = never flip. 1.0 = always flip. 0.5 is standard for natural images.

→ effect = Pixel at column j moves to column (width - 1 - j). A cat facing left becomes a cat facing right. The label stays the same.

9transforms.RandomCrop(32, padding=4)

First pads the 32×32 image with 4 black pixels on each side (making it 40×40), then takes a random 32×32 crop. The net effect: the image shifts randomly by up to 4 pixels in any direction. This teaches the CNN that objects can appear anywhere in the frame, not just centered.

EXECUTION STATE

📚 RandomCrop(size, padding) = Pads the image, then extracts a random sub-region of the specified size. Equivalent to slight translation augmentation.

⬇ arg: size=32 = Output crop size: 32×32 pixels. Same as input, so the image stays the same resolution.

⬇ arg: padding=4 = Add 4 pixels of zero-padding on all four sides before cropping. Input 32×32 → padded 40×40 → random crop 32×32.

→ max shift = The crop origin can range from (0,0) to (8,8) in the 40×40 padded image. At (4,4) the crop is centered (no shift). At (0,0) the image shifts 4px right and down.

12transforms.ColorJitter(brightness=0.2, contrast=0.2)

Randomly perturbs brightness and contrast by up to ±20%. This simulates lighting variation — the same object photographed in bright sunlight vs. dim shade should still be recognized. The jitter amount is drawn uniformly from [1-0.2, 1+0.2] = [0.8, 1.2] for each parameter.

EXECUTION STATE

📚 ColorJitter(brightness, contrast, saturation, hue) = Randomly changes the visual properties of a PIL Image. Each parameter specifies the maximum deviation from the original. Set to 0 (or omit) to leave unchanged.

⬇ arg: brightness=0.2 = Multiply pixel brightness by a random factor in [0.8, 1.2]. Factor 0.8 = 20% darker. Factor 1.2 = 20% brighter.

⬇ arg: contrast=0.2 = Adjust contrast by a random factor in [0.8, 1.2]. Lower contrast pushes pixels toward the mean gray. Higher contrast pushes them apart.

15transforms.ToTensor()

Converts the augmented PIL Image to a PyTorch tensor. This is the boundary between PIL-based augmentations (above) and tensor-based operations (below). After this point, the data is a float32 tensor in [0, 1].

EXECUTION STATE

→ position matters = ToTensor MUST come after all PIL transforms and BEFORE Normalize. RandomHorizontalFlip and friends operate on PIL Images, not tensors.

18transforms.Normalize(mean=..., std=...)

Applies per-channel normalization: for each channel c, compute (pixel - mean[c]) / std[c]. This centers the data around zero and scales it to unit variance. The exact values come from our computation in the previous code block.

EXECUTION STATE

📚 transforms.Normalize(mean, std) = Per-channel normalization. For channel c: output[c] = (input[c] - mean[c]) / std[c]. Operates on tensors, not PIL Images.

⬇ arg: mean=[0.4914, 0.4822, 0.4465] = CIFAR-10 training set per-channel means. Red is brightest on average (0.49), Blue is darkest (0.45). These are the values we computed.

⬇ arg: std=[0.2470, 0.2435, 0.2616] = CIFAR-10 per-channel standard deviations. Dividing by these scales each channel to approximately unit variance.

→ output range = After normalization, most values fall in [-2, +2]. A pixel at the mean → 0.0. A pixel at 0.0 (black) → -1.99. A pixel at 1.0 (white) → +2.06.

25test_transform = transforms.Compose([...])

The test pipeline has NO augmentation — only ToTensor and Normalize. Why? Test data must be evaluated consistently. If we flipped or cropped test images randomly, accuracy would vary between evaluations. The same normalization ensures test pixels are on the same scale as training pixels.

EXECUTION STATE

→ critical rule = NEVER augment test data. Augmentation is a training-time regularization technique. Test data represents the real world — it must be processed deterministically.

→ same Normalize = We use the SAME mean and std from the TRAINING set for both train and test. We never compute stats on the test set — that would be data leakage.

26transforms.ToTensor()

Same ToTensor as in the training pipeline. PIL Image [0, 255] becomes tensor [0.0, 1.0] in (C, H, W) format.

EXECUTION STATE

→ identical conversion = Both pipelines use the same ToTensor → Normalize chain for the final conversion step

27transforms.Normalize(mean=[...], std=[...])

Identical normalization using TRAINING set statistics. This is a fundamental ML principle: never use test set information during any phase of data processing.

EXECUTION STATE

→ Why training stats? = In a real scenario, you don’t have the test set when you build your pipeline. You only know the training distribution. Using training stats for both ensures no information leakage.

21 lines without explanation

1import torchvision.transforms as transforms
2
3# --- Training transforms: augment + normalize ---
4train_transform = transforms.Compose([
5    # 50% chance to mirror the image horizontally
6    transforms.RandomHorizontalFlip(p=0.5),
7
8    # Pad 4px on each side, then random crop to 32x32
9    transforms.RandomCrop(32, padding=4),
10
11    # Randomly adjust brightness and contrast
12    transforms.ColorJitter(brightness=0.2, contrast=0.2),
13
14    # PIL [0, 255] -> Tensor [0.0, 1.0]
15    transforms.ToTensor(),
16
17    # Center around zero: (pixel - mean) / std
18    transforms.Normalize(
19        mean=[0.4914, 0.4822, 0.4465],
20        std=[0.2470, 0.2435, 0.2616],
21    ),
22])
23
24# --- Test transforms: normalize only (NO augmentation) ---
25test_transform = transforms.Compose([
26    transforms.ToTensor(),
27    transforms.Normalize(
28        mean=[0.4914, 0.4822, 0.4465],
29        std=[0.2470, 0.2435, 0.2616],
30    ),
31])

Notice the order of transforms in the training pipeline: augmentation transforms (RandomHorizontalFlip, RandomCrop, ColorJitter) operate on PIL Images and come first. Then ToTensor() converts to a tensor. Finally, Normalize() operates on the tensor. Getting this order wrong causes cryptic errors.

Quick Check: Why do we NOT apply RandomHorizontalFlip to the test set?

Answer: Test data must be evaluated deterministically. If we augmented test images, accuracy would vary between evaluations because random flips would change the model's predictions. Augmentation is a training-time regularization technique — it helps prevent overfitting but has no place in evaluation.

The Complete Data Pipeline

With our transforms defined, we can now build the complete data pipeline. Three key decisions remain: (1) how to split training data into train and validation sets, (2) what batch size to use, and (3) whether to shuffle.

The Train/Validation Split

CIFAR-10 provides a predefined train/test split (50K/10K), but we need a validation set too. The validation set is your feedback loop during training — you check accuracy on it after each epoch to detect overfitting. The test set is reserved for the final evaluation only. If you tune hyperparameters based on test accuracy, you are implicitly overfitting to the test set.

We split the 50,000 training images into 45,000 for training and 5,000 for validation (a 90/10 split). This gives us 500 validation images per class — enough for a reliable accuracy estimate.

Split	Size	Purpose	Augmentation?
Training	45,000	Model learns from these images	Yes — random transforms each epoch
Validation	5,000	Monitor overfitting during training	No — deterministic evaluation
Test	10,000	Final accuracy after all tuning is done	No — deterministic evaluation

Batching and Shuffling

We feed images to the CNN in batches of 128. Why not one at a time? Batching has three benefits: (1) GPU parallelism — processing 128 images simultaneously is barely slower than processing 1, (2) stable gradients — averaging gradients over 128 images reduces noise compared to single-sample updates (recall SGD from Chapter 9), and (3) efficient memory use — the GPU can optimize matrix operations for fixed-size batches.

We shuffle the training data each epoch so the model sees images in a different random order. Without shuffling, the model might learn spurious correlations from the batch composition (e.g., “batch 5 always has cats followed by dogs”).

Building DataLoaders with Train/Validation/Test Split

🐍build_dataloaders.py

Explanation(14)

Code(35)

1import torch

We need torch for DataLoader, random_split, and Generator — the data pipeline utilities that feed batches to the training loop.

EXECUTION STATE

torch = Provides torch.utils.data.DataLoader, torch.utils.data.random_split, torch.Generator

2from torchvision import datasets

Import the datasets module for CIFAR10 class.

EXECUTION STATE

datasets = torchvision.datasets — contains CIFAR10 and other standard vision datasets

5trainset = datasets.CIFAR10(..., transform=train_transform)

Loads the full 50,000-image training set WITH the augmentation pipeline. Every access to trainset[i] triggers the random augmentation chain. During training, the same image will look slightly different each epoch.

EXECUTION STATE

⬇ arg: transform=train_transform = The augmentation pipeline from above: RandomHorizontalFlip → RandomCrop → ColorJitter → ToTensor → Normalize

trainset = CIFAR10 dataset with 50,000 images. Will be split into 45K train + 5K val next.

8testset = datasets.CIFAR10(..., transform=test_transform)

Loads the 10,000-image test set with the non-augmented pipeline (ToTensor + Normalize only). Test data is never augmented — we want deterministic, repeatable evaluation.

EXECUTION STATE

⬇ arg: train=False = Load the test split (10,000 images) instead of training split

⬇ arg: transform=test_transform = No augmentation: only ToTensor → Normalize

13train_data, val_data = torch.utils.data.random_split(...)

Splits the 50,000 training images into 45,000 for training and 5,000 for validation. The validation set acts as a "mini test set" you check during training to detect overfitting. We use a fixed random seed so the split is reproducible.

EXECUTION STATE

📚 random_split(dataset, lengths, generator) = Randomly partitions a dataset into non-overlapping subsets of specified sizes. Returns a list of Subset objects that reference the original dataset’s transform.

⬇ arg: trainset = The full 50,000-image CIFAR-10 training set

⬇ arg: [45000, 5000] = Split sizes. Must sum to len(trainset) = 50000. 45K for training, 5K for validation.

⬇ arg: generator=...manual_seed(42) = Fixed random number generator ensures the same 5,000 images go to validation every time. Without this, the split changes on each run, making results non-reproducible.

→ why 45K/5K? = 90%/10% is a standard split ratio. 5,000 validation samples (500 per class) give a reliable accuracy estimate. Using too many for validation wastes training data.

19train_loader = torch.utils.data.DataLoader(...)

Creates an iterator that serves batches of 128 images during training. Shuffling randomizes the order each epoch so the model doesn’t memorize batch patterns. num_workers=2 loads data in parallel background threads.

EXECUTION STATE

📚 DataLoader(dataset, batch_size, shuffle, num_workers) = Wraps a dataset for efficient batched iteration. Each call to next() returns a batch of (images, labels) tensors ready for the GPU.

⬇ arg: train_data = The 45,000-image training subset from random_split

⬇ arg: batch_size=128 = Each batch contains 128 images. Total batches = ceil(45000/128) = 352. Last batch has 45000 mod 128 = 104 images.

⬇ arg: shuffle=True = Randomize sample order each epoch. CRITICAL for training — without shuffle, the model sees the same sequence every epoch and may learn spurious order-dependent patterns.

⬇ arg: num_workers=2 = Use 2 background processes for data loading. While the GPU trains on batch N, workers pre-load batch N+1. Prevents the GPU from waiting idle for data.

22val_loader = torch.utils.data.DataLoader(..., shuffle=False)

Validation DataLoader with shuffle=False. We never shuffle validation data because: (1) evaluation doesn’t benefit from random order, and (2) deterministic order makes debugging easier. Same batch_size is fine since we don’t backpropagate on validation.

EXECUTION STATE

⬇ arg: val_data = The 5,000-image validation subset

⬇ arg: shuffle=False = No shuffling for validation — we want repeatable evaluation. Same images, same order every time.

→ total batches = ceil(5000/128) = 40 batches

25test_loader = torch.utils.data.DataLoader(...)

Test DataLoader — also unshuffled. We evaluate on the test set only once, at the very end, after all hyperparameter tuning is done. This gives the final, unbiased accuracy number.

EXECUTION STATE

⬇ arg: testset = The 10,000-image CIFAR-10 test set (separate from training)

→ total batches = ceil(10000/128) = 79 batches

30images, labels = next(iter(train_loader))

Grabs one batch to verify everything works. This triggers the full pipeline: load PIL Image → RandomHorizontalFlip → RandomCrop → ColorJitter → ToTensor → Normalize → stack into batch tensor.

EXECUTION STATE

images = torch.Tensor of shape (128, 3, 32, 32) — a batch of 128 normalized, augmented images

labels = torch.Tensor of shape (128,) — integer class labels for each image in the batch

31print(f"Batch images: {images.shape}")

Confirms the batch tensor has the expected 4D shape: (batch_size, channels, height, width) = (128, 3, 32, 32).

EXECUTION STATE

⬆ output = Batch images: torch.Size([128, 3, 32, 32])

32print(f"Batch labels: {labels.shape}")

Labels are a 1D tensor with one integer per image. These get passed to nn.CrossEntropyLoss during training.

EXECUTION STATE

⬆ output = Batch labels: torch.Size([128])

33print(f"Train batches: {len(train_loader)}")

The number of batches per epoch. With 45,000 images and batch_size=128, we get 352 batches. Each epoch, the model processes all 352 batches (seeing every training image exactly once).

EXECUTION STATE

⬆ output = Train batches: 352

→ math = ceil(45000 / 128) = ceil(351.5625) = 352. Last batch has 45000 - 351×128 = 72 images.

34print(f"Val batches: {len(val_loader)}")

40 validation batches. After each training epoch, we loop through all 40 to compute validation accuracy.

EXECUTION STATE

⬆ output = Val batches: 40

35print(f"Test batches: {len(test_loader)}")

79 test batches. Used only once at the very end to report final accuracy. If you evaluate on the test set during training and tune hyperparameters based on it, you are overfitting to the test set.

EXECUTION STATE

⬆ output = Test batches: 79

21 lines without explanation

1import torch
2from torchvision import datasets
3
4# Load datasets with their respective transforms
5trainset = datasets.CIFAR10(
6    root="./data", train=True, transform=train_transform,
7)
8testset = datasets.CIFAR10(
9    root="./data", train=False, transform=test_transform,
10)
11
12# Split training into train (45K) + validation (5K)
13train_data, val_data = torch.utils.data.random_split(
14    trainset, [45000, 5000],
15    generator=torch.Generator().manual_seed(42),
16)
17
18# Create DataLoaders for batched iteration
19train_loader = torch.utils.data.DataLoader(
20    train_data, batch_size=128, shuffle=True, num_workers=2,
21)
22val_loader = torch.utils.data.DataLoader(
23    val_data, batch_size=128, shuffle=False, num_workers=2,
24)
25test_loader = torch.utils.data.DataLoader(
26    testset, batch_size=128, shuffle=False, num_workers=2,
27)
28
29# Verify one batch
30images, labels = next(iter(train_loader))
31print(f"Batch images: {images.shape}")
32print(f"Batch labels: {labels.shape}")
33print(f"Train batches: {len(train_loader)}")
34print(f"Val batches:   {len(val_loader)}")
35print(f"Test batches:  {len(test_loader)}")

The complete data pipeline at a glance:
Load — torchvision downloads and caches CIFAR-10 on disk
Split — 50K training images become 45K train + 5K validation
Transform (train) — RandomFlip → RandomCrop → ColorJitter → ToTensor → Normalize
Transform (val/test) — ToTensor → Normalize only
Batch — DataLoader serves groups of 128 images each iteration
Shuffle — Training batches are randomized each epoch

That's it — our data pipeline is complete. With 352 training batches, 40 validation batches, and 79 test batches, we have everything needed to train and evaluate a CNN. Each training image goes through a random augmentation pipeline, so the network effectively trains on a much larger, more diverse dataset than the original 45,000 images.

In the next section, we'll build the CNN architecture itself and train it using these DataLoaders. You'll see exactly how the data preparation choices we made here — normalization, augmentation, batch size — affect training speed and final accuracy.