Chapter 11
18 min read
Section 34 of 65

Data Loading and Batching

Training in Practice

Learning Objectives

By the end of this section, you will be able to:

  1. Explain why we split training data into mini-batches instead of using the full dataset at once
  2. Derive the key mathematical property: mini-batch gradients are unbiased estimates of the true gradient with variance σ2/B\sigma^2 / B
  3. Implement a complete mini-batch training loop from scratch in NumPy
  4. Use PyTorch's Dataset and DataLoader to handle batching, shuffling, and parallel loading
  5. Choose appropriate batch sizes for different hardware and training scenarios

Where We Left Off

In Chapter 10, we built multi-layer perceptrons (MLPs) that can learn complex functions by stacking layers. In Chapters 7-9, we learned how data flows forward through a network, how gradients flow backward via backpropagation, and how optimizers like SGD and Adam use those gradients to update weights.

But there is a question we have been quietly glossing over: how do we actually feed data to the network? In all our examples so far, we computed gradients on the entire dataset at once. For 8 or 16 samples, that is fine. But real datasets have thousands to billions of samples. Computing the gradient on the entire dataset before making a single weight update is both computationally wasteful and, surprisingly, mathematically suboptimal.

This section introduces the concept of mini-batch training — the practical engine that powers every modern neural network, from small classifiers to GPT-4.


The Training Data Challenge

Consider a concrete scenario. You are training a model on ImageNet, which has 1.2 million images. Each image is 224×224×3 = 150,528 floats. At 4 bytes per float, the raw input data alone is 720 GB. This does not fit in GPU memory (typically 16-80 GB), and computing the gradient over all 1.2 million samples before making a single weight update would be absurdly slow.

Even if memory were unlimited, there is a deeper mathematical reason not to use the full dataset for every gradient computation. Recall from Chapter 8 that the gradient of the loss over the entire dataset is:

L(θ)=1Ni=1N(xi,yi;θ)\nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(x_i, y_i; \theta)

This is an average of per-sample gradients. And here is the key insight: you do not need to compute every single term in this sum to get a useful estimate of which direction to move. A random subset of samples gives you a noisy but unbiased approximation of the true gradient — and that approximation is often good enough.

The Core Insight: If you are standing on a mountain in fog and ask 100 people which way is downhill, you get a reliable answer. You do not need to ask all 10,000 people on the mountain. Asking a small random group is faster and points you in roughly the same direction.

Three Flavors of Gradient Descent

Based on how many samples we use to compute each gradient update, there are three regimes.

Full-Batch Gradient Descent

Use the entire dataset for every gradient computation:

θθη1Ni=1N(xi,yi;θ)\theta \leftarrow \theta - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(x_i, y_i; \theta)

  • Advantage: The gradient is exact — no noise, smooth convergence
  • Disadvantage: Must process all N samples before one update. For N = 1,000,000 this is extremely slow
  • Disadvantage: Requires all data in memory at once
  • Disadvantage: Smooth gradients can get trapped in sharp local minima

Stochastic Gradient Descent (B = 1)

Use a single random sample for each gradient:

θθη(xi,yi;θ)\theta \leftarrow \theta - \eta \cdot \nabla \ell(x_i, y_i; \theta)

  • Advantage: Updates after every sample — N updates per epoch instead of 1
  • Advantage: The noise can help escape sharp local minima (acts like a regularizer)
  • Disadvantage: Very noisy gradients — the path toward the optimum is erratic
  • Disadvantage: Cannot exploit GPU parallelism (processing one sample at a time wastes GPU compute)

Mini-Batch Gradient Descent: The Sweet Spot

Use a random subset of BB samples:

θθη1BiB(xi,yi;θ)\theta \leftarrow \theta - \eta \cdot \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla \ell(x_i, y_i; \theta)

where B\mathcal{B} is a random mini-batch of size BB. This is what everyone actually uses. When people say “SGD” in deep learning, they almost always mean mini-batch SGD.

PropertyFull Batch (B=N)Mini-Batch (B=32-512)SGD (B=1)
Updates per epoch1N/B (many)N (most)
Gradient noiseZeroModerateHigh
GPU utilizationHighHighLow (waste)
Memory per stepAll dataB samples1 sample
Convergence pathSmoothSlightly noisyVery noisy
GeneralizationMay overfitGood (noise helps)Good but slow
Why powers of 2? You will often see batch sizes like 32, 64, 128, 256, 512. This is because GPUs process data in warps of 32 threads, and memory allocation is most efficient in powers of 2. A batch of 33 wastes nearly as much GPU time as a batch of 64.

The Mathematics of Mini-Batch Gradients

Let us formalize why mini-batch training works. The true gradient over the full dataset is:

g=L(θ)=1Ni=1Ni(θ)g = \nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell_i(\theta)

where i(θ)\nabla \ell_i(\theta) is the per-sample gradient for sample ii. When we draw a random mini-batch B\mathcal{B} of size BB, the mini-batch gradient is:

g^B=1BiBi(θ)\hat{g}_B = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla \ell_i(\theta)

Property 1: Unbiased Estimation

The mini-batch gradient is an unbiased estimator of the true gradient. That is, its expected value equals the true gradient:

E[g^B]=g\mathbb{E}[\hat{g}_B] = g

This follows directly from the linearity of expectation. Each sample is drawn uniformly at random from the dataset, so E[i]=g\mathbb{E}[\nabla \ell_i] = g for any randomly chosen sample ii. Averaging BB of them still has the same expected value.

In plain English: on average, the mini-batch gradient points in the same direction as the full gradient. Any single mini-batch may be off, but there is no systematic bias.

Property 2: Variance Reduction

The variance of the mini-batch gradient estimator is:

Var(g^B)=σ2B\text{Var}(\hat{g}_B) = \frac{\sigma^2}{B}

where σ2\sigma^2 is the variance of individual per-sample gradients. This is the central limit theorem at work: averaging BB independent random variables reduces variance by a factor of BB.

Let us verify this with our running example. At the initial weights w=[0,0]w = [0, 0], each sample produces a different gradient:

SampleErrorPer-sample gradient (dw)db
0: x=[1.0, 0.5]-2.50[-5.000, -2.500]-5.000
1: x=[2.0, 1.0]-4.00[-16.000, -8.000]-8.000
2: x=[0.5, 2.0]0.00[0.000, 0.000]0.000
3: x=[1.5, 0.5]-3.50[-10.500, -3.500]-7.000
4: x=[3.0, 1.5]-5.50[-33.000, -16.500]-11.000
5: x=[0.5, 0.5]-1.50[-1.500, -1.500]-3.000
6: x=[2.5, 2.0]-4.00[-20.000, -16.000]-8.000
7: x=[1.0, 1.5]-1.50[-3.000, -4.500]-3.000

The true gradient (average of all 8) is L=[11.125,6.5625]\nabla L = [-11.125, -6.5625]. The per-sample variance in the first component (dw1dw_1) is σ2=112.67\sigma^2 = 112.67. For different batch sizes:

Batch Size BVariance of dw₁Std Dev of dw₁Relative to B=1
1 (SGD)112.67±10.611.00×
256.34±7.510.50×
428.17±5.310.25×
8 (Full)14.08±3.750.125×

Doubling the batch size halves the variance. But here is the crucial trade-off: doubling the batch size also halves the number of updates per epoch. With B=1B = 1 you get 8 noisy updates per epoch. With B=8B = 8 you get 1 perfect update. In practice, the many noisy updates often lead to faster convergence than one perfect update.

The Noise-Progress Trade-off

Think of the trade-off this way. At each step, your gradient estimate has two components:

g^B=gsignal+ϵBnoise, N(0,σ2/B)\hat{g}_B = \underbrace{g}_{\text{signal}} + \underbrace{\epsilon_B}_{\text{noise, } \sim \mathcal{N}(0, \sigma^2/B)}

The signal gg always points toward the optimum. The noise ϵB\epsilon_B is random but shrinks as BB grows. With enough signal-to-noise ratio (gϵB\|g\| \gg \|\epsilon_B\|), each step makes progress. The question is: do you take many small noisy steps or few large clean steps? The answer depends on the loss landscape.

An important subtlety: The gradient noise from mini-batching actually helps generalization. Models trained with smaller batch sizes tend to find flatter minima (which generalize better) because the noise prevents them from settling into sharp, narrow minima that only work on the training set. This is why some researchers argue that batch size is implicitly a regularization hyperparameter.

Visualizing Batch Size Effects

The interactive visualization below shows gradient descent paths on a 2D loss surface f(w1,w2)=w12+8w22f(w_1, w_2) = w_1^2 + 8w_2^2 (an elongated valley). All five paths start from the same point. Watch how the batch size controls the amount of noise in the gradient:

Loading batch gradient visualization...

Notice several things as you experiment:

  • B=1 (red) takes a wild, erratic path but may reach the minimum faster because it gets 8 updates per “epoch” while full-batch gets only 1
  • B=N (purple) follows the smoothest, most direct path but takes fewer steps per epoch
  • The elongated valley reveals an important effect: in the narrow direction (w2w_2), even small noise causes overshooting. Larger batches help stability along sensitive dimensions
  • Increasing the learning rate amplifies the noise effect — large lr + small batch can diverge

Building Mini-Batch Training from Scratch

Before using PyTorch's DataLoader, let us build the entire mini-batch pipeline from scratch in NumPy. This makes the mechanism completely transparent: you will see exactly how data is shuffled, split into batches, and used to compute gradients.

We use 8 data points from the linear function y=2x1x2+1y = 2x_1 - x_2 + 1, starting with weights w=[0,0]w = [0, 0] and bias b=0b = 0. The code performs one complete epoch with batch size 2, giving us 4 mini-batch gradient updates.

NumPy — One Epoch of Mini-Batch Training
🐍minibatch_training.py
1import numpy as np

NumPy provides fast N-dimensional arrays (ndarray) and vectorized math operations. Every matrix multiply (@ operator), element-wise operation, and random number call in this file runs as optimized C code. We alias it as ‘np’ by universal convention.

EXECUTION STATE
📚 numpy = Library for numerical computing — provides ndarray type, matrix operations via @, broadcasting, random number generation. All math runs in compiled C, not slow Python loops.
3# Dataset: 8 samples of a linear function

We create a tiny dataset where the true relationship is y = 2x₁ - x₂ + 1. With only 8 samples, we can track every gradient by hand. The goal: learn the weights w = [2, -1] and bias b = 1 from these samples.

EXECUTION STATE
true relationship = y = 2·x₁ - 1·x₂ + 1. Example: x=[1.0, 0.5] → y = 2(1.0) - 0.5 + 1 = 2.5
8 samples = Small enough to show every batch step. In real training, N = 50,000+ (CIFAR-10) or 1.2M (ImageNet).
4X = np.array([...]) — Input features (8×2)

Creates an 8×2 matrix where each row is one data sample with two features (x₁, x₂). Think of these as 8 measurements, each with two properties.

EXECUTION STATE
⬇ X shape = (8, 2) — 8 samples, each with 2 features
X (8×2) =
       x₁   x₂
[0]  1.0   0.5
[1]  2.0   1.0
[2]  0.5   2.0
[3]  1.5   0.5
[4]  3.0   1.5
[5]  0.5   0.5
[6]  2.5   2.0
[7]  1.0   1.5
9y = 2 * X[:, 0] - X[:, 1] + 1 — Target values

Computes target values using the true linear relationship. X[:, 0] selects all values in column 0 (x₁), X[:, 1] selects column 1 (x₂). The result is a 1D array of 8 target values.

EXECUTION STATE
📚 X[:, 0] = NumPy slicing: all rows (:), column 0. Returns [1.0, 2.0, 0.5, 1.5, 3.0, 0.5, 2.5, 1.0] — a 1D array of all x₁ values.
📚 X[:, 1] = All rows, column 1. Returns [0.5, 1.0, 2.0, 0.5, 1.5, 0.5, 2.0, 1.5] — all x₂ values.
y (8,) = [2.5, 4.0, 0.0, 3.5, 5.5, 1.5, 4.0, 1.5]
→ verification = y[0] = 2(1.0) - 0.5 + 1 = 2.5 ✓ y[1] = 2(2.0) - 1.0 + 1 = 4.0 ✓ y[2] = 2(0.5) - 2.0 + 1 = 0.0 ✓ y[4] = 2(3.0) - 1.5 + 1 = 5.5 ✓
10N = len(X) — Dataset size

N = 8 is the total number of training samples. This determines how many batches we get per epoch: n_batches = N // batch_size.

EXECUTION STATE
N = 8 = Total samples. With batch_size=2: 8//2 = 4 batches per epoch. With batch_size=4: 8//4 = 2 batches.
12# Initialize weights

Before training, all learnable parameters start at zero. The network ‘knows nothing’ and predicts ŷ = 0 for every input. Training will push w toward [2, -1] and b toward 1.

13w = np.zeros(2) — Weight vector

Creates a 1D array [0.0, 0.0]. These are the two weights we want to learn. The true values are w = [2, -1]. The distance from the optimum is ||(0,0) - (2,-1)|| = √5 ≈ 2.24.

EXECUTION STATE
w = [0.0, 0.0] = Initial weights — the model predicts ŷ = x₁·0 + x₂·0 + 0 = 0 for all inputs. Target: w = [2.0, -1.0].
📚 np.zeros(2) = Creates a 1D array of 2 zeros. Equivalent to np.array([0.0, 0.0]). For larger networks: np.zeros((784, 256)) creates a 784×256 weight matrix.
14b = 0.0 — Bias

The bias term allows the model to output non-zero values even when all inputs are zero. Target value: b = 1.0.

EXECUTION STATE
b = 0.0 = Initial bias. Without bias, the model can only learn functions that pass through the origin. Target: b = 1.0.
15lr = 0.05 — Learning rate

How big each weight update step is. lr = 0.05 means we move 5% of the gradient magnitude each step. Too large → diverge. Too small → slow convergence. This interacts with batch size: smaller batches have noisier gradients, so we often use smaller lr.

EXECUTION STATE
lr = 0.05 = Learning rate. Update rule: w = w - 0.05 × gradient. If gradient = [-11.125, -6.5625], the step is 0.05 × 11.125 = 0.556 in the w₁ direction.
16batch_size = 2 — Samples per gradient update

Each gradient update uses 2 samples instead of all 8. This is a mini-batch. With N=8 and B=2, we get 4 batches per epoch. Each batch gives a noisy but unbiased estimate of the true gradient.

EXECUTION STATE
batch_size = 2 = B=2 means each gradient is computed from 2 samples. Variance ∝ 1/B: 4× noisier than full-batch (B=8), but we get 4 updates per epoch instead of 1.
→ trade-off = B=1: noisiest, 8 updates/epoch. B=2: moderate noise, 4 updates. B=8: no noise, 1 update. More updates often means faster convergence despite noise.
18# Shuffle: randomize sample order

Shuffling is critical. Without it, the network sees samples in the same order every epoch, which can create systematic bias in gradient estimates. Shuffling ensures each batch is a random subset.

19np.random.seed(42) — Fix random state

Seeds the random number generator for reproducibility. With seed 42, np.random.permutation(8) will always return the same shuffled order. In real training, you would NOT fix the seed — different shuffles each epoch help generalization.

EXECUTION STATE
📚 np.random.seed(42) = Sets the internal state of NumPy’s random number generator. After this call, all ‘random’ operations produce the same sequence. Used for debugging and reproducibility only.
20indices = np.random.permutation(N) — Shuffle order

Generates a random permutation of [0, 1, 2, 3, 4, 5, 6, 7]. This is the order in which we will visit the 8 samples. Each epoch should use a different permutation.

EXECUTION STATE
📚 np.random.permutation(N) = Returns a random ordering of integers 0 to N-1. Like shuffling a deck of 8 cards. Each call (with different seed) gives a different order.
⬇ arg: N = 8 = Generate a permutation of 8 elements: [0, 1, 2, 3, 4, 5, 6, 7] in random order.
⬆ indices = [1, 5, 0, 7, 2, 4, 3, 6]
→ meaning = Visit sample 1 first, then 5, then 0, then 7, then 2, 4, 3, 6. Batches: [1,5], [0,7], [2,4], [3,6].
21X_shuf = X[indices] — Reorder inputs

Fancy indexing: X[indices] selects rows of X in the order given by indices. Row 1 becomes the first row, row 5 becomes the second, etc.

EXECUTION STATE
📚 X[indices] = NumPy fancy indexing: X[[1,5,0,7,2,4,3,6]] returns rows 1,5,0,7,2,4,3,6 in that order. Creates a new array — X itself is unchanged.
X_shuf (8×2) =
       x₁   x₂
[0]  2.0   1.0   (was sample 1)
[1]  0.5   0.5   (was sample 5)
[2]  1.0   0.5   (was sample 0)
[3]  1.0   1.5   (was sample 7)
[4]  0.5   2.0   (was sample 2)
[5]  3.0   1.5   (was sample 4)
[6]  1.5   0.5   (was sample 3)
[7]  2.5   2.0   (was sample 6)
22y_shuf = y[indices] — Reorder targets

Apply the same permutation to targets. X_shuf[i] and y_shuf[i] remain paired — we shuffled the order but kept each (input, target) pair together.

EXECUTION STATE
y_shuf = [4.0, 1.5, 2.5, 1.5, 0.0, 5.5, 3.5, 4.0]
→ pairing check = X_shuf[0]=[2.0,1.0], y_shuf[0]=4.0. Check: 2(2.0)-1.0+1=4.0 ✓ X_shuf[1]=[0.5,0.5], y_shuf[1]=1.5. Check: 2(0.5)-0.5+1=1.5 ✓
24# One epoch: process all batches

One epoch = one complete pass through the entire dataset. With batch_size=2 and N=8, one epoch consists of 4 mini-batch gradient updates. After 4 batches, every sample has been used exactly once.

25n_batches = N // batch_size — Batches per epoch

Integer division: 8 // 2 = 4 batches. Each batch uses 2 samples. If N isn’t divisible by batch_size, the last few samples are dropped (or a smaller final batch is used).

EXECUTION STATE
N // batch_size = 8 // 2 = 4. Integer division drops the remainder. If N=10, batch_size=3: 10//3 = 3 batches (9 samples used, 1 dropped).
n_batches = 4 = 4 weight updates per epoch. Compare: full-batch = 1 update, SGD (B=1) = 8 updates.
27for batch_idx in range(n_batches): — Batch loop

Iterates through each mini-batch. batch_idx goes 0, 1, 2, 3. Each iteration: (1) slice a batch, (2) compute predictions, (3) compute gradients, (4) update weights. This is the core training rhythm.

LOOP TRACE · 4 iterations
batch_idx=0
Batch 0 = Samples [1,5]: X_b=[[2.0,1.0],[0.5,0.5]], y_b=[4.0,1.5]
loss = 9.1250 — predictions are all 0 (weights are zero)
after update = w=[0.4375, 0.2375], b=0.2750
batch_idx=1
Batch 1 = Samples [0,7]: X_b=[[1.0,0.5],[1.0,1.5]], y_b=[2.5,1.5]
loss = 1.4854 — much lower, weights already improving
after update = w=[0.5425, 0.3116], b=0.3800
batch_idx=2
Batch 2 = Samples [2,4]: X_b=[[0.5,2.0],[3.0,1.5]], y_b=[0.0,5.5]
loss = 5.3878 — loss spikes! This batch has very different samples
after update = w=[0.9644, 0.4110], b=0.4675
batch_idx=3
Batch 3 = Samples [3,6]: X_b=[[1.5,0.5],[2.5,2.0]], y_b=[3.5,4.0]
loss = 0.9975 — loss drops again as weights approach true values
after update = w=[1.1054, 0.4755], b=0.5515
28start = batch_idx * batch_size — Batch start index

Calculates where this batch begins in the shuffled array. batch_idx=0 → start=0, batch_idx=1 → start=2, batch_idx=2 → start=4, batch_idx=3 → start=6.

EXECUTION STATE
start values = batch 0: 0×2=0, batch 1: 1×2=2, batch 2: 2×2=4, batch 3: 3×2=6
29end = start + batch_size — Batch end index

The slice X_shuf[start:end] selects ‘batch_size’ consecutive rows. Python slicing is exclusive at the end, so [0:2] gives rows 0 and 1.

EXECUTION STATE
end values = batch 0: 0+2=2, batch 1: 2+2=4, batch 2: 4+2=6, batch 3: 6+2=8
30X_b = X_shuf[start:end] — Batch inputs

Slices 2 rows from the shuffled input array. For batch 0: X_shuf[0:2] = the first two shuffled samples. This is the mini-batch that will produce this step’s gradient.

EXECUTION STATE
📚 array[start:end] = NumPy slicing: selects rows from index ‘start’ to ‘end-1’ (exclusive end). X_shuf[0:2] returns rows 0 and 1.
X_b (batch 0) =
[[2.0, 1.0],    (was original sample 1)
 [0.5, 0.5]]    (was original sample 5)
31y_b = y_shuf[start:end] — Batch targets

The corresponding target values for this batch. Always sliced with the same indices as X_b to maintain pairing.

EXECUTION STATE
y_b (batch 0) = [4.0, 1.5]
33# Forward pass

The forward pass computes predictions for the current batch only — not the entire dataset. This is the key efficiency win of mini-batching: we only process B samples, not N.

34y_pred = X_b @ w + b — Predictions

Matrix multiply: each sample’s features dot-product with the weight vector, plus bias. X_b is (2×2), w is (2,), so X_b @ w is (2,) — one prediction per sample.

EXECUTION STATE
📚 @ operator = Matrix multiplication in NumPy. For X_b(2×2) @ w(2,): each row of X_b is dotted with w. Row 0: [2.0,1.0]·[0,0]+0 = 0. Row 1: [0.5,0.5]·[0,0]+0 = 0.
⬆ y_pred (batch 0, step 0) = [0.0000, 0.0000] — all zeros because w=[0,0] and b=0. The model knows nothing yet.
35error = y_pred - y_b — Prediction errors

Element-wise subtraction: how far each prediction is from the truth. Negative error means we predicted too low.

EXECUTION STATE
error (batch 0) = [0.0-4.0, 0.0-1.5] = [-4.0000, -1.5000]
→ meaning = Sample 1: predicted 0, truth 4.0 → off by -4.0 (too low). Sample 5: predicted 0, truth 1.5 → off by -1.5.
36loss = np.mean(error ** 2) — MSE loss

Mean Squared Error: average of squared errors. Squaring makes all errors positive and penalizes large errors more. This loss is computed over the mini-batch only, not the full dataset.

EXECUTION STATE
📚 np.mean() = Computes the arithmetic mean of an array. np.mean([16.0, 2.25]) = (16.0 + 2.25) / 2 = 9.125.
error ** 2 = [(-4.0)², (-1.5)²] = [16.0, 2.25]
⬆ loss (batch 0) = 9.1250 — high because weights are still at zero. The full-dataset loss at w=[0,0] is 10.6562.
38# Compute gradients

The gradient tells us which direction to adjust weights to reduce the loss. For MSE with linear model ŷ = Xw + b, the gradients have closed-form expressions. We compute them over the mini-batch.

39dw = (2/batch_size) * (X_b.T @ error) — Weight gradient

The gradient of MSE loss with respect to w. The factor 2/B comes from the derivative of (1/B)∑(ŷ-y)². X_b.T transposes (2×2) to (2×2), then multiplies by error (2,) to get a (2,) gradient vector.

EXECUTION STATE
📚 X_b.T = NumPy transpose: flips rows and columns. X_b(2×2) becomes X_b.T(2×2): [[2.0, 0.5], → [[2.0, 0.5], [1.0, 0.5]] [1.0, 0.5]]
X_b.T @ error = [[2.0, 0.5], [1.0, 0.5]] @ [-4.0, -1.5] = [2.0×(-4.0) + 0.5×(-1.5), 1.0×(-4.0) + 0.5×(-1.5)] = [-8.75, -4.75]
2/batch_size = 2/2 = 1.0. This is the MSE derivative factor: d/dw[(1/B)∑(Xw+b-y)²] = (2/B)·Xᵀ·error.
⬆ dw (batch 0) = [-8.7500, -4.7500] — negative gradient means ‘increase w to reduce loss’. After update: w moves in positive direction.
40db = (2/batch_size) * np.sum(error) — Bias gradient

The gradient of MSE with respect to bias b. Since b affects all predictions equally, its gradient is the sum (not dot product) of errors, times 2/B.

EXECUTION STATE
📚 np.sum(error) = Sum of all elements: (-4.0) + (-1.5) = -5.5. For bias gradient, we sum errors because ∂loss/∂b involves summing over all samples.
⬆ db (batch 0) = -5.5000 — negative means ‘increase b to reduce loss’.
42# Update weights

The weight update step: move in the negative gradient direction. This is vanilla SGD applied to a mini-batch gradient. The update is the same formula from Chapter 8, just using a batch gradient instead of the full gradient.

43w = w - lr * dw — Weight update

SGD update rule: subtract lr times the gradient. Since dw is negative (we’re below the true values), subtracting a negative increases w — moving toward the optimum.

EXECUTION STATE
lr * dw = 0.05 × [-8.75, -4.75] = [-0.4375, -0.2375]
⬆ w_new (after batch 0) = [0, 0] - [-0.4375, -0.2375] = [0.4375, 0.2375]
→ progress = Target: [2.0, -1.0]. After 1 batch: [0.4375, 0.2375]. We moved 22% toward w₁=2 but w₂ went the wrong way (toward +0.24 instead of -1.0). This is normal — each batch gives a noisy gradient.
44b = b - lr * db — Bias update

Same SGD update for the bias. db = -5.5, so b increases by 0.05 × 5.5 = 0.275.

EXECUTION STATE
⬆ b_new (after batch 0) = 0.0 - 0.05×(-5.5) = 0.2750. Target: 1.0. Moved 27.5% of the way.
46print(f"Batch {batch_idx}: loss=...") — Monitor training

Prints the loss for each mini-batch. Notice how loss fluctuates between batches — this is the characteristic noise of mini-batch training. The overall trend should be downward.

EXECUTION STATE
output = Batch 0: loss=9.1250 Batch 1: loss=1.4854 Batch 2: loss=5.3878 Batch 3: loss=0.9975
→ note the spike at batch 2 = Loss went 9.13 → 1.49 → 5.39 → 1.00. The spike at batch 2 is normal: that batch (samples 2,4) has very different feature values ([0.5,2.0] and [3.0,1.5]), so the gradient pushed weights in a locally suboptimal direction.
47print(f" w=[...], b=...") — Weight trajectory

Tracking how weights evolve across batches shows the ‘random walk toward the optimum’ characteristic of mini-batch training.

EXECUTION STATE
weight trajectory = Start: w=[0.0000, 0.0000], b=0.0000 B0: w=[0.4375, 0.2375], b=0.2750 B1: w=[0.5425, 0.3116], b=0.3800 B2: w=[0.9644, 0.4110], b=0.4675 B3: w=[1.1054, 0.4755], b=0.5515
→ target = w=[2.0, -1.0], b=1.0. After 1 epoch (4 batches): w₁ moved from 0 to 1.11 (55% of the way). w₂ is at +0.48 but should be -1.0 — needs more epochs.
49# Evaluate on full dataset

After processing all 4 batches (one epoch), we evaluate the model on the entire dataset to measure overall progress. This is the loss we actually care about — the per-batch losses during training are just noisy estimates.

50y_final = X @ w + b — Full dataset predictions

Computes predictions for all 8 samples using the learned weights. X(8×2) @ w(2,) + b gives 8 predictions.

EXECUTION STATE
w = [1.1054, 0.4755]
b = 0.5515
⬆ y_final = Predictions for all 8 samples using learned w and b. These are closer to the true y values than the initial all-zeros predictions.
51full_loss = np.mean((y_final - y) ** 2) — Epoch-end loss

The true MSE loss over the entire dataset. This dropped from 10.6562 (at initialization) to 0.9971 after just one epoch of mini-batch training. The model is learning.

EXECUTION STATE
⬆ full_loss = 0.9971 — down from 10.6562 (90.6% reduction in 1 epoch!)
→ per-sample breakdown = The model predicts ŷ = 1.105·x₁ + 0.476·x₂ + 0.552. Close to the true 2·x₁ - 1·x₂ + 1 but not yet converged. More epochs needed.
52print(f"After 1 epoch: full_loss=...")

Final output: After 1 epoch: full_loss=0.9971. With continued training (more epochs), the loss would decrease further and weights would approach the true values [2, -1, 1].

EXECUTION STATE
output = After 1 epoch: full_loss=0.9971
13 lines without explanation
1import numpy as np
2
3# ── Dataset: 8 samples, y = 2*x1 - x2 + 1 ──
4X = np.array([
5    [1.0, 0.5], [2.0, 1.0], [0.5, 2.0], [1.5, 0.5],
6    [3.0, 1.5], [0.5, 0.5], [2.5, 2.0], [1.0, 1.5]
7])
8y = 2 * X[:, 0] - X[:, 1] + 1
9N = len(X)
10
11# ── Initialize weights ──
12w = np.zeros(2)
13b = 0.0
14lr = 0.05
15batch_size = 2
16
17# ── Shuffle: randomize sample order ──
18np.random.seed(42)
19indices = np.random.permutation(N)
20X_shuf = X[indices]
21y_shuf = y[indices]
22
23# ── One epoch: process all batches ──
24n_batches = N // batch_size
25
26for batch_idx in range(n_batches):
27    start = batch_idx * batch_size
28    end = start + batch_size
29    X_b = X_shuf[start:end]
30    y_b = y_shuf[start:end]
31
32    # Forward pass
33    y_pred = X_b @ w + b
34    error = y_pred - y_b
35    loss = np.mean(error ** 2)
36
37    # Compute gradients
38    dw = (2 / batch_size) * (X_b.T @ error)
39    db = (2 / batch_size) * np.sum(error)
40
41    # Update weights
42    w = w - lr * dw
43    b = b - lr * db
44
45    print(f"Batch {batch_idx}: loss={loss:.4f}")
46    print(f"  w=[{w[0]:.4f}, {w[1]:.4f}], b={b:.4f}")
47
48# ── Evaluate on full dataset ──
49y_final = X @ w + b
50full_loss = np.mean((y_final - y) ** 2)
51print(f"\nAfter 1 epoch: full_loss={full_loss:.4f}")

Look at how the loss fluctuates across batches: 9.13 → 1.49 → 5.39 → 1.00. This is completely normal. Each batch is a different random subset of the data, so the loss it sees differs from the overall loss. The spike at batch 2 happened because those particular samples (with features [0.5, 2.0] and [3.0, 1.5]) pushed the gradient in a locally suboptimal direction.

Key Observation: After just one epoch of mini-batch training (4 gradient updates), the full-dataset loss dropped from 10.66 to 1.00 — a 90.6% reduction. Full-batch gradient descent with the same learning rate would have made only 1 update in the same time, achieving much less progress.

This is the fundamental advantage of mini-batch training: more frequent updates lead to faster learning, even though each individual update is noisier.


PyTorch Dataset and DataLoader

PyTorch provides two classes that handle all the batching machinery we just built by hand:

  • Dataset — defines how to access individual samples. You implement two methods: __len__() returns the total number of samples, and __getitem__(idx) returns one (features, target) pair
  • DataLoader — wraps a Dataset and handles batching, shuffling, and parallel loading. You iterate over it and get ready-made batches

This is a clean separation of concerns. The Dataset knows about your data format (images? text? tabular?). The DataLoader knows about training logistics (batch size, shuffling, workers). You can swap either independently.

The Dataset Contract

Every PyTorch Dataset must implement two methods:

MethodWhat It ReturnsWhen PyTorch Calls It
__len__(self)int — total number of samplesOnce, to determine how many batches per epoch
__getitem__(self, idx)(features, target) tupleB times per batch — once per sample

The DataLoader calls __getitem__ for each index in the current batch, then stacks the results into tensors. If each __getitem__ returns a tuple of (tensor(2,), tensor()), the batch becomes (tensor(B, 2), tensor(B,)).

DataLoader: The Batch Machine

When you write for X_batch, y_batch in loader, the DataLoader does the following each iteration:

  1. Generate indices: Pick B indices from [0, N-1]. If shuffle=True, the order is randomized at the start of each epoch
  2. Fetch samples: Call dataset[idx] for each index. With num_workers > 0, this happens in parallel processes
  3. Collate: Stack individual sample tensors into batch tensors. The default collate function handles most cases
  4. Transfer: With pin_memory=True, data is placed in pinned (page-locked) memory for faster GPU transfer

Why Shuffling Matters

Without shuffling, the network sees the same batch compositions every epoch. If samples 0-3 happen to be easy and samples 4-7 are hard, the first batches always underestimate the true gradient while the last batches always overestimate it. This creates a systematic oscillation in the weight trajectory.

Shuffling each epoch ensures that batch compositions are random, making each mini-batch gradient an independent, unbiased sample. This is a practical requirement for the theoretical guarantees (unbiasedness, variance reduction) we derived earlier.

Always shuffle training data. Set shuffle=True for training loaders. For validation and test loaders, set shuffle=False so results are reproducible.

Here is the complete PyTorch implementation — the same problem as the NumPy version, but using Dataset and DataLoader:

PyTorch — Dataset, DataLoader, and Training Loop
🐍pytorch_dataloader.py
1import torch

PyTorch is the deep learning framework. It provides tensors (like NumPy arrays but with GPU support and automatic differentiation), neural network modules, optimizers, and the data loading utilities we’ll use here.

EXECUTION STATE
📚 torch = PyTorch core library. Provides: Tensor (GPU-accelerated arrays), autograd (automatic differentiation), nn (neural network layers), optim (optimizers). Version 2.0+ supports torch.compile() for faster execution.
2from torch.utils.data import Dataset, DataLoader

The two key classes for data loading in PyTorch. Dataset defines HOW to access individual samples. DataLoader handles batching, shuffling, and parallel loading. This separation is a clean design pattern: data storage is decoupled from data delivery.

EXECUTION STATE
📚 Dataset = Abstract base class. You subclass it and implement __len__() and __getitem__(). Any data source (files, database, API) can be wrapped as a Dataset.
📚 DataLoader = Wraps a Dataset and provides: batching, shuffling, parallel loading (num_workers), memory pinning (pin_memory for GPU), and automatic collation of samples into tensors.
4# Custom Dataset class

PyTorch’s data pipeline starts with a Dataset. You tell PyTorch HOW to access one sample, and DataLoader handles the rest (batching, shuffling, parallelism). This is the Strategy pattern: swap different Datasets without changing the training loop.

5class RegressionDataset(Dataset):

Inherits from torch.utils.data.Dataset. Must implement two methods: __len__() returns the total number of samples, __getitem__(idx) returns one sample. PyTorch calls these internally when iterating.

EXECUTION STATE
⬇ input: Dataset (parent class) = Abstract base class from torch.utils.data. Defines the interface that DataLoader expects. Any class that implements __len__ and __getitem__ works as a Dataset.
→ required methods = __len__(): return int (dataset size) __getitem__(idx): return tuple (features, label) for sample at index idx
6def __init__(self, X, y):

Constructor: receives raw data (NumPy arrays) and converts them to PyTorch tensors for efficient computation. This is where you’d also apply any one-time preprocessing (normalization, encoding, etc.).

EXECUTION STATE
⬇ input: X = NumPy array (8,2) — input features. Will be converted to a float32 tensor.
⬇ input: y = NumPy array (8,) — target values. Will be converted to a float32 tensor.
7self.X = torch.tensor(X, dtype=torch.float32)

Converts the NumPy array to a PyTorch tensor with 32-bit floating point precision. float32 is the standard dtype for neural network training — it balances precision and memory. (float16/bfloat16 is used in mixed-precision training.)

EXECUTION STATE
📚 torch.tensor() = Creates a new tensor from data. Unlike torch.as_tensor(), this always copies the data. dtype=torch.float32 ensures 32-bit precision.
⬇ arg: dtype=torch.float32 = 32-bit floating point. Each value uses 4 bytes. For 8 samples × 2 features: 8×2×4 = 64 bytes. With float64: 128 bytes. With float16: 32 bytes.
self.X = tensor([[1.0, 0.5], [2.0, 1.0], [0.5, 2.0], [1.5, 0.5], [3.0, 1.5], [0.5, 0.5], [2.5, 2.0], [1.0, 1.5]]) shape (8, 2)
8self.y = torch.tensor(y, dtype=torch.float32)

Converts target values to a float32 tensor. Storing tensors in __init__ means the conversion happens once, not every time __getitem__ is called.

EXECUTION STATE
self.y = tensor([2.5, 4.0, 0.0, 3.5, 5.5, 1.5, 4.0, 1.5]) shape (8,)
10def __len__(self):

Returns the total number of samples. DataLoader calls this to determine how many batches to create. len(dataset) = N = 8.

EXECUTION STATE
⬆ returns = 8 — the number of samples in the dataset. DataLoader uses this with batch_size to compute n_batches = 8 // 2 = 4.
11return len(self.X)

len() on a tensor returns the size of the first dimension. For self.X with shape (8, 2), len returns 8.

EXECUTION STATE
📚 len(tensor) = Returns tensor.shape[0] — the size of dimension 0. For (8,2): returns 8. For (3,4,5): returns 3.
13def __getitem__(self, idx):

Returns ONE sample given its index. DataLoader calls this for each sample index in the current batch. For batch [3, 6]: calls __getitem__(3) and __getitem__(6), then stacks results into tensors.

EXECUTION STATE
⬇ input: idx = Integer index (0 to 7). DataLoader generates these indices, handles shuffling, and calls __getitem__ for each.
⬆ returns = Tuple (X[idx], y[idx]). Example: idx=0 returns (tensor([1.0, 0.5]), tensor(2.5)).
14return self.X[idx], self.y[idx]

Returns a tuple of (features, target) for the requested index. DataLoader’s default collate function automatically stacks these tuples into batched tensors.

EXECUTION STATE
example: idx=0 = returns (tensor([1.0, 0.5]), tensor(2.5))
example: idx=4 = returns (tensor([3.0, 1.5]), tensor(5.5))
16# Create dataset

Now we instantiate our custom Dataset with the same 8-sample data from the NumPy example. The Dataset wraps raw arrays into PyTorch’s tensor ecosystem.

17import numpy as np

We use NumPy to create the raw data arrays, then pass them to our Dataset which converts to PyTorch tensors. In practice, data often starts as NumPy arrays, pandas DataFrames, or files on disk.

18X_np = np.array([...]) — Same 8 samples

Identical to our NumPy example: 8 data points with 2 features each. The underscore _np suffix is a convention to distinguish NumPy arrays from PyTorch tensors.

EXECUTION STATE
X_np shape = (8, 2) — same data as before
23y_np = 2 * X_np[:, 0] - X_np[:, 1] + 1

Same target computation: y = 2x₁ - x₂ + 1. Produces [2.5, 4.0, 0.0, 3.5, 5.5, 1.5, 4.0, 1.5].

EXECUTION STATE
y_np = [2.5, 4.0, 0.0, 3.5, 5.5, 1.5, 4.0, 1.5]
25dataset = RegressionDataset(X_np, y_np) — Instantiate

Creates our Dataset object. Internally, X_np and y_np are converted to float32 tensors. From now on, we interact with this dataset through __len__ and __getitem__.

EXECUTION STATE
dataset = RegressionDataset with 8 samples. dataset[0] = (tensor([1.0, 0.5]), tensor(2.5)). len(dataset) = 8.
26print(f"Dataset size: {len(dataset)}")

Calls dataset.__len__() which returns 8. Output: Dataset size: 8.

EXECUTION STATE
output = Dataset size: 8
27print(f"Sample 0: X=..., y=...") — Inspect one sample

Calls dataset.__getitem__(0). Returns (tensor([1.0, 0.5]), tensor(2.5)). This is how DataLoader accesses individual samples.

EXECUTION STATE
dataset[0] = (tensor([1.0000, 0.5000]), tensor(2.5000))
output = Sample 0: X=tensor([1.0000, 0.5000]), y=2.5
29# Create DataLoader

DataLoader is the workhorse that converts a single-sample Dataset into an efficient batch iterator. It handles shuffling, batching, parallel loading, and memory pinning — all with one constructor call.

30loader = DataLoader(...) — The batch machine

Creates a DataLoader that wraps our dataset. When we iterate over ‘loader’, it yields (X_batch, y_batch) tuples where each batch contains batch_size samples.

EXECUTION STATE
📚 DataLoader() = torch.utils.data.DataLoader: wraps a Dataset and provides an iterable of batches. Handles shuffling, batching, parallel workers, and memory pinning for GPU transfer.
⬇ arg 1: dataset = Our RegressionDataset with 8 samples. DataLoader will call dataset[i] to fetch individual samples, then stack them into batch tensors.
⬇ arg 2: batch_size=2 = Each iteration yields 2 samples. With 8 total: 8/2 = 4 batches per epoch. Common values: 32, 64, 128, 256 (powers of 2 for GPU efficiency).
⬇ arg 3: shuffle=True = Re-randomizes sample order at the start of each epoch. Critical for training: prevents the model from learning order-dependent patterns. Set False for validation/test (reproducibility).
⬇ arg 4: drop_last=False = If N is not divisible by batch_size, keep the smaller final batch. With drop_last=True, that batch is discarded. Example: N=10, B=3: 3 full batches + 1 partial (1 sample) or drop it.
36# Training loop

The standard PyTorch training loop: (1) iterate over DataLoader, (2) forward pass, (3) compute loss, (4) backward pass (autograd), (5) update weights, (6) zero gradients. This pattern is universal across PyTorch projects.

37model_w = torch.zeros(2, requires_grad=True)

Creates a weight tensor initialized to zeros with gradient tracking enabled. requires_grad=True tells PyTorch’s autograd to record operations on this tensor so it can compute gradients during backward().

EXECUTION STATE
📚 torch.zeros(2) = Creates a 1D tensor of 2 zeros: tensor([0., 0.]). Shape: (2,).
⬇ requires_grad=True = Enables automatic differentiation. PyTorch will build a computation graph for every operation involving model_w, then compute ∂loss/∂w during .backward().
model_w = tensor([0., 0.], requires_grad=True)
38model_b = torch.zeros(1, requires_grad=True)

Bias parameter, also gradient-tracked. Shape (1,) instead of scalar so broadcasting works consistently with batch operations.

EXECUTION STATE
model_b = tensor([0.], requires_grad=True)
39lr = 0.05 — Same learning rate as NumPy version

Identical learning rate for fair comparison with the NumPy implementation.

41for epoch in range(3): — Epoch loop

Trains for 3 full passes through the dataset. Each epoch: DataLoader shuffles the data and yields 4 batches. So we do 3 × 4 = 12 weight updates total.

LOOP TRACE · 3 iterations
epoch=0
Epoch 0 = First pass through all 8 samples in 4 batches. Shuffled into a new random order. Weights start at [0, 0].
epoch=1
Epoch 1 = Second pass. DataLoader re-shuffles the 8 samples into a different order. Different batches than epoch 0.
epoch=2
Epoch 2 = Third pass. Again re-shuffled. By now, weights should be closer to [2, -1], b closer to 1.
42epoch_loss = 0.0 — Accumulate batch losses

Running sum of losses across all batches in this epoch. Divided by n_batches at the end to get the average loss for reporting.

43n_batches = 0 — Count batches

Counter for how many batches we process. We could compute this as len(loader), but counting explicitly works regardless of drop_last setting.

45for X_batch, y_batch in loader: — Batch iteration

This is the magic line. DataLoader handles everything: (1) shuffles indices, (2) groups them into batches of 2, (3) calls dataset[idx] for each, (4) stacks results into tensors. Each iteration yields a (features, targets) tuple.

EXECUTION STATE
📚 iterating DataLoader = Each iteration: DataLoader picks batch_size indices (shuffled), calls dataset.__getitem__ for each, stacks them into tensors via the collate function. Returns (X_batch, y_batch).
X_batch shape = (2, 2) — 2 samples, 2 features each. Automatically stacked from individual (2,) tensors.
y_batch shape = (2,) — 2 target values
46# Forward pass

Compute predictions for this batch. PyTorch automatically builds the computation graph as operations are performed on gradient-tracked tensors.

47y_pred = X_batch @ model_w + model_b — Predictions

Same linear model as NumPy: ŷ = Xw + b. The @ operator works on PyTorch tensors just like NumPy. Because model_w has requires_grad=True, PyTorch records this matmul in the computation graph.

EXECUTION STATE
📚 @ on tensors = torch.matmul under the hood. X_batch(2,2) @ model_w(2,) = y_pred(2,). Identical to NumPy’s @ but with autograd support.
y_pred shape = (2,) — one prediction per sample in the batch
48loss = torch.mean((y_pred - y_batch) ** 2) — MSE

Same MSE loss as NumPy. torch.mean computes the average. The entire expression is tracked by autograd — when we call loss.backward(), gradients flow back through ** 2, subtraction, @, to model_w and model_b.

EXECUTION STATE
📚 torch.mean() = Computes the mean of all elements in the tensor. Same as np.mean() but with autograd support. Can also take a dim argument for per-dimension means.
loss = A scalar tensor (0-dimensional). Contains the MSE value. Has grad_fn=MeanBackward0 indicating it’s part of the computation graph.
50# Backward pass (autograd computes gradients)

This is where PyTorch shines over NumPy. Instead of manually deriving and coding gradient formulas, we call .backward() and autograd computes all gradients automatically.

51loss.backward() — Automatic differentiation

Computes ∂loss/∂model_w and ∂loss/∂model_b by traversing the computation graph in reverse. After this call, model_w.grad contains the gradient of loss w.r.t. w, and model_b.grad contains the gradient w.r.t. b. Exactly matches our manual NumPy formulas.

EXECUTION STATE
📚 .backward() = Runs reverse-mode automatic differentiation. Walks the computation graph from loss back to every leaf tensor with requires_grad=True. Accumulates gradients into .grad attributes.
after backward() = model_w.grad = ∂loss/∂w (same as our manual dw formula) model_b.grad = ∂loss/∂b (same as our manual db formula) These are the SAME numbers we computed by hand in the NumPy version.
53# Update weights (no_grad prevents tracking)

We need to modify the weights but don’t want PyTorch to track the update as part of the computation graph. torch.no_grad() temporarily disables gradient tracking.

54with torch.no_grad(): — Disable gradient tracking

Context manager that disables autograd. Without it, the weight update w -= lr * w.grad would be recorded in the graph, creating a loop. Inside no_grad(), operations are not tracked.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables gradient computation. Operations inside execute faster and use less memory. Essential for weight updates and inference.
→ why needed = Without no_grad(): w -= lr*grad would be tracked, grad of the update would be computed, creating an infinite chain. no_grad() tells PyTorch ‘this is not part of the model, don’t track it’.
55model_w -= lr * model_w.grad — SGD update

Identical to the NumPy version: w = w - 0.05 × gradient. The -= operator modifies the tensor in-place (important: in-place ops on leaf tensors only work inside no_grad()).

EXECUTION STATE
model_w.grad = The gradient computed by .backward(). Same values as our manual (2/B) * (X_b.T @ error) formula.
lr * model_w.grad = 0.05 × gradient — the step size in weight space
56model_b -= lr * model_b.grad — Bias update

Same SGD update for the bias parameter.

58# Zero gradients for next batch

CRITICAL step that beginners often forget. PyTorch ACCUMULATES gradients by default — .backward() ADDS to .grad, not replaces it. We must explicitly zero them before the next batch.

59model_w.grad.zero_() — Reset weight gradients

Sets model_w.grad to all zeros. The underscore suffix (_) indicates an in-place operation. Without this, gradients from batch 0 would be ADDED to gradients from batch 1, giving wrong updates.

EXECUTION STATE
📚 .zero_() = In-place operation: fills the tensor with zeros. The trailing _ is PyTorch convention for in-place ops (.add_(), .mul_(), .fill_(), .zero_()). Modifies the tensor directly.
→ why zero? = PyTorch accumulates gradients to support gradient accumulation (summing gradients across multiple forward passes before updating). Default behavior: grad += new_grad. We need grad = new_grad, so we zero first.
→ what if we forget? = Gradients keep growing: batch 0 grad + batch 1 grad + batch 2 grad + ... The updates become enormous and training diverges. A common beginner bug.
60model_b.grad.zero_() — Reset bias gradients

Same zeroing for the bias gradient. Both must be zeroed before the next forward+backward pass.

62epoch_loss += loss.item() — Accumulate

Extracts the loss as a plain Python float (no gradient tracking) and adds it to the running total. .item() is important: it detaches the value from the computation graph, preventing memory leaks.

EXECUTION STATE
📚 .item() = Converts a scalar tensor to a Python float. tensor(9.125).item() = 9.125. Also detaches from the computation graph, freeing memory. Always use .item() for logging/printing.
63n_batches += 1

Count batches processed. After the inner loop: n_batches = 4 (since 8 samples / 2 batch_size = 4).

65avg = epoch_loss / n_batches — Average epoch loss

The average loss over all 4 batches gives a smoother estimate of training progress than individual batch losses.

EXECUTION STATE
avg = epoch_loss / 4. This is the number we track to monitor training progress.
66print(f"Epoch {epoch}: avg_loss=...") — Progress report

Prints the average loss for each epoch. You should see this decrease over epochs as the model learns.

EXECUTION STATE
typical output = Epoch 0: avg_loss=~4.5 Epoch 1: avg_loss=~1.0 Epoch 2: avg_loss=~0.5 (Exact values depend on shuffle order)
68print(f"Learned: w=..., b=...") — Final weights

After 3 epochs (12 weight updates), the model should have learned weights close to the true values [2, -1] and bias close to 1. Not perfect yet — more epochs would bring them closer.

EXECUTION STATE
model_w.data = Close to [2.0, -1.0] after 3 epochs. .data accesses the underlying tensor without gradient tracking.
📚 .tolist() = Converts a PyTorch tensor to a Python list. tensor([1.5, -0.3]).tolist() = [1.5, -0.3].
📚 .item() = Converts a scalar tensor to Python float. model_b is shape (1,) so model_b.item() gives a single float.
23 lines without explanation
1import torch
2from torch.utils.data import Dataset, DataLoader
3
4# ── Custom Dataset class ──
5class RegressionDataset(Dataset):
6    def __init__(self, X, y):
7        self.X = torch.tensor(X, dtype=torch.float32)
8        self.y = torch.tensor(y, dtype=torch.float32)
9
10    def __len__(self):
11        return len(self.X)
12
13    def __getitem__(self, idx):
14        return self.X[idx], self.y[idx]
15
16# ── Create dataset ──
17import numpy as np
18X_np = np.array([
19    [1.0, 0.5], [2.0, 1.0], [0.5, 2.0], [1.5, 0.5],
20    [3.0, 1.5], [0.5, 0.5], [2.5, 2.0], [1.0, 1.5]
21])
22y_np = 2 * X_np[:, 0] - X_np[:, 1] + 1
23
24dataset = RegressionDataset(X_np, y_np)
25print(f"Dataset size: {len(dataset)}")
26print(f"Sample 0: X={dataset[0][0]}, y={dataset[0][1]}")
27
28# ── Create DataLoader ──
29loader = DataLoader(
30    dataset,
31    batch_size=2,
32    shuffle=True,
33    drop_last=False
34)
35
36# ── Training loop ──
37model_w = torch.zeros(2, requires_grad=True)
38model_b = torch.zeros(1, requires_grad=True)
39lr = 0.05
40
41for epoch in range(3):
42    epoch_loss = 0.0
43    n_batches = 0
44
45    for X_batch, y_batch in loader:
46        # Forward pass
47        y_pred = X_batch @ model_w + model_b
48        loss = torch.mean((y_pred - y_batch) ** 2)
49
50        # Backward pass (autograd computes gradients)
51        loss.backward()
52
53        # Update weights (no_grad prevents tracking)
54        with torch.no_grad():
55            model_w -= lr * model_w.grad
56            model_b -= lr * model_b.grad
57
58        # Zero gradients for next batch
59        model_w.grad.zero_()
60        model_b.grad.zero_()
61
62        epoch_loss += loss.item()
63        n_batches += 1
64
65    avg = epoch_loss / n_batches
66    print(f"Epoch {epoch}: avg_loss={avg:.4f}")
67
68print(f"Learned: w={model_w.data.tolist()}, b={model_b.item():.4f}")

Compare the two implementations. The PyTorch version replaces our manual index slicing with DataLoader iteration, our manual gradient formulas with loss.backward(), and our manual weight updates with direct tensor operations. The core training rhythm is identical: iterate batches, compute loss, compute gradients, update weights.


Batch Size in Practice

Memory Constraints

The maximum batch size is often dictated by GPU memory. Each sample in a batch requires memory for:

  • Activations: The intermediate values at every layer, stored for backpropagation. For a ResNet-50 processing 224×224 images, each sample needs ~100 MB of activation memory
  • Gradients: Same size as the activations (one gradient per activation)
  • Model parameters: Fixed cost, independent of batch size (ResNet-50: ~100 MB)

A rough formula: MemoryParams+B×ActivationsPerSample\text{Memory} \approx \text{Params} + B \times \text{ActivationsPerSample}. On a 16 GB GPU with a model using 2 GB for parameters, you have ~14 GB for activations. At 100 MB per sample, the maximum batch size is ~140. In practice, you need headroom, so B=64B = 64 or B=128B = 128 is typical.

Generalization Effects

Research by Keskar et al. (2017) and Hoffer et al. (2017) showed that large batch sizes can hurt generalization. The intuition: large batches produce smoother gradients that converge to sharp minima, while small batches inject noise that helps find flatter, more generalizable minima.

Batch SizeTraining LossTest AccuracyCharacter
B = 32-64Slightly higherBestNoisy, explores broadly
B = 256-512LowerGoodModerate noise
B = 4096+LowestOften worseToo smooth, sharp minima

The linear scaling rule (Goyal et al., 2017) provides a practical remedy: when you increase the batch size by a factor of kk, increase the learning rate by the same factor kk. This keeps the effective step size roughly constant across batch sizes. Combined with learning rate warmup (gradually increasing lr for the first few epochs), this allows training with very large batches without degrading generalization.

When to Use Gradient Accumulation

If your desired batch size exceeds GPU memory, you can simulate larger batches by accumulating gradients across multiple forward-backward passes before updating:

  1. Run forward + backward for a mini-batch of size BmicroB_{\text{micro}}
  2. Do NOT zero gradients (they accumulate via PyTorch's default behavior)
  3. Repeat for kk micro-batches
  4. Now update weights. Effective batch size: Beff=k×BmicroB_{\text{eff}} = k \times B_{\text{micro}}

This is how GPT-3 was trained with effective batch sizes of millions of tokens on GPUs that could only hold a few thousand tokens at a time.


Connection to Modern Systems

The batching concepts in this section scale directly to the largest models in production:

Data Parallelism

In distributed training, the batch is split across multiple GPUs. Each GPU processes a micro-batch, computes local gradients, then all GPUs synchronize gradients via AllReduce. If you have 8 GPUs each processing 32 samples, the effective batch size is 8×32=2568 \times 32 = 256. PyTorch's DistributedDataParallel (DDP) handles this automatically — you wrap the model, and DDP inserts gradient synchronization hooks after every backward pass.

The mathematical guarantee: AllReduce computes the average gradient across all GPUs. If each GPU processes an independent mini-batch, the averaged gradient has varianceσ2/(B×G)\sigma^2 / (B \times G) where GG is the number of GPUs. This is exactly equivalent to a single GPU processing a batch of size B×GB \times G.

Flash Attention and Batching

In transformer models, batching has an additional dimension: sequence length TT. The attention mechanism computesAttention(Q,K,V)=softmax(QK/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) \cdot V for each sample in the batch. The score matrix QKQK^\top has shape (B,H,T,T)(B, H, T, T) where HH is the number of attention heads. For a batch of 32 sequences, 32 heads, and T=2048T = 2048 tokens, this matrix alone uses 32×32×20482×21732 \times 32 \times 2048^2 \times 2 \approx 17 GB in float16 — more than the memory of most GPUs.

Flash Attention (Dao et al., 2022) solves this by tiling: instead of materializing the full T×TT \times T attention matrix, it processes small tiles (typically 128×128) in SRAM (fast on-chip cache), computes partial softmax results, and accumulates the output tile by tile. The outer loop iterates over blocks of keys/values; the inner loop iterates over blocks of queries. Each tile computes a local softmax\text{softmax} and uses the online softmax trick (tracking running max and sum) to merge tile results into the exact global softmax.

The result: memory drops from O(BHT2)O(B \cdot H \cdot T^2)to O(BHT)O(B \cdot H \cdot T), and speed improves 2–4× because the attention computation becomes memory-bandwidth-bound (SRAM is 10–20× faster than HBM). From the training loop's perspective, Flash Attention is a drop-in replacement for standard attention — same input, same output, but faster and less memory.

KV-Cache and Batched Inference

During autoregressive inference, each new token generation only needs to attend to all previous tokens. The KV-cache stores the key and value projections of all previous tokens: KcacheRTpast×dkK_{\text{cache}} \in \mathbb{R}^{T_{\text{past}} \times d_k}and VcacheRTpast×dvV_{\text{cache}} \in \mathbb{R}^{T_{\text{past}} \times d_v} per layer per head. Instead of recomputing attention over the entire sequence, the model only computes the new token's query, appends its key/value to the cache, and runs attention against the full cache.

Batched inference serves multiple requests simultaneously by packing their sequences into a single batch. The challenge: different requests are at different generation stages, with different cache lengths. Systems like vLLM use PagedAttention to manage KV-cache memory as variable-length pages (similar to virtual memory in operating systems), allowing efficient batching even when sequences have very different lengths. This relates directly to our DataLoader discussion: just as training batches different samples together, inference systems batch different requests together for GPU utilization.

DataLoader Performance Tips

For production training, DataLoader performance can become a bottleneck. Key settings:

  • num_workers > 0: Use parallel processes to load and preprocess data while the GPU trains on the current batch. A common starting point: num_workers = 4 per GPU
  • pin_memory = True: Allocates data in page-locked (pinned) memory, enabling faster CPU-to-GPU transfer via DMA. Always use with CUDA GPUs
  • prefetch_factor = 2: Each worker prefetches this many batches ahead. Keeps the GPU fed even if occasional samples are slow to load (e.g., large images from disk)
  • persistent_workers = True: Keeps worker processes alive between epochs instead of respawning them. Eliminates the overhead of process creation at epoch boundaries
The theme: From the simple 8-sample mini-batch loop you just wrote to training GPT-4 on thousands of GPUs, the fundamental concepts are the same — shuffle, batch, compute gradient, update, repeat. The engineering gets harder, but the mathematics stays the same.

Summary

In this section, we learned why mini-batch training is essential and how it works:

  1. Mini-batch gradients are unbiased: E[g^B]=L(θ)\mathbb{E}[\hat{g}_B] = \nabla L(\theta). On average, they point in the same direction as the full gradient
  2. Variance decreases with batch size: Var(g^B)=σ2/B\text{Var}(\hat{g}_B) = \sigma^2 / B. Larger batches give smoother gradients
  3. More updates beat fewer perfect updates: Mini-batching gives N/BN/B updates per epoch vs. 1 for full-batch. The frequent noisy updates converge faster in practice
  4. Shuffling is mandatory: Without it, batch compositions are correlated across epochs, violating the independence assumption
  5. PyTorch Dataset + DataLoader separate data access (what is a sample?) from training logistics (batch size, shuffle, parallel loading)
  6. Batch size is a hyperparameter that affects memory, convergence speed, and generalization. The linear scaling rule helps adapt learning rate to batch size

In the next section, we will put this batching machinery inside a complete training loop with proper loss tracking, gradient clipping, and checkpoint saving.

References

  • Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics 22(3), 400–407. Origin of stochastic approximation, the mathematical foundation of SGD.
  • Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT 2010, 177–186.
  • Bottou, L., Curtis, F. E. & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning. SIAM Review 60(2), 223–311. Rigorous treatment of mini-batch convergence and batch-size tradeoffs.
  • Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 / arXiv:1609.04836.
  • Smith, S. L., Kindermans, P.-J., Ying, C. & Le, Q. V. (2018). Don't Decay the Learning Rate, Increase the Batch Size. ICLR 2018 / arXiv:1711.00489.
  • Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677. Linear scaling rule for batch size and learning rate.
  • He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284.
  • Paszke, A. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019.
  • PyTorch Documentation. torch.utils.data — DataLoader, Dataset, Sampler. https://pytorch.org/docs/stable/data.html
Loading comments...