Chapter 11
20 min read
Section 35 of 65

The Training Loop

Training in Practice

Learning Objectives

By the end of this section, you will be able to:

  1. Describe the five-step training rhythm (forward → loss → backward → clip → update) and explain why the order matters
  2. Implement gradient clipping from scratch and explain the mathematical guarantee it provides: g^c\|\hat{g}\| \leq c after clipping
  3. Compare common learning rate schedules (step decay, cosine annealing, warmup + cosine) and explain when each is appropriate
  4. Build a complete training loop in NumPy that includes batching, gradient clipping, and learning rate decay
  5. Translate the same loop into idiomatic PyTorch using nn.Module, optimizer, scheduler, and checkpointing

Where We Left Off

In Section 1, we learned how to split data into mini-batches and feed them to a model efficiently. We proved that mini-batch gradients are unbiased estimators with variance σ2/B\sigma^2 / B, and built both NumPy and PyTorch data pipelines.

But batching is only one piece of the puzzle. In a real training run, you need to orchestrate the entire process: shuffle data each epoch, compute the forward and backward passes, clip gradients when they explode, schedule the learning rate, track losses, and save checkpoints. This section teaches that orchestration — the training loop itself.

The Central Metaphor: If mini-batching is the engine of a car, the training loop is the driver. The engine generates force (gradients), but the driver decides the speed (learning rate), applies the brakes (gradient clipping), and knows when to stop (convergence monitoring).

Anatomy of a Training Loop

Every neural network training loop — from a 22-parameter toy model to GPT-4 with 1.8 trillion parameters — follows the same nested structure:

Loop LevelWhat It DoesFrequency
Outer: for epoch in range(E)One complete pass through all training data10–300 times (small models), 1–5 times (LLMs on large corpora)
Inner: for batch in loaderProcess one mini-batchN/B times per epoch
Step: forward → loss → backward → clip → updateOne weight updateOnce per batch

The mathematical update that happens at each step is:

θt+1=θtηtclip(BL(θt),c)\theta_{t+1} = \theta_t - \eta_t \cdot \text{clip}(\nabla_{\mathcal{B}} L(\theta_t),\, c)

where ηt\eta_t is the learning rate at step tt (possibly from a schedule), BL\nabla_{\mathcal{B}} L is the mini-batch gradient, andclip(g,c)\text{clip}(g, c) scales gg so that gc\|g\| \leq c.

Let us unpack each component of this formula.


The Five Sacred Steps

Inside the batch loop, every training step follows five operations in a strict order. Getting the order wrong causes subtle, hard-to-debug failures.

Step 1: Forward Pass

Push the batch through the model to compute predictions:y^=f(Xb;θ)\hat{y} = f(X_b;\, \theta). For our 2-layer MLP: z1=XbW1+b1z_1 = X_b W_1 + b_1,h1=ReLU(z1)h_1 = \text{ReLU}(z_1),z2=h1W2+b2z_2 = h_1 W_2 + b_2. The forward pass is deterministic given fixed weights and inputs.

Step 2: Compute Loss

Measure how far the predictions are from the truth:L=1BiBlogp^(yixi)L = -\frac{1}{B} \sum_{i \in \mathcal{B}} \log \hat{p}(y_i \mid x_i). The loss is a single scalar that summarizes the model's performance on this batch. Everything in the backward pass flows from this number.

Step 3: Backward Pass (Backpropagation)

Compute θL\nabla_\theta L — the gradient of the loss with respect to every parameter. The chain rule propagates gradients from the loss backward through each layer. In PyTorch, loss.backward() does this automatically. In NumPy, we computed each gradient by hand in Chapters 7-8.

Critical: Zero gradients before backward! PyTorch accumulates gradients by default. If you call loss.backward() without first calling optimizer.zero_grad(), gradients from the previous batch add to the current batch's gradients. This is a common beginner bug that causes loss to explode.

Step 4: Gradient Clipping

If the gradient norm is too large, scale it down:

g^={gif gccggif g>c\hat{g} = \begin{cases} g & \text{if } \|g\| \leq c \\ \frac{c}{\|g\|} \cdot g & \text{if } \|g\| > c \end{cases}

This is a safety net. Most batches produce reasonable gradients, but occasionally a pathological batch creates a gradient so large that the weight update would catapult the model far from the current optimum. Clipping caps the maximum step size while preserving the gradient direction.

Step 5: Update Weights

Apply the gradient to adjust each parameter:θθηg^\theta \leftarrow \theta - \eta \cdot \hat{g}. In PyTorch, optimizer.step() does this for all parameters at once.

Why this order? Forward must come first (produces predictions). Loss needs predictions. Backward needs the loss (and clean gradients). Clipping needs computed gradients. Update needs clipped gradients. Each step depends on the previous one. Swapping any two steps produces either errors or silently wrong results.

A Note on zero_grad(set_to_none=True)

Before PyTorch 1.7, optimizer.zero_grad() overwrote every .grad tensor with zeros. Since PyTorch 2.0 the default has flipped — set_to_none=True — and .grad is set to None instead. Two practical wins: (1) no memory is spent on a zero tensor you are about to overwrite on the next backward pass, and (2) fused optimizer kernels can skip a branch because a None gradient signals “nothing to update for this parameter this step.”

In modern code you can just write optimizer.zero_grad() with no argument. The snippet below makes the behavior visible so the semantics are never mysterious.

set_to_none=False vs True — same clearing step, different state
🐍zero_grad_set_to_none.py
1import torch

PyTorch gives us autograd-enabled tensors and optimizers. Autograd is what lets loss.backward() populate .grad on every leaf tensor.

EXECUTION STATE
📚 torch = PyTorch core: tensors, autograd, optimizers, nn modules.
2import torch.nn as nn

nn holds the building blocks (Linear, Dropout, BatchNorm, ...). We alias it as nn by convention.

5model = nn.Linear(3, 1, bias=False)

A single linear layer with input dim 3 and output dim 1. bias=False keeps the demo focused on weight.grad. The learnable tensor is model.weight with shape (1, 3).

EXECUTION STATE
📚 nn.Linear(in, out, bias) = Fully-connected layer. Forward pass: output = input @ weight.T (+ bias if enabled).
⬇ arg: in_features=3 = Each input vector has 3 entries.
⬇ arg: out_features=1 = Produces a single scalar per input.
⬇ arg: bias=False = Skip the bias term so only weight.grad exists — simpler demo.
6with torch.no_grad():

Disable gradient tracking for the setup we are about to do. We are overwriting the weight tensor, not running a forward pass that should be differentiated.

EXECUTION STATE
📚 torch.no_grad() = Context manager that stops autograd from recording operations. Any tensor modified inside is not part of the graph.
7model.weight.copy_(torch.tensor([[1.0, 2.0, 3.0]]))

Manually set the weights to [[1, 2, 3]]. Using copy_ (in-place) keeps the same tensor id so autograd still sees it as a leaf parameter.

EXECUTION STATE
⬇ arg: torch.tensor([[1., 2., 3.]]) = Shape (1, 3) matches model.weight.
⬆ result: model.weight = tensor([[1., 2., 3.]])
10x = torch.tensor([[1.0, 1.0, 1.0]])

A single input with shape (1, 3). All-ones makes the forward and backward pass easy to verify by hand.

EXECUTION STATE
x = tensor([[1., 1., 1.]]) shape (1, 3)
11loss = (model(x) ** 2).mean()

Forward pass + a dummy squared loss. pred = 1*1 + 2*1 + 3*1 = 6.0. loss = mean(6**2) = 36.0. The single element means .mean() is just the identity, but it keeps loss a scalar so backward() works.

EXECUTION STATE
pred = model(x) = 1*1 + 2*1 + 3*1 = 6.0 → tensor([[6.]])
loss = (6.0)**2 = 36.0 → tensor(36., grad_fn=<MeanBackward0>)
12loss.backward()

Autograd walks the computation graph and populates .grad on every leaf tensor. dL/dw_i = 2 * pred * x_i = 2 * 6 * 1 = 12 for each weight. So model.weight.grad becomes [[12., 12., 12.]].

EXECUTION STATE
📚 loss.backward() = Triggers reverse-mode automatic differentiation. Writes into .grad of every tensor that has requires_grad=True.
⬆ result: model.weight.grad = tensor([[12., 12., 12.]])
13print("after backward :", model.weight.grad)

Confirms .grad is now a real tensor (not None) holding [[12, 12, 12]].

EXECUTION STATE
stdout = after backward : tensor([[12., 12., 12.]])
16opt = torch.optim.SGD(model.parameters(), lr=0.01)

Plain SGD optimizer over the model's parameters. We only need it for its .zero_grad() method in this demo — we will not actually take a step.

EXECUTION STATE
📚 torch.optim.SGD = Vanilla stochastic gradient descent. Stores references to the parameters it controls.
⬇ arg: model.parameters() = Generator yielding every learnable tensor (here just model.weight).
⬇ arg: lr=0.01 = Learning rate — unused here since we never call .step().
17opt.zero_grad(set_to_none=False)

Classic pre-1.7 behavior: walk every parameter, call .grad.zero_() on it. The .grad tensor is preserved, just overwritten with zeros. Memory stays allocated; a zeroing kernel runs.

EXECUTION STATE
📚 Optimizer.zero_grad(set_to_none) = Clear gradients for all parameters controlled by this optimizer.
⬇ arg: set_to_none=False = Overwrite .grad with zeros instead of dropping it. Legacy behavior; slower and uses more memory.
⬆ result: model.weight.grad = tensor([[0., 0., 0.]])
⬆ is None? = False
18print("set_to_none=False :", model.weight.grad, "is None?", ...)

Shows that .grad is still a live zero tensor.

EXECUTION STATE
stdout = set_to_none=False : tensor([[0., 0., 0.]]) is None? False
22opt.zero_grad(set_to_none=True)

Modern default since PyTorch 2.0. Instead of zeroing the tensor, PyTorch drops the reference so .grad becomes None. The next loss.backward() will allocate a fresh tensor. Two benefits: (1) less memory pressure when gradients are sparse, and (2) fused optimizer kernels can skip a branch because None signals 'no update needed for this parameter'.

EXECUTION STATE
⬇ arg: set_to_none=True = Set each .grad to None. This is the default from PyTorch 2.0 onward.
⬆ result: model.weight.grad = None
⬆ is None? = True
→ practical takeaway = In modern PyTorch you can just write opt.zero_grad() with no argument — set_to_none=True is already the default. You only pass set_to_none=False if you explicitly want the old behavior (rare; mostly for tests or custom training utilities that read .grad directly).
23print("set_to_none=True :", model.weight.grad, "is None?", ...)

Confirms .grad has been cleared all the way to None.

EXECUTION STATE
stdout = set_to_none=True : None is None? True
10 lines without explanation
1import torch
2import torch.nn as nn
3
4# --- Tiny, deterministic model ---
5model = nn.Linear(3, 1, bias=False)
6with torch.no_grad():
7    model.weight.copy_(torch.tensor([[1.0, 2.0, 3.0]]))
8
9# --- Forward + backward populates .grad ---
10x = torch.tensor([[1.0, 1.0, 1.0]])
11loss = (model(x) ** 2).mean()    # pred = 6.0, loss = 36.0
12loss.backward()                   # dL/dw = 2 * 6 * [1,1,1] = [12,12,12]
13print("after backward    :", model.weight.grad)
14
15# --- Classic behavior: set_to_none=False writes zeros ---
16opt = torch.optim.SGD(model.parameters(), lr=0.01)
17opt.zero_grad(set_to_none=False)
18print("set_to_none=False :", model.weight.grad,
19      "is None?", model.weight.grad is None)
20
21# --- Modern default (PyTorch >= 2.0): set_to_none=True drops the tensor ---
22opt.zero_grad(set_to_none=True)
23print("set_to_none=True  :", model.weight.grad,
24      "is None?", model.weight.grad is None)

model.train() vs model.eval() — a Runnable Demo

Every nn.Module carries a self.training flag. model.train() sets it to True on the module and all its children; model.eval() sets it to False. Two standard layers behave differently depending on this flag:

  • Dropout — train: samples a fresh Bernoulli mask and rescales survivors by 1/(1p)1/(1-p). Eval: identity.
  • BatchNorm — train: normalizes with batch statistics and updates the running mean/variance. Eval: uses the accumulated running statistics and updates nothing.

Forgetting model.eval() before validation is one of the most common silent bugs: Dropout keeps zeroing activations and BatchNorm keeps using the tiny batch you happen to be validating on. The demo below makes both modes concrete.

Dropout under model.train() vs model.eval()
🐍train_vs_eval.py
1import torch

PyTorch core, for tensors and the RNG controlled by torch.manual_seed.

2import torch.nn as nn

We will use nn.Dropout. Same alias convention as before.

4torch.manual_seed(0)

Fix the PyTorch RNG so the demo is reproducible. Dropout's random mask is drawn from this RNG; without a seed, outputs would differ every time you re-run the cell.

EXECUTION STATE
📚 torch.manual_seed(seed) = Seeds the default CPU RNG used by Dropout, nn.init, torch.randn, etc. Does NOT seed the CUDA RNG (use torch.cuda.manual_seed_all for that).
5layer = nn.Dropout(p=0.5)

Construct a Dropout module. In training mode it zeros each element of the input independently with probability p, and rescales survivors by 1/(1-p) so the expected sum is preserved.

EXECUTION STATE
📚 nn.Dropout(p) = Stochastic regularizer. Training: zero each element with prob p, scale survivors by 1/(1 - p). Eval: identity.
⬇ arg: p=0.5 = Drop 50% of elements on average. Scale factor for survivors = 1 / (1 - 0.5) = 2.0. Choosing p is a regularization hyperparameter (Ch 12).
8x = torch.ones(1, 10)

Shape (1, 10), every entry = 1.0. The uniform input makes the mask visible: zeros in the output mean 'dropped', 2.0s mean 'kept and scaled'.

EXECUTION STATE
x = tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) shape (1, 10)
11layer.train()

Switch the module into training mode. Every nn.Module has a self.training flag; layer.train() sets it to True on this module and all its children. Dropout checks this flag on every forward pass. So does BatchNorm (use batch statistics + update running mean/var), and a few other modules (LayerNorm is mode-independent and ignores it).

EXECUTION STATE
📚 Module.train() = Sets self.training = True recursively. Standard call at the top of every training epoch: model.train().
→ effect on Dropout = Samples a fresh Bernoulli mask per forward pass.
→ effect on BatchNorm = Uses batch statistics (mean, var of current batch) AND updates running_mean / running_var.
12for i in range(3): print(f"train() run {i}:", layer(x))

Call the layer three times with the same input. Each call draws a fresh random mask, so outputs differ across runs.

EXECUTION STATE
run 0 (example) = tensor([[0., 0., 2., 2., 0., 2., 2., 0., 2., 2.]])
run 1 (example) = tensor([[2., 0., 2., 0., 0., 0., 2., 2., 2., 2.]])
run 2 (example) = tensor([[0., 2., 0., 2., 2., 2., 2., 0., 2., 0.]])
→ survivors scaled by 2.0 = Because 1/(1-p) = 1/0.5 = 2. This compensation keeps E[output_i] = x_i, so downstream layers see the same expected scale with or without Dropout active.
→ exact masks depend on RNG = The specific zeros/twos pattern you see depends on your PyTorch build's RNG. The statistical behavior (~5 zeros, ~5 twos, different each run) is what matters.
17layer.eval()

Switch into evaluation mode. self.training becomes False. From now on Dropout is a no-op (identity), and BatchNorm uses its frozen running statistics instead of batch statistics. This is the single most important line in validation code — forgetting it silently corrupts your validation metrics.

EXECUTION STATE
📚 Module.eval() = Sets self.training = False recursively. Standard call at the top of every validation / test pass: model.eval().
→ effect on Dropout = Identity. No masking, no rescaling.
→ effect on BatchNorm = Normalizes using running_mean and running_var accumulated during training. Does NOT update them.
18for i in range(3): print(f"eval() run {i}:", layer(x))

Three calls, same input, same output. Evaluation is deterministic because the stochastic operations are turned off.

EXECUTION STATE
run 0 = tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
run 1 = tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
run 2 = tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
→ rule of thumb = Always bracket your validation pass with model.eval() ... model.train() so Dropout and BatchNorm behave correctly. This is separate from torch.no_grad(), which only disables gradient tracking.
12 lines without explanation
1import torch
2import torch.nn as nn
3
4torch.manual_seed(0)
5layer = nn.Dropout(p=0.5)
6
7# Input: a single row of all-ones so the effect of Dropout is obvious
8x = torch.ones(1, 10)
9
10# --- TRAIN mode: stochastic masking + rescaling ---
11layer.train()
12for i in range(3):
13    print(f"train() run {i}:", layer(x))
14# Each run draws a fresh Bernoulli(0.5) mask.
15# Zeros stay 0; survivors are scaled by 1/(1 - 0.5) = 2.0.
16
17# --- EVAL mode: identity ---
18layer.eval()
19for i in range(3):
20    print(f"eval()  run {i}:", layer(x))
21# Same output every time — Dropout becomes a no-op.

Interactive Training Simulator

The visualization below simulates gradient descent on the loss surface L(w1,w2)=w12+3w22L(w_1, w_2) = w_1^2 + 3 w_2^2. You can adjust the learning rate, LR scheduler, gradient clipping threshold, and batch size to see how each affects the training trajectory, loss curve, and gradient behavior.

Loading training loop visualization...

Experiment with different settings and observe:

  • No scheduler + high LR: The loss may oscillate or even diverge as the model overshoots the minimum repeatedly
  • Cosine schedule: Smooth descent that slows naturally near the end, often reaching a lower final loss than step decay
  • Warmup + cosine: Starts with small steps (warmup prevents early instability), peaks, then decays smoothly. This is the standard schedule for transformer training
  • Gradient clipping disabled + small batch: Noisy gradients occasionally cause large jumps. Enable clipping to see the trajectory stabilize
  • Large batch size: Smoother trajectory but fewer updates for the same number of epochs. Compare B=1 (very noisy) vs B=64 (smooth but slow)

Gradient Clipping: Preventing Explosions

As we saw in Chapter 8, gradients can explode during backpropagation. In a deep network with LL layers, the gradient of the loss with respect to early layers involves a product of LL Jacobian matrices:

LW1=LzLzLzL1z2z1z1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_L} \cdot \frac{\partial z_L}{\partial z_{L-1}} \cdots \frac{\partial z_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}

If the spectral norms of these Jacobians are greater than 1, the product grows exponentially with depth. A 50-layer network where each Jacobian has norm 1.1 produces gradients that are 1.1501171.1^{50} \approx 117 times larger at layer 1 than at layer 50. The weight update becomes so large that the model jumps to a completely unrelated point in parameter space, and training diverges.

Norm Clipping vs. Value Clipping

There are two flavors of gradient clipping:

MethodFormulaProperties
Norm clipping (standard)g ← (c / ‖g‖) · g if ‖g‖ > cPreserves direction. Scales ALL gradient components equally. Used by GPT, BERT, most transformers.
Value clippingg_i ← clamp(g_i, -c, c)Clips each component independently. Can change gradient direction. Less common.

Norm clipping is overwhelmingly preferred because it preserves the gradient direction. Think of it this way: the gradient tells you where to go, and the norm tells you how fast. Norm clipping says “go in the same direction, but at a safe speed.” Value clipping says “cap each axis independently,” which can rotate the gradient direction.

Mathematically, norm clipping guarantees:

g^=min(g,c)\|\hat{g}\| = \min(\|g\|,\, c)

This bounds the maximum change in parameters per step:Δθ=ηg^ηc\|\Delta \theta\| = \eta \|\hat{g}\| \leq \eta \cdot c. With η=0.001\eta = 0.001 and c=1.0c = 1.0, no single step can move the parameters by more than 0.001 in the L2 norm. This is the fundamental stability guarantee.

Choosing the Clip Threshold

The clip threshold cc is a hyperparameter. A common practice:

  1. Run a few epochs without clipping and record the gradient norms
  2. Set cc to the 90th or 95th percentile of observed norms
  3. This clips only the most extreme gradients while leaving the typical ones untouched

Common defaults: c=1.0c = 1.0 for transformers (GPT-2, GPT-3, BERT all use 1.0), c=5.0c = 5.0 for RNNs and c=0.25c = 0.25 for very deep or sensitive architectures.


Learning Rate Scheduling

The learning rate η\eta controls how far each gradient step moves in parameter space. Using a fixed learning rate throughout training is suboptimal: early in training, when the model is far from the optimum, we want large steps for fast progress. Late in training, when the model is near a good solution, we want small steps for precise convergence.

A learning rate schedule is a function η(t)\eta(t) that adjusts the learning rate based on the training step tt or epoch ee.

Step Decay

The simplest schedule: multiply the learning rate by a factor γ<1\gamma < 1 every few epochs:

ηe=η0γe/S\eta_e = \eta_0 \cdot \gamma^{\lfloor e / S \rfloor}

where SS is the step size (how many epochs between decays) and γ\gamma is the decay factor (typically 0.1 or 0.5). This creates a staircase pattern: constant LR within each interval, discrete drops between intervals.

EpochWith γ=0.8, S=1With γ=0.1, S=30
00.50000.1000
10.40000.1000
20.32000.1000
50.16380.1000
300.00060.0100
600.00000.0010

Cosine Annealing

A smooth, parameter-free schedule that decays the learning rate following a cosine curve fromηmax\eta_{\max} down to ηmin\eta_{\min} (often 0):

ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

where TT is the total number of training steps. The cosine shape provides fast initial learning (LR barely decreases at first), then accelerating decay (LR drops quickly at the end). This matches the intuition that most learning happens early, and the final phase is fine-tuning.

Warmup + Cosine (The Transformer Standard)

Modern transformer training almost universally uses a linear warmup followed by cosine decay:

ηt={ηmaxtTwarmupt<Twarmupηmin+12(ηmaxηmin)(1+cos(π(tTwarmup)TTwarmup))tTwarmup\eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi(t - T_{\text{warmup}})}{T - T_{\text{warmup}}}\right)\right) & t \geq T_{\text{warmup}} \end{cases}

Warmup gradually increases the LR from near-zero to the peak value over the first TwarmupT_{\text{warmup}} steps. Why? At the start of training, the model's activations and gradients are poorly calibrated — a large learning rate would push the model into unstable territory before it has had a chance to “settle in.” Warmup lets the model find a reasonable starting point before accelerating.

ScheduleWhen to UseExamples
Step decayCNNs, well-understood tasks, fine-tuningResNet training (γ=0.1 at epochs 30, 60, 90)
Cosine annealingWhen you know the total training lengthCIFAR-10/100 training
Warmup + cosineTransformers, large learning rates, unstable early trainingGPT-2, GPT-3, BERT, ViT, LLaMA
The original transformer paper (Attention Is All You Need, 2017) used an inverse square root schedule: ηt=dmodel0.5min(t0.5,tTwarmup1.5)\eta_t = d_{\text{model}}^{-0.5} \cdot \min(t^{-0.5},\, t \cdot T_{\text{warmup}}^{-1.5}). Modern practice has largely replaced this with warmup + cosine, which is simpler to tune and performs equally well.

Building the Training Loop from Scratch

Now let us put everything together. The code below implements a complete training loop in NumPy for a 2-layer MLP classifying 6 data points into 2 classes. It includes all five steps: forward pass, loss computation, backpropagation, gradient clipping, and weight updates with learning rate decay.

Pay close attention to the loop structure: the outer epoch loop handles shuffling and learning rate decay; the inner batch loop handles the five training steps. This nested structure is universal across all neural network training.

NumPy — Complete Training Loop with Gradient Clipping
🐍training_loop.py
1import numpy as np

NumPy provides fast N-dimensional arrays and vectorized math. Every matrix multiply (@), element-wise operation, and random call in this file runs as compiled C code, not slow Python loops.

EXECUTION STATE
📚 numpy = Library for numerical computing — ndarray type, matrix operations via @, broadcasting, and random number generation.
3# Dataset: 6 points, 2 classes

We create a small binary classification dataset. 3 points belong to class 0 and 3 to class 1. Small enough to trace every computation, large enough to show batching.

EXECUTION STATE
classification task = Given (x₁, x₂), predict class 0 or class 1. Unlike regression in Section 1, we now predict categories.
4X = np.array([...]) — Input features (6×2)

6 data points, each with 2 features. The first 3 points are class 0, the last 3 are class 1. These clusters are not linearly separable — the MLP’s hidden layer must learn a nonlinear boundary.

EXECUTION STATE
⬇ X (6×2) =
       x₁   x₂
[0]  0.5   0.1   (class 0)
[1]  0.2   0.8   (class 0)
[2]  0.9   0.6   (class 0)
[3]  0.1   0.5   (class 1)
[4]  0.8   0.2   (class 1)
[5]  0.6   0.9   (class 1)
6y = np.array([0, 0, 0, 1, 1, 1]) — Labels

Integer class labels. 0 and 1 are used as indices into the softmax output: probs[sample, y[sample]] gives the predicted probability for the correct class.

EXECUTION STATE
y = [0, 0, 0, 1, 1, 1] — first 3 samples are class 0, last 3 are class 1
8# 2-layer MLP: 2 → 4 → 2

Our model has 2 input features, 4 hidden neurons with ReLU activation, and 2 output neurons (one per class) with softmax. Total parameters: (2×4) + 4 + (4×2) + 2 = 22.

EXECUTION STATE
architecture = Input(2) → Linear(2,4) → ReLU → Linear(4,2) → Softmax → class probability
parameter count = W1: 2×4=8, b1: 4, W2: 4×2=8, b2: 2. Total: 22 learnable parameters.
9np.random.seed(7) — Fix initialization

Seeds the random number generator for reproducible weight initialization. With seed 7, we get specific initial weights that produce interesting training behavior (gradient clipping triggers on some batches).

EXECUTION STATE
📚 np.random.seed() = Sets the internal state of NumPy’s PRNG. All subsequent random calls produce the same sequence. Used for reproducibility — remove in production.
10W1 = 1.5 * np.random.randn(2, 4) — Layer 1 weights

Xavier-like initialization scaled by 1.5. np.random.randn draws from the standard normal distribution N(0, 1), and we scale by 1.5 to make some initial weights large enough that gradient clipping will activate — a realistic scenario.

EXECUTION STATE
📚 np.random.randn(2, 4) = Draws a 2×4 matrix from N(0, 1). Each element is independently sampled. After ×1.5, the values have std dev = 1.5.
W1 (2×4) =
      h₀      h₁      h₂      h₃
x₁ [ 2.5358, -0.6989,  0.0492,  0.6113]
x₂ [-1.1834,  0.0031, -0.0013, -2.6321]
→ note large values = W1[0,0]=2.54 and W1[1,3]=-2.63 are quite large. These create strong gradients that may need clipping.
11b1 = np.zeros(4) — Layer 1 biases

Biases start at zero. Standard practice — the initial bias does not matter much because the random weights already break symmetry between neurons.

EXECUTION STATE
b1 = [0.0, 0.0, 0.0, 0.0]
12W2 = 1.5 * np.random.randn(4, 2) — Layer 2 weights

The output layer maps 4 hidden features to 2 class logits. Each column corresponds to one class.

EXECUTION STATE
W2 (4×2) =
      class0  class1
h₀ [ 1.5265,  0.9007]
h₁ [-0.9381, -0.2573]
h₂ [ 0.7579, -0.3920]
h₃ [-0.3641, -2.1799]
13b2 = np.zeros(2) — Output biases

Two output biases, one per class. Starting at zero means the model initially predicts equal probability for both classes (50/50 after softmax).

EXECUTION STATE
b2 = [0.0, 0.0]
15lr = 0.5 — Initial learning rate

Starting learning rate. We will decay this by 0.8× each epoch (step decay schedule). Epoch 0: 0.5, Epoch 1: 0.4, Epoch 2: 0.32. This gradually reduces step size as we approach the minimum.

EXECUTION STATE
lr = 0.5 = Moderately aggressive for this small problem. In practice: Adam uses 3e-4, SGD uses 0.01-0.1 for image classification.
→ schedule preview = Epoch 0: lr=0.500 Epoch 1: lr=0.400 (×0.8) Epoch 2: lr=0.320 (×0.8)
16clip_norm = 1.0 — Gradient clipping threshold

If the total gradient norm exceeds 1.0, we scale ALL gradients down proportionally so the norm becomes exactly 1.0. This prevents any single batch from making a catastrophically large weight update.

EXECUTION STATE
clip_norm = 1.0 = Maximum allowed L2 norm of the entire gradient vector. If ‖grad‖ = 1.86, we multiply all gradients by 1.0/1.86 = 0.537.
→ why 1.0? = Chosen to demonstrate clipping. Common values in practice: 1.0 for transformers (GPT uses 1.0), 5.0 for RNNs, 0.5 for very sensitive models.
17batch_size = 2 — Mini-batch size

Each gradient update uses 2 samples. With N=6 and B=2: 3 batches per epoch, so 3 weight updates per epoch, 9 total updates across 3 epochs.

EXECUTION STATE
batch_size = 2 = 6 samples / 2 per batch = 3 batches per epoch. Each batch gives one gradient update.
18losses = [] — Loss tracking list

We record the average loss at the end of each epoch. This creates the loss curve — the single most important diagnostic tool for training. A healthy training run shows losses consistently decreasing.

EXECUTION STATE
losses = Empty list. After 3 epochs: [0.8517, 0.8307, 0.7130]. This downward trend confirms the model is learning.
20np.random.seed(42) — Fix shuffle order

Seeds the RNG used for shuffling. This ensures the same batch compositions each run for reproducibility. In real training, you would NOT fix this seed — different shuffles each run help generalization.

22for epoch in range(3): — Epoch loop (outer loop)

The outer loop. One epoch = one complete pass through all 6 training samples. We train for 3 epochs, giving the model 3 chances to see every data point. Real training uses 10-300 epochs.

LOOP TRACE · 3 iterations
epoch=0
Epoch 0 = lr=0.500, perm=[0,1,5,2,4,3]. Batches: {0,1}, {5,2}, {4,3}. Avg loss: 0.8517
clipping = Batch 2 gradient norm = 1.86 > 1.0 → CLIPPED to 1.0
epoch=1
Epoch 1 = lr=0.400, perm=[3,0,1,2,5,4]. Batches: {3,0}, {1,2}, {5,4}. Avg loss: 0.8307
clipping = Batches 1 and 2 gradient norms (1.05, 1.52) > 1.0 → CLIPPED
epoch=2
Epoch 2 = lr=0.320, perm=[4,1,0,3,5,2]. Batches: {4,1}, {0,3}, {5,2}. Avg loss: 0.7130
clipping = No clipping needed — all gradient norms < 1.0. Training has stabilized!
23perm = np.random.permutation(len(X)) — Shuffle each epoch

Generate a fresh random ordering of sample indices. This is the first action inside each epoch — critical for preventing the model from memorizing batch-order patterns. Different permutation each epoch means different batch compositions.

EXECUTION STATE
📚 np.random.permutation(6) = Returns a random ordering of [0,1,2,3,4,5]. Each epoch gets a different order.
epoch 0 perm = [0, 1, 5, 2, 4, 3] → batches: {0,1}, {5,2}, {4,3}
epoch 1 perm = [3, 0, 1, 2, 5, 4] → batches: {3,0}, {1,2}, {5,4}
epoch 2 perm = [4, 1, 0, 3, 5, 2] → batches: {4,1}, {0,3}, {5,2}
24X_s, y_s = X[perm], y[perm] — Apply shuffle

Reorder both inputs and targets using the same permutation. The pairing X_s[i], y_s[i] is preserved — we shuffled the order, not the correspondence.

EXECUTION STATE
X_s (epoch 0) = X[[0,1,5,2,4,3]]: rows 0,1 are batch 0; rows 5,2 are batch 1; rows 4,3 are batch 2
25epoch_loss = 0.0 — Accumulator for epoch average

Running sum of losses across all batches in this epoch. Divided by the number of batches at the end to get the average loss for the loss curve.

27for i in range(0, len(X), batch_size): — Batch loop (inner loop)

The inner loop. Iterates through the shuffled data in chunks of batch_size=2. range(0, 6, 2) gives i = 0, 2, 4, creating 3 batches. Each iteration of this loop performs one complete training step: forward → loss → backward → clip → update.

EXECUTION STATE
📚 range(0, 6, 2) = Start at 0, stop before 6, step by 2. Yields: 0, 2, 4. Three batches.
batch 0: i=0 = X_s[0:2] — first 2 shuffled samples
batch 1: i=2 = X_s[2:4] — middle 2 shuffled samples
batch 2: i=4 = X_s[4:6] — last 2 shuffled samples
28Xb = X_s[i:i+batch_size] — Batch inputs

Slice batch_size=2 rows from the shuffled input array. For epoch 0, batch 0: X_s[0:2] selects samples originally at indices 0 and 1.

EXECUTION STATE
Xb (epoch 0, batch 0) =
[[0.5, 0.1],   (sample 0, class 0)
 [0.2, 0.8]]   (sample 1, class 0)
29yb = y_s[i:i+batch_size] — Batch labels

Corresponding class labels for this batch.

EXECUTION STATE
yb (epoch 0, batch 0) = [0, 0] — both samples are class 0
30B = len(yb) — Actual batch size

Usually equals batch_size, but the last batch may be smaller if N is not divisible by B. Using the actual length prevents division errors.

EXECUTION STATE
B = 2 = For this dataset, every batch has exactly 2 samples (6 / 2 = 3 even batches).
32# Step 1: Forward pass

The forward pass computes the model’s prediction for the current batch. Data flows: Input → Linear(W1) → ReLU → Linear(W2) → Softmax → probabilities. This was covered in detail in Chapter 7.

33z1 = Xb @ W1 + b1 — Layer 1 pre-activation

Matrix multiply: Xb(2×2) @ W1(2×4) = z1(2×4). Each row of z1 is one sample’s pre-activation values for the 4 hidden neurons.

EXECUTION STATE
📚 @ operator = Matrix multiplication. Xb(2×2) @ W1(2×4) = z1(2×4). Inner dimensions (2) match.
z1 (epoch 0, batch 0) =
[[ 1.1496, -0.3491,  0.0245,  0.0424],
 [-0.4395, -0.1373,  0.0088, -1.9834]]
34h1 = np.maximum(0, z1) — ReLU activation

ReLU zeros out all negative values: max(0, x). This introduces nonlinearity — without it, stacking linear layers would just be one big linear layer. Covered in Chapter 5.

EXECUTION STATE
📚 np.maximum(0, z1) = Element-wise max with 0. Negative → 0, positive → unchanged.
h1 (epoch 0, batch 0) =
[[1.1496, 0.0000, 0.0245, 0.0424],
 [0.0000, 0.0000, 0.0088, 0.0000]]
→ sparsity = Sample 1 has 3 of 4 neurons dead (zero). This is normal for ReLU — sparse activations are a feature, not a bug.
35z2 = h1 @ W2 + b2 — Output logits

Layer 2 pre-activations (logits). h1(2×4) @ W2(4×2) = z2(2×2). Each row has 2 values: one raw score per class. These become probabilities after softmax.

EXECUTION STATE
z2 (epoch 0, batch 0) =
[[1.7579, 0.9334],
 [0.0067, -0.0034]]
→ sample 0 = logit for class 0 = 1.76, class 1 = 0.93. Class 0 has higher logit → model leans toward class 0 (correct!).
36exp_z = np.exp(z2 - z2.max(axis=1, keepdims=True)) — Numerically stable exp

First subtract the row-wise max (log-sum-exp trick), then apply exp(). Subtracting the max prevents overflow: exp(1000) = infinity, but exp(1000-1000) = exp(0) = 1. The subtraction does not change the softmax output because softmax(x) = softmax(x - c) for any constant c.

EXECUTION STATE
📚 z2.max(axis=1, keepdims=True) = Row-wise max: [[1.7579], [0.0067]]. keepdims=True keeps shape (2,1) for broadcasting.
⬇ axis=-1 = Operates along the last axis (columns within each row).
⬇ keepdims=True = Result shape (2,1) not (2,) — so z2(2,2) - max(2,1) broadcasts correctly.
37probs = exp_z / exp_z.sum(axis=1, keepdims=True) — Softmax

Divides each row by its sum to get probabilities that sum to 1. This is the complete softmax function: softmax(z_i) = exp(z_i) / ∑ exp(z_j).

EXECUTION STATE
probs (epoch 0, batch 0) =
[[0.6952, 0.3048],   → sample 0: 69.5% class 0 (correct=0 ✓)
 [0.5025, 0.4975]]   → sample 1: 50.2% class 0 (correct=0, barely)
39# Step 2: Compute loss

Cross-entropy loss measures how far the predicted probabilities are from the true labels. Lower = better. A perfect model gives loss = 0 (probability 1.0 for the correct class).

40loss = -np.mean(np.log(probs[range(B), yb] + 1e-8))

Cross-entropy loss: pick the predicted probability for the correct class, take its negative log, average across the batch. probs[range(B), yb] uses advanced indexing to select probs[0, y[0]] and probs[1, y[1]].

EXECUTION STATE
📚 probs[range(B), yb] = Advanced indexing: probs[0, 0] = 0.6952, probs[1, 0] = 0.5025. These are the predicted probabilities for the correct class.
→ log of correct probs = log(0.6952) = -0.3635, log(0.5025) = -0.6882
⬇ 1e-8 = Tiny epsilon to prevent log(0) = -infinity. In practice, softmax never outputs exactly 0, but this is a safety net.
⬆ loss (epoch 0, batch 0) = -mean([-0.3635, -0.6882]) = 0.5258. Lower is better (perfect = 0).
41epoch_loss += loss — Accumulate batch loss

Add this batch’s loss to the running sum. At epoch end, we divide by number of batches to get the average loss for the epoch.

43# Step 3: Backward pass

Compute gradients of the loss with respect to every weight and bias. These gradients tell us which direction to adjust each parameter to reduce the loss. The chain rule propagates gradients from the output back to the input (covered in Chapter 8).

44dz2 = probs.copy() — Start of softmax+CE gradient

The gradient of cross-entropy loss combined with softmax has a beautifully simple form: dL/dz2 = probs - one_hot(y). We start by copying probs, then subtract 1 from the correct class position.

EXECUTION STATE
dz2 before subtraction =
[[0.6952, 0.3048],
 [0.5025, 0.4975]] — copy of probs
45dz2[range(B), yb] -= 1 — Subtract 1 at correct class

This creates the softmax+CE gradient in one step. For each sample, subtract 1 from the position of the correct class. The result: dz2[i,j] = probs[i,j] - 1 if j is the correct class, else probs[i,j].

EXECUTION STATE
dz2 after subtraction =
[[-0.3048,  0.3048],    → sample 0: correct class 0, subtracted 1
 [-0.4975,  0.4975]]    → sample 1: correct class 0, subtracted 1
→ interpretation = Negative at correct class = ‘increase this logit’. Positive at wrong class = ‘decrease this logit’. The magnitude is the error.
46dz2 /= B — Average over batch

Divide by batch size to get the mean gradient. This matches the 1/B in the loss = (1/B)∑ losses formula.

EXECUTION STATE
dz2 / 2 =
[[-0.1524,  0.1524],
 [-0.2488,  0.2488]]
47dW2 = h1.T @ dz2 — Weight gradient for layer 2

Gradient of loss w.r.t. W2. The formula dW = activationsᵀ × output_gradient comes from the chain rule applied to z2 = h1 @ W2 + b2. Shape: h1.T(4×2) @ dz2(2×2) = dW2(4×2).

EXECUTION STATE
📚 h1.T @ dz2 = Transpose h1 from (2×4) to (4×2), then multiply by dz2(2×2) = dW2(4×2). Each element dW2[i,j] tells us how to adjust the weight connecting hidden neuron i to output class j.
48db2 = dz2.sum(axis=0) — Bias gradient for layer 2

Sum the output gradient across samples. Since bias adds the same value to all samples, its gradient is the sum of per-sample gradients.

49dh1 = dz2 @ W2.T — Gradient flowing back through layer 2

Backpropagate the gradient through the weight matrix: dz2(2×2) @ W2.T(2×4) = dh1(2×4). This is the gradient of loss w.r.t. the hidden layer activations.

50dz1 = dh1 * (z1 > 0) — ReLU backward

The derivative of ReLU: 1 if the pre-activation was positive (neuron was active), 0 if negative (neuron was dead). This masks out gradients for neurons that were zeroed by ReLU.

EXECUTION STATE
📚 (z1 > 0) = Boolean mask: True where z1 > 0, False where z1 ≤ 0. Element-wise multiply zeros out gradients for dead neurons.
51dW1 = Xb.T @ dz1 — Weight gradient for layer 1

Same pattern as dW2: activationsᵀ × gradient. Here the ‘activations’ are the raw inputs Xb. Shape: Xb.T(2×2) @ dz1(2×4) = dW1(2×4).

52db1 = dz1.sum(axis=0) — Bias gradient for layer 1

Sum across samples, same pattern as db2.

54# Step 4: Gradient clipping

Before updating weights, we check if the total gradient is too large. If it is, we scale it down. This prevents a single bad batch from making a catastrophically large update that destabilizes training. Think of it as a speed limit on weight updates.

55grads = np.concatenate([g.ravel() for g in [dW1, db1, dW2, db2]])

Flatten all gradient tensors into a single 1D vector. dW1(2×4)=8 values, db1(4)=4, dW2(4×2)=8, db2(2)=2. Total: 22 gradient values — one per parameter.

EXECUTION STATE
📚 .ravel() = Flattens a multi-dimensional array into 1D. [[1,2],[3,4]].ravel() = [1,2,3,4]. Does not copy data.
📚 np.concatenate() = Joins arrays end-to-end. concatenate([[1,2], [3,4,5]]) = [1,2,3,4,5].
grads shape = (22,) — all 22 gradient values in one vector, ready for norm computation
56norm = np.linalg.norm(grads) — L2 norm of all gradients

Computes the Euclidean norm (L2 norm): √(g₁² + g₂² + ... + g₂₂²). This single number summarizes the total ‘size’ of the gradient. A large norm means the model wants to make a big jump in weight space.

EXECUTION STATE
📚 np.linalg.norm() = Euclidean (L2) norm: sqrt(sum of squares). norm([3,4]) = sqrt(9+16) = 5.0.
norm values across training = E0B0: 0.89, E0B1: 0.90, E0B2: 1.86 (will be clipped!) E1B0: 0.34, E1B1: 1.05 (clipped), E1B2: 1.52 (clipped) E2B0: 0.58, E2B1: 0.31, E2B2: 0.38 (all under threshold)
57if norm > clip_norm: — Check if clipping needed

Only clip when the gradient norm exceeds the threshold. If norm ≤ clip_norm, leave the gradients unchanged. This is a conditional safeguard, not a constant rescaling.

EXECUTION STATE
→ epoch 0, batch 2 = norm = 1.8621 > 1.0 → CLIP! scale = 1.0 / 1.8621 = 0.537
→ epoch 2, batch 0 = norm = 0.5781 < 1.0 → no clip needed
58scale = clip_norm / norm — Compute scaling factor

The ratio that will bring the gradient norm down to exactly clip_norm. If norm = 1.86 and clip_norm = 1.0, scale = 0.537. Every gradient component is multiplied by this factor, preserving the gradient direction while reducing its magnitude.

EXECUTION STATE
scale formula = clip_norm / norm = 1.0 / 1.8621 = 0.5370. All 22 gradient values are multiplied by 0.537.
→ after scaling = New norm = 1.8621 × 0.537 = 1.0000 exactly. Direction preserved, magnitude capped.
59dW1 *= scale; db1 *= scale — Scale layer 1 gradients

In-place multiplication reduces all layer 1 gradients proportionally. The *= operator modifies the arrays without creating copies.

60dW2 *= scale; db2 *= scale — Scale layer 2 gradients

Same scaling for layer 2. All gradient tensors are scaled by the same factor, preserving relative magnitudes between layers.

61print(f" Clipped: {norm:.2f} -> {clip_norm}")

Logs when clipping happens. Monitoring clip frequency is important: if clipping happens very often, the learning rate may be too high. If it never happens, the threshold may be too generous.

EXECUTION STATE
output examples = Epoch 0, Batch 2: ‘Clipped: 1.86 -> 1.0’ Epoch 1, Batch 1: ‘Clipped: 1.05 -> 1.0’ Epoch 1, Batch 2: ‘Clipped: 1.52 -> 1.0’
63# Step 5: Update weights

Apply the (possibly clipped) gradients to update all model parameters. This is the SGD update rule: θ ← θ - lr × gradient. After this, the model is slightly better at classifying the current batch’s samples.

64W1 -= lr * dW1; b1 -= lr * db1 — Update layer 1

SGD update for layer 1 weights and biases. With lr=0.5 and the gradients from batch 0: each weight moves by at most 0.5 × (gradient component) in the direction that reduces the loss.

EXECUTION STATE
update formula = W1_new = W1_old - 0.5 × dW1. The -= operator is in-place (efficient, no copy).
65W2 -= lr * dW2; b2 -= lr * db2 — Update layer 2

Same SGD update for layer 2. After this line, one complete training step is done: forward → loss → backward → clip → update. The inner loop moves to the next batch.

67avg_loss = epoch_loss / (len(X) // batch_size) — Epoch average

Compute average loss over all 3 batches in this epoch. This is the value we plot on the loss curve. It smooths out the per-batch fluctuations into a single number per epoch.

EXECUTION STATE
len(X) // batch_size = 6 // 2 = 3 batches per epoch
⬆ avg_loss values = Epoch 0: (0.5258 + 0.8223 + 1.2069) / 3 = 0.8517 Epoch 1: (0.6255 + 0.6803 + 1.1863) / 3 = 0.8307 Epoch 2: (0.8215 + 0.6814 + 0.6360) / 3 = 0.7130
68losses.append(avg_loss) — Record for loss curve

Append to the losses list. After 3 epochs: losses = [0.8517, 0.8307, 0.7130]. The downward trend (0.85 → 0.83 → 0.71) confirms the model is learning.

EXECUTION STATE
losses after 3 epochs = [0.8517, 0.8307, 0.7130] — consistent decrease = healthy training
69lr *= 0.8 — Step decay learning rate schedule

Reduce the learning rate by 20% after each epoch. This is the simplest LR schedule: start with large steps to make fast initial progress, then take smaller steps for fine-grained convergence near the optimum.

EXECUTION STATE
→ schedule = After epoch 0: lr = 0.5 × 0.8 = 0.400 After epoch 1: lr = 0.4 × 0.8 = 0.320 After epoch 2: lr = 0.32 × 0.8 = 0.256
→ why decay? = Early in training, we are far from the optimum → large steps are fine. Late in training, we are close → large steps would overshoot. Decaying lr matches step size to distance from optimum.
70print(f"Epoch {epoch}: loss=..., lr=...") — Progress report

Print the epoch summary. This is the minimum monitoring for any training run: epoch number, average loss, and current learning rate.

EXECUTION STATE
output = Epoch 0: loss=0.8517, lr=0.4000 Epoch 1: loss=0.8307, lr=0.3200 Epoch 2: loss=0.7130, lr=0.2560
72print(f"Loss trajectory: ...") — Final summary

Print the complete loss history. [0.8517, 0.8307, 0.7130] shows steady improvement across all 3 epochs. The model started near random chance (~0.69 for binary cross-entropy) and is learning to separate the two classes.

EXECUTION STATE
output = Loss trajectory: [‘0.8517’, ‘0.8307’, ‘0.7130’]
→ random baseline = Binary CE at 50/50 prediction = -log(0.5) = 0.693. Our initial loss (0.85) is above this because the random weights don’t predict 50/50 exactly. After training, 0.71 is approaching this baseline, which means the model is learning.
14 lines without explanation
1import numpy as np
2
3# ── Dataset: 6 points, 2 classes ──
4X = np.array([[0.5, 0.1], [0.2, 0.8], [0.9, 0.6],
5              [0.1, 0.5], [0.8, 0.2], [0.6, 0.9]])
6y = np.array([0, 0, 0, 1, 1, 1])
7
8# ── 2-layer MLP: 2 → 4 → 2 ──
9np.random.seed(7)
10W1 = 1.5 * np.random.randn(2, 4)
11b1 = np.zeros(4)
12W2 = 1.5 * np.random.randn(4, 2)
13b2 = np.zeros(2)
14
15lr = 0.5
16clip_norm = 1.0
17batch_size = 2
18losses = []
19
20np.random.seed(42)
21
22for epoch in range(3):
23    perm = np.random.permutation(len(X))
24    X_s, y_s = X[perm], y[perm]
25    epoch_loss = 0.0
26
27    for i in range(0, len(X), batch_size):
28        Xb = X_s[i:i+batch_size]
29        yb = y_s[i:i+batch_size]
30        B = len(yb)
31
32        # ── Step 1: Forward pass ──
33        z1 = Xb @ W1 + b1
34        h1 = np.maximum(0, z1)
35        z2 = h1 @ W2 + b2
36        exp_z = np.exp(z2 - z2.max(axis=1, keepdims=True))
37        probs = exp_z / exp_z.sum(axis=1, keepdims=True)
38
39        # ── Step 2: Compute loss ──
40        loss = -np.mean(np.log(probs[range(B), yb] + 1e-8))
41        epoch_loss += loss
42
43        # ── Step 3: Backward pass ──
44        dz2 = probs.copy()
45        dz2[range(B), yb] -= 1
46        dz2 /= B
47        dW2 = h1.T @ dz2
48        db2 = dz2.sum(axis=0)
49        dh1 = dz2 @ W2.T
50        dz1 = dh1 * (z1 > 0)
51        dW1 = Xb.T @ dz1
52        db1 = dz1.sum(axis=0)
53
54        # ── Step 4: Gradient clipping ──
55        grads = np.concatenate([g.ravel() for g in [dW1, db1, dW2, db2]])
56        norm = np.linalg.norm(grads)
57        if norm > clip_norm:
58            scale = clip_norm / norm
59            dW1 *= scale; db1 *= scale
60            dW2 *= scale; db2 *= scale
61            print(f"  Clipped: {norm:.2f} -> {clip_norm}")
62
63        # ── Step 5: Update weights ──
64        W1 -= lr * dW1; b1 -= lr * db1
65        W2 -= lr * dW2; b2 -= lr * db2
66
67    avg_loss = epoch_loss / (len(X) // batch_size)
68    losses.append(avg_loss)
69    lr *= 0.8
70    print(f"Epoch {epoch}: loss={avg_loss:.4f}, lr={lr:.4f}")
71
72print(f"Loss trajectory: {[f'{l:.4f}' for l in losses]}")

The loss trajectory [0.8517, 0.8307, 0.7130] shows healthy learning. Notice that gradient clipping activated 3 times out of 9 batches — primarily in the first two epochs when gradients were large. By epoch 2, all gradient norms were below the threshold (highest: 0.58), indicating that the model has moved to a region of the loss landscape with smoother gradients.

Key Insight: Gradient clipping is a transient stabilizer. It is most active early in training when the model is in unstable territory with large gradients. As training progresses and the model approaches a minimum, gradients naturally shrink and clipping becomes inactive. If clipping is always active, the clip threshold is too low or the learning rate is too high.

The PyTorch Training Loop

The PyTorch version follows the same five-step structure, but replaces manual operations with PyTorch's built-in tools:

OperationNumPy (manual)PyTorch (automatic)
Forward passz1 = Xb @ W1 + b1; h1 = ...logits = model(Xb)
LossManual softmax + cross-entropy (3 lines)loss = criterion(logits, yb)
Backward passManual chain rule (9 lines)loss.backward()
Gradient clippingConcatenate + norm + scale (6 lines)nn.utils.clip_grad_norm_(...)
Update weightsW1 -= lr * dW1 (4 lines)optimizer.step()
LR schedulelr *= 0.8 (1 line)scheduler.step()
CheckpointingNot showntorch.save({...})

The core rhythm is identical. PyTorch handles the bookkeeping; you handle the architecture decisions.

PyTorch — Production Training Loop with All Components
🐍pytorch_training_loop.py
1import torch

PyTorch core: provides tensors with GPU acceleration and automatic differentiation. Every computation on tensors with requires_grad=True is recorded in a computational graph for backpropagation.

EXECUTION STATE
📚 torch = PyTorch core. Key subsystems: Tensor (arrays), autograd (automatic derivatives), nn (neural network building blocks), optim (optimizers).
2import torch.nn as nn

The neural network module. Contains layers (nn.Linear, nn.Conv2d), loss functions (nn.CrossEntropyLoss), containers (nn.Module, nn.Sequential), and utilities (nn.utils.clip_grad_norm_).

EXECUTION STATE
📚 torch.nn = Building blocks for neural networks. nn.Module is the base class for all models. nn.Linear is a fully-connected layer. nn.CrossEntropyLoss combines softmax + negative log likelihood.
3from torch.utils.data import DataLoader, TensorDataset

DataLoader handles batching/shuffling (covered in Section 1). TensorDataset is a convenience wrapper that pairs input and target tensors into a Dataset without writing a custom class.

EXECUTION STATE
📚 TensorDataset = Wraps tensors into a Dataset. TensorDataset(X, y) makes dataset[i] return (X[i], y[i]). No custom class needed for simple tensor data.
5# Dataset (same 6 points)

Same classification dataset as the NumPy version: 6 points from 2 classes. Using identical data lets us directly compare the NumPy and PyTorch implementations.

6X = torch.tensor([...]) — Input features

Creates a (6, 2) float32 tensor. dtype=torch.float32 is the standard for neural network training (balances precision and memory).

EXECUTION STATE
⬇ dtype=torch.float32 = 32-bit floating point. Each value uses 4 bytes. float64 is more precise but 2× memory. float16/bfloat16 is used in mixed-precision training.
8y = torch.tensor([...], dtype=torch.long) — Class labels

Labels must be torch.long (64-bit integers) for nn.CrossEntropyLoss. The loss function uses these as indices into the logit tensor: logits[sample, y[sample]].

EXECUTION STATE
⬇ dtype=torch.long = 64-bit integer type. Required by CrossEntropyLoss for indexing. Using float32 would cause a runtime error.
9loader = DataLoader(TensorDataset(X, y), ...) — Batch iterator

One line replaces all the manual shuffling, slicing, and batching we wrote in NumPy. The DataLoader handles everything: shuffling indices, creating batches of 2, iterating through the dataset.

EXECUTION STATE
⬇ batch_size=2 = Same as NumPy version: 3 batches per epoch.
⬇ shuffle=True = Re-randomize order at each epoch start. Equivalent to our np.random.permutation() call.
11# Model

In PyTorch, models are defined as classes that inherit from nn.Module. This gives them parameter tracking, gradient computation, and serialization for free.

12class TwoLayerMLP(nn.Module): — Model class

Inheriting from nn.Module is the PyTorch convention for all models. nn.Module provides: automatic parameter registration, .parameters() method for optimizers, .state_dict() for checkpointing, and .train()/.eval() mode switching.

EXECUTION STATE
📚 nn.Module = Base class for all neural network modules. Tracks parameters, supports save/load, provides forward() hook. Every PyTorch model inherits from this.
13def __init__(self):

Constructor: define all layers here. Each nn.Linear layer you assign as a self.attribute is automatically registered as a parameter. The optimizer will find and update them.

14super().__init__() — Initialize nn.Module

Must call the parent class constructor first. This sets up nn.Module’s internal parameter tracking, hooks, and buffers. Forgetting this call causes cryptic errors later.

EXECUTION STATE
📚 super().__init__() = Calls nn.Module.__init__(). Sets up internal dictionaries: _parameters, _modules, _buffers. Without this, self.fc1 = nn.Linear(...) would not be registered.
15self.fc1 = nn.Linear(2, 4) — Layer 1

Creates a fully-connected layer: input 2 features → output 4 features. Internally stores a weight matrix W(4×2) and bias vector b(4). Parameters are initialized randomly (Kaiming uniform by default).

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Stores W(out×in) and b(out). Forward: output = input @ W.T + b. Default init: Kaiming uniform for weights, uniform for bias.
⬇ in_features=2 = Each input has 2 features (x₁, x₂). This sets W1 to shape (4, 2).
⬇ out_features=4 = 4 hidden neurons. More neurons = more capacity but more parameters.
→ parameters = W1: (4, 2) = 8 params. b1: (4,) = 4 params. Total: 12 learnable values.
16self.fc2 = nn.Linear(4, 2) — Layer 2 (output)

Maps 4 hidden features to 2 class logits. No activation here — nn.CrossEntropyLoss applies softmax internally.

EXECUTION STATE
→ parameters = W2: (2, 4) = 8 params. b2: (2,) = 2 params. Total: 10.
→ model total = fc1: 12 + fc2: 10 = 22 parameters. Same as our NumPy version.
18def forward(self, x): — Forward pass method

Defines how data flows through the model. Called automatically when you write model(Xb). PyTorch builds the computation graph as operations execute, enabling automatic gradient computation.

EXECUTION STATE
⬇ input: x = Tensor of shape (B, 2) where B is batch size. Each row is one sample’s features.
⬆ returns = Tensor of shape (B, 2) — raw logits (unnormalized scores) for each class.
19x = torch.relu(self.fc1(x)) — Hidden layer + ReLU

self.fc1(x) computes x @ W1.T + b1, then torch.relu zeros negatives. This single line replaces our 2-line z1 = Xb @ W1 + b1; h1 = np.maximum(0, z1) from NumPy.

EXECUTION STATE
📚 torch.relu() = Element-wise ReLU: max(0, x). Same as np.maximum(0, x) but with autograd support.
→ shape = Input: (B, 2) → fc1: (B, 4) → relu: (B, 4). Same shape, negative values zeroed.
20return self.fc2(x) — Output logits (no softmax!)

Returns raw logits, NOT probabilities. nn.CrossEntropyLoss expects raw logits and applies softmax internally for numerical stability. Never apply softmax before CrossEntropyLoss — you would get wrong gradients.

EXECUTION STATE
⬆ return shape = (B, 2) — two logits per sample. Positive logit = model favors that class.
→ why no softmax? = CrossEntropyLoss = log_softmax + nll_loss internally. Applying softmax yourself would double-apply it, producing tiny gradients and broken training.
22model = TwoLayerMLP() — Instantiate model

Creates the model with randomly initialized weights. All nn.Linear layers inside are automatically registered. model.parameters() returns an iterator over all 22 learnable values.

EXECUTION STATE
model.parameters() = Returns: [fc1.weight(4,2), fc1.bias(4), fc2.weight(2,4), fc2.bias(2)] — 4 parameter tensors, 22 values total.
23criterion = nn.CrossEntropyLoss() — Loss function

Combines softmax and negative log-likelihood in one numerically stable operation. Input: logits (B, C), targets (B,). Output: scalar loss. This replaces our 3-line softmax + log + mean from NumPy.

EXECUTION STATE
📚 nn.CrossEntropyLoss() = Computes: -log(softmax(logits)[correct_class]), averaged over the batch. Expects raw logits (NOT softmax output). Targets must be torch.long integers.
→ internals = Step 1: log_softmax = logits - log(sum(exp(logits))) Step 2: pick log_softmax[correct_class] Step 3: negate and average
24optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

Creates an SGD optimizer that will update all model parameters using the gradients computed by .backward(). The optimizer stores a reference to model.parameters() and the learning rate.

EXECUTION STATE
📚 torch.optim.SGD() = Stochastic Gradient Descent optimizer. Update rule: param = param - lr × param.grad. Supports momentum, weight_decay, and Nesterov momentum via optional arguments.
⬇ model.parameters() = Iterator over all learnable tensors. The optimizer will update all of these during .step().
⬇ lr=0.5 = Initial learning rate. Same as NumPy version. Will be decayed by the scheduler.
25scheduler = torch.optim.lr_scheduler.StepLR(...) — LR schedule

Automatically decays the learning rate. StepLR multiplies lr by gamma every step_size epochs. With step_size=1 and gamma=0.8: lr decays by 20% every epoch, matching our manual lr *= 0.8.

EXECUTION STATE
📚 StepLR(optimizer, step_size, gamma) = Every step_size epochs, multiply the optimizer’s lr by gamma. lr_new = lr_old × gamma.
⬇ step_size=1 = Decay every 1 epoch. step_size=5 would decay every 5 epochs.
⬇ gamma=0.8 = Multiply lr by 0.8 each step. After 3 steps: 0.5 → 0.4 → 0.32 → 0.256.
27# Training loop

The standard PyTorch training loop. Compare with the NumPy version: the structure is identical, but PyTorch handles gradient computation, optimizer updates, and LR scheduling automatically.

28best_loss = float('inf') — Track best model

Initialize to infinity so the first epoch’s loss is always ‘better’. We save a checkpoint whenever the loss improves. This ensures we keep the best model even if later epochs overfit.

EXECUTION STATE
float('inf') = Python positive infinity. Any real number is less than inf, so the first comparison always triggers a checkpoint save.
30for epoch in range(3): — Epoch loop

Same outer loop as NumPy. Each epoch: iterate through all batches, update weights, decay LR, check if best model.

31epoch_loss = 0.0 — Loss accumulator

Reset at the start of each epoch. We sum batch losses and divide by n_batches for the average.

32n_batches = 0 — Batch counter

Count batches explicitly rather than computing len(loader). Works correctly even with drop_last=True or variable-length final batches.

34for Xb, yb in loader: — Batch iteration

This single line replaces our manual shuffling, index computation, and slicing from NumPy. DataLoader handles everything: shuffle indices, create batches, stack into tensors.

EXECUTION STATE
📚 iterating DataLoader = Each iteration: (1) generate B shuffled indices, (2) call dataset[idx] for each, (3) stack into tensors. Returns (Xb, yb) tuple.
Xb shape = (2, 2) — 2 samples, 2 features each
yb shape = (2,) — 2 class labels (long integers)
35# Step 1: Forward pass

Same as NumPy Step 1, but PyTorch records the computation graph automatically.

36logits = model(Xb) — Forward pass

Calls model.forward(Xb) which runs: fc1 → relu → fc2. Returns (2, 2) tensor of raw logits. PyTorch builds the computation graph in the background as each operation executes.

EXECUTION STATE
logits shape = (2, 2) — two logits per sample. These are raw scores, not probabilities.
38# Step 2: Compute loss

CrossEntropyLoss applies softmax and computes negative log-likelihood in one step.

39loss = criterion(logits, yb) — Cross-entropy loss

Internally computes: softmax(logits), picks the correct class probabilities, takes -log, averages. Returns a scalar tensor with grad_fn (part of the computation graph).

EXECUTION STATE
📚 criterion(logits, yb) = nn.CrossEntropyLoss: (1) log_softmax(logits), (2) select [correct class], (3) negate and mean. Returns scalar tensor.
loss = Scalar tensor, e.g. tensor(0.8517, grad_fn=<NllLossBackward0>). The grad_fn shows it is connected to the computation graph.
41# Step 3: Backward pass

Two lines replace our entire manual backpropagation: zero_grad() clears old gradients, backward() computes new ones via autograd.

42optimizer.zero_grad() — Clear old gradients

Sets .grad to zero for every parameter the optimizer manages. CRITICAL: PyTorch accumulates gradients by default. Without this, gradients from previous batches would add up, giving wrong updates.

EXECUTION STATE
📚 optimizer.zero_grad() = Iterates over all parameter groups and sets param.grad = zeros. Equivalent to: for p in model.parameters(): p.grad.zero_(). Must be called BEFORE .backward().
→ common bug = Forgetting zero_grad() causes gradients to accumulate across batches. Loss initially drops, then explodes. One of the most common PyTorch bugs.
43loss.backward() — Compute all gradients

Traverses the computation graph backward from loss to every leaf tensor with requires_grad=True. After this call, every model parameter has a .grad attribute containing ∂loss/∂param. This replaces our 9 lines of manual gradient computation.

EXECUTION STATE
📚 .backward() = Reverse-mode automatic differentiation. Walks the graph: loss → nll → log_softmax → fc2 → relu → fc1. Computes and stores ∂loss/∂W for every parameter.
→ after backward() = model.fc1.weight.grad = dW1 (shape 4×2) model.fc1.bias.grad = db1 (shape 4) model.fc2.weight.grad = dW2 (shape 2×4) model.fc2.bias.grad = db2 (shape 2)
45# Step 4: Gradient clipping

PyTorch provides a built-in function for gradient clipping. One line replaces our 6-line manual implementation.

46grad_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Computes the total L2 norm of all gradients, then scales them down if the norm exceeds max_norm. Returns the original (pre-clipping) norm for monitoring. The trailing underscore _ means it modifies gradients in-place.

EXECUTION STATE
📚 nn.utils.clip_grad_norm_() = 1. Compute total norm: sqrt(sum of all grad² across all params) 2. If norm > max_norm: scale all grads by (max_norm / norm) 3. Return the original norm (before clipping) The _ suffix means in-place modification.
⬇ model.parameters() = All model parameters. The function iterates over them to compute the total norm.
⬇ max_norm=1.0 = Maximum allowed gradient norm. GPT-2/3 uses 1.0. BERT uses 1.0. ResNet training often uses 5.0 or 10.0.
⬆ grad_norm = The pre-clipping norm (useful for monitoring). If this is consistently > max_norm, consider reducing the learning rate.
48# Step 5: Update weights

The optimizer applies the (possibly clipped) gradients to update all model parameters.

49optimizer.step() — Apply gradients

For SGD: param = param - lr * param.grad for every parameter. The optimizer reads each param.grad and updates the parameter tensor. One line replaces our 4-line manual update.

EXECUTION STATE
📚 optimizer.step() = Applies the update rule to all parameters. For SGD: p.data -= lr * p.grad. For Adam: includes momentum and adaptive learning rates per parameter.
→ order matters! = Must come AFTER backward() (so gradients exist) and AFTER clip_grad_norm_ (so gradients are clipped). Must come BEFORE zero_grad() of the next iteration.
51epoch_loss += loss.item() — Accumulate (detached)

.item() extracts the scalar value as a Python float, detaching it from the computation graph. This prevents memory leaks — storing the full loss tensor would keep the entire graph alive.

EXECUTION STATE
📚 .item() = Converts scalar tensor to Python float. tensor(0.85).item() = 0.85. Critical: detaches from graph, preventing memory leaks during loss tracking.
52n_batches += 1

Increment counter. After inner loop: n_batches = 3.

54# End-of-epoch bookkeeping

After processing all batches in an epoch, we compute the average loss, step the LR scheduler, print progress, and optionally save a checkpoint. This is the ‘maintenance’ phase of the training loop.

55avg_loss = epoch_loss / n_batches — Epoch average

Same as NumPy: average the per-batch losses. This is the value for the loss curve.

56scheduler.step() — Decay learning rate

Tells the scheduler that one epoch has passed. StepLR multiplies the optimizer’s lr by gamma=0.8. Equivalent to our manual lr *= 0.8.

EXECUTION STATE
📚 scheduler.step() = Advances the scheduler by one step. For StepLR: if (epoch+1) % step_size == 0, multiply lr by gamma. The optimizer’s param_groups[0][‘lr’] is updated in-place.
→ LR after each call = After epoch 0: lr = 0.5 × 0.8 = 0.400 After epoch 1: lr = 0.4 × 0.8 = 0.320 After epoch 2: lr = 0.32 × 0.8 = 0.256
57lr = scheduler.get_last_lr()[0] — Get current LR

Retrieves the learning rate that will be used for the NEXT epoch. Returns a list (one lr per param group), so we index [0] for our single param group.

EXECUTION STATE
📚 get_last_lr() = Returns the learning rates computed by the last scheduler.step() call. List of floats, one per optimizer param group.
58print(f"Epoch {epoch}: loss=..., lr=...") — Progress

Same monitoring output as the NumPy version. Epoch number, average loss, and current learning rate.

60# Checkpoint best model

Checkpointing saves the model state to disk so training can be resumed after a crash, and so we can recover the best model even if later epochs overfit.

61if avg_loss < best_loss: — New best?

Only save when the loss improves. This keeps the checkpoint file at the state with the lowest training loss. In practice, you would also track validation loss and checkpoint based on that.

EXECUTION STATE
best_loss tracking = Epoch 0: 0.85 < inf → save! best_loss = 0.85 Epoch 1: 0.83 < 0.85 → save! best_loss = 0.83 Epoch 2: 0.71 < 0.83 → save! best_loss = 0.71
62best_loss = avg_loss — Update best

Record the new best loss for comparison in future epochs.

63torch.save({...}, 'best_model.pt') — Save checkpoint

Saves a dictionary containing everything needed to resume training: model weights, optimizer state (momentum buffers, etc.), current epoch, and the loss. The .pt extension is convention for PyTorch checkpoints.

EXECUTION STATE
📚 torch.save(obj, path) = Serializes any Python object (dict, tensor, model) to a file using pickle. For models, always save state_dict() rather than the whole model object.
⬇ 'epoch': epoch = Which epoch this checkpoint is from. Needed to resume training from the right point.
⬇ 'model_state_dict' = model.state_dict() returns an OrderedDict mapping parameter names to tensors. Contains all weights and biases.
⬇ 'optimizer_state_dict' = optimizer.state_dict() includes learning rates, momentum buffers, and per-parameter statistics. Essential for resuming without a training loss spike.
⬇ 'loss': avg_loss = Record the loss at checkpoint time for comparison when loading.
68print(f" Saved checkpoint (loss=...)")

Log checkpoint saves. In production, you would also log the file path and size. For long training runs, periodically saving prevents losing days of work to a crash.

20 lines without explanation
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
4
5# ── Dataset (same 6 points) ──
6X = torch.tensor([[0.5,0.1],[0.2,0.8],[0.9,0.6],
7                   [0.1,0.5],[0.8,0.2],[0.6,0.9]], dtype=torch.float32)
8y = torch.tensor([0, 0, 0, 1, 1, 1], dtype=torch.long)
9loader = DataLoader(TensorDataset(X, y), batch_size=2, shuffle=True)
10
11# ── Model ──
12class TwoLayerMLP(nn.Module):
13    def __init__(self):
14        super().__init__()
15        self.fc1 = nn.Linear(2, 4)
16        self.fc2 = nn.Linear(4, 2)
17
18    def forward(self, x):
19        x = torch.relu(self.fc1(x))
20        return self.fc2(x)
21
22model = TwoLayerMLP()
23criterion = nn.CrossEntropyLoss()
24optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
25scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.8)
26
27# ── Training loop ──
28best_loss = float('inf')
29
30for epoch in range(3):
31    epoch_loss = 0.0
32    n_batches = 0
33
34    for Xb, yb in loader:
35        # Step 1: Forward pass
36        logits = model(Xb)
37
38        # Step 2: Compute loss
39        loss = criterion(logits, yb)
40
41        # Step 3: Backward pass
42        optimizer.zero_grad()
43        loss.backward()
44
45        # Step 4: Gradient clipping
46        grad_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
47
48        # Step 5: Update weights
49        optimizer.step()
50
51        epoch_loss += loss.item()
52        n_batches += 1
53
54    # End-of-epoch bookkeeping
55    avg_loss = epoch_loss / n_batches
56    scheduler.step()
57    lr = scheduler.get_last_lr()[0]
58    print(f"Epoch {epoch}: loss={avg_loss:.4f}, lr={lr:.4f}")
59
60    # Checkpoint best model
61    if avg_loss < best_loss:
62        best_loss = avg_loss
63        torch.save({
64            'epoch': epoch,
65            'model_state_dict': model.state_dict(),
66            'optimizer_state_dict': optimizer.state_dict(),
67            'loss': avg_loss,
68        }, 'best_model.pt')
69        print(f"  Saved checkpoint (loss={avg_loss:.4f})")

Compare the two implementations side by side. The structure is identical: outer epoch loop, inner batch loop, five steps per batch. PyTorch adds three capabilities that the NumPy version lacks: automatic differentiation (loss.backward()), optimizer abstraction (optimizer.step() works the same for SGD, Adam, or any optimizer), and LR scheduling as a first-class concept.


Model Checkpointing

Training a large model can take hours, days, or even weeks. If the process crashes (GPU failure, power outage, preempted cloud instance), you lose all progress unless you have checkpoints — periodic snapshots of the model state saved to disk.

What to Save

A complete checkpoint should include everything needed to resume training from exactly where you left off:

ComponentWhy It Is NeededPyTorch Method
Model weightsThe learned parameters — the whole point of trainingmodel.state_dict()
Optimizer stateMomentum buffers, adaptive LR statistics (Adam's m and v). Without these, resuming causes a loss spikeoptimizer.state_dict()
Epoch / step numberKnow where to resume fromManual tracking
LR scheduler stateResume the LR schedule from the right pointscheduler.state_dict()
Best validation lossTrack best model for early stoppingManual tracking
Random stateReproducible shuffling after resumetorch.get_rng_state()

Loading a Checkpoint

To resume training, load each component back:

  1. Recreate the model architecture (same class definition)
  2. Call model.load_state_dict(checkpoint['model_state_dict'])
  3. Recreate the optimizer with the same hyperparameters
  4. Call optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  5. Set the starting epoch to checkpoint['epoch'] + 1
Always save state_dict(), not the model object. Saving the full model object (torch.save(model)) uses Python pickle, which is fragile — it breaks if you rename a class, move it to a different module, or change Python versions. The state_dict is just a dictionary of tensors, which is portable and robust.

Connection to Modern Training Systems

The training loop you just built scales directly to the largest models in existence. Here is how each concept maps to production systems:

Mixed Precision Training

Modern GPUs (A100, H100) have specialized hardware for float16 and bfloat16 operations that run 2–4× faster than float32. Mixed precision training keeps a float32 “master copy” of the weights but performs forward and backward passes in lower precision. The training loop adds a gradient scaler step: before optimizer.step(), gradients are scaled up to prevent underflow in float16, then unscaled for the weight update.

Gradient Accumulation Revisited

When the desired batch size exceeds GPU memory, gradient accumulation simulates larger batches by running multiple forward-backward passes before calling optimizer.step(). The training loop change is simple: skip optimizer.zero_grad() fork1k-1 iterations, then zero and step on the kk-th. Effective batch size: Beff=k×BmicroB_{\text{eff}} = k \times B_{\text{micro}}. The gradients accumulate naturally because PyTorch adds to .grad by default.

Distributed Training

In multi-GPU training, each GPU runs the same training loop on a different mini-batch. After the backward pass, gradients are synchronized across GPUs using AllReduce (average all gradient tensors). The optimizer step then uses the averaged gradient, equivalent to training with a batch size of B×num_gpusB \times \text{num\_gpus}. PyTorch'sDistributedDataParallel wraps the model and handles synchronization transparently.

Training Stability in Transformers

Large transformers are notoriously sensitive to training stability. The techniques from this section are not optional — they are essential:

  • Gradient clipping (c=1.0): Used by GPT-2, GPT-3, PaLM, and LLaMA. Without it, occasional gradient spikes cause loss spikes that can permanently damage the model
  • Warmup + cosine LR: GPT-3 uses 375M tokens of warmup. LLaMA uses 2000 steps. The warmup period lets LayerNorm and attention statistics stabilize before large updates begin
  • Checkpointing every N steps: LLaMA saves checkpoints every 1000 steps. GPT-3 training ran for weeks — without checkpoints, a single crash would lose days of compute
  • Loss monitoring: Training runs are watched by automated systems that alert engineers when the loss plateaus, spikes, or diverges. Some systems automatically reduce LR or roll back to a checkpoint
The production training loop is exactly what you built in this section, plus mixed precision, gradient accumulation, distributed synchronization, and logging. The five-step rhythm — forward, loss, backward, clip, update — is the same at every scale.

Summary

In this section, we built the complete training loop from the ground up:

  1. Five-step rhythm: Every training step follows forward → loss → backward → clip → update. The order is strict and the reasons are mathematical
  2. Gradient clipping bounds the update magnitude: Δθηc\|\Delta\theta\| \leq \eta \cdot c. Norm clipping (not value clipping) preserves gradient direction. It is most active early in training and naturally fades as gradients shrink
  3. Learning rate scheduling matches step size to training phase. Step decay is simple; cosine annealing is smooth; warmup + cosine is the transformer standard. The key formula: ηt=ηmax2(1+cos(πt/T))\eta_t = \frac{\eta_{\max}}{2}(1 + \cos(\pi t / T))
  4. Checkpointing saves model, optimizer, scheduler, and epoch state. Always save state_dict(), not the model object. Checkpoint the best model based on validation loss
  5. NumPy to PyTorch: The same five steps, same structure, same math. PyTorch automates gradient computation, optimizer bookkeeping, and LR scheduling

In the next section, we will add the critical missing piece: validation and testing. The training loop we built optimizes for the training data, but how do we know if the model will perform well on new, unseen data? That requires splitting our data, tracking separate metrics, and understanding the train-test gap.

References

  • Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature 323, 533–536. The original backpropagation paper that justifies step 3 (the backward pass) of the training loop.
  • Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 / arXiv:1211.5063. The paper that introduced norm-based gradient clipping and analyzed exploding gradients.
  • Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677. Linear scaling rule and gradual warm-up schedule.
  • Loshchilov, I. & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 / arXiv:1608.03983. Origin of cosine annealing with warm restarts.
  • Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks. WACV 2017 / arXiv:1506.01186.
  • Smith, L. N. & Topin, N. (2019). Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. SPIE Defense 2019 / arXiv:1708.07120. The OneCycleLR schedule.
  • Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018 / arXiv:1710.03740. Foundational paper on FP16 training with loss scaling.
  • Kalamkar, D. et al. (2019). A Study of BFLOAT16 for Deep Learning Training. arXiv:1905.12322.
  • PyTorch Documentation. Automatic Mixed Precision (torch.cuda.amp), Learning Rate Schedulers (torch.optim.lr_scheduler), and Optimizer.zero_grad(set_to_none=True). https://pytorch.org/docs/stable/amp.html
Loading comments...