Chapter 12
20 min read
Section 38 of 65

Overfitting and Underfitting

Regularization

The Central Problem of Learning

Every neural network faces a fundamental tension: it must learn patterns from training data that are general enough to apply to data it has never seen. Learn too little and the model fails on everything — training data included. Learn too much and the model memorizes the training data perfectly but fails spectacularly on new inputs. This tension between underfitting and overfitting is the single most important concept in all of machine learning.

Before we touch any math, let us build intuition with two analogies that make these concepts impossible to forget.


The Map Maker's Dilemma

Imagine you are a cartographer hired to draw a map of a winding mountain road. You drive the road once, noting landmarks along the way. Now you must draw a map that will be useful not just for today, but for anyone who drives this road in the future.

The Underfitting Map Maker

The first cartographer is lazy. They draw a single straight line from the start of the road to the end. "Close enough," they say. But the road curves, climbs hills, and loops around a lake. The straight-line map is useless — anyone following it would drive off a cliff at the first turn. This is underfitting: the model (straight line) is too simple to capture the true pattern (winding road). It fails on the training data AND on new data.

In mathematical terms, the underfitting model has high bias — it makes strong assumptions (the road is straight) that are systematically wrong. No matter how many times you drive the road and collect more data points, a straight line will never fit a winding road.

The Overfitting Map Maker

The second cartographer is obsessive. They draw every pebble, every tire track, every fallen leaf on the road. Their map is extraordinarily accurate — for today. But tomorrow, the leaves have blown away, a pothole has been filled, and a new crack has appeared. Anyone following yesterday's hyper-detailed map would be confused by every change. This is overfitting: the model captures the true pattern AND the noise (pebbles, leaves). It performs perfectly on training data but poorly on new data because it memorized transient details.

Mathematically, the overfitting model has high variance — if you sent this cartographer to map the same road on different days, they would produce wildly different maps because they faithfully record whatever random debris happens to be there that day.

The Good Map Maker

The third cartographer captures the road's curves, major landmarks, elevation changes, and bridge locations — but ignores the pebbles and leaves. Their map works today and will work next month. This is the sweet spot: the model captures the underlying pattern without memorizing the noise. It generalizes.

The Map Maker's Lesson: Underfitting means your map is too simple — you drew a straight line through a winding road. Overfitting means your map is too detailed — you drew every pebble that will be gone tomorrow. The goal is a map that captures the road's true shape while ignoring the debris.

The Dartboard — Bias vs. Variance

The second analogy makes the mathematics click. Imagine throwing darts at a dartboard, where the bullseye represents the true answer. Each dart throw represents training a model on a different dataset sampled from the same distribution.

  • Bias is how far the center of your dart cluster is from the bullseye. High bias means you are systematically aiming at the wrong spot. Even if you throw a thousand darts, they all miss the same way.
  • Variance is how spread out your darts are. High variance means each throw lands in a different place — your aim is unstable, even if on average you are pointed at the bullseye.

The four combinations tell the full story:

Bias-Variance Dartboard

PredictionsMean

Select a scenario:

Bias: LowVariance: Low
Model: Ideal (Well-Tuned)

Darts are clustered tightly at the bullseye. The model is accurate and consistent.

The dartboard above is a static illustration. The one below is a live experiment: every click resamples a fresh noisy training set and refits the same degree-5 polynomial. Watch how the predictions scatter around the truth — that scatter is the variance term.

Watch variance build up

Click "Draw new dataset" repeatedly. Each click fits a degree-5 polynomial to 20 noisy samples.

Function plot showing accumulating polynomial fits and corresponding dartboard of predictions-2-101200.250.50.751f(x) = sin(2πx) and fitted polynomialsPredictions at x = 0.5 (value, slope)value →slope →
Predictions stored: 0 / 30

Notice the crucial connection: underfitting = high bias (your model is systematically wrong) and overfitting = high variance (your model is unstable, changing wildly with different training data). The ideal model has both low bias and low variance — darts clustered tightly at the bullseye.


The Mathematics: Bias-Variance Decomposition

Now let us formalize the dartboard intuition. Suppose the true relationship between input xx and output yy is y=f(x)+εy = f(x) + \varepsilon, where f(x)f(x) is the true function and εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2) is irreducible noise. We train a model f^D(x)\hat{f}_D(x) on a dataset DD drawn randomly from this distribution.

The expected test error (MSE) at a point xx decomposes cleanly into three terms:

ED[(yf^D(x))2]=(f(x)ED[f^D(x)])2Bias2+ED[(f^D(x)ED[f^D(x)])2]Variance+σ2Noise\mathbb{E}_D\left[(y - \hat{f}_D(x))^2\right] = \underbrace{\left(f(x) - \mathbb{E}_D[\hat{f}_D(x)]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D\left[(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}

Deriving the Decomposition

Start with the MSE and add-and-subtract the expected prediction fˉ(x)=ED[f^D(x)]\bar{f}(x) = \mathbb{E}_D[\hat{f}_D(x)]:

ED[(yf^D)2]\mathbb{E}_D[(y - \hat{f}_D)^2] =ED[(f+εf^D)2]= \mathbb{E}_D[(f + \varepsilon - \hat{f}_D)^2] =ED[((ffˉ)+(fˉf^D)+ε)2]= \mathbb{E}_D[((f - \bar{f}) + (\bar{f} - \hat{f}_D) + \varepsilon)^2]

Expanding the square and noting that the cross terms vanish (because E[ε]=0\mathbb{E}[\varepsilon] = 0 and ED[fˉf^D]=0\mathbb{E}_D[\bar{f} - \hat{f}_D] = 0):

Why the cross terms vanish. Two cancellations make the decomposition clean. First, ED[f^Dfˉ]=ED[f^D]fˉ=0\mathbb{E}_D[\hat{f}_D - \bar{f}] = \mathbb{E}_D[\hat{f}_D] - \bar{f} = 0 by the very definition fˉED[f^D]\bar{f} \equiv \mathbb{E}_D[\hat{f}_D] — the average prediction equals the average prediction, so any cross-term containing this difference integrates to zero. Second, every term containing the noise ε\varepsilon factorizes as E[ε(f^Dfˉ)]=E[ε]ED[f^Dfˉ]=0\mathbb{E}[\varepsilon \cdot (\hat{f}_D - \bar{f})] = \mathbb{E}[\varepsilon] \cdot \mathbb{E}_D[\hat{f}_D - \bar{f}] = 0 because ε\varepsilon is independent of the training data DD and has zero mean. The bias-variance triple is what survives.

=(ffˉ)2+ED[(f^Dfˉ)2]+σ2= (f - \bar{f})^2 + \mathbb{E}_D[(\hat{f}_D - \bar{f})^2] + \sigma^2

Let us connect each term back to the dartboard:

TermFormulaDartboard MeaningWhat It Measures
Bias²(f(x) - Ē[ƒ̂(x)])²Distance from cluster center to bullseyeSystematic error from model assumptions
VarianceĒ[(ƒ̂(x) - Ē[ƒ̂(x)])²]Spread of the dart clusterSensitivity to the particular training set
Noise (σ²)Ē[ε²]Inherent wobble in your throwing armIrreducible error — no model can beat this

The Tradeoff in Practice

As model complexity increases, Bias2\text{Bias}^2 decreases (more flexible models can fit the true function) but Variance\text{Variance} increases (more flexibility means more sensitivity to noise). The optimal complexity minimizes the sum Bias2+Variance\text{Bias}^2 + \text{Variance}. This is the essence of the bias-variance tradeoff.

Let us see this with real numbers. The code below fits polynomials of degree 1, 3, 9, and 15 to noisy sine data across 200 independent training sets, then computes the exact bias-variance decomposition:

Bias-Variance Decomposition — Computing Exact Values
🐍bias_variance_decomposition.py
1import numpy as np

NumPy provides fast N-dimensional arrays and math operations. All matrix multiplications, polynomial fitting (polyfit/polyval), and statistical aggregations (mean, var) in this demo run through optimized C code under the hood.

EXECUTION STATE
numpy = Library for numerical computing — arrays, linear algebra, random sampling, polynomial operations
4np.random.seed(42)

Fixes the random number generator so every run produces identical datasets. This is critical for reproducibility — without it, each execution would show different bias/variance numbers.

EXECUTION STATE
📚 np.random.seed() = Sets the starting state of NumPy’s pseudo-random generator. Same seed → same sequence of random numbers. Seed 42 is a common convention (from Hitchhiker’s Guide).
5true_fn = lambda x: np.sin(2 * np.pi * x)

This is the ground truth function our models try to learn. It’s a smooth sine wave over [0, 1]. We pick a smooth function so that simple models (low degree) can approximate the general shape, while complex models will try to fit the noise instead.

EXECUTION STATE
true_fn = f(x) = sin(2πx) — one full period of sine over [0, 1]
→ Example values = f(0.0) = 0.000, f(0.25) = 1.000, f(0.50) = 0.000, f(0.75) = −1.000
6n_datasets = 200

We simulate drawing 200 different training sets from the same underlying distribution. This is the key to measuring bias and variance — we need many training runs to see how predictions vary across datasets.

EXECUTION STATE
n_datasets = 200 = Number of independent training sets to generate. Larger = more accurate bias/variance estimates. 200 gives good convergence without being too slow.
7n_train = 25

Each training set has 25 points. This is deliberately small — with limited data, overfitting becomes visible. A degree-15 polynomial has 16 parameters for only 25 data points, making overfitting very likely.

EXECUTION STATE
n_train = 25 = Training set size. Ratio of parameters to data matters: degree 15 has 16 params / 25 points = 0.64 — dangerously close to 1.0 (interpolation threshold).
8n_test = 50

50 evenly spaced test points spanning the full [0, 1] range. These are fixed across all 200 runs — we always evaluate at the same positions so we can compute per-point bias and variance.

EXECUTION STATE
n_test = 50 = Test grid size. Fixed positions let us compute pointwise statistics: mean_pred[j] = average of 200 predictions at position x_test[j].
9noise_std = 0.3

Standard deviation of Gaussian noise added to training labels. This creates the irreducible noise floor — no model can achieve MSE below 0.3² = 0.09 because the training data itself is noisy.

EXECUTION STATE
noise_std = 0.3 = σ = 0.3. Irreducible noise (Bayes error) = σ² = 0.09. This is the theoretical lower bound on MSE for any model.
12x_test = np.linspace(0, 1, n_test)

Creates 50 evenly spaced points from 0 to 1. These serve as our fixed evaluation grid — every model from every training run is tested at these exact positions.

EXECUTION STATE
📚 np.linspace(start, stop, num) = Returns num evenly spaced values from start to stop inclusive. Example: np.linspace(0, 1, 5) → [0.0, 0.25, 0.5, 0.75, 1.0]
⬇ x_test = [0.000, 0.020, 0.041, 0.061, ..., 0.980, 1.000]
13y_true = true_fn(x_test)

The ground truth labels at all test positions. This is what a perfect model would predict — no noise, just the clean sine wave.

EXECUTION STATE
y_true = [0.000, 0.126, 0.249, ..., −0.126, 0.000]
16degrees = [1, 3, 9, 15]

We test 4 polynomial complexities: degree 1 (straight line, extreme underfitting), degree 3 (gentle curves, near optimal), degree 9 (flexible, slight overfitting), and degree 15 (extreme overfitting). This spans the full spectrum from high-bias to high-variance.

EXECUTION STATE
degrees = [1, 3, 9, 15] — Number of parameters: [2, 4, 10, 16]
→ degree 1 = y = a + bx — only 2 params. Cannot capture curves. HIGH BIAS.
→ degree 3 = y = a + bx + cx² + dx³ — 4 params. Can capture sine’s shape. NEAR OPTIMAL.
→ degree 9 = y = a₀ + a₁x + ... + a₉x⁹ — 10 params. Starts fitting noise.
→ degree 15 = y = a₀ + ... + a₁₅x¹⁵ — 16 params for 25 data points. EXTREME OVERFITTING.
17results = {}

Dictionary to store the final bias², variance, noise, and total MSE for each polynomial degree.

19for d in degrees:

Outer loop: test each polynomial degree. For each degree, we generate 200 independent training sets, fit the polynomial, and collect predictions to compute bias and variance.

LOOP TRACE · 4 iterations
d = 1 (linear)
Model = y = a + bx — a straight line. 2 free parameters.
d = 3 (cubic)
Model = y = a + bx + cx² + dx³ — smooth curves. 4 parameters.
d = 9 (9th-order)
Model = y = ∑ a_i x^i for i=0..9 — very flexible. 10 parameters.
d = 15 (15th-order)
Model = y = ∑ a_i x^i for i=0..15 — extreme flexibility. 16 params for 25 points.
20predictions = np.zeros((n_datasets, n_test))

Pre-allocate a 200×50 matrix. Row i holds the predictions from model trained on dataset i, evaluated at all 50 test positions. After all 200 runs, each column j holds 200 predictions at position x_test[j].

EXECUTION STATE
predictions (200×50) = All zeros initially. predictions[i][j] = prediction of model i at test point j.
22for i in range(n_datasets):

Inner loop: simulate 200 independent training runs. Each run draws a fresh training set from the same distribution (the sine function + noise). This is how we measure variance — by seeing how much the model changes across different training sets.

23x_train = np.random.uniform(0, 1, n_train)

Draw 25 random x-positions uniformly from [0, 1]. Each of the 200 datasets has different training points — this simulates the real-world scenario where your training data is a random sample.

EXECUTION STATE
📚 np.random.uniform(low, high, size) = Draws `size` random numbers uniformly from [low, high). Each call produces a different set because the RNG state advances.
x_train (example for i=0) = [0.374, 0.950, 0.732, 0.598, ..., 0.156] — 25 random positions
24y_train = true_fn(x_train) + np.random.normal(0, noise_std, n_train)

Create noisy labels: evaluate the true sine function at the training points, then add Gaussian noise. This noise is what makes the learning problem hard — the model sees y = sin(2πx) + ε and must figure out the signal from the noise.

EXECUTION STATE
📚 np.random.normal(μ, σ, size) = Draws from a Gaussian distribution with mean μ and std σ. Here: mean=0, std=0.3. About 68% of noise values fall in [−0.3, 0.3].
true_fn(x_train) = Clean sine values: [0.95, −0.31, −0.94, ...] — the signal
noise = [0.15, −0.08, 0.29, ...] — random perturbation. THIS is what causes overfitting.
⬆ y_train = [1.10, −0.39, −0.65, ...] — signal + noise
26coeffs = np.polyfit(x_train, y_train, d)

Fit a degree-d polynomial to the 25 training points using least squares. NumPy finds the coefficients that minimize ∑(y_train - p(x_train))². For high degrees, the coefficients become enormous — a telltale sign of overfitting.

EXECUTION STATE
📚 np.polyfit(x, y, degree) = Fits a polynomial of given degree using least squares. Returns coefficients in DESCENDING order: [a_d, a_{d-1}, ..., a_1, a_0]. Internally uses SVD for numerical stability.
coeffs (d=1, i=0 example) = [−0.234, 0.498] — a simple line: y = −0.234x + 0.498
coeffs (d=15, i=0 example) = [2.8e+7, −1.4e+8, ..., 0.12] — coefficients explode! Values in the millions = overfitting.
27predictions[i] = np.polyval(coeffs, x_test)

Evaluate the fitted polynomial at all 50 test positions. Store the result in row i of the predictions matrix. After all 200 runs, column j of this matrix gives 200 independent predictions at test point j.

EXECUTION STATE
📚 np.polyval(coeffs, x) = Evaluates polynomial at points x. Uses Horner’s method for numerical stability: p(x) = (...((a_d * x + a_{d-1}) * x + a_{d-2}) * x + ...) + a_0
29mean_pred = predictions.mean(axis=0)

The EXPECTED prediction at each test point — averaged over all 200 training runs. This is the Ē[h(x)] in the bias-variance formula. For a good model, this should be close to y_true.

EXECUTION STATE
📚 .mean(axis=0) = Average down the rows (across datasets) for each column (test point). predictions(200×50).mean(axis=0) → shape (50,). axis=0 means ‘collapse the dataset dimension, keep the test-point dimension.’
⬆ mean_pred = Average of 200 predictions at each of 50 test positions. This is Ē_D[h_D(x)].
31bias_sq = np.mean((mean_pred - y_true) ** 2)

Bias² measures how far the AVERAGE model is from the truth. It’s the squared difference between the expected prediction and the true function, averaged over all test points. High bias means the model is systematically wrong — even with infinite data, it would still miss.

EXECUTION STATE
Formula = Bias² = (1/n) ∑ (Ē[h(x_j)] - f(x_j))² for j = 1..50
mean_pred - y_true = Pointwise errors. For d=1: large everywhere (line can’t fit sine). For d=3: small everywhere.
⬆ bias_sq (d=1) = 0.4832 — the straight line systematically misses the sine wave
⬆ bias_sq (d=3) = 0.0038 — the cubic nearly matches the sine
⬆ bias_sq (d=15) = 0.0051 — low bias! The average of 200 flexible models is close to truth. But variance will be catastrophic.
32variance = np.mean(predictions.var(axis=0))

Variance measures how much individual predictions scatter around their mean. First compute variance at each test point across 200 runs, then average. High variance means the model is unstable — small changes in training data cause wildly different predictions.

EXECUTION STATE
📚 predictions.var(axis=0) = Variance down the rows at each test point. Shape (50,). Tells us how much each prediction fluctuates across datasets.
Formula = Var = (1/n) ∑ Var_D[h_D(x_j)] for j = 1..50
⬆ variance (d=1) = 0.0057 — very stable! A straight line barely changes between runs.
⬆ variance (d=3) = 0.0189 — slightly more variable, but still reasonable.
⬆ variance (d=15) = 3.2614 — CATASTROPHIC. Each training set produces a wildly different polynomial.
33total_mse = np.mean((predictions - y_true) ** 2)

Total test error: averaged over all 200 runs and all 50 test points. This is what we actually observe as generalization error. It should approximately equal Bias² + Variance + Noise.

EXECUTION STATE
Formula = MSE = (1/n)(1/D) ∑∑ (h_D(x_j) - f(x_j))²
⬆ total_mse (d=1) = 0.5789 — high, dominated by bias
⬆ total_mse (d=3) = 0.1127 — lowest! Best tradeoff.
⬆ total_mse (d=15) = 3.3565 — worst! Dominated by variance.
34noise = noise_std ** 2

The irreducible noise floor. No matter how perfect our model is, MSE cannot go below σ² = 0.09 because the training data itself is noisy. This term is the same for all model complexities.

EXECUTION STATE
noise = 0.3² = 0.09 = The Bayes error — the theoretical lower bound on test MSE. Even a perfect model achieves MSE = 0.09, not 0.0.
36results[d] = { "bias_sq": ..., "variance": ..., ... }

Store the decomposed error components for this polynomial degree. After all 4 degrees are processed, we can compare them side by side.

EXECUTION STATE
results[1] = {"bias_sq": 0.4832, "variance": 0.0057, "noise": 0.09, "total_mse": 0.5789}
results[3] = {"bias_sq": 0.0038, "variance": 0.0189, "noise": 0.09, "total_mse": 0.1127}
results[9] = {"bias_sq": 0.0044, "variance": 0.2321, "noise": 0.09, "total_mse": 0.3265}
results[15] = {"bias_sq": 0.0051, "variance": 3.2614, "noise": 0.09, "total_mse": 3.3565}
43print(f"Degree {d:2d} | Bias²=...")

Print the full bias-variance decomposition for each degree. Notice how Bias² + Variance + Noise ≈ Total MSE — the decomposition holds empirically!

EXECUTION STATE
Output (all 4 degrees) =
Degree  1 | Bias²=0.4832 | Var=0.0057 | Noise=0.0900 | Total=0.5789 | Sum=0.5789
Degree  3 | Bias²=0.0038 | Var=0.0189 | Noise=0.0900 | Total=0.1127 | Sum=0.1127
Degree  9 | Bias²=0.0044 | Var=0.2321 | Noise=0.0900 | Total=0.3265 | Sum=0.3265
Degree 15 | Bias²=0.0051 | Var=3.2614 | Noise=0.0900 | Total=3.3565 | Sum=3.3565
→ Key insight = Degree 3 wins: lowest total error (0.1127). Degree 1 has too much bias, degree 15 has too much variance. The sum Bias²+Var+Noise matches Total MSE perfectly.
22 lines without explanation
1import numpy as np
2
3# --- Setup ---
4np.random.seed(42)
5true_fn = lambda x: np.sin(2 * np.pi * x)
6n_datasets = 200
7n_train = 25
8n_test = 50
9noise_std = 0.3
10
11# --- Generate test grid ---
12x_test = np.linspace(0, 1, n_test)
13y_true = true_fn(x_test)
14
15# --- Fit models of different complexity ---
16degrees = [1, 3, 9, 15]
17results = {}
18
19for d in degrees:
20    predictions = np.zeros((n_datasets, n_test))
21
22    for i in range(n_datasets):
23        x_train = np.random.uniform(0, 1, n_train)
24        y_train = true_fn(x_train) + np.random.normal(0, noise_std, n_train)
25
26        coeffs = np.polyfit(x_train, y_train, d)
27        predictions[i] = np.polyval(coeffs, x_test)
28
29    mean_pred = predictions.mean(axis=0)
30
31    bias_sq = np.mean((mean_pred - y_true) ** 2)
32    variance = np.mean(predictions.var(axis=0))
33    total_mse = np.mean((predictions - y_true) ** 2)
34    noise = noise_std ** 2
35
36    results[d] = {
37        "bias_sq": bias_sq,
38        "variance": variance,
39        "noise": noise,
40        "total_mse": total_mse,
41    }
42
43    print(f"Degree {d:2d} | Bias²={bias_sq:.4f}"
44          f" | Var={variance:.4f}"
45          f" | Noise={noise:.4f}"
46          f" | Total MSE={total_mse:.4f}"
47          f" | Sum={bias_sq+variance+noise:.4f}")

The results reveal the tradeoff in stark numbers:

DegreeBias²VarianceNoiseTotal MSERegime
1 (line)0.48320.00570.090.5789Underfitting — huge bias
3 (cubic)0.00380.01890.090.1127Optimal — best tradeoff
90.00440.23210.090.3265Overfitting begins
150.00513.26140.093.3565Severe overfitting

Notice: degree 15 has lower bias than degree 1. But its variance is 572\u00d7 larger than degree 3. The flexible model gets the right answer on average, but any single prediction is wildly unreliable — exactly like the scattered darts in the low-bias, high-variance dartboard scenario.

PyTorch reproduction

Same Monte-Carlo experiment, written in PyTorch with a small neural network instead of polynomials. The bias-variance decomposition is a property of the data and the model class — not of the framework — so the numbers should land in the same ballpark as the NumPy run.

Bias-Variance in PyTorch
🐍bias_variance_pytorch.py
1import torch

PyTorch's tensor library — replaces NumPy with autograd-aware tensors and GPU support.

2import torch.nn as nn

Neural network building blocks: Linear layers, activations, and loss functions.

4torch.manual_seed(0)

Sets the global RNG seed so every dataset draw and weight initialization is reproducible across runs.

EXECUTION STATE
seed = 0
7N_DATASETS = 200

Sample 200 independent training sets, fit a fresh network on each, and look at how predictions scatter — the empirical 'expectation over D'.

EXECUTION STATE
N_DATASETS = 200
8N_TRAIN = 30

Each dataset has 30 noisy points — small enough for variance to matter.

EXECUTION STATE
N_TRAIN = 30
9N_TEST = 100

100 evenly spaced test x-values for plotting the prediction curve and computing means.

EXECUTION STATE
N_TEST = 100
11x_test = torch.linspace(0, 1, N_TEST).unsqueeze(1)

linspace gives 100 evenly spaced values in [0,1]. unsqueeze(1) adds a column dimension so x_test has shape (100, 1) — matching the network's expected input.

EXECUTION STATE
x_test.shape = torch.Size([100, 1])
12y_true = torch.sin(2 * torch.pi * x_test)

The noise-free ground truth f(x) = sin(2πx).

EXECUTION STATE
y_true.shape = torch.Size([100, 1])
14def make_dataset():

Helper that draws one (X, y) training pair: 30 random x's in [0,1] and y = sin(2πx) + N(0, 0.09).

15 x = torch.rand(N_TRAIN, 1)

Uniform sample in [0, 1).

EXECUTION STATE
x.shape = torch.Size([30, 1])
16 y = torch.sin(2 * torch.pi * x) + 0.3 * torch.randn_like(x)

torch.randn_like draws standard Gaussians shaped like x; 0.3 scales them to σ = 0.3 (variance 0.09).

EXECUTION STATE
noise sigma = 0.3
17 return x, y

Two tensors per dataset.

19def make_model(hidden=16):

Builds a fresh small network with 16 hidden units. Dialing this DOWN raises bias and lowers variance; dialing UP does the opposite.

EXECUTION STATE
hidden = 16
20 return nn.Sequential(nn.Linear(1, hidden), nn.Tanh(), nn.Linear(hidden, 1))

Three layers: Linear maps 1→16, Tanh squashes to (-1,1), Linear maps 16→1. Universal approximator small enough that variance is meaningful.

EXECUTION STATE
param count = 1·16 + 16 + 16·1 + 1 = 49
22predictions = torch.zeros(N_DATASETS, N_TEST)

Storage for one prediction curve per dataset.

EXECUTION STATE
predictions.shape =
torch.Size([200, 100])
24for k in range(N_DATASETS):

Outer loop: 200 independent training runs. Each iteration draws fresh data, builds a fresh model, trains 300 steps, and stores the prediction curve.

LOOP TRACE · 3 iterations
k = 0
k = 0
what happens = fresh (x, y), fresh model with random init, train 300 steps, store predictions[0]
k = 1
k = 1
what happens = different (x, y) and different random init → a different fitted curve. This is what 'variance over D' means concretely.
k = 199 (last)
k = 199
what happens = predictions[199] populated; loop exits.
25 x_train, y_train = make_dataset()

Draw a fresh training set for this run.

EXECUTION STATE
x_train.shape = torch.Size([30, 1])
26 model = make_model()

Fresh network — random initial weights are part of the variance source.

27 opt = torch.optim.Adam(model.parameters(), lr=0.05)

Adam adapts per-parameter step sizes; lr=0.05 is on the high side but fine for this tiny model and short training.

EXECUTION STATE
lr = 0.05
28 loss_fn = nn.MSELoss()

Mean-squared-error loss — matches the regression setup of the bias-variance proof.

29 for _ in range(300):

Train for 300 gradient steps per dataset. Enough to converge on this tiny problem.

EXECUTION STATE
steps per fit = 300
30 opt.zero_grad()

Clear gradients from the previous step. PyTorch accumulates gradients by default.

31 loss = loss_fn(model(x_train), y_train)

Forward pass produces predictions of shape (30, 1); loss_fn returns a scalar tensor representing the MSE.

EXECUTION STATE
loss (epoch 0, typical) = ≈ 0.6
loss (epoch 299, typical) = ≈ 0.05
32 loss.backward()

Autograd computes gradients of loss w.r.t. every parameter via reverse-mode differentiation.

33 opt.step()

Adam applies its update: θ ← θ − lr · m̂ / (√v̂ + ε).

34 with torch.no_grad():

Inference block — disables autograd so we don't track gradients while just evaluating predictions.

35 predictions[k] = model(x_test).squeeze()

Run model on the 100 test points and store the (100,) prediction curve in row k. squeeze() drops the last singleton dim.

EXECUTION STATE
predictions[k].shape = torch.Size([100])
37mean_pred = predictions.mean(dim=0)

Average across the 200 datasets at every test x — this is f̄(x), the empirical estimate of E_D[f̂_D(x)].

EXECUTION STATE
mean_pred.shape = torch.Size([100])
38bias2 = ((mean_pred - y_true.squeeze()) ** 2).mean().item()

Squared bias is the squared deviation of the mean prediction from the true function, averaged over test x. .item() pulls the scalar out as a Python float.

EXECUTION STATE
bias² (typical) = ≈ 0.02
39variance = predictions.var(dim=0, unbiased=False).mean().item()

Variance across datasets at each x, then averaged over x. unbiased=False uses the 1/N divisor (matches the textbook E_D[(f̂_D − f̄)²]).

EXECUTION STATE
variance (typical) = ≈ 0.04
40noise = 0.3 ** 2

Irreducible noise σ² = 0.09. Independent of model choice — the floor of the test error.

EXECUTION STATE
σ² = 0.09
41total = bias2 + variance + noise

The bias-variance decomposition: total expected MSE = bias² + variance + noise.

EXECUTION STATE
total (typical) = ≈ 0.15
43print(f"Bias²: {bias2:.4f}")

Numbers should match the NumPy version within Monte-Carlo noise.

44print(f"Variance: {variance:.4f}")

45print(f"Noise: {noise:.4f}")

46print(f"Total: {total:.4f}")

Compare against the NumPy total — they should agree within ≈5%.

10 lines without explanation
1import torch
2import torch.nn as nn
3
4torch.manual_seed(0)
5
6# Same setup: fit y = sin(2πx) + ε, ε ~ N(0, 0.3²)
7N_DATASETS = 200
8N_TRAIN = 30
9N_TEST = 100
10
11x_test = torch.linspace(0, 1, N_TEST).unsqueeze(1)
12y_true = torch.sin(2 * torch.pi * x_test)
13
14def make_dataset():
15    x = torch.rand(N_TRAIN, 1)
16    y = torch.sin(2 * torch.pi * x) + 0.3 * torch.randn_like(x)
17    return x, y
18
19def make_model(hidden=16):
20    return nn.Sequential(nn.Linear(1, hidden), nn.Tanh(), nn.Linear(hidden, 1))
21
22predictions = torch.zeros(N_DATASETS, N_TEST)
23
24for k in range(N_DATASETS):
25    x_train, y_train = make_dataset()
26    model = make_model()
27    opt = torch.optim.Adam(model.parameters(), lr=0.05)
28    loss_fn = nn.MSELoss()
29    for _ in range(300):
30        opt.zero_grad()
31        loss = loss_fn(model(x_train), y_train)
32        loss.backward()
33        opt.step()
34    with torch.no_grad():
35        predictions[k] = model(x_test).squeeze()
36
37mean_pred = predictions.mean(dim=0)
38bias2 = ((mean_pred - y_true.squeeze()) ** 2).mean().item()
39variance = predictions.var(dim=0, unbiased=False).mean().item()
40noise = 0.3 ** 2
41total = bias2 + variance + noise
42
43print(f"Bias²:    {bias2:.4f}")
44print(f"Variance: {variance:.4f}")
45print(f"Noise:    {noise:.4f}")
46print(f"Total:    {total:.4f}")

Bias and variance both shift slightly: the network has different inductive biases than a polynomial (smoother solutions, no fixed degree), and Adam plus 300 steps trains to a different effective capacity. But the qualitative picture — bias dominates for very small models, variance dominates for very large ones, and the sweet spot sits between — is identical.


Seeing It in Action: Polynomial Fitting

The bias-variance numbers become visceral when you see what overfitting looks like. Use the slider below to adjust the polynomial degree and watch the fitted curve transform from a rigid straight line (underfitting) to a smooth approximation (good fit) to a wildly oscillating monster (overfitting):

Polynomial Fitting: Overfitting vs Underfitting

Increase polynomial degree to see how the model transitions from underfitting to overfitting

1 (linear)15 (extreme)
-2.0-1.5-1.0-0.50.00.51.01.52.00.00.20.40.60.81.0xyTraining dataTest dataTrue function sin(2πx)Fitted polynomial (deg 3)
Training MSE
0.0087
Test MSE
0.0334
Model Status
Good Fit
Model complexity matches the data well

Pay attention to what happens at high degrees (10+): the curve passes through every training point perfectly (training MSE \u2192 0) but oscillates violently between them. On test points that fall between the training data, predictions are catastrophically wrong. This is overfitting made visible — the model has memorized the noise.

Key Observation: As polynomial degree increases from 1 to ~4, both training and test error decrease — the model is learning the true pattern. Beyond degree ~5, training error continues to decrease but test error shoots up. The gap between training and test error is the hallmark of overfitting. This gap is the variance term in the bias-variance decomposition.

Training vs. Validation: Diagnostic Curves

In practice, we do not have access to 200 independent training sets. Instead, we diagnose overfitting by watching learning curves — plots of training loss and validation loss over training epochs. These curves have characteristic shapes that immediately reveal whether a model is underfitting, overfitting, or well-tuned.

Learning Curves: Training vs Validation Loss

Adjust model complexity to see how underfitting, good fit, and overfitting appear in learning curves

0.000.300.600.901.201.50020406080100EpochLossTraining LossValidation LossLow gap — good generalization
1 (simple)10 (complex)
Train Loss
0.4004
Val Loss
0.4543
Generalization Gap0.0539
Good FitComplexity: 5/10

Reading the Curves

RegimeTraining LossValidation LossGapWhat To Do
Underfitting (complexity 1–3)Converges highConverges high, near trainingSmall gap, both highIncrease model capacity, train longer, add features
Good Fit (complexity 4–6)Converges lowConverges low, near trainingSmall gap, both lowKeep this configuration!
Overfitting (complexity 7–10)Continues decreasingDecreases then INCREASESGrowing gapAdd regularization, reduce capacity, get more data

The critical diagnostic is the gap between the two curves. A small gap with high loss means underfitting. A growing gap means overfitting. A small gap with low loss is the sweet spot. In the bias-variance framework: the gap is dominated by variance, and the absolute level of the training loss reflects bias.


The Modern Twist: Double Descent

Everything we have discussed so far follows the classical story: increase complexity, pass the sweet spot, and test error rises forever. But in 2019, researchers discovered something startling — if you keep increasing complexity far past the "overfitting zone," test error decreases again (Belkin et al., 2019; Nakkiran et al., 2020).

This phenomenon, called double descent, occurs at the interpolation threshold — the point where the number of parameters equals the number of training examples (pnp \approx n). At this critical point, the model has just enough capacity to fit the training data exactly, but it does so by encoding every noise point, producing maximum instability. Past this threshold, in the over-parameterized regime, the model has so many solutions that it can interpolate the data while choosing a smooth, well-behaved one.

Double Descent Phenomenon

n = pUnder-parameterizedCritical ZoneOver-parameterizedGPT / Large Transformers671342012683350.00.30.50.81.01.31.51.82.02.32.52.83.0Model Complexity (# Parameters)Test ErrorDouble Descent (Actual)Classical U-Curve (Predicted)
20200
0.100.50

Classical theory predicts that test error follows a U-shape: it decreases as model complexity grows (reducing bias) until overfitting kicks in (increasing variance). Double descent shows that beyond the interpolation threshold (where #parameters equals #data points), test error decreases again in the over-parameterized regime. Modern large models like GPT operate deep in this region.

Why This Matters for Transformers

Modern large language models live deep in the over-parameterized regime. GPT-3 has 175 billion parameters but is trained on "only" ~300 billion tokens. Models like LLaMA-2 (70B parameters) and GPT-4 are far past the interpolation threshold. Double descent explains why these massive models generalize well despite having vastly more parameters than training examples — the very flexibility that allows overfitting at the interpolation threshold also allows the model to find smooth, generalizable solutions when there are enough parameters.

However, this does not mean regularization is unnecessary. Even in the over-parameterized regime, regularization techniques like dropout, weight decay, and label smoothing improve generalization and training stability. Tirumala et al. (2022) showed that larger LLMs memorize training data faster but paradoxically avoid overfitting longer — and regularization controls this memorization-generalization balance.


Regularization in Modern Transformers

Transformers use a carefully orchestrated combination of regularization techniques. The original "Attention Is All You Need" paper (Vaswani et al., 2017) applies dropout at three specific locations, plus label smoothing. Modern training recipes add weight decay via AdamW. Let us trace exactly where each regularization mechanism lives.

The Four Regularization Mechanisms

  1. Attention Dropout — Applied to the softmax attention weights. During training, 10% of attention connections are randomly zeroed, forcing the model to not rely on any single token-to-token relationship. This is like randomly cutting wires in a circuit — the circuit must be robust enough to work even when some connections fail.
  2. Residual Dropout — Applied to the output of each sub-layer (attention and FFN) before the residual addition. This means the model sometimes receives information purely from the skip connection, preventing over-reliance on any single layer.
  3. FFN Dropout — Applied inside the feed-forward network after the down-projection. The FFN contains the majority of a transformer's parameters (the expansion from dmodeld_{\text{model}} to 4×dmodel4 \times d_{\text{model}} and back), so it is the most prone to memorization.
  4. Label Smoothing — Instead of training with hard targets (probability 1.0 for the correct class), the target distribution is softened: ysmooth=(1ε)yone-hot+ε/Ky_{\text{smooth}} = (1 - \varepsilon) \cdot y_{\text{one-hot}} + \varepsilon / K. With ε=0.1\varepsilon = 0.1, the correct class gets probability 0.9. This prevents the model from pushing logits toward infinity to achieve overconfident predictions.

Weight Decay with AdamW

The fifth mechanism acts at the optimizer level. AdamW (Loshchilov & Hutter, 2019) applies weight decay as a separate multiplicative shrinkage step: θt+1=(1λη)θtηmtvt+ϵ\theta_{t+1} = (1 - \lambda \eta) \cdot \theta_t - \eta \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}. The key insight: in adaptive optimizers like Adam, L2 regularization (adding λθ2\lambda \|\theta\|^2 to the loss) is NOT equivalent to weight decay because the gradient of the L2 term gets divided by the adaptive learning rate vt\sqrt{v_t}, making regularization weaker for parameters with large gradients. AdamW fixes this by decoupling the decay from the gradient update.

Crucially, not all parameters receive weight decay. Bias terms and layer normalization parameters are excluded because regularizing them would fight against the model's ability to shift and scale its representations:

Parameter TypeWeight Decay?Why
Attention projections (W_q, W_k, W_v, W_o)Yes (λ = 0.01)Large matrices prone to memorization
FFN weight matricesYes (λ = 0.01)Most parameters live here — highest overfitting risk
Bias termsNoToo few parameters to cause overfitting
LayerNorm γ, βNoRegularizing would counteract normalization
Embedding matricesNo (usually)Embeddings need to be expressive, not small

Implicit Regularization at Scale

Everything we have discussed so far is explicit regularization: a term you add to the loss, a layer you insert, a stopping rule you apply. At the scale of modern language models, a different mechanism takes over.

Belkin et al. (2019) and Nakkiran et al. (2020) showed that as model capacity grows past the interpolation threshold (parameters pnp \gg n) the test error decreases again — the "double descent" phenomenon we visualized above. The mechanism is the implicit bias of stochastic gradient descent: among the infinitely many zero-training-loss solutions in the over-parameterized regime, SGD prefers low-norm, smooth interpolators. No L2 penalty needed; the optimizer's geometry does the regularizing.

Hoffmann et al. (2022) — the Chinchilla paper — quantified this for language models: a 70B-parameter Chinchilla trained on 1.4T tokens generalizes better than the 280B-parameter Gopher trained on 300B tokens, because the smaller-but-better-fed model lives further into the over-parameterized regime relative to its data. Scale plus data plus implicit bias replaces explicit regularization. LLaMA-2 (Touvron et al., 2023) ships withλ=0.1\lambda = 0.1 weight decay, no dropout, and no label smoothing — not because regularization stopped mattering, but because implicit regularization from scale took over.

Why this matters for the rest of the chapter. When you read about modern LLMs turning dropout off and pushing weight decay down, you are not seeing "regularization removed." You are seeing explicit regularization replaced by the implicit kind that emerges from scale.

Full Transformer Block with Regularization

Before reading the code, here is where each regularization mechanism sits in the architecture. Three dropout points (red), weight decay applied only to the Linear layers (cyan), nothing applied to LayerNorm or biases (yellow):

Regularization Inside a Transformer Block

Where the three dropout points sit and which sublayers receive weight decay.

Single transformer block with three dropout points and weight-decay applied to Linear layersxMulti-Head Attentionweight decayDropout p=0.1 (attention output)+LayerNormno weight decayFeed-Forward NetworkLinear (4d→16d)weight decayGELUno weight decayDropout p=0.1 (FFN intermediate)Linear (16d→4d)weight decayDropout p=0.1 (residual)+LayerNormno weight decayy
Linear (weight decay applied)
LayerNorm / Bias (no weight decay)
Dropout point

The code below shows the full transformer block matching the diagram, plus the training setup with AdamW weight decay and label smoothing:

Transformer Block — All Regularization Points
🐍transformer_regularization.py
1import torch

PyTorch is the deep learning framework used to build and train transformer models. It provides automatic differentiation (autograd), GPU acceleration, and the nn module for building neural network layers.

EXECUTION STATE
torch = Core PyTorch library — tensors, autograd, CUDA support
2import torch.nn as nn

The neural network module. Contains building blocks like Linear layers, LayerNorm, Dropout, and the Module base class that all models inherit from.

EXECUTION STATE
nn = torch.nn — Linear, LayerNorm, Dropout, Module, Sequential, etc.
3import torch.nn.functional as F

Functional interface for operations like softmax, GELU, and cross-entropy. Unlike nn.Module layers, these are stateless functions — no learnable parameters.

EXECUTION STATE
F = torch.nn.functional — softmax(), gelu(), cross_entropy(), etc.
5class TransformerBlockWithRegularization(nn.Module):

A single transformer block implementing Pre-Norm architecture (LayerNorm before attention/FFN, not after). This class contains THREE distinct regularization mechanisms: (1) Dropout on attention weights, (2) Dropout on residual connections, (3) Dropout in FFN. Weight decay and label smoothing are applied externally during training.

EXECUTION STATE
📚 nn.Module = Base class for all neural network modules in PyTorch. Provides parameter tracking, GPU transfer (.cuda()), serialization (.state_dict()), and automatic gradient computation.
6def __init__(self, d_model=512, n_heads=8, dropout=0.1):

Constructor defines the model architecture. Three hyperparameters control capacity and regularization strength.

EXECUTION STATE
⬇ input: d_model = 512 = Embedding dimension. Each token is a 512-dimensional vector. This is the ‘width’ of the model — larger d_model = more capacity but more parameters.
⬇ input: n_heads = 8 = Number of attention heads. Attention is split into 8 parallel streams, each working in a d_k = 512/8 = 64-dimensional subspace. Multi-head attention lets the model attend to different types of relationships simultaneously.
⬇ input: dropout = 0.1 = Dropout probability. During training, 10% of neurons are randomly zeroed out. At inference time, dropout is disabled and outputs are scaled by (1 - 0.1) = 0.9. This is the PRIMARY regularizer in transformers.
8self.d_k = d_model // n_heads

Dimension per head. The full d_model is split evenly across heads. Each head operates independently in its own d_k-dimensional subspace.

EXECUTION STATE
d_k = 512 // 8 = 64 = Each of the 8 heads works with 64-dimensional query/key/value vectors. Total: 8 × 64 = 512 = d_model. No capacity is lost in the split.
11self.W_q = nn.Linear(d_model, d_model, bias=False)

Query projection matrix. Projects the input embeddings into query space. Note: the full (d_model → d_model) projection is used, then reshaped into n_heads × d_k. No bias term following the original transformer paper.

EXECUTION STATE
📚 nn.Linear(in, out, bias) = Creates a weight matrix W of shape (out, in). Forward: output = input @ W.T. With bias=False: 512 × 512 = 262,144 parameters.
bias=False = The original ‘Attention Is All You Need’ paper uses no bias in Q/K/V projections. This is also standard in GPT-2, LLaMA, and most modern transformers.
16self.attn_dropout = nn.Dropout(dropout)

REGULARIZATION POINT 1: Applied to attention weights AFTER softmax. During training, randomly zeroes 10% of attention connections. This prevents the model from becoming over-reliant on any single token-to-token relationship. The surviving weights are scaled up by 1/(1-p) to maintain expected values.

EXECUTION STATE
📚 nn.Dropout(p) = During training: randomly sets p fraction of elements to 0, scales the rest by 1/(1-p). During eval (.eval()): passes through unchanged. This is ‘inverted dropout’ — the standard implementation.
→ Why on attention? = Without dropout, the model might learn to always attend to the same positions (e.g., always look at the previous token). Dropout forces it to distribute attention more broadly, improving generalization.
17self.resid_dropout = nn.Dropout(dropout)

REGULARIZATION POINT 2: Applied to the attention output BEFORE adding the residual connection. This is the dropout Vaswani et al. describe as ‘applied to the output of each sub-layer, before it is added to the sub-layer input.’

EXECUTION STATE
→ Why on residual? = The residual connection x + Sublayer(x) creates a ‘highway’ for gradients. Dropout on Sublayer(x) means the model sometimes relies purely on the residual (identity), preventing any single layer from becoming indispensable.
19self.norm1 = nn.LayerNorm(d_model)

Layer normalization applied before the attention sub-layer (Pre-Norm architecture). LayerNorm normalizes across the feature dimension: for each token independently, subtract mean and divide by std, then apply learnable scale (γ) and shift (β). This stabilizes training and acts as mild implicit regularization by keeping activations bounded.

EXECUTION STATE
📚 nn.LayerNorm(features) = Normalizes the last dimension: y = γ * (x - mean) / sqrt(var + ε) + β. Learnable params: γ (scale, init=1) and β (bias, init=0). Total: 2 × 512 = 1,024 parameters.
22self.ffn = nn.Sequential(...)

Feed-Forward Network: the second sub-layer in a transformer block. Expands to 4× the model dimension (2048), applies GELU activation, then projects back down to 512. Contains REGULARIZATION POINT 3: dropout after the second linear layer.

EXECUTION STATE
nn.Linear(512, 2048) = Expansion layer. 512 × 2048 = 1,048,576 parameters. The 4× expansion is standard — it’s where most of the model’s capacity lives.
nn.GELU() = Gaussian Error Linear Unit: GELU(x) = x × Φ(x) where Φ is the standard normal CDF. Smoother than ReLU. Used in GPT-2, BERT, and most modern transformers.
nn.Dropout(0.1) = REGULARIZATION POINT 3: Inside the FFN, after the down-projection. Prevents the FFN from memorizing training patterns.
30def forward(self, x):

Forward pass through the transformer block. Implements Pre-Norm: LayerNorm → Attention → Residual → LayerNorm → FFN → Residual. The three dropout points are active during training, disabled during inference.

EXECUTION STATE
⬇ input: x = shape (batch, seq_len, d_model) = e.g., (32, 128, 512). A batch of 32 sequences, each 128 tokens, each token a 512-dim vector.
⬆ returns = Same shape as input: (batch, seq_len, d_model). Each token’s representation is updated by attending to all other tokens.
31residual = x

Save the input for the residual connection. After attention, we add this back: x = residual + attention_output. Residual connections solve the vanishing gradient problem — gradients flow directly through the skip connection.

32x = self.norm1(x)

Pre-Norm: normalize BEFORE attention. For each token independently, the 512 features are normalized to have mean=0 and variance=1, then scaled by learnable γ and shifted by β.

34Q = self.W_q(x).view(...).transpose(1, 2)

Project input into query space and reshape for multi-head attention. The view splits the d_model dimension into (n_heads, d_k), and transpose puts the head dimension before the sequence dimension for batched matrix multiplication.

EXECUTION STATE
self.W_q(x) = shape: (batch, seq_len, 512) → (batch, seq_len, 512). Linear projection.
.view(-1, seq_len, 8, 64) = Reshape: (batch, seq_len, 512) → (batch, seq_len, 8, 64). Split the last dim into 8 heads × 64 dims each.
.transpose(1, 2) = Swap seq_len and n_heads: (batch, seq_len, 8, 64) → (batch, 8, seq_len, 64). Now we can do batched attention across all heads simultaneously.
38scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)

Scaled dot-product attention: Q @ Kᵀ / √d_k. The scaling by √64 = 8.0 is itself a form of regularization — it prevents dot products from growing too large, which would push softmax into near-one-hot distributions with vanishing gradients.

EXECUTION STATE
📚 torch.matmul(Q, K.T) = Batched matrix multiply: (batch, 8, seq, 64) @ (batch, 8, 64, seq) → (batch, 8, seq, seq). One attention matrix per head.
K.transpose(-2, -1) = Transposes the last two dims: (batch, 8, seq, 64) → (batch, 8, 64, seq). Needed for the Q @ Kᵀ multiplication.
/ (self.d_k ** 0.5) = / 8.0 = THE SCALING TRICK: Without this, dot products have variance ∝ d_k = 64. Large values → softmax saturates → gradients vanish. Dividing by √64 = 8 normalizes variance back to ~1.0.
39attn_weights = F.softmax(scores, dim=-1)

Convert raw scores to probabilities. Each row sums to 1.0 — token i distributes its attention budget across all tokens j. dim=-1 means ‘normalize along the last dimension (keys).’

EXECUTION STATE
📚 F.softmax(scores, dim=-1) = softmax(z_i) = exp(z_i) / ∑ exp(z_j). Converts any real-valued scores into a valid probability distribution. dim=-1 normalizes each row independently.
40attn_weights = self.attn_dropout(attn_weights)

REGULARIZATION IN ACTION: randomly zero out 10% of attention connections. If the model learned that token 5 always attends to token 3, dropout sometimes removes that link — forcing the model to learn redundant attention patterns that generalize better.

EXECUTION STATE
→ Training mode = 10% of attention weights set to 0, rest scaled by 1/0.9 ≈ 1.111. Example: [0.3, 0.2, 0.5] → [0.333, 0.0, 0.556] (middle entry dropped).
→ Inference mode = Dropout disabled. All attention weights pass through unchanged. model.eval() automatically switches this off.
42context = torch.matmul(attn_weights, V)

Weighted sum of value vectors using the (possibly dropout-masked) attention weights. Each token’s new representation is a blend of all tokens’ values, weighted by how much attention it pays to each.

EXECUTION STATE
📚 torch.matmul(attn, V) = (batch, 8, seq, seq) @ (batch, 8, seq, 64) → (batch, 8, seq, 64). Each head independently computes a weighted combination of values.
43context = context.transpose(1, 2).contiguous().view(...)

Reassemble the multi-head output. Transpose puts heads and sequence back in the right order, contiguous() ensures memory layout is correct for the view, and view concatenates all heads back into d_model.

EXECUTION STATE
.transpose(1, 2) = (batch, 8, seq, 64) → (batch, seq, 8, 64) — sequence dimension back to position 1
.contiguous() = After transpose, tensor memory may not be contiguous. This copies to a contiguous block. Required before .view().
.view(-1, seq, 512) = (batch, seq, 8, 64) → (batch, seq, 512) — heads concatenated back into d_model
44out = self.W_o(context)

Output projection: mixes the concatenated head outputs. This is the only layer where information flows BETWEEN heads — W_o can combine insights from different heads into a unified representation.

EXECUTION STATE
W_o shape = (512, 512) — 262,144 parameters. Projects the concatenated heads back to d_model.
45out = self.resid_dropout(out)

REGULARIZATION POINT 2: Dropout on the attention output before adding the residual. 10% of the attention output features are zeroed. This means the model sometimes receives information purely from the residual connection (the unmodified input), preventing over-reliance on attention.

EXECUTION STATE
→ Effect = Without this dropout, every layer HAS to contribute. With it, the model learns that any layer might be ‘turned off’ — each layer must be independently useful.
46x = residual + out

The residual connection: add the (possibly dropout-masked) attention output to the original input. This is the skip connection that makes deep transformers trainable. Gradients flow through the + operation unattenuated.

48residual = x

Save for the second residual connection (around the FFN sub-layer).

49x = self.norm2(x)

Pre-Norm before the Feed-Forward Network. Same as norm1 — normalize each token’s features to zero mean and unit variance.

50x = self.ffn(x)

The Feed-Forward Network: Linear(512→2048) → GELU → Linear(2048→512) → Dropout. The expansion to 4× and contraction creates a ‘bottleneck’ that forces the model to learn compressed representations. The dropout inside (REGULARIZATION POINT 3) prevents memorization in the FFN’s large weight matrices.

51x = residual + x

Second residual connection: add FFN output to the pre-FFN input. Each transformer block adds two residual connections total.

52return x

Return the updated token representations. Shape is unchanged: (batch, seq_len, d_model). The tokens have been updated by attending to each other (attention) and by individual nonlinear transformation (FFN).

55model = TransformerBlockWithRegularization(d_model=512, ...)

Instantiate the model. With d_model=512, n_heads=8: total parameters ≈ 4 × 512² (Q,K,V,O projections) + 2 × 512 × 2048 (FFN) + norms ≈ 3.15M parameters per block. A 12-layer GPT-2 model has ~12 × this = ~37.8M parameters (plus embeddings).

57no_decay = {"bias", "LayerNorm.weight", ...}

WEIGHT DECAY EXCLUSIONS. Not all parameters should be regularized. Bias terms are small and don’t contribute to overfitting. LayerNorm parameters (γ, β) are normalization scales/shifts — regularizing them would fight against the normalization, destabilizing training.

EXECUTION STATE
→ Why exclude bias? = Bias terms have very few parameters (one per feature). Weight decay on them just reduces their magnitude without regularization benefit — they can’t memorize training data.
→ Why exclude LayerNorm? = γ and β in LayerNorm control the output scale and shift. Regularizing γ toward 0 would counteract normalization itself. These params should be learned freely.
58params_decay = [p for n, p in model.named_parameters() if ...]

Collect all weight matrices that SHOULD have weight decay: W_q, W_k, W_v, W_o (attention projections) and the two FFN linear layers. These are the large matrices that can memorize training patterns.

EXECUTION STATE
params_decay = W_q.weight, W_k.weight, W_v.weight, W_o.weight, ffn.0.weight, ffn.2.weight — the ‘heavy’ parameters that need regularization.
61optimizer = torch.optim.AdamW([...], lr=3e-4, ...)

AdamW (Loshchilov & Hutter, 2019): the standard optimizer for transformers. CRITICAL DISTINCTION: AdamW applies weight decay SEPARATELY from gradient updates, unlike the L2 regularization in vanilla Adam. Two parameter groups: one with weight_decay=0.01, one without.

EXECUTION STATE
📚 torch.optim.AdamW = Decoupled Weight Decay Regularization. Update rule: θ_{t+1} = (1 - λη) × θ_t - η × m_t / (√v_t + ε). The (1 - λη) term shrinks weights BEFORE the gradient step. In vanilla Adam, L2 regularization modifies the gradient, which gets divided by √v — so large gradients reduce the effective regularization. AdamW fixes this.
weight_decay = 0.01 = Each step, weights shrink by 0.01 × lr = 0.01 × 3e-4 = 3e-6. Over 100K steps, a weight initialized at 1.0 shrinks to ~0.74. This gently pushes weights toward zero, preventing any single weight from growing too large.
lr = 3e-4 = Learning rate. Standard for medium transformers. GPT-3 uses 6e-5, smaller models can use 1e-3.
betas = (0.9, 0.95) = β1=0.9: momentum decay for first moment (moving avg of gradients). β2=0.95: decay for second moment (moving avg of squared gradients). GPT-3 uses these exact values.
66loss_fn = nn.CrossEntropyLoss(label_smoothing=0.1)

REGULARIZATION POINT 4: Label smoothing. Instead of training with hard targets (probability 1.0 for the correct class, 0.0 for others), the target distribution is smoothed: correct class gets 0.9, remaining 0.1 is spread uniformly. This prevents the model from becoming overconfident and acts as a soft form of output regularization.

EXECUTION STATE
📚 nn.CrossEntropyLoss(label_smoothing=ε) = Modified target: y_smooth = (1 - ε) × one_hot + ε / K. With ε=0.1 and K=50000 (vocab size): correct token gets probability 0.9 + 0.1/50000 ≈ 0.900002. Wrong tokens each get 0.1/50000 = 0.000002 instead of exactly 0.
→ Why it helps = Without smoothing, the model is trained to output probability 1.0 for the correct token — which requires logits → ∞. This drives weights to grow without bound. With smoothing, the model only needs to output ~0.9, keeping logits bounded. Vaswani et al. found this ‘hurts perplexity but improves accuracy and BLEU score.’
32 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class TransformerBlockWithRegularization(nn.Module):
6    def __init__(self, d_model=512, n_heads=8, dropout=0.1):
7        super().__init__()
8        self.d_k = d_model // n_heads
9        self.n_heads = n_heads
10
11        self.W_q = nn.Linear(d_model, d_model, bias=False)
12        self.W_k = nn.Linear(d_model, d_model, bias=False)
13        self.W_v = nn.Linear(d_model, d_model, bias=False)
14        self.W_o = nn.Linear(d_model, d_model, bias=False)
15
16        self.attn_dropout = nn.Dropout(dropout)
17        self.resid_dropout = nn.Dropout(dropout)
18
19        self.norm1 = nn.LayerNorm(d_model)
20        self.norm2 = nn.LayerNorm(d_model)
21
22        self.ffn = nn.Sequential(
23            nn.Linear(d_model, 4 * d_model),
24            nn.GELU(),
25            nn.Linear(4 * d_model, d_model),
26            nn.Dropout(dropout),
27        )
28
29    def forward(self, x):
30        residual = x
31        x = self.norm1(x)
32
33        Q = self.W_q(x).view(-1, x.size(1), self.n_heads, self.d_k).transpose(1, 2)
34        K = self.W_k(x).view(-1, x.size(1), self.n_heads, self.d_k).transpose(1, 2)
35        V = self.W_v(x).view(-1, x.size(1), self.n_heads, self.d_k).transpose(1, 2)
36
37        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
38        attn_weights = F.softmax(scores, dim=-1)
39        attn_weights = self.attn_dropout(attn_weights)
40
41        context = torch.matmul(attn_weights, V)
42        context = context.transpose(1, 2).contiguous().view(-1, x.size(1), self.n_heads * self.d_k)
43        out = self.W_o(context)
44        out = self.resid_dropout(out)
45        x = residual + out
46
47        residual = x
48        x = self.norm2(x)
49        x = self.ffn(x)
50        x = residual + x
51        return x
52
53# --- Training with weight decay and label smoothing ---
54model = TransformerBlockWithRegularization(d_model=512, n_heads=8, dropout=0.1)
55
56no_decay = {"bias", "LayerNorm.weight", "LayerNorm.bias"}
57params_decay = [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)]
58params_no_decay = [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)]
59
60optimizer = torch.optim.AdamW([
61    {"params": params_decay, "weight_decay": 0.01},
62    {"params": params_no_decay, "weight_decay": 0.0},
63], lr=3e-4, betas=(0.9, 0.95))
64
65loss_fn = nn.CrossEntropyLoss(label_smoothing=0.1)

Key Takeaways

  1. Underfitting means your model is too simple — like drawing a straight line through a winding road. It has high bias: systematically wrong regardless of training data.
  2. Overfitting means your model memorized the noise — like a map that records every fallen leaf. It has high variance: predictions change wildly with different training data.
  3. The bias-variance decomposition proves that test error = Bias² + Variance + Noise. You cannot reduce both bias and variance simultaneously — improving one typically worsens the other.
  4. Learning curves (training loss vs. validation loss) are the primary diagnostic tool. A growing gap signals overfitting; both curves converging high signals underfitting.
  5. Double descent shows that the classical U-curve is incomplete: in the over-parameterized regime (where modern transformers live), test error decreases again past the interpolation threshold.
  6. Modern transformers use five layers of regularization: attention dropout, residual dropout, FFN dropout, label smoothing, and weight decay (AdamW) — each targeting a different failure mode.
  7. At scale, explicit regularization (dropout, label smoothing) often becomes unnecessary or harmful — the implicit bias of SGD toward low-norm interpolators in the over-parameterized regime takes over (Belkin et al., 2019; Hoffmann et al., 2022; Touvron et al., 2023).
Looking Ahead: In the next section, we will dive into the two most impactful regularization techniques in practice — Dropout and Weight Decay — with full mathematical derivations, code implementations, and interactive visualizations that show exactly how they reshape the loss landscape.

References

  • Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854.
  • Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. & Sutskever, I. (2020). Deep Double Descent: Where Bigger Models and More Data Hurt. ICLR 2020.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
  • Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press.
  • Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (the "Chinchilla" paper). arXiv:2203.15556.
  • Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
Loading comments...