Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why a model that performs well on training data can fail on new data, and formalize this as the generalization gap
Implement a correct train / validation / test split and explain the role of each subset
Derive the bias-variance decomposition: $\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$
Implement early stopping — the simplest and most effective regularization technique
Use PyTorch's model.eval() and torch.no_grad() correctly for evaluation

Where We Left Off

In Section 2, we built a complete training loop: forward → loss → backward → clip → update, with learning rate scheduling and checkpointing. But our entire focus was on reducing the training loss. We measured success by how well the model fit the training data.

This creates a dangerous blind spot. A model with enough capacity can memorize the training data — achieving near-zero training loss — while being completely useless on new data. This is overfitting, and detecting it requires data that the model has never seen during training.

The Central Question: Training loss tells you how well the model memorizes. But what you actually care about is how well it generalizes. How do you measure something you have never seen?

Why Validation? The Overfitting Problem

Consider a student preparing for an exam. They study the practice problems until they can solve each one perfectly. On exam day, they face new problems and fail. The student memorized the answers instead of learning the concepts. Neural networks do exactly the same thing.

Formally, the generalization gap is the difference between the model's performance on training data and on new, unseen data:

$\text{Generalization Gap} = L_{\text{test}}(\theta) - L_{\text{train}}(\theta)$

A model that has memorized the training data has $L_{\text{train}} \approx 0$ but potentially large $L_{\text{test}}$ . The gap is wide. A model that has learned the underlying patterns has both losses small and roughly equal. The gap is narrow.

Why does overfitting happen?

Every dataset is a sample from an underlying data distribution. The training data contains the true pattern plus noise (measurement error, label noise, finite sampling). A model with enough parameters can fit both the pattern and the noise. On new data, the noise is different, so the noise-fitting part hurts rather than helps.

The risk of overfitting increases when:

Model is too complex for the data (more parameters than data points)
Training data is too small (easier to memorize)
Training runs too long (model eventually memorizes after learning the pattern)
No regularization is used (nothing prevents memorization)

The Three-Way Split

The solution is to split your data into three disjoint subsets, each with a specific role:

Subset	Typical Size	Purpose	Used During
Training set	60–80%	Train the model (compute gradients, update weights)	Every epoch
Validation set	10–20%	Monitor overfitting, tune hyperparameters, select best model	After each epoch (no gradients)
Test set	10–20%	Final, unbiased performance estimate	Once, at the very end

The mathematical guarantee: if the test set is drawn i.i.d. from the same distribution as the training data, and is never used for any training decision, then the test loss is an unbiased estimator of the true generalization error:

$\mathbb{E}[L_{\text{test}}] = \mathbb{E}_{(x,y) \sim P_{\text{data}}}[\ell(f_\theta(x), y)]$

The moment you use test data to make any decision (choosing hyperparameters, selecting which epoch to stop at, comparing model architectures), the test loss becomes a biased, optimistic estimate. This is why the validation set exists: it absorbs all the selection bias so the test set stays clean.

The Cardinal Rule: The test set must never influence any decision during training. Not the learning rate, not the architecture, not when to stop, not which model to select. Violating this rule is called data leakage and makes your reported performance unreliable.

Common Split Ratios

Dataset Size	Recommended Split	Reasoning
< 1,000	70 / 15 / 15	Small data — need enough train samples, but val/test must be meaningful
1,000–100,000	80 / 10 / 10	Moderate data — can afford smaller val/test fractions
> 100,000	98 / 1 / 1	Large data — even 1% is thousands of samples, plenty for reliable estimates
ImageNet (1.2M)	~96 / 2 / 2	50K val, 50K test — standard benchmark

Shuffling before splitting is critical. If your data has any ordering (e.g., all cats first, then all dogs), splitting without shuffling puts all cats in training and all dogs in test. The model has never seen a dog during training and fails completely.

Data Leakage: The Silent Killer

A split only protects you if no information from the val/test sets sneaks back into training. When it does, your validation metrics look artificially good in the lab and then collapse in production. Kaufman et al. (2012) call this data leakage and show that it is the single most common reason winning Kaggle models fail to reproduce on fresh data.

Leakage comes in four distinct flavors. The first two are about the split itself; the last two are about the preprocessing pipeline that wraps the model.

Type	What it looks like	Fix
Target leakage	A feature that is only available AFTER the prediction time of the label (e.g., 'was the loan paid off' as a feature to predict 'will the loan default').	Audit features for a strict chronology: would this value exist at inference time?
Train/test contamination	Duplicate or near-duplicate rows exist in both train and test (e.g., the same image under a different filename, the same user's two sessions).	Deduplicate on a stable identifier before splitting; prefer group-based splits (e.g., GroupKFold by user id).
Preprocessing leakage	A scaler, imputer, PCA, or encoder is fit on the FULL dataset before splitting, so test statistics influence training features.	Fit preprocessing on TRAIN ONLY, then apply to val/test. scikit-learn Pipelines do this correctly by construction.
Validation-set tuning leakage	Tuning thresholds, hyperparameters, or architecture repeatedly against the same validation set eventually overfits it. The test set stays untouched to catch this.	Keep a sealed test set. If you must iterate heavily, use k-fold CV (Kohavi 1995) or nested CV (Cawley & Talbot 2010) for honest estimates.

The preprocessing case is the one most practitioners have accidentally committed. Here is the smallest example that shows what it hides:

StandardScaler leakage — a 21-line demo that exposes an 87 σ outlier

🐍scaler_leakage.py

Explanation(13)

Code(20)

1import numpy as np

NumPy gives us fast array math so we can compute means, stds, and element-wise differences without Python loops.

4X = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 100.0])

Six one-dimensional samples. The first five look innocuous; the last one (100.0) is an extreme outlier that will end up in the test set. Real-world analogues: a rare fraud transaction, a failure mode with catastrophic magnitude, a mis-calibrated sensor reading.

EXECUTION STATE

X = array([1., 2., 3., 4., 5., 100.])

5X_train, X_test = X[:4], X[4:]

Slice the first four into training, the last two into testing. The outlier (100.0) lives exclusively in the test set, which is what simulates 'the future we cannot see yet.'

EXECUTION STATE

X_train = array([1., 2., 3., 4.])

X_test = array([ 5., 100.])

8mu_full, std_full = X.mean(), X.std()

WRONG PATH — compute scaler statistics on the ENTIRE X, including the test outlier. This is the textbook leakage mistake: your preprocessing step has seen data it will be evaluated on.

EXECUTION STATE

mu_full = (1+2+3+4+5+100)/6 = 115/6 = 19.17

std_full = sqrt(((1-19.17)² + (2-19.17)² + ... + (100-19.17)²)/6) ≈ 36.17

→ why bad = std_full is dominated by the outlier (|100 - 19.17|² = 6534). The scaler you trained can only produce these particular stats BECAUSE it peeked at the test set.

9X_train_wrong = (X_train - mu_full) / std_full

Apply the leaked statistics to train.

EXECUTION STATE

X_train_wrong = ([1,2,3,4] - 19.17)/36.17 ≈ [-0.502, -0.475, -0.447, -0.419]

→ what you see = Training features look almost constant. The scaler has hidden the real variation inside the training set — because the outlier's distance blew up the std.

10X_test_wrong = (X_test - mu_full) / std_full

Apply the same leaked statistics to test.

EXECUTION STATE

X_test_wrong = ([5, 100] - 19.17)/36.17 ≈ [-0.392, 2.235]

→ the danger = The outlier LOOKS like a modest 2.2 σ — nothing alarming. In the leaked world you trained your model inside, a 2 σ sample is routine. Your model will happily predict on it and its error estimate will be too rosy.

13mu_train, std_train = X_train.mean(), X_train.std()

RIGHT PATH — compute stats on X_train ONLY. The test set stays sealed.

EXECUTION STATE

mu_train = (1+2+3+4)/4 = 2.5

std_train = sqrt(((1-2.5)² + (2-2.5)² + (3-2.5)² + (4-2.5)²)/4) = sqrt(1.25) ≈ 1.118

14X_train_right = (X_train - mu_train) / std_train

Apply the honest training-only statistics to X_train.

EXECUTION STATE

X_train_right = ([1,2,3,4] - 2.5)/1.118 ≈ [-1.341, -0.447, 0.447, 1.341]

→ sanity check = Mean = 0, std = 1 across X_train as expected for proper standardization.

15X_test_right = (X_test - mu_train) / std_train

Apply the SAME train-only statistics to X_test. The test set never influences mu_train or std_train.

EXECUTION STATE

X_test_right = ([5, 100] - 2.5)/1.118 ≈ [2.236, 87.210]

→ the signal the leaked version hid = The outlier now shows up as 87 σ above the training mean. Your model is extrapolating wildly. This is a signal — you can detect it, decide to clip, to retrain with more data, or to route this example to a human reviewer. The leaked-stats version silently hid this signal.

17print(f"full-data stats: mean={mu_full:.2f} std={std_full:.2f}")

Inspect the two parameter sets the scaler learned.

EXECUTION STATE

stdout = full-data stats: mean=19.17 std=36.17

18print(f"train-only stats: mean={mu_train:.2f} std={std_train:.2f}")

The honest scaler's parameters.

EXECUTION STATE

stdout = train-only stats: mean=2.50 std=1.12

19print(f"WRONG X_test: {np.round(X_test_wrong, 3)}")

The leaked view of the outlier.

EXECUTION STATE

stdout = WRONG X_test: [-0.392 2.235]

20print(f"RIGHT X_test: {np.round(X_test_right, 3)}")

The honest view — the outlier is exposed at 87 σ.

EXECUTION STATE

stdout = RIGHT X_test: [ 2.236 87.21 ]

→ bottom line = Always fit scalers, imputers, PCA, encoders, feature selectors — every preprocessing step that learns parameters — on TRAIN ONLY, then APPLY to val / test. In scikit-learn: fit_transform on train, transform on test. In PyTorch: compute stats on train_loader, apply as a frozen torch.nn.Module.

7 lines without explanation

1import numpy as np
2
3# 6 samples, one giant outlier waiting in the test set
4X = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 100.0])
5X_train, X_test = X[:4], X[4:]     # train = [1,2,3,4], test = [5, 100]
6
7# --- WRONG: fit the scaler on the entire X (leaks test info) ---
8mu_full,  std_full  = X.mean(),       X.std()
9X_train_wrong = (X_train - mu_full)  / std_full
10X_test_wrong  = (X_test  - mu_full)  / std_full
11
12# --- RIGHT: fit the scaler on X_train only, then apply to X_test ---
13mu_train, std_train = X_train.mean(), X_train.std()
14X_train_right = (X_train - mu_train) / std_train
15X_test_right  = (X_test  - mu_train) / std_train
16
17print(f"full-data  stats: mean={mu_full:.2f}  std={std_full:.2f}")
18print(f"train-only stats: mean={mu_train:.2f}  std={std_train:.2f}")
19print(f"WRONG X_test: {np.round(X_test_wrong, 3)}")
20print(f"RIGHT X_test: {np.round(X_test_right, 3)}")

Why this matters: the leaked scaler does not just add a small bias. It can systematically hide the out-of-distribution samples that your production monitoring most needs to see. A model trained on leaked features looks miraculous in offline evaluation and fails quietly in deployment.

K-Fold Cross-Validation

A single train / val / test split is wasteful when data is scarce. You train on 70%, evaluate on 15%, and throw away the noisy signal buried in whichever 15% happened to be your validation fold. K-fold cross-validation (Stone 1974) spreads that evaluation across the entire dataset by rotating every example through the validation role exactly once.

The dataset is partitioned into $K$ equal-sized folds $D_1, \ldots, D_K$ . For each $j \in \{1, \ldots, K\}$ , fit a fresh model $f_{-j}$ on the union of the other folds and score it on $D_j$ . The CV estimate of generalization error is the mean of those fold scores:

$\mathrm{CV}_K(f) \;=\; \frac{1}{K} \sum_{j=1}^{K} L\!\left(f_{-j};\; D_j\right)$

The standard choices are $K=5$ (fast, slight upward bias) and $K=10$ (slower, lower bias, higher variance of the estimate). Going all the way to $K=N$ is leave-one-out CV — almost unbiased, but the $K$ estimates are so correlated that the variance of the mean blows up (Bengio & Grandvalet 2004). In applied deep learning the cost of refitting a large model K times usually rules out K-fold entirely; it is most useful with small datasets or classical models.

K-fold CV by hand — the fold loop, made visible

🐍kfold_from_scratch.py

Explanation(24)

Code(35)

1import numpy as np

All the math (sampling, distances, means) is vectorized through NumPy.

4np.random.seed(0)

Seed the data-generating RNG so the 40 samples are reproducible. k-fold CV is about estimator variance; we want to isolate that from dataset randomness.

5X = np.vstack([...])

Stack two arrays top-to-bottom. The result is a (40, 2) matrix: 20 class-1 points near (+1.5, +1.5), 20 class-0 points near (-1.5, -1.5). Well-separated but with some overlap (since σ = 1 for each Gaussian, the distributions touch in the middle).

EXECUTION STATE

📚 np.vstack(tup) = Stack arrays in sequence vertically (row-wise). Equivalent to np.concatenate with axis=0.

X.shape = (40, 2)

9y = np.concatenate([np.ones(20, int), np.zeros(20, int)])

Labels aligned with X. 20 ones followed by 20 zeros. The labels are perfectly ordered here — this is exactly why we will shuffle before splitting into folds.

EXECUTION STATE

y[:5] = [1, 1, 1, 1, 1]

y[-5:] = [0, 0, 0, 0, 0]

11def mean_classifier(X_tr, y_tr, X_va, y_va):

A tiny, fit-in-two-lines classifier: compute the mean of each class on training data; at validation time predict whichever class mean is closer. Choosing a weak model keeps the focus on the CV mechanics, not on model performance.

EXECUTION STATE

⬇ input: X_tr = Training features. Shape (~32, 2) for 5-fold CV on 40 samples.

⬇ input: y_tr = Training labels aligned with X_tr.

⬇ input: X_va = Validation features for this fold. Shape (8, 2).

⬇ input: y_va = Validation labels for this fold.

⬆ returns = float — accuracy on the validation fold.

13mu0 = X_tr[y_tr == 0].mean(axis=0)

Boolean indexing: select training rows whose label is 0, then column-wise mean. Produces a 2-vector — the centroid of class 0 in the training set.

EXECUTION STATE

X_tr[y_tr == 0] = all training points in class 0

axis=0 = reduce along the ROW axis — take the mean of each feature column independently

mu0 = approximately (-1.5, -1.5) — training centroid of class 0

14mu1 = X_tr[y_tr == 1].mean(axis=0)

Same for class 1.

EXECUTION STATE

mu1 = approximately (+1.5, +1.5) — training centroid of class 1

15d0 = np.linalg.norm(X_va - mu0, axis=1)

Distance from each validation point to the class-0 centroid. np.linalg.norm with axis=1 computes the row-wise L2 norm, returning a 1-D vector of length = len(X_va).

EXECUTION STATE

📚 np.linalg.norm(a, axis) = Vector / matrix norm. Default is L2. With axis=1 it treats each row as a vector and returns the per-row norm.

X_va - mu0 = broadcast: shape (8, 2) − shape (2,) → shape (8, 2)

d0 (shape) = (8,) — one distance per validation sample

16d1 = np.linalg.norm(X_va - mu1, axis=1)

Distance from each validation point to the class-1 centroid.

17pred = (d1 < d0).astype(int)

Element-wise comparison gives a boolean vector: True where class 1's centroid is closer. Cast to int so pred ∈ {0, 1}.

EXECUTION STATE

(d1 < d0) = length-8 bool array

.astype(int) = turns True → 1, False → 0

18return float((pred == y_va).mean())

(pred == y_va) produces a bool array where True marks correct predictions. .mean() averages True = 1, False = 0 — so the result is the fraction correct. Cast to float so downstream code doesn't mix in NumPy scalars.

EXECUTION STATE

⬆ return: accuracy = e.g. 0.875 — 7 of 8 validation samples correct

21np.random.seed(1)

A SEPARATE seed for the CV shuffling, distinct from the data-generation seed. This mirrors real practice: you use one seed to create a reproducible dataset, and a different seed to produce a reproducible split for an experiment. Changing only the CV seed lets you see how sensitive your conclusions are to the specific fold assignment.

22perm = np.random.permutation(len(X))

Shuffle indices 0..39 into a random order. We will later slice this permutation into 5 contiguous chunks, each of which serves once as the validation fold.

EXECUTION STATE

📚 np.random.permutation(n) = Returns a random permutation of arange(n).

perm (example) = [11, 3, 27, 38, 5, 19, 33, 21, 14, 30, 1, 15, 6, 26, 18, 0, 4, 37, 8, 35, 32, 22, 29, 9, 28, 25, 7, 36, 24, 34, 23, 12, 16, 20, 17, 2, 39, 13, 31, 10]

→ why shuffle = Without this, folds 0..2 would be ALL class 1 and folds 3..4 would be ALL class 0 — the classifier would see no negatives during training on some folds. Shuffling ensures roughly balanced folds.

24K = 5

Number of folds. K = 5 and K = 10 are the two overwhelmingly common choices. K = 5 uses 80% of the data per training fold; K = 10 uses 90%. More folds → less bias (each training set resembles the full dataset more closely) but higher variance in the estimate (folds are more correlated with each other) and more compute.

25fold_size = len(X) // K

Integer division. If N does not divide evenly by K, the last fold is larger or you drop the remainder — sklearn's KFold handles this elegantly; our educational version just uses floor-division so every fold has exactly 8 samples.

EXECUTION STATE

fold_size = 40 // 5 = 8 samples per fold

26accs = []

Collect one accuracy per fold.

27for k in range(K):

The five-fold rotation. Each iteration pops fold k out as the validation set and trains on the other four folds.

LOOP TRACE · 5 iterations

k=0

va_idx = perm[0:8] = first 8 shuffled indices

tr_idx = perm[8:40] = remaining 32 indices

acc (typical) = ≈ 1.000 — clean fold

k=1

va_idx = perm[8:16]

tr_idx = perm[0:8] ∪ perm[16:40]

acc (typical) = ≈ 0.875 — one validation point sits in the overlap zone

k=2

va_idx = perm[16:24]

acc (typical) = ≈ 1.000

k=3

va_idx = perm[24:32]

acc (typical) = ≈ 0.875

k=4

va_idx = perm[32:40]

acc (typical) = ≈ 1.000

28va_idx = perm[k*fold_size:(k+1)*fold_size]

The 8 indices that are validation this round. A contiguous slice of the shuffled permutation.

29tr_idx = np.concatenate([perm[:k*fold_size], perm[(k+1)*fold_size:]])

Training indices = everything EXCEPT the current validation slice. Two-piece concatenation: the portion before the slice and the portion after.

30acc = mean_classifier(X[tr_idx], y[tr_idx], X[va_idx], y[va_idx])

Train a FRESH classifier on the training indices (our toy version just recomputes the class means — a real classifier would re-fit from scratch here) and score it on the validation fold.

EXECUTION STATE

→ key principle = Every fold trains a NEW model from scratch. NEVER carry training state from fold to fold — that leaks information across folds.

31accs.append(acc)

Record the per-fold accuracy.

32print(f"fold {k}: train={len(tr_idx):2d} val={len(va_idx):2d} acc={acc:.3f}")

Per-fold diagnostic. Seeing the sizes printed confirms the fold arithmetic is right.

EXECUTION STATE

stdout (example run) = fold 0: train=32 val= 8 acc=1.000 fold 1: train=32 val= 8 acc=0.875 fold 2: train=32 val= 8 acc=1.000 fold 3: train=32 val= 8 acc=0.875 fold 4: train=32 val= 8 acc=1.000

34cv_mean, cv_std = float(np.mean(accs)), float(np.std(accs))

Two numbers summarize the CV result. cv_mean is your generalization estimate. cv_std is the variability across folds — a proxy for how much your accuracy depends on which 20% of the data you held out.

EXECUTION STATE

cv_mean = 0.950 — average of the 5 fold accuracies

cv_std = 0.061 — per-fold scatter

35print(f"5-fold CV accuracy: {cv_mean:.3f} +/- {cv_std:.3f}")

The single most useful summary in applied ML: 'mean ± std over K folds'. Two models with mean accuracy 0.93 and 0.94 but stds 0.02 and 0.08 are VERY different beasts — the first is reliably good, the second might be randomly lucky. Always report both.

EXECUTION STATE

stdout = 5-fold CV accuracy: 0.950 +/- 0.061

→ interpretation = Generalization accuracy is ~0.95 with per-fold scatter ~0.06. If you re-split with a different seed and get 0.94 ± 0.07, your estimate is stable. If you get 0.80 ± 0.15, the model is sensitive to the split and you need more data or a simpler model.

11 lines without explanation

1import numpy as np
2
3# Two Gaussian blobs in 2D — class 1 near (+1.5, +1.5), class 0 near (-1.5, -1.5)
4np.random.seed(0)
5X = np.vstack([
6    np.random.randn(20, 2) + np.array([ 1.5,  1.5]),
7    np.random.randn(20, 2) + np.array([-1.5, -1.5]),
8])
9y = np.concatenate([np.ones(20, dtype=int), np.zeros(20, dtype=int)])
10
11def mean_classifier(X_tr, y_tr, X_va, y_va):
12    """Predict the class whose training mean is nearer (nearest-centroid)."""
13    mu0 = X_tr[y_tr == 0].mean(axis=0)
14    mu1 = X_tr[y_tr == 1].mean(axis=0)
15    d0 = np.linalg.norm(X_va - mu0, axis=1)
16    d1 = np.linalg.norm(X_va - mu1, axis=1)
17    pred = (d1 < d0).astype(int)
18    return float((pred == y_va).mean())
19
20# Shuffle once, then walk K equal-sized folds
21np.random.seed(1)
22perm = np.random.permutation(len(X))     # 40 indices in random order
23
24K = 5
25fold_size = len(X) // K                  # 8 per fold
26accs = []
27for k in range(K):
28    va_idx = perm[k*fold_size:(k+1)*fold_size]
29    tr_idx = np.concatenate([perm[:k*fold_size], perm[(k+1)*fold_size:]])
30    acc = mean_classifier(X[tr_idx], y[tr_idx], X[va_idx], y[va_idx])
31    accs.append(acc)
32    print(f"fold {k}: train={len(tr_idx):2d}  val={len(va_idx):2d}  acc={acc:.3f}")
33
34cv_mean, cv_std = float(np.mean(accs)), float(np.std(accs))
35print(f"5-fold CV accuracy: {cv_mean:.3f} +/- {cv_std:.3f}")

Stratified K-Fold

Plain K-fold splits purely by index. If the class proportions are skewed (say 90% class 0, 10% class 1), a random fold might by chance contain no minority samples at all. Its metrics become meaningless and your CV estimate picks up spurious variance. Stratified K-fold (Kohavi 1995) constrains every fold to mirror the overall class ratio. It is the default for classification and should be a reflex:

KFold and StratifiedKFold in scikit-learn

🐍kfold_sklearn.py

Explanation(8)

Code(22)

2from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

The three standard tools. KFold and StratifiedKFold produce splits; cross_val_score ties together split + fit + score.

EXECUTION STATE

📚 KFold(n_splits, shuffle, random_state) = Returns an iterable of (train_idx, val_idx) pairs. Ignores labels — just splits by index.

📚 StratifiedKFold(...) = Same interface, but each fold preserves the class proportions of the full dataset. Uses y when generating splits.

📚 cross_val_score(estimator, X, y, cv, scoring) = Does the full loop internally: split → clone estimator → fit on train → score on val → record. Returns array of per-fold scores.

3from sklearn.linear_model import LogisticRegression

A real classifier this time. Each fold will re-fit a fresh LogisticRegression model on that fold's training split.

12kf = KFold(n_splits=5, shuffle=True, random_state=1)

Build the plain K-fold splitter. shuffle=True randomizes before splitting (otherwise you get consecutive slices). random_state pins the shuffle for reproducibility.

EXECUTION STATE

⬇ arg: n_splits=5 = K — number of folds.

⬇ arg: shuffle=True = Shuffle indices before slicing into folds. Almost always what you want; only set False for time-series (see TimeSeriesSplit).

⬇ arg: random_state=1 = Seed for the shuffle. Pin it for reproducible experiments; vary it to measure split-to-split variance.

13acc_plain = cross_val_score(LogisticRegression(), X, y, cv=kf, scoring="accuracy")

The full CV pipeline in one call. For each (train_idx, val_idx) yielded by kf, cross_val_score clones the estimator (so training state does NOT leak across folds), fits, scores, and appends to a 1-D array of length n_splits.

EXECUTION STATE

⬇ arg: estimator=LogisticRegression() = Unfit prototype. A fresh clone is used per fold.

⬇ arg: scoring='accuracy' = Metric to evaluate each fold. Swap in 'f1', 'roc_auc', 'neg_log_loss', ... as needed.

⬆ acc_plain = array([1.0, 1.0, 1.0, 0.875, 1.0]) (example)

⬆ acc_plain.mean() = 0.975

⬆ acc_plain.std() = 0.050

14print("plain :", np.round(acc_plain, 3), ...)

One-line diagnostic dump. Mean ± std is the number you quote in a paper or a report.

EXECUTION STATE

stdout = plain : [1. 1. 1. 0.875 1. ] mean 0.975 +/- 0.050

17skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

The stratified variant. Each fold is built so that class proportions match the full dataset. For our 20/20 data the effect is subtle, but on a 90/10 imbalance plain K-fold can produce folds with 4/4 validation sets that just happen to be all-majority — wrecking the fold's score. StratifiedKFold never does that.

EXECUTION STATE

→ when it matters = ALWAYS use StratifiedKFold for classification with class imbalance. It is the default in cross_val_score when you pass an integer cv for a classifier (cv=5 uses StratifiedKFold under the hood). Making it explicit removes surprise.

18acc_strat = cross_val_score(LogisticRegression(), X, y, cv=skf, scoring="accuracy")

Same call, different splitter. Produces folds whose class ratios all match 50/50 here.

EXECUTION STATE

⬆ acc_strat = array([1.0, 1.0, 1.0, 1.0, 1.0]) (example — stratification stabilizes the result)

19print("stratified :", np.round(acc_strat, 3), ...)

On balanced data, the two splitters agree within noise. On imbalanced data you would see StratifiedKFold's std drop sharply because no fold ends up with a degenerate class mix.

EXECUTION STATE

stdout (example) = stratified : [1. 1. 1. 1. 1. ] mean 1.000 +/- 0.000

→ practical rule = Classification ⇒ StratifiedKFold. Regression ⇒ plain KFold (stratification needs discrete labels). Time series ⇒ TimeSeriesSplit (no shuffling; forward-walk splits).

14 lines without explanation

1import numpy as np
2from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
3from sklearn.linear_model import LogisticRegression
4
5np.random.seed(0)
6X = np.vstack([
7    np.random.randn(20, 2) + [ 1.5,  1.5],
8    np.random.randn(20, 2) + [-1.5, -1.5],
9])
10y = np.concatenate([np.ones(20, dtype=int), np.zeros(20, dtype=int)])
11
12# Plain K-fold — ignores labels when splitting
13kf = KFold(n_splits=5, shuffle=True, random_state=1)
14acc_plain = cross_val_score(LogisticRegression(), X, y, cv=kf, scoring="accuracy")
15print("plain       :", np.round(acc_plain, 3),
16      f"mean {acc_plain.mean():.3f} +/- {acc_plain.std():.3f}")
17
18# Stratified K-fold — preserves class ratio inside EVERY fold
19skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
20acc_strat = cross_val_score(LogisticRegression(), X, y, cv=skf, scoring="accuracy")
21print("stratified  :", np.round(acc_strat, 3),
22      f"mean {acc_strat.mean():.3f} +/- {acc_strat.std():.3f}")

Rule of thumb for neural nets: k-fold CV is invaluable for small datasets (under ~10 000 samples) or when comparing model families. For deep networks trained for hours per run on large datasets, a single honest train / val / test split plus a held-out test set is the norm — the training noise per run dominates split noise anyway. Save k-fold for preprocessing-pipeline selection, hyperparameter sweeps on small data, and methodological comparisons that will make it into a paper.

The Bias-Variance Tradeoff

The generalization error of any model can be decomposed into three irreducible components:

$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}$

Bias measures how far the model's average prediction is from the truth. High bias = the model is too simple to capture the pattern (underfitting). A linear model fitting a sine wave has high bias.
Variance measures how much the model's predictions change when trained on different samples. High variance = the model is too sensitive to the specific training data (overfitting). A degree-15 polynomial has high variance.
Noise ( $\sigma^2$ ) is the irreducible error from the data-generating process. No model can do better than this.

The tradeoff: increasing model complexity decreases bias (better fit to the true pattern) but increases variance (more sensitive to training noise). The optimal complexity minimizes the sum.

Regime	Bias	Variance	Train Loss	Val Loss	Gap
Underfitting	High	Low	High	High	Small
Good fit	Low	Low	Low	Low	Small
Overfitting	Low	High	Very low	High	Large

The key diagnostic: compare train loss to val loss. If both are high, increase complexity (more layers, more neurons). If train loss is low but val loss is high, decrease complexity or add regularization. If both are low and close together, you have found a good fit.

Interactive: Overfitting Explorer

The visualization below fits polynomial models of increasing degree to a noisy sine wave. Drag the complexity slider from left (linear, underfitting) to right (degree 15, extreme overfitting) and observe how the train error keeps decreasing while the validation error follows a U-shape — first decreasing, then increasing.

Loading overfitting visualization...

Key observations as you experiment:

Degree 1-2 (underfitting): The line is too rigid to follow the sine curve. Both train and val MSE are high. The model has high bias.
Degree 3-5 (good fit): The curve follows the sine pattern without chasing individual points. Val MSE reaches its minimum. This is the sweet spot.
Degree 10+ (overfitting): The curve wiggles wildly to pass through every training point. Train MSE is near zero, but val MSE climbs sharply. The model has high variance.
Increase noise: With more noise, overfitting starts at a lower degree because there is more noise to memorize

Early Stopping

Early stopping is the simplest and most effective regularization technique. The idea: monitor the validation loss during training, and stop when it starts increasing.

The Algorithm

After each epoch, compute the validation loss
If the val loss is the best so far, save a snapshot of the model weights and reset a counter
If the val loss did NOT improve, increment the counter
If the counter reaches a threshold called patience, stop training
Restore the model weights from the snapshot (the best-validated state)

Formally, we define the stopping criterion:

$\text{stop if } L_{\text{val}}(\theta_e) > \min_{e' < e} L_{\text{val}}(\theta_{e'}) \text{ for } P \text{ consecutive epochs}$

where $P$ is the patience. The patience parameter is important because validation loss is noisy — it may temporarily increase for a few epochs before decreasing again. Without patience, you would stop prematurely on a random fluctuation.

Choosing Patience

Patience	Behavior	When to Use
3-5	Stops quickly. May miss slow improvements.	Fast experiments, well-tuned models
10-15	Balanced. Standard default.	Most training scenarios
20-50	Very patient. Allows long plateaus.	LR schedulers with warmup, noisy small datasets

Early stopping is mathematically equivalent to L2 regularization (weight decay) in certain settings (Bishop, 1995). As training progresses, the effective model complexity increases. Stopping early limits how complex the model becomes, just as weight decay penalizes large weights. The difference: early stopping automatically tunes the regularization strength based on the validation loss, while weight decay requires manually choosing $\lambda$ .

Combine early stopping with other techniques. Early stopping is not an alternative to dropout, weight decay, or data augmentation — it is complementary. Use early stopping as a safety net on top of other regularization methods. It costs nothing (just a validation evaluation per epoch) and prevents wasted compute.

Building Validation from Scratch

The code below implements a complete training pipeline with train/val/test evaluation and early stopping in NumPy. Pay close attention to the evaluate() function — it runs a forward pass only, with no backward pass and no weight updates. This is what makes it safe to use on validation and test data.

NumPy — Training with Validation and Early Stopping

🐍validation_training.py

Explanation(87)

Code(100)

1import numpy as np

NumPy provides the ndarray type and vectorized math operations we need for matrix multiplications, element-wise operations, and random number generation.

EXECUTION STATE

📚 numpy = Numerical computing library. All matrix math (@, broadcasting, exp, log) runs in optimized C, not Python.

3# Generate two-moons dataset (200 points)

We create a synthetic 2D classification dataset where two crescent-shaped clusters overlap slightly. This is a classic ML test problem: simple enough to visualize, complex enough that a linear model fails.

EXECUTION STATE

two moons = Two interleaved half-circle clusters. The overlap (controlled by noise) makes the boundary fuzzy — there is no perfect separator.

4np.random.seed(0) — Reproducibility

Fixes the random number generator so we get identical data every run. Critical for debugging and reproducible experiments.

5N = 200 — Total dataset size

200 total samples: 100 per class. After splitting: 140 train, 30 val, 30 test. Enough for smooth loss curves while being small enough to demonstrate overfitting.

EXECUTION STATE

N = 200 = 100 class-0 points + 100 class-1 points. Split 70/15/15 → 140 + 30 + 30.

6t = np.linspace(0, π, N // 2) — Angular positions

Creates 100 evenly spaced angles from 0 to π. These define the positions along each crescent. linspace(0, π, 100) gives [0, 0.0317, 0.0634, ..., 3.1416].

EXECUTION STATE

📚 np.linspace(0, π, 100) = 100 values evenly spaced from 0 to π. The angular parameter for the half-circle.

7X0 = np.c_[cos(t), sin(t)] + noise — Moon 0 (class 0)

The first crescent: (cos(t), sin(t)) traces a half-circle from (1,0) to (-1,0). Adding Gaussian noise with std=0.3 makes the boundary fuzzy.

EXECUTION STATE

📚 np.c_[a, b] = Column-stack: combines two 1D arrays into a (N, 2) matrix. np.c_[[1,2], [3,4]] = [[1,3],[2,4]].

0.3 * randn(100, 2) = Gaussian noise with std=0.3 on both x and y. This controls how much the two moons overlap.

X0 shape = (100, 2) — 100 points in 2D, centered on a half-circle

8X1 = np.c_[cos(t)+0.5, -sin(t)+0.3] + noise — Moon 1 (class 1)

The second crescent: flipped vertically (-sin) and shifted right (+0.5) and up (+0.3) so the two moons interleave. Same noise level.

EXECUTION STATE

+0.5, +0.3 = Offsets that position moon 1 to interleave with moon 0. Without these, the two moons would be concentric.

X1 shape = (100, 2) — 100 points for the second class

9X = np.vstack([X0, X1]) — Combine both classes

Stack vertically: X0 (100×2) on top of X1 (100×2) = X (200×2). First 100 rows are class 0, last 100 are class 1.

EXECUTION STATE

📚 np.vstack() = Vertical stack: combines arrays along axis 0 (rows). vstack([(100,2), (100,2)]) = (200,2).

⬆ X = (200, 2) — all data points, first 100 are class 0, last 100 are class 1.

10y = np.array([0]*100 + [1]*100) — Labels

First 100 labels are 0, last 100 are 1. Matches the order of X (X0 first, X1 second).

EXECUTION STATE

y = [0, 0, ..., 0, 1, 1, ..., 1] — 200 labels, 100 of each class.

12# Shuffle and split: 70% train, 15% val, 15% test

The three-way split is the foundation of model evaluation. Training data trains the model, validation data tunes hyperparameters and detects overfitting, test data gives the final unbiased performance estimate.

13perm = np.random.permutation(N) — Random shuffle

Random permutation of [0, 1, ..., 199]. Shuffling before splitting ensures each subset is a representative sample of the full dataset. Without shuffling, train would be all class 0 and test would be all class 1.

EXECUTION STATE

📚 np.random.permutation(200) = Random ordering of 0..199. Ensures train/val/test each get a mix of both classes.

14X, y = X[perm], y[perm] — Apply shuffle

Reorder both features and labels using the same permutation. The pairing X[i], y[i] is preserved.

15n_train = int(0.7 * N) — Train size = 140

70% of 200 = 140 training samples. The 70/15/15 split is a common default for medium-sized datasets.

EXECUTION STATE

n_train = 140 = 70% of data. More training data → better model. Less → noisier gradients. 70% is a reasonable balance.

16n_val = int(0.15 * N) — Val size = 30

15% of 200 = 30 validation samples. Used to detect overfitting and select the best model. NEVER used for training.

EXECUTION STATE

n_val = 30 = 15% of data. Enough for a reliable loss estimate. The val set is the ‘early warning system’ for overfitting.

17X_train, y_train = X[:n_train], y[:n_train]

First 140 shuffled samples become the training set. These are the ONLY samples the model sees during gradient updates.

EXECUTION STATE

X_train shape = (140, 2). Class balance: 70 zeros, 70 ones (well-balanced).

18X_val, y_val = X[n_train:n_train+n_val], y[...]

Next 30 samples become the validation set. Used after each epoch to check if the model is overfitting. No gradients are computed on this data.

EXECUTION STATE

X_val shape = (30, 2). Class balance: 17 zeros, 13 ones.

19X_test, y_test = X[n_train+n_val:], y[...]

Last 30 samples become the test set. Only evaluated ONCE at the very end. Never influences any training decision. This is the final, unbiased performance report.

EXECUTION STATE

X_test shape = (30, 2). Class balance: 13 zeros, 17 ones.

→ critical rule = Test data must NEVER be used to select hyperparameters, choose when to stop, or pick the best model. It is for the final evaluation only.

21print(f"Train: ..., Val: ..., Test: ...")

Output: Train: 140, Val: 30, Test: 30. Always verify your split sizes and class balance before training.

EXECUTION STATE

output = Train: 140, Val: 30, Test: 30

23# Model: 2 → 64 → 2 (overparameterized)

We intentionally use a very wide hidden layer (64 neurons). With 194 parameters for 140 training points, this model has enough capacity to memorize the training data — making overfitting likely. This is the setup that makes validation essential.

EXECUTION STATE

overparameterization = Parameters: (2×64)+64+(64×2)+2 = 194. With only 140 training points, the model has more parameters than data — a classic recipe for overfitting.

24np.random.seed(3)

Fix weight initialization seed for reproducibility.

25H = 64 — Hidden layer width

64 hidden neurons. Compare with Section 2 which used 4. The excess capacity is what enables overfitting.

26W1 = 0.5 * np.random.randn(2, H) — Layer 1 weights

Random initialization from N(0, 0.25). Shape (2, 64): 2 input features → 64 hidden neurons. 128 parameters in this matrix alone.

EXECUTION STATE

W1 shape = (2, 64) = 128 parameters

27b1 = np.zeros(H) — Layer 1 biases

64 bias terms, all starting at zero.

28W2 = 0.5 * np.random.randn(H, 2) — Layer 2 weights

Shape (64, 2): maps 64 hidden features to 2 class logits. 128 parameters.

EXECUTION STATE

W2 shape = (64, 2) = 128 parameters

29b2 = np.zeros(2) — Output biases

2 output biases, one per class.

31# Helper: evaluate loss and accuracy (NO weight updates)

THIS IS THE KEY FUNCTION. The evaluate function computes loss and accuracy on ANY data subset WITHOUT modifying the model weights. This is what makes validation and test evaluation possible — it is a pure measurement, not a training step.

32def evaluate(X_set, y_set): — Pure evaluation function

Runs a forward pass on the given data and returns loss + accuracy. No backward pass, no weight updates. Safe to call on val/test data because it does not change the model at all.

EXECUTION STATE

⬇ input: X_set = Feature matrix for any split. Can be X_train (140×2), X_val (30×2), or X_test (30×2).

⬇ input: y_set = Labels for the same split. Must match X_set in length.

⬆ returns = Tuple (loss, acc). loss = cross-entropy (lower is better). acc = fraction correct (higher is better).

→ critical difference from training = No backward pass, no dW, no weight update. The model is not modified. This function is READ-ONLY.

33z1 = X_set @ W1 + b1 — Forward: layer 1

Standard linear transform. Note: this uses the CURRENT values of W1 and b1 (which are global variables that change during training).

34h1 = np.maximum(0, z1) — ReLU activation

Element-wise ReLU. Same as the training forward pass.

35z2 = h1 @ W2 + b2 — Output logits

Raw class scores. Shape: (N_set, 2) where N_set is the size of the data subset.

36exp_z = np.exp(z2 - z2.max(...)) — Stable softmax (numerator)

Numerically stable exponentiation: subtract row max before exp to prevent overflow.

37probs = exp_z / exp_z.sum(...) — Softmax probabilities

Each row sums to 1.0. probs[i, c] = P(class=c | x_i).

38loss = -np.mean(np.log(probs[...])) — Cross-entropy loss

Average negative log probability of the correct class across all samples in this set. This is the same loss function used during training.

EXECUTION STATE

⬆ loss = Scalar. Initial val loss ≈ 1.11 (near random). After training: val loss ≈ 0.19 at best epoch.

39acc = np.mean(z2.argmax(axis=1) == y_set) — Accuracy

Fraction of samples where the predicted class (argmax of logits) matches the true label. Unlike loss, accuracy is a discrete metric (0% to 100%).

EXECUTION STATE

📚 .argmax(axis=1) = Returns the index of the maximum value in each row. For (N, 2): returns 0 or 1 for each sample — the predicted class.

⬆ acc = Scalar between 0 and 1. Initial val acc ≈ 53% (near random). After training: ≈ 93%.

40return loss, acc

Return both metrics. We track loss for early stopping (continuous, differentiable) and accuracy for human interpretability.

42# Training loop with validation and early stopping

The core structure: after each epoch of training, we evaluate on the validation set. If val loss has not improved for ‘patience’ epochs, we stop and restore the best model.

43lr = 0.05 — Learning rate

Same learning rate as Section 2.

44patience = 10 — How long to wait before stopping

If the validation loss does not improve for 10 consecutive epochs, we stop training. This prevents wasting compute on a model that is already overfitting.

EXECUTION STATE

patience = 10 = 10 epochs of no improvement → stop. Smaller patience (3-5) stops earlier but may miss slow improvements. Larger (15-20) is safer but wastes more compute.

45best_val_loss = float('inf') — Track best validation loss

Initialize to infinity so the first epoch’s loss is always ‘the best so far’. This is the value we compare against to decide whether to save or wait.

46best_epoch = 0 — Which epoch was best

Records when the best val loss was achieved. Used in the final report.

47wait = 0 — Epochs since last improvement

Counter that resets to 0 when val loss improves, increments when it does not. When wait >= patience, we stop.

EXECUTION STATE

wait = 0 initially. Counts up to patience (10) then triggers early stopping.

48best_W1, best_b1 = W1.copy(), b1.copy() — Snapshot best weights

Save a COPY of the current weights. When we find a better val loss, we overwrite these. At the end, we restore these as the final model. The .copy() is critical — without it, best_W1 would be a reference to the same array that keeps changing.

49best_W2, best_b2 = W2.copy(), b2.copy()

Same snapshot for layer 2 weights.

51np.random.seed(42) — Fix shuffle order

Seeds the shuffling used inside the training loop for reproducibility.

52for epoch in range(100): — Epoch loop

Train for up to 100 epochs. With early stopping, we may stop much earlier (typically around epoch 60-80 for this problem). The max of 100 is a safety limit.

53# Train (mini-batch, B=10)

The training section — covered in detail in Section 2. Here we focus on the NEW part: validation evaluation and early stopping after each epoch.

54p = np.random.permutation(n_train)

Shuffle training indices for this epoch.

55Xs, ys = X_train[p], y_train[p]

Reorder training data. Different shuffle each epoch.

56for i in range(0, n_train, 10): — Mini-batch loop

Process 14 batches of 10 samples each. Standard inner training loop from Section 2.

57Xb, yb = Xs[i:i+10], ys[i:i+10]

Slice one mini-batch.

58B = len(yb)

Actual batch size (10 for all batches since 140/10 = 14 exactly).

59z1 = Xb @ W1 + b1 — Forward pass (training batch)

Standard forward pass through layer 1. This IS part of training — gradients will follow.

60h1 = np.maximum(0, z1)

ReLU activation.

61z2 = h1 @ W2 + b2

Output logits.

62ez = np.exp(z2 - z2.max(axis=1, keepdims=True))

Numerically stable exp for softmax.

63pr = ez / ez.sum(axis=1, keepdims=True)

Softmax probabilities.

64dz2 = pr.copy()

Start of backward pass: copy probabilities.

65dz2[range(B), yb] -= 1

Softmax + cross-entropy gradient: subtract 1 at correct class position.

66dz2 /= B

Average gradient over batch.

67dW2 = h1.T @ dz2; db2 = dz2.sum(0)

Layer 2 weight and bias gradients.

68dh1 = dz2 @ W2.T; dz1 = dh1 * (z1 > 0)

Backpropagate through W2 and ReLU.

69dW1 = Xb.T @ dz1; db1_ = dz1.sum(0)

Layer 1 gradients.

70W1 -= lr*dW1; b1 -= lr*db1_

SGD update for layer 1.

71W2 -= lr*dW2; b2 -= lr*db2

SGD update for layer 2. After this line, one training step is complete.

73# Evaluate on train AND validation (no gradients)

THIS IS THE NEW PART compared to Section 2. After all 14 training batches, we measure how the model performs on both training data and held-out validation data. The training loss tells us how well the model fits the training data. The validation loss tells us how well it generalizes.

74train_loss, train_acc = evaluate(X_train, y_train)

Call our read-only evaluate function on the TRAINING set. Note: this is NOT the per-batch loss from inside the training loop. We recompute loss on the full training set for a cleaner metric.

EXECUTION STATE

typical trajectory = E0: 0.42 (87%) → E20: 0.21 (92%) → E60: 0.19 (93%) → E80: 0.17 (93%). Keeps decreasing.

75val_loss, val_acc = evaluate(X_val, y_val)

Evaluate on the VALIDATION set. This is the number we watch for overfitting. Initially it decreases (model is learning useful patterns), but eventually it may increase (model starts memorizing training noise).

EXECUTION STATE

typical trajectory = E0: 0.46 (73%) → E20: 0.30 (80%) → E60: 0.22 (90%) → best then rises.

→ overfitting signal = When train_loss keeps going down but val_loss starts going UP, the model is overfitting. The gap between them is the generalization gap.

77# Early stopping check

The early stopping algorithm: (1) if val_loss improved, save model and reset counter; (2) if not, increment counter; (3) if counter reaches patience, stop training and restore the best model.

78if val_loss < best_val_loss: — Did val loss improve?

Compare current validation loss against the best we have ever seen. If it is lower (better), this epoch produced a better model.

EXECUTION STATE

→ why val_loss, not train_loss? = We care about generalization, not memorization. Train loss always improves with more training. Val loss reflects performance on unseen data.

79best_val_loss = val_loss — Update best

Record the new best validation loss for future comparisons.

80best_epoch = epoch — Record when

Remember which epoch achieved this best loss. Reported at the end.

81best_W1, best_b1 = W1.copy(), b1.copy() — Save model snapshot

Save a copy of the current weights. If the model starts overfitting later, we can restore these ‘best’ weights. The .copy() creates independent arrays that will not be affected by future weight updates.

82best_W2, best_b2 = W2.copy(), b2.copy()

Save layer 2 weights too.

83wait = 0 — Reset patience counter

The model just improved, so reset the counter. We give it another ‘patience’ epochs before considering stopping.

84else: — Val loss did NOT improve

The model did not beat its previous best on the validation set. This could be a temporary fluctuation or the start of overfitting.

85wait += 1 — Increment patience counter

One more epoch without improvement. When this reaches patience (10), we stop.

EXECUTION STATE

wait progression = If val_loss keeps not improving: wait = 1, 2, 3, ..., 10 → STOP. If val_loss improves at wait=7: wait resets to 0.

87if epoch % 20 == 0 or wait == patience:

Print progress every 20 epochs and at the stopping point.

88print(f"Epoch ...: train=... val=... wait=...")

Log both train and val metrics. The key diagnostic: is the gap between train_loss and val_loss growing? If so, the model is overfitting.

91if wait >= patience: — Patience exhausted?

The trigger condition for early stopping. If we have gone 10 epochs without improvement, stop training and restore the best model.

92print(f"Early stopping at epoch {epoch}! Best: epoch {best_epoch}")

Announce the early stop. The best model was saved patience epochs ago.

93break — Exit the epoch loop

Stop training immediately. Without this break, we would waste compute training a model that is only getting worse on validation data.

95# Final evaluation on test set (using best model)

The FINAL step. We restore the best model (the one with lowest val loss) and evaluate it on the test set, which has never influenced any decision during training. This is the TRUE performance of the model.

96# Restore best weights

The current model weights have been training past the optimal point (overfitting). We replace them with the saved best weights from the epoch with lowest validation loss.

97W1, b1 = best_W1, best_b1 — Restore layer 1

Replace the overfit weights with the saved best weights. After this, the model is in its best state.

98W2, b2 = best_W2, best_b2 — Restore layer 2

Same for layer 2.

99test_loss, test_acc = evaluate(X_test, y_test) — THE FINAL NUMBER

This is the moment of truth. We call evaluate on the test set using the restored best model. This number is the unbiased estimate of how the model will perform on real, unseen data. It should only be computed ONCE.

EXECUTION STATE

⬆ test_loss, test_acc = The final reported metrics. This is what you put in your paper, your report, your model card. It was computed on data the model has never seen and that never influenced any training decision.

100print(f"Test (best model): ...")

Final output: the test performance of the best model selected by validation.

EXECUTION STATE

output = Test (best model, epoch XX): loss=X.XXXX, acc=XX%

13 lines without explanation

1import numpy as np
2
3# ── Generate two-moons dataset (200 points) ──
4np.random.seed(0)
5N = 200
6t = np.linspace(0, np.pi, N // 2)
7X0 = np.c_[np.cos(t), np.sin(t)] + 0.3 * np.random.randn(N//2, 2)
8X1 = np.c_[np.cos(t)+0.5, -np.sin(t)+0.3] + 0.3 * np.random.randn(N//2, 2)
9X = np.vstack([X0, X1])
10y = np.array([0]*(N//2) + [1]*(N//2))
11
12# ── Shuffle and split: 70% train, 15% val, 15% test ──
13perm = np.random.permutation(N)
14X, y = X[perm], y[perm]
15n_train = int(0.7 * N)
16n_val = int(0.15 * N)
17X_train, y_train = X[:n_train], y[:n_train]
18X_val, y_val = X[n_train:n_train+n_val], y[n_train:n_train+n_val]
19X_test, y_test = X[n_train+n_val:], y[n_train+n_val:]
20
21print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
22
23# ── Model: 2 → 64 → 2 (overparameterized) ──
24np.random.seed(3)
25H = 64
26W1 = 0.5 * np.random.randn(2, H)
27b1 = np.zeros(H)
28W2 = 0.5 * np.random.randn(H, 2)
29b2 = np.zeros(2)
30
31# ── Helper: evaluate loss and accuracy (NO weight updates) ──
32def evaluate(X_set, y_set):
33    z1 = X_set @ W1 + b1
34    h1 = np.maximum(0, z1)
35    z2 = h1 @ W2 + b2
36    exp_z = np.exp(z2 - z2.max(axis=1, keepdims=True))
37    probs = exp_z / exp_z.sum(axis=1, keepdims=True)
38    loss = -np.mean(np.log(probs[range(len(y_set)), y_set] + 1e-8))
39    acc = np.mean(z2.argmax(axis=1) == y_set)
40    return loss, acc
41
42# ── Training loop with validation and early stopping ──
43lr = 0.05
44patience = 10
45best_val_loss = float('inf')
46best_epoch = 0
47wait = 0
48best_W1, best_b1 = W1.copy(), b1.copy()
49best_W2, best_b2 = W2.copy(), b2.copy()
50
51np.random.seed(42)
52for epoch in range(100):
53    # Train (mini-batch, B=10)
54    p = np.random.permutation(n_train)
55    Xs, ys = X_train[p], y_train[p]
56    for i in range(0, n_train, 10):
57        Xb, yb = Xs[i:i+10], ys[i:i+10]
58        B = len(yb)
59        z1 = Xb @ W1 + b1
60        h1 = np.maximum(0, z1)
61        z2 = h1 @ W2 + b2
62        ez = np.exp(z2 - z2.max(axis=1, keepdims=True))
63        pr = ez / ez.sum(axis=1, keepdims=True)
64        dz2 = pr.copy()
65        dz2[range(B), yb] -= 1
66        dz2 /= B
67        dW2 = h1.T @ dz2; db2 = dz2.sum(0)
68        dh1 = dz2 @ W2.T; dz1 = dh1 * (z1 > 0)
69        dW1 = Xb.T @ dz1; db1_ = dz1.sum(0)
70        W1 -= lr*dW1; b1 -= lr*db1_
71        W2 -= lr*dW2; b2 -= lr*db2
72
73    # Evaluate on train AND validation (no gradients)
74    train_loss, train_acc = evaluate(X_train, y_train)
75    val_loss, val_acc = evaluate(X_val, y_val)
76
77    # Early stopping check
78    if val_loss < best_val_loss:
79        best_val_loss = val_loss
80        best_epoch = epoch
81        best_W1, best_b1 = W1.copy(), b1.copy()
82        best_W2, best_b2 = W2.copy(), b2.copy()
83        wait = 0
84    else:
85        wait += 1
86
87    if epoch % 20 == 0 or wait == patience:
88        print(f"Epoch {epoch}: train={train_loss:.4f} ({train_acc:.0%})"
89              f"  val={val_loss:.4f} ({val_acc:.0%})  wait={wait}")
90
91    if wait >= patience:
92        print(f"Early stopping at epoch {epoch}! Best: epoch {best_epoch}")
93        break
94
95# ── Final evaluation on test set (using best model) ──
96# Restore best weights
97W1, b1 = best_W1, best_b1
98W2, b2 = best_W2, best_b2
99test_loss, test_acc = evaluate(X_test, y_test)
100print(f"\nTest (best model, epoch {best_epoch}): loss={test_loss:.4f}, acc={test_acc:.0%}")

Watch the loss trajectory. In the early epochs, both train and val loss decrease together — the model is learning the true pattern. Eventually, the train loss keeps dropping while the val loss plateaus or rises — the model is now memorizing training noise. Early stopping catches this moment and restores the model to its best state.

The evaluate function is the key innovation of this section. Compare Section 2 (training loop only, no evaluation) with this code. The only addition is a read-only forward pass on the validation set after each epoch, plus a few lines of early stopping logic. This small change transforms a training loop that might overfit into one that automatically finds the optimal stopping point.

PyTorch: eval(), no_grad(), and Early Stopping

The PyTorch version introduces two critical features that do not exist in NumPy:

model.eval() switches the model to evaluation mode, disabling dropout and using running BatchNorm statistics. Without this, validation metrics are wrong.
@torch.no_grad() disables gradient tracking, saving memory and compute. Without this, PyTorch builds a computation graph for the validation forward pass, wasting GPU memory.

Feature	Training	Evaluation
model.train() / model.eval()	model.train() — dropout active, BN uses batch stats	model.eval() — dropout disabled, BN uses running stats
Gradient tracking	Enabled — autograd builds computation graph	@torch.no_grad() — no graph, saves memory
Weight updates	Yes — optimizer.step() modifies parameters	No — parameters are read-only
Data used	Training set only	Validation set (per epoch), test set (once at end)

PyTorch — Evaluation Mode, no_grad(), Early Stopping

🐍pytorch_validation.py

Explanation(53)

Code(67)

1import torch

PyTorch core: tensors with GPU support and automatic differentiation.

2import torch.nn as nn

Neural network building blocks: layers, loss functions, utilities.

3from torch.utils.data import DataLoader, TensorDataset

DataLoader for batching/shuffling, TensorDataset for pairing tensors.

5# Dataset and splits (same two-moons data)

Convert our NumPy arrays to PyTorch tensors. We assume X_train, y_train, X_val, y_val, X_test, y_test were already created by the NumPy code above.

6X_train_t = torch.tensor(X_train, dtype=torch.float32)

Convert NumPy training features to float32 tensor.

7y_train_t = torch.tensor(y_train, dtype=torch.long)

Convert labels to long integers (required by CrossEntropyLoss).

8X_val_t = torch.tensor(X_val, dtype=torch.float32)

Validation features as tensor.

9y_val_t = torch.tensor(y_val, dtype=torch.long)

Validation labels.

10X_test_t = torch.tensor(X_test, dtype=torch.float32)

Test features — only used at the very end.

11y_test_t = torch.tensor(y_test, dtype=torch.long)

Test labels.

13train_loader = DataLoader(..., batch_size=10, shuffle=True)

DataLoader for the training set only. Validation and test do not need DataLoader — we evaluate them in one pass since they are small enough to fit in memory.

EXECUTION STATE

→ why no val_loader? = Val/test sets are small (30 samples). We pass the full tensor to model() directly. DataLoader is only needed for large datasets or when shuffling/batching matters.

17# Model

Define the same 2→64→2 architecture using PyTorch’s nn.Sequential.

18model = nn.Sequential(...) — Build model

nn.Sequential chains layers in order: Linear(2,64) → ReLU → Linear(64,2). Same architecture as the NumPy version but defined declaratively.

EXECUTION STATE

📚 nn.Sequential = Container that chains modules. model(x) calls each layer in order. No need for a custom forward() method.

→ parameters = Linear(2,64): 192. Linear(64,2): 130. Total: 194.

20criterion = nn.CrossEntropyLoss()

Same loss function: softmax + negative log likelihood combined.

21optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

SGD optimizer with same learning rate as NumPy.

23# Evaluation helper

The PyTorch equivalent of our NumPy evaluate function. Uses two key features: @torch.no_grad() and model.eval().

24@torch.no_grad() — Decorator: disable gradient tracking

This decorator wraps the entire function in a torch.no_grad() context. No computation graph is built inside this function, so: (1) forward pass is faster, (2) uses less memory, (3) gradients are not accumulated. Essential for evaluation.

EXECUTION STATE

📚 @torch.no_grad() = Decorator form of ‘with torch.no_grad():’. Disables autograd for all operations inside the function. The model’s parameters are NOT modified.

→ without this = PyTorch would build a computation graph for the val forward pass, wasting memory. On large models, this can cause OOM errors during evaluation.

25def evaluate(model, X, y): — Evaluate any split

Takes a model and any data split. Returns (loss, accuracy). Works for train, val, or test data.

EXECUTION STATE

⬇ input: model = The nn.Sequential model. We call model(X) to get predictions.

⬇ input: X = Feature tensor. Can be X_train_t (140,2), X_val_t (30,2), or X_test_t (30,2).

⬇ input: y = Label tensor. Same split as X.

⬆ returns = Tuple (loss_float, acc_float). Both are plain Python floats (detached from graph).

26model.eval() — Switch to evaluation mode

Sets the model to evaluation mode. This disables dropout and switches BatchNorm to use running statistics instead of batch statistics. Our model has neither, but calling eval() is a best practice — always do it before evaluation.

EXECUTION STATE

📚 model.eval() = Sets self.training = False for all submodules. Affects: Dropout (disabled), BatchNorm (uses running mean/var), and any custom layer that checks self.training.

→ why it matters = Dropout randomly zeros neurons during training (regularization) but NOT during eval. If you forget model.eval(), dropout masks the val predictions, giving wrong loss values.

27logits = model(X) — Forward pass (no graph built)

Runs the full forward pass on the data. Because of @torch.no_grad(), no computation graph is built, saving memory.

28loss = criterion(logits, y).item() — Compute and extract loss

Compute cross-entropy loss, then .item() extracts it as a Python float (detaching from graph).

29acc = (logits.argmax(dim=1) == y).float().mean().item()

Compute accuracy: predicted class == true class → boolean → float (0/1) → mean → Python float.

EXECUTION STATE

📚 .argmax(dim=1) = Index of max value along dim 1 (classes). Returns predicted class (0 or 1) for each sample.

.float() = Converts boolean tensor (True/False) to float (1.0/0.0) for computing mean.

30model.train() — Switch back to training mode

Restores the model to training mode (re-enables dropout, etc.) so the next training epoch works correctly. Always pair model.eval() with model.train().

31return loss, acc

Return the scalar loss and accuracy values.

33# Training with early stopping

Same early stopping logic as NumPy, but using PyTorch’s state_dict for checkpointing.

34best_val_loss = float('inf')

Initialize to infinity.

35best_epoch = 0

Track which epoch was best.

36patience, wait = 10, 0

Same patience=10 as NumPy. wait counts epochs without improvement.

37best_state = model.state_dict().copy() — Save initial state

model.state_dict() returns an OrderedDict of all parameter tensors. .copy() creates a deep copy so it is not affected by future updates. This replaces our manual W1.copy(), b1.copy() etc.

EXECUTION STATE

📚 model.state_dict() = Returns {'0.weight': tensor(64,2), '0.bias': tensor(64), '2.weight': tensor(2,64), '2.bias': tensor(2)}. All model parameters as a dictionary.

39for epoch in range(100):

Same epoch loop. May stop early via break.

40model.train() — Ensure training mode at epoch start

Sets model to training mode. Evaluate() switches to eval mode and back, but this is a safe explicit set.

41for Xb, yb in train_loader: — Training batches

Iterate mini-batches from DataLoader. 14 batches of 10 per epoch.

42logits = model(Xb) — Forward

Model predictions for this training batch.

43loss = criterion(logits, yb) — Loss

Cross-entropy loss for this batch.

44optimizer.zero_grad() — Clear gradients

Reset all gradient accumulators.

45loss.backward() — Compute gradients

Backpropagate through the model.

46optimizer.step() — Update weights

Apply SGD update.

48train_loss, train_acc = evaluate(model, X_train_t, y_train_t)

Evaluate on full training set. The evaluate function handles model.eval() and torch.no_grad() internally.

49val_loss, val_acc = evaluate(model, X_val_t, y_val_t)

Evaluate on validation set. This is the metric that drives early stopping.

51if val_loss < best_val_loss:

Check for improvement.

52best_val_loss = val_loss

Update best.

53best_epoch = epoch

Record best epoch.

54best_state = model.state_dict().copy() — Checkpoint best

Save a deep copy of all model parameters. In PyTorch, this is the standard way to save model state — it replaces our manual per-tensor copying from the NumPy version.

55wait = 0

Reset counter.

56else:

No improvement.

57wait += 1

Increment patience counter.

59if wait >= patience:

Check if patience exhausted.

60print(f"Early stopping ...")

Announce stopping.

61break

Exit epoch loop.

63# Restore best model and evaluate on test set

Load the saved best weights back into the model and compute the final, unbiased test metric.

64model.load_state_dict(best_state) — Restore best weights

Replaces all model parameters with the saved best state. After this call, model is in its best-validated state, not the potentially overfit final state.

EXECUTION STATE

📚 model.load_state_dict(state) = Loads parameters from an OrderedDict into the model. Shape and key names must match. This is the standard way to restore checkpoints in PyTorch.

65test_loss, test_acc = evaluate(model, X_test_t, y_test_t)

THE FINAL NUMBER. Evaluate the best-validated model on completely held-out test data. This number is only computed once and is the unbiased estimate of real-world performance.

EXECUTION STATE

⬆ test_loss, test_acc = The metrics you report. Computed on data that never influenced training, hyperparameter selection, or model selection.

66print(f"Test: loss=..., acc=...")

Final output: the test performance. This is the number that goes in the paper.

EXECUTION STATE

output = Test: loss=X.XXXX, acc=XX%

14 lines without explanation

1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
4
5# ── Dataset and splits (same two-moons data) ──
6X_train_t = torch.tensor(X_train, dtype=torch.float32)
7y_train_t = torch.tensor(y_train, dtype=torch.long)
8X_val_t = torch.tensor(X_val, dtype=torch.float32)
9y_val_t = torch.tensor(y_val, dtype=torch.long)
10X_test_t = torch.tensor(X_test, dtype=torch.float32)
11y_test_t = torch.tensor(y_test, dtype=torch.long)
12
13train_loader = DataLoader(
14    TensorDataset(X_train_t, y_train_t), batch_size=10, shuffle=True
15)
16
17# ── Model ──
18model = nn.Sequential(
19    nn.Linear(2, 64), nn.ReLU(), nn.Linear(64, 2)
20)
21criterion = nn.CrossEntropyLoss()
22optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
23
24# ── Evaluation helper ──
25@torch.no_grad()
26def evaluate(model, X, y):
27    model.eval()
28    logits = model(X)
29    loss = criterion(logits, y).item()
30    acc = (logits.argmax(dim=1) == y).float().mean().item()
31    model.train()
32    return loss, acc
33
34# ── Training with early stopping ──
35best_val_loss = float('inf')
36best_epoch = 0
37patience, wait = 10, 0
38best_state = model.state_dict().copy()
39
40for epoch in range(100):
41    model.train()
42    for Xb, yb in train_loader:
43        logits = model(Xb)
44        loss = criterion(logits, yb)
45        optimizer.zero_grad()
46        loss.backward()
47        optimizer.step()
48
49    train_loss, train_acc = evaluate(model, X_train_t, y_train_t)
50    val_loss, val_acc = evaluate(model, X_val_t, y_val_t)
51
52    if val_loss < best_val_loss:
53        best_val_loss = val_loss
54        best_epoch = epoch
55        best_state = model.state_dict().copy()
56        wait = 0
57    else:
58        wait += 1
59
60    if wait >= patience:
61        print(f"Early stopping at epoch {epoch}. Best: {best_epoch}")
62        break
63
64# ── Restore best model and evaluate on test set ──
65model.load_state_dict(best_state)
66test_loss, test_acc = evaluate(model, X_test_t, y_test_t)
67print(f"Test: loss={test_loss:.4f}, acc={test_acc:.0%}")

The PyTorch version replaces manual weight copying (W1.copy()) withmodel.state_dict().copy() and weight restoration withmodel.load_state_dict(best_state). The rest is structurally identical to the NumPy version.

Going further: `torch.inference_mode()`

Since PyTorch 1.9, there is a second, stricter alternative to no_grad(): torch.inference_mode(). Both disable autograd, but inference_mode also turns off the version counters and view tracking that autograd normally maintains to make in-place operations safe. Skipping that bookkeeping makes it measurably faster — roughly 5–15% for pure-inference workloads per the PyTorch 1.9 release notes — at the price of a stricter contract: any tensor created inside is an “inference tensor” and cannot later be re-enabled for gradient tracking.

	torch.no_grad()	torch.inference_mode()
Disables autograd	Yes	Yes
Disables version counters / view tracking	No	Yes
Output tensors can later enter autograd	Yes	No (must .clone() first)
Speed	Baseline	~5-15% faster on inference-heavy CPU / GPU code
Recommended for	Validation inside a training loop	Pure inference services; end-of-training test

torch.no_grad() vs torch.inference_mode() — same numbers, different contract

🐍no_grad_vs_inference_mode.py

Explanation(11)

Code(25)

1import torch

PyTorch core — we need it for tensors, the no_grad context, and the newer inference_mode context (added in PyTorch 1.9).

3x = torch.randn(3, requires_grad=False)

A length-3 random vector with no gradient tracking. requires_grad=False is the default for fresh tensors; making it explicit here shows that x itself is not part of any autograd graph.

EXECUTION STATE

📚 torch.randn(*shape, requires_grad) = Draws from the standard normal distribution N(0, 1) and returns a tensor.

⬇ arg: shape=(3,) = 1D tensor with 3 elements.

⬇ arg: requires_grad=False = Do not track operations on this tensor for autograd. Default; stated here for clarity.

⬆ x (example) = tensor([ 0.34, -1.12, 0.58]) (your values will differ — no seed)

6with torch.no_grad():

Standard context for validation since 2017. Inside the block, every operation is executed but no autograd graph is recorded. Outputs have requires_grad=False. You can still call .requires_grad_(True) on them afterwards, and you can still feed them into a later gradient-tracked computation.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables autograd recording. Used in every validation loop.

→ guarantees = No graph is built; memory for intermediate activations is freed as soon as possible.

7y_ng = x * 2

Element-wise multiply. Runs under no_grad, so y_ng has no grad_fn. Shape matches x.

EXECUTION STATE

y_ng (example) = tensor([ 0.68, -2.24, 1.16])

y_ng.requires_grad = False

8print("no_grad : requires_grad =", ..., "is_inference =", ...)

Prints both flags. Under no_grad, the output's is_inference() returns False — it is a regular tensor that merely happens to have no gradient. You could still pass it into an autograd-tracked computation later.

EXECUTION STATE

stdout = no_grad : requires_grad = False , is_inference = False

12with torch.inference_mode():

A stricter, faster context introduced in PyTorch 1.9. It disables autograd the same way no_grad does, but also disables version-counter bookkeeping and view-tracking — the internal machinery that makes autograd safe against in-place modifications. Skipping that bookkeeping makes inference_mode measurably faster than no_grad (especially for many small ops), at the cost of stricter rules on the resulting tensors.

EXECUTION STATE

📚 torch.inference_mode() = Context manager introduced in PyTorch 1.9 for the inference fast path.

→ what it gives up = Tensors created inside are 'inference tensors'. They cannot be saved for backward, cannot be used as leaves in a later autograd graph, and cannot have requires_grad flipped to True afterwards.

→ when to use = Production inference services; end-of-training test evaluation. Avoid inside the training loop's validation pass if any of those tensors need to flow back into gradient-tracked code.

13y_im = x * 2

Same computation, but the result is an inference tensor.

EXECUTION STATE

y_im (example) = tensor([ 0.68, -2.24, 1.16])

y_im.is_inference() = True — the tensor is marked as inference-only.

14print("inference_mode: requires_grad =", ..., "is_inference =", ...)

Output: is_inference is now True. That is the ONLY observable difference from no_grad at this point — until we try to re-enable grad tracking below.

EXECUTION STATE

stdout = inference_mode: requires_grad = False , is_inference = True

18print("outputs equal :", torch.allclose(y_ng, y_im))

Both contexts compute exactly the same numerical result. The difference is purely about what autograd is allowed to do with the outputs afterwards.

EXECUTION STATE

📚 torch.allclose(a, b) = Returns True iff every element-wise difference is within a tolerance (default rtol=1e-5, atol=1e-8).

stdout = outputs equal : True

21try: y_im.requires_grad_(True)

Attempt to retrofit gradient tracking onto the inference tensor. This is the strictness showing up: PyTorch refuses.

EXECUTION STATE

📚 Tensor.requires_grad_(flag) = In-place setter for the requires_grad flag.

→ attempted action = Turn y_im into a leaf tensor that accumulates gradients.

22except RuntimeError as e: print(...)

PyTorch raises RuntimeError: 'Setting requires_grad=True on inference tensor outside InferenceMode is not allowed. Use .clone() to make a differentiable copy first.' This is the correctness-first design that gives inference_mode its speed: the C++ engine does not have to reason about whether an inference tensor might later participate in backward.

EXECUTION STATE

stdout = cannot flip requires_grad on inference tensor: RuntimeError

→ workaround if needed = If you need a tensor from inference_mode to re-enter autograd, make a differentiable copy: z = y_im.clone().requires_grad_(True). That breaks the inference-tensor lineage.

14 lines without explanation

1import torch
2
3x = torch.randn(3, requires_grad=False)
4
5# --- torch.no_grad(): disables autograd inside the block ---
6with torch.no_grad():
7    y_ng = x * 2
8print("no_grad       : requires_grad =", y_ng.requires_grad,
9      ", is_inference =", y_ng.is_inference())
10
11# --- torch.inference_mode(): stricter — marks outputs as inference tensors ---
12with torch.inference_mode():
13    y_im = x * 2
14print("inference_mode: requires_grad =", y_im.requires_grad,
15      ", is_inference =", y_im.is_inference())
16
17# Numerical results are identical:
18print("outputs equal :", torch.allclose(y_ng, y_im))
19
20# An inference tensor cannot be retro-fitted into a grad-tracked computation:
21try:
22    y_im.requires_grad_(True)
23except RuntimeError as e:
24    print("cannot flip requires_grad on inference tensor:",
25          type(e).__name__)

Rule of thumb: during the validation pass inside a training loop, stick with torch.no_grad(). For the final test evaluation or a production inference service, switch to torch.inference_mode(). Both must be paired with model.eval()— model.eval() fixes Dropout and BatchNorm behavior; no_grad / inference_mode fixes autograd behavior.

Connection to Modern Systems

The validation concepts from this section are used at every scale of modern ML:

LLM Evaluation

Large language models like GPT-4 and LLaMA are evaluated on held-out text that was never part of the training corpus. The primary metric is perplexity on validation data: $\text{PPL} = \exp(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t | x_{<t}))$ . Lower perplexity means the model assigns higher probability to the correct next tokens. Chinchilla (Hoffmann et al., 2022) used validation loss scaling laws to determine the optimal balance of model size vs. training data size.

Benchmark Contamination

A major concern in LLM evaluation is benchmark contamination — when test data accidentally appears in the training corpus. If GPT-4's training data included questions from a benchmark, its score on that benchmark is meaningless (just like a student who memorized the exam answers). Modern evaluations use contamination detection and held-out dynamic benchmarks.

Cross-Validation

When data is scarce, k-fold cross-validation maximizes use of available data. Split data into $k$ equal folds. Train on $k-1$ folds, validate on the remaining one. Repeat $k$ times, each fold serving as validation once. The average validation score across all folds is a more reliable estimate than a single split. Common choice: $k = 5$ or $k = 10$ .

Transformer Early Stopping

Pre-training large transformers typically does NOT use early stopping — the models are trained for a fixed number of tokens (LLaMA: 1.4T tokens, Chinchilla: 1.4T tokens). Instead, early stopping is used during fine-tuning, where a pre-trained model is adapted to a specific task with a small dataset. Fine-tuning is highly prone to overfitting (billions of parameters, thousands of training examples), making early stopping essential.

The three-way split is not just a technique — it is a scientific principle. Every claim about model performance must be backed by evaluation on truly held-out data. Without this, we are measuring memorization, not intelligence.

Summary

In this section, we learned how to measure and prevent overfitting:

Generalization gap: $L_{\text{test}} - L_{\text{train}}$ measures how much the model has overfit. A small gap indicates good generalization
Three-way split: Train (learn parameters), validation (tune hyperparameters, detect overfitting), test (final unbiased estimate). The test set must NEVER influence any training decision
Bias-variance tradeoff: $\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$ . Underfitting = high bias. Overfitting = high variance. The optimal model complexity minimizes total error
Early stopping: Monitor val loss, stop after patience epochs without improvement, restore best model. Mathematically equivalent to L2 regularization in certain settings
PyTorch evaluation: Always use model.eval() and @torch.no_grad() during evaluation. eval() disables dropout/BN, no_grad() saves memory

In the next section, we will expand our monitoring beyond simple loss tracking to include learning curves, confusion matrices, precision/recall, and other metrics that give deeper insight into what the model is learning and where it fails.

References

Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147. The paper that formalized k-fold cross-validation.
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995. Introduces stratified k-fold and compares it to plain k-fold on imbalanced data.
Bengio, Y. & Grandvalet, Y. (2004). No Unbiased Estimator of the Variance of K-Fold Cross-Validation. JMLR 5, 1089–1105.
Varma, S. & Simon, R. (2006). Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91. Motivates nested cross-validation.
Cawley, G. C. & Talbot, N. L. C. (2010). On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11, 2079–2107. Canonical reference for nested cross-validation.
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data 6(4), Article 15.
Prechelt, L. (1998). Early Stopping — But When? In Neural Networks: Tricks of the Trade (LNCS 1524), 55–69. Classic reference on early-stopping criteria.
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Springer. Chapter 7 gives the canonical bias-variance decomposition and cross-validation treatment.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, MIT Press. Chapters 5 and 7.8 cover generalization, cross-validation, and early stopping.
PyTorch 1.9 Release Notes. Introduction of torch.inference_mode. https://github.com/pytorch/pytorch/releases/tag/v1.9.0

Learning Objectives

Where We Left Off

Why Validation? The Overfitting Problem

Why does overfitting happen?

The Three-Way Split

Common Split Ratios

Data Leakage: The Silent Killer

K-Fold Cross-Validation

Stratified K-Fold

The Bias-Variance Tradeoff

Interactive: Overfitting Explorer

Early Stopping

The Algorithm

Choosing Patience

Building Validation from Scratch

PyTorch: eval(), no_grad(), and Early Stopping

Going further: torch.inference_mode()

Connection to Modern Systems

LLM Evaluation

Benchmark Contamination

Cross-Validation

Transformer Early Stopping

Summary

References

Going further: `torch.inference_mode()`