Explain why a model that performs well on training data can fail on new data, and formalize this as the generalization gap
Implement a correct train / validation / test split and explain the role of each subset
Derive the bias-variance decomposition: Error=Bias2+Variance+Noise
Implement early stopping — the simplest and most effective regularization technique
Use PyTorch's model.eval() and torch.no_grad() correctly for evaluation
Where We Left Off
In Section 2, we built a complete training loop: forward → loss → backward → clip → update, with learning rate scheduling and checkpointing. But our entire focus was on reducing the training loss. We measured success by how well the model fit the training data.
This creates a dangerous blind spot. A model with enough capacity can memorize the training data — achieving near-zero training loss — while being completely useless on new data. This is overfitting, and detecting it requires data that the model has never seen during training.
The Central Question: Training loss tells you how well the model memorizes. But what you actually care about is how well it generalizes. How do you measure something you have never seen?
Why Validation? The Overfitting Problem
Consider a student preparing for an exam. They study the practice problems until they can solve each one perfectly. On exam day, they face new problems and fail. The student memorized the answers instead of learning the concepts. Neural networks do exactly the same thing.
Formally, the generalization gap is the difference between the model's performance on training data and on new, unseen data:
Generalization Gap=Ltest(θ)−Ltrain(θ)
A model that has memorized the training data has Ltrain≈0 but potentially large Ltest. The gap is wide. A model that has learned the underlying patterns has both losses small and roughly equal. The gap is narrow.
Why does overfitting happen?
Every dataset is a sample from an underlying data distribution. The training data contains the true pattern plus noise (measurement error, label noise, finite sampling). A model with enough parameters can fit both the pattern and the noise. On new data, the noise is different, so the noise-fitting part hurts rather than helps.
The risk of overfitting increases when:
Model is too complex for the data (more parameters than data points)
Training data is too small (easier to memorize)
Training runs too long (model eventually memorizes after learning the pattern)
No regularization is used (nothing prevents memorization)
The Three-Way Split
The solution is to split your data into three disjoint subsets, each with a specific role:
Subset
Typical Size
Purpose
Used During
Training set
60–80%
Train the model (compute gradients, update weights)
Every epoch
Validation set
10–20%
Monitor overfitting, tune hyperparameters, select best model
After each epoch (no gradients)
Test set
10–20%
Final, unbiased performance estimate
Once, at the very end
The mathematical guarantee: if the test set is drawn i.i.d. from the same distribution as the training data, and is never used for any training decision, then the test loss is an unbiased estimator of the true generalization error:
E[Ltest]=E(x,y)∼Pdata[ℓ(fθ(x),y)]
The moment you use test data to make any decision (choosing hyperparameters, selecting which epoch to stop at, comparing model architectures), the test loss becomes a biased, optimistic estimate. This is why the validation set exists: it absorbs all the selection bias so the test set stays clean.
The Cardinal Rule: The test set must never influence any decision during training. Not the learning rate, not the architecture, not when to stop, not which model to select. Violating this rule is called data leakage and makes your reported performance unreliable.
Common Split Ratios
Dataset Size
Recommended Split
Reasoning
< 1,000
70 / 15 / 15
Small data — need enough train samples, but val/test must be meaningful
1,000–100,000
80 / 10 / 10
Moderate data — can afford smaller val/test fractions
> 100,000
98 / 1 / 1
Large data — even 1% is thousands of samples, plenty for reliable estimates
ImageNet (1.2M)
~96 / 2 / 2
50K val, 50K test — standard benchmark
Shuffling before splitting is critical. If your data has any ordering (e.g., all cats first, then all dogs), splitting without shuffling puts all cats in training and all dogs in test. The model has never seen a dog during training and fails completely.
Data Leakage: The Silent Killer
A split only protects you if no information from the val/test sets sneaks back into training. When it does, your validation metrics look artificially good in the lab and then collapse in production. Kaufman et al. (2012) call this data leakage and show that it is the single most common reason winning Kaggle models fail to reproduce on fresh data.
Leakage comes in four distinct flavors. The first two are about the split itself; the last two are about the preprocessing pipeline that wraps the model.
Type
What it looks like
Fix
Target leakage
A feature that is only available AFTER the prediction time of the label (e.g., 'was the loan paid off' as a feature to predict 'will the loan default').
Audit features for a strict chronology: would this value exist at inference time?
Train/test contamination
Duplicate or near-duplicate rows exist in both train and test (e.g., the same image under a different filename, the same user's two sessions).
Deduplicate on a stable identifier before splitting; prefer group-based splits (e.g., GroupKFold by user id).
Preprocessing leakage
A scaler, imputer, PCA, or encoder is fit on the FULL dataset before splitting, so test statistics influence training features.
Fit preprocessing on TRAIN ONLY, then apply to val/test. scikit-learn Pipelines do this correctly by construction.
Validation-set tuning leakage
Tuning thresholds, hyperparameters, or architecture repeatedly against the same validation set eventually overfits it. The test set stays untouched to catch this.
Keep a sealed test set. If you must iterate heavily, use k-fold CV (Kohavi 1995) or nested CV (Cawley & Talbot 2010) for honest estimates.
The preprocessing case is the one most practitioners have accidentally committed. Here is the smallest example that shows what it hides:
StandardScaler leakage — a 21-line demo that exposes an 87 σ outlier
🐍scaler_leakage.py
Explanation(13)
Code(20)
1import numpy as np
NumPy gives us fast array math so we can compute means, stds, and element-wise differences without Python loops.
4X = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 100.0])
Six one-dimensional samples. The first five look innocuous; the last one (100.0) is an extreme outlier that will end up in the test set. Real-world analogues: a rare fraud transaction, a failure mode with catastrophic magnitude, a mis-calibrated sensor reading.
EXECUTION STATE
X = array([1., 2., 3., 4., 5., 100.])
5X_train, X_test = X[:4], X[4:]
Slice the first four into training, the last two into testing. The outlier (100.0) lives exclusively in the test set, which is what simulates 'the future we cannot see yet.'
EXECUTION STATE
X_train = array([1., 2., 3., 4.])
X_test = array([ 5., 100.])
8mu_full, std_full = X.mean(), X.std()
WRONG PATH — compute scaler statistics on the ENTIRE X, including the test outlier. This is the textbook leakage mistake: your preprocessing step has seen data it will be evaluated on.
→ why bad = std_full is dominated by the outlier (|100 - 19.17|² = 6534). The scaler you trained can only produce these particular stats BECAUSE it peeked at the test set.
→ what you see = Training features look almost constant. The scaler has hidden the real variation inside the training set — because the outlier's distance blew up the std.
→ the danger = The outlier LOOKS like a modest 2.2 σ — nothing alarming. In the leaked world you trained your model inside, a 2 σ sample is routine. Your model will happily predict on it and its error estimate will be too rosy.
→ the signal the leaked version hid = The outlier now shows up as 87 σ above the training mean. Your model is extrapolating wildly. This is a signal — you can detect it, decide to clip, to retrain with more data, or to route this example to a human reviewer. The leaked-stats version silently hid this signal.
→ bottom line = Always fit scalers, imputers, PCA, encoders, feature selectors — every preprocessing step that learns parameters — on TRAIN ONLY, then APPLY to val / test. In scikit-learn: fit_transform on train, transform on test. In PyTorch: compute stats on train_loader, apply as a frozen torch.nn.Module.
7 lines without explanation
1import numpy as np
23# 6 samples, one giant outlier waiting in the test set4X = np.array([1.0,2.0,3.0,4.0,5.0,100.0])5X_train, X_test = X[:4], X[4:]# train = [1,2,3,4], test = [5, 100]67# --- WRONG: fit the scaler on the entire X (leaks test info) ---8mu_full, std_full = X.mean(), X.std()9X_train_wrong =(X_train - mu_full)/ std_full
10X_test_wrong =(X_test - mu_full)/ std_full
1112# --- RIGHT: fit the scaler on X_train only, then apply to X_test ---13mu_train, std_train = X_train.mean(), X_train.std()14X_train_right =(X_train - mu_train)/ std_train
15X_test_right =(X_test - mu_train)/ std_train
1617print(f"full-data stats: mean={mu_full:.2f} std={std_full:.2f}")18print(f"train-only stats: mean={mu_train:.2f} std={std_train:.2f}")19print(f"WRONG X_test: {np.round(X_test_wrong,3)}")20print(f"RIGHT X_test: {np.round(X_test_right,3)}")
Why this matters: the leaked scaler does not just add a small bias. It can systematically hide the out-of-distribution samples that your production monitoring most needs to see. A model trained on leaked features looks miraculous in offline evaluation and fails quietly in deployment.
K-Fold Cross-Validation
A single train / val / test split is wasteful when data is scarce. You train on 70%, evaluate on 15%, and throw away the noisy signal buried in whichever 15% happened to be your validation fold. K-fold cross-validation (Stone 1974) spreads that evaluation across the entire dataset by rotating every example through the validation role exactly once.
The dataset is partitioned into K equal-sized folds D1,…,DK. For each j∈{1,…,K}, fit a fresh model f−j on the union of the other folds and score it on Dj. The CV estimate of generalization error is the mean of those fold scores:
CVK(f)=K1∑j=1KL(f−j;Dj)
The standard choices are K=5 (fast, slight upward bias) and K=10 (slower, lower bias, higher variance of the estimate). Going all the way to K=N is leave-one-out CV — almost unbiased, but the K estimates are so correlated that the variance of the mean blows up (Bengio & Grandvalet 2004). In applied deep learning the cost of refitting a large model K times usually rules out K-fold entirely; it is most useful with small datasets or classical models.
K-fold CV by hand — the fold loop, made visible
🐍kfold_from_scratch.py
Explanation(24)
Code(35)
1import numpy as np
All the math (sampling, distances, means) is vectorized through NumPy.
4np.random.seed(0)
Seed the data-generating RNG so the 40 samples are reproducible. k-fold CV is about estimator variance; we want to isolate that from dataset randomness.
5X = np.vstack([...])
Stack two arrays top-to-bottom. The result is a (40, 2) matrix: 20 class-1 points near (+1.5, +1.5), 20 class-0 points near (-1.5, -1.5). Well-separated but with some overlap (since σ = 1 for each Gaussian, the distributions touch in the middle).
EXECUTION STATE
📚 np.vstack(tup) = Stack arrays in sequence vertically (row-wise). Equivalent to np.concatenate with axis=0.
Labels aligned with X. 20 ones followed by 20 zeros. The labels are perfectly ordered here — this is exactly why we will shuffle before splitting into folds.
EXECUTION STATE
y[:5] = [1, 1, 1, 1, 1]
y[-5:] = [0, 0, 0, 0, 0]
11def mean_classifier(X_tr, y_tr, X_va, y_va):
A tiny, fit-in-two-lines classifier: compute the mean of each class on training data; at validation time predict whichever class mean is closer. Choosing a weak model keeps the focus on the CV mechanics, not on model performance.
EXECUTION STATE
⬇ input: X_tr = Training features. Shape (~32, 2) for 5-fold CV on 40 samples.
⬇ input: y_tr = Training labels aligned with X_tr.
⬇ input: X_va = Validation features for this fold. Shape (8, 2).
⬇ input: y_va = Validation labels for this fold.
⬆ returns = float — accuracy on the validation fold.
13mu0 = X_tr[y_tr == 0].mean(axis=0)
Boolean indexing: select training rows whose label is 0, then column-wise mean. Produces a 2-vector — the centroid of class 0 in the training set.
EXECUTION STATE
X_tr[y_tr == 0] = all training points in class 0
axis=0 = reduce along the ROW axis — take the mean of each feature column independently
mu0 = approximately (-1.5, -1.5) — training centroid of class 0
14mu1 = X_tr[y_tr == 1].mean(axis=0)
Same for class 1.
EXECUTION STATE
mu1 = approximately (+1.5, +1.5) — training centroid of class 1
15d0 = np.linalg.norm(X_va - mu0, axis=1)
Distance from each validation point to the class-0 centroid. np.linalg.norm with axis=1 computes the row-wise L2 norm, returning a 1-D vector of length = len(X_va).
EXECUTION STATE
📚 np.linalg.norm(a, axis) = Vector / matrix norm. Default is L2. With axis=1 it treats each row as a vector and returns the per-row norm.
d0 (shape) = (8,) — one distance per validation sample
16d1 = np.linalg.norm(X_va - mu1, axis=1)
Distance from each validation point to the class-1 centroid.
17pred = (d1 < d0).astype(int)
Element-wise comparison gives a boolean vector: True where class 1's centroid is closer. Cast to int so pred ∈ {0, 1}.
EXECUTION STATE
(d1 < d0) = length-8 bool array
.astype(int) = turns True → 1, False → 0
18return float((pred == y_va).mean())
(pred == y_va) produces a bool array where True marks correct predictions. .mean() averages True = 1, False = 0 — so the result is the fraction correct. Cast to float so downstream code doesn't mix in NumPy scalars.
EXECUTION STATE
⬆ return: accuracy = e.g. 0.875 — 7 of 8 validation samples correct
21np.random.seed(1)
A SEPARATE seed for the CV shuffling, distinct from the data-generation seed. This mirrors real practice: you use one seed to create a reproducible dataset, and a different seed to produce a reproducible split for an experiment. Changing only the CV seed lets you see how sensitive your conclusions are to the specific fold assignment.
22perm = np.random.permutation(len(X))
Shuffle indices 0..39 into a random order. We will later slice this permutation into 5 contiguous chunks, each of which serves once as the validation fold.
EXECUTION STATE
📚 np.random.permutation(n) = Returns a random permutation of arange(n).
→ why shuffle = Without this, folds 0..2 would be ALL class 1 and folds 3..4 would be ALL class 0 — the classifier would see no negatives during training on some folds. Shuffling ensures roughly balanced folds.
24K = 5
Number of folds. K = 5 and K = 10 are the two overwhelmingly common choices. K = 5 uses 80% of the data per training fold; K = 10 uses 90%. More folds → less bias (each training set resembles the full dataset more closely) but higher variance in the estimate (folds are more correlated with each other) and more compute.
25fold_size = len(X) // K
Integer division. If N does not divide evenly by K, the last fold is larger or you drop the remainder — sklearn's KFold handles this elegantly; our educational version just uses floor-division so every fold has exactly 8 samples.
EXECUTION STATE
fold_size = 40 // 5 = 8 samples per fold
26accs = []
Collect one accuracy per fold.
27for k in range(K):
The five-fold rotation. Each iteration pops fold k out as the validation set and trains on the other four folds.
LOOP TRACE · 5 iterations
k=0
va_idx = perm[0:8] = first 8 shuffled indices
tr_idx = perm[8:40] = remaining 32 indices
acc (typical) = ≈ 1.000 — clean fold
k=1
va_idx = perm[8:16]
tr_idx = perm[0:8] ∪ perm[16:40]
acc (typical) = ≈ 0.875 — one validation point sits in the overlap zone
k=2
va_idx = perm[16:24]
acc (typical) = ≈ 1.000
k=3
va_idx = perm[24:32]
acc (typical) = ≈ 0.875
k=4
va_idx = perm[32:40]
acc (typical) = ≈ 1.000
28va_idx = perm[k*fold_size:(k+1)*fold_size]
The 8 indices that are validation this round. A contiguous slice of the shuffled permutation.
Train a FRESH classifier on the training indices (our toy version just recomputes the class means — a real classifier would re-fit from scratch here) and score it on the validation fold.
EXECUTION STATE
→ key principle = Every fold trains a NEW model from scratch. NEVER carry training state from fold to fold — that leaks information across folds.
Two numbers summarize the CV result. cv_mean is your generalization estimate. cv_std is the variability across folds — a proxy for how much your accuracy depends on which 20% of the data you held out.
EXECUTION STATE
cv_mean = 0.950 — average of the 5 fold accuracies
The single most useful summary in applied ML: 'mean ± std over K folds'. Two models with mean accuracy 0.93 and 0.94 but stds 0.02 and 0.08 are VERY different beasts — the first is reliably good, the second might be randomly lucky. Always report both.
EXECUTION STATE
stdout = 5-fold CV accuracy: 0.950 +/- 0.061
→ interpretation = Generalization accuracy is ~0.95 with per-fold scatter ~0.06. If you re-split with a different seed and get 0.94 ± 0.07, your estimate is stable. If you get 0.80 ± 0.15, the model is sensitive to the split and you need more data or a simpler model.
11 lines without explanation
1import numpy as np
23# Two Gaussian blobs in 2D — class 1 near (+1.5, +1.5), class 0 near (-1.5, -1.5)4np.random.seed(0)5X = np.vstack([6 np.random.randn(20,2)+ np.array([1.5,1.5]),7 np.random.randn(20,2)+ np.array([-1.5,-1.5]),8])9y = np.concatenate([np.ones(20, dtype=int), np.zeros(20, dtype=int)])1011defmean_classifier(X_tr, y_tr, X_va, y_va):12"""Predict the class whose training mean is nearer (nearest-centroid)."""13 mu0 = X_tr[y_tr ==0].mean(axis=0)14 mu1 = X_tr[y_tr ==1].mean(axis=0)15 d0 = np.linalg.norm(X_va - mu0, axis=1)16 d1 = np.linalg.norm(X_va - mu1, axis=1)17 pred =(d1 < d0).astype(int)18returnfloat((pred == y_va).mean())1920# Shuffle once, then walk K equal-sized folds21np.random.seed(1)22perm = np.random.permutation(len(X))# 40 indices in random order2324K =525fold_size =len(X)// K # 8 per fold26accs =[]27for k inrange(K):28 va_idx = perm[k*fold_size:(k+1)*fold_size]29 tr_idx = np.concatenate([perm[:k*fold_size], perm[(k+1)*fold_size:]])30 acc = mean_classifier(X[tr_idx], y[tr_idx], X[va_idx], y[va_idx])31 accs.append(acc)32print(f"fold {k}: train={len(tr_idx):2d} val={len(va_idx):2d} acc={acc:.3f}")3334cv_mean, cv_std =float(np.mean(accs)),float(np.std(accs))35print(f"5-fold CV accuracy: {cv_mean:.3f} +/- {cv_std:.3f}")
Stratified K-Fold
Plain K-fold splits purely by index. If the class proportions are skewed (say 90% class 0, 10% class 1), a random fold might by chance contain no minority samples at all. Its metrics become meaningless and your CV estimate picks up spurious variance. Stratified K-fold (Kohavi 1995) constrains every fold to mirror the overall class ratio. It is the default for classification and should be a reflex:
The three standard tools. KFold and StratifiedKFold produce splits; cross_val_score ties together split + fit + score.
EXECUTION STATE
📚 KFold(n_splits, shuffle, random_state) = Returns an iterable of (train_idx, val_idx) pairs. Ignores labels — just splits by index.
📚 StratifiedKFold(...) = Same interface, but each fold preserves the class proportions of the full dataset. Uses y when generating splits.
📚 cross_val_score(estimator, X, y, cv, scoring) = Does the full loop internally: split → clone estimator → fit on train → score on val → record. Returns array of per-fold scores.
Build the plain K-fold splitter. shuffle=True randomizes before splitting (otherwise you get consecutive slices). random_state pins the shuffle for reproducibility.
EXECUTION STATE
⬇ arg: n_splits=5 = K — number of folds.
⬇ arg: shuffle=True = Shuffle indices before slicing into folds. Almost always what you want; only set False for time-series (see TimeSeriesSplit).
⬇ arg: random_state=1 = Seed for the shuffle. Pin it for reproducible experiments; vary it to measure split-to-split variance.
13acc_plain = cross_val_score(LogisticRegression(), X, y, cv=kf, scoring="accuracy")
The full CV pipeline in one call. For each (train_idx, val_idx) yielded by kf, cross_val_score clones the estimator (so training state does NOT leak across folds), fits, scores, and appends to a 1-D array of length n_splits.
EXECUTION STATE
⬇ arg: estimator=LogisticRegression() = Unfit prototype. A fresh clone is used per fold.
⬇ arg: scoring='accuracy' = Metric to evaluate each fold. Swap in 'f1', 'roc_auc', 'neg_log_loss', ... as needed.
The stratified variant. Each fold is built so that class proportions match the full dataset. For our 20/20 data the effect is subtle, but on a 90/10 imbalance plain K-fold can produce folds with 4/4 validation sets that just happen to be all-majority — wrecking the fold's score. StratifiedKFold never does that.
EXECUTION STATE
→ when it matters = ALWAYS use StratifiedKFold for classification with class imbalance. It is the default in cross_val_score when you pass an integer cv for a classifier (cv=5 uses StratifiedKFold under the hood). Making it explicit removes surprise.
18acc_strat = cross_val_score(LogisticRegression(), X, y, cv=skf, scoring="accuracy")
Same call, different splitter. Produces folds whose class ratios all match 50/50 here.
On balanced data, the two splitters agree within noise. On imbalanced data you would see StratifiedKFold's std drop sharply because no fold ends up with a degenerate class mix.
→ practical rule = Classification ⇒ StratifiedKFold. Regression ⇒ plain KFold (stratification needs discrete labels). Time series ⇒ TimeSeriesSplit (no shuffling; forward-walk splits).
14 lines without explanation
1import numpy as np
2from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
3from sklearn.linear_model import LogisticRegression
45np.random.seed(0)6X = np.vstack([7 np.random.randn(20,2)+[1.5,1.5],8 np.random.randn(20,2)+[-1.5,-1.5],9])10y = np.concatenate([np.ones(20, dtype=int), np.zeros(20, dtype=int)])1112# Plain K-fold — ignores labels when splitting13kf = KFold(n_splits=5, shuffle=True, random_state=1)14acc_plain = cross_val_score(LogisticRegression(), X, y, cv=kf, scoring="accuracy")15print("plain :", np.round(acc_plain,3),16f"mean {acc_plain.mean():.3f} +/- {acc_plain.std():.3f}")1718# Stratified K-fold — preserves class ratio inside EVERY fold19skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)20acc_strat = cross_val_score(LogisticRegression(), X, y, cv=skf, scoring="accuracy")21print("stratified :", np.round(acc_strat,3),22f"mean {acc_strat.mean():.3f} +/- {acc_strat.std():.3f}")
Rule of thumb for neural nets: k-fold CV is invaluable for small datasets (under ~10 000 samples) or when comparing model families. For deep networks trained for hours per run on large datasets, a single honest train / val / test split plus a held-out test set is the norm — the training noise per run dominates split noise anyway. Save k-fold for preprocessing-pipeline selection, hyperparameter sweeps on small data, and methodological comparisons that will make it into a paper.
The Bias-Variance Tradeoff
The generalization error of any model can be decomposed into three irreducible components:
Bias measures how far the model's average prediction is from the truth. High bias = the model is too simple to capture the pattern (underfitting). A linear model fitting a sine wave has high bias.
Variance measures how much the model's predictions change when trained on different samples. High variance = the model is too sensitive to the specific training data (overfitting). A degree-15 polynomial has high variance.
Noise (σ2) is the irreducible error from the data-generating process. No model can do better than this.
The tradeoff: increasing model complexity decreases bias (better fit to the true pattern) but increases variance (more sensitive to training noise). The optimal complexity minimizes the sum.
Regime
Bias
Variance
Train Loss
Val Loss
Gap
Underfitting
High
Low
High
High
Small
Good fit
Low
Low
Low
Low
Small
Overfitting
Low
High
Very low
High
Large
The key diagnostic: compare train loss to val loss. If both are high, increase complexity (more layers, more neurons). If train loss is low but val loss is high, decrease complexity or add regularization. If both are low and close together, you have found a good fit.
Interactive: Overfitting Explorer
The visualization below fits polynomial models of increasing degree to a noisy sine wave. Drag the complexity slider from left (linear, underfitting) to right (degree 15, extreme overfitting) and observe how the train error keeps decreasing while the validation error follows a U-shape — first decreasing, then increasing.
Loading overfitting visualization...
Key observations as you experiment:
Degree 1-2 (underfitting): The line is too rigid to follow the sine curve. Both train and val MSE are high. The model has high bias.
Degree 3-5 (good fit): The curve follows the sine pattern without chasing individual points. Val MSE reaches its minimum. This is the sweet spot.
Degree 10+ (overfitting): The curve wiggles wildly to pass through every training point. Train MSE is near zero, but val MSE climbs sharply. The model has high variance.
Increase noise: With more noise, overfitting starts at a lower degree because there is more noise to memorize
Early Stopping
Early stopping is the simplest and most effective regularization technique. The idea: monitor the validation loss during training, and stop when it starts increasing.
The Algorithm
After each epoch, compute the validation loss
If the val loss is the best so far, save a snapshot of the model weights and reset a counter
If the val loss did NOT improve, increment the counter
If the counter reaches a threshold called patience, stop training
Restore the model weights from the snapshot (the best-validated state)
Formally, we define the stopping criterion:
stop if Lval(θe)>mine′<eLval(θe′) for P consecutive epochs
where P is the patience. The patience parameter is important because validation loss is noisy — it may temporarily increase for a few epochs before decreasing again. Without patience, you would stop prematurely on a random fluctuation.
Choosing Patience
Patience
Behavior
When to Use
3-5
Stops quickly. May miss slow improvements.
Fast experiments, well-tuned models
10-15
Balanced. Standard default.
Most training scenarios
20-50
Very patient. Allows long plateaus.
LR schedulers with warmup, noisy small datasets
Early stopping is mathematically equivalent to L2 regularization (weight decay) in certain settings (Bishop, 1995). As training progresses, the effective model complexity increases. Stopping early limits how complex the model becomes, just as weight decay penalizes large weights. The difference: early stopping automatically tunes the regularization strength based on the validation loss, while weight decay requires manually choosing λ.
Combine early stopping with other techniques. Early stopping is not an alternative to dropout, weight decay, or data augmentation — it is complementary. Use early stopping as a safety net on top of other regularization methods. It costs nothing (just a validation evaluation per epoch) and prevents wasted compute.
Building Validation from Scratch
The code below implements a complete training pipeline with train/val/test evaluation and early stopping in NumPy. Pay close attention to the evaluate() function — it runs a forward pass only, with no backward pass and no weight updates. This is what makes it safe to use on validation and test data.
NumPy — Training with Validation and Early Stopping
🐍validation_training.py
Explanation(87)
Code(100)
1import numpy as np
NumPy provides the ndarray type and vectorized math operations we need for matrix multiplications, element-wise operations, and random number generation.
EXECUTION STATE
📚 numpy = Numerical computing library. All matrix math (@, broadcasting, exp, log) runs in optimized C, not Python.
3# Generate two-moons dataset (200 points)
We create a synthetic 2D classification dataset where two crescent-shaped clusters overlap slightly. This is a classic ML test problem: simple enough to visualize, complex enough that a linear model fails.
EXECUTION STATE
two moons = Two interleaved half-circle clusters. The overlap (controlled by noise) makes the boundary fuzzy — there is no perfect separator.
4np.random.seed(0) — Reproducibility
Fixes the random number generator so we get identical data every run. Critical for debugging and reproducible experiments.
5N = 200 — Total dataset size
200 total samples: 100 per class. After splitting: 140 train, 30 val, 30 test. Enough for smooth loss curves while being small enough to demonstrate overfitting.
6t = np.linspace(0, π, N // 2) — Angular positions
Creates 100 evenly spaced angles from 0 to π. These define the positions along each crescent. linspace(0, π, 100) gives [0, 0.0317, 0.0634, ..., 3.1416].
EXECUTION STATE
📚 np.linspace(0, π, 100) = 100 values evenly spaced from 0 to π. The angular parameter for the half-circle.
⬆ X = (200, 2) — all data points, first 100 are class 0, last 100 are class 1.
10y = np.array([0]*100 + [1]*100) — Labels
First 100 labels are 0, last 100 are 1. Matches the order of X (X0 first, X1 second).
EXECUTION STATE
y = [0, 0, ..., 0, 1, 1, ..., 1] — 200 labels, 100 of each class.
12# Shuffle and split: 70% train, 15% val, 15% test
The three-way split is the foundation of model evaluation. Training data trains the model, validation data tunes hyperparameters and detects overfitting, test data gives the final unbiased performance estimate.
13perm = np.random.permutation(N) — Random shuffle
Random permutation of [0, 1, ..., 199]. Shuffling before splitting ensures each subset is a representative sample of the full dataset. Without shuffling, train would be all class 0 and test would be all class 1.
EXECUTION STATE
📚 np.random.permutation(200) = Random ordering of 0..199. Ensures train/val/test each get a mix of both classes.
14X, y = X[perm], y[perm] — Apply shuffle
Reorder both features and labels using the same permutation. The pairing X[i], y[i] is preserved.
15n_train = int(0.7 * N) — Train size = 140
70% of 200 = 140 training samples. The 70/15/15 split is a common default for medium-sized datasets.
EXECUTION STATE
n_train = 140 = 70% of data. More training data → better model. Less → noisier gradients. 70% is a reasonable balance.
16n_val = int(0.15 * N) — Val size = 30
15% of 200 = 30 validation samples. Used to detect overfitting and select the best model. NEVER used for training.
EXECUTION STATE
n_val = 30 = 15% of data. Enough for a reliable loss estimate. The val set is the ‘early warning system’ for overfitting.
17X_train, y_train = X[:n_train], y[:n_train]
First 140 shuffled samples become the training set. These are the ONLY samples the model sees during gradient updates.
Last 30 samples become the test set. Only evaluated ONCE at the very end. Never influences any training decision. This is the final, unbiased performance report.
→ critical rule = Test data must NEVER be used to select hyperparameters, choose when to stop, or pick the best model. It is for the final evaluation only.
21print(f"Train: ..., Val: ..., Test: ...")
Output: Train: 140, Val: 30, Test: 30. Always verify your split sizes and class balance before training.
EXECUTION STATE
output = Train: 140, Val: 30, Test: 30
23# Model: 2 → 64 → 2 (overparameterized)
We intentionally use a very wide hidden layer (64 neurons). With 194 parameters for 140 training points, this model has enough capacity to memorize the training data — making overfitting likely. This is the setup that makes validation essential.
EXECUTION STATE
overparameterization = Parameters: (2×64)+64+(64×2)+2 = 194. With only 140 training points, the model has more parameters than data — a classic recipe for overfitting.
24np.random.seed(3)
Fix weight initialization seed for reproducibility.
25H = 64 — Hidden layer width
64 hidden neurons. Compare with Section 2 which used 4. The excess capacity is what enables overfitting.
Shape (64, 2): maps 64 hidden features to 2 class logits. 128 parameters.
EXECUTION STATE
W2 shape = (64, 2) = 128 parameters
29b2 = np.zeros(2) — Output biases
2 output biases, one per class.
31# Helper: evaluate loss and accuracy (NO weight updates)
THIS IS THE KEY FUNCTION. The evaluate function computes loss and accuracy on ANY data subset WITHOUT modifying the model weights. This is what makes validation and test evaluation possible — it is a pure measurement, not a training step.
32def evaluate(X_set, y_set): — Pure evaluation function
Runs a forward pass on the given data and returns loss + accuracy. No backward pass, no weight updates. Safe to call on val/test data because it does not change the model at all.
EXECUTION STATE
⬇ input: X_set = Feature matrix for any split. Can be X_train (140×2), X_val (30×2), or X_test (30×2).
⬇ input: y_set = Labels for the same split. Must match X_set in length.
⬆ returns = Tuple (loss, acc). loss = cross-entropy (lower is better). acc = fraction correct (higher is better).
→ critical difference from training = No backward pass, no dW, no weight update. The model is not modified. This function is READ-ONLY.
33z1 = X_set @ W1 + b1 — Forward: layer 1
Standard linear transform. Note: this uses the CURRENT values of W1 and b1 (which are global variables that change during training).
34h1 = np.maximum(0, z1) — ReLU activation
Element-wise ReLU. Same as the training forward pass.
35z2 = h1 @ W2 + b2 — Output logits
Raw class scores. Shape: (N_set, 2) where N_set is the size of the data subset.
Fraction of samples where the predicted class (argmax of logits) matches the true label. Unlike loss, accuracy is a discrete metric (0% to 100%).
EXECUTION STATE
📚 .argmax(axis=1) = Returns the index of the maximum value in each row. For (N, 2): returns 0 or 1 for each sample — the predicted class.
⬆ acc = Scalar between 0 and 1. Initial val acc ≈ 53% (near random). After training: ≈ 93%.
40return loss, acc
Return both metrics. We track loss for early stopping (continuous, differentiable) and accuracy for human interpretability.
42# Training loop with validation and early stopping
The core structure: after each epoch of training, we evaluate on the validation set. If val loss has not improved for ‘patience’ epochs, we stop and restore the best model.
43lr = 0.05 — Learning rate
Same learning rate as Section 2.
44patience = 10 — How long to wait before stopping
If the validation loss does not improve for 10 consecutive epochs, we stop training. This prevents wasting compute on a model that is already overfitting.
EXECUTION STATE
patience = 10 = 10 epochs of no improvement → stop. Smaller patience (3-5) stops earlier but may miss slow improvements. Larger (15-20) is safer but wastes more compute.
45best_val_loss = float('inf') — Track best validation loss
Initialize to infinity so the first epoch’s loss is always ‘the best so far’. This is the value we compare against to decide whether to save or wait.
46best_epoch = 0 — Which epoch was best
Records when the best val loss was achieved. Used in the final report.
47wait = 0 — Epochs since last improvement
Counter that resets to 0 when val loss improves, increments when it does not. When wait >= patience, we stop.
EXECUTION STATE
wait = 0 initially. Counts up to patience (10) then triggers early stopping.
48best_W1, best_b1 = W1.copy(), b1.copy() — Snapshot best weights
Save a COPY of the current weights. When we find a better val loss, we overwrite these. At the end, we restore these as the final model. The .copy() is critical — without it, best_W1 would be a reference to the same array that keeps changing.
49best_W2, best_b2 = W2.copy(), b2.copy()
Same snapshot for layer 2 weights.
51np.random.seed(42) — Fix shuffle order
Seeds the shuffling used inside the training loop for reproducibility.
52for epoch in range(100): — Epoch loop
Train for up to 100 epochs. With early stopping, we may stop much earlier (typically around epoch 60-80 for this problem). The max of 100 is a safety limit.
53# Train (mini-batch, B=10)
The training section — covered in detail in Section 2. Here we focus on the NEW part: validation evaluation and early stopping after each epoch.
54p = np.random.permutation(n_train)
Shuffle training indices for this epoch.
55Xs, ys = X_train[p], y_train[p]
Reorder training data. Different shuffle each epoch.
56for i in range(0, n_train, 10): — Mini-batch loop
Process 14 batches of 10 samples each. Standard inner training loop from Section 2.
57Xb, yb = Xs[i:i+10], ys[i:i+10]
Slice one mini-batch.
58B = len(yb)
Actual batch size (10 for all batches since 140/10 = 14 exactly).
Standard forward pass through layer 1. This IS part of training — gradients will follow.
60h1 = np.maximum(0, z1)
ReLU activation.
61z2 = h1 @ W2 + b2
Output logits.
62ez = np.exp(z2 - z2.max(axis=1, keepdims=True))
Numerically stable exp for softmax.
63pr = ez / ez.sum(axis=1, keepdims=True)
Softmax probabilities.
64dz2 = pr.copy()
Start of backward pass: copy probabilities.
65dz2[range(B), yb] -= 1
Softmax + cross-entropy gradient: subtract 1 at correct class position.
66dz2 /= B
Average gradient over batch.
67dW2 = h1.T @ dz2; db2 = dz2.sum(0)
Layer 2 weight and bias gradients.
68dh1 = dz2 @ W2.T; dz1 = dh1 * (z1 > 0)
Backpropagate through W2 and ReLU.
69dW1 = Xb.T @ dz1; db1_ = dz1.sum(0)
Layer 1 gradients.
70W1 -= lr*dW1; b1 -= lr*db1_
SGD update for layer 1.
71W2 -= lr*dW2; b2 -= lr*db2
SGD update for layer 2. After this line, one training step is complete.
73# Evaluate on train AND validation (no gradients)
THIS IS THE NEW PART compared to Section 2. After all 14 training batches, we measure how the model performs on both training data and held-out validation data. The training loss tells us how well the model fits the training data. The validation loss tells us how well it generalizes.
Call our read-only evaluate function on the TRAINING set. Note: this is NOT the per-batch loss from inside the training loop. We recompute loss on the full training set for a cleaner metric.
Evaluate on the VALIDATION set. This is the number we watch for overfitting. Initially it decreases (model is learning useful patterns), but eventually it may increase (model starts memorizing training noise).
EXECUTION STATE
typical trajectory = E0: 0.46 (73%) → E20: 0.30 (80%) → E60: 0.22 (90%) → best then rises.
→ overfitting signal = When train_loss keeps going down but val_loss starts going UP, the model is overfitting. The gap between them is the generalization gap.
77# Early stopping check
The early stopping algorithm: (1) if val_loss improved, save model and reset counter; (2) if not, increment counter; (3) if counter reaches patience, stop training and restore the best model.
78if val_loss < best_val_loss: — Did val loss improve?
Compare current validation loss against the best we have ever seen. If it is lower (better), this epoch produced a better model.
EXECUTION STATE
→ why val_loss, not train_loss? = We care about generalization, not memorization. Train loss always improves with more training. Val loss reflects performance on unseen data.
79best_val_loss = val_loss — Update best
Record the new best validation loss for future comparisons.
80best_epoch = epoch — Record when
Remember which epoch achieved this best loss. Reported at the end.
81best_W1, best_b1 = W1.copy(), b1.copy() — Save model snapshot
Save a copy of the current weights. If the model starts overfitting later, we can restore these ‘best’ weights. The .copy() creates independent arrays that will not be affected by future weight updates.
82best_W2, best_b2 = W2.copy(), b2.copy()
Save layer 2 weights too.
83wait = 0 — Reset patience counter
The model just improved, so reset the counter. We give it another ‘patience’ epochs before considering stopping.
84else: — Val loss did NOT improve
The model did not beat its previous best on the validation set. This could be a temporary fluctuation or the start of overfitting.
85wait += 1 — Increment patience counter
One more epoch without improvement. When this reaches patience (10), we stop.
EXECUTION STATE
wait progression = If val_loss keeps not improving: wait = 1, 2, 3, ..., 10 → STOP.
If val_loss improves at wait=7: wait resets to 0.
87if epoch % 20 == 0 or wait == patience:
Print progress every 20 epochs and at the stopping point.
88print(f"Epoch ...: train=... val=... wait=...")
Log both train and val metrics. The key diagnostic: is the gap between train_loss and val_loss growing? If so, the model is overfitting.
91if wait >= patience: — Patience exhausted?
The trigger condition for early stopping. If we have gone 10 epochs without improvement, stop training and restore the best model.
92print(f"Early stopping at epoch {epoch}! Best: epoch {best_epoch}")
Announce the early stop. The best model was saved patience epochs ago.
93break — Exit the epoch loop
Stop training immediately. Without this break, we would waste compute training a model that is only getting worse on validation data.
95# Final evaluation on test set (using best model)
The FINAL step. We restore the best model (the one with lowest val loss) and evaluate it on the test set, which has never influenced any decision during training. This is the TRUE performance of the model.
96# Restore best weights
The current model weights have been training past the optimal point (overfitting). We replace them with the saved best weights from the epoch with lowest validation loss.
97W1, b1 = best_W1, best_b1 — Restore layer 1
Replace the overfit weights with the saved best weights. After this, the model is in its best state.
98W2, b2 = best_W2, best_b2 — Restore layer 2
Same for layer 2.
99test_loss, test_acc = evaluate(X_test, y_test) — THE FINAL NUMBER
This is the moment of truth. We call evaluate on the test set using the restored best model. This number is the unbiased estimate of how the model will perform on real, unseen data. It should only be computed ONCE.
EXECUTION STATE
⬆ test_loss, test_acc = The final reported metrics. This is what you put in your paper, your report, your model card. It was computed on data the model has never seen and that never influenced any training decision.
100print(f"Test (best model): ...")
Final output: the test performance of the best model selected by validation.
EXECUTION STATE
output = Test (best model, epoch XX): loss=X.XXXX, acc=XX%
Watch the loss trajectory. In the early epochs, both train and val loss decrease together — the model is learning the true pattern. Eventually, the train loss keeps dropping while the val loss plateaus or rises — the model is now memorizing training noise. Early stopping catches this moment and restores the model to its best state.
The evaluate function is the key innovation of this section. Compare Section 2 (training loop only, no evaluation) with this code. The only addition is a read-only forward pass on the validation set after each epoch, plus a few lines of early stopping logic. This small change transforms a training loop that might overfit into one that automatically finds the optimal stopping point.
PyTorch: eval(), no_grad(), and Early Stopping
The PyTorch version introduces two critical features that do not exist in NumPy:
model.eval() switches the model to evaluation mode, disabling dropout and using running BatchNorm statistics. Without this, validation metrics are wrong.
@torch.no_grad() disables gradient tracking, saving memory and compute. Without this, PyTorch builds a computation graph for the validation forward pass, wasting GPU memory.
DataLoader for the training set only. Validation and test do not need DataLoader — we evaluate them in one pass since they are small enough to fit in memory.
EXECUTION STATE
→ why no val_loader? = Val/test sets are small (30 samples). We pass the full tensor to model() directly. DataLoader is only needed for large datasets or when shuffling/batching matters.
17# Model
Define the same 2→64→2 architecture using PyTorch’s nn.Sequential.
18model = nn.Sequential(...) — Build model
nn.Sequential chains layers in order: Linear(2,64) → ReLU → Linear(64,2). Same architecture as the NumPy version but defined declaratively.
EXECUTION STATE
📚 nn.Sequential = Container that chains modules. model(x) calls each layer in order. No need for a custom forward() method.
This decorator wraps the entire function in a torch.no_grad() context. No computation graph is built inside this function, so: (1) forward pass is faster, (2) uses less memory, (3) gradients are not accumulated. Essential for evaluation.
EXECUTION STATE
📚 @torch.no_grad() = Decorator form of ‘with torch.no_grad():’. Disables autograd for all operations inside the function. The model’s parameters are NOT modified.
→ without this = PyTorch would build a computation graph for the val forward pass, wasting memory. On large models, this can cause OOM errors during evaluation.
25def evaluate(model, X, y): — Evaluate any split
Takes a model and any data split. Returns (loss, accuracy). Works for train, val, or test data.
EXECUTION STATE
⬇ input: model = The nn.Sequential model. We call model(X) to get predictions.
⬇ input: X = Feature tensor. Can be X_train_t (140,2), X_val_t (30,2), or X_test_t (30,2).
⬇ input: y = Label tensor. Same split as X.
⬆ returns = Tuple (loss_float, acc_float). Both are plain Python floats (detached from graph).
26model.eval() — Switch to evaluation mode
Sets the model to evaluation mode. This disables dropout and switches BatchNorm to use running statistics instead of batch statistics. Our model has neither, but calling eval() is a best practice — always do it before evaluation.
EXECUTION STATE
📚 model.eval() = Sets self.training = False for all submodules. Affects: Dropout (disabled), BatchNorm (uses running mean/var), and any custom layer that checks self.training.
→ why it matters = Dropout randomly zeros neurons during training (regularization) but NOT during eval. If you forget model.eval(), dropout masks the val predictions, giving wrong loss values.
27logits = model(X) — Forward pass (no graph built)
Runs the full forward pass on the data. Because of @torch.no_grad(), no computation graph is built, saving memory.
28loss = criterion(logits, y).item() — Compute and extract loss
Compute cross-entropy loss, then .item() extracts it as a Python float (detaching from graph).
Compute accuracy: predicted class == true class → boolean → float (0/1) → mean → Python float.
EXECUTION STATE
📚 .argmax(dim=1) = Index of max value along dim 1 (classes). Returns predicted class (0 or 1) for each sample.
.float() = Converts boolean tensor (True/False) to float (1.0/0.0) for computing mean.
30model.train() — Switch back to training mode
Restores the model to training mode (re-enables dropout, etc.) so the next training epoch works correctly. Always pair model.eval() with model.train().
31return loss, acc
Return the scalar loss and accuracy values.
33# Training with early stopping
Same early stopping logic as NumPy, but using PyTorch’s state_dict for checkpointing.
34best_val_loss = float('inf')
Initialize to infinity.
35best_epoch = 0
Track which epoch was best.
36patience, wait = 10, 0
Same patience=10 as NumPy. wait counts epochs without improvement.
37best_state = model.state_dict().copy() — Save initial state
model.state_dict() returns an OrderedDict of all parameter tensors. .copy() creates a deep copy so it is not affected by future updates. This replaces our manual W1.copy(), b1.copy() etc.
EXECUTION STATE
📚 model.state_dict() = Returns {'0.weight': tensor(64,2), '0.bias': tensor(64), '2.weight': tensor(2,64), '2.bias': tensor(2)}. All model parameters as a dictionary.
39for epoch in range(100):
Same epoch loop. May stop early via break.
40model.train() — Ensure training mode at epoch start
Sets model to training mode. Evaluate() switches to eval mode and back, but this is a safe explicit set.
41for Xb, yb in train_loader: — Training batches
Iterate mini-batches from DataLoader. 14 batches of 10 per epoch.
Evaluate on validation set. This is the metric that drives early stopping.
51if val_loss < best_val_loss:
Check for improvement.
52best_val_loss = val_loss
Update best.
53best_epoch = epoch
Record best epoch.
54best_state = model.state_dict().copy() — Checkpoint best
Save a deep copy of all model parameters. In PyTorch, this is the standard way to save model state — it replaces our manual per-tensor copying from the NumPy version.
55wait = 0
Reset counter.
56else:
No improvement.
57wait += 1
Increment patience counter.
59if wait >= patience:
Check if patience exhausted.
60print(f"Early stopping ...")
Announce stopping.
61break
Exit epoch loop.
63# Restore best model and evaluate on test set
Load the saved best weights back into the model and compute the final, unbiased test metric.
64model.load_state_dict(best_state) — Restore best weights
Replaces all model parameters with the saved best state. After this call, model is in its best-validated state, not the potentially overfit final state.
EXECUTION STATE
📚 model.load_state_dict(state) = Loads parameters from an OrderedDict into the model. Shape and key names must match. This is the standard way to restore checkpoints in PyTorch.
THE FINAL NUMBER. Evaluate the best-validated model on completely held-out test data. This number is only computed once and is the unbiased estimate of real-world performance.
EXECUTION STATE
⬆ test_loss, test_acc = The metrics you report. Computed on data that never influenced training, hyperparameter selection, or model selection.
66print(f"Test: loss=..., acc=...")
Final output: the test performance. This is the number that goes in the paper.
EXECUTION STATE
output = Test: loss=X.XXXX, acc=XX%
14 lines without explanation
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
45# ── Dataset and splits (same two-moons data) ──6X_train_t = torch.tensor(X_train, dtype=torch.float32)7y_train_t = torch.tensor(y_train, dtype=torch.long)8X_val_t = torch.tensor(X_val, dtype=torch.float32)9y_val_t = torch.tensor(y_val, dtype=torch.long)10X_test_t = torch.tensor(X_test, dtype=torch.float32)11y_test_t = torch.tensor(y_test, dtype=torch.long)1213train_loader = DataLoader(14 TensorDataset(X_train_t, y_train_t), batch_size=10, shuffle=True15)1617# ── Model ──18model = nn.Sequential(19 nn.Linear(2,64), nn.ReLU(), nn.Linear(64,2)20)21criterion = nn.CrossEntropyLoss()22optimizer = torch.optim.SGD(model.parameters(), lr=0.05)2324# ── Evaluation helper ──25@torch.no_grad()26defevaluate(model, X, y):27 model.eval()28 logits = model(X)29 loss = criterion(logits, y).item()30 acc =(logits.argmax(dim=1)== y).float().mean().item()31 model.train()32return loss, acc
3334# ── Training with early stopping ──35best_val_loss =float('inf')36best_epoch =037patience, wait =10,038best_state = model.state_dict().copy()3940for epoch inrange(100):41 model.train()42for Xb, yb in train_loader:43 logits = model(Xb)44 loss = criterion(logits, yb)45 optimizer.zero_grad()46 loss.backward()47 optimizer.step()4849 train_loss, train_acc = evaluate(model, X_train_t, y_train_t)50 val_loss, val_acc = evaluate(model, X_val_t, y_val_t)5152if val_loss < best_val_loss:53 best_val_loss = val_loss
54 best_epoch = epoch
55 best_state = model.state_dict().copy()56 wait =057else:58 wait +=15960if wait >= patience:61print(f"Early stopping at epoch {epoch}. Best: {best_epoch}")62break6364# ── Restore best model and evaluate on test set ──65model.load_state_dict(best_state)66test_loss, test_acc = evaluate(model, X_test_t, y_test_t)67print(f"Test: loss={test_loss:.4f}, acc={test_acc:.0%}")
The PyTorch version replaces manual weight copying (W1.copy()) withmodel.state_dict().copy() and weight restoration withmodel.load_state_dict(best_state). The rest is structurally identical to the NumPy version.
Going further: torch.inference_mode()
Since PyTorch 1.9, there is a second, stricter alternative to no_grad(): torch.inference_mode(). Both disable autograd, but inference_mode also turns off the version counters and view tracking that autograd normally maintains to make in-place operations safe. Skipping that bookkeeping makes it measurably faster — roughly 5–15% for pure-inference workloads per the PyTorch 1.9 release notes — at the price of a stricter contract: any tensor created inside is an “inference tensor” and cannot later be re-enabled for gradient tracking.
torch.no_grad()
torch.inference_mode()
Disables autograd
Yes
Yes
Disables version counters / view tracking
No
Yes
Output tensors can later enter autograd
Yes
No (must .clone() first)
Speed
Baseline
~5-15% faster on inference-heavy CPU / GPU code
Recommended for
Validation inside a training loop
Pure inference services; end-of-training test
torch.no_grad() vs torch.inference_mode() — same numbers, different contract
🐍no_grad_vs_inference_mode.py
Explanation(11)
Code(25)
1import torch
PyTorch core — we need it for tensors, the no_grad context, and the newer inference_mode context (added in PyTorch 1.9).
3x = torch.randn(3, requires_grad=False)
A length-3 random vector with no gradient tracking. requires_grad=False is the default for fresh tensors; making it explicit here shows that x itself is not part of any autograd graph.
EXECUTION STATE
📚 torch.randn(*shape, requires_grad) = Draws from the standard normal distribution N(0, 1) and returns a tensor.
⬇ arg: shape=(3,) = 1D tensor with 3 elements.
⬇ arg: requires_grad=False = Do not track operations on this tensor for autograd. Default; stated here for clarity.
⬆ x (example) = tensor([ 0.34, -1.12, 0.58]) (your values will differ — no seed)
6with torch.no_grad():
Standard context for validation since 2017. Inside the block, every operation is executed but no autograd graph is recorded. Outputs have requires_grad=False. You can still call .requires_grad_(True) on them afterwards, and you can still feed them into a later gradient-tracked computation.
EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd recording. Used in every validation loop.
→ guarantees = No graph is built; memory for intermediate activations is freed as soon as possible.
7y_ng = x * 2
Element-wise multiply. Runs under no_grad, so y_ng has no grad_fn. Shape matches x.
Prints both flags. Under no_grad, the output's is_inference() returns False — it is a regular tensor that merely happens to have no gradient. You could still pass it into an autograd-tracked computation later.
A stricter, faster context introduced in PyTorch 1.9. It disables autograd the same way no_grad does, but also disables version-counter bookkeeping and view-tracking — the internal machinery that makes autograd safe against in-place modifications. Skipping that bookkeeping makes inference_mode measurably faster than no_grad (especially for many small ops), at the cost of stricter rules on the resulting tensors.
EXECUTION STATE
📚 torch.inference_mode() = Context manager introduced in PyTorch 1.9 for the inference fast path.
→ what it gives up = Tensors created inside are 'inference tensors'. They cannot be saved for backward, cannot be used as leaves in a later autograd graph, and cannot have requires_grad flipped to True afterwards.
→ when to use = Production inference services; end-of-training test evaluation. Avoid inside the training loop's validation pass if any of those tensors need to flow back into gradient-tracked code.
13y_im = x * 2
Same computation, but the result is an inference tensor.
EXECUTION STATE
y_im (example) = tensor([ 0.68, -2.24, 1.16])
y_im.is_inference() = True — the tensor is marked as inference-only.
Both contexts compute exactly the same numerical result. The difference is purely about what autograd is allowed to do with the outputs afterwards.
EXECUTION STATE
📚 torch.allclose(a, b) = Returns True iff every element-wise difference is within a tolerance (default rtol=1e-5, atol=1e-8).
stdout = outputs equal : True
21try: y_im.requires_grad_(True)
Attempt to retrofit gradient tracking onto the inference tensor. This is the strictness showing up: PyTorch refuses.
EXECUTION STATE
📚 Tensor.requires_grad_(flag) = In-place setter for the requires_grad flag.
→ attempted action = Turn y_im into a leaf tensor that accumulates gradients.
22except RuntimeError as e: print(...)
PyTorch raises RuntimeError: 'Setting requires_grad=True on inference tensor outside InferenceMode is not allowed. Use .clone() to make a differentiable copy first.' This is the correctness-first design that gives inference_mode its speed: the C++ engine does not have to reason about whether an inference tensor might later participate in backward.
EXECUTION STATE
stdout = cannot flip requires_grad on inference tensor: RuntimeError
→ workaround if needed = If you need a tensor from inference_mode to re-enter autograd, make a differentiable copy: z = y_im.clone().requires_grad_(True). That breaks the inference-tensor lineage.
14 lines without explanation
1import torch
23x = torch.randn(3, requires_grad=False)45# --- torch.no_grad(): disables autograd inside the block ---6with torch.no_grad():7 y_ng = x *28print("no_grad : requires_grad =", y_ng.requires_grad,9", is_inference =", y_ng.is_inference())1011# --- torch.inference_mode(): stricter — marks outputs as inference tensors ---12with torch.inference_mode():13 y_im = x *214print("inference_mode: requires_grad =", y_im.requires_grad,15", is_inference =", y_im.is_inference())1617# Numerical results are identical:18print("outputs equal :", torch.allclose(y_ng, y_im))1920# An inference tensor cannot be retro-fitted into a grad-tracked computation:21try:22 y_im.requires_grad_(True)23except RuntimeError as e:24print("cannot flip requires_grad on inference tensor:",25type(e).__name__)
Rule of thumb: during the validation pass inside a training loop, stick with torch.no_grad(). For the final test evaluation or a production inference service, switch to torch.inference_mode(). Both must be paired with model.eval()— model.eval() fixes Dropout and BatchNorm behavior; no_grad / inference_mode fixes autograd behavior.
Connection to Modern Systems
The validation concepts from this section are used at every scale of modern ML:
LLM Evaluation
Large language models like GPT-4 and LLaMA are evaluated on held-out text that was never part of the training corpus. The primary metric is perplexity on validation data: PPL=exp(−T1∑t=1TlogP(xt∣x<t)). Lower perplexity means the model assigns higher probability to the correct next tokens. Chinchilla (Hoffmann et al., 2022) used validation loss scaling laws to determine the optimal balance of model size vs. training data size.
Benchmark Contamination
A major concern in LLM evaluation is benchmark contamination — when test data accidentally appears in the training corpus. If GPT-4's training data included questions from a benchmark, its score on that benchmark is meaningless (just like a student who memorized the exam answers). Modern evaluations use contamination detection and held-out dynamic benchmarks.
Cross-Validation
When data is scarce, k-fold cross-validation maximizes use of available data. Split data into k equal folds. Train on k−1 folds, validate on the remaining one. Repeat k times, each fold serving as validation once. The average validation score across all folds is a more reliable estimate than a single split. Common choice: k=5 or k=10.
Transformer Early Stopping
Pre-training large transformers typically does NOT use early stopping — the models are trained for a fixed number of tokens (LLaMA: 1.4T tokens, Chinchilla: 1.4T tokens). Instead, early stopping is used during fine-tuning, where a pre-trained model is adapted to a specific task with a small dataset. Fine-tuning is highly prone to overfitting (billions of parameters, thousands of training examples), making early stopping essential.
The three-way split is not just a technique — it is a scientific principle. Every claim about model performance must be backed by evaluation on truly held-out data. Without this, we are measuring memorization, not intelligence.
Summary
In this section, we learned how to measure and prevent overfitting:
Generalization gap:Ltest−Ltrain measures how much the model has overfit. A small gap indicates good generalization
Three-way split: Train (learn parameters), validation (tune hyperparameters, detect overfitting), test (final unbiased estimate). The test set must NEVER influence any training decision
Bias-variance tradeoff:Error=Bias2+Variance+Noise. Underfitting = high bias. Overfitting = high variance. The optimal model complexity minimizes total error
Early stopping: Monitor val loss, stop after patience epochs without improvement, restore best model. Mathematically equivalent to L2 regularization in certain settings
PyTorch evaluation: Always use model.eval() and @torch.no_grad() during evaluation. eval() disables dropout/BN, no_grad() saves memory
In the next section, we will expand our monitoring beyond simple loss tracking to include learning curves, confusion matrices, precision/recall, and other metrics that give deeper insight into what the model is learning and where it fails.
References
Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147. The paper that formalized k-fold cross-validation.
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995. Introduces stratified k-fold and compares it to plain k-fold on imbalanced data.
Bengio, Y. & Grandvalet, Y. (2004). No Unbiased Estimator of the Variance of K-Fold Cross-Validation. JMLR 5, 1089–1105.
Varma, S. & Simon, R. (2006). Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91. Motivates nested cross-validation.
Cawley, G. C. & Talbot, N. L. C. (2010). On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11, 2079–2107. Canonical reference for nested cross-validation.
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data 6(4), Article 15.
Prechelt, L. (1998). Early Stopping — But When? In Neural Networks: Tricks of the Trade (LNCS 1524), 55–69. Classic reference on early-stopping criteria.
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), Springer. Chapter 7 gives the canonical bias-variance decomposition and cross-validation treatment.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, MIT Press. Chapters 5 and 7.8 cover generalization, cross-validation, and early stopping.