The Central Problem of Learning
Every neural network faces a fundamental tension: it must learn patterns from training data that are general enough to apply to data it has never seen. Learn too little and the model fails on everything — training data included. Learn too much and the model memorizes the training data perfectly but fails spectacularly on new inputs. This tension between underfitting and overfitting is the single most important concept in all of machine learning.
Before we touch any math, let us build intuition with two analogies that make these concepts impossible to forget.
The Map Maker's Dilemma
Imagine you are a cartographer hired to draw a map of a winding mountain road. You drive the road once, noting landmarks along the way. Now you must draw a map that will be useful not just for today, but for anyone who drives this road in the future.
The Underfitting Map Maker
The first cartographer is lazy. They draw a single straight line from the start of the road to the end. "Close enough," they say. But the road curves, climbs hills, and loops around a lake. The straight-line map is useless — anyone following it would drive off a cliff at the first turn. This is underfitting: the model (straight line) is too simple to capture the true pattern (winding road). It fails on the training data AND on new data.
In mathematical terms, the underfitting model has high bias — it makes strong assumptions (the road is straight) that are systematically wrong. No matter how many times you drive the road and collect more data points, a straight line will never fit a winding road.
The Overfitting Map Maker
The second cartographer is obsessive. They draw every pebble, every tire track, every fallen leaf on the road. Their map is extraordinarily accurate — for today. But tomorrow, the leaves have blown away, a pothole has been filled, and a new crack has appeared. Anyone following yesterday's hyper-detailed map would be confused by every change. This is overfitting: the model captures the true pattern AND the noise (pebbles, leaves). It performs perfectly on training data but poorly on new data because it memorized transient details.
Mathematically, the overfitting model has high variance — if you sent this cartographer to map the same road on different days, they would produce wildly different maps because they faithfully record whatever random debris happens to be there that day.
The Good Map Maker
The third cartographer captures the road's curves, major landmarks, elevation changes, and bridge locations — but ignores the pebbles and leaves. Their map works today and will work next month. This is the sweet spot: the model captures the underlying pattern without memorizing the noise. It generalizes.
The Map Maker's Lesson: Underfitting means your map is too simple — you drew a straight line through a winding road. Overfitting means your map is too detailed — you drew every pebble that will be gone tomorrow. The goal is a map that captures the road's true shape while ignoring the debris.
The Dartboard — Bias vs. Variance
The second analogy makes the mathematics click. Imagine throwing darts at a dartboard, where the bullseye represents the true answer. Each dart throw represents training a model on a different dataset sampled from the same distribution.
- Bias is how far the center of your dart cluster is from the bullseye. High bias means you are systematically aiming at the wrong spot. Even if you throw a thousand darts, they all miss the same way.
- Variance is how spread out your darts are. High variance means each throw lands in a different place — your aim is unstable, even if on average you are pointed at the bullseye.
The four combinations tell the full story:
Bias-Variance Dartboard
Select a scenario:
Darts are clustered tightly at the bullseye. The model is accurate and consistent.
The dartboard above is a static illustration. The one below is a live experiment: every click resamples a fresh noisy training set and refits the same degree-5 polynomial. Watch how the predictions scatter around the truth — that scatter is the variance term.
Watch variance build up
Click "Draw new dataset" repeatedly. Each click fits a degree-5 polynomial to 20 noisy samples.
Notice the crucial connection: underfitting = high bias (your model is systematically wrong) and overfitting = high variance (your model is unstable, changing wildly with different training data). The ideal model has both low bias and low variance — darts clustered tightly at the bullseye.
The Mathematics: Bias-Variance Decomposition
Now let us formalize the dartboard intuition. Suppose the true relationship between input and output is , where is the true function and is irreducible noise. We train a model on a dataset drawn randomly from this distribution.
The expected test error (MSE) at a point decomposes cleanly into three terms:
Deriving the Decomposition
Start with the MSE and add-and-subtract the expected prediction :
Expanding the square and noting that the cross terms vanish (because and ):
Why the cross terms vanish. Two cancellations make the decomposition clean. First, by the very definition — the average prediction equals the average prediction, so any cross-term containing this difference integrates to zero. Second, every term containing the noise factorizes as because is independent of the training data and has zero mean. The bias-variance triple is what survives.
Let us connect each term back to the dartboard:
| Term | Formula | Dartboard Meaning | What It Measures |
|---|---|---|---|
| Bias² | (f(x) - Ē[ƒ̂(x)])² | Distance from cluster center to bullseye | Systematic error from model assumptions |
| Variance | Ē[(ƒ̂(x) - Ē[ƒ̂(x)])²] | Spread of the dart cluster | Sensitivity to the particular training set |
| Noise (σ²) | Ē[ε²] | Inherent wobble in your throwing arm | Irreducible error — no model can beat this |
The Tradeoff in Practice
As model complexity increases, decreases (more flexible models can fit the true function) but increases (more flexibility means more sensitivity to noise). The optimal complexity minimizes the sum . This is the essence of the bias-variance tradeoff.
Let us see this with real numbers. The code below fits polynomials of degree 1, 3, 9, and 15 to noisy sine data across 200 independent training sets, then computes the exact bias-variance decomposition:
The results reveal the tradeoff in stark numbers:
| Degree | Bias² | Variance | Noise | Total MSE | Regime |
|---|---|---|---|---|---|
| 1 (line) | 0.4832 | 0.0057 | 0.09 | 0.5789 | Underfitting — huge bias |
| 3 (cubic) | 0.0038 | 0.0189 | 0.09 | 0.1127 | Optimal — best tradeoff |
| 9 | 0.0044 | 0.2321 | 0.09 | 0.3265 | Overfitting begins |
| 15 | 0.0051 | 3.2614 | 0.09 | 3.3565 | Severe overfitting |
Notice: degree 15 has lower bias than degree 1. But its variance is 572\u00d7 larger than degree 3. The flexible model gets the right answer on average, but any single prediction is wildly unreliable — exactly like the scattered darts in the low-bias, high-variance dartboard scenario.
PyTorch reproduction
Same Monte-Carlo experiment, written in PyTorch with a small neural network instead of polynomials. The bias-variance decomposition is a property of the data and the model class — not of the framework — so the numbers should land in the same ballpark as the NumPy run.
Bias and variance both shift slightly: the network has different inductive biases than a polynomial (smoother solutions, no fixed degree), and Adam plus 300 steps trains to a different effective capacity. But the qualitative picture — bias dominates for very small models, variance dominates for very large ones, and the sweet spot sits between — is identical.
Seeing It in Action: Polynomial Fitting
The bias-variance numbers become visceral when you see what overfitting looks like. Use the slider below to adjust the polynomial degree and watch the fitted curve transform from a rigid straight line (underfitting) to a smooth approximation (good fit) to a wildly oscillating monster (overfitting):
Polynomial Fitting: Overfitting vs Underfitting
Increase polynomial degree to see how the model transitions from underfitting to overfitting
Pay attention to what happens at high degrees (10+): the curve passes through every training point perfectly (training MSE \u2192 0) but oscillates violently between them. On test points that fall between the training data, predictions are catastrophically wrong. This is overfitting made visible — the model has memorized the noise.
Key Observation: As polynomial degree increases from 1 to ~4, both training and test error decrease — the model is learning the true pattern. Beyond degree ~5, training error continues to decrease but test error shoots up. The gap between training and test error is the hallmark of overfitting. This gap is the variance term in the bias-variance decomposition.
Training vs. Validation: Diagnostic Curves
In practice, we do not have access to 200 independent training sets. Instead, we diagnose overfitting by watching learning curves — plots of training loss and validation loss over training epochs. These curves have characteristic shapes that immediately reveal whether a model is underfitting, overfitting, or well-tuned.
Learning Curves: Training vs Validation Loss
Adjust model complexity to see how underfitting, good fit, and overfitting appear in learning curves
Reading the Curves
| Regime | Training Loss | Validation Loss | Gap | What To Do |
|---|---|---|---|---|
| Underfitting (complexity 1–3) | Converges high | Converges high, near training | Small gap, both high | Increase model capacity, train longer, add features |
| Good Fit (complexity 4–6) | Converges low | Converges low, near training | Small gap, both low | Keep this configuration! |
| Overfitting (complexity 7–10) | Continues decreasing | Decreases then INCREASES | Growing gap | Add regularization, reduce capacity, get more data |
The critical diagnostic is the gap between the two curves. A small gap with high loss means underfitting. A growing gap means overfitting. A small gap with low loss is the sweet spot. In the bias-variance framework: the gap is dominated by variance, and the absolute level of the training loss reflects bias.
The Modern Twist: Double Descent
Everything we have discussed so far follows the classical story: increase complexity, pass the sweet spot, and test error rises forever. But in 2019, researchers discovered something startling — if you keep increasing complexity far past the "overfitting zone," test error decreases again (Belkin et al., 2019; Nakkiran et al., 2020).
This phenomenon, called double descent, occurs at the interpolation threshold — the point where the number of parameters equals the number of training examples (). At this critical point, the model has just enough capacity to fit the training data exactly, but it does so by encoding every noise point, producing maximum instability. Past this threshold, in the over-parameterized regime, the model has so many solutions that it can interpolate the data while choosing a smooth, well-behaved one.
Double Descent Phenomenon
Classical theory predicts that test error follows a U-shape: it decreases as model complexity grows (reducing bias) until overfitting kicks in (increasing variance). Double descent shows that beyond the interpolation threshold (where #parameters equals #data points), test error decreases again in the over-parameterized regime. Modern large models like GPT operate deep in this region.
Why This Matters for Transformers
Modern large language models live deep in the over-parameterized regime. GPT-3 has 175 billion parameters but is trained on "only" ~300 billion tokens. Models like LLaMA-2 (70B parameters) and GPT-4 are far past the interpolation threshold. Double descent explains why these massive models generalize well despite having vastly more parameters than training examples — the very flexibility that allows overfitting at the interpolation threshold also allows the model to find smooth, generalizable solutions when there are enough parameters.
However, this does not mean regularization is unnecessary. Even in the over-parameterized regime, regularization techniques like dropout, weight decay, and label smoothing improve generalization and training stability. Tirumala et al. (2022) showed that larger LLMs memorize training data faster but paradoxically avoid overfitting longer — and regularization controls this memorization-generalization balance.
Regularization in Modern Transformers
Transformers use a carefully orchestrated combination of regularization techniques. The original "Attention Is All You Need" paper (Vaswani et al., 2017) applies dropout at three specific locations, plus label smoothing. Modern training recipes add weight decay via AdamW. Let us trace exactly where each regularization mechanism lives.
The Four Regularization Mechanisms
- Attention Dropout — Applied to the softmax attention weights. During training, 10% of attention connections are randomly zeroed, forcing the model to not rely on any single token-to-token relationship. This is like randomly cutting wires in a circuit — the circuit must be robust enough to work even when some connections fail.
- Residual Dropout — Applied to the output of each sub-layer (attention and FFN) before the residual addition. This means the model sometimes receives information purely from the skip connection, preventing over-reliance on any single layer.
- FFN Dropout — Applied inside the feed-forward network after the down-projection. The FFN contains the majority of a transformer's parameters (the expansion from to and back), so it is the most prone to memorization.
- Label Smoothing — Instead of training with hard targets (probability 1.0 for the correct class), the target distribution is softened: . With , the correct class gets probability 0.9. This prevents the model from pushing logits toward infinity to achieve overconfident predictions.
Weight Decay with AdamW
The fifth mechanism acts at the optimizer level. AdamW (Loshchilov & Hutter, 2019) applies weight decay as a separate multiplicative shrinkage step: . The key insight: in adaptive optimizers like Adam, L2 regularization (adding to the loss) is NOT equivalent to weight decay because the gradient of the L2 term gets divided by the adaptive learning rate , making regularization weaker for parameters with large gradients. AdamW fixes this by decoupling the decay from the gradient update.
Crucially, not all parameters receive weight decay. Bias terms and layer normalization parameters are excluded because regularizing them would fight against the model's ability to shift and scale its representations:
| Parameter Type | Weight Decay? | Why |
|---|---|---|
| Attention projections (W_q, W_k, W_v, W_o) | Yes (λ = 0.01) | Large matrices prone to memorization |
| FFN weight matrices | Yes (λ = 0.01) | Most parameters live here — highest overfitting risk |
| Bias terms | No | Too few parameters to cause overfitting |
| LayerNorm γ, β | No | Regularizing would counteract normalization |
| Embedding matrices | No (usually) | Embeddings need to be expressive, not small |
Implicit Regularization at Scale
Everything we have discussed so far is explicit regularization: a term you add to the loss, a layer you insert, a stopping rule you apply. At the scale of modern language models, a different mechanism takes over.
Belkin et al. (2019) and Nakkiran et al. (2020) showed that as model capacity grows past the interpolation threshold (parameters ) the test error decreases again — the "double descent" phenomenon we visualized above. The mechanism is the implicit bias of stochastic gradient descent: among the infinitely many zero-training-loss solutions in the over-parameterized regime, SGD prefers low-norm, smooth interpolators. No L2 penalty needed; the optimizer's geometry does the regularizing.
Hoffmann et al. (2022) — the Chinchilla paper — quantified this for language models: a 70B-parameter Chinchilla trained on 1.4T tokens generalizes better than the 280B-parameter Gopher trained on 300B tokens, because the smaller-but-better-fed model lives further into the over-parameterized regime relative to its data. Scale plus data plus implicit bias replaces explicit regularization. LLaMA-2 (Touvron et al., 2023) ships with weight decay, no dropout, and no label smoothing — not because regularization stopped mattering, but because implicit regularization from scale took over.
Why this matters for the rest of the chapter. When you read about modern LLMs turning dropout off and pushing weight decay down, you are not seeing "regularization removed." You are seeing explicit regularization replaced by the implicit kind that emerges from scale.
Full Transformer Block with Regularization
Before reading the code, here is where each regularization mechanism sits in the architecture. Three dropout points (red), weight decay applied only to the Linear layers (cyan), nothing applied to LayerNorm or biases (yellow):
Regularization Inside a Transformer Block
Where the three dropout points sit and which sublayers receive weight decay.
The code below shows the full transformer block matching the diagram, plus the training setup with AdamW weight decay and label smoothing:
Key Takeaways
- Underfitting means your model is too simple — like drawing a straight line through a winding road. It has high bias: systematically wrong regardless of training data.
- Overfitting means your model memorized the noise — like a map that records every fallen leaf. It has high variance: predictions change wildly with different training data.
- The bias-variance decomposition proves that test error = Bias² + Variance + Noise. You cannot reduce both bias and variance simultaneously — improving one typically worsens the other.
- Learning curves (training loss vs. validation loss) are the primary diagnostic tool. A growing gap signals overfitting; both curves converging high signals underfitting.
- Double descent shows that the classical U-curve is incomplete: in the over-parameterized regime (where modern transformers live), test error decreases again past the interpolation threshold.
- Modern transformers use five layers of regularization: attention dropout, residual dropout, FFN dropout, label smoothing, and weight decay (AdamW) — each targeting a different failure mode.
- At scale, explicit regularization (dropout, label smoothing) often becomes unnecessary or harmful — the implicit bias of SGD toward low-norm interpolators in the over-parameterized regime takes over (Belkin et al., 2019; Hoffmann et al., 2022; Touvron et al., 2023).
Looking Ahead: In the next section, we will dive into the two most impactful regularization techniques in practice — Dropout and Weight Decay — with full mathematical derivations, code implementations, and interactive visualizations that show exactly how they reshape the loss landscape.
References
- Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854.
- Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. & Sutskever, I. (2020). Deep Double Descent: Where Bigger Models and More Data Hurt. ICLR 2020.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
- Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press.
- Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (the "Chinchilla" paper). arXiv:2203.15556.
- Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.