Introduction
In the previous section, we saw that overfitting happens when a model memorizes noise instead of learning patterns. Now we turn to the two most powerful weapons against overfitting: Dropout and Weight Decay. Together, these techniques are used in virtually every modern neural network, from small classifiers to GPT-4.
Each technique attacks overfitting from a different angle. Dropout is a stochastic method — it randomly disables parts of the network during training. Weight decay is a deterministic method — it continuously shrinks all weights toward zero. Despite their different mechanisms, both achieve the same goal: forcing the model to learn simpler, more generalizable representations.
Dropout: The Unreliable Team
Imagine you are training a team of 10 people for a critical presentation. Every morning, you randomly send 5 of them home — they cannot attend today's rehearsal. The remaining 5 must figure out how to cover all roles by themselves. Tomorrow, a different random half shows up. The day after, yet another combination.
At first, this seems like chaos. But after weeks of this, something remarkable happens: every team member becomes a generalist. Nobody can afford to hyper-specialize because they might not be there tomorrow. Everyone learns to handle the introduction, the data analysis, and the conclusions — just in case. When the full team finally reunites for the actual presentation, they are far more robust than a team where Alice always does slides and Bob always does analysis. If Alice gets sick on presentation day, the team can still function.
The Biological Parallel: Srivastava et al. (2014) drew an explicit analogy to sexual reproduction in biology. In asexual reproduction, the entire genome passes to offspring unchanged. In sexual reproduction, genes must work well with random combinations of genes from the other parent. This forces genes to be "good team players" — individually useful in many contexts, not dependent on specific gene partners. Dropout creates the same pressure on neurons.
How Dropout Works Mathematically
During training, each neuron's output is independently set to zero with probability (the drop rate). The surviving activations are then scaled up by to preserve the expected output magnitude. This is called inverted dropout.
Formally, given a layer's activation vector , the dropout operation is:
The scaling by is the clever trick. Since each element of is 1 with probability and 0 with probability :
The expected value of each output element is exactly the original activation — no scaling needed at test time. This property makes inverted dropout the standard implementation in PyTorch, TensorFlow, and every modern framework.
What Gets Dropped Where
| Layer Type | Drop Rate (p) | What Gets Dropped | Why This Rate |
|---|---|---|---|
| Fully-connected hidden | 0.5 | Individual neuron outputs | Original paper default; high capacity layers need strong regularization |
| Convolutional | 0.1–0.3 | Individual feature values (or entire channels) | Conv layers have fewer params per feature; lower rate is sufficient |
| Transformer attention weights | 0.1 | Individual attention connections | Applied after softmax; prevents over-reliance on specific token pairs |
| Transformer FFN | 0.1 | Individual FFN outputs | Applied after the down-projection in the 4× expansion |
| Transformer residual | 0.1 | Sub-layer output before residual add | Sometimes forces the model to rely on the skip connection |
| Input layer | 0.0–0.2 | Individual input features | Rarely used; corrupting inputs can hurt more than help |
Dropout Inside a Neural Network
The visualization below shows a small neural network with dropout applied. Toggle between training mode (random neurons are dropped) and inference mode (all neurons active). Adjust the dropout rate to see how aggressively the network is thinned:
Dropout Visualization
How Dropout Works
During Training: Each hidden neuron is randomly "dropped" (set to zero) with probability p = 0.50. This means each forward pass uses a different random subset of the network.
This prevents neurons from co-adapting too much to each other. Each neuron must learn useful features independently, without relying on specific other neurons being present.
Notice how at high dropout rates (0.7+), only a skeleton of the network remains. Each training step uses a different skeleton. A network with hidden units has possible sub-networks — dropout implicitly trains this exponential ensemble.
The Ensemble Interpretation
Why does randomly breaking a network make it better? The answer lies in ensemble learning. It is a well-known fact in machine learning that averaging the predictions of many different models produces better results than any single model. The problem is that training thousands of separate models is expensive.
Dropout gives us ensembles for free. Each dropout mask defines a different sub-network — a different "expert." With hidden units, there are possible masks, hence sub-networks sharing the same parameters. Each mini-batch trains a different sub-network. At test time, using the full network with scaled weights approximates the geometric mean of all these sub-networks' predictions — an ensemble of exponentially many models, trained with a single set of shared weights.
Dropout as Ensemble Training
Each dropout mask creates a different sub-network. Training with dropout is equivalent to training an exponential ensemble of thinned networks.
A network with n hidden units has 2n possible sub-networks. With 4 hidden units, there are 16 possible dropout masks. Dropout samples a different sub-network each mini-batch, effectively training an exponential ensemble.
Gal & Ghahramani (2016, §3) made this ensemble view precise. Treat the network's weights as a random variable with a posterior conditioned on the data . Each dropout mask samples an approximate weight configuration . Averaging predictions over many masks at test time \u2014 keeping dropout active during inference, repeating the forward pass, and averaging the outputs \u2014 is approximate Monte-Carlo Bayesian inference, . This is "MC Dropout", and it gives calibrated uncertainty estimates from any network you trained with dropout \u2014 useful for active learning, anomaly detection, medical diagnosis, and autonomous driving.
Implementing Dropout from Scratch
The implementation of dropout is surprisingly simple — just a binary mask and a scaling factor. The code below shows both the forward pass (masking and scaling) and the backward pass (same mask, same scaling). Click any line to see the exact values flowing through:
The key insight: the scaling by during training means no modification is needed at test time. This is why it is called "inverted" dropout — the compensation happens during training rather than at inference, simplifying deployment.
Dropout in PyTorch with nn.Dropout
The NumPy implementation made the math explicit. In practice you reach for nn.Dropout, which handles the inverted scaling for you and switches itself off in eval mode.
Weight Decay: The Tax on Complexity
Imagine every parameter in your model must pay a "tax" proportional to its magnitude. A weight of 5.0 pays more tax than a weight of 0.1. To justify its cost, a large weight must significantly reduce the loss — otherwise the tax pushes it back toward zero. Parameters that don't earn their keep get taxed into irrelevance.
Another way to think about it: every weight has an elastic band attached to the origin. The band pulls the weight toward zero with a force proportional to the weight's magnitude (). During training, the data gradient pulls the weight away from zero (to fit the data), while the elastic band pulls it back (to keep things simple). The equilibrium is a compromise between fitting and simplicity.
Occam's Razor in Code: Weight decay is the mathematical implementation of "prefer simpler explanations." A model with large weights is making bold, specific claims about the data. Weight decay penalizes boldness, preferring models that make gentle, conservative predictions — which tend to generalize better. Only parameters that genuinely reduce the loss survive the tax.
The Mathematics of Weight Decay
Weight decay adds a penalty term to the loss function that is proportional to the squared magnitude of all weights:
Taking the gradient with respect to :
The SGD update becomes:
The factor is the "decay" — each step, every weight shrinks by this fraction before the gradient update. With and , the multiplicative factor is , meaning each weight decays by 0.001% per step. Over thousands of steps, this gentle pull has a dramatic cumulative effect.
Bayesian Interpretation
L2 regularization has an elegant Bayesian interpretation: it is equivalent to placing a Gaussian prior on the weights centered at zero:
Comparing with the L2 penalty , we get . A strong regularization ( large) corresponds to a tight prior ( small) — a strong belief that weights should be near zero. A weak regularization corresponds to a wide prior — willingness to let weights grow large if the data demands it.
The MAP (Maximum A Posteriori) estimate with this Gaussian prior is exactly the L2-regularized loss minimum. This is not coincidence — it is the same optimization problem viewed from two complementary perspectives: frequentist (penalty) and Bayesian (prior).
Geometric View: The Constraint Sphere
The L2 penalty constrains the weights to lie within a hypersphere centered at the origin. The regularization strength controls the radius: larger means a smaller sphere (tighter constraint on weight magnitude).
The visualization below shows this geometrically. The elliptical contours represent the data loss (where the model wants the weights to be), and the circle represents the L2 constraint (where regularization forces them). The optimum lies at the intersection — the best fit within the constraint:
Weight Decay (L2 Regularization) Effect
Weights shrunk toward origin by 37.5%
L2 regularization adds a penalty proportional to squared weight magnitude
Understanding Weight Decay
Blue ellipses show contours of constant data loss (where data loss is the same). The blue dot marks where data loss is minimized. Orange circles show contours of constant regularization penalty (centered at origin). The green dot shows where the total loss (data + regularization) is minimized. Notice how increasing λ pulls the optimum closer to the origin, shrinking the weight magnitudes. This prevents weights from growing too large, which helps prevent overfitting.
The contour plot above shows the geometry of the constraint. The animation below shows the same effect in the time domain: a histogram of all model weights compressing toward zero as weight decay does its work. Drag to feel how decay strength controls the speed of compression.
Weight magnitude under weight decay
Watch the distribution compress toward zero as training progresses.
There is a deeper geometric insight from Goodfellow et al. (2016): weight decay rescales parameters along the eigenvectors of the Hessian. Parameters along directions with large eigenvalues (directions that strongly affect the loss) are barely shrunk. Parameters along directions with small eigenvalues (directions the loss is insensitive to) are aggressively shrunk toward zero. Weight decay selectively compresses the model's capacity in exactly the directions that don't matter for fitting the data.
Adam vs. AdamW: Why L2 \u2260 Weight Decay
For SGD, L2 regularization and weight decay are mathematically equivalent — they produce identical updates. But for adaptive optimizers like Adam, they are fundamentally different, and getting this wrong can significantly hurt training.
The problem: in Adam with L2 regularization, the regularization gradient gets fed into the adaptive moment estimates. This means it gets divided by — the running average of squared gradients. Parameters with large gradient history (large ) receive weaker regularization, while parameters with small gradients receive disproportionately strong regularization. This is an unintended, parameter-dependent distortion.
AdamW (Loshchilov & Hutter, 2019) fixes this by applying weight decay directly to the weights, completely bypassing the adaptive scaling. The visualization below shows both optimizers converging on the same loss surface — notice how their trajectories and final positions differ:
What practitioners actually saw. Before AdamW, training transformers and ResNets with Adam plus an L2 penalty consistently underperformed SGD with momentum on image classification \u2014 faster initial convergence, worse final test error. Loshchilov & Hutter (2019, \u00a71) traced this gap to exactly the asymmetry derived above: Adam's per-parameter rescaling shrinks the L2 contribution most aggressively for parameters with the largest second-moment estimate, which is the opposite of what regularization should do. AdamW removed the asymmetry by decoupling the decay step, and the gap closed.
Adam + L2 vs. AdamW (Decoupled Weight Decay)
Same loss, same hyperparameters — different regularization behavior
Adam + L2 Regularization
AdamW (Decoupled Weight Decay)
L2 term gets divided by √v — uneven decay!
Decay applied directly — uniform on all axes!
Adam + L2 adds the weight penalty to the gradient before adaptive scaling. Because the second moment v differs per parameter, the effective regularization is uneven — parameters with large gradients (high v) get less decay, while parameters with small gradients get more decay. This distorts the intended regularization.
AdamW decouples weight decay from the adaptive gradient step. Every parameter is decayed by the same proportion (1 - lr×λ) regardless of gradient magnitude. This gives uniform, predictable regularization — matching what we actually want from weight decay.
The code below implements both approaches so you can see exactly where the decoupling happens:
A practical consequence: with AdamW, the optimal weight decay is independent of the learning rate. You can tune them separately, which is a major simplification. With Adam+L2, changing the learning rate changes the effective regularization strength, making hyperparameter tuning a tangled mess.
The no_decay filter pattern
In every modern transformer training loop you will see a small helper that splits parameters into two AdamW groups: one decayed at the standard rate (typically ) and one with . The no-decay group covers biases and LayerNorm parameters \u2014 decaying these one-dimensional scale/shift parameters tends to harm rather than help. The convention dates to GPT-3 (Brown et al., 2020, Appendix B) and is now the default in HuggingFace Transformers, fairseq, and reference implementations like nanoGPT.
Dropout and Weight Decay in Transformers
The original Transformer (Vaswani et al., 2017) uses dropout at three locations with rate :
- After attention weights (post-softmax, pre-V multiply): prevents over-reliance on specific token-to-token relationships
- After each sub-layer output (before residual addition): occasionally forces the model to rely purely on the skip connection
- After embedding + positional encoding sum: regularizes the input representation
Plus label smoothing with as an output-level regularizer.
Why Modern LLMs Remove Dropout
A striking development: LLaMA, GPT-3, Mistral, PaLM, and most modern large language models use zero dropout. This seems counterintuitive — why remove a regularizer? The reasons are revealing:
| Reason | Explanation |
|---|---|
| Scale provides implicit regularization | With billions of parameters and trillions of tokens, each training example is seen only 1–4 times. There is little opportunity to memorize. |
| Dropout adds noise to gradients | The stochastic masks increase gradient variance, requiring more steps to converge. At scale, this computational cost is substantial. |
| Distributed training complications | Synchronizing dropout masks across tensor/pipeline parallelism adds complexity. |
| Weight decay suffices | AdamW with λ=0.1 provides enough regularization for large-scale training. |
| Dropout returns for fine-tuning | When adapting a large model to a small dataset, dropout (or LoRA dropout) is reintroduced. |
Modern Transformer Training Recipe
Here is the configuration used by LLaMA (Touvron et al., 2023), representative of current best practices for large language models:
| Hyperparameter | Value | Why |
|---|---|---|
| Optimizer | AdamW | Decoupled weight decay; standard for all transformer training |
| Learning rate | 3e-4 (7B/13B), 1.5e-4 (33B/65B) | Smaller models tolerate higher LR |
| LR schedule | Cosine decay to 10% of peak | Smooth decay prevents sudden loss spikes |
| Warmup | 2,000 steps | Prevents early-training instability |
| Weight decay | 0.1 | Stronger than vision models (λ=0.0001) because LLMs have more capacity |
| β₁, β₂ | 0.9, 0.95 | Lower β₂ than default (0.999) for faster adaptation |
| Gradient clipping | 1.0 | Prevents exploding gradients from rare batches |
| Dropout | 0.0 | No dropout — weight decay and data scale provide sufficient regularization |
| Label smoothing | 0.0 | Not used in autoregressive LMs (helpful for classification) |
| Batch size | 4M tokens | Large batches reduce gradient noise, acting as implicit regularization |
Notice the careful balance: no dropout, but strong weight decay (\u03bb=0.1). The implicit regularization from massive training data, large batch sizes, and weight decay replaces the explicit stochasticity of dropout. For smaller models or fine-tuning on limited data, dropout is still the go-to technique.
The implicit-regularization angle
At the scale of GPT-3 / LLaMA / Chinchilla there is also an implicit regularizer at work that has nothing to do with the techniques in this section \u2014 SGD's bias toward low-norm interpolators in the over-parameterized regime (Belkin et al., 2019; Nakkiran et al., 2020) and compute-optimal scaling (Hoffmann et al., 2022). See \u00a712.1 \u2014 Implicit Regularization at Scale for the full story.
Key Takeaways
- Dropout randomly zeroes neuron outputs during training (rate ), scales survivors by , and does nothing at inference. It implicitly trains an ensemble of sub-networks sharing the same parameters.
- Weight decay shrinks all weights toward zero each step by a factor . It is equivalent to a Gaussian prior on the weights (Bayesian view) or constraining weights to a hypersphere (geometric view).
- L2 \u2260 weight decay for Adam. In Adam+L2, the regularization gradient is distorted by adaptive scaling. AdamW decouples the decay, giving uniform regularization and independent hyperparameter tuning.
- Not all parameters should be regularized. Weight decay is applied to weight matrices but NOT to biases, LayerNorm parameters, or embeddings.
- Dropout prevents co-adaptation — neurons learn individually useful features rather than developing fragile co-dependencies.
- Modern LLMs use weight decay without dropout. The combination of massive data, large batches, and AdamW (\u03bb=0.1) provides sufficient regularization. Dropout returns for fine-tuning on small datasets.
- At scale, the implicit regularization from SGD's low-norm bias and Chinchilla-style data-rich training replaces explicit dropout (see \u00a712.1's Implicit Regularization at Scale).
Looking Ahead: In the next section, we will explore two more regularization techniques — Early Stopping and Data Augmentation — which control overfitting through training dynamics and data manipulation rather than model modification.
References
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15(56), 1929\u20131958.
- Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML 2016.
- Krogh, A. & Hertz, J. A. (1991). A Simple Weight Decay Can Improve Generalization. NIPS 1991 (NeurIPS 4).
- Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
- Brown, T. B. et al. (2020). Language Models are Few-Shot Learners (GPT-3), Appendix B (parameter-group split convention). NeurIPS 2020.
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press.
- Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.