Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

In the previous section, we saw that overfitting happens when a model memorizes noise instead of learning patterns. Now we turn to the two most powerful weapons against overfitting: Dropout and Weight Decay. Together, these techniques are used in virtually every modern neural network, from small classifiers to GPT-4.

Each technique attacks overfitting from a different angle. Dropout is a stochastic method — it randomly disables parts of the network during training. Weight decay is a deterministic method — it continuously shrinks all weights toward zero. Despite their different mechanisms, both achieve the same goal: forcing the model to learn simpler, more generalizable representations.

Dropout: The Unreliable Team

Imagine you are training a team of 10 people for a critical presentation. Every morning, you randomly send 5 of them home — they cannot attend today's rehearsal. The remaining 5 must figure out how to cover all roles by themselves. Tomorrow, a different random half shows up. The day after, yet another combination.

At first, this seems like chaos. But after weeks of this, something remarkable happens: every team member becomes a generalist. Nobody can afford to hyper-specialize because they might not be there tomorrow. Everyone learns to handle the introduction, the data analysis, and the conclusions — just in case. When the full team finally reunites for the actual presentation, they are far more robust than a team where Alice always does slides and Bob always does analysis. If Alice gets sick on presentation day, the team can still function.

The Biological Parallel: Srivastava et al. (2014) drew an explicit analogy to sexual reproduction in biology. In asexual reproduction, the entire genome passes to offspring unchanged. In sexual reproduction, genes must work well with random combinations of genes from the other parent. This forces genes to be "good team players" — individually useful in many contexts, not dependent on specific gene partners. Dropout creates the same pressure on neurons.

How Dropout Works Mathematically

During training, each neuron's output is independently set to zero with probability $p$ (the drop rate). The surviving activations are then scaled up by $1/(1-p)$ to preserve the expected output magnitude. This is called inverted dropout.

Formally, given a layer's activation vector $\mathbf{h}$ , the dropout operation is:

$\text{Training: } \quad \mathbf{m} \sim \text{Bernoulli}(1-p), \quad \tilde{\mathbf{h}} = \frac{\mathbf{m} \odot \mathbf{h}}{1-p}$

$\text{Inference: } \quad \tilde{\mathbf{h}} = \mathbf{h} \quad \text{(no modification)}$

The scaling by $1/(1-p)$ is the clever trick. Since each element of $\mathbf{m}$ is 1 with probability $(1-p)$ and 0 with probability $p$ :

$\mathbb{E}[\tilde{h}_i] = \frac{(1-p) \cdot h_i}{1-p} = h_i$

The expected value of each output element is exactly the original activation — no scaling needed at test time. This property makes inverted dropout the standard implementation in PyTorch, TensorFlow, and every modern framework.

What Gets Dropped Where

Layer Type	Drop Rate (p)	What Gets Dropped	Why This Rate
Fully-connected hidden	0.5	Individual neuron outputs	Original paper default; high capacity layers need strong regularization
Convolutional	0.1–0.3	Individual feature values (or entire channels)	Conv layers have fewer params per feature; lower rate is sufficient
Transformer attention weights	0.1	Individual attention connections	Applied after softmax; prevents over-reliance on specific token pairs
Transformer FFN	0.1	Individual FFN outputs	Applied after the down-projection in the 4× expansion
Transformer residual	0.1	Sub-layer output before residual add	Sometimes forces the model to rely on the skip connection
Input layer	0.0–0.2	Individual input features	Rarely used; corrupting inputs can hurt more than help

Dropout Inside a Neural Network

The visualization below shows a small neural network with dropout applied. Toggle between training mode (random neurons are dropped) and inference mode (all neurons active). Adjust the dropout rate to see how aggressively the network is thinned:

Dropout Visualization

Training Mode

Input Layer

Hidden Layer

Output Layer

Dropped

Mode

Dropout Rate (p)0.50

No dropout (0)Heavy (0.9)

Statistics

Dropped Neurons4 / 10

Active Connections24 / 50

Keep Probability0.50

How Dropout Works

During Training: Each hidden neuron is randomly "dropped" (set to zero) with probability p = 0.50. This means each forward pass uses a different random subset of the network.

This prevents neurons from co-adapting too much to each other. Each neuron must learn useful features independently, without relying on specific other neurons being present.

Notice how at high dropout rates (0.7+), only a skeleton of the network remains. Each training step uses a different skeleton. A network with $n$ hidden units has $2^n$ possible sub-networks — dropout implicitly trains this exponential ensemble.

The Ensemble Interpretation

Why does randomly breaking a network make it better? The answer lies in ensemble learning. It is a well-known fact in machine learning that averaging the predictions of many different models produces better results than any single model. The problem is that training thousands of separate models is expensive.

Dropout gives us ensembles for free. Each dropout mask defines a different sub-network — a different "expert." With $n$ hidden units, there are $2^n$ possible masks, hence $2^n$ sub-networks sharing the same parameters. Each mini-batch trains a different sub-network. At test time, using the full network with scaled weights approximates the geometric mean of all these sub-networks' predictions — an ensemble of exponentially many models, trained with a single set of shared weights.

Dropout as Ensemble Training

Each dropout mask creates a different sub-network. Training with dropout is equivalent to training an exponential ensemble of thinned networks.

Full Network (no dropout)

Dropout masks (p = 0.5)

Sub-net 1 of 2⁴ = 16

Sub-net 2 of 2⁴ = 16

Sub-net 3 of 2⁴ = 16

Sub-net 4 of 2⁴ = 16

Sub-net 5 of 2⁴ = 16

Sub-net 6 of 2⁴ = 16

Sub-net 7 of 2⁴ = 16

Sub-net 8 of 2⁴ = 16

Dropout Rate (p)0.5

0.00.8

Sub-networks shown

A network with n hidden units has 2ⁿ possible sub-networks. With 4 hidden units, there are 16 possible dropout masks. Dropout samples a different sub-network each mini-batch, effectively training an exponential ensemble.

Gal & Ghahramani (2016, §3) made this ensemble view precise. Treat the network's weights as a random variable with a posterior $p(\theta \mid D)$ conditioned on the data $D$ . Each dropout mask $m$ samples an approximate weight configuration $\theta_m = \theta \odot m$ . Averaging predictions over many masks at test time \u2014 keeping dropout active during inference, repeating the forward pass, and averaging the outputs \u2014 is approximate Monte-Carlo Bayesian inference, $p(y \mid x, D) \approx \frac{1}{M} \sum_{i=1}^{M} f(x; \theta_{m_i})$ . This is "MC Dropout", and it gives calibrated uncertainty estimates from any network you trained with dropout \u2014 useful for active learning, anomaly detection, medical diagnosis, and autonomous driving.

Implementing Dropout from Scratch

The implementation of dropout is surprisingly simple — just a binary mask and a scaling factor. The code below shows both the forward pass (masking and scaling) and the backward pass (same mask, same scaling). Click any line to see the exact values flowing through:

Inverted Dropout \u2014 Forward & Backward

🐍dropout_from_scratch.py

Explanation(19)

Code(40)

1import numpy as np

NumPy provides the random number generation (np.random.binomial) and array operations needed to implement dropout efficiently as element-wise array operations rather than Python loops.

EXECUTION STATE

numpy = Numerical computing library — arrays, random sampling, element-wise ops

3def dropout_forward(x, drop_rate, training=True)

The forward pass of inverted dropout. During training, randomly zeros out elements and scales survivors by 1/keep_prob. During inference, passes input through unchanged. This ‘inverted’ approach means no weight modification is needed at test time.

EXECUTION STATE

⬇ input: x = Activations from the previous layer. Shape can be any: (batch, features), (batch, seq, d_model), etc. Dropout operates element-wise.

⬇ input: drop_rate = Probability of DROPPING each element. Common values: 0.1 (transformers), 0.5 (fully-connected). drop_rate=0 means no dropout.

⬇ input: training = If True: apply stochastic masking. If False: pass through unchanged. PyTorch’s model.eval() sets this to False automatically.

⬆ returns = (out, mask) — the masked-and-scaled output, plus the binary mask (needed for backprop).

5if not training or drop_rate == 0.0: return x, None

At inference time (training=False), dropout is a no-op: return the input unchanged. This is the key advantage of inverted dropout — the model produces correct outputs without any weight scaling at test time.

EXECUTION STATE

→ Why no-op? = During training, we scale by 1/keep_prob. This ensures E[out] = x at training time. So at test time, just passing x through gives the correct expected value.

7keep_prob = 1.0 - drop_rate

Convert drop rate to keep probability. If drop_rate=0.5, then keep_prob=0.5 — each element survives with 50% chance.

EXECUTION STATE

keep_prob = 1.0 - 0.5 = 0.5 = Half the neurons will be kept, half will be zeroed. This is the ‘classic’ dropout rate from Srivastava et al. (2014).

9mask = np.random.binomial(1, keep_prob, size=x.shape)

Generate a binary mask by sampling from the Bernoulli distribution independently for each element. Each element is 1 (keep) with probability keep_prob or 0 (drop) with probability drop_rate.

EXECUTION STATE

📚 np.random.binomial(n, p, size) = Draws from Binomial(n, p) distribution. With n=1, this is Bernoulli(p): returns 0 or 1. Each element of the output array is sampled independently.

⬇ arg: n = 1 = Number of trials per element. With 1 trial, each output is either 0 or 1 — a coin flip.

⬇ arg: keep_prob = 0.5 = Probability of success (keeping the element). Each element independently survives with this probability.

⬇ arg: size = x.shape = (8,) = Generate one mask value per input element. The mask has the same shape as x.

⬆ mask (example) = [1, 0, 1, 1, 0, 0, 1, 0] — 4 of 8 elements kept (exactly 50% this time, but it’s random)

11out = (x * mask) / keep_prob

Two operations in one line: (1) zero out dropped elements via element-wise multiply with mask, then (2) scale survivors by 1/keep_prob to maintain the expected sum. This is ‘inverted dropout’ — the scaling happens during training, not at test time.

EXECUTION STATE

x * mask (element-wise) = [1.5×1, -0.8×0, 2.1×1, 0.3×1, -1.2×0, 0.7×0, 1.9×1, -0.5×0] = [1.5, 0, 2.1, 0.3, 0, 0, 1.9, 0]

/ keep_prob = / 0.5 = ×2 = Scale up survivors by 2× to compensate for the 50% that were dropped. This preserves the expected sum: E[∑ out] = ∑ x.

⬆ out = [3.0, 0, 4.2, 0.6, 0, 0, 3.8, 0] — survivors are doubled, dropped are zero

→ Why scale? = Without scaling: E[out_i] = keep_prob × x_i = 0.5 × x_i (half the expected value). With scaling: E[out_i] = keep_prob × x_i / keep_prob = x_i (correct!). The next layer receives inputs with the right expected magnitude.

13return out, mask

Return the masked output and the mask itself. The mask must be saved for backpropagation — during the backward pass, gradients are masked with the SAME mask (and same scaling).

EXECUTION STATE

⬆ return: out (8,) = [3.0, 0.0, 4.2, 0.6, 0.0, 0.0, 3.8, 0.0]

⬆ return: mask (8,) = [1, 0, 1, 1, 0, 0, 1, 0] — saved for backward pass

15def dropout_backward(grad_output, mask, drop_rate)

Backpropagation through dropout: apply the SAME mask from the forward pass. Neurons that were dropped in forward get zero gradient in backward. Survivors get their gradient scaled by 1/keep_prob, just like in the forward pass.

EXECUTION STATE

⬇ input: grad_output = Gradients flowing back from the next layer. Same shape as x.

⬇ input: mask = The SAME binary mask from the forward pass. Crucially, we do NOT resample — the same neurons that were dropped in forward are dropped in backward.

⬆ returns = (grad_output * mask) / keep_prob — same masking and scaling as forward. This ensures the gradient is an unbiased estimator.

17return (grad_output * mask) / keep_prob

Element-wise: grad_output[i] is passed through if mask[i]=1 (and scaled by 1/keep_prob), or zeroed if mask[i]=0. This is the mathematical derivative of the forward pass: d(x*mask/p)/dx = mask/p.

EXECUTION STATE

Derivative of dropout = out = x * mask / p. So d(out)/d(x) = mask / p. The backward is simply grad_output * mask / p.

20np.random.seed(42)

Fix random seed for reproducible mask generation. In practice, different seeds are used for each mini-batch — the point of dropout is that each batch sees a DIFFERENT sub-network.

21x = np.array([1.5, -0.8, 2.1, 0.3, -1.2, 0.7, 1.9, -0.5])

A vector of 8 activations from some hidden layer. These are the values that will be randomly masked. Notice both positive and negative values — dropout treats them identically.

EXECUTION STATE

x = [1.5, -0.8, 2.1, 0.3, -1.2, 0.7, 1.9, -0.5]

sum(x) = 1.5 + (-0.8) + 2.1 + 0.3 + (-1.2) + 0.7 + 1.9 + (-0.5) = 4.0

23print("Input x:", x)

Display the original input vector before dropout.

EXECUTION STATE

Output = Input x: [ 1.5 -0.8 2.1 0.3 -1.2 0.7 1.9 -0.5]

24print("Sum(x):", x.sum().round(4))

The sum before dropout. After inverted dropout, the EXPECTED sum should be the same.

EXECUTION STATE

Output = Sum(x): 4.0

27out_train, mask = dropout_forward(x, drop_rate=0.5, training=True)

Apply 50% dropout during training. About half the elements will be zeroed, survivors will be doubled (scaled by 1/0.5 = 2).

EXECUTION STATE

drop_rate = 0.5 = Drop 50% of elements. This is the classic rate from the original dropout paper for hidden layers.

training = True = Activate stochastic masking. Each call generates a DIFFERENT random mask.

28print("Mask:", mask)

The randomly generated binary mask. 1 = kept, 0 = dropped.

EXECUTION STATE

Output = Mask: [1 0 1 1 0 0 1 0]

→ Kept elements = x[0]=1.5, x[2]=2.1, x[3]=0.3, x[6]=1.9 — 4 of 8 survived

29print("Train out:", out_train)

The output after masking and scaling: survivors are doubled, dropped are zero.

EXECUTION STATE

Output = Train out: [3.0 0. 4.2 0.6 0. 0. 3.8 0. ]

30print("Sum(out):", ...)

Sum of dropout output. Due to the random mask, this particular sum won’t exactly equal sum(x)=4.0, but ON AVERAGE across many masks it will.

EXECUTION STATE

Output = Sum(out): 11.6 (this particular mask kept 4 elements and doubled them)

→ Expected sum = E[sum(out)] = sum(x) = 4.0. Any single mask may overshoot or undershoot, but the expectation is preserved.

34out_test, _ = dropout_forward(x, drop_rate=0.5, training=False)

At inference time: dropout is disabled. The input passes through unchanged. No mask is generated (returns None). This is the beauty of inverted dropout — test time is trivial.

EXECUTION STATE

training = False = No masking, no scaling. The function returns x unmodified. This is why inverted dropout is preferred over standard dropout — no weight scaling needed at inference.

35print("Test out:", out_test)

At test time, the output equals the input exactly.

EXECUTION STATE

Output = Test out: [ 1.5 -0.8 2.1 0.3 -1.2 0.7 1.9 -0.5]

→ Verify = out_test == x? Yes! No modification at test time.

21 lines without explanation

1import numpy as np
2
3def dropout_forward(x, drop_rate, training=True):
4    """Inverted dropout: scale during training, no-op at inference."""
5    if not training or drop_rate == 0.0:
6        return x, None
7
8    keep_prob = 1.0 - drop_rate
9
10    mask = np.random.binomial(1, keep_prob, size=x.shape)
11
12    out = (x * mask) / keep_prob
13
14    return out, mask
15
16def dropout_backward(grad_output, mask, drop_rate):
17    """Backprop through dropout: same mask, same scaling."""
18    keep_prob = 1.0 - drop_rate
19    return (grad_output * mask) / keep_prob
20
21# --- Example with concrete values ---
22np.random.seed(42)
23x = np.array([1.5, -0.8, 2.1, 0.3, -1.2, 0.7, 1.9, -0.5])
24
25print("Input x:", x)
26print("Sum(x):", x.sum().round(4))
27print()
28
29# Training: 50% dropout
30out_train, mask = dropout_forward(x, drop_rate=0.5, training=True)
31print("Mask:       ", mask)
32print("Train out:  ", out_train)
33print("Sum(out):   ", out_train.sum().round(4))
34print("E[sum(out)]:", x.sum().round(4), "(preserved!)")
35print()
36
37# Inference: no dropout
38out_test, _ = dropout_forward(x, drop_rate=0.5, training=False)
39print("Test out:   ", out_test)
40print("Sum(test):  ", out_test.sum().round(4))

The key insight: the scaling by $1/(1-p)$ during training means no modification is needed at test time. This is why it is called "inverted" dropout — the compensation happens during training rather than at inference, simplifying deployment.

Dropout in PyTorch with `nn.Dropout`

The NumPy implementation made the math explicit. In practice you reach for nn.Dropout, which handles the inverted scaling for you and switches itself off in eval mode.

Dropout with nn.Dropout and model.train() / model.eval()

🐍dropout_pytorch.py

Explanation(26)

Code(37)

1import torch

PyTorch core.

2import torch.nn as nn

Layer modules including Dropout.

4torch.manual_seed(0)

Reproducible random masks across runs.

EXECUTION STATE

seed = 0

7model = nn.Sequential(

Build a 4-layer network: Linear(8→16) → ReLU → Dropout(0.5) → Linear(16→4). Dropout sits between the hidden layer and the output.

8 nn.Linear(8, 16),

Maps 8-dim input to 16 hidden units; weight matrix shape (16, 8) plus a 16-dim bias.

EXECUTION STATE

params = 8·16 + 16 = 144

9 nn.ReLU(),

Element-wise max(0, x). No parameters.

10 nn.Dropout(p=0.5),

Bernoulli mask with keep probability 0.5; in training mode every element is independently zeroed with prob 0.5 and surviving elements are multiplied by 1/(1-p) = 2 (inverted-dropout scaling). In eval mode this layer becomes the identity.

EXECUTION STATE

p = 0.5

11 nn.Linear(16, 4),

Final layer to 4 outputs.

EXECUTION STATE

params = 16·4 + 4 = 68

12)

End of Sequential.

14x = torch.ones(1, 8)

Fixed input — every forward pass starts from the same x so any output difference is purely from dropout's random mask.

EXECUTION STATE

x.shape = torch.Size([1, 8])

17model.train()

Sets self.training = True on the model and all submodules. Tells nn.Dropout to actually drop. Required before training; also useful for Monte-Carlo dropout at inference.

18out_train_a = model(x)

Forward pass with one random mask. Two runs of the SAME input produce DIFFERENT outputs — that randomness is the regularizer.

EXECUTION STATE

out_train_a (typical) = tensor([[ 0.84, -1.21, 0.07, 2.10]])

19out_train_b = model(x) # different!

Same input, different mask, different output. Critical to understand: at training time the function the network computes is stochastic.

EXECUTION STATE

out_train_b (typical) = tensor([[ 1.15, -0.83, 0.41, 1.74]])

20print("train pass 1:", …)

21print("train pass 2:", …)

Two distinct vectors confirm the mask is sampled fresh on every forward pass.

24model.eval()

Sets self.training = False. nn.Dropout becomes the identity. nn.BatchNorm switches to running statistics. ALWAYS call before validation/test.

25out_eval_a = model(x)

Deterministic forward pass.

EXECUTION STATE

out_eval_a (typical) = tensor([[ 0.97, -1.04, 0.22, 1.91]])

26out_eval_b = model(x)

Same as out_eval_a — eval mode is deterministic given fixed weights and input.

EXECUTION STATE

out_eval_a == out_eval_b = True

27print("eval pass 1:", …)

28print("eval pass 2:", …)

Identical output — eval mode is deterministic.

33n_runs = 1000

Large enough sample size to verify the expected-value identity E[ỹ] = y.

EXECUTION STATE

n_runs = 1000

34model.train()

Re-enable dropout for averaging.

35train_avg = sum(model(x) for _ in range(n_runs)) / n_runs

Average over 1000 stochastic forward passes. By the law of large numbers this converges to the eval-mode output (because of inverted scaling).

EXECUTION STATE

train_avg.shape = torch.Size([1, 4])

36model.eval()

Switch back for the deterministic comparison.

37eval_avg = model(x)

Single deterministic forward pass.

38print("avg train ≈ eval?", torch.allclose(train_avg, eval_avg, atol=0.1))

Should print True. This is the empirical proof that PyTorch's nn.Dropout uses inverted scaling correctly — you never need to multiply by 1/(1-p) yourself.

EXECUTION STATE

expected output = avg train ≈ eval? True

11 lines without explanation

1import torch
2import torch.nn as nn
3
4torch.manual_seed(0)
5
6# Define a tiny model with dropout between layers
7model = nn.Sequential(
8    nn.Linear(8, 16),
9    nn.ReLU(),
10    nn.Dropout(p=0.5),
11    nn.Linear(16, 4),
12)
13
14x = torch.ones(1, 8)  # fixed input
15
16# TRAINING MODE — dropout is active
17model.train()
18out_train_a = model(x)
19out_train_b = model(x)  # different! same input, different mask
20print("train pass 1:", out_train_a.detach().numpy().round(3))
21print("train pass 2:", out_train_b.detach().numpy().round(3))
22
23# EVALUATION MODE — dropout becomes the identity
24model.eval()
25out_eval_a = model(x)
26out_eval_b = model(x)
27print("eval  pass 1:", out_eval_a.detach().numpy().round(3))
28print("eval  pass 2:", out_eval_b.detach().numpy().round(3))
29
30# Verify expected magnitude: average of many train forwards ≈ eval output
31# (PyTorch's nn.Dropout uses inverted scaling, so this works automatically)
32n_runs = 1000
33model.train()
34train_avg = sum(model(x) for _ in range(n_runs)) / n_runs
35model.eval()
36eval_avg = model(x)
37print("avg train ≈ eval?", torch.allclose(train_avg, eval_avg, atol=0.1))

Weight Decay: The Tax on Complexity

Imagine every parameter in your model must pay a "tax" proportional to its magnitude. A weight of 5.0 pays more tax than a weight of 0.1. To justify its cost, a large weight must significantly reduce the loss — otherwise the tax pushes it back toward zero. Parameters that don't earn their keep get taxed into irrelevance.

Another way to think about it: every weight has an elastic band attached to the origin. The band pulls the weight toward zero with a force proportional to the weight's magnitude ( $-\lambda w$ ). During training, the data gradient pulls the weight away from zero (to fit the data), while the elastic band pulls it back (to keep things simple). The equilibrium is a compromise between fitting and simplicity.

Occam's Razor in Code: Weight decay is the mathematical implementation of "prefer simpler explanations." A model with large weights is making bold, specific claims about the data. Weight decay penalizes boldness, preferring models that make gentle, conservative predictions — which tend to generalize better. Only parameters that genuinely reduce the loss survive the tax.

The Mathematics of Weight Decay

Weight decay adds a penalty term to the loss function that is proportional to the squared magnitude of all weights:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \|\mathbf{w}\|_2^2 = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \sum_i w_i^2$

Taking the gradient with respect to $w$ :

$\frac{\partial \mathcal{L}_{\text{total}}}{\partial w} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w} + \lambda w$

The SGD update becomes:

$w_{t+1} = w_t - \eta \left( \frac{\partial \mathcal{L}}{\partial w} + \lambda w_t \right) = (1 - \eta \lambda) w_t - \eta \frac{\partial \mathcal{L}}{\partial w}$

The factor $(1 - \eta \lambda)$ is the "decay" — each step, every weight shrinks by this fraction before the gradient update. With $\eta = 0.001$ and $\lambda = 0.01$ , the multiplicative factor is $0.99999$ , meaning each weight decays by 0.001% per step. Over thousands of steps, this gentle pull has a dramatic cumulative effect.

Bayesian Interpretation

L2 regularization has an elegant Bayesian interpretation: it is equivalent to placing a Gaussian prior on the weights centered at zero:

$p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \quad \Rightarrow \quad -\log p(\mathbf{w}) = \frac{1}{2\sigma^2} \|\mathbf{w}\|_2^2 + \text{const}$

Comparing with the L2 penalty $\frac{\lambda}{2} \|\mathbf{w}\|^2$ , we get $\lambda = 1/\sigma^2$ . A strong regularization ( $\lambda$ large) corresponds to a tight prior ( $\sigma$ small) — a strong belief that weights should be near zero. A weak regularization corresponds to a wide prior — willingness to let weights grow large if the data demands it.

The MAP (Maximum A Posteriori) estimate with this Gaussian prior is exactly the L2-regularized loss minimum. This is not coincidence — it is the same optimization problem viewed from two complementary perspectives: frequentist (penalty) and Bayesian (prior).

Geometric View: The Constraint Sphere

The L2 penalty $\|\mathbf{w}\|_2^2 \leq c$ constrains the weights to lie within a hypersphere centered at the origin. The regularization strength $\lambda$ controls the radius: larger $\lambda$ means a smaller sphere (tighter constraint on weight magnitude).

The visualization below shows this geometrically. The elliptical contours represent the data loss (where the model wants the weights to be), and the circle represents the L2 constraint (where regularization forces them). The optimum lies at the intersection — the best fit within the constraint:

Weight Decay (L2 Regularization) Effect

Data Loss Optimum (θ*)

Regularized Optimum (θ̃*)

Regularization Strength (λ)0.50

No regularization (0)Strong (2)

Show Data Loss Contours

Show Regularization Contours

Optimum Comparison

Unregularized θ*

(2.00, 1.50)

‖θ*‖ = 2.50

Regularized θ̃*

(1.00, 1.20)

‖θ̃*‖ = 1.56

Weight Shrinkage

37.5%

Weights shrunk toward origin by 37.5%

Total Loss Function

L = L_data + λ·‖θ‖²

L2 regularization adds a penalty proportional to squared weight magnitude

Understanding Weight Decay

Blue ellipses show contours of constant data loss (where data loss is the same). The blue dot marks where data loss is minimized. Orange circles show contours of constant regularization penalty (centered at origin). The green dot shows where the total loss (data + regularization) is minimized. Notice how increasing λ pulls the optimum closer to the origin, shrinking the weight magnitudes. This prevents weights from growing too large, which helps prevent overfitting.

The contour plot above shows the geometry of the constraint. The animation below shows the same effect in the time domain: a histogram of all model weights compressing toward zero as weight decay does its work. Drag $\lambda$ to feel how decay strength controls the speed of compression.

Weight magnitude under weight decay

Watch the distribution compress toward zero as training progresses.

λ0.010

There is a deeper geometric insight from Goodfellow et al. (2016): weight decay rescales parameters along the eigenvectors of the Hessian. Parameters along directions with large eigenvalues (directions that strongly affect the loss) are barely shrunk. Parameters along directions with small eigenvalues (directions the loss is insensitive to) are aggressively shrunk toward zero. Weight decay selectively compresses the model's capacity in exactly the directions that don't matter for fitting the data.

Adam vs. AdamW: Why L2 \u2260 Weight Decay

For SGD, L2 regularization and weight decay are mathematically equivalent — they produce identical updates. But for adaptive optimizers like Adam, they are fundamentally different, and getting this wrong can significantly hurt training.

The problem: in Adam with L2 regularization, the regularization gradient $\lambda w$ gets fed into the adaptive moment estimates. This means it gets divided by $\sqrt{v_t}$ — the running average of squared gradients. Parameters with large gradient history (large $v_t$ ) receive weaker regularization, while parameters with small gradients receive disproportionately strong regularization. This is an unintended, parameter-dependent distortion.

AdamW (Loshchilov & Hutter, 2019) fixes this by applying weight decay directly to the weights, completely bypassing the adaptive scaling. The visualization below shows both optimizers converging on the same loss surface — notice how their trajectories and final positions differ:

What practitioners actually saw. Before AdamW, training transformers and ResNets with Adam plus an L2 penalty consistently underperformed SGD with momentum on image classification \u2014 faster initial convergence, worse final test error. Loshchilov & Hutter (2019, \u00a71) traced this gap to exactly the asymmetry derived above: Adam's per-parameter rescaling shrinks the L2 contribution most aggressively for parameters with the largest second-moment estimate, which is the opposite of what regularization should do. AdamW removed the asymmetry by decoupling the decay step, and the gap closed.

Adam + L2 vs. AdamW (Decoupled Weight Decay)

Same loss, same hyperparameters — different regularization behavior

Adam + L2 Regularization

AdamW (Decoupled Weight Decay)

Step 100 / 100

Weight Decay λ0.15

00.5

Learning Rate0.15

0.010.5

Adam + L2 — Final Position

(1.743, 1.462)

||w|| = 2.275

AdamW — Final Position

(1.706, 1.322)

||w|| = 2.158

Adam + L2 Update Rule

w ← w - lr × (grad + λw) / √(v + ε)

L2 term gets divided by √v — uneven decay!

AdamW Update Rule

w ← (1 - lr×λ)w - lr × grad / √(v + ε)

Decay applied directly — uniform on all axes!

Data loss contoursL2 constraint ||w||² = cAdam+L2 pathAdamW path

Adam + L2 adds the weight penalty to the gradient before adaptive scaling. Because the second moment v differs per parameter, the effective regularization is uneven — parameters with large gradients (high v) get less decay, while parameters with small gradients get more decay. This distorts the intended regularization.

AdamW decouples weight decay from the adaptive gradient step. Every parameter is decayed by the same proportion (1 - lr×λ) regardless of gradient magnitude. This gives uniform, predictable regularization — matching what we actually want from weight decay.

The code below implements both approaches so you can see exactly where the decoupling happens:

L2 Regularization vs. Weight Decay \u2014 SGD and AdamW

🐍weight_decay_comparison.py

Explanation(20)

Code(59)

1import torch

PyTorch is used here for tensor operations. While NumPy could work for pure math, using PyTorch shows how these optimizers connect to real training code.

4def sgd_with_l2(params, grads, lr, lambda_)

SGD with L2 regularization: the gradient of the L2 penalty (λ·w) is ADDED to the data gradient, and the combined gradient is used for the update. This is mathematically equivalent to weight decay for SGD (but NOT for Adam).

EXECUTION STATE

⬇ input: params = List of weight tensors. Each tensor is a model parameter to be updated.

⬇ input: grads = List of gradients. grads[i] = ∂L_data/∂params[i] (gradient of the data loss only).

⬇ input: lr = Learning rate η. Controls step size.

⬇ input: lambda_ = Regularization strength λ. Larger = more weight shrinkage.

7total_grad = g + lambda_ * w

The L2 penalty ½λ||w||² has gradient λ·w. This is ADDED to the data gradient. For Adam, this combined gradient would then be divided by √v — which distorts the regularization. For SGD, it’s fine.

EXECUTION STATE

g = Data gradient: [0.5, 0.1] — the signal from the loss function

λ * w = 0.01 × [2.0, -1.5] = [0.02, -0.015] — the regularization gradient, pulling weights toward zero

⬆ total_grad = [0.52, 0.085] — combined gradient = data + regularization

8w_new = w - lr * total_grad

Standard SGD update with the combined gradient.

EXECUTION STATE

w_new = [2.0, -1.5] - 0.1 × [0.52, 0.085] = [1.948, -1.5085]

13def sgd_with_weight_decay(params, grads, lr, lambda_)

SGD with weight decay: instead of modifying the gradient, shrink the weights DIRECTLY by the factor (1 - lr·λ) each step. For SGD, this produces IDENTICAL results to L2. The algebraic equivalence: w - lr*(g + λw) = (1-lrλ)*w - lr*g.

EXECUTION STATE

⬆ returns = Updated params. For SGD: mathematically identical to L2 approach.

17w_new = (1 - lr * lambda_) * w - lr * g

The weight decay formulation: first shrink w by (1 - lr·λ), then subtract the scaled data gradient. This is WHERE Adam and SGD diverge — in Adam, the gradient step goes through adaptive scaling but the decay does not.

EXECUTION STATE

(1 - lr * lambda_) = (1 - 0.001) = 0.999 = Multiplicative shrinkage factor. Each step, weights shrink by 0.1%. Over 1000 steps: 0.999¹⁰⁰⁰ ≈ 0.368 (weights decay to ~37% of original).

w_new = 0.999 × [2.0, -1.5] - 0.1 × [0.5, 0.1] = [1.948, -1.5085] — SAME as L2!

21def adamw_step(params, grads, m, v, t, lr, lambda_, ...)

AdamW: the CORRECT way to add weight decay to Adam. The key insight from Loshchilov & Hutter (2019): weight decay is applied DIRECTLY to weights, completely bypassing Adam’s adaptive moment scaling. This ensures every parameter receives the same proportional decay regardless of its gradient history.

EXECUTION STATE

⬇ input: m, v = First and second moment estimates. m tracks gradient mean, v tracks gradient variance. Both are running averages initialized to zero.

⬇ input: t = Current timestep (starting from 1). Used for bias correction in the early steps.

⬇ input: beta1=0.9, beta2=0.999 = Exponential decay rates for moments. beta1 controls how quickly the mean adapts, beta2 controls the variance estimate. Standard values for transformer training.

28mi_new = beta1 * mi + (1 - beta1) * g

Update first moment (gradient mean). Exponential moving average of gradients. NOTICE: only the DATA gradient g is used here — NOT g + λ*w. This is the decoupling.

EXECUTION STATE

→ Decoupling = In Adam+L2: mi_new = beta1*mi + (1-beta1)*(g + λw). The L2 term pollutes the moment tracking. In AdamW: only g goes into moments.

29vi_new = beta2 * vi + (1 - beta2) * g ** 2

Update second moment (gradient variance). Used for adaptive learning rate: parameters with large variance get smaller steps. Again, only the data gradient — the weight decay term is excluded.

31m_hat = mi_new / (1 - beta1 ** t)

Bias correction for the first moment. In the early steps (small t), the running average is biased toward zero (since it’s initialized to 0). Dividing by (1 - 0.9ᵗ) corrects this. At t=1: divide by 0.1 (10× amplification). By t=20: divide by ~0.88 (nearly no correction).

EXECUTION STATE

📚 Bias correction = At t=1: 1 - 0.9¹ = 0.1, so m_hat = m/0.1 = 10m. At t=10: 1 - 0.9¹⁰ = 0.651. At t=∞: 1 - 0 = 1 (no correction needed).

32v_hat = vi_new / (1 - beta2 ** t)

Bias correction for the second moment. Same logic, but with beta2=0.999, correction is needed for ~1000 steps.

34adam_step = lr * m_hat / (torch.sqrt(v_hat) + eps)

The Adam gradient step. The adaptive learning rate is lr / (√v_hat + ε) — parameters with large gradient variance get smaller steps. CRUCIALLY: this only involves the data gradient. Weight decay is NOT in this step.

EXECUTION STATE

√v_hat + ε = The per-parameter adaptive denominator. Large gradients → large v → large denominator → smaller effective learning rate. This is what causes the Adam+L2 problem: if λw were in the gradient, it would be divided by this term.

36w_new = (1 - lr * lambda_) * w - adam_step

THE KEY LINE: weight decay (1 - lr·λ) is applied DIRECTLY to the current weights, then the adaptive gradient step is subtracted. The weight decay is NOT divided by √v. Every parameter shrinks by exactly the same proportion (lr·λ) regardless of gradient history.

EXECUTION STATE

(1 - lr * λ) * w = Direct weight shrinkage. Applied uniformly. NOT modulated by the adaptive learning rate. This is the ‘decoupling’ that makes AdamW work.

adam_step = The adaptive gradient update. Only contains information from the data loss, not the regularization term.

→ vs. Adam+L2 = In Adam+L2: w_new = w - lr*(m_hat_combined)/(√v_hat_combined + ε). The λw term is inside m_hat and v_hat, so it gets divided by √v — params with large gradients get LESS regularization.

46torch.manual_seed(42)

Fix the random seed for reproducibility.

47w = torch.tensor([2.0, -1.5])

A 2D weight vector. We will apply one SGD step with both L2 and weight decay to show they are equivalent.

EXECUTION STATE

w = [2.0, -1.5] — a simple 2D weight vector for demonstration

48g = torch.tensor([0.5, 0.1])

The data gradient (from the loss function, not including regularization).

EXECUTION STATE

g = [0.5, 0.1] — w1 has a larger gradient than w2

49lr, lam = 0.1, 0.01

Learning rate 0.1, regularization strength 0.01.

52w_l2 = sgd_with_l2([w], [g], lr, lam)[0]

Apply one step of SGD with L2 regularization.

EXECUTION STATE

⬆ w_l2 = [1.948, -1.5085]

53w_wd = sgd_with_weight_decay([w], [g], lr, lam)[0]

Apply one step of SGD with weight decay.

EXECUTION STATE

⬆ w_wd = [1.948, -1.5085] — identical to L2!

54print(f"SGD + L2: {w_l2}")

Print both results to verify the equivalence.

EXECUTION STATE

Output =

SGD + L2:  tensor([1.9480, -1.5085])
SGD + WD:  tensor([1.9480, -1.5085])
Equal?     True

→ Key insight = For SGD, L2 and weight decay are IDENTICAL. But for Adam, they produce DIFFERENT trajectories because Adam’s adaptive scaling distorts the L2 gradient.

39 lines without explanation

1import torch
2import torch.nn as nn
3
4# --- SGD with L2 regularization (manual) ---
5def sgd_with_l2(params, grads, lr, lambda_):
6    """SGD + L2: gradient of penalty is added to loss gradient."""
7    updated = []
8    for w, g in zip(params, grads):
9        total_grad = g + lambda_ * w
10        w_new = w - lr * total_grad
11        updated.append(w_new)
12    return updated
13
14# --- SGD with weight decay (equivalent to L2 for SGD!) ---
15def sgd_with_weight_decay(params, grads, lr, lambda_):
16    """SGD + WD: decay is applied directly to weights."""
17    updated = []
18    for w, g in zip(params, grads):
19        w_new = (1 - lr * lambda_) * w - lr * g
20        updated.append(w_new)
21    return updated
22
23# --- AdamW (decoupled weight decay) ---
24def adamw_step(params, grads, m, v, t, lr, lambda_,
25               beta1=0.9, beta2=0.999, eps=1e-8):
26    """AdamW: weight decay is decoupled from adaptive gradient step."""
27    updated_params = []
28    updated_m = []
29    updated_v = []
30
31    for w, g, mi, vi in zip(params, grads, m, v):
32        mi_new = beta1 * mi + (1 - beta1) * g
33        vi_new = beta2 * vi + (1 - beta2) * g ** 2
34
35        m_hat = mi_new / (1 - beta1 ** t)
36        v_hat = vi_new / (1 - beta2 ** t)
37
38        adam_step = lr * m_hat / (torch.sqrt(v_hat) + eps)
39
40        w_new = (1 - lr * lambda_) * w - adam_step
41
42        updated_params.append(w_new)
43        updated_m.append(mi_new)
44        updated_v.append(vi_new)
45
46    return updated_params, updated_m, updated_v
47
48# --- Demonstrate the equivalence (SGD) and difference (Adam) ---
49torch.manual_seed(42)
50w = torch.tensor([2.0, -1.5])
51g = torch.tensor([0.5, 0.1])
52lr, lam = 0.1, 0.01
53
54# SGD + L2 vs SGD + WD: should be identical
55w_l2 = sgd_with_l2([w], [g], lr, lam)[0]
56w_wd = sgd_with_weight_decay([w], [g], lr, lam)[0]
57print(f"SGD + L2:  {w_l2}")
58print(f"SGD + WD:  {w_wd}")
59print(f"Equal?     {torch.allclose(w_l2, w_wd)}")

A practical consequence: with AdamW, the optimal weight decay is independent of the learning rate. You can tune them separately, which is a major simplification. With Adam+L2, changing the learning rate changes the effective regularization strength, making hyperparameter tuning a tangled mess.

The `no_decay` filter pattern

In every modern transformer training loop you will see a small helper that splits parameters into two AdamW groups: one decayed at the standard rate (typically $\lambda = 0.1$ ) and one with $\lambda = 0$ . The no-decay group covers biases and LayerNorm parameters \u2014 decaying these one-dimensional scale/shift parameters tends to harm rather than help. The convention dates to GPT-3 (Brown et al., 2020, Appendix B) and is now the default in HuggingFace Transformers, fairseq, and reference implementations like nanoGPT.

Splitting parameters for AdamW

🐍no_decay_filter.py

Explanation(21)

Code(31)

1import torch

2import torch.nn as nn

8NO_DECAY = ("bias", "LayerNorm.weight", "ln_f.weight")

Substring patterns for parameter names that should NOT be decayed. 'bias' covers all biases; 'LayerNorm.weight' covers the γ scale parameter inside any LayerNorm; 'ln_f.weight' is the GPT-style final LayerNorm name.

EXECUTION STATE

NO_DECAY = ("bias", "LayerNorm.weight", "ln_f.weight")

10def split_param_groups(model, weight_decay=0.1):

Returns two AdamW param groups: one decayed at the given rate, one not decayed. Default 0.1 matches GPT/LLaMA training recipes.

EXECUTION STATE

weight_decay = 0.1

11 decay, no_decay = [], []

Two accumulators.

12 for name, param in model.named_parameters():

Iterate over every (name, tensor) pair in the model. Names look like 'layers.3.attn.query.weight' or 'layers.3.attn.query.bias'.

13 if not param.requires_grad:

Skip frozen parameters.

14 continue

15 if any(nd in name for nd in NO_DECAY):

Substring match: if any forbidden pattern appears in the parameter's name, route to no_decay.

16 no_decay.append(param)

17 else:

18 decay.append(param)

Everything else (weight matrices in Linear, attention projections, embeddings) gets decayed.

19 return [

AdamW accepts a list of dicts; each dict can override hyperparameters for its param subset.

20 {"params": decay, "weight_decay": weight_decay},

Group 1: decayed.

21 {"params": no_decay, "weight_decay": 0.0},

Group 2: explicit zero decay.

22 ]

25model = nn.Sequential(nn.Linear(64, 64), nn.LayerNorm(64), nn.Linear(64, 64))

Toy stack with the parameter types we care about: two Linear layers (weights+biases) and one LayerNorm (weight γ + bias β).

EXECUTION STATE

Linear(64,64) param count each = 64·64 + 64 = 4160

LayerNorm(64) param count = 64 + 64 = 128

26groups = split_param_groups(model, weight_decay=0.1)

Walks the named parameters and routes them.

EXECUTION STATE

decay group size = 2 weight matrices = 8192 params

no_decay group size = 2 biases + 2 LayerNorm tensors = 256 params

27opt = torch.optim.AdamW(groups, lr=3e-4)

AdamW receives the two groups. lr=3e-4 is a transformer-default learning rate.

EXECUTION STATE

lr = 3e-4

29print(f"decayed params: {sum(...):>5d}")

30print(f"non-decayed params: {sum(...):>5d}")

Should print: decayed params: 8192, non-decayed params: 256. Biases and LayerNorm scales are a tiny fraction of the parameter count but matter for generalization.

10 lines without explanation

1import torch
2import torch.nn as nn
3
4# What practitioners actually do: split params into decay vs no-decay groups.
5# This convention (excluding biases and LayerNorm scales from weight decay)
6# became standard via reference implementations like GPT-3 (Brown et al., 2020,
7# Appendix B) and is now the default in HuggingFace, fairseq, and nanoGPT.
8
9NO_DECAY = ("bias", "LayerNorm.weight", "ln_f.weight")
10
11def split_param_groups(model, weight_decay=0.1):
12    decay, no_decay = [], []
13    for name, param in model.named_parameters():
14        if not param.requires_grad:
15            continue
16        if any(nd in name for nd in NO_DECAY):
17            no_decay.append(param)
18        else:
19            decay.append(param)
20    return [
21        {"params": decay,    "weight_decay": weight_decay},
22        {"params": no_decay, "weight_decay": 0.0},
23    ]
24
25# Example with a tiny transformer-like block
26model = nn.Sequential(nn.Linear(64, 64), nn.LayerNorm(64), nn.Linear(64, 64))
27groups = split_param_groups(model, weight_decay=0.1)
28opt = torch.optim.AdamW(groups, lr=3e-4)
29
30print(f"decayed params:    {sum(p.numel() for p in groups[0]['params']):>5d}")
31print(f"non-decayed params: {sum(p.numel() for p in groups[1]['params']):>5d}")

Dropout and Weight Decay in Transformers

The original Transformer (Vaswani et al., 2017) uses dropout at three locations with rate $p = 0.1$ :

After attention weights (post-softmax, pre-V multiply): prevents over-reliance on specific token-to-token relationships
After each sub-layer output (before residual addition): occasionally forces the model to rely purely on the skip connection
After embedding + positional encoding sum: regularizes the input representation

Plus label smoothing with $\varepsilon = 0.1$ as an output-level regularizer.

Why Modern LLMs Remove Dropout

A striking development: LLaMA, GPT-3, Mistral, PaLM, and most modern large language models use zero dropout. This seems counterintuitive — why remove a regularizer? The reasons are revealing:

Reason	Explanation
Scale provides implicit regularization	With billions of parameters and trillions of tokens, each training example is seen only 1–4 times. There is little opportunity to memorize.
Dropout adds noise to gradients	The stochastic masks increase gradient variance, requiring more steps to converge. At scale, this computational cost is substantial.
Distributed training complications	Synchronizing dropout masks across tensor/pipeline parallelism adds complexity.
Weight decay suffices	AdamW with λ=0.1 provides enough regularization for large-scale training.
Dropout returns for fine-tuning	When adapting a large model to a small dataset, dropout (or LoRA dropout) is reintroduced.

Modern Transformer Training Recipe

Here is the configuration used by LLaMA (Touvron et al., 2023), representative of current best practices for large language models:

Hyperparameter	Value	Why
Optimizer	AdamW	Decoupled weight decay; standard for all transformer training
Learning rate	3e-4 (7B/13B), 1.5e-4 (33B/65B)	Smaller models tolerate higher LR
LR schedule	Cosine decay to 10% of peak	Smooth decay prevents sudden loss spikes
Warmup	2,000 steps	Prevents early-training instability
Weight decay	0.1	Stronger than vision models (λ=0.0001) because LLMs have more capacity
β₁, β₂	0.9, 0.95	Lower β₂ than default (0.999) for faster adaptation
Gradient clipping	1.0	Prevents exploding gradients from rare batches
Dropout	0.0	No dropout — weight decay and data scale provide sufficient regularization
Label smoothing	0.0	Not used in autoregressive LMs (helpful for classification)
Batch size	4M tokens	Large batches reduce gradient noise, acting as implicit regularization

Notice the careful balance: no dropout, but strong weight decay (\u03bb=0.1). The implicit regularization from massive training data, large batch sizes, and weight decay replaces the explicit stochasticity of dropout. For smaller models or fine-tuning on limited data, dropout is still the go-to technique.

The implicit-regularization angle

At the scale of GPT-3 / LLaMA / Chinchilla there is also an implicit regularizer at work that has nothing to do with the techniques in this section \u2014 SGD's bias toward low-norm interpolators in the over-parameterized regime (Belkin et al., 2019; Nakkiran et al., 2020) and compute-optimal scaling (Hoffmann et al., 2022). See \u00a712.1 \u2014 Implicit Regularization at Scale for the full story.

Key Takeaways

Dropout randomly zeroes neuron outputs during training (rate $p$ ), scales survivors by $1/(1-p)$ , and does nothing at inference. It implicitly trains an ensemble of $2^n$ sub-networks sharing the same parameters.
Weight decay shrinks all weights toward zero each step by a factor $(1 - \eta\lambda)$ . It is equivalent to a Gaussian prior on the weights (Bayesian view) or constraining weights to a hypersphere (geometric view).
L2 \u2260 weight decay for Adam. In Adam+L2, the regularization gradient is distorted by adaptive scaling. AdamW decouples the decay, giving uniform regularization and independent hyperparameter tuning.
Not all parameters should be regularized. Weight decay is applied to weight matrices but NOT to biases, LayerNorm parameters, or embeddings.
Dropout prevents co-adaptation — neurons learn individually useful features rather than developing fragile co-dependencies.
Modern LLMs use weight decay without dropout. The combination of massive data, large batches, and AdamW (\u03bb=0.1) provides sufficient regularization. Dropout returns for fine-tuning on small datasets.
At scale, the implicit regularization from SGD's low-norm bias and Chinchilla-style data-rich training replaces explicit dropout (see \u00a712.1's Implicit Regularization at Scale).

Looking Ahead: In the next section, we will explore two more regularization techniques — Early Stopping and Data Augmentation — which control overfitting through training dynamics and data manipulation rather than model modification.

References

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15(56), 1929\u20131958.
Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML 2016.
Krogh, A. & Hertz, J. A. (1991). A Simple Weight Decay Can Improve Generalization. NIPS 1991 (NeurIPS 4).
Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
Brown, T. B. et al. (2020). Language Models are Few-Shot Learners (GPT-3), Appendix B (parameter-group split convention). NeurIPS 2020.
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press.
Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.

Introduction

Dropout: The Unreliable Team

How Dropout Works Mathematically

What Gets Dropped Where

Dropout Inside a Neural Network

Dropout Visualization

How Dropout Works

The Ensemble Interpretation

Dropout as Ensemble Training

Implementing Dropout from Scratch

Dropout in PyTorch with nn.Dropout

Weight Decay: The Tax on Complexity

The Mathematics of Weight Decay

Bayesian Interpretation

Geometric View: The Constraint Sphere

Weight Decay (L2 Regularization) Effect

Understanding Weight Decay

Weight magnitude under weight decay

Adam vs. AdamW: Why L2 \u2260 Weight Decay

Adam + L2 vs. AdamW (Decoupled Weight Decay)

Adam + L2 Regularization

AdamW (Decoupled Weight Decay)

The no_decay filter pattern

Dropout and Weight Decay in Transformers

Why Modern LLMs Remove Dropout

Modern Transformer Training Recipe

The implicit-regularization angle

Key Takeaways

References

Dropout in PyTorch with `nn.Dropout`

The `no_decay` filter pattern