Introduction
Every massive language model you have ever used — GPT-4, Claude, Gemini, Llama, DeepSeek — was trained by repeating one operation about ten million times: compute the gradient of a scalar loss with respect to every parameter, then take a small step downhill. The optimizer choice changes, the loss changes, the architecture changes, the data changes. The gradient step does not. It is the heart muscle of deep learning.
For a 70-billion-parameter model that means computing seventy billion partial derivatives per training step. By hand it is unthinkable. By finite differences it would take more compute than humanity has ever produced. The reason it is even possible — and roughly free, in the sense that the gradient costs only a constant factor more than the forward pass — is a single 1970s idea called reverse-mode automatic differentiation, built on top of one rule from first-year calculus: the chain rule.
Why this matters: Backpropagation is not a separate algorithm bolted onto neural networks. It is reverse-mode autodiff applied to a computational graph. Once you see the graph and the chain rule together, every variation — gradient checkpointing, mixed precision, FSDP, gradient accumulation, even Adam itself — becomes a memory or numerics trick around the same primitive.
1.4.1 The Real Problem: A Hundred Billion Knobs
Training a deep network is, mathematically, an enormous optimisation problem. We pick a loss function that measures how badly the model is doing, where is the vector of all the model's parameters — every weight matrix, every bias, every layer-norm scale, every embedding vector — packed end to end. For a modern frontier model with on the order of .
To improve the model we need to know, for each of those numbers, the answer to one question: if I nudge this one knob up by a tiny amount, does the loss go up or down, and by how much? That answer is the partial derivative . Stack all of them into a vector and you get the gradient .
Here is the brutal arithmetic. If you tried to estimate each partial derivative the naive way — perturb one parameter, re-run the entire forward pass, see how much the loss changed — you would do forward passes per gradient. For a 70B model with a forward pass that takes ~500 ms on an H100 cluster, one gradient step would take .
Reverse-mode autodiff gives us the entire gradient — all components — in one forward pass plus one backward pass, with the backward pass roughly the same cost as the forward. So the total cost is , not . This is the only reason deep learning is computationally possible.
1.4.2 Four Ways to Compute a Derivative
It helps to see why reverse-mode wins by laying it next to its competitors. There are four classical approaches:
| Method | What it does | Cost for N inputs, 1 output | Used in deep learning? |
|---|---|---|---|
| Symbolic | Treat the program as an algebraic expression and apply differentiation rules to produce a closed-form formula. | Formula explodes ("expression swell") — quickly unmanageable. | No. |
| Numerical (finite differences) | Perturb each input slightly, re-evaluate, divide. (L(θ+ε) − L(θ))/ε. | O(N) forward passes per gradient. Numerically noisy. | Only for gradient-checking. |
| Forward-mode autodiff | Carry a 'dual number' (value, derivative) through every op; differentiate as you go. | O(N) — one extra pass per input direction. | Yes, but only for tiny input dimension. |
| Reverse-mode autodiff | Record the computation as a DAG on the forward pass, then walk it in reverse applying the chain rule. | O(1) — one backward pass for the whole gradient. | Yes — this is backpropagation. |
For a function with many inputs and one output (which is exactly the deep-learning shape: many parameters, one loss), reverse-mode is the only viable choice. The rest of this section is dedicated to building it from first principles.
1.4.3 The Idea: Calculus on a Graph
The first idea is to stop thinking of the model as a giant tangled formula and start thinking of it as a directed acyclic graph. Every primitive operation — add, multiply, matmul, exp, log, softmax, GELU — is a node. Edges carry intermediate values. Inputs flow in at one side, the loss pops out at the other.
Take an absurdly small example: . As a graph that is five nodes:
W ── × ── u ── + ── v ── (·)² ── y
x ─────┘ b ──┘
The forward pass evaluates left to right: with , , , we get , , . Each node remembers two things: its output value, and a pointer to the small local rule that knows how to differentiate it.
The backward pass walks the same graph from right to left. At each node we have already received an incoming derivative from downstream (the gradient of the loss w.r.t. the node's output) and we use the local rule to convert it into derivatives w.r.t. the node's inputs. Repeat until we hit the leaves. Done.
1.4.4 The Chain Rule, Symbol by Symbol
The local rule at every node is the chain rule. In its scalar form:
Read it like a relay race. is the baton handed to us by the downstream runner — we never compute it, we just receive it. is the local derivative we do know because we wrote the op (it is just the derivative of this one primitive). Multiplying the two gives the baton we hand to the next runner upstream.
For an op with multiple inputs the rule generalises to a sum: if , then
And for the case where one variable feeds many downstream ops — say is read by both and — the rule becomes a sum over children, the multivariable chain rule:
That single equation is the whole engine. Every other rule — , , , the matrix calculus identities behind matmul — is just a way of specialising it to one primitive at a time.
For vector-valued nodes (which is the normal case in a real network) the same rule holds, with the local derivative now a Jacobian and the upstream baton a vector. We almost never materialise these Jacobians explicitly — instead we use vector-Jacobian products , which is exactly what the local backward closures compute.
1.4.5 Forward Mode vs Reverse Mode
Both modes apply the chain rule; they differ only in the order in which they multiply the Jacobians. Suppose the network is the composition with Jacobians . The total Jacobian is the product . Matrix product is associative, so we can multiply from either end:
- Forward mode computes — picks one input direction and pushes it forward through every layer. Cost: one full forward-style pass per input.
- Reverse mode computes — starts from a cotangent on the output (for a scalar loss this is just the number 1) and pulls it back through every layer. Cost: one full backward-style pass per output.
Deep learning gives us many inputs (parameters) and one output (loss). Forward mode would cost one pass per parameter — back to the nightmare. Reverse mode costs one pass for the entire gradient. That asymmetry is the whole game.
1.4.6 Interactive: Build a Computational Graph
Pick an example below and step through the forward and backward passes one node at a time. Watch how the values fill in on the way forward and the gradients fill in on the way back. The arrows on the edges flip direction between the two phases — that flip is literally the difference between value flow and gradient flow.
1.4.7 Interactive: Watch the Chain Rule Multiply
Pick a composed function such as or and drag the input. The visualiser shows each intermediate function and its local derivative; the final gradient at the bottom is the product of all the local derivatives. That product is the chain rule, made visible.
1.4.8 Manual Numerical Walkthrough
Time to derive a gradient with a pencil. We will use the same toy as the graph above: with , , , and we will treat itself as the loss. The numbers are small enough to fit in your head, but the structure is exactly the same as a billion-parameter run — only the dimensions change.
Manual Numerical Walkthrough — open to see every number
Set up the graph. Three leaves (W, x, b) feed three ops (multiply, add, square) producing three intermediates (u, v, y).
| Node | Formula | Value | Local derivative w.r.t. each input |
|---|---|---|---|
| W | leaf | 2 | — |
| x | leaf | 3 | — |
| b | leaf | 1 | — |
| u | W · x | 6 | ∂u/∂W = x = 3, ∂u/∂x = W = 2 |
| v | u + b | 7 | ∂v/∂u = 1, ∂v/∂b = 1 |
| y | v² | 49 | ∂y/∂v = 2v = 14 |
Forward pass. Plug values left to right: , , . Done. Every intermediate value is now memorised by the graph.
Seed the backward pass. The loss is itself, so . This is the cotangent that we push backward.
Step 1 — through the square. Local rule: . Apply the chain rule:
Step 2 — through the addition. Local rules: , . Two outgoing arrows split the gradient evenly:
Step 3 — through the multiplication. Local rules: , . So
Final gradients (read from the leaves): , , . Three numbers, computed by three local rules and three multiplications. No giant formula, no nested derivatives.
Sanity check by hand. Expand symbolically: . Same number. The graph walk did exactly the symbolic derivative — but it would have been just as easy with nodes instead of three.
1.4.9 Plain Python: A Tiny Autograd Engine
We will now build the entire engine from §1.4.8 in about sixty lines of Python — no PyTorch, no NumPy, just floats and closures. The class is called Value and it is deliberately written in the Karpathy-micrograd style so that every line maps to one idea from the last few subsections.
Read it once top-to-bottom for the shape, then click each line for the connection back to the math.
Three points worth pausing on. First, the engine never stores any formula — only numbers and a list of parent pointers. The chain rule emerges from the order in which we call closures, not from algebra. Second, self.grad uses += rather than =: a node can feed many children, and each child must contribute its term to the multivariable chain rule sum (§1.4.4). Third, the topological sort is non-negotiable — call the closures in the wrong order and some children will fire before their gradient has been finalised, silently losing terms.
1.4.10 PyTorch: The Same Math at Industrial Scale
PyTorch's autograd is, structurally, our Value class with three upgrades: each node holds an N-dimensional tensor instead of one float; each _backward is a hand-tuned C++/CUDA kernel that handles the vector-Jacobian product without ever materialising the Jacobian; and the topological sort runs iteratively so it scales to millions of nodes. The user-facing API is deliberately almost identical to the toy.
The output is the same three numbers you derived by hand: 42, 28, 14. That is the entire promise of automatic differentiation — the mathematics does not change as the model scales from three nodes to three trillion. Only the kernels, the memory plan, and the parallelisation strategy do.
grad_fn: in our toy we stored _backward as a Python closure on every node. PyTorch stores a C++ Node object called the grad_fn — same idea, but it can be moved between processes for distributed training, serialised for checkpointing, and inspected for debugging via torch.autograd.gradcheck.1.4.11 At Massive Scale: Why Autodiff Is the Real Hero
Now scale our toy to a real frontier-model training run and watch what survives. The mathematics — the chain rule, the forward-then-reverse pass, the gradient accumulation — is identical. Everything else has to be redesigned.
Memory: the silent killer
Reverse mode is fast because every intermediate is reused exactly once on the way back. But that means every intermediate must be stored until then. For a 70B-parameter transformer training on sequences of 4096 tokens with batch 1 in bfloat16, the activations alone weigh on the order of 200–400 GB per data-parallel rank — more than the parameters themselves, sometimes by an order of magnitude.
| Memory category | Scales with | Rough share at 70B/4k context |
|---|---|---|
| Parameters (θ) | N | 140 GB in bf16, 70 GB in fp8 |
| Gradients (∂L/∂θ) | N | Same shape as parameters: 140 GB / 70 GB |
| Optimizer state (Adam: m, v) | 2N | Often the biggest single slice — 280 GB+ in fp32 |
| Activations (autodiff bookkeeping) | batch × seq × hidden × layers | Dominates for long contexts — 200–600 GB |
Gradient checkpointing — trade compute for memory
The classical autodiff bargain is "keep every intermediate". Gradient checkpointing breaks that bargain on purpose: drop most activations during forward, and when backward needs one, recompute it from the nearest checkpoint. You roughly double the forward FLOPs but cut activation memory by an order of magnitude. The chain rule is unchanged — you just rebuild the missing intermediates on demand.
Gradient accumulation — virtually larger batches
Look at the += in our toy: it is exactly what lets you train with batch size 4096 on a GPU that fits only batch 4. Run a micro-batch, call backward(), do not zero.grad, run the next micro-batch, repeat 1024 times, then step the optimizer. The accumulated gradient is mathematically identical to the full-batch gradient (modulo non-linearities like batch norm). The autodiff primitive that makes this safe is +=, not the optimizer.
Mixed precision — bf16/fp16 forward, fp32 gradients where it counts
Forward intermediates are stored in bf16 (or fp8 on the newest hardware) to save memory and matmul cost. Gradients are also bf16, but the optimizer keeps a master fp32 copy of the weights and the Adam moments to avoid catastrophic loss-of-precision at update time. The autograd graph routes a cast op between every fp32 leaf and its bf16 use, so the chain rule walks back through the casts like any other op.
FSDP / ZeRO — shard the parameters across ranks
Even after checkpointing and mixed precision, the parameters and optimizer state alone outgrow any single GPU around 7B. ZeRO-3 and PyTorch FSDP shard , the gradients, and the optimizer state across all data-parallel ranks. Before each layer's forward, the rank that owns the shard all-gathers its weights; afterwards it discards them. Backward does the same in reverse, but additionally reduce-scatters the gradients so only the owning rank holds for its shard. From the autograd engine's point of view nothing has changed — the gather/scatter are just more ops in the graph.
1.4.12 Engineering Reality and Pitfalls
The autograd engine is one of the most battle-tested pieces of code in PyTorch, but it can still be made to lie if you push it the wrong way. The list below is the high-frequency failure modes in real training runs.
- Forgetting
zero_grad(). Because gradients accumulate, missing a single zero turns every subsequent step into a runaway. Symptom: loss explodes after step 1 with no obvious cause. - In-place ops on tensors that backward needs. Writing
x.add_(1)mutates the activation the backward pass expected to see. PyTorch raises a runtime error in many cases but not all — when it does not, you silently compute the wrong gradient. - Using
.detach()without meaning to. Anything you copy via.detach(), convert to numpy, or print as a Python float is cut off from the graph. Gradients that should have flowed through it become zero. Useful when you want it (e.g. teacher targets); catastrophic when you do not. - Non-differentiable ops.
argmax, discrete sampling,torch.wherewith a hard threshold: each returns a tensor with no gradient. Either reroute through a differentiable surrogate (Gumbel-softmax, straight-through estimator) or accept that no learning signal flows there. - NaN gradients. The chain rule multiplies. One rogue infinity in a single intermediate corrupts every leaf upstream. Common causes:
log(0),1/0,sqrtof a negative value, fp16 overflow. Mixed precision adds aGradScalerexactly to catch this. - Forgetting
torch.no_grad()at inference. Without it, autograd still records the graph during eval — wasting memory and slowing inference. Production inference always wraps forward inwith torch.no_grad():or its successortorch.inference_mode(). - Higher-order gradients are expensive. Gradients of gradients (for meta-learning, second-order optimisers, or influence functions) require
create_graph=True, which keeps the backward graph itself differentiable. That stacks another autodiff pass on top — memory and compute double.
Function with your own forward and backward, run torch.autograd.gradcheck before trusting it. It compares your analytic backward against finite differences at high precision and is the single best defense against subtle bugs.Summary
Modern training has one inner loop — forward, loss, backward, step — and reverse-mode automatic differentiation is the only thing that makes the backward step cheap enough to repeat ten million times. We derived it from first principles:
- A scalar loss over parameters is hopeless to differentiate by formula or by perturbation; reverse mode does the full gradient in one extra pass.
- A network is a DAG of primitive ops; the forward pass evaluates it, the backward pass walks the same DAG in reverse, applying the chain rule locally at each node.
- The whole engine fits in sixty lines of Python: a
Valueclass, three operator overloads with+=accumulation, a topological sort, and a seed of 1 at the loss. PyTorch is the same algorithm, just with fast kernels and a million times more bookkeeping. - Everything else in a frontier-model training stack — gradient checkpointing, mixed precision, gradient accumulation, ZeRO/FSDP — is a memory or parallelism trick wrapped around this single primitive.
From here on in the book you can treat loss.backward() as one line of code that you fully understand. The next section (§1.5) takes the gradients we just learned to compute and asks the next question: given the gradient, how do we actually move? That is the realm of gradient descent, momentum, Adam, AdamW, and the optimisation landscape.