A 70-billion-parameter language model that has just finished pretraining is a remarkable thing. It can complete code, recite historical dates, translate Latin, finish a half-written sonnet, and explain quantum field theory at three levels of difficulty. It is also, in a very precise sense, useless. Ask it to write a polite email and you might get a polite email, or you might get a paragraph of insults, or a Reddit thread about emails, or twelve emails in a row, or a refusal to engage. The pretrained model imitates everything that exists on the internet, and the internet is not asking it to be helpful.
The thesis of this section. Pretraining minimises cross-entropy against the data distribution. What we want is a policy that maximises a human preference reward without drifting too far from the pretrained model. The mathematical object that arbitrates this trade-off is the KL-regularised optimal policy . Every RLHF method in production today -- PPO, DPO, GRPO, best-of-N rejection sampling, RLAIF -- is a different way of approximating samples from this distribution.
The Real Problem: Imitation Is Not What We Want
Pretraining a language model is, mathematically, a single objective: minimise the cross-entropy of the next token given the previous tokens, averaged over a corpus that approximates the distribution of natural text on the internet. Formally, with model parameters and corpus ,
Minimising this is equivalent to minimising : the trained model becomes the best imitation of the corpus that the architecture and optimiser can produce. This is wonderful for what it gives us: a model that has implicitly absorbed grammar, semantics, factual knowledge, code idioms, and the rhetorical structure of every genre on the web. It is catastrophic for what it also gives us: a model that has implicitly absorbed Reddit flame wars, scam emails, propaganda, factual errors, and the particular conversational style of every troll forum it was trained on. The pretrained model is a probabilistic mixture of all of these -- it has no preference between "helpful assistant" and "4chan poster" because, statistically, it is both.
The practical consequence is severe. The 2020 InstructGPT paper documented the gap with care: when GPT-3 (175B, pretrained on the web) was prompted to follow user instructions, it followed them about 30% of the time. The remaining 70% of completions were plausible continuations of the prompt that ignored or contradicted the user's intent. After RLHF, the 1.3B InstructGPT model -- one hundred and thirty times smaller -- was preferred by human raters in ~85% of head-to-head comparisons. The capability gap closed not by scaling, but by alignment.
| Model | Params | Trained on | Pref vs GPT-3-175B |
|---|---|---|---|
| GPT-3 (2020) | 175B | Web pretraining | 50% (baseline) |
| GPT-3 + SFT | 175B | Pretrain + instruction tuning | ~68% |
| InstructGPT-1.3B | 1.3B | Pretrain + SFT + RLHF | ~85% |
| InstructGPT-175B | 175B | Pretrain + SFT + RLHF | ~88% |
Read that table carefully. The 1.3B aligned model beats the 175B unaligned model. The reverse is what the pre-2022 scaling-laws narrative would have predicted -- more parameters, more capability. What it actually says is that alignment is a separate axis from capacity, and that the marginal value of one alignment step is, at frontier scales, larger than the marginal value of 100× more parameters.
Intuition: A Parrot That Has Read the Whole Internet
Picture the pretrained model as a parrot that has read the entire internet and memorised the statistical shape of how text continues. Show it a prompt and it spits back the most probable continuation under that statistical shape. It has no goals, no preferences, no notion of "what the user wants" -- only what comes next in a corpus that contains the user's prompt as a substring.
We want something else: an assistant that, given a prompt, produces a reply that a human would prefer over alternatives. The minimal machinery to express this is:
- A reward function that scores how good response is for prompt . We get this from human preference comparisons (Sections 14.2-14.3): humans look at two responses, pick the better one, and we fit a model that reproduces those choices.
- A policy -- the language model whose parameters we will adjust to maximise expected reward.
- An anchor -- a copy of the pretrained model that does not change. The aligned policy is allowed to drift from the anchor, but not too far. Drift is measured in KL divergence and penalised.
Why the anchor? Two reasons. First, the reward model is wrong: it was fit on a finite number of human comparisons and has shortcuts and blindspots. Without the anchor, the policy will find those blindspots and produce text that scores high on the reward but humans hate -- the canonical reward hacking failure mode. Second, the policy must retain its general language ability: a model that has drifted so far from the pretrained distribution that it cannot complete code, translate Latin, or summarise news is not useful even if it scores well on the reward.
The whole alignment process is a single tug-of-war. The reward pulls the policy toward high-scoring responses. The KL anchor pulls it back toward the pretrained distribution. A single hyperparameter, , decides who wins the rope. Everything in this chapter -- the reward-model architecture, the PPO algorithm, the choice of KL estimator -- is machinery for executing this tug-of-war stably at the scale of billions of parameters.
The Mathematics of Alignment
Write the alignment objective as a functional of the policy :
Three symbols carry the whole story. is the distribution of prompts. For each prompt the policy generates a response , the reward model scores it as , and we pay a per-prompt KL price proportional to how far has drifted from . The coefficient sets the exchange rate between reward and KL.
The closed-form aligned policy
Hold the prompt fixed for a moment and ask: what is the policy that maximises at this prompt? Form the Lagrangian with the simplex constraint :
Differentiate with respect to and set to zero:
Solve for and absorb the constants into a normalising partition function :
This is the most important equation in this chapter. Read it as a recipe: take the pretrained distribution, multiply every candidate response by an exponential factor that grows with its reward, and renormalise. The temperature of the exponential is . Small means the tilt is aggressive -- the policy concentrates on high-reward responses. Large means the tilt is gentle -- the policy stays close to .
The two limiting cases
- β → 0⁺: the exponential blows up for any response whose reward is even marginally higher than the rest. collapses onto . This is reward hacking: if the reward model has a bug -- rewarding any response that starts with "Certainly!", for example -- the aligned model will produce nothing but responses that start with "Certainly!".
- β → ∞: the exponential is flat, , and the reward has no effect. No alignment happens; the model behaves like the pretrained one.
- Production regime: for InstructGPT, Anthropic HH-RLHF, and most modern PPO recipes. DeepSeek-R1 uses β = 0.001 for GRPO on reasoning tasks. The exact value is tuned against held-out preference data, not derived from first principles.
Why we do not just compute π* directly
The closed-form is mathematically complete but computationally impossible at the scale of an LLM. The response space is the set of token sequences of length up to a few thousand, with a vocabulary of ~100,000 -- a number with more digits than there are particles in the observable universe. We cannot enumerate to compute . We cannot even sample from directly because that would require evaluating the unnormalised density at every candidate sequence.
Everything in this chapter is engineering machinery for approximating samples from without ever forming it. PPO does this by sampling from the current policy and using importance weights toward . DPO does it by eliminating the reward model and rewriting the optimum in terms of pairwise preferences directly. GRPO does it by replacing the critic with group-relative rewards. Best-of-N rejection sampling does it by drawing many samples from and keeping the highest-reward one (the cheapest approximation; you will see this in Section 14.6). All of these are different solvers for the same equation.
Manual Numerical Walkthrough
Let us compute the KL-tilted policy by hand for a tiny example so every number is auditable. Eight candidate responses, a reward vector, a reference distribution, and three values of β.
Click to expand: π* at three values of β, computed by hand
Step 1 — The setup. Eight candidate responses to the prompt "Write a polite email declining a meeting." Reward vector (polite responses score high, rude/dishonest ones score low). Reference logits (slight prior bias toward verbose / off-topic, reflecting the web corpus).
Step 2 — Compute π_ref via softmax. Subtract max and exponentiate: . Exponentials: . Sum: . Divide: . Sanity check: sums to 1.000.
Step 3 — Pick β = 1.0 first. Compute : log π_ref ≈ . Add to get . This is the log unnormalised aligned policy.
Step 4 — Normalise via log-softmax. Max is 6.15. Subtract: . Exponentiate: . Sum: . Divide: .
Step 5 — Read off the alignment effect at β = 1. The probability of polite-detailed jumped from 0.157 to 0.704 (a 4.5× tilt). Rude-refusal went from 0.078 to essentially zero. The off-topic response, which had the highest mass under π_ref (0.211), got pushed down to 0.000855 -- the reward actively suppresses it. Expected reward (vs. ). We bought 6.5 reward units for nats.
Step 6 — Repeat at β = 0.1 (aggressive tilt). Now , an enormous tilt. The probability of polite-detailed shoots up to ~0.997, polite-concise drops to ~0.003, everything else is numerically zero. Expected reward ≈ 7.99 (essentially the max). KL ≈ 2.92 nats. We are within rounding of pure argmax behavior -- this is the reward-hacking regime.
Step 7 — Repeat at β = 10 (mild tilt). , exponents in [-0.6, 0.8], so the tilt is gentle. Approximate result: . Expected reward ≈ 3.6. KL ≈ 0.35 nats. The policy has shifted toward polite responses but has not concentrated on any of them.
Step 8 — Verify the Pareto trade-off. Stack the three points: (β=0.1, KL=2.92, E[r]=7.99), (β=1, KL=2.3, E[r]=7.62), (β=10, KL=0.35, E[r]=3.6). Reward rises monotonically as β falls; KL rises with it. The shape of (KL, E[r]) is concave -- the first half-nat of KL buys ~2 reward units, the next full nat buys ~1, and the next nat after that buys ~0.4. This is the alignment Pareto frontier; β is the parameter that chooses your operating point on it.
Step 9 — Cross-check against the explorer. Open the widget below and drag β to 1.0 (logβ = 0). The displayed probabilities should match step 4 to three decimals. Drag to logβ = -1 (β = 0.1) and the bars should collapse onto polite-detailed; drag to logβ = +1 (β = 10) and the bars should look nearly indistinguishable from the gray π_ref bars. This is your acceptance test for the whole derivation.
Visualizing the KL-Tilted Policy
The widget below evaluates on the eight responses from the walkthrough. The gray bar for each response is its probability under ; the indigo bar is its probability under the current . Drag the β slider to watch the tilt strengthen or weaken. The right panel shows the live and and traces the (KL, reward) Pareto frontier as β varies.
Four things to internalise from the sandbox. First, the Pareto curve has a knee: there is a small range of β where you get most of the reward gain for most of the KL cost. To the left of the knee, you are paying KL for diminishing reward; to the right, you are wasting reward headroom. Modern RLHF runs sweep β to find this knee. Second, the polite-formal and polite-concise responses both rise under tilt but never dominate, because their rewards (6 and 7) are not the maximum. The KL term prevents all-on-argmax behavior unless β is tiny. Third, the off-topic response has the largest mass but is rapidly suppressed as soon as the tilt turns on -- this is what the pretrained model"s "helpfully wrong" behavior looks like in distribution. Fourth, the rude and dishonest responses go to numerically zero at any β < 5 -- alignment is essentially zero-shot at suppressing strongly-negative-reward responses, which is why even one round of RLHF dramatically reduces overtly bad outputs.
Plain Python: Solving the Aligned Policy by Hand
The script below implements the full closed form in NumPy. No deep-learning framework, no GPU. We define the reference distribution, the reward vector, and the KL-tilted policy; sweep β; and verify the two limiting cases.
Two non-obvious details worth a second look. First, we work in log-space throughout (line 39) rather than computing directly. With realistic rewards (typically scaled to mean 0, std 1 after the reward model trains) and β = 0.05, the exponent can easily exceed 60, which overflows float32. Every production RLHF codebase computes the policy this way for exactly this reason. Second, the "reward hacked" regime at β = 0.01 produces a policy that is essentially deterministic -- . In an LLM-scale setting this manifests as the post-RLHF model producing identical or near-identical responses to many different prompts ("mode collapse"). A practitioner spotting mode collapse in samples should immediately suspect β is too small or that the reward model has a shortcut that the policy has discovered.
Sanity-check yourself. Run the script withbeta = 1.0. The top response should be polite-detailed with probability ~0.704.E[r]should be ~7.6.KLshould be ~2.3 nats. These three numbers also match exactly what the visualizer displays atlogβ = 0-- a four-way cross-check between the closed-form math, the manual walkthrough, the Python implementation, and the SVG widget.
PyTorch: From Toy Distribution to Token-Level Policy
The 8-response toy is conceptually identical to the real thing, but the real thing has two complications: the response space is a sequence of tokens (so "y" lives in a combinatorial space), and the policy is parameterised by a 70-billion-parameter neural network. The PyTorch snippet below shows the half of the story that scales without changes -- the per-token KL-tilted distribution at one decoding step -- and previews the sampling machinery that the rest of the chapter builds out for full sequences.
Three things to internalise here. First, at one decoding step the math is identical: a softmax over reference logits, plus a per-token reward bonus, gives an aligned next-token distribution. The complication is that during real RLHF, we never directly form this distribution at every position -- we sample from (a parameterised approximation of ) and use those samples to compute the loss. Second, the frozen reference policy is a full copy of the pretrained model that stays resident in memory throughout RLHF. For a 70B model in bf16 this is 140 GB of weights -- non-trivial. Modern RLHF stacks (TRL, OpenRLHF, verl) either keep the reference on a separate set of GPUs or share weights with the SFT model and detach gradients. Third, the k3 KL estimator on line 55 is the practical workhorse. It is unbiased, always non-negative on single-sample estimates (which a naive log-ratio mean is not), and has lower variance than the simpler estimators. You will see this exact line of code reappear in every algorithm in the rest of this chapter.
At Massive Scale: Why Alignment Is the Hard Part of an LLM
At billions of parameters, the closed form does not become wrong -- it becomes uncomputable. The engineering story of RLHF is the story of approximating samples from this distribution under several brutal constraints.
| Bottleneck | Why it appears at scale | How modern stacks address it |
|---|---|---|
| Partition function Z(x) | Sum over all sequences of length up to 2k tokens; combinatorially infeasible | Never form it. Use sample-based estimators (PPO importance weights, DPO pairwise reparameterisation). |
| Sampling π* directly | Requires evaluating unnormalised density at every candidate; impossible for an autoregressive model | Sample from π_θ instead and treat π_θ as an approximation of π* whose KL to π* is bounded by the training loss. |
| Reference-model memory | π_ref is a full frozen copy of a 7-70B model; 14-140 GB of bf16 weights kept resident | Separate GPU group (TRL), parameter sharing with detach (OpenRLHF), or DPO which eliminates π_ref from the inner loop |
| Reward-model accuracy | Trained on 100k–1M preference pairs; brittle outside the SFT distribution | Mix in rule-based rewards (Section 14.4), generative LLM-as-judge (Section 14.5), and bound KL to stay near π_ref where r is reliable |
| KL estimator variance | Per-token KL estimated from a single rollout has very high variance; can destabilise gradient | k3 estimator + entropy bonus + advantage normalisation (PPO/GRPO recipe, Section 14.6 and Chapter 15) |
| Mode collapse | If β too small or reward model has a shortcut, π_θ collapses to a few responses across all prompts | Sweep β on held-out preferences, monitor entropy and KL during training, restart from earlier checkpoint if collapse detected |
| Reward hacking | Strong policies exploit reward-model quirks; visible reward keeps rising while human preference falls | Iterative reward-model retraining, periodic human eval, ensemble rewards, conservative β |
| Throughput | Each PPO step: 4 GPU passes (actor, ref, critic, reward); 4× slower than pretraining per parameter update | Vectorised rollouts, mixed precision actor, 4-bit reward model, asynchronous rollout/training (verl, OpenRLHF) |
The scaling story has a striking feature: none of these bottlenecks are about how big the model is. They are about how big the action space is and how stable the sample-based estimators are. A 70B model and a 7B model have the same fundamental alignment problem; the 70B model just has more weights to be unstable in. This is why the Open Source RLHF community converged on 7-13B models for reward modeling and alignment research -- the same algorithmic story unfolds, at one tenth the GPU cost.
The compute split is also striking. For a Llama-3 70B class deployment, pretraining costs roughly FLOPs. SFT costs ~1% of that. RLHF (PPO with ~250k samples) costs another ~1-3%. So alignment is 1-5% of total training compute. And yet, on every downstream evaluation -- helpfulness, harmlessness, instruction following, reasoning -- the gap from pretrained to aligned is larger than the gap from 7B to 70B on the same evaluation. Alignment is the cheap part of the budget and the expensive part of the result.
Engineering Reality: Reward Hacking and the KL Anchor
Three patterns dominate practical RLHF debugging, and all three fall directly out of the math we have built up.
1. Reward keeps rising, human eval flatlines
Symptom: the training-time reward graph climbs steadily; held-out human preference is flat or falling. Cause: the policy has discovered a shortcut in the reward model -- maybe responses starting with "Certainly!", or longer responses, or responses that paraphrase the prompt back. The closed-form math explains why this is inevitable: concentrates on as β shrinks; if the reward model assigns high reward to a shortcut, that shortcut becomes the argmax, and the aligned model produces it. The fix is to either retrain the reward model on the failure mode, raise β, or both. Anthropic published a paper showing that periodically retraining the reward model on fresh preference pairs (and especially pairs from the current policy's output) buys substantially longer reward-hacking-free training.
2. KL grows monotonically; the loss never converges
Symptom: the KL term in the PPO loss grows linearly, never plateauing. Cause: β is too small for the reward scale, or the reward model has high variance, or the actor lr is too high. The closed-form picture: the optimum is a fixed point of the functional ; if KL is exploding, the policy is overshooting that fixed point each step. The fix is any combination of: raise β, lower actor lr, add a KL penalty clipping term to the PPO loss (the "adaptive KL" trick from PPO2), or normalise rewards to zero mean and unit variance before passing them to the loss.
3. The aligned model has lost its general ability
Symptom: post-RLHF model is great on the preference distribution but worse than the pretrained model on code, math, translation, long-tail factual recall, or anything not represented in the SFT and preference data. Cause: the policy has drifted from in directions the KL anchor does not adequately penalise (the KL is an average, not a worst-case bound). The fix is to raise β, mix pretraining tokens into the RLHF replay (the "PPO-ptx" trick from InstructGPT), or alternate RLHF steps with brief next-token-prediction fine-tuning on pretrain data. All three keep the policy from drifting away from the pretrained distribution in any one direction.
The pattern across all three failure modes is the same: the KL-tilted optimal policy is the right target, but it is only as good as the reward model and only as stable as the KL anchor. The rest of this chapter is the toolkit for getting both right -- preference data collection and the Bradley-Terry model in Section 14.2; reward model architecture and training in 14.3; rule-based and generative rewards that bypass the reward-model bottleneck in 14.4 and 14.5; and the PPO algorithm that makes sampling from stable in 14.6. Every one of those sections is a different facet of the same problem set up here: how to climb the reward without falling off the pretrained manifold.