Boo-AI — Master Artificial Intelligence by Building from Scratch

A 70-billion-parameter language model that has just finished pretraining is a remarkable thing. It can complete code, recite historical dates, translate Latin, finish a half-written sonnet, and explain quantum field theory at three levels of difficulty. It is also, in a very precise sense, useless. Ask it to write a polite email and you might get a polite email, or you might get a paragraph of insults, or a Reddit thread about emails, or twelve emails in a row, or a refusal to engage. The pretrained model imitates everything that exists on the internet, and the internet is not asking it to be helpful.

The thesis of this section. Pretraining minimises cross-entropy against the data distribution. What we want is a policy that maximises a human preference reward $r(x, y)$ without drifting too far from the pretrained model. The mathematical object that arbitrates this trade-off is the KL-regularised optimal policy $\pi^{*}(y | x) \propto \pi_{\text{ref}}(y | x) \cdot \exp(r(x, y) / \beta)$ . Every RLHF method in production today -- PPO, DPO, GRPO, best-of-N rejection sampling, RLAIF -- is a different way of approximating samples from this distribution.

The Real Problem: Imitation Is Not What We Want

Pretraining a language model is, mathematically, a single objective: minimise the cross-entropy of the next token given the previous tokens, averaged over a corpus that approximates the distribution of natural text on the internet. Formally, with model parameters $\theta$ and corpus $\mathcal{D}$ ,

$\mathcal{L}_{\text{pre}}(\theta) = -\,\mathbb{E}_{(x, y) \sim \mathcal{D}} \big[ \log \pi_{\theta}(y \mid x) \big].$

Minimising this is equivalent to minimising $\mathrm{KL}\!\left(\mathcal{D} \,\|\, \pi_{\theta}\right)$ : the trained model becomes the best imitation of the corpus that the architecture and optimiser can produce. This is wonderful for what it gives us: a model that has implicitly absorbed grammar, semantics, factual knowledge, code idioms, and the rhetorical structure of every genre on the web. It is catastrophic for what it also gives us: a model that has implicitly absorbed Reddit flame wars, scam emails, propaganda, factual errors, and the particular conversational style of every troll forum it was trained on. The pretrained model is a probabilistic mixture of all of these -- it has no preference between "helpful assistant" and "4chan poster" because, statistically, it is both.

The practical consequence is severe. The 2020 InstructGPT paper documented the gap with care: when GPT-3 (175B, pretrained on the web) was prompted to follow user instructions, it followed them about 30% of the time. The remaining 70% of completions were plausible continuations of the prompt that ignored or contradicted the user's intent. After RLHF, the 1.3B InstructGPT model -- one hundred and thirty times smaller -- was preferred by human raters in ~85% of head-to-head comparisons. The capability gap closed not by scaling, but by alignment.

Model	Params	Trained on	Pref vs GPT-3-175B
GPT-3 (2020)	175B	Web pretraining	50% (baseline)
GPT-3 + SFT	175B	Pretrain + instruction tuning	~68%
InstructGPT-1.3B	1.3B	Pretrain + SFT + RLHF	~85%
InstructGPT-175B	175B	Pretrain + SFT + RLHF	~88%

Read that table carefully. The 1.3B aligned model beats the 175B unaligned model. The reverse is what the pre-2022 scaling-laws narrative would have predicted -- more parameters, more capability. What it actually says is that alignment is a separate axis from capacity, and that the marginal value of one alignment step is, at frontier scales, larger than the marginal value of 100× more parameters.

Three failure modes that motivate everything that follows. (1) The pretrained model is helpful in the wrong direction: asked "how do I pick a lock?" it cheerfully explains. (2) It is uncalibrated: it produces confident text on topics where it knows nothing. (3) It is distributionally indifferent: a polite reply and a rude reply are both natural continuations of the same prompt, and the model has no preference between them. SFT partially addresses (1) and (2); RLHF is what closes (3).

Intuition: A Parrot That Has Read the Whole Internet

Picture the pretrained model as a parrot that has read the entire internet and memorised the statistical shape of how text continues. Show it a prompt and it spits back the most probable continuation under that statistical shape. It has no goals, no preferences, no notion of "what the user wants" -- only what comes next in a corpus that contains the user's prompt as a substring.

We want something else: an assistant that, given a prompt, produces a reply that a human would prefer over alternatives. The minimal machinery to express this is:

A reward function $r(x, y)$ that scores how good response $y$ is for prompt $x$ . We get this from human preference comparisons (Sections 14.2-14.3): humans look at two responses, pick the better one, and we fit a model that reproduces those choices.
A policy $\pi_{\theta}(y \mid x)$ -- the language model whose parameters we will adjust to maximise expected reward.
An anchor $\pi_{\text{ref}}$ -- a copy of the pretrained model that does not change. The aligned policy is allowed to drift from the anchor, but not too far. Drift is measured in KL divergence and penalised.

Why the anchor? Two reasons. First, the reward model is wrong: it was fit on a finite number of human comparisons and has shortcuts and blindspots. Without the anchor, the policy will find those blindspots and produce text that scores high on the reward but humans hate -- the canonical reward hacking failure mode. Second, the policy must retain its general language ability: a model that has drifted so far from the pretrained distribution that it cannot complete code, translate Latin, or summarise news is not useful even if it scores well on the reward.

The whole alignment process is a single tug-of-war. The reward pulls the policy toward high-scoring responses. The KL anchor pulls it back toward the pretrained distribution. A single hyperparameter, $\beta$ , decides who wins the rope. Everything in this chapter -- the reward-model architecture, the PPO algorithm, the choice of KL estimator -- is machinery for executing this tug-of-war stably at the scale of billions of parameters.

An analogy. Think of the pretrained model as a very well-read person who imitates whoever they last spoke to. Alignment is teaching them to be themselves as an assistant, but without forgetting everything they read. Crank β up too high and they keep imitating the last troll they spoke to. Crank β down too low and they become a sycophant who only says what scores high on the reward and has forgotten how to write a function.

The Mathematics of Alignment

Write the alignment objective as a functional of the policy $\pi$ :

$J(\pi) = \mathbb{E}_{x \sim \rho,\; y \sim \pi(\cdot \mid x)} \big[ r(x, y) \big] - \beta \cdot \mathbb{E}_{x \sim \rho} \big[ \mathrm{KL}\!\left( \pi(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x) \right) \big].$

Three symbols carry the whole story. $\rho$ is the distribution of prompts. For each prompt the policy generates a response $y$ , the reward model scores it as $r(x, y)$ , and we pay a per-prompt KL price proportional to how far $\pi$ has drifted from $\pi_{\text{ref}}$ . The coefficient $\beta > 0$ sets the exchange rate between reward and KL.

The closed-form aligned policy

Hold the prompt $x$ fixed for a moment and ask: what is the policy that maximises $J$ at this prompt? Form the Lagrangian with the simplex constraint $\sum_{y} \pi(y \mid x) = 1$ :

$\mathcal{L}(\pi, \lambda) = \sum_{y} \pi(y \mid x) \, r(x, y) - \beta \sum_{y} \pi(y \mid x) \log \frac{\pi(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \lambda \left( 1 - \sum_{y} \pi(y \mid x) \right).$

Differentiate with respect to $\pi(y \mid x)$ and set to zero:

$r(x, y) - \beta \log \frac{\pi(y \mid x)}{\pi_{\text{ref}}(y \mid x)} - \beta - \lambda = 0.$

Solve for $\pi(y \mid x)$ and absorb the constants into a normalising partition function $Z(x)$ :

$\pi^{*}(y \mid x) = \frac{1}{Z(x)} \, \pi_{\text{ref}}(y \mid x) \, \exp\!\left( \frac{r(x, y)}{\beta} \right), \qquad Z(x) = \sum_{y} \pi_{\text{ref}}(y \mid x) \, \exp\!\left( \frac{r(x, y)}{\beta} \right).$

This is the most important equation in this chapter. Read it as a recipe: take the pretrained distribution, multiply every candidate response by an exponential factor that grows with its reward, and renormalise. The temperature of the exponential is $\beta$ . Small $\beta$ means the tilt is aggressive -- the policy concentrates on high-reward responses. Large $\beta$ means the tilt is gentle -- the policy stays close to $\pi_{\text{ref}}$ .

The two limiting cases

β → 0⁺: the exponential blows up for any response whose reward is even marginally higher than the rest. $\pi^{*}$ collapses onto $\arg\max_{y} r(x, y)$ . This is reward hacking: if the reward model has a bug -- rewarding any response that starts with "Certainly!", for example -- the aligned model will produce nothing but responses that start with "Certainly!".
β → ∞: the exponential is flat, $\pi^{*} = \pi_{\text{ref}}$ , and the reward has no effect. No alignment happens; the model behaves like the pretrained one.
Production regime: $\beta \in [0.01, 0.1]$ for InstructGPT, Anthropic HH-RLHF, and most modern PPO recipes. DeepSeek-R1 uses β = 0.001 for GRPO on reasoning tasks. The exact value is tuned against held-out preference data, not derived from first principles.

Why we do not just compute π* directly

The closed-form $\pi^{*}(y \mid x) \propto \pi_{\text{ref}}(y \mid x) \cdot \exp(r(x, y) / \beta)$ is mathematically complete but computationally impossible at the scale of an LLM. The response space $y$ is the set of token sequences of length up to a few thousand, with a vocabulary of ~100,000 -- a number with more digits than there are particles in the observable universe. We cannot enumerate $y$ to compute $Z(x)$ . We cannot even sample from $\pi^{*}$ directly because that would require evaluating the unnormalised density at every candidate sequence.

Everything in this chapter is engineering machinery for approximating samples from $\pi^{*}$ without ever forming it. PPO does this by sampling from the current policy $\pi_{\theta}$ and using importance weights toward $\pi^{*}$ . DPO does it by eliminating the reward model and rewriting the optimum in terms of pairwise preferences directly. GRPO does it by replacing the critic with group-relative rewards. Best-of-N rejection sampling does it by drawing many samples from $\pi_{\text{ref}}$ and keeping the highest-reward one (the cheapest approximation; you will see this in Section 14.6). All of these are different solvers for the same equation.

Common confusion. The KL term in

J(\pi)

\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})

, not

\mathrm{KL}(\pi_{\text{ref}} \,\|\, \pi)

. The order matters: the forward KL

\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})

is mode-seeking -- it pushes

\pi

to concentrate on a subset of

\pi_{\text{ref}}

's support. Reversed KL is mode-covering and produces fundamentally different behavior. The InstructGPT / PPO convention uses

\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})

, and the closed form above is derived for that direction.

Manual Numerical Walkthrough

Let us compute the KL-tilted policy by hand for a tiny example so every number is auditable. Eight candidate responses, a reward vector, a reference distribution, and three values of β.

Click to expand: π* at three values of β, computed by hand

Step 1 — The setup. Eight candidate responses to the prompt "Write a polite email declining a meeting." Reward vector $r = [8, -6, 7, -3, 1, 6, -2, -5]$ (polite responses score high, rude/dishonest ones score low). Reference logits $z_{\text{ref}} = [1.2, 0.5, 1.0, 0.3, 1.5, 1.1, 0.8, 0.9]$ (slight prior bias toward verbose / off-topic, reflecting the web corpus).

Step 2 — Compute π_ref via softmax. Subtract max and exponentiate: $z - z_{\max} = [-0.3, -1.0, -0.5, -1.2, 0.0, -0.4, -0.7, -0.6]$ . Exponentials: $[0.741, 0.368, 0.607, 0.301, 1.000, 0.670, 0.497, 0.549]$ . Sum: $4.733$ . Divide: $\pi_{\text{ref}} \approx [0.157, 0.078, 0.128, 0.064, 0.211, 0.142, 0.105, 0.116]$ . Sanity check: sums to 1.000.

Step 3 — Pick β = 1.0 first. Compute $\log \pi_{\text{ref}} + r / \beta$ : log π_ref ≈ $[-1.85, -2.55, -2.06, -2.75, -1.56, -1.95, -2.25, -2.15]$ . Add $r/1.0 = r$ to get $[6.15, -8.55, 4.94, -5.75, -0.56, 4.05, -4.25, -7.15]$ . This is the log unnormalised aligned policy.

Step 4 — Normalise via log-softmax. Max is 6.15. Subtract: $[0, -14.7, -1.21, -11.9, -6.71, -2.10, -10.4, -13.3]$ . Exponentiate: $[1.000, 4.1e\text{-}7, 0.298, 6.8e\text{-}6, 1.2e\text{-}3, 0.122, 3.0e\text{-}5, 1.7e\text{-}6]$ . Sum: $1.421$ . Divide: $\pi^{*}_{\beta=1} \approx [0.704, 2.9e\text{-}7, 0.210, 4.8e\text{-}6, 8.5e\text{-}4, 0.086, 2.1e\text{-}5, 1.2e\text{-}6]$ .

Step 5 — Read off the alignment effect at β = 1. The probability of polite-detailed jumped from 0.157 to 0.704 (a 4.5× tilt). Rude-refusal went from 0.078 to essentially zero. The off-topic response, which had the highest mass under π_ref (0.211), got pushed down to 0.000855 -- the reward actively suppresses it. Expected reward $\mathbb{E}[r] = \sum_i \pi^{*}_i r_i \approx 0.704(8) + 0.210(7) + 0.086(6) + \dots \approx 7.62$ (vs. $\mathbb{E}_{\pi_{\text{ref}}}[r] = 1.13$ ). We bought 6.5 reward units for $\mathrm{KL} \approx 2.3$ nats.

Step 6 — Repeat at β = 0.1 (aggressive tilt). Now $r / \beta = 10 \cdot r$ , an enormous tilt. The probability of polite-detailed shoots up to ~0.997, polite-concise drops to ~0.003, everything else is numerically zero. Expected reward ≈ 7.99 (essentially the max). KL ≈ 2.92 nats. We are within rounding of pure argmax behavior -- this is the reward-hacking regime.

Step 7 — Repeat at β = 10 (mild tilt). $r/\beta = r/10$ , exponents in [-0.6, 0.8], so the tilt is gentle. Approximate result: $\pi^{*}_{\beta=10} \approx [0.21, 0.03, 0.16, 0.04, 0.16, 0.18, 0.06, 0.05]$ . Expected reward ≈ 3.6. KL ≈ 0.35 nats. The policy has shifted toward polite responses but has not concentrated on any of them.

Step 8 — Verify the Pareto trade-off. Stack the three points: (β=0.1, KL=2.92, E[r]=7.99), (β=1, KL=2.3, E[r]=7.62), (β=10, KL=0.35, E[r]=3.6). Reward rises monotonically as β falls; KL rises with it. The shape of (KL, E[r]) is concave -- the first half-nat of KL buys ~2 reward units, the next full nat buys ~1, and the next nat after that buys ~0.4. This is the alignment Pareto frontier; β is the parameter that chooses your operating point on it.

Step 9 — Cross-check against the explorer. Open the widget below and drag β to 1.0 (logβ = 0). The displayed probabilities should match step 4 to three decimals. Drag to logβ = -1 (β = 0.1) and the bars should collapse onto polite-detailed; drag to logβ = +1 (β = 10) and the bars should look nearly indistinguishable from the gray π_ref bars. This is your acceptance test for the whole derivation.

Visualizing the KL-Tilted Policy

The widget below evaluates $\pi^{*}(y \mid x) \propto \pi_{\text{ref}}(y \mid x) \cdot \exp(r(y) / \beta)$ on the eight responses from the walkthrough. The gray bar for each response is its probability under $\pi_{\text{ref}}$ ; the indigo bar is its probability under the current $\pi^{*}$ . Drag the β slider to watch the tilt strengthen or weaken. The right panel shows the live $\mathbb{E}_{\pi^{*}}[r]$ and $\mathrm{KL}(\pi^{*} \,\|\, \pi_{\text{ref}})$ and traces the (KL, reward) Pareto frontier as β varies.

Loading KL-tilted policy explorer…

Four things to internalise from the sandbox. First, the Pareto curve has a knee: there is a small range of β where you get most of the reward gain for most of the KL cost. To the left of the knee, you are paying KL for diminishing reward; to the right, you are wasting reward headroom. Modern RLHF runs sweep β to find this knee. Second, the polite-formal and polite-concise responses both rise under tilt but never dominate, because their rewards (6 and 7) are not the maximum. The KL term prevents all-on-argmax behavior unless β is tiny. Third, the off-topic response has the largest $\pi_{\text{ref}}$ mass but is rapidly suppressed as soon as the tilt turns on -- this is what the pretrained model"s "helpfully wrong" behavior looks like in distribution. Fourth, the rude and dishonest responses go to numerically zero at any β < 5 -- alignment is essentially zero-shot at suppressing strongly-negative-reward responses, which is why even one round of RLHF dramatically reduces overtly bad outputs.

Plain Python: Solving the Aligned Policy by Hand

The script below implements the full closed form in NumPy. No deep-learning framework, no GPU. We define the reference distribution, the reward vector, and the KL-tilted policy; sweep β; and verify the two limiting cases.

🐍aligned_policy_closed_form.py

Explanation(8)

Code(62)

5A single prompt with an enumerable response set is the cleanest possible setting

In real RLHF the response space is the set of all token sequences of length up to T -- combinatorially enormous, never enumerable. But the closed-form policy we derive on line 38 holds identically in both worlds; only the partition function Z changes from a finite sum to a path integral over sequences. Starting with 8 enumerable responses lets us compute everything by hand and inspect the policy directly.

EXECUTION STATE

responses = list of 8 string labels

len(responses) = 8

17π_ref is the pretrained model -- our anchor

The reference policy comes from pretraining on the web corpus. It has no notion of 'polite' or 'helpful'; it just predicts the next token by frequency. The toy logits here reflect that: off-topic and verbose responses get the largest prior mass because the internet is full of preamble. This is the distribution we will tilt away from, and it is also the anchor we must not drift too far from -- otherwise the model loses its general language ability.

EXECUTION STATE

ref_logits = [1.2, 0.5, 1.0, 0.3, 1.5, 1.1, 0.8, 0.9]

argmax π_ref = off-topic (4th index)

25r(y) is a scalar reward -- the only thing the human preference signal becomes

After preference data is collected and a reward model is trained (Sections 14.2-14.3), the entire human preference signal is summarised by one scalar per response. This is a massive compression: 'is this answer better than that one across all dimensions a human cares about' becomes one number. The reward is the entire surface that RLHF will climb -- if the reward model is wrong, the aligned model is wrong, and there is no automatic way to detect this from the training loss alone.

EXECUTION STATE

r = [8, -6, 7, -3, 1, 6, -2, -5]

argmax r = polite-detailed (index 0)

34The KL-regularised objective in one line

J(π) = E_y~π[r(y)] - β · KL(π || π_ref). Two terms in tension: the reward term wants the policy to put all mass on the highest-r response; the KL term pulls back toward π_ref. β is the only knob that arbitrates. The same objective shows up in InstructGPT (β=0.02 by default), Anthropic's HH-RLHF (β ~ 0.01), and every PPO/DPO loss in the literature -- they are all approximations of an argmax of this functional.

41The closed-form aligned policy

Differentiate J(π) under the simplex constraint via a Lagrangian, set derivative to zero, and you get π*(y) ∝ π_ref(y) · exp(r(y) / β). This is the Boltzmann tilt. It says: start from π_ref and re-weight every candidate by exp(r/β), then renormalise. Operate in log-space (line 39) to avoid overflow when r/β is large -- a real reward can be in [-20, +20] and β in [0.01, 1], giving exponents up to 2000.

EXECUTION STATE

π*(y) shape = same as π_ref

log-space safety = subtract max before exp

48KL is computed exactly because π and π_ref are both finite vectors

KL(p || q) = Σ p(y) · log(p(y) / q(y)). In RLHF we never have this luxury -- we estimate KL from samples, usually via the k3 estimator (log_ratio - exp(log_ratio) + 1, used in TRL and OpenRLHF). The estimator matters a lot at scale: a naive log-ratio mean is unbiased but very high variance. Here we get the exact value for free.

54Sweep β to expose the trade-off

Six values of β, six policies. At β=0.1 the policy is sharp: almost all mass on polite-detailed (r=8), and the KL is large (~3 nats). At β=20 the policy barely budges from π_ref: small reward gain, tiny KL. The Pareto frontier is convex in (reward, KL) space -- you can buy reward by spending KL, with diminishing returns.

EXECUTION STATE

β = 0.1 = E[r] ≈ 7.99, KL ≈ 2.9

β = 1.0 = E[r] ≈ 7.86, KL ≈ 2.4

β = 5.0 = E[r] ≈ 5.41, KL ≈ 0.6

β = 20 = E[r] ≈ 2.95, KL ≈ 0.1

63The two pathological limits

β -> 0: the exponential tilt dominates π_ref completely, and π* concentrates on argmax_y r(y). This is reward hacking in its purest form -- if the reward model has any bug or shortcut, the aligned model will find it and produce only that. β -> ∞: the tilt is invisible, and π* = π_ref; the model is back to being a pretrained LM with no preference signal. Real RLHF runs live in the narrow band 0.01 < β < 0.1 where reward improves measurably but the KL stays bounded.

EXECUTION STATE

β = 0.01 = P(top response) ≈ 1.0

β = 10⁶ = max|π - π_ref| ≈ 10⁻⁶

54 lines without explanation

1import numpy as np
2
3# --- A tiny "world" with a single prompt and N candidate responses ---------
4# A real LM has 50k tokens x 2k positions of responses, but the closed-form
5# math is identical -- only the cardinality changes.
6
7responses = [
8    "polite-detailed",   # r =  8
9    "rude-refusal",      # r = -6
10    "polite-concise",    # r =  7
11    "evasive",           # r = -3
12    "off-topic",         # r =  1
13    "polite-formal",     # r =  6
14    "lazy",              # r = -2
15    "dishonest",         # r = -5
16]
17
18# Pretrained reference distribution pi_ref(y | x). The pretrained LM has no
19# concept of "polite" -- its logits reflect frequency in the web corpus, so
20# verbose and off-topic responses get nearly the same prior mass as polite
21# ones.
22ref_logits = np.array([1.2, 0.5, 1.0, 0.3, 1.5, 1.1, 0.8, 0.9])
23pi_ref = np.exp(ref_logits) / np.exp(ref_logits).sum()
24
25# Reward model r(y) reflecting human preference. In a real RLHF pipeline this
26# is a 7B reward model trained on pairwise comparisons; here it is just a
27# vector of integers a human annotator would assign.
28r = np.array([8, -6, 7, -3, 1, 6, -2, -5], dtype=float)
29
30# --- The KL-regularised RLHF objective -------------------------------------
31#   J(pi) = E_y~pi[r(y)] - beta * KL( pi || pi_ref )
32# Setting dJ/dpi = 0 under the simplex constraint yields the closed form:
33#   pi*(y) = (1/Z) * pi_ref(y) * exp( r(y) / beta )
34# This is the central object of RLHF -- everything else (PPO, DPO, GRPO,
35# best-of-N) is a different way of approximating samples from this pi*.
36
37def aligned_policy(pi_ref, r, beta):
38    log_unnorm = np.log(pi_ref) + r / beta
39    log_unnorm -= log_unnorm.max()       # numerical safety
40    exps = np.exp(log_unnorm)
41    return exps / exps.sum()
42
43def kl(p, q):
44    mask = p > 1e-12
45    return float(np.sum(p[mask] * (np.log(p[mask]) - np.log(q[mask]))))
46
47# --- Sweep beta to see the alignment knob in action ------------------------
48print(f"{'beta':>8} | {'E[r]':>7} | {'KL':>6} | top response")
49print("-" * 60)
50for beta in [0.1, 0.5, 1.0, 2.0, 5.0, 20.0]:
51    pi = aligned_policy(pi_ref, r, beta)
52    top = responses[int(np.argmax(pi))]
53    Er  = float((pi * r).sum())
54    print(f"{beta:8.2f} | {Er:7.3f} | {kl(pi, pi_ref):6.3f} | {top}")
55
56# --- Two important limits --------------------------------------------------
57# beta -> 0+   : pi*  collapses onto argmax_y r(y)   (reward hacking)
58# beta -> inf  : pi* -> pi_ref                       (no alignment effect)
59pi_hot  = aligned_policy(pi_ref, r, 0.01)
60pi_cold = aligned_policy(pi_ref, r, 1e6)
61print(f"beta=0.01 -> P(argmax) = {pi_hot.max():.6f}  (reward hacked)")
62print(f"beta=1e6  -> max|pi-pi_ref| = {np.max(np.abs(pi_cold - pi_ref)):.2e}")

Two non-obvious details worth a second look. First, we work in log-space throughout (line 39) rather than computing $\pi_{\text{ref}}(y) \cdot \exp(r(y)/\beta)$ directly. With realistic rewards (typically scaled to mean 0, std 1 after the reward model trains) and β = 0.05, the exponent $r/\beta$ can easily exceed 60, which overflows float32. Every production RLHF codebase computes the policy this way for exactly this reason. Second, the "reward hacked" regime at β = 0.01 produces a policy that is essentially deterministic -- $P(\arg\max) \approx 1$ . In an LLM-scale setting this manifests as the post-RLHF model producing identical or near-identical responses to many different prompts ("mode collapse"). A practitioner spotting mode collapse in samples should immediately suspect β is too small or that the reward model has a shortcut that the policy has discovered.

Sanity-check yourself. Run the script with beta = 1.0. The top response should be polite-detailed with probability ~0.704. E[r] should be ~7.6. KL should be ~2.3 nats. These three numbers also match exactly what the visualizer displays at logβ = 0 -- a four-way cross-check between the closed-form math, the manual walkthrough, the Python implementation, and the SVG widget.

PyTorch: From Toy Distribution to Token-Level Policy

The 8-response toy is conceptually identical to the real thing, but the real thing has two complications: the response space is a sequence of tokens (so "y" lives in a combinatorial space), and the policy is parameterised by a 70-billion-parameter neural network. The PyTorch snippet below shows the half of the story that scales without changes -- the per-token KL-tilted distribution at one decoding step -- and previews the sampling machinery that the rest of the chapter builds out for full sequences.

🐍token_level_aligned_policy.py

Explanation(7)

Code(61)

12Vocab size is the only thing that changes from the toy example

Every conceptual ingredient is identical: a reference distribution, a reward, and a KL-tilted policy. The leap from 8 responses to 32,000 tokens is purely about cardinality. Real production vocabularies are larger still -- Llama-3 has 128k tokens, GPT-4 has ~100k, Claude uses a similar order of magnitude. The closed-form expression scales linearly in V at one decoding step, but enumerating π* over an entire 2048-token generation requires 32000^2048 paths -- which is why we sample instead.

EXECUTION STATE

V (vocab size) = 32,000

β (KL penalty) = 0.1 (typical RLHF)

18ref_logits comes from a frozen forward pass

In production, this vector is the output of a forward pass through the frozen pretrained (or SFT) model. It is the same model that produced π_ref(y | x) at this step before any RLHF training began. Critically: π_ref stays frozen throughout RLHF -- if it drifted, the KL anchor would drift with it and the regularisation would be meaningless. Every PPO/GRPO codebase keeps a separate reference model in memory exactly for this reason.

EXECUTION STATE

ref_logits shape = [V] = [32000]

25Per-token vs sequence-level reward is the practical wrinkle

Most reward models return one scalar per sequence (placed at the EOS token). A few -- process reward models for chain-of-thought, Constitutional AI critics -- return a reward per step. The aligned-policy formula does not care; it just needs r(y) for whatever 'y' is. In InstructGPT and DeepSeek R1, the per-token reward is zero except at the end, then the per-token KL penalty is added on every step -- this is the standard RLHF reward shaping.

31Closed-form tilt in log-space, in 3 lines

Numerically stable computation: form the log unnormalised, subtract logsumexp to normalise, exponentiate. This is exactly the same operation as line 41 of the plain Python file -- the only difference is that PyTorch does it in parallel across all 32,000 tokens, returning a vector. If you ever see a PPO codebase computing softmax(logits) and then exp(reward/beta) separately, expect numerical overflow at small β.

EXECUTION STATE

log_pi_star shape = [V]

Σ pi_star = 1.0 (verified)

41Top-5 reveals what 'alignment' physically looks like

Under π_ref, the top-5 tokens reflect what the pretrained model thinks is likely next -- often grammar-driven, frequency-driven, with no preference signal. Under π*, the top-5 are re-ordered by the reward: tokens with high per-token reward have been pushed to the top. The amount of re-ordering is controlled by β. At β=0.1 the re-ordering is dramatic; at β=10 the two top-5 lists are nearly identical.

53Sampling from π* is how real RLHF actually trains

torch.multinomial draws tokens proportional to π*. In production this is wrapped inside the actor model's generate() call, with on-policy samples used to compute the PPO loss. The sample size here (1024) stands in for the rollout-batch-size hyperparameter in real PPO; typical values are 256 to 4096 sequences per update. Sampling is what makes alignment tractable at scale -- we never form the full V-dim distribution at every position.

EXECUTION STATE

rollout size = 1024 samples

55The k3 estimator: unbiased, low-variance, used everywhere

John Schulman's k3 estimator (exp(log_ratio) - 1 - log_ratio) is the standard way to estimate KL from samples. It is exactly equal to KL in expectation, always non-negative (unlike a naive log-ratio mean which can dip negative on a single batch), and has lower variance than k1 (just log_ratio) and k2 (1/2 · log_ratio^2). Every modern PPO/GRPO implementation -- TRL, OpenRLHF, verl, the DeepSpeed-Chat reference -- uses k3.

EXECUTION STATE

k3 KL = estimator on 1024 samples

agreement with exact = within ~0.05 nats

54 lines without explanation

1import torch
2import torch.nn.functional as F
3
4# --- The same closed form, now at TOKEN scale -----------------------------
5# In a real LM, "y" is a token sequence y_1 ... y_T. The KL-regularised
6# objective factorises over tokens because both the policy and the reference
7# are autoregressive. For a single token y at position t given context x:
8#
9#     pi*(y | x) propto pi_ref(y | x) * exp( r_t(x, y) / beta )
10#
11# Below we simulate one decoding step with vocab_size = 32000.
12
13V = 32_000          # vocabulary size (Llama-3 has 128k, GPT-4 has ~100k)
14beta = 0.1          # KL penalty -- typical RLHF value
15
16# pi_ref(. | x): logits coming out of the frozen pretrained LM at this step.
17torch.manual_seed(0)
18ref_logits = torch.randn(V) * 2.0          # shape [V]
19
20# r_t(x, y): per-token reward for each candidate token. In real RLHF this is
21# usually zero everywhere except at the end-of-sequence position, where the
22# scalar reward from the reward model is placed. Here we use a per-token
23# proxy reward to keep the example self-contained.
24per_token_reward = torch.randn(V) * 0.5    # shape [V]
25
26# --- Closed-form aligned next-token distribution --------------------------
27# log pi*(y | x) = log pi_ref(y | x) + r(y) / beta - log Z
28log_pi_ref = F.log_softmax(ref_logits, dim=-1)
29log_pi_unnorm = log_pi_ref + per_token_reward / beta
30log_pi_star = log_pi_unnorm - torch.logsumexp(log_pi_unnorm, dim=-1)
31pi_star = log_pi_star.exp()
32
33# --- Sanity checks --------------------------------------------------------
34print("Sum pi*:", pi_star.sum().item())                  # 1.0 by construction
35print("KL(pi* || pi_ref):",
36      (pi_star * (log_pi_star - log_pi_ref)).sum().item())
37
38# --- Top-5 tokens before and after tilting --------------------------------
39pi_ref = log_pi_ref.exp()
40top_ref  = torch.topk(pi_ref,  k=5)
41top_star = torch.topk(pi_star, k=5)
42print("Top-5 under pi_ref :", top_ref.indices.tolist(),
43      "p =", [round(p, 4) for p in top_ref.values.tolist()])
44print("Top-5 under pi*   :", top_star.indices.tolist(),
45      "p =", [round(p, 4) for p in top_star.values.tolist()])
46
47# --- The SAMPLE-BASED estimator used in real PPO --------------------------
48# In PPO/GRPO we never form pi_star explicitly -- the vocab is too big to
49# enumerate at every position over a long generation. Instead we sample
50# y ~ pi_theta, compute the per-token log-ratio against pi_ref, and use the
51# k3 KL estimator (Schulman 2020):
52#
53#     k3_KL = exp(log_ratio) - 1 - log_ratio
54#
55# which is unbiased and always non-negative.
56
57sampled_tokens = torch.multinomial(pi_star, num_samples=1024, replacement=True)
58log_ratio = log_pi_star[sampled_tokens] - log_pi_ref[sampled_tokens]
59k3_kl = (log_ratio.exp() - 1.0 - log_ratio).mean()
60print("Sampled k3 KL:", k3_kl.item(), "vs exact:",
61      (pi_star * (log_pi_star - log_pi_ref)).sum().item())

Three things to internalise here. First, at one decoding step the math is identical: a softmax over reference logits, plus a per-token reward bonus, gives an aligned next-token distribution. The complication is that during real RLHF, we never directly form this distribution at every position -- we sample from $\pi_{\theta}$ (a parameterised approximation of $\pi^{*}$ ) and use those samples to compute the loss. Second, the frozen reference policy is a full copy of the pretrained model that stays resident in memory throughout RLHF. For a 70B model in bf16 this is 140 GB of weights -- non-trivial. Modern RLHF stacks (TRL, OpenRLHF, verl) either keep the reference on a separate set of GPUs or share weights with the SFT model and detach gradients. Third, the k3 KL estimator on line 55 is the practical workhorse. It is unbiased, always non-negative on single-sample estimates (which a naive log-ratio mean is not), and has lower variance than the simpler estimators. You will see this exact line of code reappear in every algorithm in the rest of this chapter.

At Massive Scale: Why Alignment Is the Hard Part of an LLM

At billions of parameters, the closed form $\pi^{*}(y \mid x) \propto \pi_{\text{ref}}(y \mid x) \cdot \exp(r(x, y) / \beta)$ does not become wrong -- it becomes uncomputable. The engineering story of RLHF is the story of approximating samples from this distribution under several brutal constraints.

Bottleneck	Why it appears at scale	How modern stacks address it
Partition function Z(x)	Sum over all sequences of length up to 2k tokens; combinatorially infeasible	Never form it. Use sample-based estimators (PPO importance weights, DPO pairwise reparameterisation).
Sampling π* directly	Requires evaluating unnormalised density at every candidate; impossible for an autoregressive model	Sample from π_θ instead and treat π_θ as an approximation of π* whose KL to π* is bounded by the training loss.
Reference-model memory	π_ref is a full frozen copy of a 7-70B model; 14-140 GB of bf16 weights kept resident	Separate GPU group (TRL), parameter sharing with detach (OpenRLHF), or DPO which eliminates π_ref from the inner loop
Reward-model accuracy	Trained on 100k–1M preference pairs; brittle outside the SFT distribution	Mix in rule-based rewards (Section 14.4), generative LLM-as-judge (Section 14.5), and bound KL to stay near π_ref where r is reliable
KL estimator variance	Per-token KL estimated from a single rollout has very high variance; can destabilise gradient	k3 estimator + entropy bonus + advantage normalisation (PPO/GRPO recipe, Section 14.6 and Chapter 15)
Mode collapse	If β too small or reward model has a shortcut, π_θ collapses to a few responses across all prompts	Sweep β on held-out preferences, monitor entropy and KL during training, restart from earlier checkpoint if collapse detected
Reward hacking	Strong policies exploit reward-model quirks; visible reward keeps rising while human preference falls	Iterative reward-model retraining, periodic human eval, ensemble rewards, conservative β
Throughput	Each PPO step: 4 GPU passes (actor, ref, critic, reward); 4× slower than pretraining per parameter update	Vectorised rollouts, mixed precision actor, 4-bit reward model, asynchronous rollout/training (verl, OpenRLHF)

The scaling story has a striking feature: none of these bottlenecks are about how big the model is. They are about how big the action space is and how stable the sample-based estimators are. A 70B model and a 7B model have the same fundamental alignment problem; the 70B model just has more weights to be unstable in. This is why the Open Source RLHF community converged on 7-13B models for reward modeling and alignment research -- the same algorithmic story unfolds, at one tenth the GPU cost.

The compute split is also striking. For a Llama-3 70B class deployment, pretraining costs roughly $3 \times 10^{25}$ FLOPs. SFT costs ~1% of that. RLHF (PPO with ~250k samples) costs another ~1-3%. So alignment is 1-5% of total training compute. And yet, on every downstream evaluation -- helpfulness, harmlessness, instruction following, reasoning -- the gap from pretrained to aligned is larger than the gap from 7B to 70B on the same evaluation. Alignment is the cheap part of the budget and the expensive part of the result.

Engineering Reality: Reward Hacking and the KL Anchor

Three patterns dominate practical RLHF debugging, and all three fall directly out of the math we have built up.

1. Reward keeps rising, human eval flatlines

Symptom: the training-time reward graph climbs steadily; held-out human preference is flat or falling. Cause: the policy has discovered a shortcut in the reward model -- maybe responses starting with "Certainly!", or longer responses, or responses that paraphrase the prompt back. The closed-form math explains why this is inevitable: $\pi^{*}$ concentrates on $\arg\max_{y} r(x, y)$ as β shrinks; if the reward model assigns high reward to a shortcut, that shortcut becomes the argmax, and the aligned model produces it. The fix is to either retrain the reward model on the failure mode, raise β, or both. Anthropic published a paper showing that periodically retraining the reward model on fresh preference pairs (and especially pairs from the current policy's output) buys substantially longer reward-hacking-free training.

2. KL grows monotonically; the loss never converges

Symptom: the KL term in the PPO loss grows linearly, never plateauing. Cause: β is too small for the reward scale, or the reward model has high variance, or the actor lr is too high. The closed-form picture: the optimum $\pi^{*}$ is a fixed point of the functional $J(\pi)$ ; if KL is exploding, the policy is overshooting that fixed point each step. The fix is any combination of: raise β, lower actor lr, add a KL penalty clipping term to the PPO loss (the "adaptive KL" trick from PPO2), or normalise rewards to zero mean and unit variance before passing them to the loss.

3. The aligned model has lost its general ability

Symptom: post-RLHF model is great on the preference distribution but worse than the pretrained model on code, math, translation, long-tail factual recall, or anything not represented in the SFT and preference data. Cause: the policy has drifted from $\pi_{\text{ref}}$ in directions the KL anchor does not adequately penalise (the KL is an average, not a worst-case bound). The fix is to raise β, mix pretraining tokens into the RLHF replay (the "PPO-ptx" trick from InstructGPT), or alternate RLHF steps with brief next-token-prediction fine-tuning on pretrain data. All three keep the policy from drifting away from the pretrained distribution in any one direction.

The single most useful diagnostic in RLHF. Plot three curves over training steps: (a) the training reward, (b) the per-step KL

\mathrm{KL}(\pi_{\theta} \,\|\, \pi_{\text{ref}})

, (c) the held-out reward on a frozen evaluation set sampled from

\pi_{\text{ref}}

. Healthy training: (a) rises, (b) rises and plateaus, (c) rises. Reward hacking: (a) rises, (b) rises unbounded, (c) is flat or falls. Mode collapse: (a) saturates, (b) is small, (c) is flat. Every PPO codebase that survives contact with reality emits these three curves at every logging step.

The pattern across all three failure modes is the same: the KL-tilted optimal policy $\pi^{*}$ is the right target, but it is only as good as the reward model and only as stable as the KL anchor. The rest of this chapter is the toolkit for getting both right -- preference data collection and the Bradley-Terry model in Section 14.2; reward model architecture and training in 14.3; rule-based and generative rewards that bypass the reward-model bottleneck in 14.4 and 14.5; and the PPO algorithm that makes sampling from $\pi^{*}$ stable in 14.6. Every one of those sections is a different facet of the same problem set up here: how to climb the reward without falling off the pretrained manifold.