The Real Problem: RLHF's Four-Model Tax
At the end of § 14.6 we left the standard RLHF pipeline in working order, but expensive. Every PPO step keeps four full transformer models in memory at once — the trainable policy, a value head bolted to the policy backbone, the frozen reference (SFT) model, and the frozen reward model. Each step does a rollout, a reward-model forward, a reference forward, several policy and value forwards, a joint backward, a gradient clip, and a step. The wall-clock split was 85% rollout, 15% optimisation. At 70B, the memory tax was ~2 TB before activations. The procedure works — it gave us InstructGPT, ChatGPT, Claude — but for academic groups and small teams it is brutal.
Now stand back and ask: what does PPO actually need to do? We started in § 14.2 with preference data — triples where is the response a human preferred over on prompt . We trained a reward model to score responses. We then used PPO to push the policy in the direction of higher reward, with a KL penalty to the SFT model so the policy stays in the region of human-like text. Three steps: preferences → reward model → RL.
Each of those steps loses information. The reward model compresses a rich preference distribution into a scalar. PPO compresses that scalar back into a stochastic gradient via rollouts. Reward hacking, KL drift, sampling variance — every pathology of RLHF lives somewhere in those translation layers. The natural question is: can we skip the middle steps and train directly from the preferences?
The 2023 DPO paper by Rafailov et al. answered yes — and proved it from the SAME mathematical objective PPO solves. The result is a loss function that is one line of PyTorch, needs no rollout, no reward model, no value head, no critic, and no KL coefficient as a separate term. It is just a supervised classification loss on preference pairs. The same human data, the same KL-regularised objective, but a different parameterisation that makes the whole machinery collapse.
What DPO removes from the RLHF pipeline
Intuition: A Loss Built From Preferences, Not Rewards
Forget the math for a moment. A human looked at two responses and said the first was better. What does that single judgement actually tell us about how the policy should change? Two things:
- The policy should assign more probability to the chosen response than it currently does.
- The policy should assign less probability to the rejected response than it currently does.
That is the entire content of one preference label. PPO uses this signal through three translation layers: train a reward model on many labels, sample responses, score them, use the score as a policy-gradient signal. DPO uses it directly: write a loss whose gradient pushes the policy probability of up and the policy probability of down, relative to a frozen reference. The reference anchor is what stops the model from simply assigning probability 1 to whichever response the human happened to pick — it forces the policy to tilt, not to overwrite.
A useful physical analogy: imagine the probability mass over all possible responses as a pile of sand. The reference model says where the SFT pile currently is. The human preference is a single thumb-tap that says: take some sand from this rejected response and move it onto this chosen response. After thousands of taps, the pile has tilted in the direction the humans collectively prefer, but it never crumbles into a single delta — because every tap is relative to the reference and the loss bottoms out as soon as the tilt is “enough”.
Mental model in one line: DPO is logistic regression on the LOG-RATIO of two response probabilities. The two classes are “the human preferred A” vs “the human preferred B”; the features are the implicit rewards of each response. That is the entire algorithm.
The Bridge from PPO to DPO
DPO is not a heuristic — it is the closed-form solution to the same problem PPO solves, plugged into the Bradley-Terry preference model from § 14.2. Three steps connect them.
Step 1: The KL-regularised reward objective
Both PPO and DPO maximise the same objective: expected reward minus a KL penalty to the reference.
Here is some scalar reward (in PPO, the output of the reward model). is the KL temperature. PPO solves this with stochastic gradients and sampling; DPO solves it in closed form.
Step 2: The closed-form optimum
Standard variational calculus on the KL-constrained objective gives the optimal policy at any state:
where is the partition function. The optimal policy is the reference, tilted by an exponential factor proportional to the reward. Now invert this equation: given the optimal policy and the reference, what reward produced it?
This is the central identity: any reward function corresponds to some optimal policy, and vice versa. The reward is recoverable from the optimal policy — modulo a prompt-dependent constant . Call the first term the implicit reward:
Step 3: Bradley-Terry on implicit rewards
The Bradley-Terry model (§ 14.2) says the probability the human prefers over on prompt is the sigmoid of the reward difference:
Plug in the recovered reward. The prompt-dependent constants appear in BOTH the chosen and rejected term and cancel in the difference. What remains is purely in terms of the policy and the reference:
This is a probability the policy itself implies. Take the negative log-likelihood of the observed preferences. That negative log-likelihood is the DPO loss.
The DPO Objective
Putting it together, the DPO loss on a preference dataset is:
Every symbol:
- is the trainable policy (initialised from the SFT model). Only its parameters get gradients.
- is the frozen reference model — almost always the same SFT model used to initialise the policy. It defines “the prior” the loss measures deviation from.
- are the chosen and rejected responses; is the prompt.
- is the probability the model assigns to the FULL response — the product of per-token probabilities . In log space this is a sum of per-token log-probabilities, which is what we actually compute.
- is the KL temperature — the single tuneable hyperparameter.
- is the logistic sigmoid; is the standard binary cross-entropy loss for a Bernoulli with one observed positive label per pair.
The gradient with respect to :
Read this carefully. The gradient is a per-pair coefficient times the difference of two log-likelihood gradients. The coefficient is large when the policy currently RANKS THE WRONG RESPONSE HIGHER (rejected reward greater than chosen) and small when it ranks them the right way. The two log-likelihood gradients are exactly the SFT gradients you would compute to teach the model to produce and to AVOID , scaled and combined automatically.
The one sentence to remember: DPO is SFT on the chosen response, minus SFT on the rejected response, weighted by how confidently the model is currently getting the ranking wrong.
The Implicit Reward and the β Knob
The quantity is the implicit reward DPO assigns to a response. It is not the reward a separate reward model would produce — it is the reward function the policy and reference TOGETHER ENCODE, through the closed-form identity from Step 2 above. Two consequences worth internalising:
Implicit rewards drift; absolute rewards are meaningless
Only DIFFERENCES of implicit rewards matter — for the same prompt the cancels. If you log across training, you will see it grow without bound for some prompts and decay for others, and that is fine. What you watch is the MARGIN averaged across the batch. That is the implicit-reward signal.
β controls how hard the policy is pulled away from the reference
Small β makes the loss aggressive: the margin must be times larger in log-ratio space to achieve the same loss reduction, so the policy drifts more. Large β makes the loss conservative: tiny log-ratio shifts suffice, so the policy hugs the reference. The empirical table from real production runs:
| β | Behaviour | Typical use case |
|---|---|---|
| 0.01 – 0.05 | Aggressive drift, large reward margins | When SFT is far from desired behaviour |
| 0.1 (most common) | Balanced drift, ~75–85% pref accuracy | Tulu-3, Llama-3 post-training, Zephyr |
| 0.3 – 0.5 | Conservative, hugs SFT | Safety-focused or small preference datasets |
| 1.0+ | Almost identity (no learning) | Mostly a sanity-check baseline |
β interacts with learning rate. Halving β is approximately equivalent to doubling LR for the first few steps; both make the policy move further per gradient step. In practice teams sweep β in {0.05, 0.1, 0.3} at fixed LR, then sweep LR at the winning β.
The DPO Loss Landscape
The plot below shows the three coupled quantities for a single preference pair, all on the same x-axis — the implicit reward margin . Drag the slider or press Play to watch the optimiser climb the margin axis. Note that the loss curve, the preference probability, and the gradient strength are all the same sigmoid in disguise — what changes is which one you are looking at.
Three observations to anchor:
- At the loss is — the “random preference” baseline. A model with implicit-reward accuracy of 50% sits here.
- As the loss approaches zero exponentially, AND the gradient strength approaches zero — the pair stops contributing to learning once the model is confidently correct. This is the built-in safety brake.
- As the loss grows linearly and the gradient strength saturates at 1. A confidently-wrong pair gets the full gradient signal.
The asymmetry is the punchline: the loss is bounded on the correct side, unbounded on the wrong side. The optimiser never wastes capacity on already-correct preferences and is always sensitive to the wrong ones — for free, no clip needed.
Manual Numerical Walkthrough: One DPO Update
Concrete numbers, computed by hand. Imagine one preference pair from an Anthropic HH-style dataset:
| Quantity | Value | Source |
|---|---|---|
| log π_θ(y_w | x) | −3.50 | policy forward (gradient flows) |
| log π_θ(y_l | x) | −6.00 | policy forward (gradient flows) |
| log π_ref(y_w | x) | −4.00 | frozen SFT forward (no grad) |
| log π_ref(y_l | x) | −5.00 | frozen SFT forward (no grad) |
| β | 0.5 | hyperparameter |
These are SUMS of per-token log-probs across the response — each scalar in the table compresses, say, 60 tokens of response into one number. The chosen response is about more likely under the policy than the rejected one; under the reference it's about . So the policy has already started to tilt in the right direction, but not by much.
Log-ratios
Policy already moved the chosen response up by 0.5 nats over the reference, and the rejected response DOWN by 1.0 nats. Both shifts are in the human-preferred direction — encouraging.
Implicit rewards
Reward margin: .
Loss
; ; .
For reference, a random initial pair (margin ≈ 0) would have loss ≈ 0.693. So this pair is already “easy” — the model produces the preference correctly with about 68% confidence.
Gradient strength
Roughly 32% of the maximum possible gradient signal still flows through this pair. By the time the model is at (highly confident), — only 5% of the max signal. The pair is fading from the optimiser's attention.
What one step does
The actual parameter update is the per-pair scalar −0.321 times times the difference . That gradient is what SFT would produce if you trained on as positive and as “anti-positive” (an unweighted version of unlikelihood training). The signal is weighted by ~0.16 (0.321 × 0.5) and added to whatever the rest of the batch contributes.
The whole algorithm in one paragraph
Plain Python: DPO Loss from Scratch
We re-implement DPO on the same single pair in pure NumPy. No autograd, no transformer — just the loss arithmetic and the scalars a real trainer ends up consuming after the per-token log-prob sums. Swap the hard-coded log-probs for sums over a transformer's output log-softmax and the loss code is byte-for-byte what trl, OpenRLHF, and Tulu use.
PyTorch: A Production-Shaped DPO Step
The real thing — two forward passes per response per model, a masked sum of per-token log-probs, batch-level mean of , gradient clip, and the diagnostics that an RLHF engineer actually watches. Drop in any HF causal LM for policy and a frozen copy of the SFT model for ref_policy and this trains a real DPO run.
At Massive Scale: Why DPO Became the Default
Every line of the PyTorch snippet scales to a 70B-parameter post- training run, with three engineering wins that explain why DPO displaced PPO as the default alignment recipe outside of the largest frontier labs.
The two-model memory budget
At 70B, each model is ~140 GB in bf16. DPO needs two models (policy and reference) for a total of ~280 GB in weights, plus 140 GB for policy gradients, plus ~560 GB for AdamW optimiser state (fp32 first and second moments). Total: ~980 GB before activations. PPO needs the same budget plus a reward model (~140 GB), a value head (~10–140 GB depending on architecture), and a separate KV cache for rollout (~50–200 GB) — easily 2 TB total. DPO halves the memory footprint of the comparable post-training step. On 8×H100 (640 GB HBM), DPO fits with FSDP sharding; PPO at 70B does not without offloading.
The reference can be cached
On a fixed preference set, the reference log-probs and are constants of the dataset — they do not depend on training step. Production DPO pipelines compute them ONCE before training and store them as scalars alongside each preference pair. The reference model then leaves GPU memory entirely. The remaining per-step cost is two forward passes through the POLICY only — half the FLOPs of a PPO step at the same batch size. On a Tulu-3 scale run this turns a 24-hour wall-clock into ~10 hours.
No rollout = deterministic, replayable runs
PPO rollouts depend on sampling temperature, top-p, the current policy weights, and a random seed; reproducing a PPO failure requires re-rolling the exact same trajectories, which is hard. DPO consumes a fixed offline preference dataset in a fixed order; a failed run is reproducible byte-for-byte by re-running with the same seed. Debugging RLHF stops being an exotic discipline and becomes ordinary supervised-learning debugging. This is the engineering subtext under every DPO paper's claim of “simplicity”.
Where it does NOT scale
DPO has no rollout, which means it CANNOT discover new responses. It can only redistribute probability mass across responses that appear in the preference data. For frontier reasoning models, the chosen response in the dataset is often weaker than what the model itself could now generate — DPO will obediently push toward that weaker response. This is the failure mode that pushed frontier labs back toward on-policy methods (PPO, GRPO) for reasoning training in 2024–2025: only on-policy data lets the model learn from its current best behaviour. DPO is the default for instruction-following and safety alignment; on-policy methods are the default for reasoning. The next chapter picks up this thread with GRPO.
Engineering Reality: The Failure Modes of DPO
DPO is famously stable but it has its own pathologies. The five that bite in real runs:
1. The chosen log-prob collapses too
The most counterintuitive DPO failure: implicit reward of the chosen response goes UP relative to the rejected, but the ABSOLUTE log-prob of the chosen response goes DOWN. The model learned to widen the margin by unlearning the rejected response FASTER than it learned the chosen one. Symptoms: generated responses become shorter, lower quality, more repetitive even though the preference accuracy on held-out pairs looks great. Fix: log both and separately at every step; if the chosen log-prob is falling, raise β or add an SFT regulariser on the chosen response (the IPO and SLiC variants do this).
2. Sign bugs
The DPO loss has SIX scalars combined into one margin. Flipping chosen and rejected in any of the four log-prob arguments inverts the loss and trains the model to PREFER the rejected response. Symptoms: loss decreases normally, accuracy climbs to 50% and stays there (or drops below 50% if the bug is one-sided). Fix: compute the implicit rewards in isolation and assert thatr_hat_w.mean() > r_hat_l.mean() within the first few steps. If not, you have a sign bug, period.
3. Reference-policy mismatch
If — i.e. you initialised the policy from a different model than the reference — the loss starts non-zero and the gradients can be huge on step 0. The policy will leap toward the reference for the first few hundred steps before it starts following preferences. Fix: ALWAYS initialise the policy from the reference. If the reference is the SFT model, the policy must be a fresh copy of the same SFT model.
4. Length bias
DPO computes by summing per-token log-probs across the response. Longer responses accumulate larger NEGATIVE log-probs purely because there are more tokens. If chosen responses are systematically longer than rejected ones (common in human preference data — humans equate longer with more thorough), DPO will learn that LENGTH is the signal and the policy will emit increasingly verbose responses. Fix: balance length in the preference dataset, or use the length-normalised DPO variants (SimPO normalises by response length; IPO uses a different reward parameterisation).
5. β is the LR in disguise
For small batches and short training runs, halving β looks identical to doubling LR for the first few hundred steps. Teams sweep both, confusing themselves about which dial mattered. Fix: fix LR at 5e-7 (standard for 70B DPO with AdamW), sweep β in {0.05, 0.1, 0.3}, and only sweep LR if β isn't moving the needle.
What every DPO dashboard plots: implicit reward chosen (should rise), implicit reward rejected (should fall), preference accuracy (should climb from ~0.55 toward 0.75–0.85), chosen log-prob in absolute terms (should NOT collapse), KL proxy to reference (should grow slowly). When something goes wrong, one of these five plots tells you which failure mode you hit before you even look at generations.
DPO is not the end of the story. The next chapter picks up where the rollout-free assumption breaks down: when reasoning quality starts to matter more than instruction-following or safety, on-policy methods come back. GRPO inherits the elegance of DPO — no critic, no value head — but adds the on-policy rollout that lets the model learn from its own current best behaviour.