Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: RLHF's Four-Model Tax

At the end of § 14.6 we left the standard RLHF pipeline in working order, but expensive. Every PPO step keeps four full transformer models in memory at once — the trainable policy, a value head bolted to the policy backbone, the frozen reference (SFT) model, and the frozen reward model. Each step does a rollout, a reward-model forward, a reference forward, several policy and value forwards, a joint backward, a gradient clip, and a step. The wall-clock split was 85% rollout, 15% optimisation. At 70B, the memory tax was ~2 TB before activations. The procedure works — it gave us InstructGPT, ChatGPT, Claude — but for academic groups and small teams it is brutal.

Now stand back and ask: what does PPO actually need to do? We started in § 14.2 with preference data — triples $(x, y_w, y_l)$ where $y_w$ is the response a human preferred over $y_l$ on prompt $x$ . We trained a reward model to score responses. We then used PPO to push the policy in the direction of higher reward, with a KL penalty to the SFT model so the policy stays in the region of human-like text. Three steps: preferences → reward model → RL.

Each of those steps loses information. The reward model compresses a rich preference distribution into a scalar. PPO compresses that scalar back into a stochastic gradient via rollouts. Reward hacking, KL drift, sampling variance — every pathology of RLHF lives somewhere in those translation layers. The natural question is: can we skip the middle steps and train directly from the preferences?

The 2023 DPO paper by Rafailov et al. answered yes — and proved it from the SAME mathematical objective PPO solves. The result is a loss function that is one line of PyTorch, needs no rollout, no reward model, no value head, no critic, and no KL coefficient as a separate term. It is just a supervised classification loss on preference pairs. The same human data, the same KL-regularised objective, but a different parameterisation that makes the whole machinery collapse.

What DPO removes from the RLHF pipeline

No reward model. The implicit reward is read off the policy itself. No rollout. The loss is computed from offline preference pairs, just like SFT. No value head, no critic, no GAE. There is no sequential decision problem to estimate returns for. No clip, no entropy bonus. The loss is convex in the implicit reward and the gradient self-attenuates at high margin. What remains is one scalar hyperparameter, β, and a frozen reference model.

Intuition: A Loss Built From Preferences, Not Rewards

Forget the math for a moment. A human looked at two responses and said the first was better. What does that single judgement actually tell us about how the policy should change? Two things:

The policy should assign more probability to the chosen response $y_w$ than it currently does.
The policy should assign less probability to the rejected response $y_l$ than it currently does.

That is the entire content of one preference label. PPO uses this signal through three translation layers: train a reward model on many labels, sample responses, score them, use the score as a policy-gradient signal. DPO uses it directly: write a loss whose gradient pushes the policy probability of $y_w$ up and the policy probability of $y_l$ down, relative to a frozen reference. The reference anchor is what stops the model from simply assigning probability 1 to whichever response the human happened to pick — it forces the policy to tilt, not to overwrite.

A useful physical analogy: imagine the probability mass over all possible responses as a pile of sand. The reference model says where the SFT pile currently is. The human preference is a single thumb-tap that says: take some sand from this rejected response and move it onto this chosen response. After thousands of taps, the pile has tilted in the direction the humans collectively prefer, but it never crumbles into a single delta — because every tap is relative to the reference and the loss bottoms out as soon as the tilt is “enough”.

Mental model in one line: DPO is logistic regression on the LOG-RATIO of two response probabilities. The two classes are “the human preferred A” vs “the human preferred B”; the features are the implicit rewards $\beta \log \pi_\theta / \pi_{\text{ref}}$ of each response. That is the entire algorithm.

The Bridge from PPO to DPO

DPO is not a heuristic — it is the closed-form solution to the same problem PPO solves, plugged into the Bradley-Terry preference model from § 14.2. Three steps connect them.

Step 1: The KL-regularised reward objective

Both PPO and DPO maximise the same objective: expected reward minus a KL penalty to the reference.

$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_\theta(\cdot \mid x)} \big[ r(x, y) \big] - \beta \, \mathrm{KL}\big( \pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x) \big)$

Here $r(x, y)$ is some scalar reward (in PPO, the output of the reward model). $\beta > 0$ is the KL temperature. PPO solves this with stochastic gradients and sampling; DPO solves it in closed form.

Step 2: The closed-form optimum

Standard variational calculus on the KL-constrained objective gives the optimal policy at any state:

$\pi^*(y \mid x) = \frac{1}{Z(x)} \, \pi_{\text{ref}}(y \mid x) \, \exp\!\left( \frac{1}{\beta} \, r(x, y) \right)$

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp(r(x,y)/\beta)$ is the partition function. The optimal policy is the reference, tilted by an exponential factor proportional to the reward. Now invert this equation: given the optimal policy $\pi^*$ and the reference, what reward produced it?

$r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)$

This is the central identity: any reward function corresponds to some optimal policy, and vice versa. The reward is recoverable from the optimal policy — modulo a prompt-dependent constant $\beta \log Z(x)$ . Call the first term the implicit reward:

$\hat r_\theta(x, y) \;=\; \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$

Step 3: Bradley-Terry on implicit rewards

The Bradley-Terry model (§ 14.2) says the probability the human prefers $y_w$ over $y_l$ on prompt $x$ is the sigmoid of the reward difference:

$P(y_w \succ y_l \mid x) = \sigma\big( r(x, y_w) - r(x, y_l) \big)$

Plug in the recovered reward. The prompt-dependent constants $\beta \log Z(x)$ appear in BOTH the chosen and rejected term and cancel in the difference. What remains is purely in terms of the policy and the reference:

$P(y_w \succ y_l \mid x) \;=\; \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right)$

This is a probability the policy itself implies. Take the negative log-likelihood of the observed preferences. That negative log-likelihood is the DPO loss.

The DPO Objective

Putting it together, the DPO loss on a preference dataset $\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}$ is:

$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) \;=\; - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \; \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right)$

Every symbol:

$\pi_\theta$ is the trainable policy (initialised from the SFT model). Only its parameters get gradients.
$\pi_{\text{ref}}$ is the frozen reference model — almost always the same SFT model used to initialise the policy. It defines “the prior” the loss measures deviation from.
$y_w, y_l$ are the chosen and rejected responses; $x$ is the prompt.
$\pi(y \mid x)$ is the probability the model assigns to the FULL response — the product of per-token probabilities $\prod_t \pi(y_t \mid x, y_{<t})$ . In log space this is a sum of per-token log-probabilities, which is what we actually compute.
$\beta > 0$ is the KL temperature — the single tuneable hyperparameter.
$\sigma(\cdot)$ is the logistic sigmoid; $-\log \sigma(\cdot)$ is the standard binary cross-entropy loss for a Bernoulli with one observed positive label per pair.

The gradient with respect to $\theta$ :

$\nabla_\theta \mathcal{L}_{\text{DPO}} \;=\; - \beta \, \mathbb{E} \!\left[ \sigma\big( \hat r_\theta(x, y_l) - \hat r_\theta(x, y_w) \big) \; \big( \nabla_\theta \log \pi_\theta(y_w \mid x) - \nabla_\theta \log \pi_\theta(y_l \mid x) \big) \right]$

Read this carefully. The gradient is a per-pair coefficient $\sigma(\hat r_l - \hat r_w)$ times the difference of two log-likelihood gradients. The coefficient is large when the policy currently RANKS THE WRONG RESPONSE HIGHER (rejected reward greater than chosen) and small when it ranks them the right way. The two log-likelihood gradients are exactly the SFT gradients you would compute to teach the model to produce $y_w$ and to AVOID $y_l$ , scaled and combined automatically.

The one sentence to remember: DPO is SFT on the chosen response, minus SFT on the rejected response, weighted by how confidently the model is currently getting the ranking wrong.

The Implicit Reward and the β Knob

The quantity $\hat r_\theta(x, y) = \beta \log \pi_\theta(y \mid x) / \pi_{\text{ref}}(y \mid x)$ is the implicit reward DPO assigns to a response. It is not the reward a separate reward model would produce — it is the reward function the policy and reference TOGETHER ENCODE, through the closed-form identity from Step 2 above. Two consequences worth internalising:

Implicit rewards drift; absolute rewards are meaningless

Only DIFFERENCES of implicit rewards matter — for the same prompt the $\beta \log Z(x)$ cancels. If you log $\hat r_\theta(x, y_w)$ across training, you will see it grow without bound for some prompts and decay for others, and that is fine. What you watch is the MARGIN $\hat r_\theta(x, y_w) - \hat r_\theta(x, y_l)$ averaged across the batch. That is the implicit-reward signal.

β controls how hard the policy is pulled away from the reference

Small β makes the loss aggressive: the margin must be $1/\beta$ times larger in log-ratio space to achieve the same loss reduction, so the policy drifts more. Large β makes the loss conservative: tiny log-ratio shifts suffice, so the policy hugs the reference. The empirical table from real production runs:

β	Behaviour	Typical use case
0.01 – 0.05	Aggressive drift, large reward margins	When SFT is far from desired behaviour
0.1 (most common)	Balanced drift, ~75–85% pref accuracy	Tulu-3, Llama-3 post-training, Zephyr
0.3 – 0.5	Conservative, hugs SFT	Safety-focused or small preference datasets
1.0+	Almost identity (no learning)	Mostly a sanity-check baseline

β interacts with learning rate. Halving β is approximately equivalent to doubling LR for the first few steps; both make the policy move further per gradient step. In practice teams sweep β in {0.05, 0.1, 0.3} at fixed LR, then sweep LR at the winning β.

The DPO Loss Landscape

The plot below shows the three coupled quantities for a single preference pair, all on the same x-axis — the implicit reward margin $m = \beta(\log \pi_\theta / \pi_{\text{ref}})_{y_w} - \beta(\log \pi_\theta / \pi_{\text{ref}})_{y_l}$ . Drag the slider or press Play to watch the optimiser climb the margin axis. Note that the loss curve, the preference probability, and the gradient strength are all the same sigmoid in disguise — what changes is which one you are looking at.

Loading DPO loss visualiser…

Three observations to anchor:

At $m = 0$ the loss is $-\log \sigma(0) = \log 2 \approx 0.693$ — the “random preference” baseline. A model with implicit-reward accuracy of 50% sits here.
As $m \to +\infty$ the loss approaches zero exponentially, AND the gradient strength $\sigma(-m)$ approaches zero — the pair stops contributing to learning once the model is confidently correct. This is the built-in safety brake.
As $m \to -\infty$ the loss grows linearly and the gradient strength saturates at 1. A confidently-wrong pair gets the full gradient signal.

The asymmetry is the punchline: the loss is bounded on the correct side, unbounded on the wrong side. The optimiser never wastes capacity on already-correct preferences and is always sensitive to the wrong ones — for free, no clip needed.

Manual Numerical Walkthrough: One DPO Update

Concrete numbers, computed by hand. Imagine one preference pair from an Anthropic HH-style dataset:

Quantity	Value	Source
log π_θ(y_w \| x)	−3.50	policy forward (gradient flows)
log π_θ(y_l \| x)	−6.00	policy forward (gradient flows)
log π_ref(y_w \| x)	−4.00	frozen SFT forward (no grad)
log π_ref(y_l \| x)	−5.00	frozen SFT forward (no grad)
β	0.5	hyperparameter

These are SUMS of per-token log-probs across the response — each scalar in the table compresses, say, 60 tokens of response into one number. The chosen response is about $e^{2.5} \approx 12.2\times$ more likely under the policy than the rejected one; under the reference it's about $e^{1.0} \approx 2.7\times$ . So the policy has already started to tilt in the right direction, but not by much.

Log-ratios

$\text{logratio}_w = \log \pi_\theta(y_w \mid x) - \log \pi_{\text{ref}}(y_w \mid x) = -3.50 - (-4.00) = +0.500$

$\text{logratio}_l = \log \pi_\theta(y_l \mid x) - \log \pi_{\text{ref}}(y_l \mid x) = -6.00 - (-5.00) = -1.000$

Policy already moved the chosen response up by 0.5 nats over the reference, and the rejected response DOWN by 1.0 nats. Both shifts are in the human-preferred direction — encouraging.

Implicit rewards

$\hat r_\theta(x, y_w) = \beta \cdot \text{logratio}_w = 0.5 \cdot 0.500 = +0.250$

$\hat r_\theta(x, y_l) = \beta \cdot \text{logratio}_l = 0.5 \cdot (-1.000) = -0.500$

Reward margin: $+0.250 - (-0.500) = +0.750$ .

Loss

$m = 0.750$ ; $\sigma(0.750) \approx 0.6792$ ; $\mathcal{L} = -\log(0.6792) \approx 0.3868$ .

For reference, a random initial pair (margin ≈ 0) would have loss ≈ 0.693. So this pair is already “easy” — the model produces the preference correctly with about 68% confidence.

Gradient strength

$\frac{d\mathcal{L}}{dm} = -\sigma(-m) = -\sigma(-0.750) \approx -0.321$

Roughly 32% of the maximum possible gradient signal still flows through this pair. By the time the model is at $m = 3.0$ (highly confident), $\sigma(-3) \approx 0.047$ — only 5% of the max signal. The pair is fading from the optimiser's attention.

What one step does

The actual parameter update is the per-pair scalar −0.321 times $\beta = 0.5$ times the difference $\nabla_\theta \log \pi_\theta(y_w \mid x) - \nabla_\theta \log \pi_\theta(y_l \mid x)$ . That gradient is what SFT would produce if you trained on $y_w$ as positive and $y_l$ as “anti-positive” (an unweighted version of unlikelihood training). The signal is weighted by ~0.16 (0.321 × 0.5) and added to whatever the rest of the batch contributes.

The whole algorithm in one paragraph

Look at the preference pair. Compute the sum of log-probs of the chosen and rejected responses under the live policy and the reference. Take a difference of differences, scaled by β. Apply

-\log \sigma

. Backprop. Step. Repeat. No rollouts. No reward model. No critic. The single dial is β.

Plain Python: DPO Loss from Scratch

We re-implement DPO on the same single pair in pure NumPy. No autograd, no transformer — just the loss arithmetic and the scalars a real trainer ends up consuming after the per-token log-prob sums. Swap the hard-coded log-probs for sums over a transformer's output log-softmax and the loss code is byte-for-byte what trl, OpenRLHF, and Tulu use.

DPO loss on one preference pair — pure NumPy

🐍dpo_loss_plain.py

Explanation(8)

Code(83)

17The four scalars — and what they hide

📚 Every DPO loss in the literature is, after summation, this four-scalar computation. log_pi_chosen is the sum over all response tokens of log pi_theta(token | prefix) — a single number per (prompt, response) pair. The same is computed for the rejected response, and twice more under a FROZEN reference policy (almost always the SFT model). The two pi_theta values are the only ones with gradient; the two ref values are constants. The simplicity is the whole pitch of DPO: a transformer forward pass × 4 evaluations, no sampling, no rollout, no reward model.

EXAMPLE

log_pi_chosen = -3.50; log_pi_rejected = -6.00; log_ref_chosen = -4.00; log_ref_rejected = -5.00; beta = 0.5

24Beta — the KL temperature, the only knob DPO really has

📚 Beta controls how hard DPO pulls the policy away from the reference. The mathematical derivation makes this exact: DPO is the closed-form optimum of the SAME KL-regularised reward maximisation problem PPO solves, with beta as the KL coefficient. Small beta (0.01–0.1) lets the policy move boldly toward the chosen response; large beta (0.3–1.0) clamps it tightly to the reference. Llama-3 and Tulu-3 use beta around 0.1; the original DPO paper used 0.1–0.5. There is no universal best value — it is the one experimental knob every DPO recipe sweeps.

32Log-ratios — what the model has done to each response

📚 log (pi_theta / pi_ref) = log pi_theta − log pi_ref. A POSITIVE log-ratio means the trained policy assigns the response MORE probability than the reference; a NEGATIVE log-ratio means LESS. By itself this is just the per-response shift; the loss only cares about the GAP between the chosen and rejected shifts. Note the sign discipline: if you flip chosen and rejected in either subtraction the loss inverts and the model trains in the wrong direction. That sign bug is the single most common DPO regression.

EXAMPLE

logratio_chosen   = -3.50 − (−4.00) = +0.500
logratio_rejected = -6.00 − (−5.00) = −1.000

41The implicit reward margin — the entire signal DPO uses

📚 margin = beta · (chosen log-ratio − rejected log-ratio). This is the scalar version of the implicit reward margin discussed in §6 above. POSITIVE margin = the policy has tilted toward the chosen response relative to the ref; NEGATIVE margin = the policy is going the wrong way and will be heavily penalised. Note that prompt-dependent terms (the log Z(x) partition function from the derivation) CANCEL in this subtraction — that cancellation is the algebraic miracle that lets DPO sidestep the reward model entirely.

EXAMPLE

margin = 0.5 × (0.500 − (−1.000)) = 0.5 × 1.500 = 0.750

46Implicit rewards — diagnostic, not used in the loss

📚 r_hat = beta · log(pi_theta / pi_ref) is the implicit reward DPO infers without a reward model. The DPO paper proves that for the KL-regularised optimum, the optimal policy's IMPLICIT reward (beta · log-ratio) is the actual reward up to a prompt-dependent constant. Trainers log these scalars at every step: a healthy DPO run shows r_hat_chosen rising and r_hat_rejected falling. If both fall together, the policy is collapsing — the reward gap is being preserved by the policy simply UNLEARNING both responses.

EXAMPLE

implicit_reward_w = 0.5 × 0.500 = +0.250; implicit_reward_l = 0.5 × (−1.000) = −0.500

56log-sigmoid via softplus — the numerically stable form

📚 −log sigmoid(m) = log(1 + exp(−m)) = softplus(−m). Computing log(sigmoid(m)) directly underflows for large negative m (sigmoid → 0, log → −inf). np.logaddexp(0, −m) computes log(1 + exp(−m)) using the LSE trick — stable for both signs. In PyTorch use F.logsigmoid; never roll your own log(sigmoid(...)). This is the same numerical care PPO and reward-model losses also use.

63The loss — one scalar per preference pair

📚 dpo_loss = -log sigmoid(margin). As margin → +∞ the loss approaches zero (the model is confidently doing the right thing). As margin → 0 the loss is log(2) ≈ 0.693 (the model is exactly indifferent). As margin → −∞ the loss grows linearly (the model is confidently doing the wrong thing, and gradient signal grows). This unbounded penalty for confidently-wrong is what makes DPO converge — bad predictions cannot escape the gradient.

EXAMPLE

dpo_loss = -log_sigmoid(0.750) ≈ 0.387

73Gradient scale — the SAFETY BRAKE built into DPO

📚 The gradient of -log sigmoid(m) with respect to m is -sigmoid(-m). That single factor multiplies every parameter gradient. For a confident-correct pair (large positive m) the factor is near zero — the pair contributes essentially nothing. For a confident-wrong pair (large negative m) it approaches 1. For an uncertain pair (m near 0) it is near 0.5. DPO automatically focuses learning on the pairs that are still wrong. This is the same trick that makes logistic regression robust — the gradient self-attenuates near correct examples.

EXAMPLE

grad_scale = sigmoid(-0.750) ≈ 0.321 — pair still contributes substantial gradient

75 lines without explanation

1"""
2DPO loss on a SINGLE preference pair — pure NumPy.
3
4Given a triple (x, y_w, y_l) where y_w is the response the human preferred
5and y_l is the one they rejected, DPO computes the loss directly from
6log-probabilities of both responses under the trained policy and a
7frozen reference model. No reward model. No critic. No rollout.
8
9We use scalar log-probs of full responses, exactly the way a real DPO
10trainer ends up reducing it after summing per-token log-probs.
11"""
12
13import numpy as np
14
15# ---------------------------------------------------------------------------
16# 1. The four scalars DPO needs for one preference pair.
17#    These are the SUM of per-token log pi(token | context) across the
18#    response — i.e. log pi(response | prompt). All four come from a single
19#    forward pass each: one through the live policy and one through the
20#    frozen reference model, evaluated on (chosen) and (rejected).
21# ---------------------------------------------------------------------------
22
23log_pi_chosen     = -3.50    # log pi_theta (y_w | x)   -- grad flows
24log_pi_rejected   = -6.00    # log pi_theta (y_l | x)   -- grad flows
25log_ref_chosen    = -4.00    # log pi_ref   (y_w | x)   -- frozen
26log_ref_rejected  = -5.00    # log pi_ref   (y_l | x)   -- frozen
27
28# Beta — the KL temperature. Smaller beta = more freedom to drift from ref,
29# larger beta = stays closer to ref. Typical range 0.05 – 0.5.
30beta = 0.5
31
32# ---------------------------------------------------------------------------
33# 2. Log-ratios. log (pi_theta / pi_ref) is the per-response signal: it tells
34#    us how much MORE (or less) the trained policy likes this response
35#    relative to the frozen reference.
36# ---------------------------------------------------------------------------
37
38logratio_chosen   = log_pi_chosen   - log_ref_chosen
39logratio_rejected = log_pi_rejected - log_ref_rejected
40
41# ---------------------------------------------------------------------------
42# 3. The implicit reward margin. This is the single scalar the DPO loss
43#    consumes. Positive margin = policy has shifted probability mass FROM
44#    rejected TO chosen relative to the reference, which is exactly what
45#    the human preference asks for. Negative margin = the policy is
46#    moving the WRONG way.
47# ---------------------------------------------------------------------------
48
49margin = beta * (logratio_chosen - logratio_rejected)
50
51# Implicit rewards (for diagnostics — not used in the loss directly).
52implicit_reward_chosen   = beta * logratio_chosen
53implicit_reward_rejected = beta * logratio_rejected
54
55# ---------------------------------------------------------------------------
56# 4. The DPO loss: negative log-sigmoid of the margin.
57#    Equivalent to: -log P(human prefers chosen | implicit rewards).
58#    softplus(-m) = -log sigmoid(m), used for numerical stability.
59# ---------------------------------------------------------------------------
60
61# log sigmoid via softplus: log_sigmoid(m) = -softplus(-m)
62def log_sigmoid(m):
63    # softplus written stably: log(1 + exp(-|m|)) + max(0, -m)
64    return -np.logaddexp(0.0, -m)
65
66dpo_loss = -log_sigmoid(margin)
67
68# ---------------------------------------------------------------------------
69# 5. The gradient strength scalar — how much this pair will push the policy.
70#    d/d(margin) [ -log sigmoid(margin) ] = -sigmoid(-margin)
71#    i.e. confident pairs (large positive margin) contribute almost no
72#    gradient; uncertain or wrong pairs (margin <= 0) contribute the most.
73# ---------------------------------------------------------------------------
74
75grad_scale = 1.0 / (1.0 + np.exp(margin))   # = sigmoid(-margin)
76
77print(f"logratio_chosen     = {logratio_chosen:+.3f}")
78print(f"logratio_rejected   = {logratio_rejected:+.3f}")
79print(f"implicit_reward_w   = {implicit_reward_chosen:+.3f}")
80print(f"implicit_reward_l   = {implicit_reward_rejected:+.3f}")
81print(f"margin              = {margin:+.3f}")
82print(f"dpo_loss            = {dpo_loss:.4f}")
83print(f"grad_scale (sig(-m))= {grad_scale:.4f}")

PyTorch: A Production-Shaped DPO Step

The real thing — two forward passes per response per model, a masked sum of per-token log-probs, batch-level mean of $-\log \sigma$ , gradient clip, and the diagnostics that an RLHF engineer actually watches. Drop in any HF causal LM for policy and a frozen copy of the SFT model for ref_policy and this trains a real DPO run.

DPO step — two models, two forward passes per batch, no rollout

🐍dpo_step_torch.py

Explanation(9)

Code(101)

14The three hyperparameters — and how few there are

📚 DPO's whole appeal: BETA is the only setting that materially changes behaviour. LABEL_SM is the cDPO (conservative DPO) extension that accounts for label noise — set to e.g. 0.1 when human annotators disagree on ~10% of pairs. MAX_GRAD=1.0 is the same clip-norm default as in PPO. Compare to PPO: no clip epsilon, no value-loss coefficient, no entropy bonus, no separate KL coefficient. The lone knob is the appeal AND the trap — every DPO regression in the wild is ultimately a beta misconfiguration.

19Two models, one optimizer — the resource savings vs. PPO

📚 PPO needed FOUR models (policy, value head, reference, reward model). DPO needs TWO: the trainable policy and the frozen reference. The reward model is replaced by the implicit reward derived algebraically from the policy and reference; the value head is gone because there is no rollout to estimate returns for. At 70B parameters this is the difference between needing 8×H100 and 16×H100 for a single training node — the primary engineering reason DPO became the default for academic and small-team post-training in 2024.

32Per-response log-probs under the LIVE policy

📚 response_logprob (defined below) returns a (B,) tensor — one scalar per pair, equal to the SUM of per-token log-probs over the response tokens only. We call it twice: once on the chosen-completion batch and once on the rejected-completion batch. Each call is a full transformer forward — there is no way to share the forward between chosen and rejected, since they are different token sequences. This 2× cost vs. SFT is the dominant compute factor of DPO; budget accordingly.

EXAMPLE

pol_logp_w.shape = (B,); each scalar ≈ sum of S_response log-probs, typically −20 to −200 nats

39The FROZEN reference under torch.no_grad

📚 The reference model is the SFT model (or a previous DPO iterate for iterative DPO). Critically, it is wrapped in torch.no_grad to avoid accumulating activations for backward. Without no_grad, every parameter of the reference would consume gradient memory equal to the policy's — doubling the training memory and gaining nothing. ref_logp_w and ref_logp_l are scalars per pair and can be CACHED across epochs of DPO on a fixed preference set — the reference doesn't change, so its log-probs are constants of the dataset, not of the optimisation step. Production trainers pre-compute these once and skip the reference forward entirely.

EXAMPLE

Caching ref log-probs cuts DPO training time roughly in half on a fixed preference dataset

46The four scalars combined into a margin

📚 Element-wise subtraction over the batch. logratio_w − logratio_l = (log pi_theta(yw) − log pi_ref(yw)) − (log pi_theta(yl) − log pi_ref(yl)). Multiply by BETA. The margin tensor has shape (B,) — one scalar per preference pair. Note this is the SAME margin m the visualizer plots; the (B,) tensor just stacks one m per pair so we can take a batch mean of the loss.

EXAMPLE

margin tensor: typical values at start of training [-0.5, 0.2, 0.1, ...]; at convergence [+2.0, +3.5, ...]

50logsigmoid — never log(sigmoid(x))

📚 F.logsigmoid is implemented as the numerically stable softplus identity log_sigmoid(x) = -softplus(-x). Computing log(sigmoid(x)) yourself underflows for large negative x (sigmoid(-30) ≈ 9e-14 → log → -inf in fp32). The .mean() at the end is the batch reduction — one scalar loss for backward. cDPO mixes a label-smoothed term that pulls the loss toward indifference at high margin, useful when 10–20% of pairs are mislabelled.

60Backward — only ONE model gets gradients

📚 loss.backward() populates .grad only on policy parameters; ref_policy was wrapped in no_grad and has no autograd tape. clip_grad_norm_ rescales those gradients to L2 norm ≤ MAX_GRAD = 1.0. DPO is famously stable, but a single bad batch with a huge implicit-reward gap can still produce a 100× larger gradient than usual — the clip is the safeguard. Skipping it eventually NaNs on a long run.

67Diagnostics — what you actually watch during a DPO run

📚 implicit_reward_w should TREND UP, implicit_reward_l should TREND DOWN. If BOTH trend down, the policy is collapsing (unlearning both responses); raise beta or lower LR. accuracy is the fraction of pairs where r_hat_w > r_hat_l — the implicit-reward classification accuracy on the train set. A healthy run starts near 55% and climbs into the 70–85% range. accuracy stuck at 50% = the model is not learning (often a sign bug or LR too low). kl_proxy is a cheap upper bound on the policy–reference KL — it should grow slowly; runaway growth means beta is too small.

EXAMPLE

Healthy step: rew_chosen=+0.18, rew_rejected=-0.12, accuracy=0.72, kl_proxy=0.35

85response_logprob — the helper that defines DPO's compute cost

📚 Standard causal-LM bookkeeping: logits at position t predict the token at position t+1, so we slice logits[:, :-1] vs targets[:, 1:]. log_softmax + gather is the same pattern as in PPO and reward-model code — collect the log-prob of the actually-observed next token at every position. response_mask zeros out PROMPT and PAD positions so the sum is over response tokens only. Forgetting that mask is the third most common DPO bug after sign errors and missing no_grad on the reference — it makes the loss include log-probs of the prompt under the policy, which is degenerate (the model has no agency over the prompt).

EXAMPLE

input_ids.shape=(B,S); logits.shape=(B,S-1,V); response_mask sums to ~50 per row, summing ~50 log-probs into one scalar

92 lines without explanation

1"""
2PyTorch DPO step — the inner loop of trl, OpenRLHF, Tulu, Llama-3 post-training.
3
4Two models live in memory: the trainable policy and the frozen reference.
5At full scale, both are FSDP-wrapped 70B models. The four scalars per
6preference pair are computed from two forward passes:
7   policy(x || y_w),  policy(x || y_l)        --- gradient flows
8   ref   (x || y_w),  ref   (x || y_l)        --- no_grad
9"""
10
11import torch
12import torch.nn.functional as F
13
14BETA      = 0.1     # KL temperature
15LABEL_SM  = 0.0     # cDPO label smoothing; 0.0 disables it
16MAX_GRAD  = 1.0     # global grad-norm clip
17
18def dpo_step(batch, policy, ref_policy, optimizer):
19    """
20    batch fields (all already on device):
21      chosen_ids      : (B, S_c)    prompt || chosen-response tokens
22      chosen_mask     : (B, S_c)    1 on RESPONSE tokens of chosen, else 0
23      chosen_attn     : (B, S_c)    HF-style attention mask
24      rejected_ids    : (B, S_r)    prompt || rejected-response tokens
25      rejected_mask   : (B, S_r)    1 on RESPONSE tokens of rejected, else 0
26      rejected_attn   : (B, S_r)    HF-style attention mask
27    """
28    # --- 1. Per-response log-probs under the LIVE policy (gradient flows)
29    pol_logp_w = response_logprob(policy,
30        batch["chosen_ids"],   batch["chosen_attn"],   batch["chosen_mask"])
31    pol_logp_l = response_logprob(policy,
32        batch["rejected_ids"], batch["rejected_attn"], batch["rejected_mask"])
33
34    # --- 2. Per-response log-probs under the FROZEN reference (no grad)
35    with torch.no_grad():
36        ref_logp_w = response_logprob(ref_policy,
37            batch["chosen_ids"],   batch["chosen_attn"],   batch["chosen_mask"])
38        ref_logp_l = response_logprob(ref_policy,
39            batch["rejected_ids"], batch["rejected_attn"], batch["rejected_mask"])
40
41    # --- 3. Log-ratios and the implicit reward margin
42    logratio_w = pol_logp_w - ref_logp_w
43    logratio_l = pol_logp_l - ref_logp_l
44    margin     = BETA * (logratio_w - logratio_l)        # shape (B,)
45
46    # --- 4. DPO loss (with optional cDPO label smoothing)
47    if LABEL_SM > 0:
48        # cDPO: assume LABEL_SM fraction of labels are flipped (noisy).
49        loss = -(1 - LABEL_SM) * F.logsigmoid(margin) \
50               -      LABEL_SM  * F.logsigmoid(-margin)
51        loss = loss.mean()
52    else:
53        loss = -F.logsigmoid(margin).mean()
54
55    # --- 5. Backward + clip + step
56    loss.backward()
57    gn = torch.nn.utils.clip_grad_norm_(policy.parameters(), MAX_GRAD)
58    optimizer.step()
59    optimizer.zero_grad(set_to_none=True)
60
61    # --- 6. Diagnostics — every DPO dashboard plots these per step
62    with torch.no_grad():
63        implicit_reward_w = BETA * logratio_w
64        implicit_reward_l = BETA * logratio_l
65        reward_margin     = (implicit_reward_w - implicit_reward_l).mean()
66        # "preference accuracy": fraction of pairs the implicit reward
67        # ranks correctly. Healthy SFT-warm-started runs start near 0.55
68        # and climb to 0.75–0.85.
69        accuracy = (implicit_reward_w > implicit_reward_l).float().mean()
70        # KL proxy to ref; should drift up slowly. KL >> 1 = drift trouble.
71        kl_proxy = ((pol_logp_w - ref_logp_w).abs().mean()
72                  + (pol_logp_l - ref_logp_l).abs().mean()) / 2
73
74    return {
75        "loss":             loss.detach(),
76        "margin":           margin.detach().mean(),
77        "rew_chosen":       implicit_reward_w.mean().detach(),
78        "rew_rejected":     implicit_reward_l.mean().detach(),
79        "reward_margin":    reward_margin.detach(),
80        "accuracy":         accuracy.detach(),
81        "kl_proxy":         kl_proxy.detach(),
82        "grad_norm":        gn.detach(),
83    }
84
85
86def response_logprob(model, input_ids, attn_mask, response_mask):
87    """
88    Sum of per-token log pi(token | prefix) across response tokens only.
89
90    A causal LM at position t predicts the token at position t+1, so we
91    align logits[:, :-1] with targets[:, 1:] and use response_mask[:, 1:]
92    to zero out prompt and padding positions BEFORE summing.
93    """
94    out = model(input_ids=input_ids, attention_mask=attn_mask, use_cache=False)
95    logits  = out.logits[:, :-1, :]                  # (B, S-1, V)
96    targets = input_ids[:, 1:]                       # (B, S-1)
97    rm      = response_mask[:, 1:].float()           # (B, S-1)
98
99    logp_all = F.log_softmax(logits, dim=-1)         # (B, S-1, V)
100    logp_tok = logp_all.gather(-1, targets.unsqueeze(-1)).squeeze(-1)
101    return (logp_tok * rm).sum(dim=-1)               # (B,)

At Massive Scale: Why DPO Became the Default

Every line of the PyTorch snippet scales to a 70B-parameter post- training run, with three engineering wins that explain why DPO displaced PPO as the default alignment recipe outside of the largest frontier labs.

The two-model memory budget

At 70B, each model is ~140 GB in bf16. DPO needs two models (policy and reference) for a total of ~280 GB in weights, plus 140 GB for policy gradients, plus ~560 GB for AdamW optimiser state (fp32 first and second moments). Total: ~980 GB before activations. PPO needs the same budget plus a reward model (~140 GB), a value head (~10–140 GB depending on architecture), and a separate KV cache for rollout (~50–200 GB) — easily 2 TB total. DPO halves the memory footprint of the comparable post-training step. On 8×H100 (640 GB HBM), DPO fits with FSDP sharding; PPO at 70B does not without offloading.

The reference can be cached

On a fixed preference set, the reference log-probs $\log \pi_{\text{ref}}(y_w \mid x)$ and $\log \pi_{\text{ref}}(y_l \mid x)$ are constants of the dataset — they do not depend on training step. Production DPO pipelines compute them ONCE before training and store them as scalars alongside each preference pair. The reference model then leaves GPU memory entirely. The remaining per-step cost is two forward passes through the POLICY only — half the FLOPs of a PPO step at the same batch size. On a Tulu-3 scale run this turns a 24-hour wall-clock into ~10 hours.

No rollout = deterministic, replayable runs

PPO rollouts depend on sampling temperature, top-p, the current policy weights, and a random seed; reproducing a PPO failure requires re-rolling the exact same trajectories, which is hard. DPO consumes a fixed offline preference dataset in a fixed order; a failed run is reproducible byte-for-byte by re-running with the same seed. Debugging RLHF stops being an exotic discipline and becomes ordinary supervised-learning debugging. This is the engineering subtext under every DPO paper's claim of “simplicity”.

Where it does NOT scale

DPO has no rollout, which means it CANNOT discover new responses. It can only redistribute probability mass across responses that appear in the preference data. For frontier reasoning models, the chosen response in the dataset is often weaker than what the model itself could now generate — DPO will obediently push toward that weaker response. This is the failure mode that pushed frontier labs back toward on-policy methods (PPO, GRPO) for reasoning training in 2024–2025: only on-policy data lets the model learn from its current best behaviour. DPO is the default for instruction-following and safety alignment; on-policy methods are the default for reasoning. The next chapter picks up this thread with GRPO.

Engineering Reality: The Failure Modes of DPO

DPO is famously stable but it has its own pathologies. The five that bite in real runs:

1. The chosen log-prob collapses too

The most counterintuitive DPO failure: implicit reward of the chosen response goes UP relative to the rejected, but the ABSOLUTE log-prob of the chosen response goes DOWN. The model learned to widen the margin by unlearning the rejected response FASTER than it learned the chosen one. Symptoms: generated responses become shorter, lower quality, more repetitive even though the preference accuracy on held-out pairs looks great. Fix: log both $\log \pi_\theta(y_w)$ and $\log \pi_\theta(y_l)$ separately at every step; if the chosen log-prob is falling, raise β or add an SFT regulariser on the chosen response (the IPO and SLiC variants do this).

2. Sign bugs

The DPO loss has SIX scalars combined into one margin. Flipping chosen and rejected in any of the four log-prob arguments inverts the loss and trains the model to PREFER the rejected response. Symptoms: loss decreases normally, accuracy climbs to 50% and stays there (or drops below 50% if the bug is one-sided). Fix: compute the implicit rewards in isolation and assert thatr_hat_w.mean() > r_hat_l.mean() within the first few steps. If not, you have a sign bug, period.

3. Reference-policy mismatch

If $\pi_{\text{ref}} \neq \pi_\theta^{(0)}$ — i.e. you initialised the policy from a different model than the reference — the loss starts non-zero and the gradients can be huge on step 0. The policy will leap toward the reference for the first few hundred steps before it starts following preferences. Fix: ALWAYS initialise the policy from the reference. If the reference is the SFT model, the policy must be a fresh copy of the same SFT model.

4. Length bias

DPO computes $\log \pi(y \mid x)$ by summing per-token log-probs across the response. Longer responses accumulate larger NEGATIVE log-probs purely because there are more tokens. If chosen responses are systematically longer than rejected ones (common in human preference data — humans equate longer with more thorough), DPO will learn that LENGTH is the signal and the policy will emit increasingly verbose responses. Fix: balance length in the preference dataset, or use the length-normalised DPO variants (SimPO normalises by response length; IPO uses a different reward parameterisation).

5. β is the LR in disguise

For small batches and short training runs, halving β looks identical to doubling LR for the first few hundred steps. Teams sweep both, confusing themselves about which dial mattered. Fix: fix LR at 5e-7 (standard for 70B DPO with AdamW), sweep β in {0.05, 0.1, 0.3}, and only sweep LR if β isn't moving the needle.

What every DPO dashboard plots: implicit reward chosen (should rise), implicit reward rejected (should fall), preference accuracy (should climb from ~0.55 toward 0.75–0.85), chosen log-prob in absolute terms (should NOT collapse), KL proxy to reference (should grow slowly). When something goes wrong, one of these five plots tells you which failure mode you hit before you even look at generations.

DPO is not the end of the story. The next chapter picks up where the rollout-free assumption breaks down: when reasoning quality starts to matter more than instruction-following or safety, on-policy methods come back. GRPO inherits the elegance of DPO — no critic, no value head — but adds the on-policy rollout that lets the model learn from its own current best behaviour.