The Real Problem: A Reward Is Not a Loss
By the time you reach this section you already have a reward model (§ 14.3) that takes a (prompt, response) pair and returns a scalar — a single number that says how good the response was. You also have an SFT model (§ 13) that knows how to follow chat formatting. The question of this chapter is brutally simple: how do we use the reward model to actually train the SFT model into something better?
The naive answer feels obvious. We have a scalar score; just treat it as a loss and run backprop. Maximise reward = minimise negative reward, problem solved. This is wrong, and understanding why is the entire reason PPO exists.
Look at what stands between the model's parameters and the reward. The model produces logits. We sample a token from those logits (or take an argmax, or a top-p sample — every decoding strategy is sampling-shaped). We append the sampled token to the context and sample the next one. We repeat for hundreds of tokens until we have a full response. Only then does the reward model score the response. The reward depends on the response, the response depends on which tokens were sampled, and sampling is not differentiable. There is no chain rule from the reward back to the logits, because the argmax/categorical sampling operation breaks the graph.
This is the central obstacle: we have a scalar feedback signal, but the path from parameters to that signal goes through an undifferentiable step. Cross-entropy training side-steps this by always providing the correct token at each position — there is no sampling in the loss. RLHF cannot do that; the very thing we want to learn is which tokens the model should pick. The whole field of policy gradients was built to solve this one problem: how do you take a gradient through a non-differentiable sampling step?
And once we have a working policy gradient, a second problem immediately appears: even modest policy updates can drag the model wildly off the manifold of human-likely text. A single bad gradient step can turn a coherent SFT model into one that emits gibberish that happens to score well on the reward model — the famous failure mode called reward hacking. PPO is the algorithm that takes the policy gradient idea and adds two safety brakes — a clipped surrogate objective and a per-token KL penalty — to make it actually usable at scale.
The three jobs PPO has to do at once
Intuition: Do More of What Works
Forget gradients for a moment. Imagine you sample 100 responses to a prompt from your model. The reward model scores each one. The intuition behind policy gradient is the simplest training rule in machine learning:
Take the responses that scored well, and increase their probability. Take the responses that scored badly, and decrease their probability.
That's it. That's the algorithm. Every formula in this section is a way to make that idea precise, low-variance, and safe enough to run for thousands of steps. The math will look intimidating — ratios, clips, advantages, GAE — but every symbol is just protecting that simple rule from one specific failure mode.
A useful physical analogy: imagine the policy as the centre of mass of a swarm of probability. The reward model is a landscape with peaks (good responses) and valleys (bad ones). At each step we sample a few points from the swarm, look up the landscape height under each, and gently nudge the swarm toward the higher points. The catch: if we nudge too hard the swarm fragments and slides off the manifold of sensible English text into a region where the reward model is unreliable. PPO is the leash on the swarm.
Mental model in one line: PPO = the policy gradient of REINFORCE + a baseline (so we know what counts as ‘good’) + a clip (so a single step can't move us too far) + a KL penalty to the SFT model (so we stay in the region of human-like text). Everything else is bookkeeping.
REINFORCE: The Naive Policy Gradient
Start with the simplest version that works in theory. The model is a policy — a probability distribution over actions given a state . For an LLM, the action at each timestep is the next token and the state is the context so far. We want to maximise the expected reward of trajectories sampled from :
Here is a full trajectory and is the scalar reward at the end. The problem we just identified — sampling being non-differentiable — looks like it blocks us from computing . The famous log-derivative trick rescues us:
Read this slowly. The sampling step has vanished into the expectation; the only gradient that appears is the gradient of log probability of the chosen action, which is fully differentiable. The reward is just a scalar weight that says ‘multiply this trajectory's log-probability gradient by ’. If is large and positive, we strongly increase the log-probability of the actions we took; if negative, we decrease. Exactly the ‘do more of what works’ rule, written as a gradient.
The proof of the log-derivative identity is two lines and worth knowing: is the standard chain rule applied to , and substituting that into the gradient of the expectation pulls into a sample-weighted form. Every modern policy-gradient algorithm — REINFORCE, A2C, TRPO, PPO, GRPO — starts from this same identity.
In practice we never have the expectation — we have a finite batch of sampled trajectories. The Monte Carlo estimator is:
This is REINFORCE. It works. It also has a variance problem so bad that direct REINFORCE training of a 7B LLM with a typical reward signal will produce a useless model in under a hundred steps. Every subsequent idea — baselines, advantages, GAE, clipping — is a way to reduce that variance without biasing the estimator.
Baselines, Advantage, and GAE
The first variance reduction is the cleanest. Subtract a baseline from the reward inside the sum:
Subtracting does not bias the estimator — the proof rests on , a clean consequence of the policy summing to one — but a well-chosen baseline dramatically reduces variance. The optimal baseline is the expected reward under the current policy, which is the value function . We approximate with a small value head — a second neural network (often a single linear layer on top of the shared transformer backbone) trained alongside the policy with mean-squared error against the actual returns. The result of subtracting the value baseline is called the advantage:
Read it as: “how much better than expected this action turned out to be at this state”. A positive advantage says ‘better than average; increase the probability of this action’. A negative advantage says ‘worse than average; decrease it’. The naming is precise: advantage measures the marginal value of taking versus the policy's default.
GAE: Generalised Advantage Estimation
For long sequences a single end-of-episode reward gives every timestep the same advantage, which is high-variance and ignores the fact that different tokens contributed differently to the outcome. Generalised Advantage Estimation (GAE) (Schulman 2015) interpolates between two extremes:
- Pure Monte Carlo — use the full return : low bias, high variance.
- One-step TD — use : high bias (depends on the value head being accurate), low variance.
GAE blends them via an exponential moving average with parameter :
With we recover one-step TD; with we recover Monte Carlo; intermediate values trade bias against variance. The PPO paper and every open-source RLHF library default to and for language tasks — they treat the per-token reward as zero except at the EOS token, where it equals the reward model's score. We will use those defaults without further comment for the rest of the chapter.
Why Big Policy Steps Are Catastrophic
Suppose we have a low-variance advantage estimate. Why not just take a big gradient step in its direction? Two reasons, and PPO addresses both.
Reason one: the rollout becomes stale. The advantages we computed were under the OLD policy. If a single gradient step changes the policy a lot, the next minibatch we sample from the same rollout is being graded with advantages that no longer match the current policy's behaviour. The objective starts rewarding actions that the new policy already takes more often, creating a positive feedback loop that diverges. This is the classical off-policy / importance-sampling problem: if you want to take K gradient steps on data sampled by the OLD policy you need to correct each step by the probability ratio .
Reason two: the reward model is only locally trustworthy. The reward model was trained on responses from the SFT distribution. Push the policy 50 nats of KL away from that distribution and the reward model has no idea what to do — it will return whatever score its lookup table happens to spit out for that out-of-distribution text, and the policy will gleefully optimise for that nonsense score. This is the textbook RLHF failure: after a few thousand steps, the model emits responses like “Sure! I'd be happy to help with that great question!!!” on every prompt, because the reward model gave a slight upvote to enthusiasm and that turned into a runaway exploitation.
Both problems demand the same medicine: keep each update small in policy space, not just in parameter space. TRPO (the predecessor of PPO) does this with a hard constraint and a conjugate-gradient solver to find the largest safe step. PPO's contribution was the observation that the same objective can be achieved with a much simpler trick: bound the importance ratio directly, then take ordinary SGD steps and let the bound do the work.
The PPO Clipped Surrogate Objective
Define the per-step probability ratio:
The importance-weighted policy gradient says we should maximise in expectation — bigger ratio on good actions, smaller on bad. PPO's clipped surrogate is one line:
Read it piece by piece. The first argument of the min is the unclipped importance-weighted advantage. The second is the same thing with first restricted to the interval (typically ). The outer min takes the smaller of the two — the pessimistic one. The clip is the trust region; the min is the asymmetry that prevents reward hacking.
The asymmetry is the cleverest part. Consider a positive advantage:
- If , both terms equal . Min is . Normal gradient: increase .
- If (we have already increased the probability beyond the clip), unclipped is (large positive), clipped is (capped). Min is the capped value, which is constant in . Gradient is zero. The policy stops being pushed further on this action even though the advantage is positive.
Now consider a negative advantage:
- If , both terms equal . Normal gradient: decrease .
- If , unclipped is (small in magnitude, negative — the ‘safe’ objective), clipped is (larger in magnitude, more negative). Min picks the unclipped (smaller-magnitude negative). So we get gradient that continues pushing the probability down. The clip does NOT save you from a bad action once you have already cut its probability — you keep cutting.
This is the asymmetry: the clip only stops gradient when the policy is moving in the “rewarded” direction past the trust region. It never stops the gradient that's pulling away from a bad action. That is precisely the safety property we wanted: bound the speed at which the policy can chase reward, but let it freely run away from penalties.
Drag the sliders. With and , watch the orange line: it rises with the ratio up to 1.2, then goes flat for the rest of the domain — that flat region is the no-gradient zone. Flip the advantage to negative and the flat region jumps to the LEFT side of the band, while the right side of the curve becomes the active gradient zone. Shrink to 0.05 and the trust region narrows to almost nothing — the policy becomes almost frozen. Push it to 0.5 and the clip barely does anything. The choice is the value the community converged on after extensive ablations; it is the de facto standard for every open-source RLHF library.
What the clip does NOT do
Manual Numerical Walkthrough: One PPO Update
Take a rollout of four timesteps from a three-action policy. The old policy assigned the following log-probabilities to the actions it actually sampled, and GAE gave us the following advantages:
| t | action a_t | old log π(a_t|s_t) | advantage A_t (raw) |
|---|---|---|---|
| 0 | 0 | −1.20 | +0.80 |
| 1 | 2 | −0.51 | +1.50 |
| 2 | 1 | −1.61 | −0.40 |
| 3 | 0 | −0.92 | −1.20 |
Step 1 — normalise the advantages. Mean is . Standard deviation (population, used in PyTorch by default) is . Normalised: .
Step 2 — compute the new log-probs. After one gradient step, suppose the new policy assigns log-probs to those same actions. (We'll derive these properly in the plain-Python section below.)
Step 3 — compute ratios. :
| t | log new | log old | log ratio | ratio r_t |
|---|---|---|---|---|
| 0 | −0.48 | −1.20 | +0.72 | 2.054 |
| 1 | −1.28 | −0.51 | −0.77 | 0.463 |
| 2 | −1.42 | −1.61 | +0.19 | 1.209 |
| 3 | −0.40 | −0.92 | +0.52 | 1.682 |
Look at and — already these are well outside the trust region . The clip is about to fire on most of this minibatch. That is the point: we will see, in concrete numbers, exactly which timesteps the clip protects against and which it does not.
Step 4 — compute the two surrogates.
| t | Ã_t | r_t | surr1 = r_t · Ã | clip(r_t) | surr2 | min |
|---|---|---|---|---|---|---|
| 0 | +0.610 | 2.054 | +1.253 | 1.200 | +0.732 | +0.732 |
| 1 | +1.293 | 0.463 | +0.599 | 0.800 | +1.034 | +0.599 |
| 2 | −0.561 | 1.209 | −0.678 | 1.200 | −0.673 | −0.678 |
| 3 | −1.341 | 1.682 | −2.256 | 1.200 | −1.609 | −2.256 |
Read it row by row. Row 0: positive advantage, ratio above the trust region — the clip fires; surr2 (0.732) is smaller than surr1 (1.253), so the min picks surr2 and the gradient through is zero. The clip just stopped us from over-rewarding an action we've already pushed up too much. Row 1: positive advantage, ratio below the trust region — clip does NOT fire (clip(0.463)=0.8 makes surr2 larger, but min picks the smaller surr1=0.599). Gradient flows, and it pushes the probability of this good action up. Row 2: negative advantage, ratio just above 1 — clip fires; min picks surr1=−0.678 (the smaller-magnitude negative) because surr2=−0.673 is even smaller in magnitude. Gradient still pushes the probability down, but slightly less strongly than the raw advantage would suggest. Row 3: negative advantage, ratio well above 1 — surr1=−2.256, surr2=−1.609. Min picks the more negative surr1; gradient flows fully, ignoring the clip. This is the asymmetry working as advertised: we let the gradient hammer down on a bad action even when the policy has drifted, because pulling away from a penalty is always safe.
Step 5 — the loss. Mean of the min column: . Negate (we maximise the objective, so the loss is the negative): policy loss = . That positive value is what gets backpropped. Without the clip, the loss would have been ; with the clip it is higher, which means the gradient magnitude is also different — and the model is being told to make a different, safer update.
The headline number to take from this walkthrough: the clip fired on 3 out of 4 tokens. That is way too high for a production run — a healthy PPO step has clipfrac in [0.10, 0.30]. The high clipfrac here is because we deliberately took a too-large gradient step between ‘old’ and ‘new’ for the walkthrough to be readable. In practice you tune the learning rate and the number of PPO epochs per rollout to keep clipfrac in that safe band.
The Reference Model and KL Penalty
Everything above is pure PPO. The RLHF twist — the modification that turns ‘PPO for games’ into ‘PPO for LLMs’ — is the addition of a third model: the reference policy.
The reference policy is a frozen copy of the SFT model — the model from § 13. It never receives gradient. At every step of PPO we compute the per-token KL divergence from the live policy to the reference and add it as a penalty:
Why a separate KL term when PPO already has a clip? Because the clip prevents large per-step drift, but cannot prevent the policy from accumulating small drifts over thousands of steps into a model that is far from . The clip is a per-step trust region; the KL penalty is the long-term anchor.
The KL is computed token-wise. In code we use the k3 unbiased estimator where (the ratio against the reference, not the old policy). The k3 estimator is always positive, has lower variance than the naive estimator, and reduces to true KL in expectation.
The coefficient is the most-tuned hyperparameter in RLHF. Too small and the policy drifts into reward-hacked nonsense in a few thousand steps. Too large and the policy never moves from the SFT model and you waste the entire RLHF run. Modern open-source recipes use to , sometimes adapted on the fly to keep the realised KL below a target value (e.g. 6 nats per response).
| KL coefficient β | Behaviour |
|---|---|
| 0 | Pure PPO. Policy will eventually reward-hack. |
| 0.001 - 0.01 | Slow drift allowed. Use only when reward model is very robust. |
| 0.02 - 0.05 | Standard. Llama-3, Qwen-2, DeepSeek defaults sit here. |
| 0.1 - 0.5 | Conservative. Policy barely moves; use when reward model is suspect. |
| >1.0 | The policy is frozen. Indistinguishable from skipping RLHF. |
The Four-Model Dance of RLHF
At this point you have all the ingredients. A single RLHF training step needs four models in memory at once, and tracking what each does is the difference between writing PPO and watching it crash:
| Model | Trainable? | Role |
|---|---|---|
| Policy π_θ | Yes — being trained | The model we are improving. Same architecture as the SFT model; initialised from it. Forward pass produces logits and log-probs. |
| Value head V_ϕ | Yes — being trained | Small head (often one linear layer) on top of the policy's hidden states. Predicts V(s_t) for the advantage baseline. Trained with MSE against returns. |
| Reference π_ref | No — frozen SFT model | Used to compute the KL penalty. Same architecture as the policy, but its weights never change. Forward only. |
| Reward model R_ψ | No — frozen, separately trained | Takes a full (prompt, response) pair and outputs a scalar. Called once per rollout, not per gradient step. Same architecture as policy + scalar head. |
For a 70B RLHF run, holding four 70B models in memory simultaneously is the dominant cost — not the gradients, not the optimizer state. Practical implementations share weights when possible: the value head usually piggybacks on the policy backbone (one extra linear head); the reference and the policy can share parameters with a clever swap (LoRA tricks); the reward model can be smaller (a 7B reward model rates a 70B policy fine).
The orchestration follows a fixed loop, called the PPO outer loop:
- Rollout phase. Sample N prompts. For each prompt, generate a response with the OLD policy (a snapshot of π_θ before this rollout starts). Cache: input_ids, old_logp, ref_logp, old_values.
- Reward phase. Run the reward model once per (prompt, response). Combine with the running KL signal to form a per-token reward stream r_t.
- Advantage phase. Run GAE on the r_t and old_values to produce advantages A_t and returns R_t.
- Optimisation phase. For K epochs (typically K=4), shuffle the rollout into minibatches and run the PPO step on each minibatch. Each minibatch step is the function we just analysed.
- Snapshot. Set π_old ← π_θ. Back to step 1.
Sampling throughput in step 1 (the rollout) dominates the wall-clock time, often 70–90% of an RLHF step. This is why production RLHF runs use accelerated inference engines (vLLM, SGLang) for rollout and dedicated training engines (FSDP) for optimisation — splitting them across separate GPU pools is the modern best practice.
Plain Python: PPO Loss from Scratch
We re-implement the PPO clipped surrogate in pure NumPy on a toy rollout of four timesteps. No autograd, no transformer — just the loss arithmetic an RLHF trainer runs every minibatch. Swap the hard-coded logits for a real transformer's output and the code is unchanged.
PyTorch: A Production-Shaped PPO Step
Now the full thing — policy loss, clipped value loss, entropy bonus, per-token KL to a reference model, gradient clipping, and diagnostics. This is the inner loop that lives inside trl, OpenRLHF, and DeepSpeed-Chat. Drop in any HF causal LM for ‘policy’ and any scalar-head model for ‘value_head’ and this trains a real RLHF run.
At Massive Scale: What Changes for a 70B+ Run
Every line of code above scales to a 70B-parameter RLHF run, but three things change in spirit.
Memory: the four-model problem becomes the dominant cost
At 70B, each model is ~140 GB in bf16. Four of them is 560 GB just in weights, plus 280 GB for policy gradients, plus ~1.1 TB for AdamW optimiser state (fp32 m and v), for a total of ~2 TB before a single activation. Compare to pretraining a 70B model from scratch, where the four-model factor is replaced by a 1× factor — RLHF is, on a per-step basis, much more memory-hungry than pretraining. This is why production RLHF systems share the policy and value-head backbones, fold the reference into a LoRA adapter, or offload the reference and reward model to CPU/disk between rollouts.
The sampling-vs-training imbalance
A 70B policy emits maybe 30 tokens/second on a single H100 with vanilla HuggingFace generation; with vLLM that rises to 200–500 tokens/second per GPU. A typical PPO step trains on ~256 prompts × ~512 response tokens = 131k tokens. At 500 tok/s/GPU, even an 8-GPU rollout takes ~33 seconds; the optimisation phase that follows is ~5 seconds. So 85% of wall-clock is spent rolling out. This is why production RLHF infrastructure separates the rollout cluster (vLLM, many GPUs, prioritised for inference throughput) from the training cluster (FSDP, fewer GPUs, prioritised for memory bandwidth). The rollout cluster ships generations to the training cluster over the network; weights are synced after each optimisation phase.
Distributed advantage normalisation
Advantage normalisation is now an all-reduce. Each FSDP rank computes a partial sum and partial sum-of-squares of advantages on its shard; an all-reduce produces the global mean and std; every rank then normalises locally with those globals. Forgetting to all-reduce and normalising per-rank is a textbook bug — each rank sees a different scale, the effective KL coefficient varies across ranks, and the run becomes noisy in a way that takes weeks to diagnose. Every production RLHF library has unit tests for this one all-reduce.
Reward-model bottleneck
For a 70B policy with a 7B reward model, scoring is cheap. For a 405B policy with a 70B reward model, scoring is comparable in cost to a single optimisation epoch. Modern recipes batch-score responses asynchronously (the reward model runs on its own GPU pool, with a queue between it and the rollout cluster), and some cutting-edge work replaces the reward model with a verifier (rule-based, see § 14.4) or a generative judge (see § 14.5) — the DeepSeek-R1 lineage avoids a learned reward model entirely for exactly this reason.
Engineering Reality: The Knobs That Break Runs
After enough PPO runs every team converges on the same set of failure modes and the same set of high-leverage fixes.
- The KL coefficient is too small. Symptom: reward climbs for 200 steps, then the policy collapses to a degenerate output (a fixed phrase, all-caps, emoji spam) that the reward model happens to score high. Cause: the per-token KL penalty isn't enough to keep the policy near the SFT manifold. Fix: raise β_KL by 2×, restart from a checkpoint before the collapse. Many teams use adaptive KL — increase β when realised KL exceeds a target, decrease when it's below.
- Clipfrac is too high (> 0.5). Symptom: training loss looks noisy and reward grows slowly. Cause: each gradient step moves the policy too far in ratio space — most tokens hit the clip, gradient becomes erratic. Fix: lower LR by 2×, or reduce PPO epochs per rollout from 4 to 2. The clipfrac and approx_kl diagnostics are the right thing to watch, not the loss itself.
- Clipfrac is too low (< 0.05). Symptom: reward barely moves. Cause: the policy isn't changing enough — the LR is too low or the dataset is too narrow. Fix: raise LR carefully (2×) or widen the prompt distribution.
- Forgot to normalise advantages. Symptom: training loss has huge variance across batches and learning stalls. Cause: per-batch reward scale varies wildly. Fix: the one-line normalisation in the code above. This is the single most common ‘PPO doesn't train’ bug.
- Action mask wrong. Symptom: reward and KL both look normal but downstream evals are unchanged from the SFT model. Cause: the mask covers prompt tokens instead of response tokens, or has an off-by-one against labels. Fix: print one example's mask alongside its input_ids before launch and eyeball that mask=1 lines up with response tokens.
- Value head exploded. Symptom: value_loss climbing instead of falling, then NaN. Cause: value head LR not decoupled from policy LR, or value clip not applied. Fix: the clipped value loss in the code above; some teams use a smaller LR (typically 5×) for the value head than for the policy.
- Reference model accidentally on the same fork as policy. Symptom: KL goes to zero, policy never moves. Cause: the ‘reference’ tensor is actually the live policy, e.g. because both were initialised from the same variable without a deep copy. Fix: assert
id(ref_policy) != id(policy)at startup; recompute ref_logp from a separately-loaded checkpoint. - Rollout uses temperature=1.0 but training uses softmax(logits) without temperature. Symptom: ratio distribution is centred far from 1.0 even on step 0. Cause: the rollout sampled from softmax(logits / T) but the trainer computes new_logp from softmax(logits). They are different distributions and the ratio is meaningless. Fix: apply the SAME temperature inside the trainer when computing log-probs of the action.
- Gradient norm explodes after a few hundred steps. Symptom: |g| climbs from 0.5 to 50 in 100 steps. Cause: either KL coefficient too small (policy escaping to extreme outputs) or entropy too low (policy collapsing to a near-delta distribution and getting huge per-token gradient magnitudes). Fix: raise β_KL or raise the entropy bonus.
The mental model that unifies this section: PPO is REINFORCE with three modifications. One: subtract a value baseline so positive/negative judgments are calibrated. Two: clip the importance ratio so a single optimisation step cannot move the policy outside a trust region in behaviour space. Three: add a KL penalty against a frozen reference so the cumulative drift over thousands of steps is bounded. Every line of code in the PyTorch step above is one of those three jobs. Get the masking right and the diagnostics right, and a 70B chat model can be RLHF'd in a few hundred PPO outer steps.