The Real Problem: We Cannot Score, Only Compare
Suppose you want to teach a language model what humans consider a good answer. The obvious plan is to ask thousands of labelers to score model outputs on a 1-to-10 scale and train the model to maximize that score. This plan does not work, and it does not work for a deep human reason rather than a technical one.
Humans are excellent comparators and terrible absolute raters. Ask the same person on Monday and Friday to score an essay and you will get two different numbers. Ask two people in the same room and you will get two more. But ask anyone: which of these two essays is better? and the agreement rate jumps from ~50 % (on numerical scoring) to ~75 % (on pairwise choice). The signal is in the comparison, not the scale.
So the data we can actually collect cheaply and reliably is a stream of pairwise outcomes:
- For prompt , the policy produces two candidate responses and .
- A human picks one as the winner: or .
- We never observe a numeric quality value for either response.
This is the only signal we have. The reward model needs to turn this stream of binary choices into a continuous — a number that RL can later climb. How? The answer comes from a 1952 paper by Ralph Bradley and Milton Terry on tournament rankings, decades before anyone imagined fine-tuning a transformer.
Intuition: A Hidden Ladder of Quality
Imagine every possible response to a prompt sits on an invisible ladder. The higher a response is on the ladder, the better humans consider it. The exact height is unobservable — we cannot reach in and measure it — but the ladder exists.
When a human is shown two responses, they are not reading each one's ladder height. They are doing something simpler: looking at both, sensing which one is higher, and clicking it. The bigger the height gap, the more confident the click. When two responses are at almost the same height, the click is nearly a coin flip. This is the entire model in one sentence.
The core assumption: the probability that a labeler picks response A over response B is a smooth, monotone function of the gap between their ladder heights. Bigger gap → more confident preference. Equal heights → 50/50.
Two more constraints fall out of the intuition almost for free:
- Symmetry. . Someone has to win; ties we will handle separately.
- Only the gap matters. If we shift every response's ladder height by the same amount, the gaps stay the same and so do all probabilities. The scores are identifiable only up to a global constant.
- Saturation, not explosion. If A is wildly better than B, the probability should approach 1 but never exceed it. Whatever function we use must map an unbounded gap into .
The simplest function that does all three is the sigmoid. That is the Bradley-Terry model.
The Mathematical Idea: Pairwise Probabilities from Scores
Assign each response a latent score — its height on the ladder. The Bradley-Terry probability that A beats B is
Every symbol matters:
- are the latent rewards of the two responses — real numbers, possibly negative, unbounded.
- is the margin: the gap between heights. A positive margin means A is above B on the ladder.
- is the logistic sigmoid: monotone, smooth, symmetric around , saturating at 0 and 1.
Given a dataset of preference pairs — where is the winner and the loser — the log-likelihood under Bradley-Terry is
We learn the reward function by minimizing the negative of this:
This is the reward-model loss at the heart of InstructGPT, Llama 2 Chat, and every RLHF stack since. It is binary cross-entropy in disguise: each pair carries a target label of 1 (“winner truly is the winner”), and we predict the probability of that label using the margin as the logit.
The gradient with respect to the two scores is beautifully simple. If we write , then
Two facts to remember from this:
- The update is the surprise. If is already close to 1 (the model agrees with the labeler), the gradient vanishes. Confident-correct examples contribute almost nothing. Confident-wrong examples produce the largest updates.
- The push is symmetric. Winner gets pulled up by ; loser gets pushed down by the same amount. Sum of gradients is zero — the mean drifts only via the optimizer's weight initialization, not via the data.
Manual Numerical Walkthrough
Numbers anchor the math. Let us run Bradley-Terry by hand on three responses to the same prompt: “Explain photosynthesis to a child.”
Click to expand: A full BT update from three preference pairs
Suppose the true (hidden) rewards are , , . We initialize .
Step 1 — what the true BT model would say:
| Pair | Margin r* | P(left wins) = σ(margin) |
|---|---|---|
| A vs B | 1.5 − 0.0 = 1.5 | σ(1.5) ≈ 0.818 |
| A vs C | 1.5 − (−1.0) = 2.5 | σ(2.5) ≈ 0.924 |
| B vs C | 0.0 − (−1.0) = 1.0 | σ(1.0) ≈ 0.731 |
Step 2 — the dataset we collected: 3 annotators each judged each pair. Suppose the votes came in as: A beat B 3/3, A beat C 3/3, B beat C 2/3. (Eight wins for the “truer” side, one upset.) The dataset is six rows:
| Winner | Loser |
|---|---|
| A | B |
| A | B |
| A | B |
| A | C |
| A | C |
| A | C |
Plus three more from the B-vs-C judgments:
| Winner | Loser |
|---|---|
| B | C |
| B | C |
| C | B |
Step 3 — first-step gradient at r = (0, 0, 0): Every margin is 0, so every and every per-pair gradient size is . Accumulate:
| Score | Contributions | Total grad |
|---|---|---|
| r_A | − 0.5 × 6 (six wins as A) | −3.0 |
| r_B | + 0.5 × 3 (lost to A) − 0.5 × 2 (won vs C) + 0.5 × 1 (lost to C) | +1.0 |
| r_C | + 0.5 × 3 (lost to A) + 0.5 × 2 (lost to B) − 0.5 × 1 (won vs B) | +2.0 |
Step 4 — apply the update with learning rate (and dividing by N = 9 pairs):
| Score | Update | After step 1 |
|---|---|---|
| r_A | 0 − 0.1 × (−3.0 / 9) | +0.033 |
| r_B | 0 − 0.1 × (+1.0 / 9) | −0.011 |
| r_C | 0 − 0.1 × (+2.0 / 9) | −0.022 |
The mean is , so the gauge stays anchored automatically (this is the symmetric-push property at work). After a single step the order is already . Run the loop for ~2 000 steps and you would recover — the true scores minus their mean, with a tiny bias from the upset vote.
Interactive Visualization
Drag the two reward sliders below to see how the margin maps to a preference probability through the sigmoid. The red dot marks your current operating point on the curve. The annotator simulation on the left rolls a biased coin times — when the outcome is nearly even; when almost every annotator picks A but you can still see the rare upsets that real datasets contain.
Play with the “resample” button — even with the underlying probability fixed, small gives noisy empirical . This is the real obstacle in reward modeling: each preference pair is one Bernoulli trial, and you need either many trials per pair or many distinct pairs to pin down the score field.
Plain Python: Fitting Bradley-Terry by Hand
Before we touch PyTorch, let us write Bradley-Terry from scratch on NumPy. The whole algorithm — including the gradient, the optimizer, and the gauge fix — fits in 30 lines.
Run this and you should see the learned scores match the centered truth within a few hundredths, with correlation above 0.99. The engine that fits a 7-billion-parameter reward model is exactly this loop — just with the lookup-table scores replaced by a transformer forward pass.
PyTorch: The RLHF Reward-Model Loss
Now the production-shaped version. Two changes from the NumPy code:
- The latent score is produced by a neural network — a transformer with a scalar head — not stored in an array.
- We write the loss as , the numerically stable form of .
The shape contract is the entire interface between Bradley-Terry and any transformer:
| Tensor | Shape | What it is |
|---|---|---|
| chosen / rejected token_ids | (batch, seq) | Two responses to the same prompt |
| trunk output h | (batch, seq, hidden) | Transformer hidden states |
| last-token pool | (batch, hidden) | Response summary vector |
| scalar reward r | (batch,) | One number per response |
| margin r_c − r_r | (batch,) | Per-pair preference logit |
| loss | scalar | Mean of softplus(−margin) |
At Massive Scale: From One LLM Head to a 70B Reward Model
When the toy embedding becomes Llama-3 70B, three things change — but the loss does not.
Compute and memory
Both chosen and rejected are forwarded through the same 70B network. That is two forward passes and two backward passes per example. To save memory, production trainers concatenate chosen and rejected along the batch dimension so the attention mask handles them as one batch of size . Activation checkpointing and ZeRO-3 sharding are mandatory — a 70B reward model uses the same parallelism recipe as its SFT parent (see Chapter 11 — Tensor Parallelism and Chapter 12 — FSDP / ZeRO).
Data scale
InstructGPT used ~33 000 comparison pairs. Llama 2's reward model used 1.4 million. Anthropic's Helpful-Harmless dataset reports ~170 000 pairs. The data efficiency of Bradley-Terry — recall the toy example recovered 5 scores from 400 noisy comparisons — is what makes these tiny dataset sizes viable for trillion-parameter policies.
Numerical stability
At 70B parameters trained in bf16, the rewards can drift to magnitudes of 20-40 if unconstrained. Two universal tricks:
- Reward regularization: add a small term to keep rewards bounded. Llama 2's recipe uses .
- Margin loss: if the labeler gave a confidence rating (“strongly prefer” vs “slightly prefer”), require by the confidence tier . This is the InstructGPT extension and most modern reward models use some form of it.
Engineering Reality: Ties, Noise, and Reward Hacking
What about ties?
Bradley-Terry as written has no concept of a tie — every comparison must produce a winner. In practice labelers can mark about equal, and two standard fixes exist:
- Drop ties from training. Loses ~5-10 % of data but keeps the loss exact.
- Rao-Kupper extension: introduce a tie probability parameter so that grows when the margin is small. Mathematically clean, almost never used in production because dropping ties just works.
Label noise and the 75 % ceiling
When you measure inter-annotator agreement on real RLHF datasets, you get ~75 %. This is the irreducible noise floor — for one in four prompts, two thoughtful labelers genuinely disagree about which output is better. A reward model that scores above 75 % on held-out pairs is not better, it is over-fit. Watch validation pairwise accuracy carefully; the sweet spot for early-stopping is usually 72-74 %.
Why this matters for RLHF and DPO
Bradley-Terry is also the assumption that lets Direct Preference Optimization (DPO) exist. DPO rewrites the BT loss in closed form against a reference policy, skipping the reward-model-then-PPO pipeline entirely. The trick only works because the BT family has a tractable closed-form ratio between the optimal policy and the reference — that derivation is the next section's job.
Reward hacking
Bradley-Terry produces a reward function that is correct on the DISTRIBUTION of responses the SFT policy generates. Push the RL policy too far and it will discover responses outside that distribution where the reward model is wrong — typically responses that look superficially confident, long-winded, or formatted with bullet points and headers, because those features correlate with “chosen” in the training set. This is reward hacking, and the standard defense is the KL penalty against the reference policy that PPO and DPO both bake in. The reward model itself is not the culprit — Bradley-Terry is doing exactly what it was asked. The problem is that the data did not cover what the policy can become.
One sentence to keep: Bradley-Terry takes a stream of binary clicks and turns them into a calibrated scalar reward via the sigmoid of the score gap. Every modern preference-trained LLM is built on this single line of 70-year-old math.