Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: We Cannot Score, Only Compare

Suppose you want to teach a language model what humans consider a good answer. The obvious plan is to ask thousands of labelers to score model outputs on a 1-to-10 scale and train the model to maximize that score. This plan does not work, and it does not work for a deep human reason rather than a technical one.

Humans are excellent comparators and terrible absolute raters. Ask the same person on Monday and Friday to score an essay and you will get two different numbers. Ask two people in the same room and you will get two more. But ask anyone: which of these two essays is better? and the agreement rate jumps from ~50 % (on numerical scoring) to ~75 % (on pairwise choice). The signal is in the comparison, not the scale.

So the data we can actually collect cheaply and reliably is a stream of pairwise outcomes:

For prompt $x$ , the policy produces two candidate responses $y_A$ and $y_B$ .
A human picks one as the winner: $y_A \succ y_B$ or $y_B \succ y_A$ .
We never observe a numeric quality value for either response.

This is the only signal we have. The reward model needs to turn this stream of binary choices into a continuous $r(x, y) \in \mathbb{R}$ — a number that RL can later climb. How? The answer comes from a 1952 paper by Ralph Bradley and Milton Terry on tournament rankings, decades before anyone imagined fine-tuning a transformer.

Bradley-Terry is the bridge between the data we CAN collect (pairwise preferences) and the signal we NEED (a continuous reward). Every modern RLHF system — InstructGPT, Llama Chat, Claude, DeepSeek — is built on this 70-year-old model.

Intuition: A Hidden Ladder of Quality

Imagine every possible response to a prompt sits on an invisible ladder. The higher a response is on the ladder, the better humans consider it. The exact height is unobservable — we cannot reach in and measure it — but the ladder exists.

When a human is shown two responses, they are not reading each one's ladder height. They are doing something simpler: looking at both, sensing which one is higher, and clicking it. The bigger the height gap, the more confident the click. When two responses are at almost the same height, the click is nearly a coin flip. This is the entire model in one sentence.

The core assumption: the probability that a labeler picks response A over response B is a smooth, monotone function of the gap between their ladder heights. Bigger gap → more confident preference. Equal heights → 50/50.

Two more constraints fall out of the intuition almost for free:

Symmetry. $P(A \succ B) + P(B \succ A) = 1$ . Someone has to win; ties we will handle separately.
Only the gap matters. If we shift every response's ladder height by the same amount, the gaps stay the same and so do all probabilities. The scores are identifiable only up to a global constant.
Saturation, not explosion. If A is wildly better than B, the probability should approach 1 but never exceed it. Whatever function we use must map an unbounded gap into $(0, 1)$ .

The simplest function that does all three is the sigmoid. That is the Bradley-Terry model.

The Mathematical Idea: Pairwise Probabilities from Scores

Assign each response a latent score $r(y) \in \mathbb{R}$ — its height on the ladder. The Bradley-Terry probability that A beats B is

$P(y_A \succ y_B) = \sigma(r(y_A) - r(y_B)) = \frac{1}{1 + e^{-(r_A - r_B)}}$

Every symbol matters:

$r(y_A), r(y_B)$ are the latent rewards of the two responses — real numbers, possibly negative, unbounded.
$r_A - r_B$ is the margin: the gap between heights. A positive margin means A is above B on the ladder.
$\sigma(z) = 1 / (1 + e^{-z})$ is the logistic sigmoid: monotone, smooth, symmetric around $\sigma(0) = 0.5$ , saturating at 0 and 1.

Given a dataset of $N$ preference pairs $\{(y_w^{(i)}, y_l^{(i)})\}_{i=1}^N$ — where $y_w$ is the winner and $y_l$ the loser — the log-likelihood under Bradley-Terry is

$\log \mathcal{L}(r) = \sum_{i=1}^N \log \sigma(r(y_w^{(i)}) - r(y_l^{(i)}))$

We learn the reward function by minimizing the negative of this:

$\mathcal{L}_{\text{BT}}(r) = -\sum_{i=1}^N \log \sigma(r_w^{(i)} - r_l^{(i)})$

This is the reward-model loss at the heart of InstructGPT, Llama 2 Chat, and every RLHF stack since. It is binary cross-entropy in disguise: each pair carries a target label of 1 (“winner truly is the winner”), and we predict the probability of that label using the margin $r_w - r_l$ as the logit.

The gradient with respect to the two scores is beautifully simple. If we write $p = \sigma(r_w - r_l)$ , then

$\frac{\partial \mathcal{L}_{\text{BT}}}{\partial r_w} = -(1 - p), \qquad \frac{\partial \mathcal{L}_{\text{BT}}}{\partial r_l} = +(1 - p)$

Two facts to remember from this:

The update is the surprise. If $p$ is already close to 1 (the model agrees with the labeler), the gradient vanishes. Confident-correct examples contribute almost nothing. Confident-wrong examples produce the largest updates.
The push is symmetric. Winner gets pulled up by $(1 - p)$ ; loser gets pushed down by the same amount. Sum of gradients is zero — the mean drifts only via the optimizer's weight initialization, not via the data.

Manual Numerical Walkthrough

Numbers anchor the math. Let us run Bradley-Terry by hand on three responses to the same prompt: “Explain photosynthesis to a child.”

Click to expand: A full BT update from three preference pairs

Suppose the true (hidden) rewards are $r^*_A = 1.5$ , $r^*_B = 0.0$ , $r^*_C = -1.0$ . We initialize $r_A = r_B = r_C = 0$ .

Step 1 — what the true BT model would say:

Pair	Margin r*	P(left wins) = σ(margin)
A vs B	1.5 − 0.0 = 1.5	σ(1.5) ≈ 0.818
A vs C	1.5 − (−1.0) = 2.5	σ(2.5) ≈ 0.924
B vs C	0.0 − (−1.0) = 1.0	σ(1.0) ≈ 0.731

Step 2 — the dataset we collected: 3 annotators each judged each pair. Suppose the votes came in as: A beat B 3/3, A beat C 3/3, B beat C 2/3. (Eight wins for the “truer” side, one upset.) The dataset is six rows:

Winner	Loser
A	B
A	B
A	B
A	C
A	C
A	C

Plus three more from the B-vs-C judgments:

Winner	Loser
B	C
B	C
C	B

Step 3 — first-step gradient at r = (0, 0, 0): Every margin is 0, so every $p = \sigma(0) = 0.5$ and every per-pair gradient size is $1 - p = 0.5$ . Accumulate:

Score	Contributions	Total grad
r_A	− 0.5 × 6 (six wins as A)	−3.0
r_B	+ 0.5 × 3 (lost to A) − 0.5 × 2 (won vs C) + 0.5 × 1 (lost to C)	+1.0
r_C	+ 0.5 × 3 (lost to A) + 0.5 × 2 (lost to B) − 0.5 × 1 (won vs B)	+2.0

Step 4 — apply the update with learning rate $\eta = 0.1$ (and dividing by N = 9 pairs):

Score	Update	After step 1
r_A	0 − 0.1 × (−3.0 / 9)	+0.033
r_B	0 − 0.1 × (+1.0 / 9)	−0.011
r_C	0 − 0.1 × (+2.0 / 9)	−0.022

The mean is $(0.033 - 0.011 - 0.022) / 3 = 0$ , so the gauge stays anchored automatically (this is the symmetric-push property at work). After a single step the order is already $A > B > C$ . Run the loop for ~2 000 steps and you would recover $r \approx (1.5, -0.1, -1.4)$ — the true scores minus their mean, with a tiny bias from the upset vote.

Notice what the math just did: a stream of 9 binary clicks produced a calibrated continuous ranking. That information density is why RLHF can teach a 70B parameter model with on the order of 10⁵ comparisons — far fewer labels than supervised fine-tuning would need to convey the same preferences.

Interactive Visualization

Drag the two reward sliders below to see how the margin maps to a preference probability through the sigmoid. The red dot marks your current operating point on the curve. The annotator simulation on the left rolls a biased coin $N$ times — when $P(A) = 0.5$ the outcome is nearly even; when $P(A) = 0.95$ almost every annotator picks A but you can still see the rare upsets that real datasets contain.

Loading Bradley-Terry visualizer…

Play with the “resample” button — even with the underlying probability fixed, small $N$ gives noisy empirical $\hat{P}$ . This is the real obstacle in reward modeling: each preference pair is one Bernoulli trial, and you need either many trials per pair or many distinct pairs to pin down the score field.

Plain Python: Fitting Bradley-Terry by Hand

Before we touch PyTorch, let us write Bradley-Terry from scratch on NumPy. The whole algorithm — including the gradient, the optimizer, and the gauge fix — fits in 30 lines.

🐍python

Explanation(9)

Code(44)

1We never see the true reward — only comparisons

Bradley-Terry exists because of a single hard truth about preference data: humans cannot reliably emit a number on a scale, but they CAN reliably say 'I like A better than B'. The ground truth `true_r` here is the simulator's secret. The learner only sees a stream of `(winner, loser)` pairs. This is the exact information channel RLHF uses — a labeler reads two LLM outputs and clicks one.

EXECUTION STATE

n = 5

true_r = (5,) hidden scores

6The sigmoid is the heart of Bradley-Terry

Bradley-Terry says: P(i beats j) = σ(r_i − r_j) = 1 / (1 + exp(r_j − r_i)). The sigmoid maps an unbounded score difference into a (0, 1) probability and is symmetric: P(i beats j) + P(j beats i) = 1. Notice the model has NO interaction terms, no item-specific bias — every comparison is a 1-D problem along a single 'quality axis'. The whole modeling power lives in that scalar gap.

EXECUTION STATE

x = real number

9Generate the pairwise dataset

For each unordered pair (i, j) we run K = 40 noisy matches. The expected number of wins for i is K · σ(r_i − r_j). With binomial noise this is the exact statistical analogue of an RLHF dataset where many labelers each see the same prompt and may disagree. The order of items inside a (winner, loser) tuple is what carries the signal.

EXECUTION STATE

K = 40

pairs = list of (w, l) tuples

15Bernoulli sampling — this is your label noise model

`wins_i` is the number of times i beat j across the K trials. It is drawn from Binomial(K, p). When p ≈ 0.5 (close skill) the dataset is noisy: even with K = 40 you can get 18-22 wins easily. When p ≈ 0.98 (huge skill gap) almost every match goes one way. The shape of this noise — fat near 0.5, thin near 0/1 — IS the noise model that real human preference data follows, and it is the reason Bradley-Terry is the right loss.

22Initialize learned scores to zero

We start with r = 0 for every item. Because only differences r_i − r_j matter, the absolute starting point is irrelevant; what matters is that gradient descent has room to MOVE items apart. In the RLHF reward model the analogue is initializing a fresh scalar head on top of the pretrained transformer.

EXECUTION STATE

r = (5,) all zeros

26Negative log-likelihood per pair

For one observation (w beat l), the likelihood is σ(r_w − r_l). The negative log-likelihood is −log σ(d) where d = r_w − r_l. This is exactly the logistic / cross-entropy loss with a target of 1. Summed over all pairs, it is the Bradley-Terry loss. Every term wants `r_w` larger and `r_l` smaller — by exactly how much depends on how confident the model already is.

EXECUTION STATE

d = r_w − r_l

p = σ(d) ∈ (0, 1)

30Gradient — winner pulled up, loser pulled down

Take the derivative of −log σ(d) by hand: it equals −(1 − p). So the gradient on the winner's score is −(1 − p) (push UP, since we subtract gradient * lr), and on the loser's score is +(1 − p) (push DOWN). Two beautiful properties: (1) the update size is the model's surprise — if it already strongly predicts the right answer (p close to 1), the gradient vanishes; (2) the two scores receive equal and opposite pushes, conserving the mean.

EXECUTION STATE

grad[w] = −(1 − p)

grad[l] = +(1 − p)

33Anchor the mean to fix the unidentifiability

Bradley-Terry's scores are only identifiable up to a global additive constant: if we add 100 to every r_i, every preference probability stays the same. Without anchoring, the scores can drift arbitrarily during training. Subtracting the mean each step pins the gauge — exactly the trick reward models use, except in RLHF the anchor is usually 'reward of a reference response = 0'.

35Compare to the centered ground truth

We center the true scores the same way, then check correlation. With K = 40 matches per pair the recovered scores should land within ±0.1 of the truth and correlation should exceed 0.99. The lesson: 400 noisy binary comparisons (10 pairs × 40 trials) suffice to fully recover a 5-dimensional scalar score field. This data efficiency is exactly why RLHF works with ~10⁵ comparisons for billion-parameter models — preferences encode a lot of information per bit.

35 lines without explanation

1import numpy as np
2
3# Five chess-like "responses" with a hidden true quality.  We will NEVER look
4# at the true score during fitting — only at pairwise win/lose outcomes, the
5# same signal a human RLHF annotator gives an LLM.
6np.random.seed(0)
7n = 5
8true_r = np.array([2.0, 1.0, 0.5, -0.5, -1.5])   # hidden ground truth
9
10def sigmoid(x):
11    return 1.0 / (1.0 + np.exp(-x))
12
13# Build a dataset of pairwise comparisons.  For each pair (i, j) we sample
14# K matches; winner i happens with probability sigmoid(true_r[i] - true_r[j]).
15K = 40
16pairs = []                                       # list of (winner, loser)
17for i in range(n):
18    for j in range(i + 1, n):
19        p = sigmoid(true_r[i] - true_r[j])
20        wins_i = np.random.binomial(K, p)
21        pairs += [(i, j)] * wins_i               # i beats j
22        pairs += [(j, i)] * (K - wins_i)         # j beats i
23
24# Fit Bradley-Terry: learn r in R^n by maximizing log-likelihood of the data.
25# A useful trick — scores are only identifiable up to a constant, so we anchor
26# the mean at zero after every update.
27r = np.zeros(n)
28lr = 0.1
29for step in range(2000):
30    grad = np.zeros(n)
31    loss = 0.0
32    for w, l in pairs:
33        d = r[w] - r[l]
34        p = sigmoid(d)
35        loss -= np.log(p + 1e-12)
36        grad[w] -= (1.0 - p)                     # dL/dr_w = -(1-p)
37        grad[l] += (1.0 - p)                     # dL/dr_l = +(1-p)
38    r -= lr * grad / len(pairs)
39    r -= r.mean()                                # anchor: mean(r) = 0
40
41true_centered = true_r - true_r.mean()
42print("learned:", np.round(r, 3))
43print("truth  :", np.round(true_centered, 3))
44print("corr   :", np.corrcoef(r, true_centered)[0, 1])

Run this and you should see the learned scores match the centered truth within a few hundredths, with correlation above 0.99. The engine that fits a 7-billion-parameter reward model is exactly this loop — just with the lookup-table scores replaced by a transformer forward pass.

PyTorch: The RLHF Reward-Model Loss

Now the production-shaped version. Two changes from the NumPy code:

The latent score is produced by a neural network — a transformer with a scalar head — not stored in an array.
We write the loss as $\text{softplus}(-(r_w - r_l))$ , the numerically stable form of $-\log \sigma$ .

🐍python

Explanation(8)

Code(58)

5The reward model is a transformer with a scalar head

Every modern RLHF stack — InstructGPT, Llama 2/3 Chat, Claude's constitutional pipeline, DeepSeek-R1's preliminary stage — uses the same architecture: a copy of the SFT model with the LM-head replaced by a single linear layer that outputs ONE number. The pretrained transformer provides 'language understanding'; the scalar head turns it into 'preference judgment'. Everything we learn here applies unchanged when the trunk is a 70B parameter model.

13Embedding stands in for the full LLM trunk

For a runnable demo we replace the 8B-parameter transformer with a 10k-vocab embedding table. The shape contract is identical to the real thing — we produce a (batch, seq, hidden) tensor — and the loss / gradient math is bit-for-bit what a production trainer would compute. Swap `nn.Embedding` for `LlamaModel.from_pretrained(...)` and you have an industrial reward model.

EXECUTION STATE

trunk =

(10000, hidden_size)

head =

(hidden_size, 1)

21Pool by taking the last token's hidden state

How do you turn a (seq, hidden) sequence into one number? The RLHF convention: take the hidden state at the LAST non-padding token. Why last? Causal attention means the last position can attend to the whole sequence, so it carries a summary of the whole response. Alternatives — mean-pool, CLS-token — work too, but last-token pooling matches how the SFT model was originally read.

EXECUTION STATE

h = (batch, seq, hidden)

last =

(batch, hidden)

25Scalar reward — one number per response

The linear head collapses the 4096-dim hidden state into one real number r. There is no sigmoid here — the reward is intentionally unbounded. The probability interpretation only appears when we form a DIFFERENCE between two rewards. This is one of the most under-appreciated design choices: by keeping r unbounded we give the model room to express huge confidence gaps (margin of 6 ⇒ p ≈ 0.998) without saturating.

EXECUTION STATE

return = (batch,)

28The Bradley-Terry loss as softplus(−margin)

Algebra: −log σ(x) = log(1 + exp(−x)) = softplus(−x). So the BT loss on one pair is softplus(−(r_c − r_r)). The reason every framework writes it this way is numerical stability — when the margin is very negative, plain σ → 0 and log(0) = −∞, but softplus handles the full real line without ever producing NaN. This is the SAME loss as binary cross-entropy with target 1 applied to the margin.

EXECUTION STATE

r_chosen = (batch,)

r_rejected = (batch,)

41One prompt, two responses — the data shape of RLHF

Every example is a TRIPLE: (prompt, chosen_response, rejected_response). The model scores both responses (using the same parameters — no separate networks!) and the loss only depends on the DIFFERENCE. This dual-pass structure has a name in the literature: siamese training. It is why RLHF datasets are quoted as 'N preference pairs' rather than 'N labels'.

45Forward both branches in one optimizer step

Concretely: we call `model(chosen)` and `model(rejected)` separately, get two scalar rewards, then form the margin. Autograd correctly accumulates the gradient ON THE SHARED PARAMETERS from both forward passes because backprop on (r_c − r_r) flows independently through each. Production trainers often concatenate chosen and rejected into one batch of size 2·B to halve overhead.

49After 200 steps the model gives chosen a positive reward, rejected a negative one

On real data you would never train to convergence on a single pair — but here you should see the margin grow to roughly +6 to +10, with P(chosen ≻ rejected) → 0.997+. In practice training stops when held-out pairwise accuracy plateaus around 70-75 % (the data is noisy — even humans disagree ~25 % of the time on subjective prompts). 'Accuracy' here means: how often does the reward model agree with the held-out human label?

50 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# In RLHF, the reward model is a transformer with a scalar head:
6#     reward = w · h_last_token   (a single number per (prompt, response)).
7# At training time each example provides TWO responses to the same prompt:
8#     "chosen" (the human-preferred one) and "rejected".  The Bradley-Terry
9# loss says: P(chosen ≻ rejected) = σ(r_chosen − r_rejected),  and we
10# minimize −log of that probability.
11
12class RewardModel(nn.Module):
13    def __init__(self, hidden_size: int):
14        super().__init__()
15        # In practice this is a frozen-or-fine-tuned LLM trunk followed by
16        # a single linear "value head".  We replace the trunk by an
17        # nn.Embedding to keep the example self-contained and runnable.
18        self.trunk = nn.Embedding(10_000, hidden_size)
19        self.head  = nn.Linear(hidden_size, 1, bias=False)
20
21    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
22        # token_ids: (batch, seq).   We use the LAST-token hidden state as
23        # the response summary — the standard RLHF convention.
24        h = self.trunk(token_ids)              # (batch, seq, hidden)
25        last = h[:, -1, :]                     # (batch, hidden)
26        return self.head(last).squeeze(-1)     # (batch,)  scalar reward
27
28def bradley_terry_loss(r_chosen: torch.Tensor,
29                       r_rejected: torch.Tensor) -> torch.Tensor:
30    # −log σ(r_chosen − r_rejected)  ==  softplus(−(r_chosen − r_rejected))
31    # softplus is the numerically stable form: it never overflows when the
32    # margin is hugely negative, and it goes to 0 cleanly when the margin
33    # is hugely positive.
34    return F.softplus(-(r_chosen - r_rejected)).mean()
35
36# --- training step on one preference pair ---
37torch.manual_seed(0)
38model = RewardModel(hidden_size=32)
39opt   = torch.optim.AdamW(model.parameters(), lr=1e-3)
40
41# Two fake responses to the same prompt.  In reality these are tokenized
42# strings produced by the SFT policy.
43chosen   = torch.tensor([[5, 12, 88,  3, 41]])
44rejected = torch.tensor([[5, 12, 88, 99, 17]])
45
46for step in range(200):
47    r_c = model(chosen)
48    r_r = model(rejected)
49    loss = bradley_terry_loss(r_c, r_r)
50
51    opt.zero_grad()
52    loss.backward()
53    opt.step()
54
55print(f"r_chosen  = {model(chosen).item():+.3f}")
56print(f"r_rejected= {model(rejected).item():+.3f}")
57print(f"margin    = {(model(chosen) - model(rejected)).item():+.3f}")
58print(f"P(chosen) = {torch.sigmoid(model(chosen)-model(rejected)).item():.3f}")

The shape contract is the entire interface between Bradley-Terry and any transformer:

Tensor	Shape	What it is
chosen / rejected token_ids	(batch, seq)	Two responses to the same prompt
trunk output h	(batch, seq, hidden)	Transformer hidden states
last-token pool	(batch, hidden)	Response summary vector
scalar reward r	(batch,)	One number per response
margin r_c − r_r	(batch,)	Per-pair preference logit
loss	scalar	Mean of softplus(−margin)

At Massive Scale: From One LLM Head to a 70B Reward Model

When the toy embedding becomes Llama-3 70B, three things change — but the loss does not.

Compute and memory

Both chosen and rejected are forwarded through the same 70B network. That is two forward passes and two backward passes per example. To save memory, production trainers concatenate chosen and rejected along the batch dimension so the attention mask handles them as one batch of size $2B$ . Activation checkpointing and ZeRO-3 sharding are mandatory — a 70B reward model uses the same parallelism recipe as its SFT parent (see Chapter 11 — Tensor Parallelism and Chapter 12 — FSDP / ZeRO).

Data scale

InstructGPT used ~33 000 comparison pairs. Llama 2's reward model used 1.4 million. Anthropic's Helpful-Harmless dataset reports ~170 000 pairs. The data efficiency of Bradley-Terry — recall the toy example recovered 5 scores from 400 noisy comparisons — is what makes these tiny dataset sizes viable for trillion-parameter policies.

Numerical stability

At 70B parameters trained in bf16, the rewards $r_c, r_r$ can drift to magnitudes of 20-40 if unconstrained. Two universal tricks:

Reward regularization: add a small $\lambda \cdot (r_c^2 + r_r^2)$ term to keep rewards bounded. Llama 2's recipe uses $\lambda = 10^{-3}$ .
Margin loss: if the labeler gave a confidence rating (“strongly prefer” vs “slightly prefer”), require $r_c - r_r \geq m$ by the confidence tier $m$ . This is the InstructGPT extension and most modern reward models use some form of it.

The single most common production failure of Bradley-Terry training is not bad gradients — it is the dataset. A reward model is exactly as good as the preference data it sees, and pairs where humans disagree more than 60/40 carry essentially zero usable signal. The serious engineering work happens in dataset curation, not in the loss function.

Engineering Reality: Ties, Noise, and Reward Hacking

What about ties?

Bradley-Terry as written has no concept of a tie — every comparison must produce a winner. In practice labelers can mark about equal, and two standard fixes exist:

Drop ties from training. Loses ~5-10 % of data but keeps the loss exact.
Rao-Kupper extension: introduce a tie probability parameter $\theta > 1$ so that $P(\text{tie})$ grows when the margin is small. Mathematically clean, almost never used in production because dropping ties just works.

Label noise and the 75 % ceiling

When you measure inter-annotator agreement on real RLHF datasets, you get ~75 %. This is the irreducible noise floor — for one in four prompts, two thoughtful labelers genuinely disagree about which output is better. A reward model that scores above 75 % on held-out pairs is not better, it is over-fit. Watch validation pairwise accuracy carefully; the sweet spot for early-stopping is usually 72-74 %.

Why this matters for RLHF and DPO

Bradley-Terry is also the assumption that lets Direct Preference Optimization (DPO) exist. DPO rewrites the BT loss in closed form against a reference policy, skipping the reward-model-then-PPO pipeline entirely. The trick only works because the BT family has a tractable closed-form ratio between the optimal policy and the reference — that derivation is the next section's job.

Reward hacking

Bradley-Terry produces a reward function that is correct on the DISTRIBUTION of responses the SFT policy generates. Push the RL policy too far and it will discover responses outside that distribution where the reward model is wrong — typically responses that look superficially confident, long-winded, or formatted with bullet points and headers, because those features correlate with “chosen” in the training set. This is reward hacking, and the standard defense is the KL penalty against the reference policy that PPO and DPO both bake in. The reward model itself is not the culprit — Bradley-Terry is doing exactly what it was asked. The problem is that the data did not cover what the policy can become.

One sentence to keep: Bradley-Terry takes a stream of binary clicks and turns them into a calibrated scalar reward via the sigmoid of the score gap. Every modern preference-trained LLM is built on this single line of 70-year-old math.