The Real Problem: A Preference Number Has to Come From Somewhere
The previous section gave us a theory of preferences — Bradley-Terry — and a dataset of triples , each saying “for this prompt , a human preferred response over response .” The theory says these preferences are consistent with an underlying scalar reward function . The dataset is finite. The space of possible responses at inference time is essentially infinite — a trillion plausible 200-token replies to any given prompt. We need a function that takes any new pair and returns a number that agrees, in expectation, with how a human would have voted.
That function does not write itself. We cannot hard-code it — “a good response is helpful and honest” is not an algorithm. We cannot table-lookup it — the table has 1012 rows. We have exactly one tool capable of compressing trillions of possible responses into a scalar with reasonable judgment: a large language model. So we take the same transformer that just finished supervised fine-tuning, rip off the head that predicts the next token, bolt on a tiny new head that predicts a single number, and train the resulting Frankenstein on the preference dataset with the Bradley-Terry log-likelihood. That creature is the reward model.
Intuition: A Tasting Judge with a Single Knob
Picture a wine judge whose entire job is to compare two glasses and point at the better one. They never have to score on an absolute scale — “this is an 87” is meaningless; what they can do reliably is say “this is better than that.” Over a few thousand such comparisons their internal scoring drifts into something coherent: a function from glass to a real number, where bigger means better. They cannot tell you the units. The function is only well- defined up to a constant shift. But it ranks every wine they have ever tasted, and it generalises — they can rank a new wine after one sniff.
A reward model is exactly that judge, with the transformer body as its nose, the scalar head as its mouth, and the preference dataset as its training. We never ask it “how good is this response?” because the answer would be unanchored. We only ever ask “which of these two is better?” during training, and the answer's sign is what matters. At inference time inside PPO, we will use the absolute number — but only as a relative signal compared to other responses the policy generates for the same prompt. The constant cancels itself out the moment the policy starts comparing.
Two implications fall out of this intuition immediately. First, the head needs only one output dimension — we never need a probability distribution over “quality classes,” just a real number on a line. Second, the head's bias term is useless: it would shift every reward by the same amount, and the loss only sees differences. Both of these facts will pop up in the math, the code, and the architecture diagram below — and they explain why the smallest production-grade reward model is a 70-billion-parameter backbone with a 8 192-parameter head.
The Mathematical Idea: Pair-Level Bradley-Terry on a Scalar Head
Let the transformer body, parameterised by , map an input sequence to a sequence of hidden vectors . Let the scalar head be a single linear layer with weight vector (and, often, no bias). A pooling function collapses the per-token states into one vector. The reward is then a single inner product:
The full reward-model parameters are — the backbone weights and the head weights. During training the body is also updated; we are not freezing the SFT checkpoint, we are continuing to fine-tune it for a new task.
Given a preference dataset , the Bradley-Terry log-likelihood from §14.2 becomes the training objective. For one pair, the loss is
where is the logistic sigmoid. We sum (or average) this over the dataset to get the full objective . The gradient with respect to either reward is delightfully clean:
where is the model's current probability of preferring the chosen response. Three things are worth noticing here. First, the magnitude of every gradient is — a number between 0 and 1 that automatically shrinks for pairs the model is already confident on. Second, the two gradients are equal and opposite, so a bias term in would have gradient zero on every pair — the bias is unidentifiable. Third, the chain rule through the head gives
where are the pooled hidden states for the chosen and rejected responses. Geometrically: at every step, the head weight drifts in the direction , scaled by how wrong the model currently is. Over the whole dataset, converges to (approximately) the average chosen-minus-rejected difference vector. That single observation is enough to explain nine out of ten reward-model bugs: if your data has a length bias, systematically points in the “longer responses” direction, and your head learns “longer = better.”
Architecture: Same Body, New Tiny Head
Reward-model architecture is almost embarrassingly minimal once you understand the math. You take the SFT checkpoint — the same one whose chat-template behaviour you trusted enough to RLHF on top of — and replace exactly one layer. The toggle below shows what changes.
Two architectural details deserve their own paragraphs because they are where most of the variance between codebases lives.
Head shape: with no bias
Every production reward model uses a single linear layer . No activation, no MLP, no LayerNorm — those would saturate the gradient for high-confidence pairs and slow training to a crawl. The bias is almost always omitted (see the math above for why it has zero gradient). For an 8B-class backbone, the head is parameters — about of the model. Some experimental codebases use a two-layer MLP head, but the gain is marginal and the added depth makes the reward landscape harder for PPO to explore.
Initialisation: from the SFT model
The body is initialised from the SFT checkpoint — not from a base model, not from a pretrained-only model. The reason is the chat template: a base model has never seen <|start_header_id|>assistant<|end_header_id|> as a high-frequency pattern, so its last-position hidden state is a random direction with respect to “a complete assistant turn ended here.” The SFT model has spent thousands of GPU-hours making that last-token state meaningful; we inherit that work for free. The head is initialised either to or to zero — both put the initial reward at zero everywhere, which keeps PPO's downstream KL constraint sane on step 0.
Pooling: Which Hidden State Becomes the Reward?
The transformer body returns hidden states — one per token. We need exactly one number out of the head. Which token's hidden state should feed the head? Three answers have been tried; only one has survived.
| Pooling rule | How it works | Verdict |
|---|---|---|
| Last-token pooling (default) | Read the hidden state at the LAST non-pad position. With a chat template that ends in <|eot_id|>, this is the position whose hidden state had access (via causal attention) to the whole conversation. | ✅ The universal default — used by InstructGPT, Llama-2, Llama-3, Tülu, Qwen, Mistral, every open-source RM trainer. Cheap, correct, and the hidden state at EOT is the natural ‘summary token’ in a causal LM. |
| Mean pooling | Average the hidden states across all non-pad positions. Borrowed from BERT-style classification. | ⚠️ Worse than last-token by ~3 points of preference accuracy. Causal LMs concentrate semantic information at the last token; averaging dilutes that signal with mostly-irrelevant prompt tokens. |
| Attention pooling (a learned query token) | Add an extra trainable [REWARD] token at the end, let the model attend over the response with one new attention head, read off that token's hidden state. | 🔬 Marginally better on some benchmarks (~0.5 pts) at the cost of changing the tokenizer and re-pretraining the embedding. Not worth it for production. Used in a few research papers. |
The only subtle thing about last-token pooling is the padding side. If the tokenizer right-pads (default for most causal LMs), the last real token is somewhere in the middle of the tensor, not at index . You must read hidden[batch_idx, attention_mask.sum(dim=1) - 1], not hidden[:, -1, :]. The PyTorch snippet later in this section gets this right; nearly every from-scratch RM implementation gets it wrong at least once.
Manual Numerical Walkthrough: One Gradient Step on One Pair
We will use a 4-dimensional toy where every number fits in your head. Open the panel and walk through it with a pencil — once you have computed the gradient by hand, the production code becomes trivial to read.
Manual Numerical Walkthrough — open to see every number
The setup. Hidden dimension . One preference pair. The transformer body produced these two pooled hidden states (we are pretending the body is done; only the head trains in this walkthrough):
| Vector | h_1 | h_2 | h_3 | h_4 |
|---|---|---|---|---|
| h_chosen | 0.5 | 1.0 | −0.3 | 0.2 |
| h_rejected | 0.1 | 0.4 | −0.1 | −0.2 |
| difference (h_chosen − h_rejected) | 0.4 | 0.6 | −0.2 | 0.4 |
Step 1 — Initial head. , no bias. Both rewards are zero. The Bradley-Terry probability of preferring the chosen response is , and the loss is . This is the canonical starting loss for every RM run.
Step 2 — Compute the gradient. Using the formula from above, , giving .
Step 3 — One SGD step with learning rate (chosen large so the numbers move visibly). The new head is .
Step 4 — Re-score both responses.
| Quantity | Calculation | Value |
|---|---|---|
| r_chosen | 0.20·0.5 + 0.30·1.0 + (−0.10)·(−0.3) + 0.20·0.2 | 0.10 + 0.30 + 0.03 + 0.04 = 0.47 |
| r_rejected | 0.20·0.1 + 0.30·0.4 + (−0.10)·(−0.1) + 0.20·(−0.2) | 0.02 + 0.12 + 0.01 − 0.04 = 0.11 |
| margin (r_chosen − r_rejected) | 0.47 − 0.11 | 0.36 |
| P(chosen ≻ rejected) | σ(0.36) | ≈ 0.589 |
| loss | −log 0.589 | ≈ 0.529 |
What happened in one step. The model went from random guessing (P = 0.500) to preferring the right response with 58.9% probability — a 9-percentage-point shift from a SINGLE gradient step on a SINGLE pair. The head weights point in exactly the direction (compare line 3 of the first table with the new : they are proportional). Repeat this for tens of thousands of pairs over a real backbone and you have a working reward model.
What does NOT happen. The bias (if there were one) does not move. The body (if we were training it too) would also move, but in a slower, more diffuse way: each hidden state has its own gradient pulling it in opposite directions, making the body slightly more sensitive to features that distinguish good from bad responses.
The big asterisk. Real training uses AdamW (with its second-moment scaling), gradient clipping at norm 1.0, lr ≈ 1e-5 instead of 1.0, and bf16 mixed precision. The math is identical; the magnitudes are smaller per-step but accumulate over tens of thousands of steps. The walkthrough above is exactly what happens inside AdamW's update, just before the second-moment re-scaling.
Visualizing the Bradley-Terry Loss Landscape
Move the two sliders and watch the loss curve, the preference probability, and the gradient pressure on each reward update in real time. The dashed blue line is the preference probability ; the solid purple line is the loss ; the moving dot tracks your current margin.
Three behaviours are worth feeling rather than just reading. First, the loss is asymmetric in the margin: at the loss is about 3.05; at it is about 0.05. A confidently wrong prediction costs 60× more than the same magnitude of confidently-right prediction earns — exactly the property you want when you are stamping out mis-rankings. Second, beyond roughly the gradient is numerically zero — and so is the optimizer's pressure to learn from this pair. Easy pairs self-mute; only the close decisions keep contributing. Third, the gradient on the chosen reward is the exact negation of the gradient on the rejected reward, which is why a bias term in the head has gradient zero on every pair — a fact you can read straight off the arrows next to the sliders.
Plain Python: A Reward Model in 80 Lines of NumPy
Below is a complete reward-model training step in plain Python with nothing but the standard library and the math module. We use a 4-dimensional “hidden state” in place of the transformer body, but the loss, the gradient computation, and the update are identical to what runs inside HuggingFace TRL on Llama-3-8B. The moment you can read this code, you can read the production code.
PyTorch: A Production Reward Model on Top of a Transformer
The PyTorch version stacks the chosen and rejected sequences into a single forward pass, finds the last non-pad position with the attention mask, runs the scalar head, and computes the BT loss with F.logsigmoid. Strip the optimizer plumbing and you can fit the whole reward model in 60 lines.
From toy script to real RM training
The script above is approximately what runs inside trl.RewardTrainer, OpenRLHF.RewardModel, and torchtune.training.recipes.reward_modeling — minus the FSDP wrap, the gradient checkpointing, the bf16 mixed-precision casts, and the distributed data loader. The per-pair forward and the loss function are byte-for-byte the same. Once those tensors come out, scaling up is the same engineering problem as scaling up SFT: shard the optimizer state, checkpoint activations every 4 layers, and use a fused AdamW kernel.
At Massive Scale: Why Reward Modeling Is the Bottleneck of RLHF
For a 7B-class model an RM run is a 12-hour, 8-GPU job and looks almost identical to SFT. At the frontier scale the picture changes completely, in three ways that each set their own engineering agenda.
Memory: two forward passes per pair, double the activations
SFT activations for a 70B model on a 4 096-token sequence already push a single H100 to 70 GB at bf16. RM training doubles that because every pair contains TWO sequences (chosen and rejected) and we stack them along the batch dimension to keep dropout / RNG identical. Concretely, a per-GPU batch of pairs produces a activation tensor — the same memory footprint as SFT with . Almost every RM run uses activation checkpointing every 1-4 layers as a hard requirement, not an optimisation.
Compute: with being pair-tokens
The standard scaling-law FLOPs estimate applies, with the parameter count and the number of training tokens. For an RM, is the total number of tokens across BOTH sequences in every pair — typically 1.5–2× larger than an equivalently-sized SFT corpus measured in pairs. A 70B RM trained on 500k preference pairs is roughly FLOPs — a few days on a 256-H100 island. That is significant compute, and it dwarfs the cost of the actual PPO run that consumes the RM downstream.
Communication: small batches, low utilisation
Preference data is expensive to collect (a few dollars per pair from good annotators), so corpora are small — 50k to 1M pairs is typical, compared with 5M+ for SFT. Small datasets force small global batches (you cannot afford to step over the data many times), and small batches mean the all-reduce after each step is a larger fraction of total time. On 256 H100s an RM step might be 40% bubble, 60% compute, compared with 15/85 for an SFT step on the same hardware. Frontier teams compensate by overlapping the all-reduce with the backward pass and by tightening gradient_accumulation_steps to keep the optical-link payload large.
The reward-hacking budget
A reward model that scores 78% on held-out preferences sounds good — but PPO will then push the policy through rollout samples per update, and the policy will quickly discover the 12% of cases where the RM is wrong in a systematic way. The bigger the RM, the more expressive its preference function, the slower this “reward hacking” sets in. Frontier teams almost always choose an RM at least as large as the policy it scores. Anthropic reported scaling RM and policy roughly together in the Claude papers; OpenAI did the same with GPT-4 RLHF; Llama-3 used a 70B RM for the 70B policy.
Engineering Reality: The Catalogue of Reward-Model Pathologies
Reward models fail silently. The training loss looks fine — it slides from 0.693 down to 0.45 and plateaus — and standard ML hygiene tells you the run is healthy. Then PPO produces a model that writes 800-word responses to “hi,” or hedges every answer with “as an AI assistant,” or refuses anything that smells like a question. The pathology was in the RM all along; PPO just amplified it. The list below is what every RM engineer ends up memorising the hard way.
- Length bias. Chosen responses are systematically longer than rejected ones in the training data (because annotators write more when they like a response). The RM learns “longer = better,” and PPO blows the policy's response length out by 2–3×. Catches: regress reward against length on a held-out set; if , you have a length bias. Fix: length-control the training set or add a length penalty to the loss.
- Padding-side bug. The code reads
hidden[:, -1, :]when the tokenizer right-pads. The reward becomes a function of the pad-token hidden state, which is nearly constant across rows. Loss looks healthy but accuracy hovers at 50%. Fix: always useattention_mask.sum(dim=1) - 1. - Stale chat template. The RM uses Llama-3's template; the PPO rollout code uses Llama-2's. The hidden state at “the last non-pad token” means different things in the two formats, so the reward distribution is shifted by ~1 standard deviation between training and serving. PPO chases the shift.
- Reward exploding to ±∞. Without grad clipping, one mislabeled pair with a 12-token chosen response and a 4000-token rejected response produces a huge difference vector and a huge gradient on . The head NaNs. Fix:
clip_grad_norm_(1.0)on every step. - Distribution shift onto the policy. The RM was trained on responses written by humans; PPO rolls out responses written by the policy. After 1000 PPO steps the policy's style is far enough from the human style that the RM is no longer in distribution. Fix: collect a small batch of fresh preferences on the current policy's outputs every few hundred PPO steps and continue training the RM. This is the “online RM” loop and is part of every frontier post-training pipeline.
- Single-annotator overfitting. 70% of your preference data came from one contractor who has strong opinions about Oxford commas. The RM learns to obsess over Oxford commas. Fix: aggregate at least 3 annotators per pair and use the majority vote; monitor inter-annotator agreement on a calibration set.
- Same-prompt leak. The dev set and train set share some prompts (different responses). The RM looks great on dev (~85% accuracy) but is overfit to specific prompts. PPO discovers this immediately. Fix: split by prompt hash, not by row index.
- Bias term left on. Someone enabled
bias=Trueon the head. The bias has gradient zero on every pair, so it stays at its random init value forever — adding a constant ~0.1 to every reward. Harmless in training; sometimes causes weird numerical comparisons at serving time if downstream code looks at the absolute reward. Fix:bias=Falsealways. - Two-tower failure mode. A junior engineer runs chosen and rejected through two separate forward passes (different RNG seeds, different dropout masks). The estimated margin is much noisier than necessary; training converges 30% slower. Fix: stack and run once.
The frontier-team practice for catching these is a small, hand-curated battery of RM unit tests: a handful of trivial pairs (“helpful, accurate answer” vs. “refusal”, “direct answer” vs. “800 words of waffling”, “correct math” vs. “subtly wrong math”) on which any working RM should produce a margin of at least +1. Running these every checkpoint catches 80% of the pathologies above before they leak into PPO.
The mental model that unifies this section: a reward model is a tasting judge with a single knob. The transformer body is the palate; the scalar head is the knob; Bradley-Terry is the training protocol. Every detail in this section — pooling, no-bias head, last-token indexing, grad clipping — is a guardrail against the judge being fooled by something other than the response's actual quality. Get the guardrails right and the next section (§14.4, rule-based rewards) tells you how to teach the judge new categories beyond what preference data can reach.