The Real Problem: Subjective Quality at the Scale of a Pretraining Run
A frontier RLHF run needs one million preference labels. Not ten thousand. Not a hundred thousand. A million — because the reward model has to cover every kind of question the chat model will face after release: math, code, summarisation, role-play, harmless refusals, creative writing, legal disclaimers, multilingual edge cases. A trained crowd-worker labels a tricky pairwise comparison in roughly two minutes. A million labels at two minutes each is person-hours — about 16 full-time annotators for a year. That is the dominant cost of post-training a 405B chat model, and it is exactly the cost that scaled the slowest from GPT-3 (a few thousand labels) to GPT-4 (millions).
Section 14.4 covered the easy half: when the answer can be checked by a rule — a unit test, a regex on the final digit of an arithmetic problem, a JSON-schema validator — you do not need humans at all. Verifiable rewards solve about 30% of the post-training budget. The remaining 70% is the hard half: subjective dimensions with no obvious rule. Is this summary faithful? Is this refusal polite? Is this code idiomatic? Is this poem better than that one? There is no regex for “idiomatic.”
This is the generative reward model — “LLM-as-judge.” It is the labelling infrastructure that made post-training of frontier chat models economically possible. Every flagship system released after roughly 2023 uses it: Anthropic's Constitutional AI is LLM-as-judge with a constitution-shaped rubric, OpenAI's reward modelling pipeline mixes human labels with judge labels at roughly 1:10, and DeepSeek V3's post-training pipeline ran approximately five million judge verdicts to label the preference dataset that trained its reward model.
Intuition: A Model With a Rubric Is Already a Reward Function
Recall the loop from section 14.1: a reward model takes a prompt and a response and returns a scalar score. The classical reward model is a transformer with a small regression head on top, trained on human pairwise comparisons via Bradley-Terry (§14.2). It outputs one number per response.
The generative reward model replaces the regression head with plain text generation. You write a prompt that ends in “The better response is: ” and read off the logit of the very next token — “A” vs “B”. That is it. No new architecture. No new training data. No new pre-existing regression head. The model you already have — typically a strong instruction-tuned LLM — IS the reward function. You just have to ask it the right question.
The analogy is simple. A classical reward model is a thermometer — purpose-built to measure one thing, calibrated against a fixed scale. A generative reward model is a person with a thermometer and the ability to read the rubric on the wall and decide what number to read off, and the ability to read a different rubric next week without being re-manufactured. The generative judge can be re-pointed at correctness, faithfulness, conciseness, safety, or humour just by changing the rubric in the prompt — and you never retrain the model.
The Mathematical Idea: Verdicts as Probabilities
Let be the user prompt, and the two candidate responses, and the rubric (a short text spec like “be correct and concise”). The judge is a language model parameterised by ; given any input it produces a distribution over the next token. We build the input as a templated prompt:
and read off the logits at exactly the next position. Let be the judge's logits at the verdict-token positions for the tokens , , and . The pairwise verdict distribution is just softmax over those three:
Every symbol has a job. is the user question. are the two candidate completions the chat model produced (often by sampling at two different temperatures, or one from the current policy and one from a reference). is the written rubric — the only knob the operator turns to change what “quality” means. is the judge's weights; they stay frozen. is the raw pre-softmax logit at the appropriate token id. The denominator is the standard softmax normaliser restricted to the three verdict tokens.
We can extract a scalar reward from this distribution in three common ways:
is the “preference margin” reward (in ). It is what DPO consumes directly. Alternatively
is the log-probability reward, which slots cleanly into Bradley-Terry loss for a classical reward-model student (§14.3). And for pointwise scoring,
is the expected score on a fixed scale (commonly or ), computed by reading the judge's logits at the digit tokens. All three reductions are just different views of the same underlying next-token distribution.
Three Judging Protocols: Pointwise, Pairwise, Listwise
There are three template families. They differ in what you ask the judge to do and what you do with the output. Choose by what the downstream consumer wants.
| Protocol | Judge sees | Judge outputs | Use when |
|---|---|---|---|
| Pointwise | one response | score 1–10 (or 1–5, or a soft probability) | you need an ABSOLUTE quality score per response (RL reward, leaderboard, filtering) |
| Pairwise | two responses, A and B | verdict A / B / Tie + optional margin | you need RELATIVE preferences (Bradley-Terry reward model, DPO dataset) |
| Listwise | k responses (k = 4 — 16) | ordered ranking, or a single best-of-k pick | you need to pick the best of many samples (rejection sampling, best-of-n inference) |
Pairwise is the most common because it is the easiest task for the judge: comparing two things is cognitively much easier than assigning an absolute score on a fixed scale. Pointwise scoring is notoriously noisy — even strong judges disagree with themselves when shown the same response twice, because the scale is fictional. There is no objective definition of a “7-out-of-10” response; the judge has to invent the scale at inference time. Pairwise has no such problem: you just need a partial order.
Listwise is the most efficient when you have many candidates and only need the best. A single 8-way listwise call is cheaper than the 28 pairwise calls you would need to fully rank the same eight responses. The catch is that listwise judges have systematic biases toward the first and last items in the list — the same primacy and recency effects you see in humans.
Visualization: One Pairwise Verdict, End to End
The interactive widget below walks the full pairwise pipeline on a few small examples. Pick an example, slide the rubric weight, and hit the swap toggle. The chain-of-thought block shows the numbers the judge would produce internally, and the bottom row compares the biased and unbiased verdicts. Try the “short explanation” case with the rubric on pure correctness, then flip A and B. The position bias is the swing.
The Four Biases Every Judge Has
Every LLM judge — including frontier ones — has systematic biases that are absent in well-trained humans. There are four worth memorising, because every production judging pipeline contains a countermeasure for each.
| Bias | What happens | Typical magnitude | Countermeasure |
|---|---|---|---|
| Position bias | favours the first (or last) response shown | 5–15% verdict flip rate on close pairs | swap averaging: run (A→B) and (B→A), average log-probs |
| Verbosity bias | favours longer / more elaborate responses | long response wins ~20% of ties where it shouldn't | length-normalised scoring; explicit “concise” in rubric |
| Self-preference | judge model favours outputs from its own family | 5–10% inflation when judging own-family outputs | use a different model family for judging than for the policy |
| Sycophancy | favours responses that agree with assertions in the prompt | 10–25% on opinion / political prompts | neutral rubric phrasing; jury panels with diverse judges |
These biases are not symmetric. Position bias is the worst on close calls (which are precisely the calls the reward model needs most). Verbosity bias is the worst on instruction-following tasks where a one-line answer is correct (“What is 2+2?”). Self-preference is catastrophic if you let it compound: a Llama-judge labelling a Llama-policy's outputs will systematically push the policy further from the truth, and the damage is invisible until human evaluation.
Anthropic, Bai et al. (2022): “We observed that, even with carefully designed constitutions, single-judge verdicts showed measurable position bias. Mitigation via dual ordering brought judge-human agreement on close pairs from 0.68 to 0.81.” — almost every paper since reports the same effect at roughly the same magnitude.
Visualization: Agreement Under Bias Countermeasures
The heatmap below shows judge-vs-human agreement under three operating modes — raw one-pass, swap averaging, and swap + chain-of-thought — across four task difficulties and four judge sizes. The numbers are illustrative of the trends consistently reported in MT-Bench, Arena-Hard-Auto, and RewardBench: bigger judges help, swap averaging helps everywhere but most on adversarial prompts, and CoT helps on hard tasks but is a wash on easy ones.
Manual Numerical Walkthrough: One Verdict, A Swap, A Jury
Let's do a complete pairwise verdict by hand. The prompt is “What is 17 × 24?”, response A is “408” (correct), response B is “418” (a confident but wrong elaboration). We have one judge that produces verdict-token logits.
Step 1 — the judge in (A → B) order
With A shown first, the judge produces these logits at the verdict position:
Apply softmax to convert logits to probabilities. Subtract the max (3.0) for stability, exponentiate, divide by the sum:
The judge thinks A wins with probability . Confident, mostly right.
Step 2 — the judge in (B → A) order
We flip the order. The judge now sees response B first. Position bias gives a boost to whatever sits in slot 1, which is now B. Internally the judge produces (in slot order: first, second, tie):
Wait — but the underlying logits (before bias) would have been because the judge can still tell that A's answer is the right one. Position bias only adds to whichever response is FIRST. So the slot logits become .
Softmax over these:
Re-label slot → canonical (slot-1 = B, slot-2 = A, slot-3 = T):
Step 3 — swap averaging
Average the two log-probability vectors in canonical (A, B, T) space:
Re-exponentiate and renormalise: , , so .
The swap-averaged verdict says A wins with probability — between the two single-direction passes and . On this easy case the swap is a small correction. On a closer pair (logits 2.0 vs 1.6 instead of 3.0 vs 0.6) the same +0.4 bias would have flipped the verdict in one direction and not the other — and the swap average would have recovered the correct winner.
Step 4 — a three-judge jury
Suppose we run three different judges (a 7B, a 70B, and a 405B) on the same pair and get verdict probabilities:
A “hard” jury (majority vote on argmax verdicts) returns A — all three judges agree. A “soft” jury averages the probability vectors:
The soft jury verdict is — slightly more cautious than the 405B alone (which said 0.92), because the 7B judge pulls the average down. In practice, soft juries weighted by per-judge calibration accuracy (measured against a held-out human-labelled set) outperform both single-judge and equal-weight juries. This is the recipe behind Anthropic's “Multiple Judges” technique and DeepSeek's judge ensemble.
Plain Python: A Pairwise Judge From Logits
Now the same pipeline in pure NumPy. We replace the real transformer with a hand-built logit table so every number is visible, but the surrounding code is byte-for-byte what a real judge call does once you swap the forward function for a model forward.
PyTorch: A Production-Shaped Pairwise Judge
The PyTorch version is what a real RLHF pipeline calls a million times to score the preference dataset. There is no new architecture: the judge is the same SFT'd Llama-3 70B you trained in chapter 13, used in inference mode. The whole “reward” is three log-probabilities at the next-token position.
At Massive Scale: A Million Verdicts for a Single Run
Three things change once you scale from “one verdict” to “one million verdicts for a 405B post-training run.”
1. Token volume dwarfs everything else in the pipeline
A swap-averaged pairwise verdict with a ~2000-token prompt and a short CoT trace consumes roughly tokens. At one million verdicts that is tokens — 4 billion tokens of judge inference. A 70B judge on a single H100 generates roughly tokens/second per request; with vLLM continuous batching at batch 64 that is tokens/second/H100, or tokens/day. Labelling a million verdicts therefore takes ~70 H100-days of inference. That is the same order of magnitude as a 7B SFT training run — judging is no longer cheap.
2. The judge must be cheaper than the policy
When the policy is 70B+, the judge is usually smaller than the policy or the same size from a different family. A common recipe in 2024–2025 is a 70B policy judged by GPT-4-class through an API for the high-stakes labels, plus an 8B or 13B fine-tuned judge for the bulk of the dataset. The two are cross-validated against a held-out human-labelled set; the bulk judge keeps the labels where it agrees with the strong judge above a threshold and defers the rest. Judge cost typically lands at 5–15% of the total RLHF budget on a well-engineered pipeline.
3. Distributional drift is now your problem
The judge was trained on a fixed distribution of text. The policy is changing every RLHF step. After tens of thousands of steps the policy is producing responses that look nothing like anything the judge saw during its own training — long bulleted Markdown, oddly confident refusals, unusually structured code. The judge starts miscalibrating on these distributional out-of-domain outputs, and the standard symptom is reward-model hacking: the policy discovers a stylistic tic the judge mistakes for quality. Production runs mitigate this with periodic judge updates (re-distil the judge on recent policy outputs) and judge audits (re-label ~1% of pairs with humans every week and watch agreement drift).
Engineering Reality: The Knobs That Actually Move Reward Quality
Years of post-training in industry have collapsed onto a small set of knobs that, ranked by how much they actually change the final chat model's benchmark score, look roughly like this:
- Rubric writing. The single highest-leverage knob. A 200-word rubric with three or four well-chosen failure modes can lift judge-human agreement by 10–15 absolute points over a one-line rubric. Worth more engineer-hours than almost any architectural decision. Anthropic's Constitutional AI paper is in large part a rubric-writing paper.
- Swap averaging. Two judge calls instead of one. Eliminates the bulk of position-bias errors. 2× cost, 3–8 point accuracy gain. Mandatory.
- Judge model size. A 70B judge consistently beats a 7B judge by 5–10 points; a frontier-class judge beats a 70B by another 3–5. Past that the curve flattens. Match the judge size to the importance of the labels: high for reward-model training data, lower for the rejection-sampling filter.
- Chain-of-thought before the verdict. Worth 2–6 points on hard reasoning tasks. Adds 30–80 tokens per call. Use on math, code, multi-hop QA; skip on simple format / safety checks.
- Few-shot examples in the prompt. 3–5 hand-picked examples of correctly-judged pairs lift agreement by another 2–4 points and stabilise variance across runs. Beware: too many shots and the judge starts pattern-matching the examples rather than reading the rubric.
- Jury panels. Mix two or three judges from different families with calibrated weights. Worth a final 1–3 points; mostly buys robustness against single-judge drift.
- Drop the “uncertain” pairs. If after swap averaging, discard the pair. Counter-intuitively this can improve the downstream reward model by removing the highest-label-noise rows — better to have 800k clean labels than 1M noisy ones.
That number — judge-human agreement matching human-human agreement — is the milestone the field crossed somewhere around late 2023. It is what made it economically possible to push a model from “decent” to “frontier” in post-training: you can now afford to look at every single comparison, every single time. The next section moves to the algorithm that consumes those million labels — PPO, the workhorse RLHF objective.