Chapter 14
20 min read
Section 81 of 117

Generative Reward Models (LLM-as-Judge)

Reward Modeling and RLHF

The Real Problem: Subjective Quality at the Scale of a Pretraining Run

A frontier RLHF run needs one million preference labels. Not ten thousand. Not a hundred thousand. A million — because the reward model has to cover every kind of question the chat model will face after release: math, code, summarisation, role-play, harmless refusals, creative writing, legal disclaimers, multilingual edge cases. A trained crowd-worker labels a tricky pairwise comparison in roughly two minutes. A million labels at two minutes each is 2×106/6033,0002 \times 10^6 / 60 \approx 33{,}000 person-hours — about 16 full-time annotators for a year. That is the dominant cost of post-training a 405B chat model, and it is exactly the cost that scaled the slowest from GPT-3 (a few thousand labels) to GPT-4 (millions).

Section 14.4 covered the easy half: when the answer can be checked by a rule — a unit test, a regex on the final digit of an arithmetic problem, a JSON-schema validator — you do not need humans at all. Verifiable rewards solve about 30% of the post-training budget. The remaining 70% is the hard half: subjective dimensions with no obvious rule. Is this summary faithful? Is this refusal polite? Is this code idiomatic? Is this poem better than that one? There is no regex for “idiomatic.”

The core insight: a sufficiently strong instruction-tuned LLM, given a written rubric, can read two responses and pick the winner with accuracy that approaches a trained human labeller — but at a thousandth of the cost and a hundred thousandth of the latency. The judge is not magic; it is just another model running inference. But that inference happens in milliseconds, not minutes, and is parallelisable across thousands of GPUs.

This is the generative reward model — “LLM-as-judge.” It is the labelling infrastructure that made post-training of frontier chat models economically possible. Every flagship system released after roughly 2023 uses it: Anthropic's Constitutional AI is LLM-as-judge with a constitution-shaped rubric, OpenAI's reward modelling pipeline mixes human labels with judge labels at roughly 1:10, and DeepSeek V3's post-training pipeline ran approximately five million judge verdicts to label the preference dataset that trained its reward model.

Intuition: A Model With a Rubric Is Already a Reward Function

Recall the loop from section 14.1: a reward model takes a prompt and a response and returns a scalar score. The classical reward model is a transformer with a small regression head on top, trained on human pairwise comparisons via Bradley-Terry (§14.2). It outputs one number per response.

The generative reward model replaces the regression head with plain text generation. You write a prompt that ends in “The better response is: ” and read off the logit of the very next token — “A” vs “B”. That is it. No new architecture. No new training data. No new pre-existing regression head. The model you already have — typically a strong instruction-tuned LLM — IS the reward function. You just have to ask it the right question.

Why this works: a large instruction-tuned LLM has absorbed millions of examples of humans evaluating quality during SFT (helpful-vs-unhelpful), and millions more during pretraining (Reddit comments saying “this answer is better because …”). That latent “quality detector” is exactly what a reward model wants to be. The judge prompt is a tiny piece of scaffolding that surfaces it.

The analogy is simple. A classical reward model is a thermometer — purpose-built to measure one thing, calibrated against a fixed scale. A generative reward model is a person with a thermometer and the ability to read the rubric on the wall and decide what number to read off, and the ability to read a different rubric next week without being re-manufactured. The generative judge can be re-pointed at correctness, faithfulness, conciseness, safety, or humour just by changing the rubric in the prompt — and you never retrain the model.

The Mathematical Idea: Verdicts as Probabilities

Let xx be the user prompt, yAy_A and yBy_B the two candidate responses, and rr the rubric (a short text spec like “be correct and concise”). The judge is a language model parameterised by θ\theta; given any input it produces a distribution over the next token. We build the input as a templated prompt:

prompt=template(x,yA,yB,r)[The better response is]\text{prompt} = \text{template}(x, y_A, y_B, r) \,\Vert\, \text{[The better response is]}

and read off the logits at exactly the next position. Let A,  B,  T\ell_A,\; \ell_B,\; \ell_T be the judge's logits at the verdict-token positions for the tokens A\texttt{A}, B\texttt{B}, and Tie\texttt{Tie}. The pairwise verdict distribution is just softmax over those three:

Pθ(winner=kx,yA,yB,r)=exp(k)exp(A)+exp(B)+exp(T),k{A,B,T}P_\theta(\text{winner} = k \mid x, y_A, y_B, r) = \frac{\exp(\ell_k)}{\exp(\ell_A) + \exp(\ell_B) + \exp(\ell_T)}, \quad k \in \{A, B, T\}

Every symbol has a job. xx is the user question. yA,yBy_A, y_B are the two candidate completions the chat model produced (often by sampling at two different temperatures, or one from the current policy and one from a reference). rr is the written rubric — the only knob the operator turns to change what “quality” means. θ\theta is the judge's weights; they stay frozen. k\ell_k is the raw pre-softmax logit at the appropriate token id. The denominator is the standard softmax normaliser restricted to the three verdict tokens.

We can extract a scalar reward from this distribution in three common ways:

Rpref(yA)=Pθ(A wins)Pθ(B wins)R_\text{pref}(y_A) = P_\theta(\text{A wins} \mid \ldots) - P_\theta(\text{B wins} \mid \ldots)

is the  “preference margin” reward (in [1,1][-1, 1]). It is what DPO consumes directly. Alternatively

Rlogp(yA)=logPθ(A wins)R_\text{logp}(y_A) = \log P_\theta(\text{A wins} \mid \ldots)

is the log-probability reward, which slots cleanly into Bradley-Terry loss for a classical reward-model student (§14.3). And for pointwise scoring,

Rscore(y)=s=1SsPθ(score=sx,y,r)R_\text{score}(y) = \sum_{s=1}^{S} s \cdot P_\theta(\text{score} = s \mid x, y, r)

is the expected score on a fixed scale (commonly S=5S = 5 or S=10S = 10), computed by reading the judge's logits at the digit tokens. All three reductions are just different views of the same underlying next-token distribution.

The whole “judge” pipeline lives in two facts: (1) a strong LLM's next-token distribution already encodes a sensible distribution over verdicts when prompted right; (2) you can extract a calibrated reward from that distribution with a single forward pass — no fine-tuning, no head, no extra training data.

Three Judging Protocols: Pointwise, Pairwise, Listwise

There are three template families. They differ in what you ask the judge to do and what you do with the output. Choose by what the downstream consumer wants.

ProtocolJudge seesJudge outputsUse when
Pointwiseone responsescore 1–10 (or 1–5, or a soft probability)you need an ABSOLUTE quality score per response (RL reward, leaderboard, filtering)
Pairwisetwo responses, A and Bverdict A / B / Tie + optional marginyou need RELATIVE preferences (Bradley-Terry reward model, DPO dataset)
Listwisek responses (k = 4 — 16)ordered ranking, or a single best-of-k pickyou need to pick the best of many samples (rejection sampling, best-of-n inference)

Pairwise is the most common because it is the easiest task for the judge: comparing two things is cognitively much easier than assigning an absolute score on a fixed scale. Pointwise scoring is notoriously noisy — even strong judges disagree with themselves when shown the same response twice, because the scale is fictional. There is no objective definition of a “7-out-of-10” response; the judge has to invent the scale at inference time. Pairwise has no such problem: you just need a partial order.

Listwise is the most efficient when you have many candidates and only need the best. A single 8-way listwise call is cheaper than the 28 pairwise calls you would need to fully rank the same eight responses. The catch is that listwise judges have systematic biases toward the first and last items in the list — the same primacy and recency effects you see in humans.

Practical default: use pairwise with swap averaging for reward-model training data; use pointwise (with a strong rubric and few-shot examples) when you need an absolute filter or a leaderboard number; use listwise only when you have already validated that the judge is bias-free on your task.

Visualization: One Pairwise Verdict, End to End

The interactive widget below walks the full pairwise pipeline on a few small examples. Pick an example, slide the rubric weight, and hit the swap toggle. The chain-of-thought block shows the numbers the judge would produce internally, and the bottom row compares the biased and unbiased verdicts. Try the “short explanation” case with the rubric on pure correctness, then flip A and B. The position bias is the swing.

Loading judge-pipeline visualiser…

The Four Biases Every Judge Has

Every LLM judge — including frontier ones — has systematic biases that are absent in well-trained humans. There are four worth memorising, because every production judging pipeline contains a countermeasure for each.

BiasWhat happensTypical magnitudeCountermeasure
Position biasfavours the first (or last) response shown5–15% verdict flip rate on close pairsswap averaging: run (A→B) and (B→A), average log-probs
Verbosity biasfavours longer / more elaborate responseslong response wins ~20% of ties where it shouldn'tlength-normalised scoring; explicit “concise” in rubric
Self-preferencejudge model favours outputs from its own family5–10% inflation when judging own-family outputsuse a different model family for judging than for the policy
Sycophancyfavours responses that agree with assertions in the prompt10–25% on opinion / political promptsneutral rubric phrasing; jury panels with diverse judges

These biases are not symmetric. Position bias is the worst on close calls (which are precisely the calls the reward model needs most). Verbosity bias is the worst on instruction-following tasks where a one-line answer is correct (“What is 2+2?”). Self-preference is catastrophic if you let it compound: a Llama-judge labelling a Llama-policy's outputs will systematically push the policy further from the truth, and the damage is invisible until human evaluation.

Anthropic, Bai et al. (2022): “We observed that, even with carefully designed constitutions, single-judge verdicts showed measurable position bias. Mitigation via dual ordering brought judge-human agreement on close pairs from 0.68 to 0.81.” — almost every paper since reports the same effect at roughly the same magnitude.

Visualization: Agreement Under Bias Countermeasures

The heatmap below shows judge-vs-human agreement under three operating modes — raw one-pass, swap averaging, and swap + chain-of-thought — across four task difficulties and four judge sizes. The numbers are illustrative of the trends consistently reported in MT-Bench, Arena-Hard-Auto, and RewardBench: bigger judges help, swap averaging helps everywhere but most on adversarial prompts, and CoT helps on hard tasks but is a wash on easy ones.

Loading bias heatmap…

Manual Numerical Walkthrough: One Verdict, A Swap, A Jury

Let's do a complete pairwise verdict by hand. The prompt is “What is 17 × 24?”, response A is “408” (correct), response B is “418” (a confident but wrong elaboration). We have one judge that produces verdict-token logits.

Step 1 — the judge in (A → B) order

With A shown first, the judge produces these logits at the verdict position:

A=3.0,B=0.6,T=0.4\ell_A = 3.0,\quad \ell_B = 0.6,\quad \ell_T = 0.4

Apply softmax to convert logits to probabilities. Subtract the max (3.0) for stability, exponentiate, divide by the sum:

exp(max)=(e0,e2.4,e2.6)=(1.000,0.0907,0.0743)\exp(\ell - \max) = (e^{0}, e^{-2.4}, e^{-2.6}) = (1.000, 0.0907, 0.0743)

Z=1.000+0.0907+0.0743=1.1650Z = 1.000 + 0.0907 + 0.0743 = 1.1650

P1=(P(A),P(B),P(T))=(0.858,0.0779,0.0638)P_1 = (P(A), P(B), P(T)) = (0.858, 0.0779, 0.0638)

The judge thinks A wins with probability 0.8580.858. Confident, mostly right.

Step 2 — the judge in (B → A) order

We flip the order. The judge now sees response B first. Position bias gives a +0.4+0.4 boost to whatever sits in slot 1, which is now B. Internally the judge produces (in slot order: first, second, tie):

slot=(B+0.4,  A,  T)=(1.0,  3.0,  0.4)\ell_\text{slot} = (\ell_B + 0.4,\; \ell_A,\; \ell_T) = (1.0,\; 3.0,\; 0.4)

Wait — but the underlying logits (before bias) would have been (B,A,T)=(0.6,3.0,0.4)(\ell_B, \ell_A, \ell_T) = (0.6, 3.0, 0.4) because the judge can still tell that A's answer is the right one. Position bias only adds +0.4+0.4 to whichever response is FIRST. So the slot logits become (1.0,3.0,0.4)(1.0, 3.0, 0.4).

Softmax over these: exp(max)=(e2.0,e0,e2.6)=(0.1353,1.000,0.0743)\exp(\ell - \max) = (e^{-2.0}, e^{0}, e^{-2.6}) = (0.1353, 1.000, 0.0743)

Z=1.2096,Pslot=(0.1119,0.8267,0.0614)Z = 1.2096,\quad P_\text{slot} = (0.1119, 0.8267, 0.0614)

Re-label slot → canonical (slot-1 = B, slot-2 = A, slot-3 = T):

P2=(P(A),P(B),P(T))=(0.8267,0.1119,0.0614)P_2 = (P(A), P(B), P(T)) = (0.8267, 0.1119, 0.0614)

Step 3 — swap averaging

Average the two log-probability vectors in canonical (A, B, T) space:

logP1=(0.153,2.553,2.753)\log P_1 = (-0.153, -2.553, -2.753)

logP2=(0.190,2.190,2.790)\log P_2 = (-0.190, -2.190, -2.790)

logP=(0.172,2.372,2.772)\overline{\log P} = (-0.172, -2.372, -2.772)

Re-exponentiate and renormalise: explogP=(0.842,0.0935,0.0625)\exp \overline{\log P} = (0.842, 0.0935, 0.0625), Z=0.998Z = 0.998, so Pavg(0.844,0.0937,0.0626)P_\text{avg} \approx (0.844, 0.0937, 0.0626).

The swap-averaged verdict says A wins with probability 0.8440.844 — between the two single-direction passes 0.8580.858 and 0.8270.827. On this easy case the swap is a small correction. On a closer pair (logits 2.0 vs 1.6 instead of 3.0 vs 0.6) the same +0.4 bias would have flipped the verdict in one direction and not the other — and the swap average would have recovered the correct winner.

Step 4 — a three-judge jury

Suppose we run three different judges (a 7B, a 70B, and a 405B) on the same pair and get verdict probabilities:

P(7B)=(0.70,0.18,0.12),P(70B)=(0.84,0.10,0.06),P(405B)=(0.92,0.05,0.03)P^{(7B)} = (0.70, 0.18, 0.12),\quad P^{(70B)} = (0.84, 0.10, 0.06),\quad P^{(405B)} = (0.92, 0.05, 0.03)

A “hard” jury (majority vote on argmax verdicts) returns A — all three judges agree. A “soft” jury averages the probability vectors:

Pjury=13(P(7B)+P(70B)+P(405B))=(0.820,0.110,0.070)P_\text{jury} = \frac{1}{3}(P^{(7B)} + P^{(70B)} + P^{(405B)}) = (0.820, 0.110, 0.070)

The soft jury verdict is P(A wins)=0.820P(\text{A wins}) = 0.820 — slightly more cautious than the 405B alone (which said 0.92), because the 7B judge pulls the average down. In practice, soft juries weighted by per-judge calibration accuracy (measured against a held-out human-labelled set) outperform both single-judge and equal-weight juries. This is the recipe behind Anthropic's “Multiple Judges” technique and DeepSeek's judge ensemble.

Plain Python: A Pairwise Judge From Logits

Now the same pipeline in pure NumPy. We replace the real transformer with a hand-built logit table so every number is visible, but the surrounding code is byte-for-byte what a real judge call does once you swap the forward function for a model forward.

🐍python
19The verdict vocabulary — three tokens, not 128k

Real judges score over the full vocabulary, but only three tokens matter: 'A', 'B', and 'Tie'. We will read off the logits at exactly those three indices and ignore the rest. Production pipelines do the same — vLLM exposes a `logprobs` parameter that returns just the top-k tokens at the verdict position, and the judging code keeps only the three it cares about.

EXAMPLE
VOCAB = {'A': 0, 'B': 1, 'Tie': 2}
31Toy judge.forward — a hand-built logit table

📚 In production this line is a 32-layer transformer running over a ~2k-token prompt (rubric + question + both responses + verdict cue). The OUTPUT, though, is exactly what we model here: a small logit vector over the verdict tokens. By hand-rolling the logits we get to see every bias we will later have to fix.

38Truth-conditional base logits

Three cases. If A is the true winner, the judge has been trained well enough to put the most mass on 'A' (logit 2.6). If B is the winner, mass shifts to 'B' (logit 2.4). On a real tie, mass goes to 'Tie' (logit 2.0) with A and B equal. These are the numbers a perfect-rubric, unbiased judge would produce.

EXAMPLE
true=A → base = [2.6, 1.0, 0.4] (A, B, Tie)
46The position bias: +0.4 to the first-shown verdict

Every empirical study finds the same effect: a judge LLM gives roughly 5-15% extra probability mass to whichever response it sees first. Here we model that as a +0.4 boost to the first-shown answer's logit. The exact value varies by model — GPT-4-class judges score around 0.2-0.3 logits of bias; smaller open models can hit 0.6+. Real production pipelines measure this per-judge before using one.

EXAMPLE
first='A' → logits[0] += 0.4 (the 'A' verdict gets a free boost)
54Numerically stable log-softmax

📚 The same log-sum-exp trick from §1.6. Subtract the max, exponentiate, normalise. Do this and the judge stays stable even at bf16. Skip it and a single high logit (like a 'Tie' shortcut the judge learned during SFT) can underflow the others to zero.

EXAMPLE
logits [3.0, 1.4, 0.4] → log-probs [-0.31, -1.91, -2.91]
62Run the judge once and recover canonical (A, B, Tie) probabilities

📚 The judge always speaks in 'first-slot / second-slot' language: its 'A'-token always means ‘whichever response is shown first’. We re-label back to the canonical responses A and B using the ordering. This is the most common source of subtle bugs in judge code — forgetting to un-swap the labels when you flipped the prompt.

76Swap averaging — run the judge twice, average in log-space

📚 The canonical position-bias fix. We run the judge once with A first, once with B first, and average the log-probabilities in each canonical slot. The position-bias boost lands on opposite tokens in the two calls so the average cancels it out (to first order). Then renormalise so the final triple sums to 1. This costs 2× tokens — but those tokens buy a verdict the human evaluator actually agrees with.

EXAMPLE
forward (A first)  P(A) ≈ 0.69, P(B) ≈ 0.16; swapped (B first)  P(A) ≈ 0.45, P(B) ≈ 0.36; averaged → P(A) ≈ 0.57, P(B) ≈ 0.27
82Renormalise after averaging in log-space

When you average log-probs across orderings, the result is no longer a valid distribution — the three numbers no longer exp-sum to 1. Re-exponentiate, sum, divide. The arithmetic is trivial but every TRL/OpenRLHF judge contains this exact loop, and forgetting it produces 'probabilities' larger than 1 that are silently absorbed by downstream code as scores.

91Comparing one-pass vs swap-averaged verdicts

📚 Run the full pipeline for each of the three ground truths. You will see that one-pass scoring inflates P(A) by ~8-15% across all three cases (the bias goes one way), while swap averaging gives a calibrated probability that closely tracks the true winner. This is the single highest-ROI engineering change you can make to a judging pipeline.

EXAMPLE
truth=B  one-pass={A:0.42, B:0.45, Tie:0.13}  swap-avg={A:0.28, B:0.59, Tie:0.13}
92 lines without explanation
1"""
2A pairwise LLM-as-judge — pure NumPy, no autograd, no transformer.
3
4We re-implement what every production judging pipeline does:
5  1. Build a prompt:  rubric + question + response A + response B + verdict cue.
6  2. Run the judge forward to get logits at the verdict position.
7  3. Extract the log-prob of just three tokens: "A", "B", "Tie".
8  4. Average over (A→B) and (B→A) orderings to remove position bias.
9  5. Convert to a probability over the three classes.
10
11We replace the real transformer with a hand-built logit table so every
12number is visible. Swap in a real model.forward() and the code is
13byte-for-byte what a TRL / vLLM / OpenRLHF judge call does.
14"""
15
16import numpy as np
17
18# ---------------------------------------------------------------------------
19# 1. The judge vocabulary contains the verdict tokens we care about.
20#    Real Llama-3 vocab has 128k entries; we keep three for the walkthrough.
21# ---------------------------------------------------------------------------
22
23VOCAB = {"A": 0, "B": 1, "Tie": 2}
24
25# ---------------------------------------------------------------------------
26# 2. A toy "judge.forward()". In reality this is a transformer doing
27#    32 attention layers over the prompt. Here it is a hand-built table:
28#    for each (ordering, true_winner) we return the logits the judge
29#    would produce at the verdict position. The numbers reflect a
30#    realistic pattern — judge is mostly right, but biased toward the
31#    first-shown answer (~+0.4 logits to whichever appears first).
32# ---------------------------------------------------------------------------
33
34def judge_forward(first_response: str, second_response: str, true_winner: str) -> np.ndarray:
35    """Return the 3-way logit vector over (A, B, Tie) at the verdict position.
36
37    Position bias is baked in: whichever response appears FIRST gets +0.4
38    on its corresponding verdict token. This mimics what real LLM judges do.
39    """
40    # Base logits if there were no position bias.
41    if true_winner == first_response:
42        base = np.array([2.6, 1.0, 0.4])  # judge likes the first response anyway
43    elif true_winner == second_response:
44        base = np.array([1.2, 2.4, 0.4])  # judge sees the second is better
45    else:                                  # actual tie
46        base = np.array([1.6, 1.6, 2.0])
47
48    # Position bias: +0.4 to whichever appears FIRST.
49    if first_response == "A":
50        base[VOCAB["A"]] += 0.4
51    else:
52        base[VOCAB["B"]] += 0.4
53    return base
54
55# ---------------------------------------------------------------------------
56# 3. Standard log-softmax over the 3-way verdict head.
57# ---------------------------------------------------------------------------
58
59def log_softmax(z: np.ndarray) -> np.ndarray:
60    z = z - z.max()                       # numerical stability
61    return z - np.log(np.exp(z).sum())
62
63# ---------------------------------------------------------------------------
64# 4. One verdict in one ordering.
65# ---------------------------------------------------------------------------
66
67def verdict_logprobs(first: str, second: str, true_winner: str) -> dict:
68    logits = judge_forward(first, second, true_winner)
69    lp = log_softmax(logits)
70    # Re-label from "first/second slot" back to canonical A/B/Tie.
71    if first == "A":
72        return {"A": lp[VOCAB["A"]], "B": lp[VOCAB["B"]], "Tie": lp[VOCAB["Tie"]]}
73    else:
74        # Slot "A-token" was earned by response B; flip the mapping.
75        return {"B": lp[VOCAB["A"]], "A": lp[VOCAB["B"]], "Tie": lp[VOCAB["Tie"]]}
76
77# ---------------------------------------------------------------------------
78# 5. Swap-averaged verdict: the canonical position-bias fix.
79# ---------------------------------------------------------------------------
80
81def swap_averaged_verdict(true_winner: str) -> dict:
82    forward = verdict_logprobs("A", "B", true_winner)
83    swapped = verdict_logprobs("B", "A", true_winner)
84    # Average log-probs in each canonical slot, then renormalise.
85    keys = ["A", "B", "Tie"]
86    avg_lp = {k: 0.5 * (forward[k] + swapped[k]) for k in keys}
87    Z = sum(np.exp(v) for v in avg_lp.values())
88    return {k: float(np.exp(v) / Z) for k, v in avg_lp.items()}
89
90# ---------------------------------------------------------------------------
91# 6. Run the pipeline on a few cases.
92# ---------------------------------------------------------------------------
93
94for truth in ("A", "B", "Tie"):
95    forward_only = {k: float(np.exp(v)) for k, v in verdict_logprobs("A", "B", truth).items()}
96    swap_avg     = swap_averaged_verdict(truth)
97    Zf = sum(forward_only.values())
98    forward_only = {k: v / Zf for k, v in forward_only.items()}
99    print(f"truth={truth}")
100    print(f"  one-pass : {forward_only}")
101    print(f"  swap-avg : {swap_avg}")

PyTorch: A Production-Shaped Pairwise Judge

The PyTorch version is what a real RLHF pipeline calls a million times to score the preference dataset. There is no new architecture: the judge is the same SFT'd Llama-3 70B you trained in chapter 13, used in inference mode. The whole “reward” is three log-probabilities at the next-token position.

🐍python
12The judge prompt template — short, explicit, rubric-first

📚 Every word of this template matters. The rubric appears BEFORE the responses so the judge anchors on the criterion before being primed by either response. The trailing 'Reasoning: ' encourages a brief CoT and is followed by the verdict token whose logprob we will actually read. Templates that put the rubric AFTER the responses are demonstrably worse — the judge has already half-decided by the time it sees what to optimise.

27Load once, score many — judges live on a long-running server

📚 AutoModelForCausalLM in bf16 on device_map='auto' will shard a 70B judge across whatever GPUs you have. In production the judge actually lives on a vLLM server with continuous batching — a single H100 node serves ~100k verdicts/hour for a 70B judge. Here we use the HF API for clarity; the math is identical.

EXAMPLE
70B in bf16 → ~140 GB → 2 × H100 80GB
36Resolve the verdict token ids ONCE

📚 We need the integer ids of ' A', ' B', and ' Tie' so we can index into the final logit vector. We resolve them once at import time and cache them. The leading space matters — different tokenisers produce different ids for 'A' vs ' A' (Llama-3 in particular). Use exactly the form that will follow 'Reasoning: …' in your template, and verify with `tok.decode(A_id)`.

EXAMPLE
A_id = 362, B_id = 426, TIE_id = 22118  (Llama-3 BPE)
45_verdict_logprobs — one judge call, three numbers out

📚 Format the template with the question and two responses, tokenise, push to the model. We DO NOT call model.generate — we only want the logits at the next position, not actual sampled text. One forward pass, returning (V,) at the last position, is roughly 100× cheaper than running 64 tokens of CoT decoding.

EXAMPLE
ids.shape = (1, 1734)  →  out.logits.shape = (1, 1734, 128256)
50Take the last-token logits, log-softmax over the full vocab

📚 out.logits[0, -1, :] is the next-token logit vector — exactly what a sampler would draw from to generate the verdict. We log_softmax in float32 (the .float() cast) because bf16 log-softmax over 128k entries can lose precision for tokens with low logits. Then we gather the three indices we care about and ignore the rest.

EXAMPLE
log_probs.shape = (128256,)  →  log_probs[[A_id, B_id, TIE_id]] = tensor([-0.69, -0.92, -2.30])
54judge_pair — the public function the dataset pipeline calls

📚 Two judge calls, one with A shown first and one with B first. This is the swap-averaging recipe from the plain-Python version, now over real logits. The cost is exactly 2× — but the verdicts agree with human raters around 4-8 percentage points more often than a single forward pass, and the gain is biggest on the close calls that actually decide what the reward model learns from.

60Un-swap the second call before averaging

After the swap, the judge's 'A-slot' logit actually corresponds to response B and vice versa. We re-index the slot vector — torch.stack([s_lp[1], s_lp[0], s_lp[2]]) — so both calls speak the same canonical (A, B, Tie) language before we average. Get this wrong and the swap step makes things WORSE rather than better: the position biases will add instead of cancel.

EXAMPLE
slot order [B, A, Tie] (because B is in slot 1) → canonical [A, B, Tie]
63Average log-probs, then renormalise

📚 0.5 × (f_lp + s_canon) is the geometric mean of the two distributions in log-space. Softmax over the result renormalises to a valid probability triple. Production tip: the temperature here is implicitly 1 because the logits came out of log-softmax already; if you wanted to sharpen the verdict (force more decisive labels) you would divide avg_lp by a temperature < 1 before softmaxing.

EXAMPLE
avg_lp = [-0.31, -0.92, -2.61] → probs = [0.52, 0.28, 0.04] → renormalised [0.62, 0.33, 0.05]
79label_preference — turn the triple into a hard label

📚 The downstream consumer (reward-model training, DPO, BTRM) wants a hard preference label. The threshold logic here decides ties by which probability is largest. Production tip: instead of argmax, many teams threshold on |P(A)-P(B)| > 0.2 and drop pairs where the judge is uncertain — those pairs carry the most label noise and dropping them measurably improves the reward model.

EXAMPLE
P={A:0.62, B:0.33, Tie:0.05} → label = +1 (A preferred)
86 lines without explanation
1"""
2A production-shaped pairwise LLM judge in PyTorch + HuggingFace.
3
4This is the function a real RLHF pipeline calls a million times to score a
5preference dataset. The model is a vanilla HF causal LM (the same artefact
6your SFT step produced in §13.4). No reward head, no special architecture —
7the 'reward' is just three logprobs over verdict tokens.
8"""
9
10from __future__ import annotations
11import torch
12import torch.nn.functional as F
13from transformers import AutoTokenizer, AutoModelForCausalLM
14
15JUDGE_PROMPT = """\
16You are an impartial judge. Read the user question and the two responses
17below. Decide which response better follows the rubric:
18
19RUBRIC: be correct, concise, and on-topic.
20
21[Question]
22{question}
23
24[Response A]
25{response_a}
26
27[Response B]
28{response_b}
29
30After a short reasoning trace, output exactly one verdict token: A, B, or Tie.
31Reasoning: """
32
33# ---------------------------------------------------------------------------
34# 1. Load model + tokenizer once. In production this lives on a vLLM server.
35# ---------------------------------------------------------------------------
36tok   = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
37judge = AutoModelForCausalLM.from_pretrained(
38    "meta-llama/Llama-3.1-70B-Instruct",
39    torch_dtype=torch.bfloat16,
40    device_map="auto",
41)
42judge.eval()
43
44# Resolve the three verdict-token ids ONCE — every judge call reuses them.
45A_id   = tok.encode(" A",   add_special_tokens=False)[0]
46B_id   = tok.encode(" B",   add_special_tokens=False)[0]
47TIE_id = tok.encode(" Tie", add_special_tokens=False)[0]
48
49@torch.no_grad()
50def _verdict_logprobs(question: str, first: str, second: str) -> torch.Tensor:
51    """One judge call. Returns log-probs over (A_slot, B_slot, Tie_slot)."""
52    prompt = JUDGE_PROMPT.format(question=question, response_a=first, response_b=second)
53    ids = tok(prompt, return_tensors="pt").input_ids.to(judge.device)
54    out = judge(input_ids=ids, use_cache=False)
55    # Logits at the LAST token = what the judge would emit next.
56    last = out.logits[0, -1, :]                  # (V,)
57    log_probs = F.log_softmax(last.float(), dim=-1)
58    return log_probs[[A_id, B_id, TIE_id]]       # (3,)
59
60@torch.no_grad()
61def judge_pair(question: str, resp_a: str, resp_b: str) -> dict[str, float]:
62    """Swap-averaged pairwise verdict. Returns canonical P(A), P(B), P(Tie)."""
63    # Forward: A is shown first.
64    f_lp = _verdict_logprobs(question, resp_a, resp_b)        # (3,) in slot order
65    # Swapped: B is shown first. The verdict-A token now refers to B, etc.
66    s_lp = _verdict_logprobs(question, resp_b, resp_a)        # (3,) in slot order
67    # Un-swap the second call so we can average in canonical space.
68    s_canon = torch.stack([s_lp[1], s_lp[0], s_lp[2]])        # (B, A, Tie) → (A, B, Tie)
69    # Average log-probs and renormalise.
70    avg_lp = 0.5 * (f_lp + s_canon)
71    probs  = torch.softmax(avg_lp, dim=-1)
72    return {
73        "A":   probs[0].item(),
74        "B":   probs[1].item(),
75        "Tie": probs[2].item(),
76    }
77
78# ---------------------------------------------------------------------------
79# 2. Run on one preference pair.
80# ---------------------------------------------------------------------------
81q  = "Write a one-line Python expression to reverse a list xs."
82ya = "xs[::-1]"
83yb = "list(reversed(xs))  # both fine but more verbose"
84verdict = judge_pair(q, ya, yb)
85print(verdict)            # e.g. {'A': 0.52, 'B': 0.41, 'Tie': 0.07}
86
87# ---------------------------------------------------------------------------
88# 3. Bulk scoring — what the preference dataset pipeline actually does.
89# ---------------------------------------------------------------------------
90def label_preference(question: str, ya: str, yb: str) -> int:
91    """Returns +1 if A is preferred, −1 if B, 0 on tie."""
92    p = judge_pair(question, ya, yb)
93    if p["Tie"] > max(p["A"], p["B"]):
94        return 0
95    return 1 if p["A"] > p["B"] else -1

At Massive Scale: A Million Verdicts for a Single Run

Three things change once you scale from “one verdict” to “one million verdicts for a 405B post-training run.”

1. Token volume dwarfs everything else in the pipeline

A swap-averaged pairwise verdict with a ~2000-token prompt and a short CoT trace consumes roughly 2×(2000+64)40002 \times (2000 + 64) \approx 4000 tokens. At one million verdicts that is 4×1094 \times 10^9 tokens — 4 billion tokens of judge inference. A 70B judge on a single H100 generates roughly 3030 tokens/second per request; with vLLM continuous batching at batch 64 that is 1,9001{,}900 tokens/second/H100, or 6×1076 \times 10^7 tokens/day. Labelling a million verdicts therefore takes ~70 H100-days of inference. That is the same order of magnitude as a 7B SFT training run — judging is no longer cheap.

2. The judge must be cheaper than the policy

When the policy is 70B+, the judge is usually smaller than the policy or the same size from a different family. A common recipe in 2024–2025 is a 70B policy judged by GPT-4-class through an API for the high-stakes labels, plus an 8B or 13B fine-tuned judge for the bulk of the dataset. The two are cross-validated against a held-out human-labelled set; the bulk judge keeps the labels where it agrees with the strong judge above a threshold and defers the rest. Judge cost typically lands at 5–15% of the total RLHF budget on a well-engineered pipeline.

3. Distributional drift is now your problem

The judge was trained on a fixed distribution of text. The policy is changing every RLHF step. After tens of thousands of steps the policy is producing responses that look nothing like anything the judge saw during its own training — long bulleted Markdown, oddly confident refusals, unusually structured code. The judge starts miscalibrating on these distributional out-of-domain outputs, and the standard symptom is reward-model hacking: the policy discovers a stylistic tic the judge mistakes for quality. Production runs mitigate this with periodic judge updates (re-distil the judge on recent policy outputs) and judge audits (re-label ~1% of pairs with humans every week and watch agreement drift).

The economic bottom line: at frontier scale, the judge is no longer a free oracle. It is a piece of compute infrastructure with the same cost discipline as the trainer — own the inference stack (vLLM, FlashAttention, paged KV cache, FP8), run it as a long-lived service, and treat per-verdict token cost as a tracked budget line. The teams that get this right are also the teams that ship.

Engineering Reality: The Knobs That Actually Move Reward Quality

Years of post-training in industry have collapsed onto a small set of knobs that, ranked by how much they actually change the final chat model's benchmark score, look roughly like this:

  1. Rubric writing. The single highest-leverage knob. A 200-word rubric with three or four well-chosen failure modes can lift judge-human agreement by 10–15 absolute points over a one-line rubric. Worth more engineer-hours than almost any architectural decision. Anthropic's Constitutional AI paper is in large part a rubric-writing paper.
  2. Swap averaging. Two judge calls instead of one. Eliminates the bulk of position-bias errors. 2× cost, 3–8 point accuracy gain. Mandatory.
  3. Judge model size. A 70B judge consistently beats a 7B judge by 5–10 points; a frontier-class judge beats a 70B by another 3–5. Past that the curve flattens. Match the judge size to the importance of the labels: high for reward-model training data, lower for the rejection-sampling filter.
  4. Chain-of-thought before the verdict. Worth 2–6 points on hard reasoning tasks. Adds 30–80 tokens per call. Use on math, code, multi-hop QA; skip on simple format / safety checks.
  5. Few-shot examples in the prompt. 3–5 hand-picked examples of correctly-judged pairs lift agreement by another 2–4 points and stabilise variance across runs. Beware: too many shots and the judge starts pattern-matching the examples rather than reading the rubric.
  6. Jury panels. Mix two or three judges from different families with calibrated weights. Worth a final 1–3 points; mostly buys robustness against single-judge drift.
  7. Drop the “uncertain” pairs. If P(A)P(B)<0.2|P(A) - P(B)| < 0.2 after swap averaging, discard the pair. Counter-intuitively this can improve the downstream reward model by removing the highest-label-noise rows — better to have 800k clean labels than 1M noisy ones.
A canonical 2025 RLHF labelling stack looks like this: 70B fine-tuned judge served on vLLM with FP8 weights, 2k prompt tokens with a hand-written rubric plus 3 few-shot pairs, chain-of-thought enabled, swap-averaged at 2 calls per pair, with ~10% of pairs cross-checked against GPT-4-class and ~1% against humans. End-to-end agreement with held-out humans: 0.83\sim 0.83 — roughly the same as inter-annotator agreement between two trained humans. At that point the judge has stopped being the bottleneck.

That number — judge-human agreement matching human-human agreement — is the milestone the field crossed somewhere around late 2023. It is what made it economically possible to push a model from “decent” to “frontier” in post-training: you can now afford to look at every single comparison, every single time. The next section moves to the algorithm that consumes those million labels — PPO, the workhorse RLHF objective.

Loading comments...