Chapter 14
22 min read
Section 79 of 117

Reward Model Architecture and Training

Reward Modeling and RLHF

The Real Problem: A Preference Number Has to Come From Somewhere

The previous section gave us a theory of preferences — Bradley-Terry — and a dataset of triples (x,yc,yr)(x, y_c, y_r), each saying “for this prompt xx, a human preferred response ycy_c over response yry_r.” The theory says these preferences are consistent with an underlying scalar reward function rθ(x,y)Rr_\theta(x, y) \in \mathbb{R}. The dataset is finite. The space of possible responses yy at inference time is essentially infinite — a trillion plausible 200-token replies to any given prompt. We need a function that takes any new (x,y)(x, y) pair and returns a number that agrees, in expectation, with how a human would have voted.

That function does not write itself. We cannot hard-code it — “a good response is helpful and honest” is not an algorithm. We cannot table-lookup it — the table has 1012 rows. We have exactly one tool capable of compressing trillions of possible responses into a scalar with reasonable judgment: a large language model. So we take the same transformer that just finished supervised fine-tuning, rip off the head that predicts the next token, bolt on a tiny new head that predicts a single number, and train the resulting Frankenstein on the preference dataset with the Bradley-Terry log-likelihood. That creature is the reward model.

The one-line takeaway of this section: a reward model is the same transformer body you just SFT-trained, with the LM head replaced by a one-dimensional linear layer, optimised to assign higher scalar outputs to preferred responses than to dispreferred ones. The body learns the geometry; the head reads off the one number you care about.

Intuition: A Tasting Judge with a Single Knob

Picture a wine judge whose entire job is to compare two glasses and point at the better one. They never have to score on an absolute scale — “this is an 87” is meaningless; what they can do reliably is say “this is better than that.” Over a few thousand such comparisons their internal scoring drifts into something coherent: a function from glass to a real number, where bigger means better. They cannot tell you the units. The function is only well- defined up to a constant shift. But it ranks every wine they have ever tasted, and it generalises — they can rank a new wine after one sniff.

A reward model is exactly that judge, with the transformer body as its nose, the scalar head as its mouth, and the preference dataset as its training. We never ask it “how good is this response?” because the answer would be unanchored. We only ever ask “which of these two is better?” during training, and the answer's sign is what matters. At inference time inside PPO, we will use the absolute number — but only as a relative signal compared to other responses the policy generates for the same prompt. The constant cancels itself out the moment the policy starts comparing.

Two implications fall out of this intuition immediately. First, the head needs only one output dimension — we never need a probability distribution over “quality classes,” just a real number on a line. Second, the head's bias term is useless: it would shift every reward by the same amount, and the loss only sees differences. Both of these facts will pop up in the math, the code, and the architecture diagram below — and they explain why the smallest production-grade reward model is a 70-billion-parameter backbone with a 8 192-parameter head.

The Mathematical Idea: Pair-Level Bradley-Terry on a Scalar Head

Let the transformer body, parameterised by ϕ\phi, map an input sequence to a sequence of hidden vectors fϕ(x)=(h1,h2,,hT)RT×df_\phi(x) = (h_1, h_2, \dots, h_T) \in \mathbb{R}^{T \times d}. Let the scalar head be a single linear layer with weight vector wRdw \in \mathbb{R}^d (and, often, no bias). A pooling function π:RT×dRd\pi : \mathbb{R}^{T \times d} \to \mathbb{R}^d collapses the per-token states into one vector. The reward is then a single inner product:

rθ(x,y)=wπ(fϕ([x;y])).r_\theta(x, y) = w^\top \, \pi\bigl(f_\phi([x; y])\bigr).

The full reward-model parameters are θ=(ϕ,w)\theta = (\phi, w) — the backbone weights and the head weights. During training the body ϕ\phi is also updated; we are not freezing the SFT checkpoint, we are continuing to fine-tune it for a new task.

Given a preference dataset D={(x(i),yc(i),yr(i))}i=1N\mathcal{D} = \{ (x^{(i)}, y_c^{(i)}, y_r^{(i)}) \}_{i=1}^N, the Bradley-Terry log-likelihood from §14.2 becomes the training objective. For one pair, the loss is

i(θ)  =  logσ ⁣(rθ(x(i),yc(i))    rθ(x(i),yr(i))),\ell_i(\theta) \;=\; -\log \sigma\!\bigl(\, r_\theta(x^{(i)}, y_c^{(i)}) \;-\; r_\theta(x^{(i)}, y_r^{(i)}) \,\bigr),

where σ(z)=1/(1+ez)\sigma(z) = 1 / (1 + e^{-z}) is the logistic sigmoid. We sum (or average) this over the dataset to get the full objective L(θ)=1Nii(θ)\mathcal{L}(\theta) = \tfrac{1}{N} \sum_i \ell_i(\theta). The gradient with respect to either reward is delightfully clean:

irθ(x(i),yc(i))  =  (1pi),irθ(x(i),yr(i))  =  +(1pi),\frac{\partial \ell_i}{\partial r_\theta(x^{(i)}, y_c^{(i)})} \;=\; -\bigl(1 - p_i\bigr), \qquad \frac{\partial \ell_i}{\partial r_\theta(x^{(i)}, y_r^{(i)})} \;=\; +\bigl(1 - p_i\bigr),

where pi=σ(rθ(x(i),yc(i))rθ(x(i),yr(i)))p_i = \sigma(r_\theta(x^{(i)}, y_c^{(i)}) - r_\theta(x^{(i)}, y_r^{(i)})) is the model's current probability of preferring the chosen response. Three things are worth noticing here. First, the magnitude of every gradient is (1pi)(1 - p_i) — a number between 0 and 1 that automatically shrinks for pairs the model is already confident on. Second, the two gradients are equal and opposite, so a bias term in ww would have gradient zero on every pair — the bias is unidentifiable. Third, the chain rule through the head gives

iw=(1pi)(hc(i)hr(i)),\frac{\partial \ell_i}{\partial w} = -(1 - p_i)\bigl(h_c^{(i)} - h_r^{(i)}\bigr),

where hc(i),hr(i)h_c^{(i)}, h_r^{(i)} are the pooled hidden states for the chosen and rejected responses. Geometrically: at every step, the head weight ww drifts in the direction hchrh_c - h_r, scaled by how wrong the model currently is. Over the whole dataset, ww converges to (approximately) the average chosen-minus-rejected difference vector. That single observation is enough to explain nine out of ten reward-model bugs: if your data has a length bias, hchrh_c - h_r systematically points in the “longer responses” direction, and your head learns “longer = better.”

Why the Bradley-Terry loss self-saturates: as the margin grows, pi1p_i \to 1 and (1pi)0(1 - p_i) \to 0. Easy pairs stop contributing once they are learned — without you having to schedule it. This is one of the reasons RM training tolerates noisy preference labels far better than naive regression: noisy labels can't blow up the gradient, only delay convergence.

Architecture: Same Body, New Tiny Head

Reward-model architecture is almost embarrassingly minimal once you understand the math. You take the SFT checkpoint — the same one whose chat-template behaviour you trusted enough to RLHF on top of — and replace exactly one layer. The toggle below shows what changes.

Loading reward-model architecture diagram…

Two architectural details deserve their own paragraphs because they are where most of the variance between codebases lives.

Head shape: d1d \to 1 with no bias

Every production reward model uses a single linear layer wRd×1w \in \mathbb{R}^{d \times 1}. No activation, no MLP, no LayerNorm — those would saturate the gradient for high-confidence pairs and slow training to a crawl. The bias is almost always omitted (see the math above for why it has zero gradient). For an 8B-class backbone, the head is 8192×1=81928192 \times 1 = 8192 parameters — about 10710^{-7} of the model. Some experimental codebases use a two-layer MLP head, but the gain is marginal and the added depth makes the reward landscape harder for PPO to explore.

Initialisation: from the SFT model

The body is initialised from the SFT checkpoint — not from a base model, not from a pretrained-only model. The reason is the chat template: a base model has never seen <|start_header_id|>assistant<|end_header_id|> as a high-frequency pattern, so its last-position hidden state is a random direction with respect to “a complete assistant turn ended here.” The SFT model has spent thousands of GPU-hours making that last-token state meaningful; we inherit that work for free. The head ww is initialised either to N(0,1/d)\mathcal{N}(0, 1/d) or to zero — both put the initial reward at zero everywhere, which keeps PPO's downstream KL constraint sane on step 0.

Pooling: Which Hidden State Becomes the Reward?

The transformer body returns TT hidden states — one per token. We need exactly one number out of the head. Which token's hidden state should feed the head? Three answers have been tried; only one has survived.

Pooling ruleHow it worksVerdict
Last-token pooling (default)Read the hidden state at the LAST non-pad position. With a chat template that ends in <|eot_id|>, this is the position whose hidden state had access (via causal attention) to the whole conversation.✅ The universal default — used by InstructGPT, Llama-2, Llama-3, Tülu, Qwen, Mistral, every open-source RM trainer. Cheap, correct, and the hidden state at EOT is the natural &lsquo;summary token&rsquo; in a causal LM.
Mean poolingAverage the hidden states across all non-pad positions. Borrowed from BERT-style classification.⚠️ Worse than last-token by ~3 points of preference accuracy. Causal LMs concentrate semantic information at the last token; averaging dilutes that signal with mostly-irrelevant prompt tokens.
Attention pooling (a learned query token)Add an extra trainable [REWARD] token at the end, let the model attend over the response with one new attention head, read off that token&apos;s hidden state.🔬 Marginally better on some benchmarks (~0.5 pts) at the cost of changing the tokenizer and re-pretraining the embedding. Not worth it for production. Used in a few research papers.

The only subtle thing about last-token pooling is the padding side. If the tokenizer right-pads (default for most causal LMs), the last real token is somewhere in the middle of the tensor, not at index T1T - 1. You must read hidden[batch_idx, attention_mask.sum(dim=1) - 1], not hidden[:, -1, :]. The PyTorch snippet later in this section gets this right; nearly every from-scratch RM implementation gets it wrong at least once.

Manual Numerical Walkthrough: One Gradient Step on One Pair

We will use a 4-dimensional toy where every number fits in your head. Open the panel and walk through it with a pencil — once you have computed the gradient by hand, the production code becomes trivial to read.

Manual Numerical Walkthrough — open to see every number

The setup. Hidden dimension d=4d = 4. One preference pair. The transformer body produced these two pooled hidden states (we are pretending the body is done; only the head trains in this walkthrough):

Vectorh_1h_2h_3h_4
h_chosen0.51.0−0.30.2
h_rejected0.10.4−0.1−0.2
difference (h_chosen − h_rejected)0.40.6−0.20.4

Step 1 — Initial head. w=(0,0,0,0)w = (0, 0, 0, 0), no bias. Both rewards are zero. The Bradley-Terry probability of preferring the chosen response is σ(0)=0.5\sigma(0) = 0.5, and the loss is log0.5=log20.693-\log 0.5 = \log 2 \approx 0.693. This is the canonical starting loss for every RM run.

Step 2 — Compute the gradient. Using the formula from above, w=(1p)(hchr)=0.5(0.4,0.6,0.2,0.4)\nabla_w \ell = -(1 - p)(h_c - h_r) = -0.5 \cdot (0.4, 0.6, -0.2, 0.4), giving w=(0.20,0.30,+0.10,0.20)\nabla_w \ell = (-0.20, -0.30, +0.10, -0.20).

Step 3 — One SGD step with learning rate η=1\eta = 1 (chosen large so the numbers move visibly). The new head is wwηw=(0.20,0.30,0.10,0.20)w \leftarrow w - \eta \nabla_w \ell = (0.20, 0.30, -0.10, 0.20).

Step 4 — Re-score both responses.

QuantityCalculationValue
r_chosen0.20·0.5 + 0.30·1.0 + (−0.10)·(−0.3) + 0.20·0.20.10 + 0.30 + 0.03 + 0.04 = 0.47
r_rejected0.20·0.1 + 0.30·0.4 + (−0.10)·(−0.1) + 0.20·(−0.2)0.02 + 0.12 + 0.01 − 0.04 = 0.11
margin (r_chosen − r_rejected)0.47 − 0.110.36
P(chosen ≻ rejected)σ(0.36)≈ 0.589
loss−log 0.589≈ 0.529

What happened in one step. The model went from random guessing (P = 0.500) to preferring the right response with 58.9% probability — a 9-percentage-point shift from a SINGLE gradient step on a SINGLE pair. The head weights point in exactly the direction hchrh_c - h_r (compare line 3 of the first table with the new ww: they are proportional). Repeat this for tens of thousands of pairs over a real backbone and you have a working reward model.

What does NOT happen. The bias (if there were one) does not move. The body (if we were training it too) would also move, but in a slower, more diffuse way: each hidden state hc,hrh_c, h_r has its own gradient pulling it in opposite directions, making the body slightly more sensitive to features that distinguish good from bad responses.

The big asterisk. Real training uses AdamW (with its second-moment scaling), gradient clipping at norm 1.0, lr ≈ 1e-5 instead of 1.0, and bf16 mixed precision. The math is identical; the magnitudes are smaller per-step but accumulate over tens of thousands of steps. The walkthrough above is exactly what happens inside AdamW's update, just before the second-moment re-scaling.

Visualizing the Bradley-Terry Loss Landscape

Move the two sliders and watch the loss curve, the preference probability, and the gradient pressure on each reward update in real time. The dashed blue line is the preference probability P=σ(m)P = \sigma(m); the solid purple line is the loss L=logPL = -\log P; the moving dot tracks your current margin.

Loading reward-margin playground…

Three behaviours are worth feeling rather than just reading. First, the loss is asymmetric in the margin: at m=3m = -3 the loss is about 3.05; at m=+3m = +3 it is about 0.05. A confidently wrong prediction costs 60× more than the same magnitude of confidently-right prediction earns — exactly the property you want when you are stamping out mis-rankings. Second, beyond roughly m=±4m = \pm 4 the gradient is numerically zero — and so is the optimizer's pressure to learn from this pair. Easy pairs self-mute; only the close decisions keep contributing. Third, the gradient on the chosen reward is the exact negation of the gradient on the rejected reward, which is why a bias term in the head has gradient zero on every pair — a fact you can read straight off the arrows next to the sliders.

Plain Python: A Reward Model in 80 Lines of NumPy

Below is a complete reward-model training step in plain Python with nothing but the standard library and the math module. We use a 4-dimensional “hidden state” in place of the transformer body, but the loss, the gradient computation, and the update are identical to what runs inside HuggingFace TRL on Llama-3-8B. The moment you can read this code, you can read the production code.

Reward model + Bradley-Terry training step — pure Python
🐍reward_model_plain.py
17h_chosen and h_rejected — stand-ins for transformer hidden states

In a real reward model these two 4-numbers are NOT hand-picked: they are the last hidden state of the transformer body after consuming the prompt and the chosen / rejected response. The whole point of this toy is that the head and the loss are the same whether d is 4 (here) or 8192 (Llama-3-70B). We pick numbers where h_chosen has a bigger inner product with any positive-direction vector, so a reasonable optimizer will learn to score it higher.

EXAMPLE
Real model: h_chosen = model(input_ids_chosen)['last_hidden_state'][:, -1, :]  with shape (1, 8192)
26w and b — the entire reward head, four numbers and a scalar

The scalar head is the smallest linear layer you can build: d weights and one bias. For Llama-3-8B that is 4097 parameters out of 8B — about 0.00005% of the network. Initialising to zero is intentional: at step 0 every response gets reward 0, which guarantees the policy you initialise PPO with is identical in expectation to the SFT model (no KL surprise at step 0).

EXAMPLE
w starts at zeros(d).  Real RM code: nn.Linear(hidden_size, 1, bias=True), initialised with std=1/sqrt(d).
31reward() — one dot product, one bias, one scalar out

This is literally what the production code does: take the last hidden state and run it through nn.Linear(d, 1). No softmax, no sigmoid, no tanh — the head returns an unbounded real number. The Bradley-Terry loss handles the squashing internally, and any squash applied here would just collapse the gradient when the margin is large.

EXAMPLE
reward(h_chosen) at step 0 = 0.0 * 0.5 + 0.0 * 1.0 + ... = 0.0
41bt_loss() — numerically stable −log σ(margin)

Mathematically L = -log(1/(1+exp(-m))) = log(1+exp(-m)) = softplus(-m). The naive form overflows for large negative m (exp(-m) → inf). The if-branch we use here is the exact form PyTorch's F.binary_cross_entropy_with_logits uses internally. Always reach for the stable form when you write a custom RM head — large reward gaps after a few hundred steps are normal and will silently NaN a naive implementation.

EXAMPLE
bt_loss(0, 0) = log(1 + exp(0)) = log 2 ≈ 0.693. Matches the printed step-0 loss.
56Step 0 forward pass — both rewards are zero, margin is zero

With w = 0 the dot product is 0 for any hidden state, so r_chosen = r_rejected = 0. The Bradley-Terry probability of preferring chosen is sigmoid(0) = 0.5 (the model is purely guessing), and the loss equals -log 0.5 = log 2 ≈ 0.693. This is the natural baseline every RM run starts at — if your first-batch loss is far from log 2, your head initialisation is wrong.

EXAMPLE
Printed line: step 0: r_c=+0.000  r_r=+0.000  margin=+0.000  P=0.500  L=0.693
70one_minus_P — the magnitude of every gradient on this pair

When the model is confidently right (P → 1), 1 − P → 0 and the gradient vanishes. When it is confidently wrong (P → 0), 1 − P → 1 and the gradient is at its maximum. This automatic difficulty-weighting is one reason the Bradley-Terry loss works so well on noisy human data — easy pairs get small gradients and stop pulling the head once they are learned, hard pairs keep pulling.

EXAMPLE
At step 0: P=0.5, so one_minus_P=0.5. Every component of grad_w is scaled by 0.5.
71diff = h_chosen − h_rejected — the direction the head moves in

This is the heart of the RM gradient. Subtracting the rejected hidden state from the chosen one gives a vector that points 'from rejected to chosen' in feature space. The optimizer pushes w in exactly that direction. Geometrically: w starts at the origin and over training drifts toward the average difference vector across the whole preference dataset.

EXAMPLE
h_chosen − h_rejected = [0.4, 0.6, -0.2, 0.4]. After one step w becomes lr·(1−P)·this = 0.5·[0.4,0.6,-0.2,0.4] = [0.2, 0.3, -0.1, 0.2].
72grad_w — closed form, no autograd needed

Because the head is linear and the loss is softplus(-margin), the gradient w.r.t. w is just (P−1) times the difference of the hidden states. We can write this by hand because there are five lines of math between input and loss. In PyTorch you would call loss.backward() and autograd would compute the same thing; the closed form is here so the reader sees what backward() actually does.

EXAMPLE
grad_w = -(1-P)·(h_c − h_r) = -0.5·[0.4,0.6,-0.2,0.4] = [-0.2, -0.3, 0.1, -0.2]
73grad_b = 0 — the bias is unidentifiable from preferences alone

Both r_chosen and r_rejected contain the same bias b. The loss only depends on the difference r_chosen − r_rejected, so b cancels exactly and its gradient is zero on every pair. This is not a bug — it is the algebra telling you that preference data only constrains rewards UP TO A CONSTANT. You can shift every reward by +100 and not change a single preference probability. Production RM code often drops the bias entirely (bias=False on nn.Linear).

EXAMPLE
If you naively initialise b = 5.0 and run forever, b stays at 5.0. The loss does not 'see' it.
80One SGD step — lr = 1.0 here, 1e-5 in production

We use lr=1 so the change is visible in a single step. Production RM training uses lr≈1e-5 with AdamW because the transformer body is fine-tuning alongside the head and a big update on the head would shake the body off its SFT optimum. The head moves more in expectation than the body because its gradient is so much bigger per parameter, so most codebases use a 10x higher LR for the head than the body — a detail covered in §14.4.

EXAMPLE
After step: w = [0.0,0.0,0.0,0.0] − 1.0 · [-0.2,-0.3,0.1,-0.2] = [0.2, 0.3, -0.1, 0.2]
90Step 1 forward pass — margin is now positive, loss has dropped

With the updated w, r_chosen = 0.2·0.5 + 0.3·1.0 + (-0.1)·(-0.3) + 0.2·0.2 = 0.1 + 0.3 + 0.03 + 0.04 = 0.47. r_rejected = 0.2·0.1 + 0.3·0.4 + (-0.1)·(-0.1) + 0.2·(-0.2) = 0.02 + 0.12 + 0.01 − 0.04 = 0.11. Margin = 0.36, P = σ(0.36) ≈ 0.589, loss ≈ 0.529. The model has learned to prefer the chosen response — by 9 percentage points after a SINGLE gradient step on a SINGLE pair.

EXAMPLE
step 1: r_c≈+0.470  r_r≈+0.110  margin≈+0.360  P≈0.589  L≈0.529
100 lines without explanation
1"""
2A *tiny* reward model trained on a single preference pair, in NumPy.
3
4The mechanism is identical to the one running inside trl, OpenRLHF, and
5torchtune — we just strip away the transformer body and the optimizer
6machinery so the math is naked.
7
8The "transformer body" is replaced by a fixed feature function phi(seq) that
9maps a token sequence to a 4-dim feature vector. The "scalar head" is a
10single weight vector w + bias b. The loss is the Bradley-Terry log-likelihood
11of the chosen response over the rejected one. We do ONE manual gradient step
12so the reader can watch the numbers move.
13"""
14
15import math
16from typing import List, Tuple
17
18# ---------------------------------------------------------------------------
19# 1. Fake "hidden states" for two responses to the same prompt.
20#    In a real RM these come from the LAST token of a frozen-then-fine-tuned
21#    transformer. Here we hand-pick them so the arithmetic is easy.
22# ---------------------------------------------------------------------------
23
24# d = 4. h_chosen has slightly higher inner product with w than h_rejected.
25h_chosen   = [ 0.5,  1.0, -0.3,  0.2]   # "Water boils at 100°C."
26h_rejected = [ 0.1,  0.4, -0.1, -0.2]   # "Water boils when it gets hot."
27
28# ---------------------------------------------------------------------------
29# 2. The scalar reward head.
30#    A single linear layer:   r = w · h + b
31#    Initialised to zeros (the standard, KL-safe init for RM heads).
32# ---------------------------------------------------------------------------
33
34w = [0.0, 0.0, 0.0, 0.0]
35b = 0.0
36
37def reward(h: List[float]) -> float:
38    return sum(wi * hi for wi, hi in zip(w, h)) + b
39
40# ---------------------------------------------------------------------------
41# 3. Bradley-Terry loss for one pair.
42#       L = -log sigmoid(r_chosen - r_rejected)
43#         = log(1 + exp(-(r_chosen - r_rejected)))
44# ---------------------------------------------------------------------------
45
46def bt_loss(r_c: float, r_r: float) -> float:
47    m = r_c - r_r
48    # numerically stable log(1 + exp(-m))
49    if m >= 0:
50        return math.log1p(math.exp(-m))
51    return -m + math.log1p(math.exp(m))
52
53# ---------------------------------------------------------------------------
54# 4. Forward pass at step 0.
55#    Both rewards are zero (w=0), so the margin is zero, P=0.5, loss=log 2.
56# ---------------------------------------------------------------------------
57
58r_c = reward(h_chosen)
59r_r = reward(h_rejected)
60margin = r_c - r_r
61prob   = 1.0 / (1.0 + math.exp(-margin))
62loss   = bt_loss(r_c, r_r)
63
64print(f"step 0: r_c={r_c:+.3f}  r_r={r_r:+.3f}  margin={margin:+.3f}  P={prob:.3f}  L={loss:.3f}")
65
66# ---------------------------------------------------------------------------
67# 5. Backward pass — closed-form gradients.
68#
69#    dL/dr_c   = -(1 - P)        = P - 1
70#    dL/dr_r   = +(1 - P)        = 1 - P
71#    dL/dw     = dL/dr_c * h_c + dL/dr_r * h_r
72#                = -(1-P) * (h_chosen - h_rejected)
73#    dL/db     = dL/dr_c + dL/dr_r = 0   (cancels: the loss only depends on the *difference*)
74# ---------------------------------------------------------------------------
75
76one_minus_P = 1.0 - prob
77diff = [hc - hr for hc, hr in zip(h_chosen, h_rejected)]
78grad_w = [-(one_minus_P) * di for di in diff]
79grad_b = 0.0
80
81print(f"          gradient on w: {[round(g, 3) for g in grad_w]}")
82
83# ---------------------------------------------------------------------------
84# 6. One SGD step (lr = 1.0 for visibility — in real RM training lr ~ 1e-5).
85# ---------------------------------------------------------------------------
86
87lr = 1.0
88w  = [wi - lr * gi for wi, gi in zip(w, grad_w)]
89b  = b - lr * grad_b
90
91# ---------------------------------------------------------------------------
92# 7. Forward pass at step 1 — see the margin grow, the loss shrink.
93# ---------------------------------------------------------------------------
94
95r_c = reward(h_chosen)
96r_r = reward(h_rejected)
97margin = r_c - r_r
98prob   = 1.0 / (1.0 + math.exp(-margin))
99loss   = bt_loss(r_c, r_r)
100
101print(f"step 1: r_c={r_c:+.3f}  r_r={r_r:+.3f}  margin={margin:+.3f}  P={prob:.3f}  L={loss:.3f}")
102print(f"          new w: {[round(wi, 3) for wi in w]}")
103
104# ---------------------------------------------------------------------------
105# Expected output (you can run this and check):
106#   step 0: r_c=+0.000  r_r=+0.000  margin=+0.000  P=0.500  L=0.693
107#             gradient on w: [-0.2, -0.3, -0.1, -0.2]
108#   step 1: r_c=+0.190  r_r=+0.090  margin=+0.100  ...
109# After ~30 such steps the margin saturates near the largest value w can
110# afford given the data, and the loss approaches 0.
111# ---------------------------------------------------------------------------

PyTorch: A Production Reward Model on Top of a Transformer

The PyTorch version stacks the chosen and rejected sequences into a single forward pass, finds the last non-pad position with the attention mask, runs the scalar head, and computes the BT loss with F.logsigmoid. Strip the optimizer plumbing and you can fit the whole reward model in 60 lines.

Production reward model — HuggingFace + PyTorch
🐍reward_model_torch.py
30AutoModel, not AutoModelForCausalLM — we throw away the LM head

AutoModel returns the transformer body without the (huge) LM head, saving ~hidden_size × vocab_size parameters of GPU memory. For Llama-3-8B that is 8192 × 128256 ≈ 1.05B fp32 parameters or 2.1 GB at fp16 — non-trivial for an 80 GB H100. We will not regret throwing it away because the RM task does not need to score every position, only the final hidden state.

EXAMPLE
AutoModelForCausalLM('meta-llama/Llama-3.1-8B-Instruct')  → 8.0B params
AutoModel('meta-llama/Llama-3.1-8B-Instruct')                    → 6.5B params (no lm_head)
35Scalar head: Linear(d, 1, bias=False), tiny init

bias=False because preference data is invariant to a constant shift of the reward — the bias is unidentifiable and would just add noise (see the plain-Python explainer line 73). std=1/sqrt(d) keeps the initial reward magnitudes near 1, which keeps the BT loss near log 2 at step 0. Some codebases use std=0 here so the model starts at the SFT distribution exactly; either is fine in practice.

EXAMPLE
v_head has 8192 × 1 = 8192 parameters total — 0.0001% of the backbone.
47Stacked (chosen, rejected) batch shape (2B, T)

We do NOT run chosen and rejected through the model in two separate forward passes — that doubles host-device sync overhead and lets dropout / batchnorm see different RNG states for the two members of the pair. Stacking them along dim 0 forces them through the same forward call with the same RNG, so the margin estimate is low-variance.

EXAMPLE
If B=8, batch entering the model has shape (16, T) — chosen rows 0–7, rejected rows 8–15.
56attention_mask.sum(dim=1) − 1 → index of the last real token

Right-padding fills the end of short sequences with pad_token_id; the attention_mask is 1 on real tokens and 0 on pads. Summing the mask along the time axis gives the length of the real content, and subtracting 1 converts length → last index. This is the single most-bugged line in every from-scratch RM implementation — left-padding flips the convention and gives you the FIRST real-token index instead. Always check which padding side your tokenizer uses.

EXAMPLE
attention_mask = [1,1,1,1,0,0]  → sum = 4  → last_idx = 3 (the 4th token, 0-indexed)
60Fancy-indexing the last hidden state

hidden[batch_idx, last_idx] is the PyTorch idiom for 'pick one position per row'. It is equivalent to a list comprehension over rows but runs on the GPU in one kernel. The output shape is (N, d) — one d-dim vector per response. This is the SOLE input the scalar head sees; everything else the transformer computed is discarded.

EXAMPLE
hidden.shape = (16, 1024, 8192); last_hidden.shape = (16, 8192)
63v_head produces (N, 1), squeeze to (N,)

The Linear layer outputs one scalar per row but with a trailing length-1 dim — we squeeze it off so downstream code can call .mean() and friends without thinking about shape. The reward is an unbounded real number; the BT loss handles squashing internally.

EXAMPLE
rewards.shape = (16,);  example values: [2.3, -0.1, 1.8, 0.4, ..., 1.1, -0.7, ...]
73Split rewards back into chosen and rejected halves

We stacked them as [chosen_0, ..., chosen_{B-1}, rejected_0, ..., rejected_{B-1}] before the forward pass, so the inverse is the obvious slice. Keep this ordering consistent across the codebase — flipping it silently inverts the loss sign and you end up training a 'worst-response selector'. trl, OpenRLHF, and torchtune all use the chosen-first convention.

EXAMPLE
B=8 → r_chosen = rewards[0:8], r_rejected = rewards[8:16]
76margin_shift — Llama-2's per-pair margin labels

When the human annotator labels strength of preference (slight / strong / very strong), Llama-2 added a per-pair real-valued margin to the BT loss: loss = -log σ(r_chosen - r_rejected - m). 'Very strong' pairs require the model to put a bigger margin between the two responses; 'slight' pairs are happy with a tiny margin. The intuition is that disagreement-noisy 'slight' pairs should not pull the head as hard as confident 'very strong' pairs.

EXAMPLE
annotator label 'slight' → m = 0.0; 'very strong' → m = 1.0
78F.logsigmoid — numerically stable, batched, one line

F.logsigmoid(x) computes log σ(x) = -softplus(-x) without overflow for x < 0 or underflow for x > 0. Equivalent in math but not in floats to torch.log(torch.sigmoid(x)) — the latter NaNs for large positive margins after a few thousand steps. Always use F.logsigmoid (or F.binary_cross_entropy_with_logits) for the RM loss.

EXAMPLE
loss = -F.logsigmoid(margin).mean()  →  scalar tensor with requires_grad=True
79accuracy = (margin > 0).mean() — the metric trainers actually watch

Training loss on RMs is famously hard to interpret (it always sits between 0.5 and 0.69, looking flat). The metric that tells you whether the model is learning is the fraction of pairs where margin > 0 — i.e., whether chosen scored higher than rejected. A useful RM hits ≥ 0.70 on a held-out preference set; frontier RMs hit 0.80–0.90 depending on dataset quality. If accuracy is stuck at 0.5, your data is noisy or your head is dead.

EXAMPLE
Healthy curves: step 0 acc=0.51 → step 1000 acc=0.72 → step 5000 acc=0.78 (plateau).
90apply_chat_template — render the full (prompt + response) into the model's format

The RM scores RESPONSES, but the rendered string includes the prompt because the transformer needs context to evaluate the response. Using the SAME chat template as SFT is mandatory — the RM must see exactly what the policy will see at PPO time. add_generation_prompt=False because we have a complete assistant turn; we are not asking the model to generate.

EXAMPLE
Renders to '<|begin_of_text|><|start_header_id|>user<|end_header_id|>...prompt...<|eot_id|><|start_header_id|>assistant<|end_header_id|>...chosen...<|eot_id|>'
101tokenize both halves with padding=True, then split

We tokenize the chosen and rejected lists CONCATENATED so the tokenizer pads them all to the same length. If we tokenized them in two separate calls, chosen would be padded to its own max length and rejected to its own — the two would have different shapes and we could not stack them into a (2B, T) tensor. add_special_tokens=False prevents a double BOS (the chat template already added one).

EXAMPLE
len(chosen_texts) = B, len(rejected_texts) = B, enc['input_ids'].shape = (2B, T_batch_max)
112clip_grad_norm_(1.0) — the safety net of every RM training run

Reward-model training is unusually prone to gradient spikes: a single mislabeled pair can produce a margin of -10 and a per-parameter gradient ~10x normal. Without clipping, one bad batch wipes the head in a single step. max_norm=1.0 is the conservative default (some teams use 0.5 for noisy crowd-sourced data, 2.0 for clean internal data). This is the single most important hyperparameter after the learning rate.

EXAMPLE
Without clip: occasional NaN loss after 2000-3000 steps. With clip=1.0: steady convergence across 100k steps.
111 lines without explanation
1"""
2A production-shape reward model on top of any HuggingFace causal LM.
3
4This is essentially the same code that runs inside trl.RewardTrainer,
5OpenRLHF.RewardModel, and torchtune.modules.reward_models — just with all
6the parallelism / sharding / mixed-precision plumbing removed so the
7mechanism is visible.
8
9Two contracts to remember while reading:
10
11  1. Input shape: (B, 2, T)  — for each preference pair we stack the
12     chosen and rejected token sequences along dim=1 and FLATTEN to
13     (2B, T) before the forward pass. This guarantees both members of
14     the pair see identical RNG / dropout / batchnorm state, which is
15     critical for low-noise margin estimates.
16
17  2. Reward extraction: we read the LAST NON-PAD hidden state, not
18     hidden_states[:, -1, :]. With left-padding the last position
19     would be the response; with right-padding it would be a pad
20     token. Reading at the last *real* token is the only safe rule.
21"""
22
23import torch
24import torch.nn as nn
25import torch.nn.functional as F
26from transformers import AutoModel, AutoTokenizer
27
28IGNORE_INDEX = -100  # not used by RM loss directly but kept consistent with SFT
29
30
31class RewardModel(nn.Module):
32    """A transformer body + a linear scalar head."""
33
34    def __init__(self, base_model_name: str):
35        super().__init__()
36        self.backbone = AutoModel.from_pretrained(
37            base_model_name,
38            torch_dtype=torch.bfloat16,
39        )
40        d = self.backbone.config.hidden_size
41        # No bias — the bias is unidentifiable from preference data.
42        self.v_head = nn.Linear(d, 1, bias=False)
43        nn.init.normal_(self.v_head.weight, std=1.0 / (d ** 0.5))
44
45    def forward(
46        self,
47        input_ids: torch.Tensor,   # (N, T)  N = 2B
48        attention_mask: torch.Tensor,  # (N, T)
49    ) -> torch.Tensor:
50        # 1. Run the transformer body. We do not need logits, just hidden_states.
51        out = self.backbone(
52            input_ids=input_ids,
53            attention_mask=attention_mask,
54        )
55        hidden = out.last_hidden_state  # (N, T, d)
56
57        # 2. Find the index of the LAST non-pad token in each row.
58        #    attention_mask has 1s on real tokens and 0s on pads.
59        last_idx = attention_mask.long().sum(dim=1) - 1   # (N,)
60
61        # 3. Gather the hidden state at that index.
62        batch_idx = torch.arange(hidden.size(0), device=hidden.device)
63        last_hidden = hidden[batch_idx, last_idx]          # (N, d)
64
65        # 4. Scalar head -> one reward per row.
66        rewards = self.v_head(last_hidden).squeeze(-1)     # (N,)
67        return rewards
68
69
70def reward_model_loss(
71    rewards: torch.Tensor,  # (2B,)
72    margin_shift: torch.Tensor | None = None,  # optional per-pair label margin
73) -> torch.Tensor:
74    """Bradley-Terry log-likelihood loss for a stacked (chosen, rejected) batch."""
75    # Split back into (B,) chosen and (B,) rejected.
76    B = rewards.size(0) // 2
77    r_chosen   = rewards[:B]
78    r_rejected = rewards[B:]
79    margin = r_chosen - r_rejected
80    if margin_shift is not None:
81        margin = margin - margin_shift  # Llama-2-style: stronger preferences -> larger required margin
82    # F.logsigmoid is numerically stable; equivalent to -softplus(-x).
83    loss = -F.logsigmoid(margin).mean()
84    accuracy = (margin > 0).float().mean()
85    return loss, accuracy, margin
86
87
88def train_step(
89    model: RewardModel,
90    tokenizer,
91    batch: list[dict],   # each item: {"prompt", "chosen", "rejected"}
92    optimizer,
93):
94    # 1. Render every (prompt, response) into the model's chat template.
95    chosen_texts   = [tokenizer.apply_chat_template(
96        [{"role": "user", "content": b["prompt"]},
97         {"role": "assistant", "content": b["chosen"]}],
98        tokenize=False, add_generation_prompt=False)
99        for b in batch]
100    rejected_texts = [tokenizer.apply_chat_template(
101        [{"role": "user", "content": b["prompt"]},
102         {"role": "assistant", "content": b["rejected"]}],
103        tokenize=False, add_generation_prompt=False)
104        for b in batch]
105
106    # 2. Tokenize. Right-pad to the same length within the batch.
107    enc = tokenizer(
108        chosen_texts + rejected_texts,
109        padding=True, truncation=True, max_length=4096,
110        return_tensors="pt", add_special_tokens=False,
111    )
112
113    # 3. Forward + loss.
114    rewards = model(enc["input_ids"].to(model.backbone.device),
115                    enc["attention_mask"].to(model.backbone.device))
116    loss, acc, margin = reward_model_loss(rewards)
117
118    # 4. Backward + step.
119    optimizer.zero_grad()
120    loss.backward()
121    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
122    optimizer.step()
123
124    return {"loss": loss.item(), "acc": acc.item(), "margin": margin.mean().item()}

From toy script to real RM training

The script above is approximately what runs inside trl.RewardTrainer, OpenRLHF.RewardModel, and torchtune.training.recipes.reward_modeling — minus the FSDP wrap, the gradient checkpointing, the bf16 mixed-precision casts, and the distributed data loader. The per-pair forward and the loss function are byte-for-byte the same. Once those tensors come out, scaling up is the same engineering problem as scaling up SFT: shard the optimizer state, checkpoint activations every 4 layers, and use a fused AdamW kernel.

At Massive Scale: Why Reward Modeling Is the Bottleneck of RLHF

For a 7B-class model an RM run is a 12-hour, 8-GPU job and looks almost identical to SFT. At the frontier scale the picture changes completely, in three ways that each set their own engineering agenda.

Memory: two forward passes per pair, double the activations

SFT activations for a 70B model on a 4 096-token sequence already push a single H100 to 70 GB at bf16. RM training doubles that because every pair contains TWO sequences (chosen and rejected) and we stack them along the batch dimension to keep dropout / RNG identical. Concretely, a per-GPU batch of B=8B = 8 pairs produces a (16,T,d)(16, T, d) activation tensor — the same memory footprint as SFT with B=16B = 16. Almost every RM run uses activation checkpointing every 1-4 layers as a hard requirement, not an optimisation.

Compute: 6ND6 N D with DD being pair-tokens

The standard scaling-law FLOPs estimate C6NDC \approx 6 N D applies, with NN the parameter count and DD the number of training tokens. For an RM,DD is the total number of tokens across BOTH sequences in every pair — typically 1.5–2× larger than an equivalently-sized SFT corpus measured in pairs. A 70B RM trained on 500k preference pairs is roughly 6710105108210206 \cdot 7 \cdot 10^{10} \cdot 5 \cdot 10^8 \approx 2 \cdot 10^{20} FLOPs — a few days on a 256-H100 island. That is significant compute, and it dwarfs the cost of the actual PPO run that consumes the RM downstream.

Communication: small batches, low utilisation

Preference data is expensive to collect (a few dollars per pair from good annotators), so corpora are small — 50k to 1M pairs is typical, compared with 5M+ for SFT. Small datasets force small global batches (you cannot afford to step over the data many times), and small batches mean the all-reduce after each step is a larger fraction of total time. On 256 H100s an RM step might be 40% bubble, 60% compute, compared with 15/85 for an SFT step on the same hardware. Frontier teams compensate by overlapping the all-reduce with the backward pass and by tightening gradient_accumulation_steps to keep the optical-link payload large.

The reward-hacking budget

A reward model that scores 78% on held-out preferences sounds good — but PPO will then push the policy through 10510^5 rollout samples per update, and the policy will quickly discover the 12% of cases where the RM is wrong in a systematic way. The bigger the RM, the more expressive its preference function, the slower this “reward hacking” sets in. Frontier teams almost always choose an RM at least as large as the policy it scores. Anthropic reported scaling RM and policy roughly together in the Claude papers; OpenAI did the same with GPT-4 RLHF; Llama-3 used a 70B RM for the 70B policy.

Engineering Reality: The Catalogue of Reward-Model Pathologies

Reward models fail silently. The training loss looks fine — it slides from 0.693 down to 0.45 and plateaus — and standard ML hygiene tells you the run is healthy. Then PPO produces a model that writes 800-word responses to “hi,” or hedges every answer with “as an AI assistant,” or refuses anything that smells like a question. The pathology was in the RM all along; PPO just amplified it. The list below is what every RM engineer ends up memorising the hard way.

  • Length bias. Chosen responses are systematically longer than rejected ones in the training data (because annotators write more when they like a response). The RM learns “longer = better,” and PPO blows the policy's response length out by 2–3×. Catches: regress reward against length on a held-out set; if R2>0.3R^2 > 0.3, you have a length bias. Fix: length-control the training set or add a length penalty to the loss.
  • Padding-side bug. The code reads hidden[:, -1, :] when the tokenizer right-pads. The reward becomes a function of the pad-token hidden state, which is nearly constant across rows. Loss looks healthy but accuracy hovers at 50%. Fix: always use attention_mask.sum(dim=1) - 1.
  • Stale chat template. The RM uses Llama-3's template; the PPO rollout code uses Llama-2's. The hidden state at “the last non-pad token” means different things in the two formats, so the reward distribution is shifted by ~1 standard deviation between training and serving. PPO chases the shift.
  • Reward exploding to ±∞. Without grad clipping, one mislabeled pair with a 12-token chosen response and a 4000-token rejected response produces a huge difference vector hchrh_c - h_r and a huge gradient on ww. The head NaNs. Fix: clip_grad_norm_(1.0) on every step.
  • Distribution shift onto the policy. The RM was trained on responses written by humans; PPO rolls out responses written by the policy. After 1000 PPO steps the policy's style is far enough from the human style that the RM is no longer in distribution. Fix: collect a small batch of fresh preferences on the current policy's outputs every few hundred PPO steps and continue training the RM. This is the “online RM” loop and is part of every frontier post-training pipeline.
  • Single-annotator overfitting. 70% of your preference data came from one contractor who has strong opinions about Oxford commas. The RM learns to obsess over Oxford commas. Fix: aggregate at least 3 annotators per pair and use the majority vote; monitor inter-annotator agreement on a calibration set.
  • Same-prompt leak. The dev set and train set share some prompts (different responses). The RM looks great on dev (~85% accuracy) but is overfit to specific prompts. PPO discovers this immediately. Fix: split by prompt hash, not by row index.
  • Bias term left on. Someone enabled bias=True on the head. The bias has gradient zero on every pair, so it stays at its random init value forever — adding a constant ~0.1 to every reward. Harmless in training; sometimes causes weird numerical comparisons at serving time if downstream code looks at the absolute reward. Fix: bias=False always.
  • Two-tower failure mode. A junior engineer runs chosen and rejected through two separate forward passes (different RNG seeds, different dropout masks). The estimated margin is much noisier than necessary; training converges 30% slower. Fix: stack and run once.

The frontier-team practice for catching these is a small, hand-curated battery of RM unit tests: a handful of trivial pairs (“helpful, accurate answer” vs. “refusal”, “direct answer” vs. “800 words of waffling”, “correct math” vs. “subtly wrong math”) on which any working RM should produce a margin of at least +1. Running these every checkpoint catches 80% of the pathologies above before they leak into PPO.

The mental model that unifies this section: a reward model is a tasting judge with a single knob. The transformer body is the palate; the scalar head is the knob; Bradley-Terry is the training protocol. Every detail in this section — pooling, no-bias head, last-token indexing, grad clipping — is a guardrail against the judge being fooled by something other than the response's actual quality. Get the guardrails right and the next section (§14.4, rule-based rewards) tells you how to teach the judge new categories beyond what preference data can reach.
Loading comments...