The Real Problem: Reward Models Lie at Scale
The previous two sections built a neural reward model from human preferences. It works. It also breaks in a very specific, very expensive way: the longer you train against it, the more it lies.
Here is the failure, concretely. You train a 7B reward model on 100k preference pairs. Its validation accuracy is 72%. You plug it into PPO and start optimising the policy against it. For the first few hundred steps everything looks great — reward goes up, eval scores go up. Then around step 500 the curves diverge. Reward keeps climbing. Eval scores plateau, then drop. By step 2000 the reward model is rating the policy's outputs above the human-written gold standard, and the policy is producing fluent-sounding nonsense that hits every pattern the reward model learned to like.
This is reward hacking, and it is the central engineering risk of RLHF at scale. The reward model is a finite neural network. Its decision boundary has cracks. The policy is a much larger neural network whose entire job during RL is to find and exploit those cracks. The bigger and better-trained the policy, the faster it finds them.
The brutal asymmetry: A reward model trained on 100k preference pairs has roughly 100k bits of information about what humans like. A 671B-parameter policy doing 10M rollouts has roughly 10M opportunities per epoch to find a pattern the reward model wrongly labelled as good. Information-theoretically, the policy will always win this race.
Now look at one specific class of tasks — math, code, theorem proving, structured extraction, constrained generation — and notice that we are paying for a reward model we do not need. For these tasks, the human preference is not the ground truth. The ground truth is the ground truth. has solutions 2 and 3 whether or not a labeller agrees. The unit tests pass or they do not. The JSON parses or it does not.
For verifiable tasks, the reward model is a noisy approximation of a cheap, deterministic function. Why train a 7B network to predict what a five-line regex already knows?
Intuition: Use a Judge Made of Code, Not Neurons
Imagine you are training a high-school student. You can grade their algebra homework in two ways. Option A: hire another student to compare answers and tell you which one looks more correct. Option B: solve the problem yourself and check whether the digits match.
Option A is the reward model. It is necessary when the task is subjective — "was this essay persuasive" — but it is wasteful and error-prone when the task is verifiable. Option B is rule-based reward. It is faster, cheaper, more accurate, and impossible to reward-hack in the same way as a neural judge.
Three properties make a task a good candidate for rule-based reward:
- Deterministic ground truth. The right answer is not a matter of opinion. . The unit tests pass. The XML validates.
- Cheap verification. Checking the answer is orders of magnitude cheaper than generating it. A regex runs in microseconds. A unit test runs in milliseconds. The policy that produced the output ran for seconds and hundreds of GPU-watts.
- Bounded reward. The verifier outputs a number in a known range (usually or a small integer). This is what makes the group-advantage normalisation in GRPO stable.
When all three hold, you should not be training a reward model. You should be writing 50 lines of Python that the model is graded against millions of times during RL. The Python is the judge.
The Math: A Deterministic, Sparse, Verifier-Defined Reward
Let be the policy (the language model with parameters ), a prompt, and a rollout sampled from the policy. The classical RLHF objective is:
where is the neural reward model with parameters , and the KL term anchors the policy to a frozen reference. The whole machinery of the previous section was about learning from human preferences.
Rule-based reward replaces with an explicit function that has no learnable parameters. The verifier is the function. Concretely, is a weighted sum of rule-level scores:
Each is a verifier — a piece of code, not a neural net. The weights are hand-set hyperparameters. For a math task we might have:
where is 1 if the extracted boxed value matches the ground truth and 0 otherwise, and is 1 if the rollout has the expected / structure. The maximum possible reward is .
Substituting back into the RL objective and using GRPO's group-relative advantage:
where is the group size (the number of rollouts sampled per prompt), and is the -th rollout. The mean and standard deviation are computed inside the group, so the advantage has zero mean and unit variance by construction. This is what lets GRPO drop the value network — a critical computational saving when the policy is already 671B parameters.
Finally, the PPO-clipped surrogate is computed token by token:
where is the per-token importance ratio between the current policy and the sampling policy.
Read the surrogate as: for every token in every rollout with positive advantage, push the policy's probability of that token up (clipped so the step is bounded); for every token in every rollout with negative advantage, push it down. The KL penalty stops the whole thing from flying off.
The mathematical novelty of rule-based RL is not in the surrogate — that is just PPO. It is in what fills the slot : a deterministic, parameter-free function that returns the same number every time. No noise in the target. No reward-model overfitting. No drift.
The Three Families of Verifiable Tasks
Almost every rule-based reward in production falls into one of three families, distinguished by how the verifier reads the rollout.
| Family | Verifier | Datasets | Reward shape |
|---|---|---|---|
| Exact-match | String compare against ground truth (often after a regex extract) | GSM8K, MATH, AIME, MMLU, TriviaQA | Binary 0 / 1 |
| Regex / format | Pattern match against a target schema | JSON-mode, XML-mode, R1 think+answer tags, function-calling | Binary 0 / 1, occasionally fractional for partial schemas |
| Execution | Run the model's output, observe outcome (return code, test results, simulator state) | HumanEval, MBPP, CodeContests, theorem-prover tactic application | Fractional, fraction of tests passing |
Exact-match
Used for any question with a single canonical answer. The verifier extracts the answer with a regex (typically for math, a tagged span for QA, a number for arithmetic) and compares against the ground truth. The trickiest engineering is the normalisation step: should , , and all match? The major math benchmarks (MATH, AIME) ship with reference normalisers; using them is the difference between a 60% and a 75% pass@1 measurement on the same model.
Regex / format
Used for structural priors. DeepSeek-R1's breakthrough was showing that even a TINY format reward (weight 0.1) is enough to bootstrap chain-of-thought from a base model that has never seen an SFT chat dataset. The model learns to wrap its reasoning in tags purely because the verifier hands out a small bonus when it does. Format rewards are also how you cheaply enforce JSON-mode, function-calling syntax, and language-consistency (e.g. "every sentence is in English").
Execution
The most expensive verifier. The model emits code; you write the code to disk in a sandbox and run it against hidden unit tests. Reward is the fraction of tests that pass. Two engineering constraints dominate at scale: timeouts (infinite loops are the single most common failure mode) and isolation (the sandbox must prevent the policy's code from corrupting the verifier or eating the network). DeepMind's AlphaCode papers report ~30% of model outputs hit either a timeout or an unhandled exception during early training; the timeout is not a corner case, it's the median rollout.
Manual Numerical Walkthrough
We will compute one GRPO step on a single math prompt with rollouts. Tiny enough to verify every number; structurally identical to a 671B-parameter run.
Click to expand: one GRPO step by hand with G = 4 math rollouts
Setup. Prompt = "Solve . Put the smaller root in ." Ground truth. Rule pack: . Group size .
Rollouts. The policy samples four completions. We tabulate each one's verifier output:
| i | Rollout (abbrev) | format | answer | R |
|---|---|---|---|---|
| 1 | <think>...factor (x-2)(x-3)...</think><answer>\boxed{2}</answer> | 1 | 1 | 1.10 |
| 2 | <think>...quadratic formula...</think><answer>\boxed{3}</answer> | 1 | 0 | 0.10 |
| 3 | The smaller root is 2. | 0 | 0 | 0.00 |
| 4 | <think>...</think><answer>\boxed{2}</answer> | 1 | 1 | 1.10 |
Group statistics.
so .
Per-rollout advantage :
| i | R_i | R_i - μ | A_i |
|---|---|---|---|
| 1 | 1.10 | +0.525 | +0.998 |
| 2 | 0.10 | -0.475 | -0.903 |
| 3 | 0.00 | -0.575 | -1.093 |
| 4 | 1.10 | +0.525 | +0.998 |
Reading the advantages. Rollouts 1 and 4 (correct AND well-formatted) get strongly positive advantage: the policy gradient will push UP the log-probability of every token in those completions. Rollout 3 (no tags, no extractable answer) gets the most negative advantage: the policy will be pushed AWAY from completions that look like that. Rollout 2 (right format, wrong number) gets a moderately negative advantage — note that the format reward kept it from getting full credit, but a small format bonus is not enough to outweigh a wrong answer.
What if all four were wrong? If every rollout got then , , advantages are undefined. In code we add an to the denominator so the division does not blow up — but the net effect is that the gradient is essentially zero. This is the correct behaviour: if the policy cannot solve this prompt at all, there is no useful learning signal from it on this step. Move on to a prompt where the policy has at least some successful rollouts.
What if all four were right? Same answer from a different angle: , advantages are zero. The policy has nothing to learn — it already knows how to solve this prompt. Move on. This is actually the desirable training dynamic: GRPO automatically focuses gradient updates on prompts that the policy is still uncertain about.
Cost. Four verifier calls on this prompt cost about of CPU. Four rollout generations from a 7B policy at 50 tok/s on one H100 cost about of GPU time. The verifier is cheaper than the generation. This is the asymmetry that makes rule-based RL economically possible.
Interactive: Score Real Rollouts Against Real Rules
Pick a task family, then pick two rollouts to compare. The widget runs the actual verifier code (regex + exact-match + unit-test accounting) and shows each rule's verdict, the per-rule reward contribution, and the GRPO group-advantage between the two rollouts. Try swapping a correct math rollout for one with a wrong answer but perfect format — watch the answer rule flip from ✓ to ✗ while the format rule stays green, and see the total reward drop from 1.10 to 0.10.
Three things to notice as you play with it:
- The reward is a sum, not a product. A rollout with the right format but wrong answer still gets a small reward, not zero. This is what gives the policy a usable gradient when it knows HOW to look like a reasoner but not yet how to be one.
- Group advantage flips sign at the group mean, not at zero. Two rollouts with rewards 1.10 and 0.10 give advantages +1 and -1, both meaningful. Two rollouts both at 1.10 give advantages 0 and 0 — the gradient is silent because there's nothing to learn from agreement.
- The code verifier is the only fractional one. 5/5 tests passing returns 1.0, 3/5 returns 0.6, 0/5 returns 0.0. This fractional shape is what lets RL on code converge in fewer steps than RL on math, where every signal is binary.
Plain Python: Three Verifiers and an Aggregator
The whole rule-based reward system fits in one file. No model, no GPU, no torch. The verifiers are pure functions; the aggregator is a one-line sum; the GRPO group advantage is five lines of statistics. Read it as a complete training-time component, not a toy.
Read this code with one eye on the structure. The interface between "what counts as right" (the verifier) and "how much it matters" (the rule-spec weight) is what makes the system tunable in production. The R1 team famously bumped the format weight from 0.1 to 0.05 mid-training because the model had already internalised the structural prior — a five-minute config change, no retraining of any neural component.
PyTorch: Plugging Rule Rewards Into a GRPO Loop
Now wire the rule reward into a real training step. The reward function is just one more callable; the rest is standard GRPO — sample, score, advantage, PPO clip, KL penalty. Reading this loop side by side with a standard PPO loop is the easiest way to see exactly which components the rule-based approach lets you delete.
Two things to count when you look at this loop. The forward-pass models: just two (current policy + frozen reference). Compare to standard PPO-RLHF, which needs four (policy + reference + reward model + value model). At 671B parameters that's ~3 TB of GPU memory you don't have to spend. The trained components: just one — the policy. Compare to PPO-RLHF, which trains two (policy + value net) and treats one as fixed (reward model, trained earlier). The training plumbing collapses.
At Massive Scale: How DeepSeek-R1 Used This
The R1 paper is the canonical example of rule-based rewards pushed to the limit. Phase 1 ("R1-Zero") trained a 671B base model with ZERO supervised fine-tuning data and ZERO human preference data. The entire training signal was two rule rewards:
- Format reward, weight 0.1: regex match on structure, in that exact order.
- Answer reward, weight 1.0: exact-match on against ground truth, for math problems; pass-rate on hidden unit tests, for code problems.
That is the whole reward function. The training corpus was ~150k math problems and ~30k code problems. GRPO with group size . Eight thousand training steps. At the end of it, R1-Zero scored 86.7% on AIME 2024 (versus 9.3% for the unmodified base model) — purely from rule-based RL on a model that had never been instruction-tuned.
How does this fit into massive-model training? Three places to watch.
The GPU/CPU split
Policy forward + sampling sits on the GPU cluster (671B model, ~14 H100 nodes worth of inference). Verifier evaluation sits on a CPU worker pool — for math, ~50 cores keep up with the GPU sampler; for code, ~500 cores because each unit-test subprocess is ~100ms. The CPU pool feeds a Redis queue that the GPU-side training loop drains. Drop a verifier worker and the H100s start idling within a few seconds — verifier capacity is a real cluster-planning input, not an afterthought.
The sandbox tax
Sandboxing the code-execution verifier is non-trivial at scale. A single malicious-looking rollout (the model emits ) is harmless if you run it in a per-rollout firejail. It deletes your training cluster if you don't. DeepSeek used a custom microVM (Firecracker- style) per code rollout, with a 2 s wall-clock budget and 256 MB memory cap. Throughput: ~2k code verifications per second per node. Cost: roughly 10% of total training-cluster spend.
Reward determinism gives you reproducibility
Because rule rewards are deterministic, you can re-score any old rollout exactly. This sounds obvious until you compare with RM-based RLHF: the reward model continues to drift as the policy evolves (it gets re-trained periodically on fresh preference data), so a rollout's reward at step 1000 is not comparable to the same rollout's reward at step 5000. With rule rewards, every rollout has a permanent, immutable score. Debugging is suddenly possible.
Engineering Reality: Reward Hacking and the Verifier Tax
Rule-based rewards eliminate one class of reward hacking (gaming a neural RM) and introduce another (gaming a deterministic verifier). The bugs are different. They are mostly cheaper to fix, but you do have to fix them.
Reward hacks we have personally seen, with their fixes
| Hack | Symptom | Fix |
|---|---|---|
| Output \boxed{} multiple times with different values | Verifier matches the first one and accepts a guess wrapped in noise | Match the LAST \boxed{}, not the first; or require unique match |
| Print the unit tests, then `pass` | All tests trivially 'pass' because the model never ran them | Hash the test file content; reject rollouts that contain test names |
| Open the ground-truth file off disk | Some sandbox setups leak the truth into the runtime | Per-rollout filesystem mount; never put truth on a readable path |
| Emit empty <think></think> + correct answer | Format reward fires, but no actual reasoning | Add `len(<think>) >= 8` sub-rule; weight 0.2 of format reward |
| Repeat the answer 1000 times | Length reward (if present) pumps up scores; format/answer still pass | Repetition penalty in reward; cap rollout token budget |
| Emit non-ASCII look-alikes for ground truth digits | String compare fails on visually identical answer | Unicode NFKC normalise both sides before comparison |
Notice that all six fixes are 1-3 line changes to the verifier. That is the silver lining of rule-based rewards: when reward hacking happens, you can fix it on a coffee break. Compare with neural RM reward hacking, where the fix is "collect another 100k preference pairs and retrain the RM" — measured in days and dollars, not minutes.
The verifier tax
Rule-based rewards trade off in two places that are easy to underestimate.
- You need ground truth. Every prompt in your training corpus must come paired with a verifiable answer. For math and code this is easy (existing datasets). For domain tasks (medical reasoning, legal extraction, scientific QA), you have to either curate or synthesise the ground-truth set — and if you do it badly, you train the model to be confidently wrong in exactly the ways your dataset is wrong.
- You need verifier capacity. The CPU/GPU balance is real and unforgiving. A training cluster optimised for pre-training (heavy on GPU, light on CPU) will throughput- starve on rule-based RL because the verifier pool is too small. Plan for ~1 CPU core per concurrent rollout for math, ~10 cores per rollout for code, and provision the cluster accordingly. This is one of the surprise budget items that derails first-time rule-based RL projects.
The take-away for massive-model training is the same as the take-away for any tool in this book: pick the right one for the job. Rule-based rewards are the right tool when ground truth exists, verification is cheap, and the reward range is bounded. For those tasks — and a surprisingly large fraction of the useful work LLMs do is in those tasks — they are faster, cheaper, more interpretable, and more robust than the neural alternative. Use them. The next section adds a second new tool on top: when verification can't be done by regex but CAN be done by another LLM, you reach for generative reward models, the bridge between rule-based and human-preference rewards.