Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: Reward Models Lie at Scale

The previous two sections built a neural reward model from human preferences. It works. It also breaks in a very specific, very expensive way: the longer you train against it, the more it lies.

Here is the failure, concretely. You train a 7B reward model on 100k preference pairs. Its validation accuracy is 72%. You plug it into PPO and start optimising the policy against it. For the first few hundred steps everything looks great — reward goes up, eval scores go up. Then around step 500 the curves diverge. Reward keeps climbing. Eval scores plateau, then drop. By step 2000 the reward model is rating the policy's outputs above the human-written gold standard, and the policy is producing fluent-sounding nonsense that hits every pattern the reward model learned to like.

This is reward hacking, and it is the central engineering risk of RLHF at scale. The reward model is a finite neural network. Its decision boundary has cracks. The policy is a much larger neural network whose entire job during RL is to find and exploit those cracks. The bigger and better-trained the policy, the faster it finds them.

The brutal asymmetry: A reward model trained on 100k preference pairs has roughly 100k bits of information about what humans like. A 671B-parameter policy doing 10M rollouts has roughly 10M opportunities per epoch to find a pattern the reward model wrongly labelled as good. Information-theoretically, the policy will always win this race.

Now look at one specific class of tasks — math, code, theorem proving, structured extraction, constrained generation — and notice that we are paying for a reward model we do not need. For these tasks, the human preference is not the ground truth. The ground truth is the ground truth. $x^2 - 5x + 6 = 0$ has solutions 2 and 3 whether or not a labeller agrees. The unit tests pass or they do not. The JSON parses or it does not.

For verifiable tasks, the reward model is a noisy approximation of a cheap, deterministic function. Why train a 7B network to predict what a five-line regex already knows?

The shift: Replace the neural reward model with explicit verifier code on any task where you can write the verifier. Keep the reward model only for tasks that genuinely need a human sense of taste (style, helpfulness, harmlessness). This split is the single biggest reason DeepSeek-R1, Qwen-2.5-Math, and the post-2024 generation of reasoning models train faster, generalise further, and reward-hack less than their RLHF-only predecessors.

Intuition: Use a Judge Made of Code, Not Neurons

Imagine you are training a high-school student. You can grade their algebra homework in two ways. Option A: hire another student to compare answers and tell you which one looks more correct. Option B: solve the problem yourself and check whether the digits match.

Option A is the reward model. It is necessary when the task is subjective — "was this essay persuasive" — but it is wasteful and error-prone when the task is verifiable. Option B is rule-based reward. It is faster, cheaper, more accurate, and impossible to reward-hack in the same way as a neural judge.

Three properties make a task a good candidate for rule-based reward:

Deterministic ground truth. The right answer is not a matter of opinion. $2 + 2 = 4$ . The unit tests pass. The XML validates.
Cheap verification. Checking the answer is orders of magnitude cheaper than generating it. A regex runs in microseconds. A unit test runs in milliseconds. The policy that produced the output ran for seconds and hundreds of GPU-watts.
Bounded reward. The verifier outputs a number in a known range (usually $[0, 1]$ or a small integer). This is what makes the group-advantage normalisation in GRPO stable.

When all three hold, you should not be training a reward model. You should be writing 50 lines of Python that the model is graded against millions of times during RL. The Python is the judge.

A useful mental model: a rule-based reward is what the RLHF community calls a "world model in code". The world you are modelling is the world of correct math answers, or the world of programs that pass tests. Instead of training a neural approximation of that world, you simulate the world directly.

The Math: A Deterministic, Sparse, Verifier-Defined Reward

Let $\pi_\theta$ be the policy (the language model with parameters $\theta$ ), $s$ a prompt, and $a \sim \pi_\theta(\cdot \mid s)$ a rollout sampled from the policy. The classical RLHF objective is:

$J(\theta) = \mathbb{E}_{s, a}[\, r_\phi(s, a)\,] - \beta \, \mathrm{KL}\big(\pi_\theta(\cdot \mid s) \,\|\, \pi_{\mathrm{ref}}(\cdot \mid s)\big)$

where $r_\phi$ is the neural reward model with parameters $\phi$ , and the KL term anchors the policy to a frozen reference. The whole machinery of the previous section was about learning $r_\phi$ from human preferences.

Rule-based reward replaces $r_\phi$ with an explicit function $R$ that has no learnable parameters. The verifier is the function. Concretely, $R$ is a weighted sum of rule-level scores:

$R(s, a) = \sum_{i=1}^{K} w_i \, r_i(s, a)$

Each $r_i: (s, a) \to [0, 1]$ is a verifier — a piece of code, not a neural net. The weights $w_i \geq 0$ are hand-set hyperparameters. For a math task we might have:

$R_{\mathrm{math}}(s, a) = 1.0 \cdot r_{\mathrm{answer}}(s, a) + 0.1 \cdot r_{\mathrm{format}}(s, a)$

where $r_{\mathrm{answer}}$ is 1 if the extracted boxed value matches the ground truth and 0 otherwise, and $r_{\mathrm{format}}$ is 1 if the rollout has the expected $\langle\mathrm{think}\rangle$ / $\langle\mathrm{answer}\rangle$ structure. The maximum possible reward is $R_{\max} = 1.1$ .

Substituting back into the RL objective and using GRPO's group-relative advantage:

$A_i = \frac{R(s, a_i) - \mu_G}{\sigma_G + \varepsilon}, \quad \mu_G = \frac{1}{G}\sum_{j=1}^{G} R(s, a_j), \quad \sigma_G^2 = \frac{1}{G}\sum_{j=1}^{G} (R(s, a_j) - \mu_G)^2$

where $G$ is the group size (the number of rollouts sampled per prompt), and $a_i$ is the $i$ -th rollout. The mean and standard deviation are computed inside the group, so the advantage $A_i$ has zero mean and unit variance by construction. This is what lets GRPO drop the value network — a critical computational saving when the policy is already 671B parameters.

Finally, the PPO-clipped surrogate is computed token by token:

$L(\theta) = -\mathbb{E}_{s, i, t}\Big[\, \min\big(\rho_{i,t}(\theta) A_i, \,\, \mathrm{clip}(\rho_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) A_i\big) \Big] + \beta \, \mathrm{KL}(\pi_\theta \,\|\, \pi_{\mathrm{ref}})$

where $\rho_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} \mid s, a_{i, < t})}{\pi_{\theta_{\mathrm{old}}}(a_{i,t} \mid s, a_{i, < t})}$ is the per-token importance ratio between the current policy and the sampling policy.

Read the surrogate as: for every token in every rollout with positive advantage, push the policy's probability of that token up (clipped so the step is bounded); for every token in every rollout with negative advantage, push it down. The KL penalty stops the whole thing from flying off.

The mathematical novelty of rule-based RL is not in the surrogate — that is just PPO. It is in what fills the slot $R(s, a)$ : a deterministic, parameter-free function that returns the same number every time. No noise in the target. No reward-model overfitting. No drift.

The Three Families of Verifiable Tasks

Almost every rule-based reward in production falls into one of three families, distinguished by how the verifier reads the rollout.

Family	Verifier	Datasets	Reward shape
Exact-match	String compare against ground truth (often after a regex extract)	GSM8K, MATH, AIME, MMLU, TriviaQA	Binary 0 / 1
Regex / format	Pattern match against a target schema	JSON-mode, XML-mode, R1 think+answer tags, function-calling	Binary 0 / 1, occasionally fractional for partial schemas
Execution	Run the model's output, observe outcome (return code, test results, simulator state)	HumanEval, MBPP, CodeContests, theorem-prover tactic application	Fractional, fraction of tests passing

Exact-match

Used for any question with a single canonical answer. The verifier extracts the answer with a regex (typically $\boxed{...}$ for math, a tagged span for QA, a number for arithmetic) and compares against the ground truth. The trickiest engineering is the normalisation step: should $0.5$ , $1/2$ , and $\frac{1}{2}$ all match? The major math benchmarks (MATH, AIME) ship with reference normalisers; using them is the difference between a 60% and a 75% pass@1 measurement on the same model.

Regex / format

Used for structural priors. DeepSeek-R1's breakthrough was showing that even a TINY format reward (weight 0.1) is enough to bootstrap chain-of-thought from a base model that has never seen an SFT chat dataset. The model learns to wrap its reasoning in $\langle\mathrm{think}\rangle$ tags purely because the verifier hands out a small bonus when it does. Format rewards are also how you cheaply enforce JSON-mode, function-calling syntax, and language-consistency (e.g. "every sentence is in English").

Execution

The most expensive verifier. The model emits code; you write the code to disk in a sandbox and run it against hidden unit tests. Reward is the fraction of tests that pass. Two engineering constraints dominate at scale: timeouts (infinite loops are the single most common failure mode) and isolation (the sandbox must prevent the policy's code from corrupting the verifier or eating the network). DeepMind's AlphaCode papers report ~30% of model outputs hit either a timeout or an unhandled exception during early training; the timeout is not a corner case, it's the median rollout.

Verifier asymmetry: Across these three families, verifier cost spans seven orders of magnitude. A regex check is ~10 microseconds. A unit-test run is ~100 milliseconds. An end-to-end theorem-prover call can be 10 seconds. If you mix families in one training batch, the slow verifier becomes the bottleneck — your H100s sit idle waiting for sandbox subprocesses to finish. The fix is reward-side asynchrony: a worker pool of CPU verifiers feeding a queue that the GPU drains lazily.

Manual Numerical Walkthrough

We will compute one GRPO step on a single math prompt with $G = 4$ rollouts. Tiny enough to verify every number; structurally identical to a 671B-parameter run.

Click to expand: one GRPO step by hand with G = 4 math rollouts

Setup. Prompt $s$ = "Solve $x^2 - 5x + 6 = 0$ . Put the smaller root in $\boxed{}$ ." Ground truth $y = 2$ . Rule pack: $R = 1.0 \cdot r_{\mathrm{answer}} + 0.1 \cdot r_{\mathrm{format}}$ . Group size $G = 4$ .

Rollouts. The policy samples four completions. We tabulate each one's verifier output:

i	Rollout (abbrev)	format	answer	R
1	<think>...factor (x-2)(x-3)...</think><answer>\boxed{2}</answer>	1	1	1.10
2	<think>...quadratic formula...</think><answer>\boxed{3}</answer>	1	0	0.10
3	The smaller root is 2.	0	0	0.00
4	<think>...</think><answer>\boxed{2}</answer>	1	1	1.10

Group statistics.

$\mu_G = \tfrac{1}{4}(1.10 + 0.10 + 0.00 + 1.10) = 0.575$

$\sigma_G^2 = \tfrac{1}{4}\big((1.10 - 0.575)^2 + (0.10 - 0.575)^2 + (0.00 - 0.575)^2 + (1.10 - 0.575)^2\big)$

$\sigma_G^2 = \tfrac{1}{4}(0.2756 + 0.2256 + 0.3306 + 0.2756) = 0.2769$ so $\sigma_G \approx 0.526$ .

Per-rollout advantage $A_i = (R_i - \mu_G) / \sigma_G$ :

i	R_i	R_i - μ	A_i
1	1.10	+0.525	+0.998
2	0.10	-0.475	-0.903
3	0.00	-0.575	-1.093
4	1.10	+0.525	+0.998

Reading the advantages. Rollouts 1 and 4 (correct AND well-formatted) get strongly positive advantage: the policy gradient will push UP the log-probability of every token in those completions. Rollout 3 (no tags, no extractable answer) gets the most negative advantage: the policy will be pushed AWAY from completions that look like that. Rollout 2 (right format, wrong number) gets a moderately negative advantage — note that the format reward kept it from getting full credit, but a small format bonus is not enough to outweigh a wrong answer.

What if all four were wrong? If every rollout got $R = 0$ then $\mu_G = 0$ , $\sigma_G = 0$ , advantages are undefined. In code we add an $\varepsilon = 10^{-6}$ to the denominator so the division does not blow up — but the net effect is that the gradient is essentially zero. This is the correct behaviour: if the policy cannot solve this prompt at all, there is no useful learning signal from it on this step. Move on to a prompt where the policy has at least some successful rollouts.

What if all four were right? Same answer from a different angle: $\sigma_G = 0$ , advantages are zero. The policy has nothing to learn — it already knows how to solve this prompt. Move on. This is actually the desirable training dynamic: GRPO automatically focuses gradient updates on prompts that the policy is still uncertain about.

Cost. Four verifier calls on this prompt cost about $4 \times 20\,\mu\mathrm{s} = 80\,\mu\mathrm{s}$ of CPU. Four rollout generations from a 7B policy at 50 tok/s on one H100 cost about $4 \times 2\,\mathrm{s} = 8\,\mathrm{s}$ of GPU time. The verifier is $10^{5}\times$ cheaper than the generation. This is the asymmetry that makes rule-based RL economically possible.

Interactive: Score Real Rollouts Against Real Rules

Pick a task family, then pick two rollouts to compare. The widget runs the actual verifier code (regex + exact-match + unit-test accounting) and shows each rule's verdict, the per-rule reward contribution, and the GRPO group-advantage between the two rollouts. Try swapping a correct math rollout for one with a wrong answer but perfect format — watch the answer rule flip from ✓ to ✗ while the format rule stays green, and see the total reward drop from 1.10 to 0.10.

Loading rule-based reward lab…

Three things to notice as you play with it:

The reward is a sum, not a product. A rollout with the right format but wrong answer still gets a small reward, not zero. This is what gives the policy a usable gradient when it knows HOW to look like a reasoner but not yet how to be one.
Group advantage flips sign at the group mean, not at zero. Two rollouts with rewards 1.10 and 0.10 give advantages +1 and -1, both meaningful. Two rollouts both at 1.10 give advantages 0 and 0 — the gradient is silent because there's nothing to learn from agreement.
The code verifier is the only fractional one. 5/5 tests passing returns 1.0, 3/5 returns 0.6, 0/5 returns 0.0. This fractional shape is what lets RL on code converge in fewer steps than RL on math, where every signal is binary.

Plain Python: Three Verifiers and an Aggregator

The whole rule-based reward system fits in one file. No model, no GPU, no torch. The verifiers are pure functions; the aggregator is a one-line sum; the GRPO group advantage is five lines of statistics. Read it as a complete training-time component, not a toy.

Rule-based reward stack: three verifiers + aggregator + GRPO advantage

🐍rule_based_rewards.py

Explanation(8)

Code(115)

25Two regexes are the whole rule-based reward stack for math

TAG_RE matches a <think>…</think><answer>…</answer> block in that exact order. BOXED_RE pulls the content out of \boxed{…}. Between them, these two regexes carry the entire structural prior that DeepSeek-R1 used to bootstrap chain-of-thought from a base model. The model is graded on producing the right structure AND the right number — two completely orthogonal signals computed by code that takes microseconds.

EXECUTION STATE

TAG_RE = captures (think_body, answer_body) in order

BOXED_RE = captures the boxed value from the answer block

28verify_math — deterministic and sparse

The function returns a dict, not a single number. This is the most important design choice in the file. Each verifier emits component scores that the aggregator combines later. Keeping format and answer separate means you can re-weight them without re-running the model. R1's training curves show that the format weight 0.1 was crucial early — without it the model would drop the <think> block entirely once it could short-cut to the right number.

EXECUTION STATE

rollout = raw text produced by the policy

truth = ground-truth string, e.g. '42'

return = {format: 0 or 1, answer: 0 or 1, extracted: str or None}

40verify_format — structural reward without an answer

A common confusion: format reward is NOT just 'is the answer block there'. R1 used a small format reward EVEN ON tasks where the answer reward was the dominant signal — the structural prior travels across task families. A model that learns the <think>/<answer> habit on math problems will keep using it on code problems, where the code reward is binary and offers no gradient on intermediate reasoning quality.

EXECUTION STATE

structure = 1 if both tags present in order, else 0

think_length = soft 'don't emit an empty <think>' check

50verify_code — the only non-deterministic step

We write the model's code plus a test assertion to a temp file, run it in a subprocess, and count return code 0. Two engineering essentials: a timeout (infinite loops are the most common failure mode), and a sandbox (real production sandboxes are seccomp/firejail/firejail/docker; here we approximate with a subprocess). The subprocess call is the single slowest line in this file — a 2 s timeout × G rollouts × N tests × M prompts adds up fast at training scale.

EXECUTION STATE

code = the model's generated function body

tests = ['assert is_palindrome("racecar")', ...]

timeout = 2.0 s per test — empirically the median 'real solution' runtime + 4 sigma

65RuleSpec — making the reward function configurable, not coded

Each rule has a name, a weight, and a callable that maps the verifier dict to a [0, 1] score. This is the seam between 'what counts as right' (verifier) and 'how much each thing matters' (aggregator). When R1's team realised the model was gaming the answer reward by skipping the <think> block, they re-weighted the rules WITHOUT touching the verifiers — a 5-minute change instead of a re-run. This separation is the most under-appreciated piece of RL-with-rule-rewards engineering.

EXECUTION STATE

name = human-readable id for logging

weight = scalar multiplier; appears in R(s) = sum(w_i r_i)

score_fn = verifier_dict -> float in [0, 1]

71aggregate — one line of code, decades of subtlety

Linear weighted sum is the standard combiner. It is not the only choice — you could multiply (forcing ALL rules to fire), take a min (rewarding the weakest rule), or pipe through a soft-OR. R1, Qwen2-Math and Mistral-Math all use linear sums; the multiplicative variant has been tried (named 'gated rewards' in some papers) but tends to make the reward landscape too sparse for stable PPO/GRPO.

EXECUTION STATE

return = scalar in [0, sum(weights)]

91group_advantages — GRPO's killer trick in five lines

The standard PPO baseline is a learned value function — another neural net to train, another GPU to host. GRPO replaces it with the group mean: for G rollouts of the SAME prompt, compute the mean and std of their rewards, then standardise each. The result is unit-variance advantages, which fits the PPO clip range without re-tuning. This works ONLY because rule rewards are BOUNDED — if your reward were a raw RM logit ranging over (-inf, +inf), the std would be unreliable.

EXECUTION STATE

G = group size, typically 8–64 rollouts per prompt

advantages = list of (R_i - mean)/std, mean=0, std=1 by construction

102score_group — the inner loop of one GRPO step

This is what runs per prompt, per training step: roll out G samples from the policy, score each against the rules, normalise inside the group. The list[float] advantages flow straight into the policy-gradient term. For a 7-B model on 1 H100 you can score ~50 prompts/s when math is the only task; once code-verification subprocesses enter the mix that drops to ~5 prompts/s — the verifier wall.

EXECUTION STATE

rollouts = G strings sampled from the policy on one prompt

advantages = G floats; signal that drives the policy update

107 lines without explanation

1"""
2Rule-based rewards for verifiable tasks. Three verifiers, one aggregator,
3one advantage computation. The whole file is pure Python — no GPU, no
4model. The reward signal is computed by code that the human authored once
5and now runs millions of times during RL.
6
7  1. verify_math    — extract \\boxed{...} from a tagged answer and
8                       exact-match against ground truth.
9  2. verify_format  — regex check that the rollout has the
10                       <think>...</think><answer>...</answer> structure
11                       in that order.
12  3. verify_code    — write the model's code to disk, sandbox-run it
13                       against hidden unit tests, count the fraction
14                       that pass.
15
16The aggregator combines verifier outputs into a single scalar reward.
17The advantage step normalises rewards inside a GRPO group so the
18policy gradient is unit-variance.
19"""
20from __future__ import annotations
21import re, subprocess, tempfile, math, statistics
22from dataclasses import dataclass
23from typing import Callable, Sequence
24
25# ─── shared regexes ─────────────────────────────────────────────────────
26TAG_RE   = re.compile(r"<think>([\\s\\S]*?)</think>\\s*<answer>([\\s\\S]*?)</answer>")
27BOXED_RE = re.compile(r"\\\\boxed\\{([^}]+)\\}")
28
29# ─── 1. Math verifier — exact match on \\boxed{...} ────────────────────
30def verify_math(rollout: str, truth: str) -> dict:
31    m = TAG_RE.search(rollout)
32    if not m:
33        return {"format": 0.0, "answer": 0.0, "extracted": None}
34    b = BOXED_RE.search(m.group(2))
35    extracted = b.group(1).strip() if b else None
36    return {
37        "format":  1.0,
38        "answer":  1.0 if extracted == truth.strip() else 0.0,
39        "extracted": extracted,
40    }
41
42# ─── 2. Format verifier — structural reward only ────────────────────────
43def verify_format(rollout: str) -> dict:
44    m = TAG_RE.search(rollout)
45    if not m:
46        return {"structure": 0.0, "think_length": 0.0}
47    return {
48        "structure":   1.0,
49        "think_length": 1.0 if len(m.group(1).strip()) >= 8 else 0.0,
50    }
51
52# ─── 3. Code verifier — sandboxed unit-test execution ───────────────────
53def verify_code(code: str, tests: list[str], timeout: float = 2.0) -> dict:
54    """Run each unit test as 'assert ...'. Return per-test pass/fail."""
55    passes = 0
56    for t in tests:
57        prog = code + "\n" + t + "\n"
58        with tempfile.NamedTemporaryFile("w", suffix=".py", delete=True) as f:
59            f.write(prog); f.flush()
60            try:
61                r = subprocess.run(
62                    ["python", f.name],
63                    capture_output=True, timeout=timeout,
64                )
65                if r.returncode == 0:
66                    passes += 1
67            except subprocess.TimeoutExpired:
68                pass        # infinite loop = fail, not a crash
69    return {"tests_passed": passes, "tests_total": len(tests),
70            "frac": passes / max(1, len(tests))}
71
72# ─── 4. Weighted aggregator ─────────────────────────────────────────────
73@dataclass
74class RuleSpec:
75    name: str
76    weight: float
77    score_fn: Callable[[dict], float]   # rollout-result-dict -> [0, 1]
78
79def aggregate(verifier_out: dict, rules: list[RuleSpec]) -> float:
80    """R(s) = sum_i w_i * r_i(s).  r_i is the score for rule i."""
81    return sum(rule.weight * rule.score_fn(verifier_out) for rule in rules)
82
83# Example rule packs for the three task families
84MATH_RULES   = [
85    RuleSpec("format", 0.1, lambda v: v["format"]),
86    RuleSpec("answer", 1.0, lambda v: v["answer"]),
87]
88FORMAT_RULES = [
89    RuleSpec("structure",    1.0, lambda v: v["structure"]),
90    RuleSpec("think_length", 0.2, lambda v: v["think_length"]),
91]
92CODE_RULES   = [
93    RuleSpec("frac", 1.0, lambda v: v["frac"]),
94]
95
96# ─── 5. GRPO group-advantage — unit-variance baseline ───────────────────
97def group_advantages(rewards: Sequence[float]) -> list[float]:
98    """Group-relative advantage: (R_i - mean(R)) / (std(R) + eps).
99    The whole point of GRPO: drop the critic, use the group mean as the
100    baseline, the group std as the scale. Works because rule rewards are
101    BOUNDED in [0, R_max] so the variance estimate is well-behaved."""
102    if len(rewards) < 2:
103        return [0.0] * len(rewards)
104    mu  = statistics.mean(rewards)
105    sd  = statistics.pstdev(rewards) or 1e-6
106    return [(r - mu) / sd for r in rewards]
107
108# ─── 6. End-to-end on a single prompt ───────────────────────────────────
109def score_group(rollouts: list[str], truth: str) -> dict:
110    """Score G rollouts of the same math prompt. Returns rewards + advantages."""
111    rs = []
112    for r in rollouts:
113        v = verify_math(r, truth)
114        rs.append(aggregate(v, MATH_RULES))
115    return {"rewards": rs, "advantages": group_advantages(rs)}

Read this code with one eye on the structure. The interface between "what counts as right" (the verifier) and "how much it matters" (the rule-spec weight) is what makes the system tunable in production. The R1 team famously bumped the format weight from 0.1 to 0.05 mid-training because the model had already internalised the structural prior — a five-minute config change, no retraining of any neural component.

PyTorch: Plugging Rule Rewards Into a GRPO Loop

Now wire the rule reward into a real training step. The reward function is just one more callable; the rest is standard GRPO — sample, score, advantage, PPO clip, KL penalty. Reading this loop side by side with a standard PPO loop is the easiest way to see exactly which components the rule-based approach lets you delete.

GRPO step with rule-based rewards (no reward model, no value net)

🐍grpo_step.py

Explanation(11)

Code(100)

18RewardFn is a one-line abstraction with outsized leverage

By wrapping reward functions in a dataclass with a name, every reward becomes loggable, swappable, and composable. In production you might have 8 rewards: math correctness, code correctness, format, refusal, length penalty, repetition penalty, language-consistency, and a small RM. Each one is one RewardFn instance. The training loop never knows or cares which is which.

EXECUTION STATE

name = string id for logging and tensorboard

fn = (prompt, completion, meta) -> float, deterministic

24make_reward — addition is composition

Multiple reward functions are combined by simple addition. If you want non-linear interaction you write a single bigger RewardFn that does the mixing inside. Keeping the combiner linear lets us decompose final rewards into per-rule contributions for debugging — at scale, you WILL spend hours staring at per-rule reward histograms.

31grpo_step — the actual algorithm

Five steps: sample, score, advantage, re-score under current policy, PPO-clip + KL. The order matters. The 'old' log-probs must be captured at sampling time (no gradient) so we can compute the ratio between current policy and sampling policy at the gradient step. This is what enables multiple gradient passes per sample batch — the heart of PPO's sample efficiency.

EXECUTION STATE

policy = the model whose parameters we are updating

ref = frozen copy of the model from before RL — anchor for KL penalty

G = group size; DeepSeek-R1 used 16, Qwen2 uses 8–32

kl_coef = KL penalty coefficient; too high = no learning, too low = collapse

48Sample G rollouts FROM THE SAME PROMPT

This is the structural reason GRPO works without a value function. G rollouts of one prompt see almost identical context distribution, so their reward variance comes mostly from the policy's stochasticity on this prompt. Group-mean subtraction kills the per-prompt baseline; group-std division normalises the scale. The same trick on G rollouts from DIFFERENT prompts would mix unrelated reward scales and break the standardisation.

EXECUTION STATE

gens = (G, T_max) integer token tensor

logp_old = (G, T_max) log-probs from the policy AT SAMPLING TIME, no grad

53The reward call — where rule-based meets neural

We detokenize each rollout to a string and pass it through the Python reward function. This is the ONLY non-tensor step in the loop. In production you do this in a worker pool — the CPU-bound reward evaluation should never block the GPU sampler. Rewards are computed in parallel across the G rollouts of one prompt, and across prompts in the batch.

EXECUTION STATE

rewards = (G,) float32 — bounded in [0, R_max] by construction

60(rewards - mean) / std — the entire baseline mechanism

PPO's classic baseline subtracts a state-value V(s). GRPO drops V(s) entirely; the per-prompt group mean is a Monte-Carlo estimate of E[R | s], which is exactly what the value function would learn. The std division is what makes GRPO numerically stable across very different prompts: a math prompt where 7/8 rollouts get the answer right has tiny std, so each advantage is large, so the gradient is sharp — even though the absolute reward is high. Conversely, a hard prompt where 0/8 succeed has zero std and zero advantage — no gradient, which is correct because there's nothing to learn from.

EXECUTION STATE

adv = (G,) float32, mean=0, std≈1

63Re-score under current policy: the gradient pathway

logp_new is computed under the CURRENT parameters with gradient tracking on. logp_ref is under the FROZEN reference. ratio = exp(logp_new - logp_old) is the importance-sampling correction that lets us reuse the same samples for multiple gradient steps. Without this, every parameter update would require fresh rollouts — 100x slower.

EXECUTION STATE

logp_new = (G, T) — gradient flows

logp_ref = (G, T) — for KL penalty, no gradient on ref

70PPO clip — the safety rail

If the ratio drifts far from 1 (current policy very different from sampling policy), the clip stops the gradient from following further in that direction. This is what keeps GRPO from collapsing to a single high-reward mode. The 0.2 clip is empirical — too tight (0.1) and learning is slow; too loose (0.5) and the policy can swing wildly and bake in spurious rewards.

EXECUTION STATE

clip = 0.2 — DeepSeek, Llama-2, Qwen all use this

pg = scalar policy-gradient loss; sign convention: minimise -surrogate

77KL-to-reference: the leash

Without a KL term against the pre-RL reference, the policy will overfit to whatever shape of output maximises reward — usually long, repetitive, gamed outputs. The KL coefficient is the second-most-important hyperparameter after the clip range. DeepSeek-R1's first attempt without enough KL produced math solutions where the model would say 'the answer is 42' and then loop forever — a perfect format+answer reward score, total nonsense as a useful model.

EXECUTION STATE

kl = scalar; mean log-prob ratio between policy and ref

kl_coef = 0.04 is a common default; tune per task

84math_reward — the rule, finally, in production form

A 6-line Python function. No model, no training, no gradient. It returns 0 for malformed output, 0.1 for correct format but missing answer, 0.2 for correct format + wrong answer, and 1.1 for everything right. Notice the asymmetric structure: format reward stays even when answer is wrong, so the model has a non-zero gradient SIGNAL on the path from 'no tags' to 'tags but wrong number' to 'tags and right number'. Without that intermediate signal, RL would have to discover both behaviours simultaneously — which empirically does not work.

EXECUTION STATE

return = 0 | 0.1 | 0.2 | 1.1 — bounded, deterministic, free

100Composing rewards is just a list literal

make_reward(RewardFn('math', math_reward)) is the single-reward case. The R1-Zero recipe used exactly this for the first phase of training: math + format only, no human preference data at all. Later phases composed it with code_reward, language_reward, and a small RM-based reward — same training loop, longer list literal. That is the entire architectural advantage of rule-based rewards: a list, not a model.

89 lines without explanation

1"""
2GRPO training step with rule-based rewards. Two key takeaways:
3
4  - No reward model. The 'reward function' is just three Python verifiers
5    behind a thin RewardFunction interface. They could be Python regex,
6    a subprocess sandbox, or in production a separate microservice.
7  - No value function. The baseline is the per-prompt reward mean.
8
9We compute the policy-gradient surrogate with PPO clipping (the 'P' part
10of GRPO). Run this in a loop over prompts and you have a complete
11RL-from-rules training pipeline.
12"""
13import torch
14import torch.nn.functional as F
15from dataclasses import dataclass
16from typing import Callable
17
18# ─── 1. The reward function interface ───────────────────────────────────
19@dataclass
20class RewardFn:
21    """A reward function takes (prompt, completion, meta) -> float."""
22    name: str
23    fn: Callable[[str, str, dict], float]
24
25def make_reward(*fns: RewardFn):
26    """Combine multiple reward functions into one scalar."""
27    def combined(prompt: str, completion: str, meta: dict) -> float:
28        return sum(rf.fn(prompt, completion, meta) for rf in fns)
29    return combined
30
31# ─── 2. The GRPO step ───────────────────────────────────────────────────
32def grpo_step(
33    policy:    torch.nn.Module,                       # current policy
34    ref:       torch.nn.Module,                       # frozen reference
35    prompts:   list[str],                             # batch of prompts
36    metas:     list[dict],                            # per-prompt info (truths, tests)
37    reward_fn: Callable[[str, str, dict], float],     # rule-based reward
38    tokenize:  Callable,                              # str  -> tensor
39    detokenize:Callable,                              # tensor -> str
40    G:         int   = 8,                             # group size
41    clip:      float = 0.2,                           # PPO clip
42    kl_coef:   float = 0.04,                          # KL-to-ref penalty
43):
44    """One GRPO update on a batch of prompts."""
45    total_loss = 0.0
46
47    for prompt, meta in zip(prompts, metas):
48        # 2a. Sample G rollouts from the current policy
49        ids = tokenize(prompt).to(policy.device)
50        gens, logp_old = sample_with_logp(policy, ids, G)
51        completions = [detokenize(g) for g in gens]
52
53        # 2b. Score every rollout with the RULE-BASED reward
54        rewards = torch.tensor(
55            [reward_fn(prompt, c, meta) for c in completions],
56            device=policy.device, dtype=torch.float32,
57        )                                              # shape (G,)
58
59        # 2c. Group-relative advantage  (no value network!)
60        adv = (rewards - rewards.mean()) / (rewards.std() + 1e-6)
61
62        # 2d. Re-score under current policy (gradient flows here)
63        logp_new = policy_logprobs(policy, ids, gens)  # shape (G, T)
64        logp_ref = policy_logprobs(ref,    ids, gens)  # shape (G, T)
65
66        # 2e. PPO clipped surrogate, summed over generated tokens
67        ratio = torch.exp(logp_new - logp_old)         # (G, T)
68        surr1 = ratio * adv.unsqueeze(-1)
69        surr2 = torch.clamp(ratio, 1 - clip, 1 + clip) * adv.unsqueeze(-1)
70        pg    = -torch.min(surr1, surr2).mean()
71
72        # 2f. KL-to-reference penalty stops policy collapse on a high-reward mode
73        kl = (logp_new - logp_ref).mean()
74
75        total_loss = total_loss + pg + kl_coef * kl
76
77    total_loss = total_loss / len(prompts)
78    return total_loss
79
80# ─── 3. Wiring a rule-based reward into the loop ────────────────────────
81def math_reward(prompt: str, completion: str, meta: dict) -> float:
82    """Exact-match on \\boxed{...} from a <think>/<answer> formatted rollout."""
83    import re
84    tag = re.search(
85        r"<think>([\\s\\S]*?)</think>\\s*<answer>([\\s\\S]*?)</answer>",
86        completion,
87    )
88    if not tag:
89        return 0.0                                      # no format reward
90    box = re.search(r"\\\\boxed\\{([^}]+)\\}", tag.group(2))
91    if not box:
92        return 0.1                                      # got tags but no boxed
93    return 1.0 + 0.1 if box.group(1).strip() == meta["truth"] else 0.1
94
95reward_fn = make_reward(
96    RewardFn("math", math_reward),
97)
98
99# loss = grpo_step(policy, ref, prompts, metas, reward_fn, tokenize, detokenize)
100# loss.backward(); optim.step()

Two things to count when you look at this loop. The forward-pass models: just two (current policy + frozen reference). Compare to standard PPO-RLHF, which needs four (policy + reference + reward model + value model). At 671B parameters that's ~3 TB of GPU memory you don't have to spend. The trained components: just one — the policy. Compare to PPO-RLHF, which trains two (policy + value net) and treats one as fixed (reward model, trained earlier). The training plumbing collapses.

At Massive Scale: How DeepSeek-R1 Used This

The R1 paper is the canonical example of rule-based rewards pushed to the limit. Phase 1 ("R1-Zero") trained a 671B base model with ZERO supervised fine-tuning data and ZERO human preference data. The entire training signal was two rule rewards:

Format reward, weight 0.1: regex match on $\langle\mathrm{think}\rangle ... \langle/\mathrm{think}\rangle\langle\mathrm{answer}\rangle ... \langle/\mathrm{answer}\rangle$ structure, in that exact order.
Answer reward, weight 1.0: exact-match on $\boxed{...}$ against ground truth, for math problems; pass-rate on hidden unit tests, for code problems.

That is the whole reward function. The training corpus was ~150k math problems and ~30k code problems. GRPO with group size $G = 16$ . Eight thousand training steps. At the end of it, R1-Zero scored 86.7% on AIME 2024 (versus 9.3% for the unmodified base model) — purely from rule-based RL on a model that had never been instruction-tuned.

How does this fit into massive-model training? Three places to watch.

The GPU/CPU split

Policy forward + sampling sits on the GPU cluster (671B model, ~14 H100 nodes worth of inference). Verifier evaluation sits on a CPU worker pool — for math, ~50 cores keep up with the GPU sampler; for code, ~500 cores because each unit-test subprocess is ~100ms. The CPU pool feeds a Redis queue that the GPU-side training loop drains. Drop a verifier worker and the H100s start idling within a few seconds — verifier capacity is a real cluster-planning input, not an afterthought.

The sandbox tax

Sandboxing the code-execution verifier is non-trivial at scale. A single malicious-looking rollout (the model emits $\mathrm{os.system}(\text{"rm -rf /"})$ ) is harmless if you run it in a per-rollout firejail. It deletes your training cluster if you don't. DeepSeek used a custom microVM (Firecracker- style) per code rollout, with a 2 s wall-clock budget and 256 MB memory cap. Throughput: ~2k code verifications per second per node. Cost: roughly 10% of total training-cluster spend.

Reward determinism gives you reproducibility

Because rule rewards are deterministic, you can re-score any old rollout exactly. This sounds obvious until you compare with RM-based RLHF: the reward model continues to drift as the policy evolves (it gets re-trained periodically on fresh preference data), so a rollout's reward at step 1000 is not comparable to the same rollout's reward at step 5000. With rule rewards, every rollout has a permanent, immutable score. Debugging is suddenly possible.

What rule-based rewards cannot do: they cannot judge taste. They cannot score helpfulness, harmlessness, style, creativity, or any subjective dimension. Every frontier lab uses BOTH — rule rewards for verifiable subskills, RM-based rewards for the rest. The mix is typically 60–80% rule-rewarded prompts during the math/code phase, dropping to 20% during the final style-and-safety phase.

Engineering Reality: Reward Hacking and the Verifier Tax

Rule-based rewards eliminate one class of reward hacking (gaming a neural RM) and introduce another (gaming a deterministic verifier). The bugs are different. They are mostly cheaper to fix, but you do have to fix them.

Reward hacks we have personally seen, with their fixes

Hack	Symptom	Fix
Output \boxed{} multiple times with different values	Verifier matches the first one and accepts a guess wrapped in noise	Match the LAST \boxed{}, not the first; or require unique match
Print the unit tests, then `pass`	All tests trivially 'pass' because the model never ran them	Hash the test file content; reject rollouts that contain test names
Open the ground-truth file off disk	Some sandbox setups leak the truth into the runtime	Per-rollout filesystem mount; never put truth on a readable path
Emit empty <think></think> + correct answer	Format reward fires, but no actual reasoning	Add `len(<think>) >= 8` sub-rule; weight 0.2 of format reward
Repeat the answer 1000 times	Length reward (if present) pumps up scores; format/answer still pass	Repetition penalty in reward; cap rollout token budget
Emit non-ASCII look-alikes for ground truth digits	String compare fails on visually identical answer	Unicode NFKC normalise both sides before comparison

Notice that all six fixes are 1-3 line changes to the verifier. That is the silver lining of rule-based rewards: when reward hacking happens, you can fix it on a coffee break. Compare with neural RM reward hacking, where the fix is "collect another 100k preference pairs and retrain the RM" — measured in days and dollars, not minutes.

The verifier tax

Rule-based rewards trade off in two places that are easy to underestimate.

You need ground truth. Every prompt in your training corpus must come paired with a verifiable answer. For math and code this is easy (existing datasets). For domain tasks (medical reasoning, legal extraction, scientific QA), you have to either curate or synthesise the ground-truth set — and if you do it badly, you train the model to be confidently wrong in exactly the ways your dataset is wrong.
You need verifier capacity. The CPU/GPU balance is real and unforgiving. A training cluster optimised for pre-training (heavy on GPU, light on CPU) will throughput- starve on rule-based RL because the verifier pool is too small. Plan for ~1 CPU core per concurrent rollout for math, ~10 cores per rollout for code, and provision the cluster accordingly. This is one of the surprise budget items that derails first-time rule-based RL projects.

The unfixable failure mode of rule-based rewards is task-coverage: the model gets very good at exactly the verifiable slice of behaviour and ignores everything else. This is why R1 had to be re-mixed with SFT and human-preference data in its later phases — pure rule-based RL on a base model produces a savant that solves AIME but cannot hold a conversation. The verifier is a sharp tool for sharp tasks. It does not replace taste.

The take-away for massive-model training is the same as the take-away for any tool in this book: pick the right one for the job. Rule-based rewards are the right tool when ground truth exists, verification is cheap, and the reward range is bounded. For those tasks — and a surprisingly large fraction of the useful work LLMs do is in those tasks — they are faster, cheaper, more interpretable, and more robust than the neural alternative. Use them. The next section adds a second new tool on top: when verification can't be done by regex but CAN be done by another LLM, you reach for generative reward models, the bridge between rule-based and human-preference rewards.