Chapter 12
28 min read
Section 71 of 117

Evaluation at Long Context

Long-Context Extension

The Real Problem: A 128k Number on a Spec Sheet

Every modern frontier model release is announced with a context-length number, usually a power of two. 128k. 200k. 1M. The number sits on the spec sheet next to the parameter count and the MMLU score, and it is the single most misunderstood metric in large-language-model training. A model that accepts a 128 000-token prompt without raising an error is technically a 128k model. A model that can reliably use the information in that prompt is a completely different beast — and the gap between the two is wide enough to bury an entire product launch in.

The mechanics from sections 12.1–12.3 explain why the gap exists. Position embeddings have to extrapolate or interpolate. Attention has to dilute mass across a longer key-set. The KV cache balloons and forces architectural compromises. Continued pretraining on synthetic long-context data only works for some patterns. By the time the model reaches release, you have a stack of small, mostly-invisible regressions that all show up in the same place: questions whose answer depends on a needle buried deep in a haystack.

The one-line takeaway of this section: a long-context release is only as honest as its evaluation. Without a real long-context eval suite, “128k” means nothing more than “the model will not crash on a 128k prompt.”

Intuition: A 128k Model That Forgets at Token 8k

Picture an open-book exam. The student is handed a 600-page textbook five minutes before the test starts and told the answer to one of the ten questions is on page 312. The other nine questions can be answered from general knowledge. If you score the test only on the average across all ten questions, a student who never even opens the book can get a 90%. The one question that depends on the book is buried in the signal.

That is exactly what happens when you evaluate a long-context model on next-token loss. The vast majority of tokens in a long document are easy: they continue a sentence, they finish a code block, they complete a list. A model that has completely lost track of the start of the document still predicts most tokens just fine, because most tokens are predicted from a window of only a few hundred neighbours. The cross-entropy stays low. Perplexity drops. The headline number looks great.

The questions that actually require the model to remember the whole document are a tiny fraction of the token distribution. Cross-entropy averages over them. A long-context evaluation has to do the opposite: it has to construct prompts whose only possible answer requires reading the whole context, and then score the model on those prompts alone.

What “Long-Context Capability” Actually Means

Let the prompt be a sequence of tokens x1:Nx_{1:N} of length NN, and let yy be the expected answer. A general capability measure for context length NN is the conditional accuracy A(N)=E(x1:N,y)DN ⁣[1[M(x1:N)=y]]A(N) = \mathbb{E}_{(x_{1:N}, y) \sim D_N}\!\left[ \mathbf{1}[ M(x_{1:N}) = y ] \right], where MM is the model and DND_N is a distribution of (prompt, answer) pairs at length NN. The whole game of long-context evaluation is choosing DND_N so that:

  1. Every yy is uniquely determined by the prompt. No general-knowledge shortcuts.
  2. The information required to answer yy depends on a position inside x1:Nx_{1:N} that we control. So we can sweep the “depth” axis.
  3. The grader is deterministic and reproducible. Substring match, exact match, or a frozen LLM-as-judge with a fixed rubric — never a free-form human grader for a regression suite.

The Needle-in-a-Haystack (NIAH) construction is the simplest DND_N that satisfies all three. We embed a single sentence — “the needle” — at position p[0,N]p \in [0, N] inside an otherwise unrelated document of length NN, then ask a question whose only valid answer is the needle. Reporting A(N,p)A(N, p) as a 2-D heatmap is the canonical long-context plot.

Two failure modes shape that heatmap. The first is the extrapolation cliff: for any NN beyond the model's trained window NtrainN_{\text{train}} and any depth, accuracy crashes to near-zero, because positions past NtrainN_{\text{train}} generate RoPE phases the model has never seen. This is what raw Llama-2 looks like at 16k. The second is the lost-in-the-middle dip: even inside the trained window, accuracy is highest near p=0p = 0 and p=Np = N and dips in the middle. Liu et al. (2023) showed this U-shape holds for nearly every model trained with standard data, because attention mass naturally concentrates at the beginning (the system prompt) and the end (the question).

How to read a NIAH heatmap: x is context length, y is needle depth (% of the way through the prompt), colour is accuracy. Green everywhere is what you want; a red right edge means an extrapolation cliff; a red horizontal band in the middle means lost-in-the-middle.

Manual Numerical Walkthrough: A 32-Token NIAH

Let us run NIAH by hand at a scale a human can actually verify. The haystack is six copies of the same boring sentence; the needle is one sentence injected at a chosen depth. We tokenize with whitespace splits to keep the arithmetic readable.

Filler sentence (4 tokens): the cat sat down
Needle (7 tokens): the magic number is forty two today
Question: What is the magic number?

Target length is 32 tokens. Six filler copies give 24 tokens; the needle adds 7 (we drop one filler copy to keep the total close to target). We sweep three depths: 0% (front), 50% (middle), 100% (back). The harness produces three prompts:

depthprompt (32 tokens, needle in bold)needle position
0%**the magic number is forty two today** the cat sat down × 5 + the cat sat down + fillertokens 1–7
50%the cat sat down × 3 + **the magic number is forty two today** + the cat sat down × 2tokens 13–19
100%the cat sat down × 6 + **the magic number is forty two today**tokens 26–32

Now imagine three different models answering all three prompts. Model A is a tiny RNN that only remembers the last 8 tokens; it sees only the question and the surrounding filler, so it answers all three with “I don't know.” Score: 0 / 3.

Model B is a transformer trained on sequences up to length 16. It can see the whole 32-token prompt syntactically, but its position embeddings only ever saw positions 0–15 during training. For the depth = 0% prompt, the needle sits inside the trained position range, so it retrieves and answers “forty two.” For the depth = 50% prompt the needle starts at position 13 (still inside the trained range) and the question is appended past position 16 (so RoPE phases 13–32 are partially extrapolated, attention is brittle, but the retrieval just barely succeeds half the time). For depth = 100%, the needle lives in positions the model never saw any training data for — complete miss. Score: ~1.5 / 3.

Model C is the same transformer with YaRN applied to RoPE so all positions up to 32 behave like they were trained. All three prompts are inside the post-extension trained range. Retrieval succeeds in every case. Score: 3 / 3.

Now repeat this experiment at length N=32,000N = 32{,}000 with 11 depths instead of 3, and you have one column of a real NIAH heatmap. Now repeat for eight different lengths and you have the full heatmap. The principle is unchanged from this hand example — only the scale and the model.

Visualizing NIAH Failure Modes

The interactive heatmap below switches between five archetypal long-context profiles. Click each profile to see the characteristic shape — the extrapolation cliff, the lost-in-the-middle dip, the smooth PI decay, the YaRN-preserved high-frequency floor. These are idealised curves, but every one of them maps onto a real published figure.

Loading needle-in-a-haystack heatmap…

Two things are worth pausing on. First, the cliff profile is the one most release candidates produce when you naively push their max context past the trained window. It is also the profile a careless eval can miss entirely if it never tests past 8k. Second, the lost-in-the-middle profile is real and persistent — every long-context paper since Liu et al. 2023 has had to address it, and it is the main reason agentic frameworks aggressively re-rank and summarise their contexts even when the model technically accepts the full document.

The Perplexity Trap

The most common temptation when extending a model's context is to evaluate using perplexity on long documents. It is fast, deterministic, and produces a single scalar per model. It also lies. The chart below shows what happens if you let perplexity be your only signal: PPL stays in the healthy 3–4 range out to the full 128k window, while retrieval accuracy collapses to zero the moment the prompt exits the trained region.

Loading perplexity / retrieval chart…

The mechanism is simple. Cross-entropy on natural text is dominated by short-range structure: the local n-gram, the local syntax, the ongoing topic. A model that has completely failed to integrate information from 30 000 tokens ago still predicts the next sentence from the previous sentence. The errors that matter for retrieval — sharp, localised, on a single named entity — are washed out by the thousands of cheap next-token predictions around them.

This is not a new observation. It is one of the lessons every long-context team relearns the hard way. The first generation of long-context papers (early 2023) leaned heavily on PPL because no good alternative existed. NIAH (Kamradt, mid-2023) and then RULER (Hsieh et al., 2024) gave the field something measurable, and within a year nobody serious was publishing PPL-only long-context numbers anymore.

Rule of thumb: if a long-context release announces a new max context without showing a NIAH heatmap or RULER table, treat the context number as marketing, not capability.

Beyond NIAH: RULER, LongBench, InfiniteBench

NIAH is the right first eval, but it is not the only eval, and it rewards a very specific capability — single-key retrieval — that not all long-context tasks need. Real applications ask the model to reason across multiple documents, aggregate frequencies, track variables across an argument, and answer questions whose evidence is scattered. The community has converged on three benchmarks that cover these dimensions.

BenchmarkWhat it testsTasksWhen to use
NIAH (Kamradt 2023)Single-key retrieval at depth1 (the needle)First sanity check after any context extension
RULER (Hsieh et al. 2024)13 sub-tasks: multi-needle, variable tracking, aggregation, QA13 across 4 categoriesThe standard regression suite for long-context releases
LongBench (Bai et al. 2023)21 real-world tasks across 6 categories, multilingual21Capability comparison across released models
InfiniteBench (Zhang et al. 2024)100k+ tasks: novel summarisation, code completion, math12Stress-test at the edge of the trained window

The biggest practical upgrade NIAH → RULER buys you is the multi-task degradation profile. NIAH hides the fact that single-needle retrieval is the easiest long-context capability. Variable tracking, multi-key retrieval, and frequency aggregation all degrade much faster with length. The grid below switches between three model archetypes and shows how the cells change task by task.

Loading RULER task grid…

Notice the pattern that holds across all three model classes: as length grows, the bottom rows (aggregation, multi-hop QA) lose accuracy before the top rows (single-needle retrieval). A model can score 95% on NIAH at 128k and 25% on RULER's aggregation tasks at the same length. Both numbers are real. Only one of them gets put on the press release.

Plain Python: NIAH from Scratch

The harness is small — fewer than a hundred lines of Python with no ML dependencies. We build the prompt, we score the answer, we loop over a grid. Every long-context evaluation harness in production (Anthropic's, Google's, the open-source lm-evaluation-harness suite) has this same skeleton, with more polished prompts and a bigger task library bolted on.

NIAH harness — pure Python, framework-agnostic
🐍niah_plain.py
16The needle — one sentence, totally unrelated to the filler

The needle must be semantically unrelated to the surrounding text. If the filler is about Rome and the needle is about Dolores Park, no amount of pattern-matching can let the model hallucinate the answer — the only way to score is to actually retrieve the needle from its own KV cache. This is why NIAH is a retrieval test, not a knowledge test.

EXAMPLE
NEEDLE = 'The best thing to do in San Francisco is eat a sandwich at Dolores Park on a sunny day.'
17The question that only the needle answers

The question is paraphrased — we do not literally copy a span from the needle. A model that has truly read the document will produce the answer; a model that is only relying on the question (and not actually retrieving) cannot.

EXAMPLE
QUESTION = 'What is the best thing to do in San Francisco?'
18The substring we grade against

Substring grading is the original Kamradt protocol: case-insensitive, partial credit not awarded. RULER later adds LLM-as-judge for paraphrases, but for our toy harness the substring rule is robust and reproducible. Tradeoff: a model that says 'eat-a-sandwich at Dolores' would fail — fine, because the test is supposed to be hard.

EXAMPLE
ANSWER_KEY = 'eat a sandwich at dolores park'
23Filler sentence — boring and repetitive on purpose

We deliberately use repetitive filler so we can hit any target context length by repeating a fixed-cost sentence. In production NIAH the filler is a corpus of essays (Greg Kamradt used Paul Graham) for higher fidelity to real reading-comprehension distributions; the qualitative results are the same.

EXAMPLE
~32-token sentence about Rome
30Function: build one prompt at a chosen (length, depth)

Every call returns a fully-formed prompt where the needle sits at exactly depth_pct of the way through the document. With this we can run the full 2-D sweep that produces the heatmap reviewers see in long-context papers. Inputs: context_tokens (target total prompt length, e.g. 32 000), depth_pct (float in [0, 1] — 0 = needle at start, 1 = needle at end), tokens_per_sentence (~24 — assumes a BPE rate for English; this is the unit of length resolution).

EXAMPLE
build_prompt(32_000, 0.5) → 32 000-token prompt with needle in the middle
33Pick how many filler sentences fit

Total length in tokens ÷ tokens per sentence gives how many copies of the filler we need. For context_tokens = 32 000 and tokens_per_sentence = 24, n_sentences = 1333. Each one is a copy of the same Roman-Empire sentence — boring on purpose.

EXAMPLE
context_tokens=8000 → n_sentences=333; context_tokens=128000 → n_sentences=5333
36Compute the insertion index from the depth percentage

depth_pct = 0.0 → insert_idx = 0 (needle goes at the start). depth_pct = 1.0 → insert_idx = n_sentences − 1 (needle at the end). depth_pct = 0.5 puts the needle exactly in the middle. The 'lost in the middle' phenomenon (Liu et al. 2023) is exactly the dip you see at depth_pct ≈ 0.5 on a heatmap.

EXAMPLE
n_sentences=333: depth 0.0 → idx 0; depth 0.5 → idx 166; depth 1.0 → idx 332
37Prepend the needle in place of the chosen filler sentence

We do not replace the filler — we prepend the needle to it. This keeps the total length close to the target and guarantees the needle is exactly one sentence (no surrounding context that could leak hints).

39Join and wrap in an instruction template

The <document> tags are not magic — many evaluators omit them entirely. Their purpose is to make the harness less sensitive to how a given model was instruction-tuned, by giving a clear boundary between context and question. Frontier-lab harnesses (e.g. Anthropic's evals) put very careful prompts here; for a sanity check, the bare-bones version above is enough.

EXAMPLE
haystack = concatenated string ~ context_tokens tokens long
51Score: substring match, no partial credit

Returns 1.0 if the answer contains the canonical answer substring (case-insensitive), 0.0 otherwise. This is a deliberately strict grader — a model that hallucinates a paraphrase will score 0. In real RULER runs you mix substring grading for clear answer tests with LLM-as-judge grading for paraphrase-friendly tasks.

EXAMPLE
score('the answer is eat a sandwich at Dolores Park.') → 1.0
60Result record for one cell of the heatmap

Each cell of the published heatmap is one Result. length and depth_pct identify the cell; accuracy is the mean over samples_per_cell independent samples (typically 3–10 in production; 1 in casual reports — which is why error bars on NIAH heatmaps matter).

EXAMPLE
Result(length=32_000, depth_pct=0.5, accuracy=0.66)
66The sweep: loop over lengths × depths × samples

This is the entire experiment in one nested loop. For Kamradt's original PG haystack heatmap, lengths = 15 points from 1k to 128k and depths = 11 points from 0% to 100%, giving 165 cells. With samples_per_cell = 3 that is 495 generations — the cost of one NIAH plot. For a frontier 128k model at ~1k output tokens this is ~5 GPU-minutes; for a 1M-context model, ~5 GPU-hours.

EXAMPLE
niah_sweep(model, [1_000,4_000,8_000,16_000,32_000], [0,0.25,0.5,0.75,1.0], 5) → 125 generations
73Inner loop: collect samples_per_cell scores per cell

Each iteration: build a fresh prompt (the filler is identical but a real harness would shuffle paragraph order between samples), call the model, score, accumulate. Averaging absorbs single-trial sampling noise. With samples_per_cell = 1 you get the noisy single-shot heatmaps that look great on Twitter but are statistically meaningless.

89Fake model — accepts the needle iff prompt is under its trained window

This is our toy stand-in: it 'sees' anything up to trained_tokens, beyond which it shrugs and says 'I'm not sure.' Run the sweep with trained_tokens = 8000 and you get the canonical Llama-2-shaped heatmap: green up to 8k, then a brick wall of red. Real models are noisier than this — but the structure of the figure is identical.

EXAMPLE
fake_model_factory(8000) — about Llama-2-7B's native ctx
91Inner generator function — closes over trained_tokens

📚 The closure captures trained_tokens at construction time. Calling fake_model_factory(8000) returns a function that quietly enforces 'can only see 8k tokens'. We use the simple split() to count tokens — fine for English; a real harness uses the model's actual tokenizer (e.g. tiktoken for GPT, the Llama BPE for Llama).

EXAMPLE
n_tokens = len(prompt.split())  # rough whitespace token count
99Run the sweep on the fake model

Five lengths × five depths × three samples = 75 calls. With trained_tokens = 8000, every (length ≤ 8k, any depth) cell is 1.0 and every (length > 8k, any depth) cell is 0.0 — the cliff. This is what a base, un-extended Llama-2 would produce. After we apply YaRN (§ 12.3) the cliff is replaced by a gentle slope, and that gentle slope is exactly what the rest of this section is about measuring.

EXAMPLE
lengths=[1k,4k,8k,16k,32k], depths=[0,0.25,0.5,0.75,1.0]
98 lines without explanation
1"""
2Needle-in-a-Haystack (NIAH) from scratch.
3
4We build a haystack of "filler" text, insert a single sentence ("the needle")
5at a chosen depth, then ask the model a question whose only honest answer is
6the needle. Repeating this over many (length, depth) pairs gives the
7characteristic 2-D heatmap that long-context releases live and die by.
8
9The harness is intentionally framework-agnostic. The only thing we need from
10the model is a callable that returns a string answer given a prompt.
11"""
12
13from dataclasses import dataclass
14from typing import Callable, List, Tuple
15
16# ---------------------------------------------------------------------------
17# 1. Build one NIAH prompt.
18# ---------------------------------------------------------------------------
19
20NEEDLE = "The best thing to do in San Francisco is eat a sandwich at Dolores Park on a sunny day."
21QUESTION = "What is the best thing to do in San Francisco?"
22ANSWER_KEY = "eat a sandwich at dolores park"   # substring we will look for
23
24# A long, repetitive haystack. In real evals you use Paul Graham essays
25# (the "PG haystack" from Greg Kamradt's original NIAH), but for a toy run
26# any unrelated text whose total length we can control will do.
27HAYSTACK_SENTENCE = (
28    "The Roman Empire spanned three continents and lasted for over a "
29    "thousand years, shaping the laws, languages, and architecture of "
30    "modern Europe. "
31)
32
33
34def build_prompt(context_tokens: int, depth_pct: float,
35                 tokens_per_sentence: int = 24) -> str:
36    """Construct the (very long) prompt for one NIAH datapoint."""
37    n_sentences = max(1, context_tokens // tokens_per_sentence)
38    filler = [HAYSTACK_SENTENCE] * n_sentences
39
40    insert_idx = int(round(depth_pct * (n_sentences - 1)))
41    filler[insert_idx] = NEEDLE + " " + filler[insert_idx]
42
43    haystack = "".join(filler)
44    return (
45        "Read the following document carefully and then answer the question "
46        "using only information from the document.\n\n"
47        f"<document>\n{haystack}\n</document>\n\n"
48        f"Question: {QUESTION}\nAnswer:"
49    )
50
51
52# ---------------------------------------------------------------------------
53# 2. Score one model answer.
54# ---------------------------------------------------------------------------
55
56def score(answer: str) -> float:
57    """1.0 if the needle appears (case-insensitive substring), else 0.0."""
58    return 1.0 if ANSWER_KEY in answer.lower() else 0.0
59
60
61# ---------------------------------------------------------------------------
62# 3. Sweep over a grid of (length, depth) pairs.
63# ---------------------------------------------------------------------------
64
65@dataclass
66class Result:
67    length: int        # context length in tokens
68    depth_pct: float   # where the needle was placed
69    accuracy: float    # 0.0 or 1.0 for a single sample; mean over k samples
70
71def niah_sweep(
72    generate: Callable[[str], str],
73    lengths: List[int],
74    depths: List[float],
75    samples_per_cell: int = 5,
76) -> List[Result]:
77    """Run a full NIAH grid. Returns one Result per cell."""
78    out: List[Result] = []
79    for L in lengths:
80        for d in depths:
81            scores = []
82            for _ in range(samples_per_cell):
83                prompt = build_prompt(L, d)
84                ans = generate(prompt)
85                scores.append(score(ans))
86            out.append(Result(L, d, sum(scores) / len(scores)))
87    return out
88
89
90# ---------------------------------------------------------------------------
91# 4. A fake "model" so we can run the harness end-to-end without a GPU.
92#    The fake reads the needle at depth d only if the prompt is shorter
93#    than 8k tokens (about the trained window for a stock Llama-2-7B).
94# ---------------------------------------------------------------------------
95
96def fake_model_factory(trained_tokens: int = 8000):
97    def generate(prompt: str) -> str:
98        n_tokens = len(prompt.split())
99        if n_tokens <= trained_tokens and NEEDLE.lower() in prompt.lower():
100            return ANSWER_KEY
101        return "I'm not sure."
102    return generate
103
104
105if __name__ == "__main__":
106    lengths = [1_000, 4_000, 8_000, 16_000, 32_000]
107    depths  = [0.0, 0.25, 0.5, 0.75, 1.0]
108
109    results = niah_sweep(fake_model_factory(8_000), lengths, depths,
110                         samples_per_cell=3)
111
112    for r in results:
113        print(f"len={r.length:>6}  depth={r.depth_pct:>4.0%}  "
114              f"acc={r.accuracy:.2f}")

PyTorch: Running NIAH on a Real Model

Swapping the fake generator for a real model touches three things: the tokenizer (use the model's own), the dtype and attention implementation (bfloat16 + FlashAttention 2 at this scale), and the decoding settings (greedy, low temperature). The harness itself does not move.

NIAH on Llama-3.1-8B — HuggingFace + FlashAttention 2
🐍niah_hf.py
18The model under test

Llama-3.1 8B Instruct ships with a 128k context window. We pick it because it is one of the smallest models with a serious long-context claim — small enough to run a full sweep on a single H100 in tens of minutes, large enough that the failure modes you see are representative of frontier models.

EXAMPLE
MODEL_ID = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
20Tokenizer — must be the model's own

📚 AutoTokenizer.from_pretrained downloads the BPE merges file shipped with the checkpoint. Using a generic tokenizer (e.g. tiktoken) gives the wrong token count, which makes your '128k' rows actually be 110k or 150k — and that off-by-15% is enough to push the cliff off the edge of the heatmap. Returns a PreTrainedTokenizerFast bound to the model's vocabulary.

EXAMPLE
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
21Load the model in bfloat16 with FlashAttention 2

📚 AutoModelForCausalLM.from_pretrained — torch_dtype=torch.bfloat16 halves memory vs fp32 and is the native dtype Llama-3 trained in (so no quality loss); device_map='auto' lets HF accelerate place layers across whatever GPUs are visible; attn_implementation='flash_attention_2' is critical at 128k — it brings attention memory from O(N²) to O(N), without which an 8B model OOMs at ~40k on an 80 GB H100.

EXAMPLE
model loads at ~16 GB; with flash-attn 2, 128k prompt prefill fits in ~62 GB total
28Switch to eval mode

model.eval() turns off dropout and switches batchnorm layers into running-stat mode. For Llama-3 there is no dropout in inference anyway, but this is muscle memory worth keeping — and it lets HF skip allocating the dropout RNG state per layer.

31Factory: build the (prompt → answer) closure

We do not want the model and tokenizer reloaded for every NIAH cell, so we return a closure that captures them. niah_sweep then just sees a Callable[[str], str] — exactly the same interface the fake generator implemented.

EXAMPLE
generator = make_hf_generator(max_new_tokens=64)
36Tokenize with truncation OFF

📚 tokenizer(prompt, return_tensors='pt'): runs the BPE merges and returns a dict of tensors {input_ids: [1, S], attention_mask: [1, S]}. We deliberately set truncation=False — the whole point of NIAH is to feed sequences longer than the model can comfortably handle and measure how it falls over. truncation=True would silently make the test trivial.

EXAMPLE
inputs['input_ids'] → torch.int64 of shape [1, S] where S is the real token count
42Read the actual token count

n_tokens is the source of truth for that row of the heatmap. If you log this you will often discover that asking for a '32 000-token prompt' actually gives 31 740 tokens, because our build_prompt rounds at the sentence boundary. The horizontal axis of every NIAH heatmap is approximate by ±~1%.

EXAMPLE
n_tokens = inputs['input_ids'].shape[1]  # e.g. 31 740
43Detect: are we past the model's trained context?

model.config.max_position_embeddings is what the checkpoint claims its context window is — 131 072 for Llama-3.1. Past that, the model has never seen a RoPE phase from that position, so behaviour is whatever the position-encoding extension (PI / YaRN / NTK) buys us. We do not raise — we want to *measure* this regime, not hide it.

EXAMPLE
Llama-3.1-8B: model.config.max_position_embeddings = 131072
50Greedy decoding — no sampling for NIAH

📚 model.generate — do_sample=False forces argmax decoding; temperature=0.0 makes the answer deterministic given model+prompt, so re-running the sweep gives identical numbers (critical for regression-testing a long-context fine-tune); max_new_tokens=64 caps the answer length so a model that wanders does not cost 1 000 tokens per cell; pad_token_id=eos silences a HF warning on models with no dedicated pad token.

EXAMPLE
out = model.generate(input_ids=[1, 31740], max_new_tokens=64, do_sample=False)
58Slice off the prompt from generate's output

📚 model.generate returns the WHOLE sequence (prompt + answer) by default, not just the answer. We slice from inputs['input_ids'].shape[1] forward to get only the newly-generated tokens. tokenizer.decode(skip_special_tokens=True) converts those back to a string and drops <s>, </s>, etc.

EXAMPLE
answer_ids = out[0, 31740:]  # shape [≤ 64]
65Reuse the harness from the plain-Python file

The plain-Python script's build_prompt, score, and niah_sweep are model-agnostic. By keeping them in their own module and importing them here, we get one harness that works for fake models, OpenAI API models, vLLM endpoints, and direct HF inference — all without code duplication. Frontier-lab eval harnesses are organised the same way.

EXAMPLE
from niah_plain import build_prompt, score, niah_sweep, Result
71The published shape: 8 lengths × 11 depths

This is Kamradt's canonical resolution: 8 length steps geometrically spaced from 1k to 128k, 11 depth percentages from 0% to 100% in steps of 10. With samples_per_cell = 3 that is 264 generations — about 25 GPU-minutes on an 8B model with FlashAttention 2 at 128k.

EXAMPLE
lengths=[1_000,2_000,4_000,8_000,16_000,32_000,64_000,128_000], depths=[0,0.1,...,1.0]
75Persist raw results to JSON

Always dump the raw cell-level data, not just the rendered heatmap. The heatmap PNG is the press release; the JSON is what lets the next engineer compare a fine-tune to baseline, draw error bars, or rerun the analysis with a stricter grader without paying for another 25 GPU-minutes.

EXAMPLE
pathlib.Path('niah_llama31_8b.json').write_text(json.dumps(...))
70 lines without explanation
1"""
2NIAH on a real PyTorch model.
3
4We swap the fake generator from the Python script for a Hugging Face
5transformers call. The harness is unchanged — niah_sweep does not know
6whether it is calling a 7B Llama or a 405B frontier model.
7
8Key engineering details:
9  - load in bfloat16 on a single H100 (80 GB) — fits up to ~70B at 128k
10    with FlashAttention 2 and reasonable KV-cache sharding;
11  - use the model's *own* tokenizer for accurate length control;
12  - cap output to ~64 tokens — the answer is short, longer outputs only
13    add cost and let the model wander.
14"""
15
16from typing import Callable
17import torch
18from transformers import AutoModelForCausalLM, AutoTokenizer
19
20MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
21
22tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
23model = AutoModelForCausalLM.from_pretrained(
24    MODEL_ID,
25    torch_dtype=torch.bfloat16,
26    device_map="auto",
27    attn_implementation="flash_attention_2",
28)
29model.eval()
30
31
32def make_hf_generator(max_new_tokens: int = 64) -> Callable[[str], str]:
33    """Return a generate(prompt) -> str closure suitable for niah_sweep."""
34
35    def generate(prompt: str) -> str:
36        # 1. Tokenize WITH attention to the model's real context limit.
37        inputs = tokenizer(
38            prompt,
39            return_tensors="pt",
40            truncation=False,                # we want NIAH to actually fail
41        ).to(model.device)
42
43        n_tokens = inputs["input_ids"].shape[1]
44        if n_tokens > model.config.max_position_embeddings:
45            # Past the trained ctx — this is the failure regime we are
46            # *measuring*, so we still hand it to the model.
47            pass
48
49        # 2. Greedy decode. NIAH is a retrieval test, sampling would
50        #    add noise without buying any quality signal.
51        with torch.inference_mode():
52            out = model.generate(
53                **inputs,
54                max_new_tokens=max_new_tokens,
55                do_sample=False,
56                temperature=0.0,
57                pad_token_id=tokenizer.eos_token_id,
58            )
59
60        # 3. Slice off the prompt — generate() returns prompt+answer.
61        answer_ids = out[0, inputs["input_ids"].shape[1]:]
62        return tokenizer.decode(answer_ids, skip_special_tokens=True)
63
64    return generate
65
66
67# Reuse build_prompt, score, niah_sweep from the plain-Python file above.
68from niah_plain import build_prompt, score, niah_sweep, Result      # noqa: E402
69
70if __name__ == "__main__":
71    generator = make_hf_generator(max_new_tokens=64)
72
73    # A realistic 11-depth x 8-length sweep — 88 cells, samples_per_cell=3.
74    lengths = [1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000]
75    depths  = [i / 10 for i in range(11)]
76
77    results = niah_sweep(generator, lengths, depths, samples_per_cell=3)
78
79    # Write to disk so the heatmap script can plot it offline.
80    import json, pathlib
81    pathlib.Path("niah_llama31_8b.json").write_text(json.dumps(
82        [r.__dict__ for r in results], indent=2,
83    ))

From toy to production

The harness above runs in tens of GPU-minutes on an 8B model. When the same idea is run inside a frontier lab, three things change:

  1. Scale. The eval is run on every checkpoint of every continued-pretraining run, often with thousands of cells (multiple needles per cell, multiple needle positions, multiple haystack themes). Each run is a non-trivial GPU bill, so the eval cluster is a budgeted resource.
  2. Distribution shift. The haystack is no longer a single domain (Paul Graham essays). It is a sampled mixture of domains the model is expected to handle in production — legal, medical, codebases, technical reports — because long-context performance varies sharply with domain.
  3. Adversarial needles. The needle is no longer a single distinctive sentence. It is paraphrased, embedded in similar-looking distractor sentences, or split across multiple positions. This catches models that learned to pattern-match the NIAH format rather than actually retrieve.

At Massive Scale: Evaluating 1M-Context Models

Once the trained context reaches 1M tokens, every previous bottleneck gets worse by an order of magnitude.

The compute cost of the eval itself

One NIAH cell at 1M tokens needs the full prefill of a 1M-token KV cache. With FlashAttention 2 the cost is roughly O(Ndmodel)O(N \cdot d_{\text{model}}) per layer for attention plus the usual O(Ndmodel2)O(N \cdot d_{\text{model}}^2) for the FFN — at N=106N = 10^6 this is roughly 101210^{12} FLOPs per layer per forward, or 101410^{14} for a 100-layer model. A full 88-cell sweep at 1M is therefore ~101610^{16} FLOPs — about three minutes of an 8-H100 node at fp8 peak, which is fine; but the KV cache alone takes ~10 GB at hidden size 7 168 and head dimension 128 (two cache types) — and that has to fit on the device while the model weights are loaded.

The KV cache becomes the eval bottleneck

For a Llama-3-class 70B model at 1M context, the KV cache is bigger than the model itself. A single eval prompt requires sharding the KV cache across multiple devices, exactly the same tensor-parallel choreography as serving — except for an eval workload that is mostly prefill, not decode. This is why long-context eval infrastructure at a frontier lab is co-designed with the inference stack, not built as a side project.

The grading function has to keep up

At 1M context the model's answer is no longer a 64-token substring match. Tasks like “summarise this novel” require LLM-as-judge with a careful rubric. The judge model is often the same family as the model under test, which raises the question of grading bias — frontier labs typically use a held-out judge of comparable but different lineage (e.g. grading Claude with GPT-4 and vice versa) and report inter-judge agreement.

The training data has to teach the eval

Naively training on long documents does not produce a 1M-context model that scores well on RULER. The training mix needs synthetic long-context tasks — book-length question answering, paraphrased needles at random depths, multi-document aggregation — because those are the capabilities we are about to measure. This is the deepest connection between training and evaluation in long-context: the evaluation set effectively defines what “1M context” means, and the training set has to be engineered to match.

Engineering Reality: What Actually Fails

Two years of long-context releases have produced a small catalogue of recurring failure modes. The good ones come up in code reviews before the release ships; the bad ones get caught in the wild and named after the team that hit them first.

  • Tokenizer mismatch. The eval harness uses a different tokenizer from the model. The horizontal axis of the heatmap is off by ±15%, the cliff is in the wrong place, and the team spends a week debugging a regression that does not exist. Always tokenize with the model's own vocabulary.
  • Truncation silently kicks in. The HuggingFace tokenizer truncates by default at the model's configured max position. Pass truncation=False when running NIAH or the model will pass cells it should fail.
  • The prompt template is the system prompt. Some chat-tuned models prepend a default system prompt unless you explicitly pass one. That extra ~500 tokens at the front shifts every depth percentage and is invisible unless you log token counts.
  • The needle is too memorable. “The magic number is 42” appears in pretraining data. Use a needle the model could not have memorised — a randomly-generated string of words, a date that does not exist, a fictional named entity.
  • Greedy decoding silently goes off the rails. At very long context, the argmax of the first answer token is sometimes a stop token. Setting min_new_tokens=1 forces at least one real token, which is enough to catch the failure.
  • The KV cache is paged out of GPU memory. Tools like vLLM page the KV cache to CPU when it overflows. At 128k this is fine for throughput but is a 50× slowdown per token, which makes the eval take long enough that intermediate fails (NaNs, OOMs) become disproportionately likely.
  • Single-trial heatmaps overfit to one needle. Always run k3k \geq 3 samples per cell with different needles. The published Kamradt heatmaps are k=1k = 1 and have ~10 percentage points of per-cell noise that nobody talks about.

Once these are fixed, what is left is the actual capability of the model. That capability is what the next chapter — on inference and serving — has to deliver to the user at runtime.

The mental model that unifies this section: long-context capability is not a single number. It is a 2-D surface (length × depth) across a panel of tasks (single needle, multi-hop, aggregation, QA). A “128k model” is a model with a particular shape on that surface — and the shape is what long-context training is trying to flatten.
Loading comments...