The Real Problem: A 128k Number on a Spec Sheet
Every modern frontier model release is announced with a context-length number, usually a power of two. 128k. 200k. 1M. The number sits on the spec sheet next to the parameter count and the MMLU score, and it is the single most misunderstood metric in large-language-model training. A model that accepts a 128 000-token prompt without raising an error is technically a 128k model. A model that can reliably use the information in that prompt is a completely different beast — and the gap between the two is wide enough to bury an entire product launch in.
The mechanics from sections 12.1–12.3 explain why the gap exists. Position embeddings have to extrapolate or interpolate. Attention has to dilute mass across a longer key-set. The KV cache balloons and forces architectural compromises. Continued pretraining on synthetic long-context data only works for some patterns. By the time the model reaches release, you have a stack of small, mostly-invisible regressions that all show up in the same place: questions whose answer depends on a needle buried deep in a haystack.
Intuition: A 128k Model That Forgets at Token 8k
Picture an open-book exam. The student is handed a 600-page textbook five minutes before the test starts and told the answer to one of the ten questions is on page 312. The other nine questions can be answered from general knowledge. If you score the test only on the average across all ten questions, a student who never even opens the book can get a 90%. The one question that depends on the book is buried in the signal.
That is exactly what happens when you evaluate a long-context model on next-token loss. The vast majority of tokens in a long document are easy: they continue a sentence, they finish a code block, they complete a list. A model that has completely lost track of the start of the document still predicts most tokens just fine, because most tokens are predicted from a window of only a few hundred neighbours. The cross-entropy stays low. Perplexity drops. The headline number looks great.
The questions that actually require the model to remember the whole document are a tiny fraction of the token distribution. Cross-entropy averages over them. A long-context evaluation has to do the opposite: it has to construct prompts whose only possible answer requires reading the whole context, and then score the model on those prompts alone.
What “Long-Context Capability” Actually Means
Let the prompt be a sequence of tokens of length , and let be the expected answer. A general capability measure for context length is the conditional accuracy , where is the model and is a distribution of (prompt, answer) pairs at length . The whole game of long-context evaluation is choosing so that:
- Every is uniquely determined by the prompt. No general-knowledge shortcuts.
- The information required to answer depends on a position inside that we control. So we can sweep the “depth” axis.
- The grader is deterministic and reproducible. Substring match, exact match, or a frozen LLM-as-judge with a fixed rubric — never a free-form human grader for a regression suite.
The Needle-in-a-Haystack (NIAH) construction is the simplest that satisfies all three. We embed a single sentence — “the needle” — at position inside an otherwise unrelated document of length , then ask a question whose only valid answer is the needle. Reporting as a 2-D heatmap is the canonical long-context plot.
Two failure modes shape that heatmap. The first is the extrapolation cliff: for any beyond the model's trained window and any depth, accuracy crashes to near-zero, because positions past generate RoPE phases the model has never seen. This is what raw Llama-2 looks like at 16k. The second is the lost-in-the-middle dip: even inside the trained window, accuracy is highest near and and dips in the middle. Liu et al. (2023) showed this U-shape holds for nearly every model trained with standard data, because attention mass naturally concentrates at the beginning (the system prompt) and the end (the question).
Manual Numerical Walkthrough: A 32-Token NIAH
Let us run NIAH by hand at a scale a human can actually verify. The haystack is six copies of the same boring sentence; the needle is one sentence injected at a chosen depth. We tokenize with whitespace splits to keep the arithmetic readable.
Filler sentence (4 tokens): the cat sat down
Needle (7 tokens): the magic number is forty two today
Question: What is the magic number?
Target length is 32 tokens. Six filler copies give 24 tokens; the needle adds 7 (we drop one filler copy to keep the total close to target). We sweep three depths: 0% (front), 50% (middle), 100% (back). The harness produces three prompts:
| depth | prompt (32 tokens, needle in bold) | needle position |
|---|---|---|
| 0% | **the magic number is forty two today** the cat sat down × 5 + the cat sat down + filler | tokens 1–7 |
| 50% | the cat sat down × 3 + **the magic number is forty two today** + the cat sat down × 2 | tokens 13–19 |
| 100% | the cat sat down × 6 + **the magic number is forty two today** | tokens 26–32 |
Now imagine three different models answering all three prompts. Model A is a tiny RNN that only remembers the last 8 tokens; it sees only the question and the surrounding filler, so it answers all three with “I don't know.” Score: 0 / 3.
Model B is a transformer trained on sequences up to length 16. It can see the whole 32-token prompt syntactically, but its position embeddings only ever saw positions 0–15 during training. For the depth = 0% prompt, the needle sits inside the trained position range, so it retrieves and answers “forty two.” For the depth = 50% prompt the needle starts at position 13 (still inside the trained range) and the question is appended past position 16 (so RoPE phases 13–32 are partially extrapolated, attention is brittle, but the retrieval just barely succeeds half the time). For depth = 100%, the needle lives in positions the model never saw any training data for — complete miss. Score: ~1.5 / 3.
Model C is the same transformer with YaRN applied to RoPE so all positions up to 32 behave like they were trained. All three prompts are inside the post-extension trained range. Retrieval succeeds in every case. Score: 3 / 3.
Now repeat this experiment at length with 11 depths instead of 3, and you have one column of a real NIAH heatmap. Now repeat for eight different lengths and you have the full heatmap. The principle is unchanged from this hand example — only the scale and the model.
Visualizing NIAH Failure Modes
The interactive heatmap below switches between five archetypal long-context profiles. Click each profile to see the characteristic shape — the extrapolation cliff, the lost-in-the-middle dip, the smooth PI decay, the YaRN-preserved high-frequency floor. These are idealised curves, but every one of them maps onto a real published figure.
Two things are worth pausing on. First, the cliff profile is the one most release candidates produce when you naively push their max context past the trained window. It is also the profile a careless eval can miss entirely if it never tests past 8k. Second, the lost-in-the-middle profile is real and persistent — every long-context paper since Liu et al. 2023 has had to address it, and it is the main reason agentic frameworks aggressively re-rank and summarise their contexts even when the model technically accepts the full document.
The Perplexity Trap
The most common temptation when extending a model's context is to evaluate using perplexity on long documents. It is fast, deterministic, and produces a single scalar per model. It also lies. The chart below shows what happens if you let perplexity be your only signal: PPL stays in the healthy 3–4 range out to the full 128k window, while retrieval accuracy collapses to zero the moment the prompt exits the trained region.
The mechanism is simple. Cross-entropy on natural text is dominated by short-range structure: the local n-gram, the local syntax, the ongoing topic. A model that has completely failed to integrate information from 30 000 tokens ago still predicts the next sentence from the previous sentence. The errors that matter for retrieval — sharp, localised, on a single named entity — are washed out by the thousands of cheap next-token predictions around them.
This is not a new observation. It is one of the lessons every long-context team relearns the hard way. The first generation of long-context papers (early 2023) leaned heavily on PPL because no good alternative existed. NIAH (Kamradt, mid-2023) and then RULER (Hsieh et al., 2024) gave the field something measurable, and within a year nobody serious was publishing PPL-only long-context numbers anymore.
Rule of thumb: if a long-context release announces a new max context without showing a NIAH heatmap or RULER table, treat the context number as marketing, not capability.
Beyond NIAH: RULER, LongBench, InfiniteBench
NIAH is the right first eval, but it is not the only eval, and it rewards a very specific capability — single-key retrieval — that not all long-context tasks need. Real applications ask the model to reason across multiple documents, aggregate frequencies, track variables across an argument, and answer questions whose evidence is scattered. The community has converged on three benchmarks that cover these dimensions.
| Benchmark | What it tests | Tasks | When to use |
|---|---|---|---|
| NIAH (Kamradt 2023) | Single-key retrieval at depth | 1 (the needle) | First sanity check after any context extension |
| RULER (Hsieh et al. 2024) | 13 sub-tasks: multi-needle, variable tracking, aggregation, QA | 13 across 4 categories | The standard regression suite for long-context releases |
| LongBench (Bai et al. 2023) | 21 real-world tasks across 6 categories, multilingual | 21 | Capability comparison across released models |
| InfiniteBench (Zhang et al. 2024) | 100k+ tasks: novel summarisation, code completion, math | 12 | Stress-test at the edge of the trained window |
The biggest practical upgrade NIAH → RULER buys you is the multi-task degradation profile. NIAH hides the fact that single-needle retrieval is the easiest long-context capability. Variable tracking, multi-key retrieval, and frequency aggregation all degrade much faster with length. The grid below switches between three model archetypes and shows how the cells change task by task.
Notice the pattern that holds across all three model classes: as length grows, the bottom rows (aggregation, multi-hop QA) lose accuracy before the top rows (single-needle retrieval). A model can score 95% on NIAH at 128k and 25% on RULER's aggregation tasks at the same length. Both numbers are real. Only one of them gets put on the press release.
Plain Python: NIAH from Scratch
The harness is small — fewer than a hundred lines of Python with no ML dependencies. We build the prompt, we score the answer, we loop over a grid. Every long-context evaluation harness in production (Anthropic's, Google's, the open-source lm-evaluation-harness suite) has this same skeleton, with more polished prompts and a bigger task library bolted on.
PyTorch: Running NIAH on a Real Model
Swapping the fake generator for a real model touches three things: the tokenizer (use the model's own), the dtype and attention implementation (bfloat16 + FlashAttention 2 at this scale), and the decoding settings (greedy, low temperature). The harness itself does not move.
From toy to production
The harness above runs in tens of GPU-minutes on an 8B model. When the same idea is run inside a frontier lab, three things change:
- Scale. The eval is run on every checkpoint of every continued-pretraining run, often with thousands of cells (multiple needles per cell, multiple needle positions, multiple haystack themes). Each run is a non-trivial GPU bill, so the eval cluster is a budgeted resource.
- Distribution shift. The haystack is no longer a single domain (Paul Graham essays). It is a sampled mixture of domains the model is expected to handle in production — legal, medical, codebases, technical reports — because long-context performance varies sharply with domain.
- Adversarial needles. The needle is no longer a single distinctive sentence. It is paraphrased, embedded in similar-looking distractor sentences, or split across multiple positions. This catches models that learned to pattern-match the NIAH format rather than actually retrieve.
At Massive Scale: Evaluating 1M-Context Models
Once the trained context reaches 1M tokens, every previous bottleneck gets worse by an order of magnitude.
The compute cost of the eval itself
One NIAH cell at 1M tokens needs the full prefill of a 1M-token KV cache. With FlashAttention 2 the cost is roughly per layer for attention plus the usual for the FFN — at this is roughly FLOPs per layer per forward, or for a 100-layer model. A full 88-cell sweep at 1M is therefore ~ FLOPs — about three minutes of an 8-H100 node at fp8 peak, which is fine; but the KV cache alone takes ~10 GB at hidden size 7 168 and head dimension 128 (two cache types) — and that has to fit on the device while the model weights are loaded.
The KV cache becomes the eval bottleneck
For a Llama-3-class 70B model at 1M context, the KV cache is bigger than the model itself. A single eval prompt requires sharding the KV cache across multiple devices, exactly the same tensor-parallel choreography as serving — except for an eval workload that is mostly prefill, not decode. This is why long-context eval infrastructure at a frontier lab is co-designed with the inference stack, not built as a side project.
The grading function has to keep up
At 1M context the model's answer is no longer a 64-token substring match. Tasks like “summarise this novel” require LLM-as-judge with a careful rubric. The judge model is often the same family as the model under test, which raises the question of grading bias — frontier labs typically use a held-out judge of comparable but different lineage (e.g. grading Claude with GPT-4 and vice versa) and report inter-judge agreement.
The training data has to teach the eval
Naively training on long documents does not produce a 1M-context model that scores well on RULER. The training mix needs synthetic long-context tasks — book-length question answering, paraphrased needles at random depths, multi-document aggregation — because those are the capabilities we are about to measure. This is the deepest connection between training and evaluation in long-context: the evaluation set effectively defines what “1M context” means, and the training set has to be engineered to match.
Engineering Reality: What Actually Fails
Two years of long-context releases have produced a small catalogue of recurring failure modes. The good ones come up in code reviews before the release ships; the bad ones get caught in the wild and named after the team that hit them first.
- Tokenizer mismatch. The eval harness uses a different tokenizer from the model. The horizontal axis of the heatmap is off by ±15%, the cliff is in the wrong place, and the team spends a week debugging a regression that does not exist. Always tokenize with the model's own vocabulary.
- Truncation silently kicks in. The HuggingFace tokenizer truncates by default at the model's configured max position. Pass
truncation=Falsewhen running NIAH or the model will pass cells it should fail. - The prompt template is the system prompt. Some chat-tuned models prepend a default system prompt unless you explicitly pass one. That extra ~500 tokens at the front shifts every depth percentage and is invisible unless you log token counts.
- The needle is too memorable. “The magic number is 42” appears in pretraining data. Use a needle the model could not have memorised — a randomly-generated string of words, a date that does not exist, a fictional named entity.
- Greedy decoding silently goes off the rails. At very long context, the argmax of the first answer token is sometimes a stop token. Setting
min_new_tokens=1forces at least one real token, which is enough to catch the failure. - The KV cache is paged out of GPU memory. Tools like vLLM page the KV cache to CPU when it overflows. At 128k this is fine for throughput but is a 50× slowdown per token, which makes the eval take long enough that intermediate fails (NaNs, OOMs) become disproportionately likely.
- Single-trial heatmaps overfit to one needle. Always run samples per cell with different needles. The published Kamradt heatmaps are and have ~10 percentage points of per-cell noise that nobody talks about.
Once these are fixed, what is left is the actual capability of the model. That capability is what the next chapter — on inference and serving — has to deliver to the user at runtime.
The mental model that unifies this section: long-context capability is not a single number. It is a 2-D surface (length × depth) across a panel of tasks (single needle, multi-hop, aggregation, QA). A “128k model” is a model with a particular shape on that surface — and the shape is what long-context training is trying to flatten.