The Real Problem: Trained Short, Asked Long
Modern frontier models advertise 128K-token context windows. DeepSeek-V3 ships at 128K. Llama-3.1 ships at 128K. Claude 3 reaches 200K, Gemini 1.5 reaches 1M. None of these models were pretrained at that length. Llama-3 was pretrained at 8K. DeepSeek-V3 was pretrained at 4K. The 128K window is the result of a short, careful post-training stage that extends the context — and the question this chapter answers is why that stage is needed at all, and how it works.
Naively, you might think extending context is free. The transformer architecture does not hard-code any context length: attention is just a softmax over keys and queries, the positional encoding is a function of a position index that can in principle be any integer, and the KV cache is just two tensors that grow as you generate. So why not pretrain at 4K (cheap) and serve at 128K (also cheap)?
The answer is that three different things break when you do that, and one of them — the positional encoding — breaks so spectacularly that perplexity literally diverges past the training length. The other two — KV cache memory and quadratic attention compute — are bottlenecks the systems community has been chipping away at for years (GQA, MLA, FlashAttention, PagedAttention). This chapter is about the third, the rotational one, because it is the only one where the math itself stops working.
Intuition: A Clock Asked to Read the Year
Imagine you are training a model to read a clock. You show it positions of the hour hand, minute hand, and second hand over four hours of data. The second hand spins fast: in those four hours it completes 14,400 full revolutions, so the model has seen every possible second-hand position thousands of times. The minute hand also wraps fully — 240 revolutions. Even the hour hand wraps a third of the way around the clock face.
Now you ask the model to read a clock that has been running for sixteen hours. The second and minute hands are fine — they have wrapped so many times that "wrapped a few more times" looks identical to training. But the hour hand has now reached positions it never reached in training. It is pointing somewhere between four o'clock and five o'clock on a face the model only ever saw between twelve and four. The model confidently outputs nonsense, because the hour hand is the dimension that distinguishes long ranges and it has gone off-distribution.
This is exactly what happens to a transformer when you push past its pretraining length. RoPE assigns each pair of attention dimensions a different rotation speed. Fast dimensions (small ) wrap many times even inside the training window — they are the "second hands." Slow dimensions (large ) barely move during training — they are the "hour hands." When you extend context, the second hands are unfazed, but the hour hands suddenly point to angles that, in the model's entire learned experience, do not exist.
The geometry, in one sentence. Long-range information lives in the slow RoPE dimensions, which is precisely where context extension breaks the encoding. That is not a coincidence — it is the same geometric fact viewed from two angles.
The Math: Why RoPE Breaks Past L_train
RoPE encodes position by rotating each pair of channels of the query and key vectors by an angle , where the frequencies form a geometric sequence:
Here is the RoPE base (10,000 in the original Llama, 500,000 in Llama-3), is the head dimension, and indexes channel pairs. The frequencies span many orders of magnitude: at , the fastest dimension has rad/token, while the slowest has rad/token. The corresponding wavelengths in tokens are : roughly tokens at the fastest end, and tokens at the slowest end.
The attention score between query at position and key at position depends only on the relative rotation . During pretraining the model only ever observes relative offsets in . On dimension , this gives a maximum observed angle of
Two regimes follow immediately. If , the rotation wraps around completely within the training range — every angle in has been seen and the dimension is saturated. If , only an arc of the unit circle has been observed. Solving for gives the saturation boundary:
For , this works out to . Dimensions are saturated; dimensions have only seen a partial arc. There are partially-trained dimensions out of 32. That is the population YaRN has to repair.
Now extend the context to . Saturated dimensions remain fine — they wrap a few more times. Partially-trained dimensions see the maximum angle grow from to , painting arcs of the unit circle the model has never observed. The attention computation reads from positions of the rotation it was never trained to handle, and the resulting query-key dot products are arbitrary.
Two Costs That Get Worse Alongside RoPE
The rotation problem is the one the next two sections will repair. Two more problems appear at long context and are worth naming, because they shape which extension techniques are even worth attempting.
1. KV cache memory grows linearly in L
Every token in the context has to keep its keys and values around so future tokens can attend to them. For a model with layers and KV heads of dimension , the KV cache holds
For Llama-2-7B in BF16 at 4K, that is about 2 GB per sequence. At 128K it is about 64 GB per sequence — more than the model weights themselves. This is why grouped-query attention (GQA), multi-query attention (MQA), and DeepSeek's multi-head latent attention (MLA) exist: they collapse from down to 8 (GQA), 1 (MQA), or replace it with a low-rank latent (MLA), cutting the cache by an order of magnitude.
2. Attention compute grows quadratically in L
The QK matrix and its softmax are . At prefill, computing attention over the whole context costs roughly FLOPs — doubling L quadruples the bill. At 128K this is ~1000× more attention compute than at 4K, even before any sequence-parallel slicing kicks in. FlashAttention reduces the constants and removes the memory blowup, but the FLOP curve is still quadratic.
The reason these two costs matter for this chapter is that they limit how ambitious the extension target can be. You cannot just "extend to 10M tokens with YaRN": long before the rotation math fails, the KV cache exceeds GPU memory and the prefill takes minutes. Realistic targets are 128K (DeepSeek, Llama-3.1), 200K (Claude 3), 1M (Gemini 1.5, with extensive sparsity), and the techniques in this chapter all aim at the 4×–32× regime.
Manual Numerical Walkthrough
Set (so we only have 4 dimension pairs to track), , , . We will compute, by hand, which dimensions are saturated, which are on-distribution at the target length, and which have gone OOD.
Click to expand: walk through every dimension by hand
Step 1 — frequencies. :
| i | θ_i (rad/token) | wavelength λ_i (tokens) |
|---|---|---|
| 0 | 1.0 | 6.28 |
| 1 | 0.1 | 62.8 |
| 2 | 0.01 | 628.3 |
| 3 | 0.001 | 6,283.2 |
Step 2 — max angle at L_train. :
| i | α_i^train (rad) | as multiples of 2π | regime at L_train |
|---|---|---|---|
| 0 | 1023 | 162.8·2π | fully saturated, dozens of wraps |
| 1 | 102.3 | 16.28·2π | saturated, 16 wraps |
| 2 | 10.23 | 1.63·2π | wraps once, then partial arc |
| 3 | 1.023 | 0.16·2π | only 16% of one rotation seen |
Step 3 — max angle at L_target. :
| i | α_i^target (rad) | as multiples of 2π | new arcs lit up? |
|---|---|---|---|
| 0 | 4095 | 651.7·2π | no — already saturated |
| 1 | 409.5 | 65.2·2π | no — already saturated |
| 2 | 40.95 | 6.52·2π | no — wraps several times, still saturated |
| 3 | 4.095 | 0.65·2π | YES — went from 0.16·2π to 0.65·2π |
Step 4 — read off the diagnosis. Only dimension goes OOD. During pretraining it covered the arc on the unit circle; at the target length it now sweeps out to . The arc from to is the "new" angular region the model is being forced to interpret without training data.
Step 5 — sanity check against the boundary formula. The boundary formula predicts saturation at . So dimensions are saturated and dimensions are vulnerable. We saw dimension 2 wrap several times because it sat just at the boundary; dimension 3 was the actual OOD case. The formula and the table agree.
The lesson. Out of dimensions, one broke. In a real Llama with , the same calculation produces roughly 16–20 OOD dimensions out of 64 at an 8× extension. Not many in absolute terms, but they are precisely the dimensions responsible for long-range attention — and breaking them is enough to break the model.
Visualizing Where the Frequencies Go OOD
The interactive chart below makes the geometry concrete. Each row is one RoPE dimension. The dark portion of each bar shows the cumulative rotation angle the model saw during pretraining; the light portion shows the additional angle it is being asked to cover at the target length. Bars that extend past the dashed line are saturated dimensions and are safe. Bars that stop before the dashed line in the dark phase but cross it in the light phase are the OOD dimensions — every additional pixel past the boundary is angular territory the model has never seen.
Three things to try with the controls:
- Drag L_target from 4K to 128K with base = 10,000 and L_train = 4K. The number of red "OOD" dimensions grows from 0 to roughly a third of all dimensions. The 32× extension that DeepSeek-V3 performs lives in the middle of this slider.
- Switch the base from 10,000 to 500,000. Llama-3 uses base 500K specifically to push more dimensions into the "slow" regime where one wavelength spans the whole context. Watch the saturation boundary slide right — fewer dimensions wrap during training, and the slow dimensions cover more of the unit circle even at 8K. This is one of the few free architectural levers for long context.
- Set L_train = L_target. The red bars disappear: no dimension is OOD, because we are not extending. The whole problem evaporates if you can afford to pretrain at the full target length — which, at trillions of training tokens, you cannot.
Then the cost picture. The chart below shows, for several real models, the KV cache memory and prefill attention FLOPs at context lengths from 1K to 128K. Notice that even at fixed batch = 1, the FLOP bars grow 4× every time L doubles. The blue KV-cache bars grow 2× every time L doubles — manageable for a single sequence, lethal at batch > 8.
These two charts together explain the design space. The rotation problem (chart 1) is fixed by changing how positions map to angles — NTK, PI, YaRN. The memory and compute problem (chart 2) is fixed by changing how attention is computed — GQA, MLA, FlashAttention, PagedAttention. They are orthogonal techniques and every production long-context system uses both stacks together.
Plain Python: Measuring the Distribution Shift
Before invoking PyTorch and a real model, let's reproduce the OOD-dimension count with plain NumPy. The script below computes, for every dimension, what arcs of the unit circle were lit up during pretraining and what arcs are lit up at the target length, then flags every dimension where the target reaches buckets pretraining never touched.
The script gives you a per-dimension breakdown that matches the visual chart exactly. The OOD-dimension count is also a useful pre-flight check before running a context-extension recipe: if zero dimensions go OOD at your target length, your model can already extrapolate and no recipe is needed. If the OOD count is more than ~25% of dimensions, you almost certainly need YaRN-class scaling rather than naive position interpolation, because PI compresses the saturated dimensions you do not want to disturb.
PyTorch: A Tiny Length-Extrapolation Experiment
The cleanest way to see the failure is to take a real 4K-pretrained model and score the same long document at 4K, 8K, 16K, and 32K. Same model, same text, only the positions change. The result is the textbook hockey stick: perplexity is flat up to and then explodes.
What this experiment proves. The model still has all the information it needs — same weights, same vocabulary, same text. The only thing that changed is the position index passed into RoPE. Perplexity going from 6 to 1850 is therefore entirely a positional-encoding effect. Whatever fixes this has to fix the rotation math — nothing else has changed.
What Changes at 671B Parameters and 128K Context
Three things scale up uncomfortably when you take this story to a frontier model.
| Quantity | Llama-2-7B @ 4K | DeepSeek-V3 @ 128K | What blows up |
|---|---|---|---|
| Params | 7B | 671B (37B active per token) | model weights + optimizer states |
| L_target / L_train | 1× | 32× | OOD RoPE dimensions |
| KV cache (bf16, 1 seq) | 2 GB | ~17 GB (MLA latent) | without MLA it would be ~400 GB |
| Prefill attention FLOPs | ~10 TFLOP | ~13 PFLOP | quadratic in L — 1000× more than at 4K |
| Extension training tokens | n/a | ~25B tokens (long-context stage) | extension is itself a sizable training run |
The bullet points behind these numbers:
1. The extension stage is its own training run
DeepSeek-V3 was pretrained on 14.8T tokens at 4K context. Extending it to 128K cost an additional ~25B tokens of long-context training in two stages (4K → 32K, then 32K → 128K). That is not negligible — 25B tokens at 671B parameters is still tens of thousands of GPU hours. The whole purpose of YaRN is to make this 25B-token stage sufficient, because the alternative is pretraining at 128K from scratch, which would cost on the order of more attention compute than the actual run did.
2. MLA changes the KV-cache shape, not the RoPE story
DeepSeek-V3's multi-head latent attention compresses the KV cache into a low-rank latent of dimension 512, which is why a 128K context fits in 17 GB instead of hundreds of gigabytes. But MLA does not change the rotation math — it still uses RoPE for position encoding. In fact V3 uses a decoupled RoPE where the latent path is purely content-based and a small set of 64-dim heads carry the positional information. That decoupling exists precisely so that the extension stage only has to repair the positional channel, not the whole latent representation. We cover decoupled RoPE in detail in §4.4.
3. Quadratic attention is unavoidable in prefill
Generation is cheap — each new token attends to the cached KV of all previous tokens at cost . Prefill is what costs : you have to fill the cache for the prompt. At 128K, a single prefill takes seconds even on H100s. This is why long-context inference systems aggressively use sequence parallelism and FlashAttention-2 — neither of which is a property of the model, both of which are engineering necessities that the model designer has to leave room for.
Engineering Reality: Why You Can't Just Crank L
Walk through what would happen if you naively served a 4K-pretrained model at 32K and ignored everything in this chapter. You would observe, in order:
- The model produces fluent, on-topic text up to the 4K boundary.
- Around token 4000–5000, the text starts repeating, mixing topics, and producing high-frequency tokens that are statistically wrong for the context (random rare tokens, partial code snippets, fragments of foreign languages). This is the OOD RoPE rotations producing nonsense attention scores — the model is attending to something, but the spatial selectivity it learned no longer applies.
- By token 8000, perplexity is 100× the 4K baseline. Generation is functionally broken.
- Even if you only do retrieval-style queries ("what was the third paragraph about?"), the attention map between query and the relevant key has the wrong rotational alignment, so the model fails to attend to the right span. You see this as "needle in haystack" benchmark scores collapsing past .
The fix is the three sections that follow. §12.2 shows how to repair the rotation distribution with NTK-aware scaling — the simplest method that already gets most of the way there. §12.3 shows the production-grade method, YaRN, which is what DeepSeek-V3 actually used to reach 128K. §12.4 covers how you measure whether the extension worked, because perplexity alone is a famously unreliable proxy for long-context capability.