Chapter 12
15 min read
Section 68 of 117

Why Context Extension Is Hard

Long-Context Extension

The Real Problem: Trained Short, Asked Long

Modern frontier models advertise 128K-token context windows. DeepSeek-V3 ships at 128K. Llama-3.1 ships at 128K. Claude 3 reaches 200K, Gemini 1.5 reaches 1M. None of these models were pretrained at that length. Llama-3 was pretrained at 8K. DeepSeek-V3 was pretrained at 4K. The 128K window is the result of a short, careful post-training stage that extends the context — and the question this chapter answers is why that stage is needed at all, and how it works.

Naively, you might think extending context is free. The transformer architecture does not hard-code any context length: attention is just a softmax over keys and queries, the positional encoding is a function of a position index mm that can in principle be any integer, and the KV cache is just two tensors that grow as you generate. So why not pretrain at 4K (cheap) and serve at 128K (also cheap)?

The answer is that three different things break when you do that, and one of them — the positional encoding — breaks so spectacularly that perplexity literally diverges past the training length. The other two — KV cache memory and quadratic attention compute — are bottlenecks the systems community has been chipping away at for years (GQA, MLA, FlashAttention, PagedAttention). This chapter is about the third, the rotational one, because it is the only one where the math itself stops working.

The thesis of this chapter. Extending context past LtrainL_{\text{train}} fails not because attention runs out of memory (it does, but we can fix that) and not because compute is too slow (it is, but we can fix that too). It fails because RoPE rotates dimensions past angles the model has never seen. NTK-aware scaling, position interpolation, and YaRN are all answers to that one geometric question.

Intuition: A Clock Asked to Read the Year

Imagine you are training a model to read a clock. You show it positions of the hour hand, minute hand, and second hand over four hours of data. The second hand spins fast: in those four hours it completes 14,400 full revolutions, so the model has seen every possible second-hand position thousands of times. The minute hand also wraps fully — 240 revolutions. Even the hour hand wraps a third of the way around the clock face.

Now you ask the model to read a clock that has been running for sixteen hours. The second and minute hands are fine — they have wrapped so many times that "wrapped a few more times" looks identical to training. But the hour hand has now reached positions it never reached in training. It is pointing somewhere between four o'clock and five o'clock on a face the model only ever saw between twelve and four. The model confidently outputs nonsense, because the hour hand is the dimension that distinguishes long ranges and it has gone off-distribution.

This is exactly what happens to a transformer when you push past its pretraining length. RoPE assigns each pair of attention dimensions a different rotation speed. Fast dimensions (small ii) wrap many times even inside the training window — they are the "second hands." Slow dimensions (large ii) barely move during training — they are the "hour hands." When you extend context, the second hands are unfazed, but the hour hands suddenly point to angles that, in the model's entire learned experience, do not exist.

The geometry, in one sentence. Long-range information lives in the slow RoPE dimensions, which is precisely where context extension breaks the encoding. That is not a coincidence — it is the same geometric fact viewed from two angles.

The Math: Why RoPE Breaks Past L_train

RoPE encodes position mm by rotating each pair of channels (2i,2i+1)(2i, 2i+1) of the query and key vectors by an angle mθim \theta_i, where the frequencies form a geometric sequence:

θi=b2i/d,i=0,1,,d/21.\theta_i = b^{-2i/d}, \quad i = 0, 1, \ldots, d/2 - 1.

Here bb is the RoPE base (10,000 in the original Llama, 500,000 in Llama-3), dd is the head dimension, and ii indexes channel pairs. The frequencies span many orders of magnitude: at d=64,b=10,000d = 64, b = 10{,}000, the fastest dimension has θ0=1\theta_0 = 1 rad/token, while the slowest has θ31=104\theta_{31} = 10^{-4} rad/token. The corresponding wavelengths in tokens are λi=2π/θi\lambda_i = 2\pi / \theta_i: roughly 2π2\pi tokens at the fastest end, and 60,000\sim 60{,}000 tokens at the slowest end.

The attention score between query at position mm and key at position nn depends only on the relative rotation (mn)θi(m - n)\theta_i. During pretraining the model only ever observes relative offsets in [(Ltrain1),Ltrain1][-(L_{\text{train}} - 1), L_{\text{train}} - 1]. On dimension ii, this gives a maximum observed angle of

αitrain=(Ltrain1)θi.\alpha_i^{\text{train}} = (L_{\text{train}} - 1) \, \theta_i.

Two regimes follow immediately. If αitrain2π\alpha_i^{\text{train}} \ge 2\pi, the rotation wraps around completely within the training range — every angle in [0,2π)[0, 2\pi) has been seen and the dimension is saturated. If αitrain<2π\alpha_i^{\text{train}} < 2\pi, only an arc of the unit circle has been observed. Solving (Ltrain1)θi=2π(L_{\text{train}} - 1) \theta_i = 2\pi for ii gives the saturation boundary:

i=d2logb ⁣(Ltrain2π).i^* = \frac{d}{2} \log_b\!\left(\frac{L_{\text{train}}}{2\pi}\right).

For Ltrain=4096,b=10,000,d=64L_{\text{train}} = 4096, b = 10{,}000, d = 64, this works out to i21i^* \approx 21. Dimensions i<21i < 21 are saturated; dimensions i21i \ge 21 have only seen a partial arc. There are d/2i=11d/2 - i^* = 11 partially-trained dimensions out of 32. That is the population YaRN has to repair.

Now extend the context to LtargetLtrainL_{\text{target}} \gg L_{\text{train}}. Saturated dimensions remain fine — they wrap a few more times. Partially-trained dimensions see the maximum angle grow from (Ltrain1)θi(L_{\text{train}} - 1)\theta_i to (Ltarget1)θi(L_{\text{target}} - 1)\theta_i, painting arcs of the unit circle the model has never observed. The attention computation reads from positions of the rotation it was never trained to handle, and the resulting query-key dot products are arbitrary.

The numerical signature. A model whose RoPE has gone OOD does not produce noticeably worse text up to LtrainL_{\text{train}}, then suddenly produces garbage past it. Perplexity is flat-ish up to the boundary and then explodes by 2–4 orders of magnitude over the next few thousand tokens. The transition issharp, not gradual.

Two Costs That Get Worse Alongside RoPE

The rotation problem is the one the next two sections will repair. Two more problems appear at long context and are worth naming, because they shape which extension techniques are even worth attempting.

1. KV cache memory grows linearly in L

Every token in the context has to keep its keys and values around so future tokens can attend to them. For a model with nln_l layers and nkvn_{kv} KV heads of dimension dhd_h, the KV cache holds

KV(L)=2bdtypenlnkvdhLbytes/sequence.\text{KV}(L) = 2 \, b_{\text{dtype}} \, n_l \, n_{kv} \, d_h \, L \quad \text{bytes/sequence}.

For Llama-2-7B in BF16 at 4K, that is about 2 GB per sequence. At 128K it is about 64 GB per sequence — more than the model weights themselves. This is why grouped-query attention (GQA), multi-query attention (MQA), and DeepSeek's multi-head latent attention (MLA) exist: they collapse nkvn_{kv} from nhn_h down to 8 (GQA), 1 (MQA), or replace it with a low-rank latent (MLA), cutting the cache by an order of magnitude.

2. Attention compute grows quadratically in L

The QK^\top matrix and its softmax are O(L2)O(L^2). At prefill, computing attention over the whole context costs roughly 4nldL24 n_l d L^2 FLOPs — doubling L quadruples the bill. At 128K this is ~1000× more attention compute than at 4K, even before any sequence-parallel slicing kicks in. FlashAttention reduces the constants and removes the memory blowup, but the FLOP curve is still quadratic.

The reason these two costs matter for this chapter is that they limit how ambitious the extension target can be. You cannot just "extend to 10M tokens with YaRN": long before the rotation math fails, the KV cache exceeds GPU memory and the prefill takes minutes. Realistic targets are 128K (DeepSeek, Llama-3.1), 200K (Claude 3), 1M (Gemini 1.5, with extensive sparsity), and the techniques in this chapter all aim at the 4×–32× regime.

Manual Numerical Walkthrough

Set d=8d = 8 (so we only have 4 dimension pairs to track), b=10,000b = 10{,}000, Ltrain=1024L_{\text{train}} = 1024, Ltarget=4096L_{\text{target}} = 4096. We will compute, by hand, which dimensions are saturated, which are on-distribution at the target length, and which have gone OOD.

Click to expand: walk through every dimension by hand

Step 1 — frequencies. θi=100002i/8=10000i/4\theta_i = 10000^{-2i/8} = 10000^{-i/4}:

iθ_i (rad/token)wavelength λ_i (tokens)
01.06.28
10.162.8
20.01628.3
30.0016,283.2

Step 2 — max angle at L_train. αitrain=1023θi\alpha_i^{\text{train}} = 1023 \cdot \theta_i:

iα_i^train (rad)as multiples of 2πregime at L_train
01023162.8·2πfully saturated, dozens of wraps
1102.316.28·2πsaturated, 16 wraps
210.231.63·2πwraps once, then partial arc
31.0230.16·2πonly 16% of one rotation seen

Step 3 — max angle at L_target. αitarget=4095θi\alpha_i^{\text{target}} = 4095 \cdot \theta_i:

iα_i^target (rad)as multiples of 2πnew arcs lit up?
04095651.7·2πno — already saturated
1409.565.2·2πno — already saturated
240.956.52·2πno — wraps several times, still saturated
34.0950.65·2πYES — went from 0.16·2π to 0.65·2π

Step 4 — read off the diagnosis. Only dimension i=3i = 3 goes OOD. During pretraining it covered the arc [0,0.162π][0, 0.16 \cdot 2\pi] on the unit circle; at the target length it now sweeps out to [0,0.652π][0, 0.65 \cdot 2\pi]. The arc from 0.162π0.16 \cdot 2\pi to 0.652π0.65 \cdot 2\pi is the "new" angular region the model is being forced to interpret without training data.

Step 5 — sanity check against the boundary formula. The boundary formula predicts saturation at i=82log10000 ⁣10242π=4log10000(163.0)=40.553=2.21i^* = \frac{8}{2} \log_{10000}\!\frac{1024}{2\pi} = 4 \log_{10000}(163.0) = 4 \cdot 0.553 = 2.21. So dimensions i<2.21i < 2.21 are saturated and dimensions i2.21i \ge 2.21 are vulnerable. We saw dimension 2 wrap several times because it sat just at the boundary; dimension 3 was the actual OOD case. The formula and the table agree.

The lesson. Out of d/2=4d/2 = 4 dimensions, one broke. In a real Llama with d=128,d/2=64d = 128, d/2 = 64, the same calculation produces roughly 16–20 OOD dimensions out of 64 at an 8× extension. Not many in absolute terms, but they are precisely the dimensions responsible for long-range attention — and breaking them is enough to break the model.

Visualizing Where the Frequencies Go OOD

The interactive chart below makes the geometry concrete. Each row is one RoPE dimension. The dark portion of each bar shows the cumulative rotation angle the model saw during pretraining; the light portion shows the additional angle it is being asked to cover at the target length. Bars that extend past the dashed 2π2\pi line are saturated dimensions and are safe. Bars that stop before the dashed line in the dark phase but cross it in the light phase are the OOD dimensions — every additional pixel past the boundary is angular territory the model has never seen.

Loading RoPE extrapolation chart…

Three things to try with the controls:

  • Drag L_target from 4K to 128K with base = 10,000 and L_train = 4K. The number of red "OOD" dimensions grows from 0 to roughly a third of all dimensions. The 32× extension that DeepSeek-V3 performs lives in the middle of this slider.
  • Switch the base from 10,000 to 500,000. Llama-3 uses base 500K specifically to push more dimensions into the "slow" regime where one wavelength spans the whole context. Watch the saturation boundary slide right — fewer dimensions wrap during training, and the slow dimensions cover more of the unit circle even at 8K. This is one of the few free architectural levers for long context.
  • Set L_train = L_target. The red bars disappear: no dimension is OOD, because we are not extending. The whole problem evaporates if you can afford to pretrain at the full target length — which, at trillions of training tokens, you cannot.

Then the cost picture. The chart below shows, for several real models, the KV cache memory and prefill attention FLOPs at context lengths from 1K to 128K. Notice that even at fixed batch = 1, the FLOP bars grow 4× every time L doubles. The blue KV-cache bars grow 2× every time L doubles — manageable for a single sequence, lethal at batch > 8.

Loading KV cache chart…

These two charts together explain the design space. The rotation problem (chart 1) is fixed by changing how positions map to angles — NTK, PI, YaRN. The memory and compute problem (chart 2) is fixed by changing how attention is computed — GQA, MLA, FlashAttention, PagedAttention. They are orthogonal techniques and every production long-context system uses both stacks together.

Plain Python: Measuring the Distribution Shift

Before invoking PyTorch and a real model, let's reproduce the OOD-dimension count with plain NumPy. The script below computes, for every dimension, what arcs of the unit circle were lit up during pretraining and what arcs are lit up at the target length, then flags every dimension where the target reaches buckets pretraining never touched.

🐍rope_ood_diagnosis.py
3Pick a real RoPE config

We use head dim 64 and base 10,000 — exactly the values DeepSeek-V3 uses for its decoupled-RoPE channel and what Llama-2 used everywhere. This is not a toy; the numbers we are about to see are the numbers production teams actually wrestle with.

EXECUTION STATE
d = 64
base = 10000
4Pretraining vs target length

The model only ever saw positions 0..4095 during pretraining. We want to use it at 32,768. The whole question of this section reduces to: do the rotations at the new positions look anything like the rotations the model already learned to handle?

EXECUTION STATE
L_train = 4096
L_target = 32768
8Per-dimension rotation frequencies

RoPE assigns a different rotation frequency to each pair of channels. Channel pair 0 spins fast (θ ≈ 1 rad/token, wavelength ≈ 6 tokens). The last pair spins slowly (θ ≈ 1e-4 rad/token, wavelength ≈ 60,000 tokens). The geometric spread is the whole reason positions are encoded richly — and the whole reason extrapolation is fragile.

EXECUTION STATE
i = [0,1,...,31]
theta = (32,) — geometric, 1.0 → 1.0e-4
10Wavelength in tokens, not radians

Converting θ to a wavelength makes the geometry concrete: a dimension with wavelength 6 tokens completes ~683 full rotations in 4K tokens. A dimension with wavelength 60K tokens does not even reach a half-turn at 32K. The fast dimensions are saturated by training; the slow ones are nowhere near saturated.

EXECUTION STATE
wavelength = (32,) — 6.28 → 62,832
14Position grids

m_train and m_target are just integer position indices. The interesting quantity is what each m does to the rotation angle for each dimension j.

EXECUTION STATE
m_train = (4096,)
m_target = (32768,)
17The actual angle every token sees, modulo 2π

np.outer(m, theta) is the central RoPE quantity: position × frequency. We wrap it to [0, 2π) because RoPE only cares about the rotation direction, not how many full turns it has made. ang_train[m, j] is the angle the model saw at position m on dimension j during pretraining.

EXECUTION STATE
ang_train =
(4096, 32) — values in [0, 2π)
ang_target =
(32768, 32)
22Bin the unit circle and look for empty buckets

For each dimension we slice [0, 2π) into 50 buckets. If a bucket is empty at L_train but populated at L_target, then the model is being shown rotation angles on that dimension that it never observed during training. That is the formal definition of OOD positional encoding.

EXECUTION STATE
bins = 50
26Per-dimension extrapolation diagnosis

For dimensions with very short wavelengths (early i), 4K tokens already cover the circle densely — no new buckets light up at 32K. For dimensions with long wavelengths (late i), 4K tokens cover only a small arc; at 32K we light up arcs the network has never seen. Those are the dimensions that will break.

33What the print actually shows

When you run this you get something like: OOD dimensions: 8 / 32. Dim 25 (wavelength 3.3K) lights up 39 new buckets out of 50. The early dimensions (0..16 or so) have already wrapped many times — they are saturated and safe. The OOD count is the population of dimensions that PI/NTK/YaRN have to fix.

27 lines without explanation
1import numpy as np
2
3# RoPE configuration used by Llama-2 / DeepSeek-V3-base
4d, base = 64, 10_000.0            # head dim, RoPE base
5L_train, L_target = 4096, 32_768  # 4K pretraining, 32K target
6
7# Per-dimension rotation frequencies.
8# theta_i = base ** (-2 i / d) for i = 0, ..., d/2 - 1
9i = np.arange(d // 2)
10theta = base ** (-2 * i / d)
11wavelength = 2 * np.pi / theta     # tokens to complete one full turn
12
13# Positions seen in pretraining vs at target length.
14m_train  = np.arange(L_train)
15m_target = np.arange(L_target)
16
17# Angle every position produces on every dimension.
18ang_train  = np.outer(m_train,  theta) % (2 * np.pi)   # (L_train, d/2)
19ang_target = np.outer(m_target, theta) % (2 * np.pi)   # (L_target, d/2)
20
21# Are the target angles inside the training distribution?
22# Bin the unit circle into 50 buckets per dimension; OOD if any bucket
23# is empty in training but populated at the target length.
24bins = 50
25ood_dims, total_dims = [], d // 2
26
27for j in range(total_dims):
28    h_train,  _ = np.histogram(ang_train[:, j],  bins=bins, range=(0, 2*np.pi))
29    h_target, _ = np.histogram(ang_target[:, j], bins=bins, range=(0, 2*np.pi))
30    new_buckets = ((h_target > 0) & (h_train == 0)).sum()
31    if new_buckets > 0:
32        ood_dims.append((j, theta[j], wavelength[j], new_buckets))
33
34print(f"OOD dimensions: {len(ood_dims)} / {total_dims}")
35for j, t, w, nb in ood_dims[:6]:
36    print(f"  i={j:2d}  theta={t:.4e}  wavelength={w:,.0f}  new_buckets={nb}")

The script gives you a per-dimension breakdown that matches the visual chart exactly. The OOD-dimension count is also a useful pre-flight check before running a context-extension recipe: if zero dimensions go OOD at your target length, your model can already extrapolate and no recipe is needed. If the OOD count is more than ~25% of dimensions, you almost certainly need YaRN-class scaling rather than naive position interpolation, because PI compresses the saturated dimensions you do not want to disturb.

PyTorch: A Tiny Length-Extrapolation Experiment

The cleanest way to see the failure is to take a real 4K-pretrained model and score the same long document at 4K, 8K, 16K, and 32K. Same model, same text, only the positions change. The result is the textbook hockey stick: perplexity is flat up to LtrainL_{\text{train}} and then explodes.

🐍length_extrapolation_test.py
5Pick a model with a known training length

Llama-2-7B was pretrained at 4096 tokens. That is the boundary we want to step past. The same script works for any model — Mistral-7B (8K), Llama-3-8B (8K), or your own checkpoint — as long as you know its training context.

7fp16 for inference, eval mode

We are measuring perplexity, not training. fp16 keeps memory low so you can fit a 32K forward pass on a single 80GB GPU. mdl.eval() disables dropout — perplexity must be deterministic.

12Long input text

PG-19 is the standard long-context evaluation corpus; any single document longer than 32K tokens works. Use a real book, not a concatenation of short articles — concatenation creates artificial topic breaks that hide the positional-encoding issue.

EXECUTION STATE
ids =
(1, N) with N ≥ 32768
15Slice-and-score perplexity

Take the first ctx tokens of the same document and ask the model how surprised it is. Same content, same tokenizer, same model — only the position indices change. Any change in perplexity is therefore a position-encoding effect, not a content effect.

17labels=x is the standard CE trick

Hugging Face shifts labels by one inside the model: predicting token t from tokens 0..t-1. out.loss is mean cross-entropy across the whole sequence. exp(loss) is perplexity — geometric mean of 1/probability per token.

EXECUTION STATE
out.loss = scalar tensor — mean CE
22The actual numbers from a real Llama-2 run

Typical output: PPL@4K ≈ 6.2, PPL@8K ≈ 51, PPL@16K ≈ 1850, PPL@32K diverges. Perplexity does not 'gracefully degrade' — it falls off a cliff exactly at L_train. That is the failure mode this whole chapter exists to fix. With YaRN, the same model at 32K runs at PPL ≈ 7.1 — within 15% of its 4K baseline.

24 lines without explanation
1import torch, torch.nn.functional as F
2from transformers import AutoTokenizer, AutoModelForCausalLM
3
4# A 4K-context model. Loading from CPU keeps the example reproducible
5# on any machine; on a GPU it runs in seconds.
6name = "meta-llama/Llama-2-7b-hf"
7tok  = AutoTokenizer.from_pretrained(name)
8mdl  = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.float16)
9mdl.eval().to("cuda")
10
11# A long passage we will measure perplexity on.
12text = open("pg19_long_passage.txt").read()
13ids  = tok(text, return_tensors="pt").input_ids.to("cuda")  # (1, N)
14
15def slice_ppl(ids, ctx):
16    # Take the first ctx tokens; compute next-token CE loss.
17    x = ids[:, :ctx]
18    with torch.no_grad():
19        out = mdl(x, labels=x)                              # autoregressive CE
20    return torch.exp(out.loss).item()
21
22ppl_train  = slice_ppl(ids, 4096)    # inside pretraining range
23ppl_8k     = slice_ppl(ids, 8192)    # 2x pretraining
24ppl_16k    = slice_ppl(ids, 16384)   # 4x pretraining
25ppl_32k    = slice_ppl(ids, 32768)   # 8x pretraining — RoPE limit broken
26
27print(f"PPL @ 4K  : {ppl_train:6.2f}")
28print(f"PPL @ 8K  : {ppl_8k:6.2f}")
29print(f"PPL @ 16K : {ppl_16k:6.2f}")
30print(f"PPL @ 32K : {ppl_32k:6.2f}")
What this experiment proves. The model still has all the information it needs — same weights, same vocabulary, same text. The only thing that changed is the position index passed into RoPE. Perplexity going from 6 to 1850 is therefore entirely a positional-encoding effect. Whatever fixes this has to fix the rotation math — nothing else has changed.

What Changes at 671B Parameters and 128K Context

Three things scale up uncomfortably when you take this story to a frontier model.

QuantityLlama-2-7B @ 4KDeepSeek-V3 @ 128KWhat blows up
Params7B671B (37B active per token)model weights + optimizer states
L_target / L_train32×OOD RoPE dimensions
KV cache (bf16, 1 seq)2 GB~17 GB (MLA latent)without MLA it would be ~400 GB
Prefill attention FLOPs~10 TFLOP~13 PFLOPquadratic in L — 1000× more than at 4K
Extension training tokensn/a~25B tokens (long-context stage)extension is itself a sizable training run

The bullet points behind these numbers:

1. The extension stage is its own training run

DeepSeek-V3 was pretrained on 14.8T tokens at 4K context. Extending it to 128K cost an additional ~25B tokens of long-context training in two stages (4K → 32K, then 32K → 128K). That is not negligible — 25B tokens at 671B parameters is still tens of thousands of GPU hours. The whole purpose of YaRN is to make this 25B-token stage sufficient, because the alternative is pretraining at 128K from scratch, which would cost on the order of 322×32^2 \times more attention compute than the actual run did.

2. MLA changes the KV-cache shape, not the RoPE story

DeepSeek-V3's multi-head latent attention compresses the KV cache into a low-rank latent of dimension 512, which is why a 128K context fits in 17 GB instead of hundreds of gigabytes. But MLA does not change the rotation math — it still uses RoPE for position encoding. In fact V3 uses a decoupled RoPE where the latent path is purely content-based and a small set of 64-dim heads carry the positional information. That decoupling exists precisely so that the extension stage only has to repair the positional channel, not the whole latent representation. We cover decoupled RoPE in detail in §4.4.

3. Quadratic attention is unavoidable in prefill

Generation is cheap — each new token attends to the cached KV of all previous tokens at cost O(L)O(L). Prefill is what costs O(L2)O(L^2): you have to fill the cache for the prompt. At 128K, a single prefill takes seconds even on H100s. This is why long-context inference systems aggressively use sequence parallelism and FlashAttention-2 — neither of which is a property of the model, both of which are engineering necessities that the model designer has to leave room for.

Engineering Reality: Why You Can't Just Crank L

Walk through what would happen if you naively served a 4K-pretrained model at 32K and ignored everything in this chapter. You would observe, in order:

  • The model produces fluent, on-topic text up to the 4K boundary.
  • Around token 4000–5000, the text starts repeating, mixing topics, and producing high-frequency tokens that are statistically wrong for the context (random rare tokens, partial code snippets, fragments of foreign languages). This is the OOD RoPE rotations producing nonsense attention scores — the model is attending to something, but the spatial selectivity it learned no longer applies.
  • By token 8000, perplexity is 100× the 4K baseline. Generation is functionally broken.
  • Even if you only do retrieval-style queries ("what was the third paragraph about?"), the attention map between query and the relevant key has the wrong rotational alignment, so the model fails to attend to the right span. You see this as "needle in haystack" benchmark scores collapsing past LtrainL_{\text{train}}.

The fix is the three sections that follow. §12.2 shows how to repair the rotation distribution with NTK-aware scaling — the simplest method that already gets most of the way there. §12.3 shows the production-grade method, YaRN, which is what DeepSeek-V3 actually used to reach 128K. §12.4 covers how you measure whether the extension worked, because perplexity alone is a famously unreliable proxy for long-context capability.

The single sentence to remember. Context extension is hard because RoPE sends a fraction of attention dimensions to angles outside the training distribution at positions beyond LtrainL_{\text{train}}, and the rest of the chapter is about which dimensions to leave alone and which to rescale.
Loading comments...