The Real Problem: NTK Alone Is Not Enough
The previous section ended with a victory and a warning. NTK-aware scaling fixed the most obvious failure of Position Interpolation — it stopped pretending that all RoPE dimensions live on the same timescale. By scaling the RoPE base instead of compressing every angle uniformly, NTK kept the highest-frequency rotations roughly intact while spreading the slowest ones across the new context window. The result was clean: at , you could push Llama-2 from 4K to 16K tokens with almost no loss of perplexity, no fine-tuning, no retraining.
Then people tried , and , and the cracks reappeared. Around 32K context, NTK-aware Llama-2 started forgetting the start of documents. By 64K, the loss curve looked like a polite version of the original PI catastrophe — not divergent, just quietly broken. Information retrieval at long range worked for the first ~20K tokens, then degraded smoothly into noise. Why?
The diagnosis took a careful look at what NTK actually does to each dim-pair. NTK rescales the base, which means it applies a single multiplicative correction to the whole spectrum. But the spectrum is highly non-uniform. RoPE assigns dimension pair the angular frequency , so the wavelengths span three full orders of magnitude. For Llama-2 with :
| Dim pair i | θᵢ | λᵢ (positions) | Rotations in 4K context |
|---|---|---|---|
| 0 (fastest) | 1.00 | 6.28 | ~652 |
| 10 | 0.32 | 19.7 | ~208 |
| 20 | 0.10 | 62.6 | ~65 |
| 30 | 0.032 | 198.6 | ~20.6 |
| 40 | 0.010 | 630.3 | ~6.5 |
| 50 | 0.0032 | 2,000 | ~2.05 |
| 60 | 0.0010 | 6,323 | ~0.65 |
| 63 (slowest) | 1.19·10⁻⁴ | 52,738 | ~0.08 |
Look at that last column. Dim-pair 0 saw 652 full rotationsduring pretraining — it has been around the unit circle so many times that the model has memorised every angle. Dim-pair 63 saw less than one tenth of a rotation — the model literally never observed what its sin and cos look like past their first few percent of arc. These two ends of the spectrum need completely different treatment, and any method that applies a single global rule to both is going to leak one or the other.
YaRN — Yet another RoPE extensioN, Peng et al. 2023 — is the paper that finally said: stop trying to find a single scaling rule. Use three rules, one per band. Combined with a small attention-temperature correction, YaRN is the method that powers DeepSeek-V2's and DeepSeek-V3's 128K context windows, Qwen2-72B's 32K extension, and most production long-context Llama variants. It is also the rare technique whose algorithmic description fits on half a page.
Intuition: A Clock With Hands of Many Speeds
Imagine RoPE as a fancy clock with hands instead of three. The fastest hand sweeps the dial in 6.28 positions; the slowest hand takes 50,000 positions to complete one orbit. Every token in the sequence advances every hand by its own fraction of a revolution, and the resulting configuration of hands is what the attention mechanism reads to know "how far apart are these two tokens?".
Now you want to extend the clock's range from 4,000 positions to 32,000. What should you do with each hand?
- The fast hands (high frequency). These hands completed hundreds of full revolutions during the original 4K-token training, so the model has seen every conceivable configuration of them. Compressing them — making them tick slower so they only complete the same number of revolutions in 32K — would destroy precise short-range information. The model knows what a 7-token gap looks like; do not lie to it. Solution: extrapolate — keep these hands ticking at the original rate. The model will see new configurations beyond 4K, but it has learned the geometry well enough to generalise.
- The slow hands (low frequency). These hands completed only a fraction of a revolution during 4K-token training. The model has no idea what their configuration looks like at, say, the 90° mark, because it never observed them past their initial 30° arc. If you let them keep ticking at the original rate, by token 32,000 they will have swept into regions of angle space that have never appeared in training data — pure extrapolation, which language models are famously bad at. Solution: interpolate — slow these hands down by a factor of so they sweep the same small arc across the 32K context that they used to sweep across 4K. The model will still be in familiar angle space.
- The medium hands. Some hands completed a few revolutions during training — enough to know the basic shape of sin and cos but not enough to nail every angle. A sharp cutoff between "extrapolate" and "interpolate" would create a discontinuity at the band boundary; the model would suddenly see two adjacent dim-pairs behave according to totally different rules. Solution: smoothly blend — for hands in the transition zone, mix the extrapolation and interpolation strategies in proportion to how many rotations they observed.
That is the entire YaRN algorithm. Three bands, one ramp, one line of math per dim. Plus one extra ingredient — the attention temperature — which we will get to in a moment, because rescaling frequencies changes the typical magnitude of dot products and the softmax needs a small nudge to compensate.
Why was this not obvious in 2023? Because the community had been thinking of RoPE as a unified positional encoding rather than a bank of independent oscillators. Once you squint at it as a spectrum — many simultaneous frequencies, each with its own training history — the "treat each band by its rotation count" idea becomes the natural answer. YaRN's contribution was not a deep theorem; it was reframing.
The Math: Splitting the Spectrum
Start with the standard RoPE frequency schedule. For a head of dimension , dim-pair , base frequency (almost always 10,000), the angular frequency is and the wavelength in token positions is .
Given a pretraining context length and a target scale factor , define the per-dim rotation count
,
which counts how many full revolutions dim-pair made during pretraining. This is the scalar that classifies each dim into one of three bands.
The ramp γ(r)
YaRN introduces a linear ramp parameterised by two thresholds :
Read this as: means "the dim is in the high-frequency band, keep its rotation rate unchanged"; means "the dim is in the low-frequency band, compress it by a factor of "; intermediate values blend continuously. The paper's recommended defaults are (the rotation count below which a dim never completed a full revolution and is treated as "slow") and (the count above which the model has seen so many revolutions it can safely extrapolate). DeepSeek-V3 uses these defaults unchanged.
The blended frequency
The YaRN-rescaled angular frequency for dim-pair is
.
At the band edges this collapses to the familiar special cases:
- High-freq dim (): — pure extrapolation, identical to the original RoPE.
- Low-freq dim (): — pure interpolation, identical to Position Interpolation on that dim alone.
- Transition dim: a convex combination of the two strategies, weighted by the position of within .
Notice how compact this is. The whole frequency-domain intervention is a single line of math per dim, and it degenerates to either of its predecessors at the boundary. That is exactly the kind of strict generalisation that lets you adopt YaRN as a drop-in replacement: a no-scale call () recovers vanilla RoPE bit-for-bit.
The Temperature Fix: Why Attention Scores Need a Nudge
If you stop here and run inference, you will find that perplexity at long context is better than PI and better than NTK — but still not as good as the original short-context perplexity. The residual gap puzzled the YaRN authors until they noticed something subtle about the magnitude of attention dot-products after rescaling.
RoPE represents position by rotating Q and K vectors. The post-rotation dot product depends on the relative offset between two positions. When you stretch the wavelengths (by lowering some ), the typical rotation angle for a given offset gets smaller. A smaller rotation means the rotated vector is closer to its un-rotated version, which means the typical dot-product magnitude shrinks. The softmax sees lower-amplitude logits and produces a smoother, less selective attention distribution — exactly the symptom of "the model is forgetting what to focus on".
The fix is a single multiplicative scalar applied to pre-softmax attention scores:
This formula was derived empirically — fit a one-parameter curve to the post-rescaling perplexity gap across — and verified to generalise across model sizes and families. The implementation trick is even cleaner than the math: you fold the factor into the cos/sin tables once at module init, so the attention layer sees no extra multiply at runtime. The cost of YaRN inference is therefore exactly the same as vanilla RoPE inference.
Manual Numerical Walkthrough
Click to expand: a complete YaRN step on a d=8 toy head with s=4
Take a tiny head with (so 4 dim-pairs), base , pretraining context tokens, target (so ). Use the paper defaults , .
Step 1: compute θᵢ. (because ).
| i | θᵢ | λᵢ = 2π/θᵢ | rᵢ = 16/λᵢ | band |
|---|---|---|---|---|
| 0 | 1.0 | 6.283 | 2.546 | blend |
| 1 | 0.1 | 62.83 | 0.2546 | interpolate |
| 2 | 0.01 | 628.3 | 0.02546 | interpolate |
| 3 | 0.001 | 6,283 | 0.002546 | interpolate |
Step 2: compute γ(rᵢ). Only falls in the transition band ; the other three are all below .
- (below α)
Step 3: compute θᵢ' via the YaRN blend.
| i | γᵢ | γ·θᵢ | (1-γ)·θᵢ/s | θᵢ' (YaRN) | θᵢ' (PI baseline) |
|---|---|---|---|---|---|
| 0 | 0.0499 | 0.04990 | (0.9501)·(0.25) = 0.2375 | 0.2874 | 0.2500 |
| 1 | 0.0 | 0 | 0.0250 | 0.0250 | 0.0250 |
| 2 | 0.0 | 0 | 0.00250 | 0.00250 | 0.00250 |
| 3 | 0.0 | 0 | 0.000250 | 0.000250 | 0.000250 |
Step 4: notice what happened. For dim-pair — the fastest, the one the model has seen many full rotations of — YaRN sets , while PI compresses it all the way to . That extra in angular speed preserves the short-range positional sharpness that PI would have destroyed. For dim-pairs 1–3 (all in the low-freq band) YaRN and PI agree exactly: divide by .
Step 5: the attention temperature. . So every pre-softmax logit will be multiplied by ~1.14 — a mild sharpening of the attention distribution that compensates for the slight dot-product attenuation caused by the slowed rotations.
Step 6: the cost of all this. Four θ computations, four γ evaluations, four blends, one log. On a real head with the cost is 64 of each — call it a microsecond on a single CPU thread, zero on the GPU. The expensive part of switching to YaRN is not the math; it is the continued pretraining on long documents that follows, which we will discuss in the massive-scale section below.
Visualizing the Three Bands
The math is simple but the geometry is hard to picture without a live diagram. The widget below plots two things at once:
- Left panel: a bar per dim-pair, height equal to — negative for fast dims (many rotations), positive for slow dims (subrotation). Bars are coloured by YaRN band: green for extrapolate, amber for blend, blue for interpolate.
- Right panel: the ramp with the active and marked. Every dim-pair is plotted as a dot at its coordinate so you can see which band each dim landed in.
Try moving the slider from 2 to 32. Notice how the band counts in the legend change — at small , most dims stay green (extrapolate); at large , more dims drift blue (interpolate). The transition band stays narrow at any because it is gated by , not by . The attention-temperature value at the bottom of the right panel updates live to match.
Plain Python: YaRN From Scratch
Before we touch PyTorch, here is the algorithm in pure Python. No tensors, no broadcasting tricks — just a loop over dim-pairs and the four equations from the math section. If you understand this snippet, you understand YaRN.
Run this and you get back a list of 64 new RoPE frequencies, plus a single attention temperature scalar. That is the entire output of the YaRN procedure. Every downstream step — building the cos/sin tables, rotating Q and K, computing attention — is identical to vanilla RoPE.
PyTorch: A Drop-In RoPE Replacement
The plain-Python version is for understanding; the PyTorch version below is what actually ships. It is structured as a single that you can paste into any Llama/Mistral/Qwen/DeepSeek codebase to replace the existing class — no attention kernel changes required.
The two key implementation choices to notice: (1) we keep and the rotation count in FP32, because the slowest dims carry values around that would round catastrophically in BF16; (2) we fold the YaRN attention temperature directly into the precomputed cos/sin tables, which means the attention kernel does not need to know YaRN exists. FlashAttention, xFormers, PagedAttention, all just see standard rotated Q/K tensors and run unmodified.
At Massive Scale: DeepSeek-V3 from 4K to 128K
Here is how YaRN actually plays out on a 671 B-parameter MoE being trained for a 128K-token context. DeepSeek-V3's published recipe (paper Section 4.3) is a two-phase context extension after the main pretraining run:
| Phase | Context | Scale s | Tokens trained | Notes |
|---|---|---|---|---|
| Pretraining | 4,096 | 1 (no YaRN) | 14.8 T | Standard RoPE on 14.8 T tokens |
| Extension Phase 1 | 32,768 | 8.0 | ~10 B | YaRN turned on; α = 1, β = 32 |
| Extension Phase 2 | 131,072 | 32.0 | ~10 B | Same YaRN params, bigger s |
Three things to notice about this recipe. First, the extension tokens are about of the pretraining tokens — YaRN is cheap because the underlying frequency change is small and the model only needs to adapt to a new dot-product distribution, not relearn language. Second, both extension phases use the same — only changes. Third, going from 32K to 128K (another 4× scale) requires another ~10 B tokens, not 4× more, because each subsequent stretch is mostly re-fitting the attention temperature rather than learning new long-range patterns.
On a 2,048-GPU H800 cluster, each phase is roughly 36 hours of wallclock. That means a base 4K-context model becomes a production 128K-context model in about three days of training, on top of two months of pretraining. YaRN is in the rare category of techniques where the implementation cost and the training cost are both negligible compared to their benefit — most of what you are paying for is the 128K-context data pipeline, not the algorithmic change.
What changes at scale that did not appear in the toy example
- FP32 cos/sin tables become a memory line item. At 128K context and , the cos/sin cache is bytes ≈ 32 MB. Per head per device. For DeepSeek-V3 with 128 heads per layer and 61 layers, that is 250 GB of cache if you built it naively. The fix: share one cos/sin cache across heads (legal because every head uses the same RoPE frequency schedule) and reuse across layers. The actual memory cost drops to 32 MB — trivial.
- The KV cache grows linearly with context. A single 128K-token sequence holds bytes (FP8 K and V, 64 KV heads after Multi-Head Latent Attention compression, 61 layers) ≈ ~30 GB per sequence. This dwarfs the cos/sin cost but is independent of YaRN — it is the cost of having long context at all. YaRN enables you to use that KV cache by making the model attend correctly at long range; the KV cache itself is what the next chapter is about.
- Long-context training data is the real bottleneck. Pretraining corpora are heavy on web pages and dialogues, both of which are short. To fine-tune YaRN to 128K context you need a curated stream of long documents — books, technical manuals, code repos, multi-document QA pairs — which is expensive to collect at scale. DeepSeek-V3's extension phases use a heavy upsample of long-document data; ~70% of extension tokens are from sequences longer than 16K.
- Attention quality degrades super-linearly with . At the model recovers near-pretraining perplexity after ~10 B tokens; at , the same 10 B tokens leave a small but measurable gap. The cure is more tokens, but the law looks like , which means each doubling of context costs ~2.8× more extension training. This is the dominant practical limit on how far you can push a single YaRN run.
Engineering Reality: What YaRN Costs You
YaRN is conceptually clean, but every production team that ships it discovers the same handful of gotchas. Here is the short list before you bring up your own long-context extension run.
Why ship YaRN over the alternatives
| Method | Train tokens for 8× ext. | PPL gap vs base | Implementation |
|---|---|---|---|
| Position Interpolation (PI) | ~50 B | +2.1% | 1 line |
| NTK-aware scaling | ~0 (zero-shot ok) | +0.8% at s=4, +6% at s=16 | 1 line |
| ALiBi (alternative encoding) | Full retrain | +0.4% | different attention |
| YaRN | ~10 B | +0.3% at s=8 | 1 module, 50 lines |
| LongRoPE (DeepSeek-V3+) | ~5 B | +0.2% at s=32 | YaRN + per-dim ramp search |
The table makes the choice obvious: YaRN gives you the best-quality long-context model for the smallest extension budget, with an implementation footprint of one module. Even LongRoPE, the most recent improvement, is just YaRN with a learned ramp instead of the linear default — same algorithm, more tuning. Every production long-context LLM shipped since late 2023 — DeepSeek-V2 (128K), DeepSeek-V3 (128K), Qwen2-72B (32K), Llama-3.1 (128K via a related approach), Yi-200K, Mistral-Large — is sitting on top of either YaRN or a small variant of it.
The takeaway: YaRN is the cheap-and-correct answer to long context. It costs three days of training, ten billion tokens of long-document data, and a fifty-line module change. In return you get a 32× context window with sub-1% perplexity degradation. The reason every modern frontier model uses it (or a near-variant) is not that there is no alternative — there are several — but that none of them are this cheap, this clean, or this empirically robust across model scales.