Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021
Learning Objectives
After completing this section, you will be able to:
Explain why encoding position through rotation is fundamentally different from adding position to embeddings or biasing attention scores, and what mathematical advantage this provides.
Derive and interpret the 2D rotation formula that transforms Query and Key vectors, understanding each symbol and matrix operation involved.
Understand the multi-frequency design that gives RoPE both local and global position sensitivity — and why this allows generalization to unseen sequence lengths.
Implement RoPE from scratch in NumPy and PyTorch, applying it to the shared "the cat sat on the mat" example.
Connect RoPE to modern systems including LLaMA, GPT-NeoX, PaLM, CodeLlama, and understand how it interacts with Flash Attention, KV-cache, and context length extension techniques like YaRN and NTK-aware scaling.
Where RoPE appears: LLaMA 1/2/3 (Meta), GPT-NeoX (EleutherAI), PaLM/Gemini (Google), CodeLlama, Mistral, Qwen, Falcon, Phi — virtually every major open-weight LLM since 2022 uses RoPE. It has become the default positional encoding for modern transformers.
The Real Problem
Transformers process all tokens in parallel — unlike RNNs, they have no built-in notion of sequential order. Without positional information, the sentence "the cat sat on the mat" would be indistinguishable from "mat the on sat cat the." The attention mechanism computes Q⋅KT, which depends only on what the tokens are, not where they are. This is the position encoding problem, and it has been solved in several ways, each with different tradeoffs.
Position as Addition — The Limitation
The original Transformer (Vaswani et al., 2017) solved this by adding a position vector to each token embedding: xm=tokenm+PE(m). This is simple and elegant, but it has two fundamental problems.
First, the position signal gets diluted as it passes through layers. After multiple rounds of self-attention, layer normalization, and feed-forward transformations, the additive positional component can lose its influence. The model must "remember" the position it was told about many layers ago.
Second, the dot product (xm+PE(m))T(xn+PE(n)) expands into four terms: xmTxn+xmTPE(n)+PE(m)Txn+PE(m)TPE(n). Only the last term captures purely positional interaction. The cross terms xmTPE(n) mix content and position in ways the model must learn to disentangle — this is wasted capacity.
Position as Score Bias — Better but Still External
Relative position biases (Chapter 7) improve on this by adding a learned bias directly to the attention score: score(m,n)=Qm⋅Kn+b(m−n). This cleanly separates content matching from position matching. However, the bias b(m−n) is typically a learned lookup table with a fixed maximum distance. When the model encounters a sequence longer than what it was trained on, the bias has no entry for large ∣m−n∣ values, and generalization breaks down.
The core challenge: We need a method that (1) makes the dot product Qm⋅Kninherently depend on relative position (m−n), (2) does not dilute through layers because it is applied directly at the attention computation, and (3) generalizes smoothly to longer sequences without requiring learned parameters for each distance. RoPE achieves all three.
From Intuition to Rotation
The Rotation Insight
Jianlin Su and his coauthors at ZhuiYi Technology asked a beautiful question: is there a transformation f(x,m) that we can apply to both Q and K such that the inner product ⟨f(Qm,m),f(Kn,n)⟩ depends only on Qm, Kn, and the relative position (m−n)?
The answer comes from a well-known property of 2D rotations. If you rotate a vector a by angle α and another vector b by angle β, their dot product depends only on the difference of angles (α−β):
R(α)a⋅R(β)b=aTR(α−β)b
This is the key identity. If we set α=mθ and β=nθ for some frequency θ, then the dot product after rotation depends on (m−n)θ — exactly the relative position, scaled by frequency. This is not an approximation. It is an exact mathematical consequence of rotation in 2D.
The idea is simple: rotate Q by position m and rotate K by position n before computing their dot product. The dot product will automatically encode relative position.
The Multi-Frequency Design
A single rotation frequency can only encode one scale of position. To encode both "this word is 2 positions away" (local) and "this word is 500 positions away" (global), RoPE uses multiple frequencies, one for each pair of embedding dimensions. This is directly inspired by the sinusoidal positional encoding from the original Transformer, which also uses a geometric sequence of frequencies.
With embedding dimension d, RoPE splits the vector into d/2 pairs. Each pair i gets its own frequency θi=1/100002i/d. The first pair (i=0) has θ0=1.0 — it rotates by 1 radian per position (fast, local). The last pair has a tiny θ and barely rotates even at position 1000 (slow, global). Together, these form a multi-scale positional representation, analogous to wavelets or Fourier decomposition.
The Mathematical Definition
Symbol-by-Symbol Breakdown
RoPE transforms a vector x at position m by applying a 2D rotation to each consecutive pair of dimensions. For dimension pair i (pairing the i-th first-half dimension with the i-th second-half dimension), the rotation is:
xi′=xicos(mθi)−xi+d/2sin(mθi)
xi+d/2′=xisin(mθi)+xi+d/2cos(mθi)
Let us define every symbol precisely:
Symbol
Meaning
In Our Example
m
Absolute position of the token in the sequence
The=0, cat=1, sat=2, on=3, mat=4
d
Embedding dimension (total number of dimensions)
4
i
Pair index, ranging from 0 to d/2 - 1
0 or 1 (two pairs for d=4)
θᵢ
Frequency for pair i: 1/10000^(2i/d)
θ₀=1.0, θ₁=0.01
mθᵢ
Rotation angle = position × frequency
cat: 1×1.0=1.0 rad, 1×0.01=0.01 rad
xᵢ
i-th element from the first half of dimensions
For Q[cat]: x₀=0.0, x₁=2.0
xᵢ₊₄/₂
i-th element from the second half (paired with xᵢ)
For Q[cat]: x₂=0.0, x₃=1.0
The Frequency Formula
The frequency for dimension pair i is: θi=100002i/d1
This creates a geometric sequence from θ0=1.0 down to θd/2−1≈0.0001 for typical d=128. The base 10000 was chosen by Vaswani et al. (2017) for the original sinusoidal encoding and reused by Su et al. for RoPE.
For our example with d=4:
Pair 0 (i=0): θ0=1/100000/4=1/100000=1.000000 — fast rotation, 1 radian per position (57.3° per step)
Pair 1 (i=1): θ1=1/100002/4=1/100000.5=1/100=0.010000 — slow rotation, 0.01 radian per position (0.57° per step)
Pair 0 rotates 100 times faster than pair 1. This means pair 0 distinguishes nearby positions (local context), while pair 1 distinguishes distant positions (global context). In a real model with d=128, you would have 64 pairs spanning five orders of magnitude in frequency.
Why the Dot Product Depends Only on Relative Position
Consider a single dimension pair. After rotating Q at position m and K at position n, the contribution of this pair to the dot product is:
Every term depends on (m−n)θ, the relative distance scaled by frequency. The absolute positions m and n have vanished. This is the mathematical guarantee that RoPE provides: the attention score is a function of content and relative position only.
Why V is NOT rotated: Values carry the content that gets aggregated by attention. Position should affect which tokens attend to each other (via Q and K), not what information flows between them. Rotating V would corrupt the content representation with positional artifacts.
Interactive: 2D Rotation Visualizer
The visualizer below shows how RoPE rotates a single dimension pair. Select a token to see how its position determines the rotation angle. Notice that "The" (position 0) has zero rotation, while later tokens rotate progressively more.
Loading rotation visualizer...
Interactive: Multi-Frequency Explorer
This explorer shows all dimension pairs simultaneously. Move the position slider and observe how low-index pairs (red) oscillate rapidly while high-index pairs (purple/pink) barely move. This multi-scale structure is what enables RoPE to encode both local and global position.
Loading frequency explorer...
Step-by-Step Calculation
We use the same shared example from every chapter: tokens = ["The", "cat", "sat", "on", "mat"] with d=4 and the standard Q, K, V matrices. The implementation pairs the first half of dimensions with the second half: pair 0 = (col 0, col 2), pair 1 = (col 1, col 3).
Step 1: Compute Frequencies
With d=4 and base = 10000, we have 2 dimension pairs:
θ0=1/100000/4=1.000000 (pair 0: dims 0 and 2)
θ1=1/100002/4=0.010000 (pair 1: dims 1 and 3)
Rotation angles = position × frequency:
Token
Position
Angle (pair 0)
Angle (pair 1)
The
0
0 × 1.0 = 0.0000
0 × 0.01 = 0.0000
cat
1
1 × 1.0 = 1.0000
1 × 0.01 = 0.0100
sat
2
2 × 1.0 = 2.0000
2 × 0.01 = 0.0200
on
3
3 × 1.0 = 3.0000
3 × 0.01 = 0.0300
mat
4
4 × 1.0 = 4.0000
4 × 0.01 = 0.0400
Step 2: Rotate All Q Vectors
For each token, we split Q into first-half x1=Q[:,:2] and second-half x2=Q[:,2:], then apply the rotation:
"The" (pos=0): angle = 0 for both pairs, so cos=1, sin=0. Qrot[The] = Q[The] = [1.0000, 0.0000, 1.0000, 0.0000] — no change.
Notice how "on" (pos=3) gives its highest score to itself (1.0000) and to "mat" (pos=4, score 0.8058) — nearby tokens — while giving negative scores to "cat" (pos=1, score −0.6627) and "sat" (pos=2, score −0.4257), which are farther away. RoPE is creating a positional preference for nearby tokens.
Step 5: Apply Softmax
Softmax converts scaled scores to probabilities (each row sums to 1.0). Detailed computation for "The" (row 0):
Observation: Every token gives its highest attention weight to itself or its nearest neighbor. "on" and "mat" (positions 3 and 4) attend heavily to each other (0.2889 and 0.3260). "cat" attends most to "The" (0.4052), its immediate predecessor. This local preference is entirely due to RoPE — the content-only attention (Chapter 1) has a different pattern.
Output Matrix — RoPE (5×4)
d0
d1
d2
d3
The
0.2472
0.3885
0.3023
0.1620
cat
0.4620
0.1468
0.3026
0.2023
sat
0.2697
0.2762
0.4035
0.1668
on
0.3540
0.2109
0.2287
0.4952
mat
0.3472
0.2224
0.2418
0.4635
Interactive: RoPE vs Standard Attention
Compare RoPE attention weights with standard (position-free) attention. Toggle between views and hover over cells to see exact values. The "Difference" view highlights where RoPE increases (green) or decreases (red) attention compared to the standard mechanism.
Loading attention heatmap...
Applications Across Domains
RoPE was designed for language modeling but its mathematical properties make it valuable across many domains:
Domain
Application
Why RoPE Helps
Large Language Models
LLaMA, Mistral, Qwen, Phi
Enables clean relative position without learned parameters. Models can be extended to longer contexts via NTK-aware scaling or YaRN.
Code Generation
CodeLlama, DeepSeek-Coder
Code has deeply nested structure with long-range dependencies (function calls 100s of lines away). RoPE's multi-frequency bands capture both local syntax and global structure.
Vision Transformers
EVA-02, InternVL
2D images are flattened to 1D token sequences. RoPE can be extended to 2D by applying separate rotations for row and column positions.
Scientific Computing
AlphaFold 3 (protein structure)
Protein residue interactions depend on sequence distance. RoPE's relative position property naturally models the chain geometry.
Long-Context Tasks
128K+ token models
RoPE extrapolates more gracefully than learned embeddings. YaRN scaling extends 4K models to 128K+ with minimal fine-tuning.
Connection to Modern Systems
Flash Attention Compatibility
RoPE is applied before the attention computation: you rotate Q and K, then pass them to standard scaled dot-product attention. This means RoPE is fully compatible with Flash Attention (Chapter 13) — the rotation happens outside the memory-efficient kernel. In practice, most implementations apply RoPE as a preprocessing step, then call flash_attn_func(Q_rot, K_rot, V).
KV-Cache and Incremental Rotation
During autoregressive generation, K and V are cached to avoid recomputation. With RoPE, each K vector is rotated by its position once when generated, then stored in the cache already rotated. This means the KV-cache works identically to standard attention — no recomputation of rotations is needed when the cache grows.
Context Length Extension
Because RoPE uses a mathematical formula (not a learned table), it can be extended to longer sequences by modifying the frequency base:
NTK-aware interpolation (bloc97, 2023): scales the base from 10000 to a larger value, effectively stretching the wavelengths so existing RoPE representations cover more positions without retraining.
YaRN (Peng et al., 2023): combines NTK scaling with an attention temperature adjustment. Extends a 4K context model to 128K+ with a small amount of fine-tuning.
Dynamic NTK: adjusts the scaling factor at inference time based on actual sequence length, providing automatic adaptation.
Multi-Head Latent Attention (MLA)
DeepSeek-V2 and V3 (Chapter 15) combine RoPE with low-rank key-value compression. They split the query and key into a RoPE portion (carrying positional information) and a content portion (compressed via low-rank projection). This decouples position from content at the architectural level, achieving massive KV-cache savings while preserving RoPE's positional properties.
Complexity Analysis
Operation
Complexity
Notes
Compute frequencies
O(d/2)
Once per model initialization
Compute angles
O(N × d/2)
Outer product of positions and frequencies
Apply rotation
O(N × d)
Element-wise multiply + subtract/add per token
Dot-product scores
O(N² × d)
Same as standard attention — RoPE adds no cost here
Total overhead vs standard
O(N × d)
Rotation is linear in sequence length; negligible vs O(N²d)
RoPE adds zero parameters to the model — the frequencies are computed from a formula, not learned. The computational overhead is O(Nd) for the rotation step, which is negligible compared to the O(N2d) attention computation. In practice, the rotation is fused into the Q/K projection kernel and adds less than 1% to wall-clock time.
Python Implementation
Full NumPy implementation with class structure. Click any line to see the execution trace with actual matrix values from "the cat sat on the mat."
RoPE Attention \u2014 NumPy Implementation
🐍rope_attention.py
Explanation(49)
Code(147)
1import numpy as np
NumPy provides vectorized matrix operations. Q_rot @ K_rot.T runs as optimized C code, not Python loops.
2import math
Python standard library. We use math.sqrt() to precompute the scaling factor √d_model.
4class RotaryPositionAttention
Wraps RoPE in a reusable class. The key method is apply_rotation() which rotates Q and K by their position before computing attention. V is never rotated.
14def __init__(self, d_model, base)
Constructor. Takes model dimension and base frequency (10000 in the original paper). Precomputes per-pair frequencies θᵢ = 1/base^(2i/d).
EXECUTION STATE
⬇ input: d_model = 4
⬇ input: base = 10000
15self.d_model = d_model
Store model dimension (4).
EXECUTION STATE
self.d_model = 4
16self.base = base
Store base frequency. 10000 is from Vaswani et al. (2017), reused in RoPE.
EXECUTION STATE
self.base = 10000
17self.scale = math.sqrt(d_model)
Precompute √d_model for score scaling (same as Chapter 1).
EXECUTION STATE
math.sqrt(4) = 2.0
self.scale = 2.0
19self.theta = np.array([...])
Precompute one frequency per dimension pair. With d=4, we have 2 pairs. θ₀=1.0 (fast rotation), θ₁=0.01 (slow rotation).
np.outer() = outer product — every position multiplied by every frequency
⬆ return: angles (5×2) =
θ₀ θ₁
The 0.0000 0.0000
cat 1.0000 0.0100
sat 2.0000 0.0200
on 3.0000 0.0300
mat 4.0000 0.0400
35def apply_rotation(self, X) → np.ndarray
THE CORE RoPE OPERATION: rotate each row of X by its position. Splits X into first-half and second-half dimensions, then applies 2D rotation per pair. Position 0 gets zero rotation; higher positions rotate more.
EXECUTION STATE
⬇ input: X (5×4) = Q or K matrix — each row is a token's vector
⬆ returns = np.ndarray (5, 4) — position-rotated version of X
37N, D = X.shape
Unpack matrix dimensions.
EXECUTION STATE
N = 5 (tokens)
D = 4 (dimensions)
38angles = self.compute_angles(N)
Get rotation angles for all 5 positions.
EXECUTION STATE
angles (5×2) =
θ₀ θ₁
The 0.0000 0.0000
cat 1.0000 0.0100
sat 2.0000 0.0200
on 3.0000 0.0300
mat 4.0000 0.0400
39cos_a = np.cos(angles)
Compute cosines for all angles.
EXECUTION STATE
cos_a (5×2) =
pair0 pair1
The 1.000000 1.000000
cat 0.540302 0.999950
sat -0.416147 0.999800
on -0.989992 0.999550
mat -0.653644 0.999200
40sin_a = np.sin(angles)
Compute sines for all angles.
EXECUTION STATE
sin_a (5×2) =
pair0 pair1
The 0.000000 0.000000
cat 0.841471 0.010000
sat 0.909297 0.019999
on 0.141120 0.029996
mat -0.756802 0.039989
41x1 = X[:, :D // 2]
First half of dimensions: columns 0 and 1. These form the "first component" of each rotation pair.
EXECUTION STATE
:D // 2 = :2 — columns [0, 1]
x1 (for Q) =
d0 d1
The 1.0 0.0
cat 0.0 2.0
sat 1.0 1.0
on 0.0 0.0
mat 1.0 0.0
42x2 = X[:, D // 2:]
Second half of dimensions: columns 2 and 3. These are the "second component" paired with x1.
EXECUTION STATE
D // 2: = 2: — columns [2, 3]
x2 (for Q) =
d2 d3
The 1.0 0.0
cat 0.0 1.0
sat 1.0 0.0
on 1.0 1.0
mat 0.0 1.0
43rotated_first = x1 * cos_a - x2 * sin_a
First rotation component: x1·cos(θ) - x2·sin(θ). This is the standard 2D rotation formula applied element-wise per pair.
EXECUTION STATE
rotated_first (for Q) =
d0 d1
The 1.0000 0.0000
cat 0.0000 1.9899
sat -1.3254 0.9998
on -0.1411 -0.0300
mat -0.6536 -0.0400
44rotated_second = x1 * sin_a + x2 * cos_a
Second rotation component: x1·sin(θ) + x2·cos(θ). Completes the 2D rotation.
EXECUTION STATE
rotated_second (for Q) =
d0 d1
The 1.0000 0.0000
cat 0.0000 1.0199
sat 0.4932 0.0200
on -0.9900 0.9996
mat -0.7568 0.9992
d0 d1 d2 d3
The 1.0000 0.0000 1.0000 0.0000
cat 0.0000 1.9899 0.0000 1.0199
sat -1.3254 0.9998 0.4932 0.0200
on -0.1411 -0.0300 -0.9900 0.9996
mat -0.6536 -0.0400 -0.7568 0.9992
47def compute_scores(self, Q_rot, K_rot)
Compute raw dot products between rotated Q and rotated K. Since both carry positional rotation, the dot product Q_rot[m]·K_rot[n] encodes relative position (m-n).
Matrix multiply Q_rot (5×4) with K_rot transposed (4×5) → (5×5). Entry (i,j) = dot product of rotated query i with rotated key j.
EXECUTION STATE
.T = transpose — K_rot (5×4) becomes (4×5)
⬆ return: raw_scores (5×5) =
The cat sat on mat
The 0.0000 1.0806 0.4932 -1.1311 -1.3589
cat 3.0098 0.0000 2.0099 0.9598 0.4698
sat 1.0198 1.0806 2.0000 -0.3112 -0.1796
on 0.9696 -1.3254 -0.8515 2.0000 1.6116
mat 0.9592 -0.8489 -0.4361 1.8414 1.5000
51def scale_scores(self, scores) → np.ndarray
Divide every score by √d_model = √4 = 2.0 to prevent softmax saturation.
EXECUTION STATE
⬇ input: scores = shape (5, 5) — raw dot products
self.scale = 2.0 (√4)
⬆ returns = np.ndarray (5, 5) — scores ÷ 2.0
53return scores / self.scale
Elementwise division by 2.0.
EXECUTION STATE
⬆ return: scaled (5×5) =
The cat sat on mat
The 0.0000 0.5403 0.2466 -0.5656 -0.6794
cat 1.5049 0.0000 1.0049 0.4799 0.2349
sat 0.5099 0.5403 1.0000 -0.1556 -0.0898
on 0.4848 -0.6627 -0.4257 1.0000 0.8058
mat 0.4796 -0.4244 -0.2181 0.9207 0.7500
55def compute_weights(self, scaled) → np.ndarray
Apply softmax row-wise to get attention probabilities.
EXECUTION STATE
⬇ input: scaled = shape (5, 5)
⬆ returns = np.ndarray (5, 5) — each row sums to 1.0
57return self._softmax(scaled)
Calls _softmax. All 5 rows become probability distributions.
EXECUTION STATE
⬆ return: weights (5×5) =
The cat sat on mat
The 0.1972 0.3385 0.2523 0.1120 0.1000
cat 0.4052 0.0900 0.2457 0.1454 0.1138
sat 0.2116 0.2181 0.3454 0.1088 0.1162
on 0.2095 0.0665 0.0843 0.3508 0.2889
mat 0.2098 0.0849 0.1044 0.3260 0.2749
59def compute_output(self, weights, V)
Weighted sum of value vectors. V is NOT rotated — only Q and K receive positional information through rotation.
EXECUTION STATE
⬇ input: weights (5×5) = attention probabilities
⬇ input: V (5×4) =
d0 d1 d2 d3
The 1.0 0.0 0.0 0.0
cat 0.0 1.0 0.0 0.0
sat 0.0 0.0 1.0 0.0
on 0.0 0.0 0.0 1.0
mat 0.5 0.5 0.5 0.5
⬆ returns = np.ndarray (5, 4) — context vectors
61return weights @ V
Matrix multiply weights (5×5) with V (5×4) → (5×4). Each output row blends all 5 value vectors.
EXECUTION STATE
⬆ return: output (5×4) =
d0 d1 d2 d3
The 0.2472 0.3885 0.3023 0.1620
cat 0.4620 0.1468 0.3026 0.2023
sat 0.2697 0.2762 0.4035 0.1668
on 0.3540 0.2109 0.2287 0.4952
mat 0.3472 0.2224 0.2418 0.4635
63def forward(self, Q, K, V)
Main entry point. Rotates Q and K by position, computes scaled dot-product attention, and aggregates V (unrotated). The rotation makes dot products position-aware without modifying the score computation itself.
EXECUTION STATE
⬇ input: Q (5×4) =
d0 d1 d2 d3
The 1.0 0.0 1.0 0.0
cat 0.0 2.0 0.0 1.0
sat 1.0 1.0 1.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.0 1.0
⬇ input: K (5×4) =
d0 d1 d2 d3
The 0.0 1.0 0.0 1.0
cat 1.0 0.0 1.0 0.0
sat 1.0 1.0 0.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.5 0.5
⬇ input: V (5×4) =
d0 d1 d2 d3
The 1.0 0.0 0.0 0.0
cat 0.0 1.0 0.0 0.0
sat 0.0 0.0 1.0 0.0
on 0.0 0.0 0.0 1.0
mat 0.5 0.5 0.5 0.5
⬆ returns = (output (5,4), weights (5,5))
74Q_rot = self.apply_rotation(Q)
Rotate all query vectors by their position. 'The' (pos 0) is unchanged; 'mat' (pos 4) gets maximum rotation.
EXECUTION STATE
Q_rot (5×4) =
d0 d1 d2 d3
The 1.0000 0.0000 1.0000 0.0000
cat 0.0000 1.9899 0.0000 1.0199
sat -1.3254 0.9998 0.4932 0.0200
on -0.1411 -0.0300 -0.9900 0.9996
mat -0.6536 -0.0400 -0.7568 0.9992
75K_rot = self.apply_rotation(K)
Rotate all key vectors by their position. Same rotation angles as Q (position-dependent).
EXECUTION STATE
K_rot (5×4) =
d0 d1 d2 d3
The 0.0000 1.0000 0.0000 1.0000
cat -0.3012 0.0000 1.3818 0.0000
sat -0.4161 0.9998 0.9093 0.0200
on -0.1411 -0.0300 -0.9900 0.9996
mat -0.2752 -0.0200 -1.0836 0.4996
76raw_scores = self.compute_scores(Q_rot, K_rot)
Compute Q_rot @ K_rot.T → 5×5 raw score matrix.
EXECUTION STATE
raw_scores (5×5) =
The cat sat on mat
The 0.0000 1.0806 0.4932 -1.1311 -1.3589
cat 3.0098 0.0000 2.0099 0.9598 0.4698
sat 1.0198 1.0806 2.0000 -0.3112 -0.1796
on 0.9696 -1.3254 -0.8515 2.0000 1.6116
mat 0.9592 -0.8489 -0.4361 1.8414 1.5000
77scaled_scores = self.scale_scores(raw_scores)
Divide by √4 = 2.0.
EXECUTION STATE
scaled_scores (5×5) =
The cat sat on mat
The 0.0000 0.5403 0.2466 -0.5656 -0.6794
cat 1.5049 0.0000 1.0049 0.4799 0.2349
sat 0.5099 0.5403 1.0000 -0.1556 -0.0898
on 0.4848 -0.6627 -0.4257 1.0000 0.8058
mat 0.4796 -0.4244 -0.2181 0.9207 0.7500
78weights = self.compute_weights(scaled_scores)
Apply softmax row-wise.
EXECUTION STATE
weights (5×5) =
The cat sat on mat
The 0.1972 0.3385 0.2523 0.1120 0.1000
cat 0.4052 0.0900 0.2457 0.1454 0.1138
sat 0.2116 0.2181 0.3454 0.1088 0.1162
on 0.2095 0.0665 0.0843 0.3508 0.2889
mat 0.2098 0.0849 0.1044 0.3260 0.2749
79output = self.compute_output(weights, V)
Weighted sum of UNROTATED V.
EXECUTION STATE
output (5×4) =
d0 d1 d2 d3
The 0.2472 0.3885 0.3023 0.1620
cat 0.4620 0.1468 0.3026 0.2023
sat 0.2697 0.2762 0.4035 0.1668
on 0.3540 0.2109 0.2287 0.4952
mat 0.3472 0.2224 0.2418 0.4635
80return output, weights
Return the context vectors and attention weights.
EXECUTION STATE
⬆ return: output = shape (5, 4)
⬆ return: weights = shape (5, 5) — each row sums to 1.0
100tokens = [...]
The 5 tokens used in every chapter.
EXECUTION STATE
tokens = ['The', 'cat', 'sat', 'on', 'mat']
102Q = np.array([...])
Query matrix — same across all chapters. Each row encodes what that token looks for.
EXECUTION STATE
Q (5×4) =
d0 d1 d2 d3
The 1.0 0.0 1.0 0.0
cat 0.0 2.0 0.0 1.0
sat 1.0 1.0 1.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.0 1.0
110K = np.array([...])
Key matrix — same across all chapters. Each row encodes what that token offers.
EXECUTION STATE
K (5×4) =
d0 d1 d2 d3
The 0.0 1.0 0.0 1.0
cat 1.0 0.0 1.0 0.0
sat 1.0 1.0 0.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.5 0.5
118V = np.array([...])
Value matrix. V is NOT rotated in RoPE — only Q and K carry position.
EXECUTION STATE
V (5×4) =
d0 d1 d2 d3
The 1.0 0.0 0.0 0.0
cat 0.0 1.0 0.0 0.0
sat 0.0 0.0 1.0 0.0
on 0.0 0.0 0.0 1.0
mat 0.5 0.5 0.5 0.5
Constructor. Registers frequency buffer and precomputes scale.
EXECUTION STATE
⬇ input: d_model = 4
⬇ input: base = 10000
16super().__init__()
Initialize nn.Module. Required for PyTorch parameter tracking and hooks.
17self.d_model = d_model
Store model dimension.
EXECUTION STATE
self.d_model = 4
18self.base = base
Store base frequency.
EXECUTION STATE
self.base = 10000
19self.scale = math.sqrt(d_model)
Precompute √d_model = 2.0 for score scaling.
EXECUTION STATE
self.scale = 2.0
22theta = torch.tensor([...])
Create frequency tensor. Same formula as NumPy version: θᵢ = 1/base^(2i/d).
EXECUTION STATE
theta = tensor([1.0000, 0.0100])
26self.register_buffer("theta", theta)
Register as a buffer (not a parameter). Buffers are saved with the model and move to GPU with .to(device), but are NOT updated by the optimizer. Frequencies are fixed, not learned.
EXECUTION STATE
register_buffer() = saved in state_dict, moves with .to(device), NOT trained by optimizer
34def apply_rotation(self, X) → torch.Tensor
Apply RoPE rotation. Identical to NumPy version but uses PyTorch ops for GPU and autograd support.
Run RoPE attention. Output matches NumPy version exactly.
EXECUTION STATE
output shape = (5, 4)
weights shape = (5, 5)
102print(weights.detach().numpy().round(4))
.detach() removes from computation graph (no gradient tracking for printing). .numpy() converts to NumPy for display.
EXECUTION STATE
.detach() = detach from autograd graph — required before .numpy()
.numpy() = convert PyTorch tensor to NumPy array
72 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
56classRotaryPositionAttention(nn.Module):7"""
8 RoPE — Rotary Position Embedding (Su et al., 2021)
9 PyTorch implementation with GPU support.
1011 Rotates Q and K by position before computing attention.
12 V is NOT rotated. Supports automatic differentiation.
13 """1415def__init__(self, d_model:int, base:int=10000):16super().__init__()17 self.d_model = d_model
18 self.base = base
19 self.scale = math.sqrt(d_model)2021# Register frequencies as a buffer (not a parameter)22 theta = torch.tensor([231.0/(base **(2* i / d_model))24for i inrange(d_model //2)25])26 self.register_buffer("theta", theta)2728# In production: learned projection matrices29# self.W_q = nn.Linear(d_model, d_model)30# self.W_k = nn.Linear(d_model, d_model)31# self.W_v = nn.Linear(d_model, d_model)32# self.W_o = nn.Linear(d_model, d_model)3334defapply_rotation(self, X: torch.Tensor)-> torch.Tensor:35"""Apply RoPE rotation. X shape: (N, d_model)."""36 N, D = X.shape
37 positions = torch.arange(N, device=X.device, dtype=X.dtype)38 angles = positions.unsqueeze(1)* self.theta.unsqueeze(0)39 cos_a = torch.cos(angles)40 sin_a = torch.sin(angles)41 x1 = X[:,:D //2]42 x2 = X[:, D //2:]43 rotated_first = x1 * cos_a - x2 * sin_a
44 rotated_second = x1 * sin_a + x2 * cos_a
45return torch.cat([rotated_first, rotated_second], dim=-1)4647defforward(48 self,49 Q: torch.Tensor,50 K: torch.Tensor,51 V: torch.Tensor,52)->tuple[torch.Tensor, torch.Tensor]:53"""
54 Args:
55 Q: (N, d_model) query matrix
56 K: (N, d_model) key matrix
57 V: (N, d_model) value matrix (NOT rotated)
58 Returns:
59 output: (N, d_model) context vectors
60 weights: (N, N) attention weights
61 """62 Q_rot = self.apply_rotation(Q)63 K_rot = self.apply_rotation(K)64 scores = Q_rot @ K_rot.T / self.scale
65 weights = F.softmax(scores, dim=-1)66 output = weights @ V
67return output, weights
686970# ── Shared Example ──71tokens =["The","cat","sat","on","mat"]7273Q = torch.tensor([74[1.0,0.0,1.0,0.0],75[0.0,2.0,0.0,1.0],76[1.0,1.0,1.0,0.0],77[0.0,0.0,1.0,1.0],78[1.0,0.0,0.0,1.0],79])8081K = torch.tensor([82[0.0,1.0,0.0,1.0],83[1.0,0.0,1.0,0.0],84[1.0,1.0,0.0,0.0],85[0.0,0.0,1.0,1.0],86[1.0,0.0,0.5,0.5],87])8889V = torch.tensor([90[1.0,0.0,0.0,0.0],91[0.0,1.0,0.0,0.0],92[0.0,0.0,1.0,0.0],93[0.0,0.0,0.0,1.0],94[0.5,0.5,0.5,0.5],95])9697# ── Run ──98rope = RotaryPositionAttention(d_model=4, base=10000)99output, weights = rope.forward(Q, K, V)100101print("RoPE Weights (PyTorch):")102print(weights.detach().numpy().round(4))103104print("\nRoPE Output (PyTorch):")105print(output.detach().numpy().round(4))
Key Takeaways
Rotation encodes position intrinsically. Unlike additive position embeddings or score biases, RoPE makes the dot product Qm⋅Kn depend on relative position (m−n) as a mathematical consequence, not an approximation.
Multi-frequency design captures multiple scales. Fast-rotating pairs encode local position (is this word the immediate neighbor?), while slow-rotating pairs encode global position (is it in the same paragraph?). This is why RoPE generalizes to longer sequences.
Zero additional parameters. Frequencies come from a formula, not a learned lookup table. This means no out-of-vocabulary positions and no parameter overhead.
V is NOT rotated. Only Q and K carry positional information through rotation. V provides the content that gets aggregated.
Compatible with all optimizations. RoPE works seamlessly with Flash Attention, KV-cache, multi-head/multi-query/grouped-query attention, and context length extension techniques like YaRN.
De facto standard for modern LLMs. LLaMA, Mistral, PaLM, Qwen, Falcon, Phi, and virtually every major open-weight model since 2022 uses RoPE.
Exercises
Verify the relative position property: Using the code above, create two identical vectors x=[1,1,1,1] and place them at positions (0, 2) and (3, 5). Verify that the dot product of their rotated versions is the same in both cases, confirming that the score depends only on the distance 2.
Frequency analysis: For d=128 and base = 10000, compute all 64 frequencies. What is the wavelength (in positions) of the fastest and slowest pair? How many full rotations does pair 0 complete over a 4096-token sequence?
NTK scaling experiment: Modify the code to use base = 100000 instead of 10000. Recompute the attention weights. How does this affect which tokens "on" attends to? Why would this help with longer sequences?
Compare with Chapter 7: Run both the relative position bias attention (Chapter 7) and RoPE attention on the same input. Which produces stronger local preference? What happens when you increase the sequence length to 20 tokens?
2D RoPE for images: Design a 2D RoPE scheme for a 4×4 image patch grid (16 tokens). Each token has a (row, col) position. Apply separate rotations for row-position and column-position using different frequency bands. What properties does this have compared to 1D RoPE?
References
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
bloc97. (2023). "NTK-Aware Scaled RoPE Allows LLaMA Models to Have Extended (8k+) Context Size Without Any Fine-tuning and Minimal Perplexity Degradation." Reddit/r/LocalLLaMA.
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071.
Black, S., Biderman, S., Hallahan, E., et al. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745.
Liu, Z., Oguz, B., Zhao, C., et al. (2024). "World Model on Million-Length Video and Language with RingAttention." arXiv:2402.08268.