Chapter 8
22 min read
Section 9 of 17

RoPE \u2014 Rotary Position Embedding

RoPE \u2014 Rotary Position Embedding

Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021


Learning Objectives

After completing this section, you will be able to:

  1. Explain why encoding position through rotation is fundamentally different from adding position to embeddings or biasing attention scores, and what mathematical advantage this provides.
  2. Derive and interpret the 2D rotation formula that transforms Query and Key vectors, understanding each symbol and matrix operation involved.
  3. Understand the multi-frequency design that gives RoPE both local and global position sensitivity — and why this allows generalization to unseen sequence lengths.
  4. Implement RoPE from scratch in NumPy and PyTorch, applying it to the shared "the cat sat on the mat" example.
  5. Connect RoPE to modern systems including LLaMA, GPT-NeoX, PaLM, CodeLlama, and understand how it interacts with Flash Attention, KV-cache, and context length extension techniques like YaRN and NTK-aware scaling.
Where RoPE appears: LLaMA 1/2/3 (Meta), GPT-NeoX (EleutherAI), PaLM/Gemini (Google), CodeLlama, Mistral, Qwen, Falcon, Phi — virtually every major open-weight LLM since 2022 uses RoPE. It has become the default positional encoding for modern transformers.

The Real Problem

Transformers process all tokens in parallel — unlike RNNs, they have no built-in notion of sequential order. Without positional information, the sentence "the cat sat on the mat" would be indistinguishable from "mat the on sat cat the." The attention mechanism computes QKTQ \cdot K^T, which depends only on what the tokens are, not where they are. This is the position encoding problem, and it has been solved in several ways, each with different tradeoffs.

Position as Addition — The Limitation

The original Transformer (Vaswani et al., 2017) solved this by adding a position vector to each token embedding: xm=tokenm+PE(m)x_m = \text{token}_m + \text{PE}(m). This is simple and elegant, but it has two fundamental problems.

First, the position signal gets diluted as it passes through layers. After multiple rounds of self-attention, layer normalization, and feed-forward transformations, the additive positional component can lose its influence. The model must "remember" the position it was told about many layers ago.

Second, the dot product (xm+PE(m))T(xn+PE(n))(x_m + \text{PE}(m))^T(x_n + \text{PE}(n)) expands into four terms: xmTxn+xmTPE(n)+PE(m)Txn+PE(m)TPE(n)x_m^T x_n + x_m^T \text{PE}(n) + \text{PE}(m)^T x_n + \text{PE}(m)^T \text{PE}(n). Only the last term captures purely positional interaction. The cross terms xmTPE(n)x_m^T \text{PE}(n) mix content and position in ways the model must learn to disentangle — this is wasted capacity.

Position as Score Bias — Better but Still External

Relative position biases (Chapter 7) improve on this by adding a learned bias directly to the attention score: score(m,n)=QmKn+b(mn)\text{score}(m, n) = Q_m \cdot K_n + b(m - n). This cleanly separates content matching from position matching. However, the bias b(mn)b(m-n) is typically a learned lookup table with a fixed maximum distance. When the model encounters a sequence longer than what it was trained on, the bias has no entry for large mn|m - n| values, and generalization breaks down.

The core challenge: We need a method that (1) makes the dot product QmKnQ_m \cdot K_n inherently depend on relative position (mn)(m - n), (2) does not dilute through layers because it is applied directly at the attention computation, and (3) generalizes smoothly to longer sequences without requiring learned parameters for each distance. RoPE achieves all three.

From Intuition to Rotation

The Rotation Insight

Jianlin Su and his coauthors at ZhuiYi Technology asked a beautiful question: is there a transformation f(x,m)f(x, m) that we can apply to both Q and K such that the inner product f(Qm,m),f(Kn,n)\langle f(Q_m, m), f(K_n, n) \rangle depends only on QmQ_m, KnK_n, and the relative position (mn)(m - n)?

The answer comes from a well-known property of 2D rotations. If you rotate a vector a\mathbf{a} by angle α\alpha and another vector b\mathbf{b} by angle β\beta, their dot product depends only on the difference of angles (αβ)(\alpha - \beta):

R(α)aR(β)b=aTR(αβ)bR(\alpha)\mathbf{a} \cdot R(\beta)\mathbf{b} = \mathbf{a}^T R(\alpha - \beta) \mathbf{b}

This is the key identity. If we set α=mθ\alpha = m\theta and β=nθ\beta = n\theta for some frequency θ\theta, then the dot product after rotation depends on (mn)θ(m - n)\theta — exactly the relative position, scaled by frequency. This is not an approximation. It is an exact mathematical consequence of rotation in 2D.

The idea is simple: rotate Q by position mm and rotate K by position nn before computing their dot product. The dot product will automatically encode relative position.

The Multi-Frequency Design

A single rotation frequency can only encode one scale of position. To encode both "this word is 2 positions away" (local) and "this word is 500 positions away" (global), RoPE uses multiple frequencies, one for each pair of embedding dimensions. This is directly inspired by the sinusoidal positional encoding from the original Transformer, which also uses a geometric sequence of frequencies.

With embedding dimension dd, RoPE splits the vector into d/2d/2 pairs. Each pair ii gets its own frequency θi=1/100002i/d\theta_i = 1/10000^{2i/d}. The first pair (i=0i=0) has θ0=1.0\theta_0 = 1.0 — it rotates by 1 radian per position (fast, local). The last pair has a tiny θ\theta and barely rotates even at position 1000 (slow, global). Together, these form a multi-scale positional representation, analogous to wavelets or Fourier decomposition.


The Mathematical Definition

Symbol-by-Symbol Breakdown

RoPE transforms a vector x\mathbf{x} at position mm by applying a 2D rotation to each consecutive pair of dimensions. For dimension pair ii (pairing the ii-th first-half dimension with the ii-th second-half dimension), the rotation is:

xi=xicos(mθi)xi+d/2sin(mθi)x_i' = x_i \cos(m\theta_i) - x_{i+d/2} \sin(m\theta_i)

xi+d/2=xisin(mθi)+xi+d/2cos(mθi)x_{i+d/2}' = x_i \sin(m\theta_i) + x_{i+d/2} \cos(m\theta_i)

Let us define every symbol precisely:

SymbolMeaningIn Our Example
mAbsolute position of the token in the sequenceThe=0, cat=1, sat=2, on=3, mat=4
dEmbedding dimension (total number of dimensions)4
iPair index, ranging from 0 to d/2 - 10 or 1 (two pairs for d=4)
θᵢFrequency for pair i: 1/10000^(2i/d)θ₀=1.0, θ₁=0.01
mθᵢRotation angle = position × frequencycat: 1×1.0=1.0 rad, 1×0.01=0.01 rad
xᵢi-th element from the first half of dimensionsFor Q[cat]: x₀=0.0, x₁=2.0
xᵢ₊₄/₂i-th element from the second half (paired with xᵢ)For Q[cat]: x₂=0.0, x₃=1.0

The Frequency Formula

The frequency for dimension pair ii is: θi=1100002i/d\theta_i = \frac{1}{10000^{2i/d}}

This creates a geometric sequence from θ0=1.0\theta_0 = 1.0 down to θd/210.0001\theta_{d/2-1} \approx 0.0001 for typical d=128d = 128. The base 10000 was chosen by Vaswani et al. (2017) for the original sinusoidal encoding and reused by Su et al. for RoPE.

For our example with d=4d = 4:

  • Pair 0 (i=0i = 0): θ0=1/100000/4=1/100000=1.000000\theta_0 = 1/10000^{0/4} = 1/10000^0 = 1.000000 — fast rotation, 1 radian per position (57.3° per step)
  • Pair 1 (i=1i = 1): θ1=1/100002/4=1/100000.5=1/100=0.010000\theta_1 = 1/10000^{2/4} = 1/10000^{0.5} = 1/100 = 0.010000 — slow rotation, 0.01 radian per position (0.57° per step)

Pair 0 rotates 100 times faster than pair 1. This means pair 0 distinguishes nearby positions (local context), while pair 1 distinguishes distant positions (global context). In a real model with d=128d = 128, you would have 64 pairs spanning five orders of magnitude in frequency.

Why the Dot Product Depends Only on Relative Position

Consider a single dimension pair. After rotating Q at position mm and K at position nn, the contribution of this pair to the dot product is:

(q1cosmθq2sinmθ)(k1cosnθk2sinnθ)+(q1sinmθ+q2cosmθ)(k1sinnθ+k2cosnθ)(q_1 \cos m\theta - q_2 \sin m\theta)(k_1 \cos n\theta - k_2 \sin n\theta) + (q_1 \sin m\theta + q_2 \cos m\theta)(k_1 \sin n\theta + k_2 \cos n\theta)

Expanding and using the identity cos(A)cos(B)+sin(A)sin(B)=cos(AB)\cos(A)\cos(B) + \sin(A)\sin(B) = \cos(A - B):

=q1k1cos((mn)θ)q1k2sin((mn)θ)+q2k1sin((mn)θ)+q2k2cos((mn)θ)= q_1 k_1 \cos((m-n)\theta) - q_1 k_2 \sin((m-n)\theta) + q_2 k_1 \sin((m-n)\theta) + q_2 k_2 \cos((m-n)\theta)

Every term depends on (mn)θ(m - n)\theta, the relative distance scaled by frequency. The absolute positions mm and nn have vanished. This is the mathematical guarantee that RoPE provides: the attention score is a function of content and relative position only.

Why V is NOT rotated: Values carry the content that gets aggregated by attention. Position should affect which tokens attend to each other (via Q and K), not what information flows between them. Rotating V would corrupt the content representation with positional artifacts.

Interactive: 2D Rotation Visualizer

The visualizer below shows how RoPE rotates a single dimension pair. Select a token to see how its position determines the rotation angle. Notice that "The" (position 0) has zero rotation, while later tokens rotate progressively more.

Loading rotation visualizer...

Interactive: Multi-Frequency Explorer

This explorer shows all dimension pairs simultaneously. Move the position slider and observe how low-index pairs (red) oscillate rapidly while high-index pairs (purple/pink) barely move. This multi-scale structure is what enables RoPE to encode both local and global position.

Loading frequency explorer...

Step-by-Step Calculation

We use the same shared example from every chapter: tokens = ["The", "cat", "sat", "on", "mat"] with d=4d = 4 and the standard Q, K, V matrices. The implementation pairs the first half of dimensions with the second half: pair 0 = (col 0, col 2), pair 1 = (col 1, col 3).

Step 1: Compute Frequencies

With d=4d = 4 and base = 10000, we have 2 dimension pairs:

  • θ0=1/100000/4=1.000000\theta_0 = 1/10000^{0/4} = 1.000000 (pair 0: dims 0 and 2)
  • θ1=1/100002/4=0.010000\theta_1 = 1/10000^{2/4} = 0.010000 (pair 1: dims 1 and 3)

Rotation angles = position ×\times frequency:

TokenPositionAngle (pair 0)Angle (pair 1)
The00 × 1.0 = 0.00000 × 0.01 = 0.0000
cat11 × 1.0 = 1.00001 × 0.01 = 0.0100
sat22 × 1.0 = 2.00002 × 0.01 = 0.0200
on33 × 1.0 = 3.00003 × 0.01 = 0.0300
mat44 × 1.0 = 4.00004 × 0.01 = 0.0400

Step 2: Rotate All Q Vectors

For each token, we split Q into first-half x1=Q[:,:2]x_1 = Q[\text{:,\,:2}] and second-half x2=Q[:,2:]x_2 = Q[\text{:,\,2:}], then apply the rotation:

"The" (pos=0): angle = 0 for both pairs, so cos=1, sin=0. QrotQ_{\text{rot}}[The] = Q[The] = [1.0000, 0.0000, 1.0000, 0.0000] — no change.

"cat" (pos=1): Q[cat] = [0.0, 2.0, 0.0, 1.0], so x1x_1 = [0.0, 2.0], x2x_2 = [0.0, 1.0]

  • Pair 0 (angle=1.0): 0.0×cos(1.0)0.0×sin(1.0)=0.00000.0 \times \cos(1.0) - 0.0 \times \sin(1.0) = 0.0000, 0.0×sin(1.0)+0.0×cos(1.0)=0.00000.0 \times \sin(1.0) + 0.0 \times \cos(1.0) = 0.0000
  • Pair 1 (angle=0.01): 2.0×0.99991.0×0.0100=1.98992.0 \times 0.9999 - 1.0 \times 0.0100 = 1.9899, 2.0×0.0100+1.0×0.9999=1.01992.0 \times 0.0100 + 1.0 \times 0.9999 = 1.0199

QrotQ_{\text{rot}}[cat] = [0.0000, 1.9899, 0.0000, 1.0199]

"sat" (pos=2): Q[sat] = [1.0, 1.0, 1.0, 0.0], angle pair 0 = 2.0 (cos = −0.4161, sin = 0.9093)

  • Pair 0: 1.0×(0.4161)1.0×0.9093=1.32541.0 \times (-0.4161) - 1.0 \times 0.9093 = -1.3254, 1.0×0.9093+1.0×(0.4161)=0.49321.0 \times 0.9093 + 1.0 \times (-0.4161) = 0.4932
  • Pair 1 (angle=0.02): 1.0×0.99980.0×0.0200=0.99981.0 \times 0.9998 - 0.0 \times 0.0200 = 0.9998, 1.0×0.0200+0.0×0.9998=0.02001.0 \times 0.0200 + 0.0 \times 0.9998 = 0.0200

QrotQ_{\text{rot}}[sat] = [−1.3254, 0.9998, 0.4932, 0.0200]

Full rotated Q matrix:

Tokend0d1d2d3
The 1.0000 0.0000 1.0000 0.0000
cat 0.0000 1.9899 0.0000 1.0199
sat-1.3254 0.9998 0.4932 0.0200
on-0.1411-0.0300-0.9900 0.9996
mat-0.6536-0.0400-0.7568 0.9992

Step 3: Rotate All K Vectors

Same rotation angles applied to K:

Tokend0d1d2d3
The 0.0000 1.0000 0.0000 1.0000
cat-0.3012 0.0000 1.3818 0.0000
sat-0.4161 0.9998 0.9093 0.0200
on-0.1411-0.0300-0.9900 0.9996
mat-0.2752-0.0200-1.0836 0.4996

Step 4: Compute Scaled Scores

Raw scores = Qrot×KrotTQ_{\text{rot}} \times K_{\text{rot}}^T, then divide by 4=2.0\sqrt{4} = 2.0:

Thecatsatonmat
The 0.0000 0.5403 0.2466-0.5656-0.6794
cat 1.5049 0.0000 1.0049 0.4799 0.2349
sat 0.5099 0.5403 1.0000-0.1556-0.0898
on 0.4848-0.6627-0.4257 1.0000 0.8058
mat 0.4796-0.4244-0.2181 0.9207 0.7500

Notice how "on" (pos=3) gives its highest score to itself (1.0000) and to "mat" (pos=4, score 0.8058) — nearby tokens — while giving negative scores to "cat" (pos=1, score −0.6627) and "sat" (pos=2, score −0.4257), which are farther away. RoPE is creating a positional preference for nearby tokens.

Step 5: Apply Softmax

Softmax converts scaled scores to probabilities (each row sums to 1.0). Detailed computation for "The" (row 0):

  • Scaled scores: [0.0000, 0.5403, 0.2466, −0.5656, −0.6794]
  • Max = 0.5403, shifted = [−0.5403, 0.0000, −0.2937, −1.1059, −1.2197]
  • exp = [0.5826, 1.0000, 0.7455, 0.3309, 0.2953], sum = 2.9543
  • Softmax: [0.1972, 0.3385, 0.2523, 0.1120, 0.1000]

"The" attends most strongly to "cat" (0.3385), which is the next token — the positional rotation has boosted the score for nearby tokens.

Step 6: Compute Output

Output for "The" = weighted sum of V (unrotated):

  • 0.1972 ×\times V[The] + 0.3385 ×\times V[cat] + 0.2523 ×\times V[sat] + 0.1120 ×\times V[on] + 0.1000 ×\times V[mat]
  • = [0.1972, 0.0000, 0.0000, 0.0000] + [0.0000, 0.3385, 0.0000, 0.0000] + [0.0000, 0.0000, 0.2523, 0.0000] + [0.0000, 0.0000, 0.0000, 0.1120] + [0.0500, 0.0500, 0.0500, 0.0500]
  • = [0.2472, 0.3885, 0.3023, 0.1620]

Full Attention Weights and Output

Attention Weights — RoPE (5×55 \times 5)

Thecatsatonmat
The0.19720.33850.25230.11200.1000
cat0.40520.09000.24570.14540.1138
sat0.21160.21810.34540.10880.1162
on0.20950.06650.08430.35080.2889
mat0.20980.08490.10440.32600.2749
Observation: Every token gives its highest attention weight to itself or its nearest neighbor. "on" and "mat" (positions 3 and 4) attend heavily to each other (0.2889 and 0.3260). "cat" attends most to "The" (0.4052), its immediate predecessor. This local preference is entirely due to RoPE — the content-only attention (Chapter 1) has a different pattern.

Output Matrix — RoPE (5×45 \times 4)

d0d1d2d3
The0.24720.38850.30230.1620
cat0.46200.14680.30260.2023
sat0.26970.27620.40350.1668
on0.35400.21090.22870.4952
mat0.34720.22240.24180.4635

Interactive: RoPE vs Standard Attention

Compare RoPE attention weights with standard (position-free) attention. Toggle between views and hover over cells to see exact values. The "Difference" view highlights where RoPE increases (green) or decreases (red) attention compared to the standard mechanism.

Loading attention heatmap...

Applications Across Domains

RoPE was designed for language modeling but its mathematical properties make it valuable across many domains:

DomainApplicationWhy RoPE Helps
Large Language ModelsLLaMA, Mistral, Qwen, PhiEnables clean relative position without learned parameters. Models can be extended to longer contexts via NTK-aware scaling or YaRN.
Code GenerationCodeLlama, DeepSeek-CoderCode has deeply nested structure with long-range dependencies (function calls 100s of lines away). RoPE's multi-frequency bands capture both local syntax and global structure.
Vision TransformersEVA-02, InternVL2D images are flattened to 1D token sequences. RoPE can be extended to 2D by applying separate rotations for row and column positions.
Scientific ComputingAlphaFold 3 (protein structure)Protein residue interactions depend on sequence distance. RoPE's relative position property naturally models the chain geometry.
Long-Context Tasks128K+ token modelsRoPE extrapolates more gracefully than learned embeddings. YaRN scaling extends 4K models to 128K+ with minimal fine-tuning.

Connection to Modern Systems

Flash Attention Compatibility

RoPE is applied before the attention computation: you rotate Q and K, then pass them to standard scaled dot-product attention. This means RoPE is fully compatible with Flash Attention (Chapter 13) — the rotation happens outside the memory-efficient kernel. In practice, most implementations apply RoPE as a preprocessing step, then call flash_attn_func(Q_rot, K_rot, V)\texttt{flash\_attn\_func(Q\_rot, K\_rot, V)}.

KV-Cache and Incremental Rotation

During autoregressive generation, K and V are cached to avoid recomputation. With RoPE, each K vector is rotated by its position once when generated, then stored in the cache already rotated. This means the KV-cache works identically to standard attention — no recomputation of rotations is needed when the cache grows.

Context Length Extension

Because RoPE uses a mathematical formula (not a learned table), it can be extended to longer sequences by modifying the frequency base:

  • NTK-aware interpolation (bloc97, 2023): scales the base from 10000 to a larger value, effectively stretching the wavelengths so existing RoPE representations cover more positions without retraining.
  • YaRN (Peng et al., 2023): combines NTK scaling with an attention temperature adjustment. Extends a 4K context model to 128K+ with a small amount of fine-tuning.
  • Dynamic NTK: adjusts the scaling factor at inference time based on actual sequence length, providing automatic adaptation.

Multi-Head Latent Attention (MLA)

DeepSeek-V2 and V3 (Chapter 15) combine RoPE with low-rank key-value compression. They split the query and key into a RoPE portion (carrying positional information) and a content portion (compressed via low-rank projection). This decouples position from content at the architectural level, achieving massive KV-cache savings while preserving RoPE's positional properties.


Complexity Analysis

OperationComplexityNotes
Compute frequenciesO(d/2)Once per model initialization
Compute anglesO(N × d/2)Outer product of positions and frequencies
Apply rotationO(N × d)Element-wise multiply + subtract/add per token
Dot-product scoresO(N² × d)Same as standard attention — RoPE adds no cost here
Total overhead vs standardO(N × d)Rotation is linear in sequence length; negligible vs O(N²d)

RoPE adds zero parameters to the model — the frequencies are computed from a formula, not learned. The computational overhead is O(Nd)O(Nd) for the rotation step, which is negligible compared to the O(N2d)O(N^2 d) attention computation. In practice, the rotation is fused into the Q/K projection kernel and adds less than 1% to wall-clock time.


Python Implementation

Full NumPy implementation with class structure. Click any line to see the execution trace with actual matrix values from "the cat sat on the mat."

RoPE Attention \u2014 NumPy Implementation
🐍rope_attention.py
1import numpy as np

NumPy provides vectorized matrix operations. Q_rot @ K_rot.T runs as optimized C code, not Python loops.

2import math

Python standard library. We use math.sqrt() to precompute the scaling factor √d_model.

4class RotaryPositionAttention

Wraps RoPE in a reusable class. The key method is apply_rotation() which rotates Q and K by their position before computing attention. V is never rotated.

14def __init__(self, d_model, base)

Constructor. Takes model dimension and base frequency (10000 in the original paper). Precomputes per-pair frequencies θᵢ = 1/base^(2i/d).

EXECUTION STATE
⬇ input: d_model = 4
⬇ input: base = 10000
15self.d_model = d_model

Store model dimension (4).

EXECUTION STATE
self.d_model = 4
16self.base = base

Store base frequency. 10000 is from Vaswani et al. (2017), reused in RoPE.

EXECUTION STATE
self.base = 10000
17self.scale = math.sqrt(d_model)

Precompute √d_model for score scaling (same as Chapter 1).

EXECUTION STATE
math.sqrt(4) = 2.0
self.scale = 2.0
19self.theta = np.array([...])

Precompute one frequency per dimension pair. With d=4, we have 2 pairs. θ₀=1.0 (fast rotation), θ₁=0.01 (slow rotation).

EXECUTION STATE
range(d_model // 2) = range(2) → [0, 1]
θ₀ = 1/10000^(0/4) = 1/10000⁰ = 1/1 = 1.000000
θ₁ = 1/10000^(2/4) = 1/10000¹⁄₂ = 1/100 = 0.010000
self.theta = [1.000000, 0.010000]
24def _softmax(self, x) → np.ndarray

Numerically stable softmax. Subtracts row max before exponentiating to prevent overflow.

EXECUTION STATE
⬇ input: x = shape (5, 5) — scaled score matrix
⬆ returns = np.ndarray (5, 5) — softmax probabilities per row
26x_shifted = x - np.max(x, axis=-1, keepdims=True)

Subtract row-wise max for numerical stability. exp(x - max) prevents overflow.

EXECUTION STATE
axis=-1 = find max along last axis — each row gets its own max
keepdims=True = result shape (5,1) not (5,) so broadcasting x(5×5) - max(5×1) works
27exp_x = np.exp(x_shifted)

Exponentiate. Largest per-row value is exp(0)=1.0 — no overflow.

28return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Normalize each row to sum to 1.0.

EXECUTION STATE
axis=-1 = sum each row independently
keepdims=True = sum shape (5,1) for broadcasting
30def compute_angles(self, N) → np.ndarray

Compute rotation angles for all positions 0..N-1. Each position gets one angle per dimension pair: angle = pos × θᵢ.

EXECUTION STATE
⬇ input: N = 5 (number of tokens)
⬆ returns = np.ndarray (5, 2) — angles for each (position, pair)
32positions = np.arange(N)

Create position indices [0, 1, 2, 3, 4] for our 5 tokens.

EXECUTION STATE
positions = [0, 1, 2, 3, 4]
33return np.outer(positions, self.theta)

Outer product: positions (5,) × theta (2,) → angles (5, 2). Entry [m, i] = m × θᵢ.

EXECUTION STATE
np.outer() = outer product — every position multiplied by every frequency
⬆ return: angles (5×2) =
        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400
35def apply_rotation(self, X) → np.ndarray

THE CORE RoPE OPERATION: rotate each row of X by its position. Splits X into first-half and second-half dimensions, then applies 2D rotation per pair. Position 0 gets zero rotation; higher positions rotate more.

EXECUTION STATE
⬇ input: X (5×4) = Q or K matrix — each row is a token's vector
⬆ returns = np.ndarray (5, 4) — position-rotated version of X
37N, D = X.shape

Unpack matrix dimensions.

EXECUTION STATE
N = 5 (tokens)
D = 4 (dimensions)
38angles = self.compute_angles(N)

Get rotation angles for all 5 positions.

EXECUTION STATE
angles (5×2) =
        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400
39cos_a = np.cos(angles)

Compute cosines for all angles.

EXECUTION STATE
cos_a (5×2) =
          pair0     pair1
The    1.000000  1.000000
cat    0.540302  0.999950
sat   -0.416147  0.999800
on    -0.989992  0.999550
mat   -0.653644  0.999200
40sin_a = np.sin(angles)

Compute sines for all angles.

EXECUTION STATE
sin_a (5×2) =
          pair0     pair1
The    0.000000  0.000000
cat    0.841471  0.010000
sat    0.909297  0.019999
on     0.141120  0.029996
mat   -0.756802  0.039989
41x1 = X[:, :D // 2]

First half of dimensions: columns 0 and 1. These form the "first component" of each rotation pair.

EXECUTION STATE
:D // 2 = :2 — columns [0, 1]
x1 (for Q) =
     d0   d1
The  1.0  0.0
cat  0.0  2.0
sat  1.0  1.0
on   0.0  0.0
mat  1.0  0.0
42x2 = X[:, D // 2:]

Second half of dimensions: columns 2 and 3. These are the "second component" paired with x1.

EXECUTION STATE
D // 2: = 2: — columns [2, 3]
x2 (for Q) =
     d2   d3
The  1.0  0.0
cat  0.0  1.0
sat  1.0  0.0
on   1.0  1.0
mat  0.0  1.0
43rotated_first = x1 * cos_a - x2 * sin_a

First rotation component: x1·cos(θ) - x2·sin(θ). This is the standard 2D rotation formula applied element-wise per pair.

EXECUTION STATE
rotated_first (for Q) =
         d0       d1
The   1.0000   0.0000
cat   0.0000   1.9899
sat  -1.3254   0.9998
on   -0.1411  -0.0300
mat  -0.6536  -0.0400
44rotated_second = x1 * sin_a + x2 * cos_a

Second rotation component: x1·sin(θ) + x2·cos(θ). Completes the 2D rotation.

EXECUTION STATE
rotated_second (for Q) =
         d0       d1
The   1.0000   0.0000
cat   0.0000   1.0199
sat   0.4932   0.0200
on   -0.9900   0.9996
mat  -0.7568   0.9992
45return np.hstack([rotated_first, rotated_second])

Concatenate the two halves back into a (5×4) matrix. This is the position-encoded version of the input.

EXECUTION STATE
np.hstack() = horizontal stack — [(5×2), (5×2)] → (5×4)
⬆ return: Q_rot (5×4) =
         d0       d1       d2       d3
The   1.0000   0.0000   1.0000   0.0000
cat   0.0000   1.9899   0.0000   1.0199
sat  -1.3254   0.9998   0.4932   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.6536  -0.0400  -0.7568   0.9992
47def compute_scores(self, Q_rot, K_rot)

Compute raw dot products between rotated Q and rotated K. Since both carry positional rotation, the dot product Q_rot[m]·K_rot[n] encodes relative position (m-n).

EXECUTION STATE
⬇ input: Q_rot (5×4) = position-rotated query matrix
⬇ input: K_rot (5×4) = position-rotated key matrix
⬆ returns = np.ndarray (5, 5) — raw scores
49return Q_rot @ K_rot.T

Matrix multiply Q_rot (5×4) with K_rot transposed (4×5) → (5×5). Entry (i,j) = dot product of rotated query i with rotated key j.

EXECUTION STATE
.T = transpose — K_rot (5×4) becomes (4×5)
⬆ return: raw_scores (5×5) =
        The      cat      sat       on      mat
The   0.0000   1.0806   0.4932  -1.1311  -1.3589
cat   3.0098   0.0000   2.0099   0.9598   0.4698
sat   1.0198   1.0806   2.0000  -0.3112  -0.1796
on    0.9696  -1.3254  -0.8515   2.0000   1.6116
mat   0.9592  -0.8489  -0.4361   1.8414   1.5000
51def scale_scores(self, scores) → np.ndarray

Divide every score by √d_model = √4 = 2.0 to prevent softmax saturation.

EXECUTION STATE
⬇ input: scores = shape (5, 5) — raw dot products
self.scale = 2.0 (√4)
⬆ returns = np.ndarray (5, 5) — scores ÷ 2.0
53return scores / self.scale

Elementwise division by 2.0.

EXECUTION STATE
⬆ return: scaled (5×5) =
        The      cat      sat       on      mat
The   0.0000   0.5403   0.2466  -0.5656  -0.6794
cat   1.5049   0.0000   1.0049   0.4799   0.2349
sat   0.5099   0.5403   1.0000  -0.1556  -0.0898
on    0.4848  -0.6627  -0.4257   1.0000   0.8058
mat   0.4796  -0.4244  -0.2181   0.9207   0.7500
55def compute_weights(self, scaled) → np.ndarray

Apply softmax row-wise to get attention probabilities.

EXECUTION STATE
⬇ input: scaled = shape (5, 5)
⬆ returns = np.ndarray (5, 5) — each row sums to 1.0
57return self._softmax(scaled)

Calls _softmax. All 5 rows become probability distributions.

EXECUTION STATE
⬆ return: weights (5×5) =
        The      cat      sat       on      mat
The   0.1972   0.3385   0.2523   0.1120   0.1000
cat   0.4052   0.0900   0.2457   0.1454   0.1138
sat   0.2116   0.2181   0.3454   0.1088   0.1162
on    0.2095   0.0665   0.0843   0.3508   0.2889
mat   0.2098   0.0849   0.1044   0.3260   0.2749
59def compute_output(self, weights, V)

Weighted sum of value vectors. V is NOT rotated — only Q and K receive positional information through rotation.

EXECUTION STATE
⬇ input: weights (5×5) = attention probabilities
⬇ input: V (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5
⬆ returns = np.ndarray (5, 4) — context vectors
61return weights @ V

Matrix multiply weights (5×5) with V (5×4) → (5×4). Each output row blends all 5 value vectors.

EXECUTION STATE
⬆ return: output (5×4) =
        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635
63def forward(self, Q, K, V)

Main entry point. Rotates Q and K by position, computes scaled dot-product attention, and aggregates V (unrotated). The rotation makes dot products position-aware without modifying the score computation itself.

EXECUTION STATE
⬇ input: Q (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0
⬇ input: K (5×4) =
      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5
⬇ input: V (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5
⬆ returns = (output (5,4), weights (5,5))
74Q_rot = self.apply_rotation(Q)

Rotate all query vectors by their position. 'The' (pos 0) is unchanged; 'mat' (pos 4) gets maximum rotation.

EXECUTION STATE
Q_rot (5×4) =
         d0       d1       d2       d3
The   1.0000   0.0000   1.0000   0.0000
cat   0.0000   1.9899   0.0000   1.0199
sat  -1.3254   0.9998   0.4932   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.6536  -0.0400  -0.7568   0.9992
75K_rot = self.apply_rotation(K)

Rotate all key vectors by their position. Same rotation angles as Q (position-dependent).

EXECUTION STATE
K_rot (5×4) =
         d0       d1       d2       d3
The   0.0000   1.0000   0.0000   1.0000
cat  -0.3012   0.0000   1.3818   0.0000
sat  -0.4161   0.9998   0.9093   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.2752  -0.0200  -1.0836   0.4996
76raw_scores = self.compute_scores(Q_rot, K_rot)

Compute Q_rot @ K_rot.T → 5×5 raw score matrix.

EXECUTION STATE
raw_scores (5×5) =
        The      cat      sat       on      mat
The   0.0000   1.0806   0.4932  -1.1311  -1.3589
cat   3.0098   0.0000   2.0099   0.9598   0.4698
sat   1.0198   1.0806   2.0000  -0.3112  -0.1796
on    0.9696  -1.3254  -0.8515   2.0000   1.6116
mat   0.9592  -0.8489  -0.4361   1.8414   1.5000
77scaled_scores = self.scale_scores(raw_scores)

Divide by √4 = 2.0.

EXECUTION STATE
scaled_scores (5×5) =
        The      cat      sat       on      mat
The   0.0000   0.5403   0.2466  -0.5656  -0.6794
cat   1.5049   0.0000   1.0049   0.4799   0.2349
sat   0.5099   0.5403   1.0000  -0.1556  -0.0898
on    0.4848  -0.6627  -0.4257   1.0000   0.8058
mat   0.4796  -0.4244  -0.2181   0.9207   0.7500
78weights = self.compute_weights(scaled_scores)

Apply softmax row-wise.

EXECUTION STATE
weights (5×5) =
        The      cat      sat       on      mat
The   0.1972   0.3385   0.2523   0.1120   0.1000
cat   0.4052   0.0900   0.2457   0.1454   0.1138
sat   0.2116   0.2181   0.3454   0.1088   0.1162
on    0.2095   0.0665   0.0843   0.3508   0.2889
mat   0.2098   0.0849   0.1044   0.3260   0.2749
79output = self.compute_output(weights, V)

Weighted sum of UNROTATED V.

EXECUTION STATE
output (5×4) =
        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635
80return output, weights

Return the context vectors and attention weights.

EXECUTION STATE
⬆ return: output = shape (5, 4)
⬆ return: weights = shape (5, 5) — each row sums to 1.0
100tokens = [...]

The 5 tokens used in every chapter.

EXECUTION STATE
tokens = ['The', 'cat', 'sat', 'on', 'mat']
102Q = np.array([...])

Query matrix — same across all chapters. Each row encodes what that token looks for.

EXECUTION STATE
Q (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0
110K = np.array([...])

Key matrix — same across all chapters. Each row encodes what that token offers.

EXECUTION STATE
K (5×4) =
      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5
118V = np.array([...])

Value matrix. V is NOT rotated in RoPE — only Q and K carry position.

EXECUTION STATE
V (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5
128rope = RotaryPositionAttention(d_model=4, base=10000)

Instantiate RoPE with d_model=4 and base=10000. Precomputes theta=[1.0, 0.01] and scale=2.0.

EXECUTION STATE
rope.theta = [1.000000, 0.010000]
rope.scale = 2.0
129output, weights = rope.forward(Q, K, V)

Run the full RoPE attention pipeline: rotate Q and K, compute scaled dot-product attention, aggregate V.

EXECUTION STATE
output (5×4) =
        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635
135rope.explain(Q, K, V, tokens, query_idx=0)

Print detailed trace for 'The' (pos 0). Since pos=0, Q_rot[The] = Q[The] unchanged — zero rotation.

EXECUTION STATE
query_idx = 0 → tracing 'The'
136rope.explain(Q, K, V, tokens, query_idx=1)

Print trace for 'cat' (pos 1). Position 1 rotation is significant for pair 0 (θ=1.0 rad ≈ 57.3°) but tiny for pair 1 (θ=0.01 rad ≈ 0.57°).

EXECUTION STATE
query_idx = 1 → tracing 'cat'
98 lines without explanation
1import numpy as np
2import math
3
4class RotaryPositionAttention:
5    """
6    RoPE — Rotary Position Embedding (Su et al., 2021)
7
8    Encodes position by ROTATING Query and Key vectors
9    before computing their dot product. The rotation makes
10    Q[m] · K[n] depend only on (m - n), the relative offset.
11
12    Values V are NOT rotated — only Q and K receive position.
13    """
14
15    def __init__(self, d_model: int, base: int = 10000):
16        self.d_model = d_model
17        self.base = base
18        self.scale = math.sqrt(d_model)
19        # Precompute frequencies: one theta per dimension pair
20        self.theta = np.array([
21            1.0 / (base ** (2 * i / d_model))
22            for i in range(d_model // 2)
23        ])
24
25    def _softmax(self, x: np.ndarray) -> np.ndarray:
26        """Numerically stable softmax along last axis."""
27        x_shifted = x - np.max(x, axis=-1, keepdims=True)
28        exp_x = np.exp(x_shifted)
29        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
30
31    def compute_angles(self, N: int) -> np.ndarray:
32        """Compute rotation angles for positions 0..N-1."""
33        positions = np.arange(N)
34        return np.outer(positions, self.theta)
35
36    def apply_rotation(self, X: np.ndarray) -> np.ndarray:
37        """Apply RoPE rotation to a matrix X (N, d_model)."""
38        N, D = X.shape
39        angles = self.compute_angles(N)
40        cos_a = np.cos(angles)
41        sin_a = np.sin(angles)
42        x1 = X[:, :D // 2]
43        x2 = X[:, D // 2:]
44        rotated_first = x1 * cos_a - x2 * sin_a
45        rotated_second = x1 * sin_a + x2 * cos_a
46        return np.hstack([rotated_first, rotated_second])
47
48    def compute_scores(self, Q_rot: np.ndarray, K_rot: np.ndarray):
49        """Raw dot-product scores after rotation."""
50        return Q_rot @ K_rot.T
51
52    def scale_scores(self, scores: np.ndarray) -> np.ndarray:
53        """Divide by sqrt(d_model)."""
54        return scores / self.scale
55
56    def compute_weights(self, scaled: np.ndarray) -> np.ndarray:
57        """Apply softmax to get attention weights."""
58        return self._softmax(scaled)
59
60    def compute_output(self, weights: np.ndarray, V: np.ndarray):
61        """Weighted sum of value vectors (V is NOT rotated)."""
62        return weights @ V
63
64    def forward(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray):
65        """
66        Full RoPE attention forward pass.
67
68        Args:
69            Q: Query matrix  (N, d_model)
70            K: Key matrix    (N, d_model)
71            V: Value matrix  (N, d_model) — NOT rotated
72
73        Returns:
74            output:  Context vectors  (N, d_model)
75            weights: Attention matrix (N, N)
76        """
77        Q_rot = self.apply_rotation(Q)
78        K_rot = self.apply_rotation(K)
79        raw_scores = self.compute_scores(Q_rot, K_rot)
80        scaled_scores = self.scale_scores(raw_scores)
81        weights = self.compute_weights(scaled_scores)
82        output = self.compute_output(weights, V)
83        return output, weights
84
85    def explain(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray,
86                tokens: list, query_idx: int = 0):
87        """Print a detailed trace for a specific query token."""
88        Q_rot = self.apply_rotation(Q)
89        K_rot = self.apply_rotation(K)
90        token = tokens[query_idx]
91
92        print(f"\n=== RoPE trace for '{token}' (pos {query_idx}) ===")
93        print(f"Q[{token}] original = {Q[query_idx]}")
94        print(f"Q[{token}] rotated  = {np.round(Q_rot[query_idx], 4)}")
95
96        raw = self.compute_scores(Q_rot, K_rot)
97        scaled = self.scale_scores(raw)
98        w = self.compute_weights(scaled)
99        out = self.compute_output(w, V)
100
101        for j, t in enumerate(tokens):
102            print(f"  score[{t}] = {raw[query_idx,j]:.4f}"
103                  f" -> scaled = {scaled[query_idx,j]:.4f}"
104                  f" -> weight = {w[query_idx,j]:.4f}")
105        print(f"  output = {np.round(out[query_idx], 4)}")
106
107
108# ── Shared Example (used in every chapter) ──
109tokens = ["The", "cat", "sat", "on", "mat"]
110
111Q = np.array([
112    [1.0, 0.0, 1.0, 0.0],   # The
113    [0.0, 2.0, 0.0, 1.0],   # cat
114    [1.0, 1.0, 1.0, 0.0],   # sat
115    [0.0, 0.0, 1.0, 1.0],   # on
116    [1.0, 0.0, 0.0, 1.0],   # mat
117])
118
119K = np.array([
120    [0.0, 1.0, 0.0, 1.0],   # The
121    [1.0, 0.0, 1.0, 0.0],   # cat
122    [1.0, 1.0, 0.0, 0.0],   # sat
123    [0.0, 0.0, 1.0, 1.0],   # on
124    [1.0, 0.0, 0.5, 0.5],   # mat
125])
126
127V = np.array([
128    [1.0, 0.0, 0.0, 0.0],   # The
129    [0.0, 1.0, 0.0, 0.0],   # cat
130    [0.0, 0.0, 1.0, 0.0],   # sat
131    [0.0, 0.0, 0.0, 1.0],   # on
132    [0.5, 0.5, 0.5, 0.5],   # mat
133])
134
135# ── Run ──
136rope = RotaryPositionAttention(d_model=4, base=10000)
137output, weights = rope.forward(Q, K, V)
138
139print("RoPE Attention Weights (5x5):")
140print(np.round(weights, 4))
141
142print("\nRoPE Output (5x4):")
143print(np.round(output, 4))
144
145# Detailed trace
146rope.explain(Q, K, V, tokens, query_idx=0)
147rope.explain(Q, K, V, tokens, query_idx=1)

PyTorch Implementation

Production-ready PyTorch version with GPU support, register_buffer for frequencies, and F.softmax. Click any line for the execution trace.

RoPE Attention \u2014 PyTorch Implementation
🐍rope_attention_torch.py
1import torch

PyTorch tensor library. Provides GPU-accelerated tensor operations and automatic differentiation.

2import torch.nn as nn

Neural network module. nn.Module is the base class for all PyTorch models.

3import torch.nn.functional as F

Functional API. F.softmax() is used instead of manual implementation — it is numerically stable and CUDA-optimized.

4import math

Standard library for math.sqrt().

6class RotaryPositionAttention(nn.Module)

PyTorch module wrapping RoPE attention. Inheriting nn.Module enables GPU transfer, parameter tracking, and automatic differentiation.

15def __init__(self, d_model, base)

Constructor. Registers frequency buffer and precomputes scale.

EXECUTION STATE
⬇ input: d_model = 4
⬇ input: base = 10000
16super().__init__()

Initialize nn.Module. Required for PyTorch parameter tracking and hooks.

17self.d_model = d_model

Store model dimension.

EXECUTION STATE
self.d_model = 4
18self.base = base

Store base frequency.

EXECUTION STATE
self.base = 10000
19self.scale = math.sqrt(d_model)

Precompute √d_model = 2.0 for score scaling.

EXECUTION STATE
self.scale = 2.0
22theta = torch.tensor([...])

Create frequency tensor. Same formula as NumPy version: θᵢ = 1/base^(2i/d).

EXECUTION STATE
theta = tensor([1.0000, 0.0100])
26self.register_buffer("theta", theta)

Register as a buffer (not a parameter). Buffers are saved with the model and move to GPU with .to(device), but are NOT updated by the optimizer. Frequencies are fixed, not learned.

EXECUTION STATE
register_buffer() = saved in state_dict, moves with .to(device), NOT trained by optimizer
34def apply_rotation(self, X) → torch.Tensor

Apply RoPE rotation. Identical to NumPy version but uses PyTorch ops for GPU and autograd support.

EXECUTION STATE
⬇ input: X (5×4) = Q or K tensor
⬆ returns = torch.Tensor (5, 4) — rotated
36N, D = X.shape

Unpack dimensions.

EXECUTION STATE
N = 5
D = 4
37positions = torch.arange(N, device=X.device, dtype=X.dtype)

Create position indices on the SAME device as X (CPU or GPU). dtype matching prevents float32/float64 mismatches.

EXECUTION STATE
device=X.device = ensures positions tensor is on same device (CPU/GPU) as input
dtype=X.dtype = match precision (float32/float64) to avoid cast errors
positions = tensor([0, 1, 2, 3, 4])
38angles = positions.unsqueeze(1) * self.theta.unsqueeze(0)

Broadcasting outer product. positions (5,1) × theta (1,2) → angles (5,2). Same as np.outer().

EXECUTION STATE
.unsqueeze(1) = positions (5,) → (5,1) — add column dimension for broadcasting
.unsqueeze(0) = theta (2,) → (1,2) — add row dimension for broadcasting
angles (5×2) =
        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400
39cos_a = torch.cos(angles)

Compute cosines for all angles.

EXECUTION STATE
cos_a shape = (5, 2)
40sin_a = torch.sin(angles)

Compute sines for all angles.

EXECUTION STATE
sin_a shape = (5, 2)
41x1 = X[:, :D // 2]

First half: columns [0, 1].

EXECUTION STATE
x1 shape = (5, 2)
42x2 = X[:, D // 2:]

Second half: columns [2, 3].

EXECUTION STATE
x2 shape = (5, 2)
43rotated_first = x1 * cos_a - x2 * sin_a

First rotation component. Element-wise, not matmul.

EXECUTION STATE
rotated_first shape = (5, 2)
44rotated_second = x1 * sin_a + x2 * cos_a

Second rotation component.

EXECUTION STATE
rotated_second shape = (5, 2)
45return torch.cat([rotated_first, rotated_second], dim=-1)

Concatenate along last dimension: [(5,2), (5,2)] → (5,4).

EXECUTION STATE
dim=-1 = concatenate along last axis (columns)
⬆ return shape = (5, 4) — rotated Q or K
47def forward(self, Q, K, V)

Main forward pass. Rotates Q and K, computes attention, returns output and weights.

EXECUTION STATE
⬇ input: Q = shape (5, 4)
⬇ input: K = shape (5, 4)
⬇ input: V = shape (5, 4) — NOT rotated
⬆ returns = (output (5,4), weights (5,5))
60Q_rot = self.apply_rotation(Q)

Rotate queries by position.

61K_rot = self.apply_rotation(K)

Rotate keys by position.

62scores = Q_rot @ K_rot.T / self.scale

Scaled dot-product of rotated Q and K. In PyTorch, @ is the matrix multiply operator, .T is transpose.

EXECUTION STATE
.T = transpose — K_rot (5×4) → (4×5)
/ self.scale = ÷ 2.0 (√4)
63weights = F.softmax(scores, dim=-1)

PyTorch’s numerically stable softmax. dim=-1 means softmax over last dimension (each row independently).

EXECUTION STATE
dim=-1 = softmax along last axis — each row becomes a probability distribution
F.softmax vs manual = CUDA-optimized, numerically stable (subtracts max internally), supports autograd
64output = weights @ V

Weighted sum of V (unrotated). Same as NumPy version.

65return output, weights

Return context vectors and attention matrix.

EXECUTION STATE
⬆ return: output = shape (5, 4)
⬆ return: weights = shape (5, 5)
99rope = RotaryPositionAttention(d_model=4, base=10000)

Instantiate model. theta buffer = [1.0, 0.01], scale = 2.0.

EXECUTION STATE
rope.theta = tensor([1.0000, 0.0100])
100output, weights = rope.forward(Q, K, V)

Run RoPE attention. Output matches NumPy version exactly.

EXECUTION STATE
output shape = (5, 4)
weights shape = (5, 5)
102print(weights.detach().numpy().round(4))

.detach() removes from computation graph (no gradient tracking for printing). .numpy() converts to NumPy for display.

EXECUTION STATE
.detach() = detach from autograd graph — required before .numpy()
.numpy() = convert PyTorch tensor to NumPy array
72 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
5
6class RotaryPositionAttention(nn.Module):
7    """
8    RoPE — Rotary Position Embedding (Su et al., 2021)
9    PyTorch implementation with GPU support.
10
11    Rotates Q and K by position before computing attention.
12    V is NOT rotated. Supports automatic differentiation.
13    """
14
15    def __init__(self, d_model: int, base: int = 10000):
16        super().__init__()
17        self.d_model = d_model
18        self.base = base
19        self.scale = math.sqrt(d_model)
20
21        # Register frequencies as a buffer (not a parameter)
22        theta = torch.tensor([
23            1.0 / (base ** (2 * i / d_model))
24            for i in range(d_model // 2)
25        ])
26        self.register_buffer("theta", theta)
27
28        # In production: learned projection matrices
29        # self.W_q = nn.Linear(d_model, d_model)
30        # self.W_k = nn.Linear(d_model, d_model)
31        # self.W_v = nn.Linear(d_model, d_model)
32        # self.W_o = nn.Linear(d_model, d_model)
33
34    def apply_rotation(self, X: torch.Tensor) -> torch.Tensor:
35        """Apply RoPE rotation. X shape: (N, d_model)."""
36        N, D = X.shape
37        positions = torch.arange(N, device=X.device, dtype=X.dtype)
38        angles = positions.unsqueeze(1) * self.theta.unsqueeze(0)
39        cos_a = torch.cos(angles)
40        sin_a = torch.sin(angles)
41        x1 = X[:, :D // 2]
42        x2 = X[:, D // 2:]
43        rotated_first = x1 * cos_a - x2 * sin_a
44        rotated_second = x1 * sin_a + x2 * cos_a
45        return torch.cat([rotated_first, rotated_second], dim=-1)
46
47    def forward(
48        self,
49        Q: torch.Tensor,
50        K: torch.Tensor,
51        V: torch.Tensor,
52    ) -> tuple[torch.Tensor, torch.Tensor]:
53        """
54        Args:
55            Q: (N, d_model)  query matrix
56            K: (N, d_model)  key matrix
57            V: (N, d_model)  value matrix (NOT rotated)
58        Returns:
59            output:  (N, d_model) context vectors
60            weights: (N, N) attention weights
61        """
62        Q_rot = self.apply_rotation(Q)
63        K_rot = self.apply_rotation(K)
64        scores = Q_rot @ K_rot.T / self.scale
65        weights = F.softmax(scores, dim=-1)
66        output = weights @ V
67        return output, weights
68
69
70# ── Shared Example ──
71tokens = ["The", "cat", "sat", "on", "mat"]
72
73Q = torch.tensor([
74    [1.0, 0.0, 1.0, 0.0],
75    [0.0, 2.0, 0.0, 1.0],
76    [1.0, 1.0, 1.0, 0.0],
77    [0.0, 0.0, 1.0, 1.0],
78    [1.0, 0.0, 0.0, 1.0],
79])
80
81K = torch.tensor([
82    [0.0, 1.0, 0.0, 1.0],
83    [1.0, 0.0, 1.0, 0.0],
84    [1.0, 1.0, 0.0, 0.0],
85    [0.0, 0.0, 1.0, 1.0],
86    [1.0, 0.0, 0.5, 0.5],
87])
88
89V = torch.tensor([
90    [1.0, 0.0, 0.0, 0.0],
91    [0.0, 1.0, 0.0, 0.0],
92    [0.0, 0.0, 1.0, 0.0],
93    [0.0, 0.0, 0.0, 1.0],
94    [0.5, 0.5, 0.5, 0.5],
95])
96
97# ── Run ──
98rope = RotaryPositionAttention(d_model=4, base=10000)
99output, weights = rope.forward(Q, K, V)
100
101print("RoPE Weights (PyTorch):")
102print(weights.detach().numpy().round(4))
103
104print("\nRoPE Output (PyTorch):")
105print(output.detach().numpy().round(4))

Key Takeaways

  1. Rotation encodes position intrinsically. Unlike additive position embeddings or score biases, RoPE makes the dot product QmKnQ_m \cdot K_n depend on relative position (mn)(m - n) as a mathematical consequence, not an approximation.
  2. Multi-frequency design captures multiple scales. Fast-rotating pairs encode local position (is this word the immediate neighbor?), while slow-rotating pairs encode global position (is it in the same paragraph?). This is why RoPE generalizes to longer sequences.
  3. Zero additional parameters. Frequencies come from a formula, not a learned lookup table. This means no out-of-vocabulary positions and no parameter overhead.
  4. V is NOT rotated. Only Q and K carry positional information through rotation. V provides the content that gets aggregated.
  5. Compatible with all optimizations. RoPE works seamlessly with Flash Attention, KV-cache, multi-head/multi-query/grouped-query attention, and context length extension techniques like YaRN.
  6. De facto standard for modern LLMs. LLaMA, Mistral, PaLM, Qwen, Falcon, Phi, and virtually every major open-weight model since 2022 uses RoPE.

Exercises

  1. Verify the relative position property: Using the code above, create two identical vectors x=[1,1,1,1]\mathbf{x} = [1, 1, 1, 1] and place them at positions (0, 2) and (3, 5). Verify that the dot product of their rotated versions is the same in both cases, confirming that the score depends only on the distance 2.
  2. Frequency analysis: For d=128d = 128 and base = 10000, compute all 64 frequencies. What is the wavelength (in positions) of the fastest and slowest pair? How many full rotations does pair 0 complete over a 4096-token sequence?
  3. NTK scaling experiment: Modify the code to use base = 100000 instead of 10000. Recompute the attention weights. How does this affect which tokens "on" attends to? Why would this help with longer sequences?
  4. Compare with Chapter 7: Run both the relative position bias attention (Chapter 7) and RoPE attention on the same input. Which produces stronger local preference? What happens when you increase the sequence length to 20 tokens?
  5. 2D RoPE for images: Design a 2D RoPE scheme for a 4×4 image patch grid (16 tokens). Each token has a (row, col) position. Apply separate rotations for row-position and column-position using different frequency bands. What properties does this have compared to 1D RoPE?

References

  1. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
  2. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
  3. Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
  4. bloc97. (2023). "NTK-Aware Scaled RoPE Allows LLaMA Models to Have Extended (8k+) Context Size Without Any Fine-tuning and Minimal Perplexity Degradation." Reddit/r/LocalLLaMA.
  5. Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071.
  6. Black, S., Biderman, S., Hallahan, E., et al. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745.
  7. Liu, Z., Oguz, B., Zhao, C., et al. (2024). "World Model on Million-Length Video and Language with RingAttention." arXiv:2402.08268.
Loading comments...