Boo-AI — Master Artificial Intelligence by Building from Scratch

Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021

Learning Objectives

After completing this section, you will be able to:

Explain why encoding position through rotation is fundamentally different from adding position to embeddings or biasing attention scores, and what mathematical advantage this provides.
Derive and interpret the 2D rotation formula that transforms Query and Key vectors, understanding each symbol and matrix operation involved.
Understand the multi-frequency design that gives RoPE both local and global position sensitivity — and why this allows generalization to unseen sequence lengths.
Implement RoPE from scratch in NumPy and PyTorch, applying it to the shared "the cat sat on the mat" example.
Connect RoPE to modern systems including LLaMA, GPT-NeoX, PaLM, CodeLlama, and understand how it interacts with Flash Attention, KV-cache, and context length extension techniques like YaRN and NTK-aware scaling.

Where RoPE appears: LLaMA 1/2/3 (Meta), GPT-NeoX (EleutherAI), PaLM/Gemini (Google), CodeLlama, Mistral, Qwen, Falcon, Phi — virtually every major open-weight LLM since 2022 uses RoPE. It has become the default positional encoding for modern transformers.

The Real Problem

Transformers process all tokens in parallel — unlike RNNs, they have no built-in notion of sequential order. Without positional information, the sentence "the cat sat on the mat" would be indistinguishable from "mat the on sat cat the." The attention mechanism computes $Q \cdot K^T$ , which depends only on what the tokens are, not where they are. This is the position encoding problem, and it has been solved in several ways, each with different tradeoffs.

Position as Addition — The Limitation

The original Transformer (Vaswani et al., 2017) solved this by adding a position vector to each token embedding: $x_m = \text{token}_m + \text{PE}(m)$ . This is simple and elegant, but it has two fundamental problems.

First, the position signal gets diluted as it passes through layers. After multiple rounds of self-attention, layer normalization, and feed-forward transformations, the additive positional component can lose its influence. The model must "remember" the position it was told about many layers ago.

Second, the dot product $(x_m + \text{PE}(m))^T(x_n + \text{PE}(n))$ expands into four terms: $x_m^T x_n + x_m^T \text{PE}(n) + \text{PE}(m)^T x_n + \text{PE}(m)^T \text{PE}(n)$ . Only the last term captures purely positional interaction. The cross terms $x_m^T \text{PE}(n)$ mix content and position in ways the model must learn to disentangle — this is wasted capacity.

Position as Score Bias — Better but Still External

Relative position biases (Chapter 7) improve on this by adding a learned bias directly to the attention score: $\text{score}(m, n) = Q_m \cdot K_n + b(m - n)$ . This cleanly separates content matching from position matching. However, the bias $b(m-n)$ is typically a learned lookup table with a fixed maximum distance. When the model encounters a sequence longer than what it was trained on, the bias has no entry for large $|m - n|$ values, and generalization breaks down.

The core challenge: We need a method that (1) makes the dot product

Q_m \cdot K_n

inherently depend on relative position

(m - n)

, (2) does not dilute through layers because it is applied directly at the attention computation, and (3) generalizes smoothly to longer sequences without requiring learned parameters for each distance. RoPE achieves all three.

From Intuition to Rotation

The Rotation Insight

Jianlin Su and his coauthors at ZhuiYi Technology asked a beautiful question: is there a transformation $f(x, m)$ that we can apply to both Q and K such that the inner product $\langle f(Q_m, m), f(K_n, n) \rangle$ depends only on $Q_m$ , $K_n$ , and the relative position $(m - n)$ ?

The answer comes from a well-known property of 2D rotations. If you rotate a vector $\mathbf{a}$ by angle $\alpha$ and another vector $\mathbf{b}$ by angle $\beta$ , their dot product depends only on the difference of angles $(\alpha - \beta)$ :

$R(\alpha)\mathbf{a} \cdot R(\beta)\mathbf{b} = \mathbf{a}^T R(\alpha - \beta) \mathbf{b}$

This is the key identity. If we set $\alpha = m\theta$ and $\beta = n\theta$ for some frequency $\theta$ , then the dot product after rotation depends on $(m - n)\theta$ — exactly the relative position, scaled by frequency. This is not an approximation. It is an exact mathematical consequence of rotation in 2D.

The idea is simple: rotate Q by position $m$ and rotate K by position $n$ before computing their dot product. The dot product will automatically encode relative position.

The Multi-Frequency Design

A single rotation frequency can only encode one scale of position. To encode both "this word is 2 positions away" (local) and "this word is 500 positions away" (global), RoPE uses multiple frequencies, one for each pair of embedding dimensions. This is directly inspired by the sinusoidal positional encoding from the original Transformer, which also uses a geometric sequence of frequencies.

With embedding dimension $d$ , RoPE splits the vector into $d/2$ pairs. Each pair $i$ gets its own frequency $\theta_i = 1/10000^{2i/d}$ . The first pair ( $i=0$ ) has $\theta_0 = 1.0$ — it rotates by 1 radian per position (fast, local). The last pair has a tiny $\theta$ and barely rotates even at position 1000 (slow, global). Together, these form a multi-scale positional representation, analogous to wavelets or Fourier decomposition.

The Mathematical Definition

Symbol-by-Symbol Breakdown

RoPE transforms a vector $\mathbf{x}$ at position $m$ by applying a 2D rotation to each consecutive pair of dimensions. For dimension pair $i$ (pairing the $i$ -th first-half dimension with the $i$ -th second-half dimension), the rotation is:

$x_i' = x_i \cos(m\theta_i) - x_{i+d/2} \sin(m\theta_i)$

$x_{i+d/2}' = x_i \sin(m\theta_i) + x_{i+d/2} \cos(m\theta_i)$

Let us define every symbol precisely:

Symbol	Meaning	In Our Example
m	Absolute position of the token in the sequence	The=0, cat=1, sat=2, on=3, mat=4
d	Embedding dimension (total number of dimensions)	4
i	Pair index, ranging from 0 to d/2 - 1	0 or 1 (two pairs for d=4)
θᵢ	Frequency for pair i: 1/10000^(2i/d)	θ₀=1.0, θ₁=0.01
mθᵢ	Rotation angle = position × frequency	cat: 1×1.0=1.0 rad, 1×0.01=0.01 rad
xᵢ	i-th element from the first half of dimensions	For Q[cat]: x₀=0.0, x₁=2.0
xᵢ₊₄/₂	i-th element from the second half (paired with xᵢ)	For Q[cat]: x₂=0.0, x₃=1.0

The Frequency Formula

The frequency for dimension pair $i$ is: $\theta_i = \frac{1}{10000^{2i/d}}$

This creates a geometric sequence from $\theta_0 = 1.0$ down to $\theta_{d/2-1} \approx 0.0001$ for typical $d = 128$ . The base 10000 was chosen by Vaswani et al. (2017) for the original sinusoidal encoding and reused by Su et al. for RoPE.

For our example with $d = 4$ :

Pair 0 ( $i = 0$ ): $\theta_0 = 1/10000^{0/4} = 1/10000^0 = 1.000000$ — fast rotation, 1 radian per position (57.3° per step)
Pair 1 ( $i = 1$ ): $\theta_1 = 1/10000^{2/4} = 1/10000^{0.5} = 1/100 = 0.010000$ — slow rotation, 0.01 radian per position (0.57° per step)

Pair 0 rotates 100 times faster than pair 1. This means pair 0 distinguishes nearby positions (local context), while pair 1 distinguishes distant positions (global context). In a real model with $d = 128$ , you would have 64 pairs spanning five orders of magnitude in frequency.

Why the Dot Product Depends Only on Relative Position

Consider a single dimension pair. After rotating Q at position $m$ and K at position $n$ , the contribution of this pair to the dot product is:

$(q_1 \cos m\theta - q_2 \sin m\theta)(k_1 \cos n\theta - k_2 \sin n\theta) + (q_1 \sin m\theta + q_2 \cos m\theta)(k_1 \sin n\theta + k_2 \cos n\theta)$

Expanding and using the identity $\cos(A)\cos(B) + \sin(A)\sin(B) = \cos(A - B)$ :

$= q_1 k_1 \cos((m-n)\theta) - q_1 k_2 \sin((m-n)\theta) + q_2 k_1 \sin((m-n)\theta) + q_2 k_2 \cos((m-n)\theta)$

Every term depends on $(m - n)\theta$ , the relative distance scaled by frequency. The absolute positions $m$ and $n$ have vanished. This is the mathematical guarantee that RoPE provides: the attention score is a function of content and relative position only.

Why V is NOT rotated: Values carry the content that gets aggregated by attention. Position should affect which tokens attend to each other (via Q and K), not what information flows between them. Rotating V would corrupt the content representation with positional artifacts.

Interactive: 2D Rotation Visualizer

The visualizer below shows how RoPE rotates a single dimension pair. Select a token to see how its position determines the rotation angle. Notice that "The" (position 0) has zero rotation, while later tokens rotate progressively more.

Loading rotation visualizer...

Interactive: Multi-Frequency Explorer

This explorer shows all dimension pairs simultaneously. Move the position slider and observe how low-index pairs (red) oscillate rapidly while high-index pairs (purple/pink) barely move. This multi-scale structure is what enables RoPE to encode both local and global position.

Loading frequency explorer...

Step-by-Step Calculation

We use the same shared example from every chapter: tokens = ["The", "cat", "sat", "on", "mat"] with $d = 4$ and the standard Q, K, V matrices. The implementation pairs the first half of dimensions with the second half: pair 0 = (col 0, col 2), pair 1 = (col 1, col 3).

Step 1: Compute Frequencies

With $d = 4$ and base = 10000, we have 2 dimension pairs:

$\theta_0 = 1/10000^{0/4} = 1.000000$ (pair 0: dims 0 and 2)
$\theta_1 = 1/10000^{2/4} = 0.010000$ (pair 1: dims 1 and 3)

Rotation angles = position $\times$ frequency:

Token	Position	Angle (pair 0)	Angle (pair 1)
The	0	0 × 1.0 = 0.0000	0 × 0.01 = 0.0000
cat	1	1 × 1.0 = 1.0000	1 × 0.01 = 0.0100
sat	2	2 × 1.0 = 2.0000	2 × 0.01 = 0.0200
on	3	3 × 1.0 = 3.0000	3 × 0.01 = 0.0300
mat	4	4 × 1.0 = 4.0000	4 × 0.01 = 0.0400

Step 2: Rotate All Q Vectors

For each token, we split Q into first-half $x_1 = Q[\text{:,\,:2}]$ and second-half $x_2 = Q[\text{:,\,2:}]$ , then apply the rotation:

"The" (pos=0): angle = 0 for both pairs, so cos=1, sin=0. $Q_{\text{rot}}$ [The] = Q[The] = [1.0000, 0.0000, 1.0000, 0.0000] — no change.

"cat" (pos=1): Q[cat] = [0.0, 2.0, 0.0, 1.0], so $x_1$ = [0.0, 2.0], $x_2$ = [0.0, 1.0]

Pair 0 (angle=1.0): $0.0 \times \cos(1.0) - 0.0 \times \sin(1.0) = 0.0000$ , $0.0 \times \sin(1.0) + 0.0 \times \cos(1.0) = 0.0000$
Pair 1 (angle=0.01): $2.0 \times 0.9999 - 1.0 \times 0.0100 = 1.9899$ , $2.0 \times 0.0100 + 1.0 \times 0.9999 = 1.0199$

$Q_{\text{rot}}$ [cat] = [0.0000, 1.9899, 0.0000, 1.0199]

"sat" (pos=2): Q[sat] = [1.0, 1.0, 1.0, 0.0], angle pair 0 = 2.0 (cos = −0.4161, sin = 0.9093)

Pair 0: $1.0 \times (-0.4161) - 1.0 \times 0.9093 = -1.3254$ , $1.0 \times 0.9093 + 1.0 \times (-0.4161) = 0.4932$
Pair 1 (angle=0.02): $1.0 \times 0.9998 - 0.0 \times 0.0200 = 0.9998$ , $1.0 \times 0.0200 + 0.0 \times 0.9998 = 0.0200$

$Q_{\text{rot}}$ [sat] = [−1.3254, 0.9998, 0.4932, 0.0200]

Full rotated Q matrix:

Token	d0	d1	d2	d3
The	1.0000	0.0000	1.0000	0.0000
cat	0.0000	1.9899	0.0000	1.0199
sat	-1.3254	0.9998	0.4932	0.0200
on	-0.1411	-0.0300	-0.9900	0.9996
mat	-0.6536	-0.0400	-0.7568	0.9992

Step 3: Rotate All K Vectors

Same rotation angles applied to K:

Token	d0	d1	d2	d3
The	0.0000	1.0000	0.0000	1.0000
cat	-0.3012	0.0000	1.3818	0.0000
sat	-0.4161	0.9998	0.9093	0.0200
on	-0.1411	-0.0300	-0.9900	0.9996
mat	-0.2752	-0.0200	-1.0836	0.4996

Step 4: Compute Scaled Scores

Raw scores = $Q_{\text{rot}} \times K_{\text{rot}}^T$ , then divide by $\sqrt{4} = 2.0$ :

	The	cat	sat	on	mat
The	0.0000	0.5403	0.2466	-0.5656	-0.6794
cat	1.5049	0.0000	1.0049	0.4799	0.2349
sat	0.5099	0.5403	1.0000	-0.1556	-0.0898
on	0.4848	-0.6627	-0.4257	1.0000	0.8058
mat	0.4796	-0.4244	-0.2181	0.9207	0.7500

Notice how "on" (pos=3) gives its highest score to itself (1.0000) and to "mat" (pos=4, score 0.8058) — nearby tokens — while giving negative scores to "cat" (pos=1, score −0.6627) and "sat" (pos=2, score −0.4257), which are farther away. RoPE is creating a positional preference for nearby tokens.

Step 5: Apply Softmax

Softmax converts scaled scores to probabilities (each row sums to 1.0). Detailed computation for "The" (row 0):

Scaled scores: [0.0000, 0.5403, 0.2466, −0.5656, −0.6794]
Max = 0.5403, shifted = [−0.5403, 0.0000, −0.2937, −1.1059, −1.2197]
exp = [0.5826, 1.0000, 0.7455, 0.3309, 0.2953], sum = 2.9543
Softmax: [0.1972, 0.3385, 0.2523, 0.1120, 0.1000]

"The" attends most strongly to "cat" (0.3385), which is the next token — the positional rotation has boosted the score for nearby tokens.

Step 6: Compute Output

Output for "The" = weighted sum of V (unrotated):

0.1972 $\times$ V[The] + 0.3385 $\times$ V[cat] + 0.2523 $\times$ V[sat] + 0.1120 $\times$ V[on] + 0.1000 $\times$ V[mat]
= [0.1972, 0.0000, 0.0000, 0.0000] + [0.0000, 0.3385, 0.0000, 0.0000] + [0.0000, 0.0000, 0.2523, 0.0000] + [0.0000, 0.0000, 0.0000, 0.1120] + [0.0500, 0.0500, 0.0500, 0.0500]
= [0.2472, 0.3885, 0.3023, 0.1620]

Full Attention Weights and Output

Attention Weights — RoPE ( $5 \times 5$ )

	The	cat	sat	on	mat
The	0.1972	0.3385	0.2523	0.1120	0.1000
cat	0.4052	0.0900	0.2457	0.1454	0.1138
sat	0.2116	0.2181	0.3454	0.1088	0.1162
on	0.2095	0.0665	0.0843	0.3508	0.2889
mat	0.2098	0.0849	0.1044	0.3260	0.2749

Observation: Every token gives its highest attention weight to itself or its nearest neighbor. "on" and "mat" (positions 3 and 4) attend heavily to each other (0.2889 and 0.3260). "cat" attends most to "The" (0.4052), its immediate predecessor. This local preference is entirely due to RoPE — the content-only attention (Chapter 1) has a different pattern.

Output Matrix — RoPE ( $5 \times 4$ )

	d0	d1	d2	d3
The	0.2472	0.3885	0.3023	0.1620
cat	0.4620	0.1468	0.3026	0.2023
sat	0.2697	0.2762	0.4035	0.1668
on	0.3540	0.2109	0.2287	0.4952
mat	0.3472	0.2224	0.2418	0.4635

Interactive: RoPE vs Standard Attention

Compare RoPE attention weights with standard (position-free) attention. Toggle between views and hover over cells to see exact values. The "Difference" view highlights where RoPE increases (green) or decreases (red) attention compared to the standard mechanism.

Loading attention heatmap...

Applications Across Domains

RoPE was designed for language modeling but its mathematical properties make it valuable across many domains:

Domain	Application	Why RoPE Helps
Large Language Models	LLaMA, Mistral, Qwen, Phi	Enables clean relative position without learned parameters. Models can be extended to longer contexts via NTK-aware scaling or YaRN.
Code Generation	CodeLlama, DeepSeek-Coder	Code has deeply nested structure with long-range dependencies (function calls 100s of lines away). RoPE's multi-frequency bands capture both local syntax and global structure.
Vision Transformers	EVA-02, InternVL	2D images are flattened to 1D token sequences. RoPE can be extended to 2D by applying separate rotations for row and column positions.
Scientific Computing	AlphaFold 3 (protein structure)	Protein residue interactions depend on sequence distance. RoPE's relative position property naturally models the chain geometry.
Long-Context Tasks	128K+ token models	RoPE extrapolates more gracefully than learned embeddings. YaRN scaling extends 4K models to 128K+ with minimal fine-tuning.

Connection to Modern Systems

Flash Attention Compatibility

RoPE is applied before the attention computation: you rotate Q and K, then pass them to standard scaled dot-product attention. This means RoPE is fully compatible with Flash Attention (Chapter 13) — the rotation happens outside the memory-efficient kernel. In practice, most implementations apply RoPE as a preprocessing step, then call $\texttt{flash\_attn\_func(Q\_rot, K\_rot, V)}$ .

KV-Cache and Incremental Rotation

During autoregressive generation, K and V are cached to avoid recomputation. With RoPE, each K vector is rotated by its position once when generated, then stored in the cache already rotated. This means the KV-cache works identically to standard attention — no recomputation of rotations is needed when the cache grows.

Context Length Extension

Because RoPE uses a mathematical formula (not a learned table), it can be extended to longer sequences by modifying the frequency base:

NTK-aware interpolation (bloc97, 2023): scales the base from 10000 to a larger value, effectively stretching the wavelengths so existing RoPE representations cover more positions without retraining.
YaRN (Peng et al., 2023): combines NTK scaling with an attention temperature adjustment. Extends a 4K context model to 128K+ with a small amount of fine-tuning.
Dynamic NTK: adjusts the scaling factor at inference time based on actual sequence length, providing automatic adaptation.

Multi-Head Latent Attention (MLA)

DeepSeek-V2 and V3 (Chapter 15) combine RoPE with low-rank key-value compression. They split the query and key into a RoPE portion (carrying positional information) and a content portion (compressed via low-rank projection). This decouples position from content at the architectural level, achieving massive KV-cache savings while preserving RoPE's positional properties.

Complexity Analysis

Operation	Complexity	Notes
Compute frequencies	O(d/2)	Once per model initialization
Compute angles	O(N × d/2)	Outer product of positions and frequencies
Apply rotation	O(N × d)	Element-wise multiply + subtract/add per token
Dot-product scores	O(N² × d)	Same as standard attention — RoPE adds no cost here
Total overhead vs standard	O(N × d)	Rotation is linear in sequence length; negligible vs O(N²d)

RoPE adds zero parameters to the model — the frequencies are computed from a formula, not learned. The computational overhead is $O(Nd)$ for the rotation step, which is negligible compared to the $O(N^2 d)$ attention computation. In practice, the rotation is fused into the Q/K projection kernel and adds less than 1% to wall-clock time.

Python Implementation

Full NumPy implementation with class structure. Click any line to see the execution trace with actual matrix values from "the cat sat on the mat."

RoPE Attention \u2014 NumPy Implementation

🐍rope_attention.py

Explanation(49)

Code(147)

1import numpy as np

NumPy provides vectorized matrix operations. Q_rot @ K_rot.T runs as optimized C code, not Python loops.

2import math

Python standard library. We use math.sqrt() to precompute the scaling factor √d_model.

4class RotaryPositionAttention

Wraps RoPE in a reusable class. The key method is apply_rotation() which rotates Q and K by their position before computing attention. V is never rotated.

14def __init__(self, d_model, base)

Constructor. Takes model dimension and base frequency (10000 in the original paper). Precomputes per-pair frequencies θᵢ = 1/base^(2i/d).

EXECUTION STATE

⬇ input: d_model = 4

⬇ input: base = 10000

15self.d_model = d_model

Store model dimension (4).

EXECUTION STATE

self.d_model = 4

16self.base = base

Store base frequency. 10000 is from Vaswani et al. (2017), reused in RoPE.

EXECUTION STATE

self.base = 10000

17self.scale = math.sqrt(d_model)

Precompute √d_model for score scaling (same as Chapter 1).

EXECUTION STATE

math.sqrt(4) = 2.0

self.scale = 2.0

19self.theta = np.array([...])

Precompute one frequency per dimension pair. With d=4, we have 2 pairs. θ₀=1.0 (fast rotation), θ₁=0.01 (slow rotation).

EXECUTION STATE

range(d_model // 2) = range(2) → [0, 1]

θ₀ = 1/10000^(0/4) = 1/10000⁰ = 1/1 = 1.000000

θ₁ = 1/10000^(2/4) = 1/10000¹⁄₂ = 1/100 = 0.010000

self.theta = [1.000000, 0.010000]

24def _softmax(self, x) → np.ndarray

Numerically stable softmax. Subtracts row max before exponentiating to prevent overflow.

EXECUTION STATE

⬇ input: x = shape (5, 5) — scaled score matrix

⬆ returns = np.ndarray (5, 5) — softmax probabilities per row

26x_shifted = x - np.max(x, axis=-1, keepdims=True)

Subtract row-wise max for numerical stability. exp(x - max) prevents overflow.

EXECUTION STATE

axis=-1 = find max along last axis — each row gets its own max

keepdims=True = result shape (5,1) not (5,) so broadcasting x(5×5) - max(5×1) works

27exp_x = np.exp(x_shifted)

Exponentiate. Largest per-row value is exp(0)=1.0 — no overflow.

28return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Normalize each row to sum to 1.0.

EXECUTION STATE

axis=-1 = sum each row independently

keepdims=True = sum shape (5,1) for broadcasting

30def compute_angles(self, N) → np.ndarray

Compute rotation angles for all positions 0..N-1. Each position gets one angle per dimension pair: angle = pos × θᵢ.

EXECUTION STATE

⬇ input: N = 5 (number of tokens)

⬆ returns = np.ndarray (5, 2) — angles for each (position, pair)

32positions = np.arange(N)

Create position indices [0, 1, 2, 3, 4] for our 5 tokens.

EXECUTION STATE

positions = [0, 1, 2, 3, 4]

33return np.outer(positions, self.theta)

Outer product: positions (5,) × theta (2,) → angles (5, 2). Entry [m, i] = m × θᵢ.

EXECUTION STATE

np.outer() = outer product — every position multiplied by every frequency

⬆ return: angles (5×2) =

        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400

35def apply_rotation(self, X) → np.ndarray

THE CORE RoPE OPERATION: rotate each row of X by its position. Splits X into first-half and second-half dimensions, then applies 2D rotation per pair. Position 0 gets zero rotation; higher positions rotate more.

EXECUTION STATE

⬇ input: X (5×4) = Q or K matrix — each row is a token's vector

⬆ returns = np.ndarray (5, 4) — position-rotated version of X

37N, D = X.shape

Unpack matrix dimensions.

EXECUTION STATE

N = 5 (tokens)

D = 4 (dimensions)

38angles = self.compute_angles(N)

Get rotation angles for all 5 positions.

EXECUTION STATE

angles (5×2) =

        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400

39cos_a = np.cos(angles)

Compute cosines for all angles.

EXECUTION STATE

cos_a (5×2) =

          pair0     pair1
The    1.000000  1.000000
cat    0.540302  0.999950
sat   -0.416147  0.999800
on    -0.989992  0.999550
mat   -0.653644  0.999200

40sin_a = np.sin(angles)

Compute sines for all angles.

EXECUTION STATE

sin_a (5×2) =

          pair0     pair1
The    0.000000  0.000000
cat    0.841471  0.010000
sat    0.909297  0.019999
on     0.141120  0.029996
mat   -0.756802  0.039989

41x1 = X[:, :D // 2]

First half of dimensions: columns 0 and 1. These form the "first component" of each rotation pair.

EXECUTION STATE

:D // 2 = :2 — columns [0, 1]

x1 (for Q) =

     d0   d1
The  1.0  0.0
cat  0.0  2.0
sat  1.0  1.0
on   0.0  0.0
mat  1.0  0.0

42x2 = X[:, D // 2:]

Second half of dimensions: columns 2 and 3. These are the "second component" paired with x1.

EXECUTION STATE

D // 2: = 2: — columns [2, 3]

x2 (for Q) =

     d2   d3
The  1.0  0.0
cat  0.0  1.0
sat  1.0  0.0
on   1.0  1.0
mat  0.0  1.0

43rotated_first = x1 * cos_a - x2 * sin_a

First rotation component: x1·cos(θ) - x2·sin(θ). This is the standard 2D rotation formula applied element-wise per pair.

EXECUTION STATE

rotated_first (for Q) =

         d0       d1
The   1.0000   0.0000
cat   0.0000   1.9899
sat  -1.3254   0.9998
on   -0.1411  -0.0300
mat  -0.6536  -0.0400

44rotated_second = x1 * sin_a + x2 * cos_a

Second rotation component: x1·sin(θ) + x2·cos(θ). Completes the 2D rotation.

EXECUTION STATE

rotated_second (for Q) =

         d0       d1
The   1.0000   0.0000
cat   0.0000   1.0199
sat   0.4932   0.0200
on   -0.9900   0.9996
mat  -0.7568   0.9992

45return np.hstack([rotated_first, rotated_second])

Concatenate the two halves back into a (5×4) matrix. This is the position-encoded version of the input.

EXECUTION STATE

np.hstack() = horizontal stack — [(5×2), (5×2)] → (5×4)

⬆ return: Q_rot (5×4) =

         d0       d1       d2       d3
The   1.0000   0.0000   1.0000   0.0000
cat   0.0000   1.9899   0.0000   1.0199
sat  -1.3254   0.9998   0.4932   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.6536  -0.0400  -0.7568   0.9992

47def compute_scores(self, Q_rot, K_rot)

Compute raw dot products between rotated Q and rotated K. Since both carry positional rotation, the dot product Q_rot[m]·K_rot[n] encodes relative position (m-n).

EXECUTION STATE

⬇ input: Q_rot (5×4) = position-rotated query matrix

⬇ input: K_rot (5×4) = position-rotated key matrix

⬆ returns = np.ndarray (5, 5) — raw scores

49return Q_rot @ K_rot.T

Matrix multiply Q_rot (5×4) with K_rot transposed (4×5) → (5×5). Entry (i,j) = dot product of rotated query i with rotated key j.

EXECUTION STATE

.T = transpose — K_rot (5×4) becomes (4×5)

⬆ return: raw_scores (5×5) =

        The      cat      sat       on      mat
The   0.0000   1.0806   0.4932  -1.1311  -1.3589
cat   3.0098   0.0000   2.0099   0.9598   0.4698
sat   1.0198   1.0806   2.0000  -0.3112  -0.1796
on    0.9696  -1.3254  -0.8515   2.0000   1.6116
mat   0.9592  -0.8489  -0.4361   1.8414   1.5000

51def scale_scores(self, scores) → np.ndarray

Divide every score by √d_model = √4 = 2.0 to prevent softmax saturation.

EXECUTION STATE

⬇ input: scores = shape (5, 5) — raw dot products

self.scale = 2.0 (√4)

⬆ returns = np.ndarray (5, 5) — scores ÷ 2.0

53return scores / self.scale

Elementwise division by 2.0.

EXECUTION STATE

⬆ return: scaled (5×5) =

        The      cat      sat       on      mat
The   0.0000   0.5403   0.2466  -0.5656  -0.6794
cat   1.5049   0.0000   1.0049   0.4799   0.2349
sat   0.5099   0.5403   1.0000  -0.1556  -0.0898
on    0.4848  -0.6627  -0.4257   1.0000   0.8058
mat   0.4796  -0.4244  -0.2181   0.9207   0.7500

55def compute_weights(self, scaled) → np.ndarray

Apply softmax row-wise to get attention probabilities.

EXECUTION STATE

⬇ input: scaled = shape (5, 5)

⬆ returns = np.ndarray (5, 5) — each row sums to 1.0

57return self._softmax(scaled)

Calls _softmax. All 5 rows become probability distributions.

EXECUTION STATE

⬆ return: weights (5×5) =

        The      cat      sat       on      mat
The   0.1972   0.3385   0.2523   0.1120   0.1000
cat   0.4052   0.0900   0.2457   0.1454   0.1138
sat   0.2116   0.2181   0.3454   0.1088   0.1162
on    0.2095   0.0665   0.0843   0.3508   0.2889
mat   0.2098   0.0849   0.1044   0.3260   0.2749

59def compute_output(self, weights, V)

Weighted sum of value vectors. V is NOT rotated — only Q and K receive positional information through rotation.

EXECUTION STATE

⬇ input: weights (5×5) = attention probabilities

⬇ input: V (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5

⬆ returns = np.ndarray (5, 4) — context vectors

61return weights @ V

Matrix multiply weights (5×5) with V (5×4) → (5×4). Each output row blends all 5 value vectors.

EXECUTION STATE

⬆ return: output (5×4) =

        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635

63def forward(self, Q, K, V)

Main entry point. Rotates Q and K by position, computes scaled dot-product attention, and aggregates V (unrotated). The rotation makes dot products position-aware without modifying the score computation itself.

EXECUTION STATE

⬇ input: Q (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0

⬇ input: K (5×4) =

      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5

⬇ input: V (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5

⬆ returns = (output (5,4), weights (5,5))

74Q_rot = self.apply_rotation(Q)

Rotate all query vectors by their position. 'The' (pos 0) is unchanged; 'mat' (pos 4) gets maximum rotation.

EXECUTION STATE

Q_rot (5×4) =

         d0       d1       d2       d3
The   1.0000   0.0000   1.0000   0.0000
cat   0.0000   1.9899   0.0000   1.0199
sat  -1.3254   0.9998   0.4932   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.6536  -0.0400  -0.7568   0.9992

75K_rot = self.apply_rotation(K)

Rotate all key vectors by their position. Same rotation angles as Q (position-dependent).

EXECUTION STATE

K_rot (5×4) =

         d0       d1       d2       d3
The   0.0000   1.0000   0.0000   1.0000
cat  -0.3012   0.0000   1.3818   0.0000
sat  -0.4161   0.9998   0.9093   0.0200
on   -0.1411  -0.0300  -0.9900   0.9996
mat  -0.2752  -0.0200  -1.0836   0.4996

76raw_scores = self.compute_scores(Q_rot, K_rot)

Compute Q_rot @ K_rot.T → 5×5 raw score matrix.

EXECUTION STATE

raw_scores (5×5) =

        The      cat      sat       on      mat
The   0.0000   1.0806   0.4932  -1.1311  -1.3589
cat   3.0098   0.0000   2.0099   0.9598   0.4698
sat   1.0198   1.0806   2.0000  -0.3112  -0.1796
on    0.9696  -1.3254  -0.8515   2.0000   1.6116
mat   0.9592  -0.8489  -0.4361   1.8414   1.5000

77scaled_scores = self.scale_scores(raw_scores)

Divide by √4 = 2.0.

EXECUTION STATE

scaled_scores (5×5) =

        The      cat      sat       on      mat
The   0.0000   0.5403   0.2466  -0.5656  -0.6794
cat   1.5049   0.0000   1.0049   0.4799   0.2349
sat   0.5099   0.5403   1.0000  -0.1556  -0.0898
on    0.4848  -0.6627  -0.4257   1.0000   0.8058
mat   0.4796  -0.4244  -0.2181   0.9207   0.7500

78weights = self.compute_weights(scaled_scores)

Apply softmax row-wise.

EXECUTION STATE

weights (5×5) =

        The      cat      sat       on      mat
The   0.1972   0.3385   0.2523   0.1120   0.1000
cat   0.4052   0.0900   0.2457   0.1454   0.1138
sat   0.2116   0.2181   0.3454   0.1088   0.1162
on    0.2095   0.0665   0.0843   0.3508   0.2889
mat   0.2098   0.0849   0.1044   0.3260   0.2749

79output = self.compute_output(weights, V)

Weighted sum of UNROTATED V.

EXECUTION STATE

output (5×4) =

        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635

80return output, weights

Return the context vectors and attention weights.

EXECUTION STATE

⬆ return: output = shape (5, 4)

⬆ return: weights = shape (5, 5) — each row sums to 1.0

100tokens = [...]

The 5 tokens used in every chapter.

EXECUTION STATE

tokens = ['The', 'cat', 'sat', 'on', 'mat']

102Q = np.array([...])

Query matrix — same across all chapters. Each row encodes what that token looks for.

EXECUTION STATE

Q (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0

110K = np.array([...])

Key matrix — same across all chapters. Each row encodes what that token offers.

EXECUTION STATE

K (5×4) =

      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5

118V = np.array([...])

Value matrix. V is NOT rotated in RoPE — only Q and K carry position.

EXECUTION STATE

V (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5

128rope = RotaryPositionAttention(d_model=4, base=10000)

Instantiate RoPE with d_model=4 and base=10000. Precomputes theta=[1.0, 0.01] and scale=2.0.

EXECUTION STATE

rope.theta = [1.000000, 0.010000]

rope.scale = 2.0

129output, weights = rope.forward(Q, K, V)

Run the full RoPE attention pipeline: rotate Q and K, compute scaled dot-product attention, aggregate V.

EXECUTION STATE

output (5×4) =

        d0       d1       d2       d3
The  0.2472   0.3885   0.3023   0.1620
cat  0.4620   0.1468   0.3026   0.2023
sat  0.2697   0.2762   0.4035   0.1668
on   0.3540   0.2109   0.2287   0.4952
mat  0.3472   0.2224   0.2418   0.4635

135rope.explain(Q, K, V, tokens, query_idx=0)

Print detailed trace for 'The' (pos 0). Since pos=0, Q_rot[The] = Q[The] unchanged — zero rotation.

EXECUTION STATE

query_idx = 0 → tracing 'The'

136rope.explain(Q, K, V, tokens, query_idx=1)

Print trace for 'cat' (pos 1). Position 1 rotation is significant for pair 0 (θ=1.0 rad ≈ 57.3°) but tiny for pair 1 (θ=0.01 rad ≈ 0.57°).

EXECUTION STATE

query_idx = 1 → tracing 'cat'

98 lines without explanation

1import numpy as np
2import math
3
4class RotaryPositionAttention:
5    """
6    RoPE — Rotary Position Embedding (Su et al., 2021)
7
8    Encodes position by ROTATING Query and Key vectors
9    before computing their dot product. The rotation makes
10    Q[m] · K[n] depend only on (m - n), the relative offset.
11
12    Values V are NOT rotated — only Q and K receive position.
13    """
14
15    def __init__(self, d_model: int, base: int = 10000):
16        self.d_model = d_model
17        self.base = base
18        self.scale = math.sqrt(d_model)
19        # Precompute frequencies: one theta per dimension pair
20        self.theta = np.array([
21            1.0 / (base ** (2 * i / d_model))
22            for i in range(d_model // 2)
23        ])
24
25    def _softmax(self, x: np.ndarray) -> np.ndarray:
26        """Numerically stable softmax along last axis."""
27        x_shifted = x - np.max(x, axis=-1, keepdims=True)
28        exp_x = np.exp(x_shifted)
29        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
30
31    def compute_angles(self, N: int) -> np.ndarray:
32        """Compute rotation angles for positions 0..N-1."""
33        positions = np.arange(N)
34        return np.outer(positions, self.theta)
35
36    def apply_rotation(self, X: np.ndarray) -> np.ndarray:
37        """Apply RoPE rotation to a matrix X (N, d_model)."""
38        N, D = X.shape
39        angles = self.compute_angles(N)
40        cos_a = np.cos(angles)
41        sin_a = np.sin(angles)
42        x1 = X[:, :D // 2]
43        x2 = X[:, D // 2:]
44        rotated_first = x1 * cos_a - x2 * sin_a
45        rotated_second = x1 * sin_a + x2 * cos_a
46        return np.hstack([rotated_first, rotated_second])
47
48    def compute_scores(self, Q_rot: np.ndarray, K_rot: np.ndarray):
49        """Raw dot-product scores after rotation."""
50        return Q_rot @ K_rot.T
51
52    def scale_scores(self, scores: np.ndarray) -> np.ndarray:
53        """Divide by sqrt(d_model)."""
54        return scores / self.scale
55
56    def compute_weights(self, scaled: np.ndarray) -> np.ndarray:
57        """Apply softmax to get attention weights."""
58        return self._softmax(scaled)
59
60    def compute_output(self, weights: np.ndarray, V: np.ndarray):
61        """Weighted sum of value vectors (V is NOT rotated)."""
62        return weights @ V
63
64    def forward(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray):
65        """
66        Full RoPE attention forward pass.
67
68        Args:
69            Q: Query matrix  (N, d_model)
70            K: Key matrix    (N, d_model)
71            V: Value matrix  (N, d_model) — NOT rotated
72
73        Returns:
74            output:  Context vectors  (N, d_model)
75            weights: Attention matrix (N, N)
76        """
77        Q_rot = self.apply_rotation(Q)
78        K_rot = self.apply_rotation(K)
79        raw_scores = self.compute_scores(Q_rot, K_rot)
80        scaled_scores = self.scale_scores(raw_scores)
81        weights = self.compute_weights(scaled_scores)
82        output = self.compute_output(weights, V)
83        return output, weights
84
85    def explain(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray,
86                tokens: list, query_idx: int = 0):
87        """Print a detailed trace for a specific query token."""
88        Q_rot = self.apply_rotation(Q)
89        K_rot = self.apply_rotation(K)
90        token = tokens[query_idx]
91
92        print(f"\n=== RoPE trace for '{token}' (pos {query_idx}) ===")
93        print(f"Q[{token}] original = {Q[query_idx]}")
94        print(f"Q[{token}] rotated  = {np.round(Q_rot[query_idx], 4)}")
95
96        raw = self.compute_scores(Q_rot, K_rot)
97        scaled = self.scale_scores(raw)
98        w = self.compute_weights(scaled)
99        out = self.compute_output(w, V)
100
101        for j, t in enumerate(tokens):
102            print(f"  score[{t}] = {raw[query_idx,j]:.4f}"
103                  f" -> scaled = {scaled[query_idx,j]:.4f}"
104                  f" -> weight = {w[query_idx,j]:.4f}")
105        print(f"  output = {np.round(out[query_idx], 4)}")
106
107
108# ── Shared Example (used in every chapter) ──
109tokens = ["The", "cat", "sat", "on", "mat"]
110
111Q = np.array([
112    [1.0, 0.0, 1.0, 0.0],   # The
113    [0.0, 2.0, 0.0, 1.0],   # cat
114    [1.0, 1.0, 1.0, 0.0],   # sat
115    [0.0, 0.0, 1.0, 1.0],   # on
116    [1.0, 0.0, 0.0, 1.0],   # mat
117])
118
119K = np.array([
120    [0.0, 1.0, 0.0, 1.0],   # The
121    [1.0, 0.0, 1.0, 0.0],   # cat
122    [1.0, 1.0, 0.0, 0.0],   # sat
123    [0.0, 0.0, 1.0, 1.0],   # on
124    [1.0, 0.0, 0.5, 0.5],   # mat
125])
126
127V = np.array([
128    [1.0, 0.0, 0.0, 0.0],   # The
129    [0.0, 1.0, 0.0, 0.0],   # cat
130    [0.0, 0.0, 1.0, 0.0],   # sat
131    [0.0, 0.0, 0.0, 1.0],   # on
132    [0.5, 0.5, 0.5, 0.5],   # mat
133])
134
135# ── Run ──
136rope = RotaryPositionAttention(d_model=4, base=10000)
137output, weights = rope.forward(Q, K, V)
138
139print("RoPE Attention Weights (5x5):")
140print(np.round(weights, 4))
141
142print("\nRoPE Output (5x4):")
143print(np.round(output, 4))
144
145# Detailed trace
146rope.explain(Q, K, V, tokens, query_idx=0)
147rope.explain(Q, K, V, tokens, query_idx=1)

PyTorch Implementation

Production-ready PyTorch version with GPU support, register_buffer for frequencies, and F.softmax. Click any line for the execution trace.

RoPE Attention \u2014 PyTorch Implementation

🐍rope_attention_torch.py

Explanation(33)

Code(105)

1import torch

PyTorch tensor library. Provides GPU-accelerated tensor operations and automatic differentiation.

2import torch.nn as nn

Neural network module. nn.Module is the base class for all PyTorch models.

3import torch.nn.functional as F

Functional API. F.softmax() is used instead of manual implementation — it is numerically stable and CUDA-optimized.

4import math

Standard library for math.sqrt().

6class RotaryPositionAttention(nn.Module)

PyTorch module wrapping RoPE attention. Inheriting nn.Module enables GPU transfer, parameter tracking, and automatic differentiation.

15def __init__(self, d_model, base)

Constructor. Registers frequency buffer and precomputes scale.

EXECUTION STATE

⬇ input: d_model = 4

⬇ input: base = 10000

16super().__init__()

Initialize nn.Module. Required for PyTorch parameter tracking and hooks.

17self.d_model = d_model

Store model dimension.

EXECUTION STATE

self.d_model = 4

18self.base = base

Store base frequency.

EXECUTION STATE

self.base = 10000

19self.scale = math.sqrt(d_model)

Precompute √d_model = 2.0 for score scaling.

EXECUTION STATE

self.scale = 2.0

22theta = torch.tensor([...])

Create frequency tensor. Same formula as NumPy version: θᵢ = 1/base^(2i/d).

EXECUTION STATE

theta = tensor([1.0000, 0.0100])

26self.register_buffer("theta", theta)

Register as a buffer (not a parameter). Buffers are saved with the model and move to GPU with .to(device), but are NOT updated by the optimizer. Frequencies are fixed, not learned.

EXECUTION STATE

register_buffer() = saved in state_dict, moves with .to(device), NOT trained by optimizer

34def apply_rotation(self, X) → torch.Tensor

Apply RoPE rotation. Identical to NumPy version but uses PyTorch ops for GPU and autograd support.

EXECUTION STATE

⬇ input: X (5×4) = Q or K tensor

⬆ returns = torch.Tensor (5, 4) — rotated

36N, D = X.shape

Unpack dimensions.

EXECUTION STATE

N = 5

D = 4

37positions = torch.arange(N, device=X.device, dtype=X.dtype)

Create position indices on the SAME device as X (CPU or GPU). dtype matching prevents float32/float64 mismatches.

EXECUTION STATE

device=X.device = ensures positions tensor is on same device (CPU/GPU) as input

dtype=X.dtype = match precision (float32/float64) to avoid cast errors

positions = tensor([0, 1, 2, 3, 4])

38angles = positions.unsqueeze(1) * self.theta.unsqueeze(0)

Broadcasting outer product. positions (5,1) × theta (1,2) → angles (5,2). Same as np.outer().

EXECUTION STATE

.unsqueeze(1) = positions (5,) → (5,1) — add column dimension for broadcasting

.unsqueeze(0) = theta (2,) → (1,2) — add row dimension for broadcasting

angles (5×2) =

        θ₀       θ₁
The  0.0000   0.0000
cat  1.0000   0.0100
sat  2.0000   0.0200
on   3.0000   0.0300
mat  4.0000   0.0400

39cos_a = torch.cos(angles)

Compute cosines for all angles.

EXECUTION STATE

cos_a shape = (5, 2)

40sin_a = torch.sin(angles)

Compute sines for all angles.

EXECUTION STATE

sin_a shape = (5, 2)

41x1 = X[:, :D // 2]

First half: columns [0, 1].

EXECUTION STATE

x1 shape = (5, 2)

42x2 = X[:, D // 2:]

Second half: columns [2, 3].

EXECUTION STATE

x2 shape = (5, 2)

43rotated_first = x1 * cos_a - x2 * sin_a

First rotation component. Element-wise, not matmul.

EXECUTION STATE

rotated_first shape = (5, 2)

44rotated_second = x1 * sin_a + x2 * cos_a

Second rotation component.

EXECUTION STATE

rotated_second shape = (5, 2)

45return torch.cat([rotated_first, rotated_second], dim=-1)

Concatenate along last dimension: [(5,2), (5,2)] → (5,4).

EXECUTION STATE

dim=-1 = concatenate along last axis (columns)

⬆ return shape = (5, 4) — rotated Q or K

47def forward(self, Q, K, V)

Main forward pass. Rotates Q and K, computes attention, returns output and weights.

EXECUTION STATE

⬇ input: Q = shape (5, 4)

⬇ input: K = shape (5, 4)

⬇ input: V = shape (5, 4) — NOT rotated

⬆ returns = (output (5,4), weights (5,5))

60Q_rot = self.apply_rotation(Q)

Rotate queries by position.

61K_rot = self.apply_rotation(K)

Rotate keys by position.

62scores = Q_rot @ K_rot.T / self.scale

Scaled dot-product of rotated Q and K. In PyTorch, @ is the matrix multiply operator, .T is transpose.

EXECUTION STATE

.T = transpose — K_rot (5×4) → (4×5)

/ self.scale = ÷ 2.0 (√4)

63weights = F.softmax(scores, dim=-1)

PyTorch’s numerically stable softmax. dim=-1 means softmax over last dimension (each row independently).

EXECUTION STATE

dim=-1 = softmax along last axis — each row becomes a probability distribution

F.softmax vs manual = CUDA-optimized, numerically stable (subtracts max internally), supports autograd

64output = weights @ V

Weighted sum of V (unrotated). Same as NumPy version.

65return output, weights

Return context vectors and attention matrix.

EXECUTION STATE

⬆ return: output = shape (5, 4)

⬆ return: weights = shape (5, 5)

99rope = RotaryPositionAttention(d_model=4, base=10000)

Instantiate model. theta buffer = [1.0, 0.01], scale = 2.0.

EXECUTION STATE

rope.theta = tensor([1.0000, 0.0100])

100output, weights = rope.forward(Q, K, V)

Run RoPE attention. Output matches NumPy version exactly.

EXECUTION STATE

output shape = (5, 4)

weights shape = (5, 5)

102print(weights.detach().numpy().round(4))

.detach() removes from computation graph (no gradient tracking for printing). .numpy() converts to NumPy for display.

EXECUTION STATE

.detach() = detach from autograd graph — required before .numpy()

.numpy() = convert PyTorch tensor to NumPy array

72 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
5
6class RotaryPositionAttention(nn.Module):
7    """
8    RoPE — Rotary Position Embedding (Su et al., 2021)
9    PyTorch implementation with GPU support.
10
11    Rotates Q and K by position before computing attention.
12    V is NOT rotated. Supports automatic differentiation.
13    """
14
15    def __init__(self, d_model: int, base: int = 10000):
16        super().__init__()
17        self.d_model = d_model
18        self.base = base
19        self.scale = math.sqrt(d_model)
20
21        # Register frequencies as a buffer (not a parameter)
22        theta = torch.tensor([
23            1.0 / (base ** (2 * i / d_model))
24            for i in range(d_model // 2)
25        ])
26        self.register_buffer("theta", theta)
27
28        # In production: learned projection matrices
29        # self.W_q = nn.Linear(d_model, d_model)
30        # self.W_k = nn.Linear(d_model, d_model)
31        # self.W_v = nn.Linear(d_model, d_model)
32        # self.W_o = nn.Linear(d_model, d_model)
33
34    def apply_rotation(self, X: torch.Tensor) -> torch.Tensor:
35        """Apply RoPE rotation. X shape: (N, d_model)."""
36        N, D = X.shape
37        positions = torch.arange(N, device=X.device, dtype=X.dtype)
38        angles = positions.unsqueeze(1) * self.theta.unsqueeze(0)
39        cos_a = torch.cos(angles)
40        sin_a = torch.sin(angles)
41        x1 = X[:, :D // 2]
42        x2 = X[:, D // 2:]
43        rotated_first = x1 * cos_a - x2 * sin_a
44        rotated_second = x1 * sin_a + x2 * cos_a
45        return torch.cat([rotated_first, rotated_second], dim=-1)
46
47    def forward(
48        self,
49        Q: torch.Tensor,
50        K: torch.Tensor,
51        V: torch.Tensor,
52    ) -> tuple[torch.Tensor, torch.Tensor]:
53        """
54        Args:
55            Q: (N, d_model)  query matrix
56            K: (N, d_model)  key matrix
57            V: (N, d_model)  value matrix (NOT rotated)
58        Returns:
59            output:  (N, d_model) context vectors
60            weights: (N, N) attention weights
61        """
62        Q_rot = self.apply_rotation(Q)
63        K_rot = self.apply_rotation(K)
64        scores = Q_rot @ K_rot.T / self.scale
65        weights = F.softmax(scores, dim=-1)
66        output = weights @ V
67        return output, weights
68
69
70# ── Shared Example ──
71tokens = ["The", "cat", "sat", "on", "mat"]
72
73Q = torch.tensor([
74    [1.0, 0.0, 1.0, 0.0],
75    [0.0, 2.0, 0.0, 1.0],
76    [1.0, 1.0, 1.0, 0.0],
77    [0.0, 0.0, 1.0, 1.0],
78    [1.0, 0.0, 0.0, 1.0],
79])
80
81K = torch.tensor([
82    [0.0, 1.0, 0.0, 1.0],
83    [1.0, 0.0, 1.0, 0.0],
84    [1.0, 1.0, 0.0, 0.0],
85    [0.0, 0.0, 1.0, 1.0],
86    [1.0, 0.0, 0.5, 0.5],
87])
88
89V = torch.tensor([
90    [1.0, 0.0, 0.0, 0.0],
91    [0.0, 1.0, 0.0, 0.0],
92    [0.0, 0.0, 1.0, 0.0],
93    [0.0, 0.0, 0.0, 1.0],
94    [0.5, 0.5, 0.5, 0.5],
95])
96
97# ── Run ──
98rope = RotaryPositionAttention(d_model=4, base=10000)
99output, weights = rope.forward(Q, K, V)
100
101print("RoPE Weights (PyTorch):")
102print(weights.detach().numpy().round(4))
103
104print("\nRoPE Output (PyTorch):")
105print(output.detach().numpy().round(4))

Key Takeaways

Rotation encodes position intrinsically. Unlike additive position embeddings or score biases, RoPE makes the dot product $Q_m \cdot K_n$ depend on relative position $(m - n)$ as a mathematical consequence, not an approximation.
Multi-frequency design captures multiple scales. Fast-rotating pairs encode local position (is this word the immediate neighbor?), while slow-rotating pairs encode global position (is it in the same paragraph?). This is why RoPE generalizes to longer sequences.
Zero additional parameters. Frequencies come from a formula, not a learned lookup table. This means no out-of-vocabulary positions and no parameter overhead.
V is NOT rotated. Only Q and K carry positional information through rotation. V provides the content that gets aggregated.
Compatible with all optimizations. RoPE works seamlessly with Flash Attention, KV-cache, multi-head/multi-query/grouped-query attention, and context length extension techniques like YaRN.
De facto standard for modern LLMs. LLaMA, Mistral, PaLM, Qwen, Falcon, Phi, and virtually every major open-weight model since 2022 uses RoPE.

Exercises

Verify the relative position property: Using the code above, create two identical vectors $\mathbf{x} = [1, 1, 1, 1]$ and place them at positions (0, 2) and (3, 5). Verify that the dot product of their rotated versions is the same in both cases, confirming that the score depends only on the distance 2.
Frequency analysis: For $d = 128$ and base = 10000, compute all 64 frequencies. What is the wavelength (in positions) of the fastest and slowest pair? How many full rotations does pair 0 complete over a 4096-token sequence?
NTK scaling experiment: Modify the code to use base = 100000 instead of 10000. Recompute the attention weights. How does this affect which tokens "on" attends to? Why would this help with longer sequences?
Compare with Chapter 7: Run both the relative position bias attention (Chapter 7) and RoPE attention on the same input. Which produces stronger local preference? What happens when you increase the sequence length to 20 tokens?
2D RoPE for images: Design a 2D RoPE scheme for a 4×4 image patch grid (16 tokens). Each token has a (row, col) position. Apply separate rotations for row-position and column-position using different frequency bands. What properties does this have compared to 1D RoPE?

References

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
bloc97. (2023). "NTK-Aware Scaled RoPE Allows LLaMA Models to Have Extended (8k+) Context Size Without Any Fine-tuning and Minimal Perplexity Degradation." Reddit/r/LocalLLaMA.
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071.
Black, S., Biderman, S., Hallahan, E., et al. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745.
Liu, Z., Oguz, B., Zhao, C., et al. (2024). "World Model on Million-Length Video and Language with RingAttention." arXiv:2402.08268.

Learning Objectives

The Real Problem

Position as Addition — The Limitation

Position as Score Bias — Better but Still External

From Intuition to Rotation

The Rotation Insight

The Multi-Frequency Design

The Mathematical Definition

Symbol-by-Symbol Breakdown

The Frequency Formula

Why the Dot Product Depends Only on Relative Position

Interactive: 2D Rotation Visualizer

Interactive: Multi-Frequency Explorer

Step-by-Step Calculation

Step 1: Compute Frequencies

Step 2: Rotate All Q Vectors

Step 3: Rotate All K Vectors

Step 4: Compute Scaled Scores

Step 5: Apply Softmax

Step 6: Compute Output

Full Attention Weights and Output

Attention Weights — RoPE (5×55 \times 55×5)

Output Matrix — RoPE (5×45 \times 45×4)

Interactive: RoPE vs Standard Attention

Applications Across Domains

Connection to Modern Systems

Flash Attention Compatibility

KV-Cache and Incremental Rotation

Context Length Extension

Multi-Head Latent Attention (MLA)

Complexity Analysis

Python Implementation

PyTorch Implementation

Key Takeaways

Exercises

References

Attention Weights — RoPE ( $5 \times 5$ )

Output Matrix — RoPE ( $5 \times 4$ )