Introduction
This section covers modern position encoding techniques that have replaced sinusoidal embeddings in state-of-the-art models: Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi).
3.1 Limitations of Traditional Position Encodings
Why New Approaches Were Needed
1ORIGINAL SINUSOIDAL (Vaswani et al., 2017):
2βββββββββββββββββββββββββββββββββββββββββββ
3
4PE(pos, 2i) = sin(pos / 10000^(2i/d))
5PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
6
7Properties:
8β No learned parameters
9β Theoretically extrapolates to longer sequences
10β Relative position information encoded
11
12Limitations:
13β Extrapolation doesn't work well in practice
14β Position info mixed with content
15β Limited flexibility
16
17
18LEARNED ABSOLUTE (BERT, GPT-2):
19ββββββββββββββββββββββββββββββββ
20
21PE = Embedding(position, d_model)
22
23Properties:
24β Simple and flexible
25β Can learn complex patterns
26β Works well within trained length
27
28Limitations:
29β Cannot extrapolate beyond max_position
30β Fixed maximum sequence length
31β More parametersModern Approaches:
RoPE (Rotary Position Embeddings)
Used in: LLaMA, Mistral, Qwen, PaLM
β’ Encodes position through rotation
β’ Relative positions emerge naturally
β’ Good length extrapolation with extensions
ALiBi (Attention with Linear Biases)
Used in: BLOOM, MPT
β’ No position in embeddings
β’ Linear bias in attention based on distance
β’ Excellent extrapolation
Length Extrapolation Comparison:
Model trained on 2048 tokens, tested on 4096:
| Method | Perplexity (2k) | Perplexity (4k) | Extrapolation |
|---|---|---|---|
| Sinusoidal | 15.2 | 150+ | Very Poor |
| Learned Absolute | 14.8 | 500+ | Terrible |
| T5 Relative | 15.0 | 45.2 | Poor |
| RoPE (base) | 14.5 | 35.8 | Decent |
| RoPE (extended) | 14.5 | 18.2 | Good |
| ALiBi | 15.1 | 17.5 | Excellent |
3.2 Rotary Position Embeddings (RoPE)
The Core Idea
1KEY INSIGHT:
2ββββββββββββ
3
4Instead of ADDING position to embeddings,
5ROTATE embeddings based on position.
6
7Traditional: x' = x + PE(pos)
8RoPE: x' = R(pos) Β· x (rotation matrix)
9
10
11WHY ROTATION?
12βββββββββββββ
13
14Consider 2D rotation:
15
16R(ΞΈ) = [cos(ΞΈ) -sin(ΞΈ)]
17 [sin(ΞΈ) cos(ΞΈ)]
18
19Key property: R(ΞΈβ) Β· R(ΞΈβ) = R(ΞΈβ + ΞΈβ)
20
21This means:
22β’ Position m rotates by angle mΒ·ΞΈ
23β’ Position n rotates by angle nΒ·ΞΈ
24β’ Their difference is (m-n)Β·ΞΈ
25
26The RELATIVE position is naturally encoded!
27
28
29HOW IT WORKS IN ATTENTION:
30ββββββββββββββββββββββββββ
31
32Standard attention at positions m and n:
33
34Attention β q_m Β· k_n
35
36With RoPE:
37
38Attention β R(m)q Β· R(n)k
39 = q^T Β· R(m)^T Β· R(n) Β· k
40 = q^T Β· R(n-m) Β· k (rotation property!)
41
42The attention score depends only on relative position (n-m)!RoPE Implementation
1import torch
2import torch.nn as nn
3import math
4from typing import Optional, Tuple
5
6
7class RotaryPositionEmbedding(nn.Module):
8 """
9 Rotary Position Embeddings (RoPE).
10
11 Used in LLaMA, Mistral, and many modern LLMs.
12 """
13
14 def __init__(
15 self,
16 dim: int,
17 max_seq_len: int = 2048,
18 base: float = 10000.0
19 ):
20 """
21 Initialize RoPE.
22
23 Args:
24 dim: Dimension to apply RoPE (usually head_dim)
25 max_seq_len: Maximum sequence length for precomputation
26 base: Base for frequency computation
27 """
28 super().__init__()
29
30 self.dim = dim
31 self.max_seq_len = max_seq_len
32 self.base = base
33
34 # Precompute frequencies
35 # theta_i = 1 / (base^(2i/dim)) for i = 0, 1, ..., dim/2-1
36 inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
37 self.register_buffer('inv_freq', inv_freq)
38
39 # Precompute sin and cos for positions
40 self._precompute_cache(max_seq_len)
41
42 def _precompute_cache(self, seq_len: int):
43 """Precompute sin/cos values for efficiency."""
44 positions = torch.arange(seq_len).float()
45
46 # freqs: [seq_len, dim/2]
47 freqs = torch.outer(positions, self.inv_freq)
48
49 # cos and sin: [seq_len, dim/2]
50 cos_cache = torch.cos(freqs)
51 sin_cache = torch.sin(freqs)
52
53 # Expand to full dimension by repeating
54 # [seq_len, dim]
55 cos_cache = torch.cat([cos_cache, cos_cache], dim=-1)
56 sin_cache = torch.cat([sin_cache, sin_cache], dim=-1)
57
58 self.register_buffer('cos_cache', cos_cache)
59 self.register_buffer('sin_cache', sin_cache)
60
61 def _rotate_half(self, x: torch.Tensor) -> torch.Tensor:
62 """
63 Rotate half the hidden dims.
64
65 [x1, x2, x3, x4] -> [-x2, x1, -x4, x3]
66 """
67 x1 = x[..., :x.shape[-1] // 2]
68 x2 = x[..., x.shape[-1] // 2:]
69 return torch.cat([-x2, x1], dim=-1)
70
71 def forward(
72 self,
73 q: torch.Tensor,
74 k: torch.Tensor,
75 positions: Optional[torch.Tensor] = None
76 ) -> Tuple[torch.Tensor, torch.Tensor]:
77 """
78 Apply rotary embeddings to queries and keys.
79
80 Args:
81 q: Query tensor [batch, heads, seq_len, head_dim]
82 k: Key tensor [batch, heads, seq_len, head_dim]
83 positions: Optional position indices [seq_len]
84
85 Returns:
86 Rotated q and k tensors
87 """
88 seq_len = q.shape[2]
89
90 # Get sin/cos for current sequence
91 if positions is None:
92 cos = self.cos_cache[:seq_len]
93 sin = self.sin_cache[:seq_len]
94 else:
95 cos = self.cos_cache[positions]
96 sin = self.sin_cache[positions]
97
98 # Reshape for broadcasting: [1, 1, seq_len, dim]
99 cos = cos.unsqueeze(0).unsqueeze(0)
100 sin = sin.unsqueeze(0).unsqueeze(0)
101
102 # Apply rotation
103 # x * cos + rotate_half(x) * sin
104 q_rotated = q * cos + self._rotate_half(q) * sin
105 k_rotated = k * cos + self._rotate_half(k) * sin
106
107 return q_rotated, k_rotated
108
109
110class RoPEAttention(nn.Module):
111 """
112 Multi-head attention with RoPE.
113 """
114
115 def __init__(
116 self,
117 d_model: int,
118 num_heads: int,
119 max_seq_len: int = 2048,
120 dropout: float = 0.0
121 ):
122 super().__init__()
123
124 self.d_model = d_model
125 self.num_heads = num_heads
126 self.head_dim = d_model // num_heads
127
128 self.q_proj = nn.Linear(d_model, d_model, bias=False)
129 self.k_proj = nn.Linear(d_model, d_model, bias=False)
130 self.v_proj = nn.Linear(d_model, d_model, bias=False)
131 self.out_proj = nn.Linear(d_model, d_model, bias=False)
132
133 # RoPE
134 self.rope = RotaryPositionEmbedding(self.head_dim, max_seq_len)
135
136 self.dropout = nn.Dropout(dropout)
137
138 def forward(
139 self,
140 x: torch.Tensor,
141 attention_mask: Optional[torch.Tensor] = None,
142 is_causal: bool = False
143 ) -> torch.Tensor:
144 """
145 Forward with RoPE.
146
147 Args:
148 x: Input [batch, seq_len, d_model]
149 attention_mask: Optional mask
150 is_causal: Whether to use causal masking
151
152 Returns:
153 Output [batch, seq_len, d_model]
154 """
155 batch_size, seq_len, _ = x.shape
156
157 # Project to Q, K, V
158 q = self.q_proj(x)
159 k = self.k_proj(x)
160 v = self.v_proj(x)
161
162 # Reshape: [batch, seq, d_model] -> [batch, heads, seq, head_dim]
163 q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
164 k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
165 v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
166
167 # Apply RoPE to Q and K (not V!)
168 q, k = self.rope(q, k)
169
170 # Standard attention computation
171 scale = 1.0 / math.sqrt(self.head_dim)
172 scores = torch.matmul(q, k.transpose(-2, -1)) * scale
173
174 if attention_mask is not None:
175 scores = scores + attention_mask
176
177 if is_causal:
178 causal_mask = torch.triu(
179 torch.ones(seq_len, seq_len, device=x.device),
180 diagonal=1
181 ).bool()
182 scores = scores.masked_fill(causal_mask, float('-inf'))
183
184 attn_weights = torch.softmax(scores, dim=-1)
185 attn_weights = self.dropout(attn_weights)
186
187 output = torch.matmul(attn_weights, v)
188
189 # Reshape back
190 output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
191
192 return self.out_proj(output)3.3 RoPE Extensions for Long Sequences
NTK-aware Scaling and YaRN
1POSITION INTERPOLATION (PI):
2ββββββββββββββββββββββββββββ
3
4Simply scale positions down:
5pos' = pos Γ (trained_len / target_len)
6
7Example: Trained on 2048, extend to 4096
8Position 2048 becomes: 2048 Γ (2048/4096) = 1024
9
10β Simple
11β Works reasonably
12β May lose precision at high frequencies
13
14
15NTK-AWARE SCALING:
16ββββββββββββββββββ
17
18Increase base to reduce all frequencies:
19base' = base Γ (scale ^ (dim / (dim-2)))
20
21Example: scale=2, dim=64
22base' = 10000 Γ (2 ^ (64/62)) β 20700
23
24β Preserves high-frequency information
25β Better for very long sequences
26β May affect short-range patterns
27
28
29YaRN (Yet another RoPE extensioN):
30ββββββββββββββββββββββββββββββββββ
31
32Adaptive interpolation:
33- High frequencies: Linear interpolation (preserves local)
34- Low frequencies: NTK scaling (enables long-range)
35
36β Best of both worlds
37β State-of-the-art extrapolation
38β More complexPractical Results:
Model trained on 4096, tested on 16384:
| Method | PPL (4k) | PPL (8k) | PPL (16k) | Quality |
|---|---|---|---|---|
| No scaling | 12.5 | 1500+ | OOM | Broken |
| Linear (PI) | 12.5 | 18.2 | 28.5 | Good |
| NTK | 12.5 | 15.8 | 22.1 | Better |
| YaRN | 12.5 | 13.8 | 16.2 | Best |
3.4 ALiBi: Attention with Linear Biases
Implementation
1class ALiBiPositionBias(nn.Module):
2 """
3 Attention with Linear Biases (ALiBi).
4
5 No position encodings in embeddings!
6 Instead, add bias to attention scores based on distance.
7 """
8
9 def __init__(
10 self,
11 num_heads: int,
12 max_seq_len: int = 2048
13 ):
14 """
15 Initialize ALiBi.
16
17 Args:
18 num_heads: Number of attention heads
19 max_seq_len: Maximum sequence length for bias precomputation
20 """
21 super().__init__()
22
23 self.num_heads = num_heads
24 self.max_seq_len = max_seq_len
25
26 # Compute slopes for each head
27 # Geometric sequence: 2^(-8/n), 2^(-16/n), ..., 2^(-8)
28 slopes = self._compute_slopes(num_heads)
29 self.register_buffer('slopes', slopes)
30
31 # Precompute bias matrix
32 bias = self._compute_bias(max_seq_len)
33 self.register_buffer('bias', bias)
34
35 def _compute_slopes(self, num_heads: int) -> torch.Tensor:
36 """
37 Compute slopes for each head.
38
39 Geometric sequence with ratio 2^(-8/num_heads).
40 """
41 # Get closest power of 2
42 def get_slopes_power_of_2(n):
43 start = 2 ** (-(2 ** -(math.log2(n) - 3)))
44 ratio = start
45 return [start * ratio ** i for i in range(n)]
46
47 if math.log2(num_heads).is_integer():
48 slopes = get_slopes_power_of_2(num_heads)
49 else:
50 # For non-power-of-2 heads, interpolate
51 closest_power = 2 ** math.floor(math.log2(num_heads))
52 slopes = get_slopes_power_of_2(closest_power)
53
54 extra_slopes = get_slopes_power_of_2(2 * closest_power)
55 extra_slopes = extra_slopes[0::2][:num_heads - closest_power]
56 slopes = slopes + extra_slopes
57
58 return torch.tensor(slopes).view(num_heads, 1, 1)
59
60 def _compute_bias(self, seq_len: int) -> torch.Tensor:
61 """
62 Compute position bias matrix.
63
64 bias[i,j] = -|i-j| (relative distance)
65 """
66 # Create distance matrix
67 positions = torch.arange(seq_len)
68 distance = positions.unsqueeze(1) - positions.unsqueeze(0) # [S, S]
69 distance = -torch.abs(distance).float() # Negative distances
70
71 return distance
72
73 def forward(self, seq_len: int) -> torch.Tensor:
74 """
75 Get position bias for given sequence length.
76
77 Args:
78 seq_len: Current sequence length
79
80 Returns:
81 Bias tensor [num_heads, seq_len, seq_len]
82 """
83 # Get bias for current length
84 bias = self.bias[:seq_len, :seq_len]
85
86 # Scale by head-specific slopes
87 # [num_heads, 1, 1] * [seq_len, seq_len] -> [num_heads, seq_len, seq_len]
88 return self.slopes * bias
89
90
91class ALiBiAttention(nn.Module):
92 """
93 Multi-head attention with ALiBi position biases.
94 """
95
96 def __init__(
97 self,
98 d_model: int,
99 num_heads: int,
100 max_seq_len: int = 2048,
101 dropout: float = 0.0
102 ):
103 super().__init__()
104
105 self.d_model = d_model
106 self.num_heads = num_heads
107 self.head_dim = d_model // num_heads
108
109 self.q_proj = nn.Linear(d_model, d_model)
110 self.k_proj = nn.Linear(d_model, d_model)
111 self.v_proj = nn.Linear(d_model, d_model)
112 self.out_proj = nn.Linear(d_model, d_model)
113
114 # ALiBi position bias
115 self.alibi = ALiBiPositionBias(num_heads, max_seq_len)
116
117 self.dropout = nn.Dropout(dropout)
118
119 def forward(
120 self,
121 x: torch.Tensor,
122 attention_mask: Optional[torch.Tensor] = None,
123 is_causal: bool = False
124 ) -> torch.Tensor:
125 """
126 Forward with ALiBi.
127
128 Note: No position encoding added to input!
129 Position is only in attention bias.
130 """
131 batch_size, seq_len, _ = x.shape
132
133 # Project (no position encoding in embeddings!)
134 q = self.q_proj(x)
135 k = self.k_proj(x)
136 v = self.v_proj(x)
137
138 # Reshape
139 q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
140 k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
141 v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
142
143 # Compute attention scores
144 scale = 1.0 / math.sqrt(self.head_dim)
145 scores = torch.matmul(q, k.transpose(-2, -1)) * scale
146
147 # Add ALiBi bias (this is where position information comes in!)
148 alibi_bias = self.alibi(seq_len)
149 scores = scores + alibi_bias.unsqueeze(0) # [1, H, S, S]
150
151 if attention_mask is not None:
152 scores = scores + attention_mask
153
154 if is_causal:
155 causal_mask = torch.triu(
156 torch.ones(seq_len, seq_len, device=x.device),
157 diagonal=1
158 ).bool()
159 scores = scores.masked_fill(causal_mask, float('-inf'))
160
161 attn_weights = torch.softmax(scores, dim=-1)
162 attn_weights = self.dropout(attn_weights)
163
164 output = torch.matmul(attn_weights, v)
165 output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
166
167 return self.out_proj(output)Key Insight:
Each head has a different "attention span":
β’ Head 0: Smallest slope β Long-range attention
β’ Head 7: Largest slope β Short-range attention
The bias makes distant tokens less attended to, with the decay rate controlled by the slope.
Token at distance d has bias: -slope Γ |d|
This creates RELATIVE position encoding without modifying the embeddings at all!
Extrapolation:
Because position is encoded as linear bias, ALiBi naturally extrapolates to longer sequences:
β’ Train on 1024: bias at position 500 = -slope Γ 500
β’ Test on 4096: bias at position 2000 = -slope Γ 2000
The relationship remains linear, so it just works!
3.5 Comparison and When to Use Each
Summary
1SINUSOIDAL/LEARNED ABSOLUTE:
2ββββββββββββββββββββββββββββ
3β Use when: Fixed sequence length, simplicity needed
4β Avoid when: Need length extrapolation
5Examples: BERT, GPT-2, original Transformer
6
7
8ROTARY POSITION EMBEDDINGS (RoPE):
9ββββββββββββββββββββββββββββββββββ
10β Use when:
11 β’ Building modern LLMs
12 β’ Need good length extension
13 β’ Want relative position benefits
14 β’ Using grouped-query attention
15
16β Consider alternatives when:
17 β’ Need absolute position info
18 β’ Very long sequences (>32k) without extension
19
20Examples: LLaMA, Mistral, Qwen, PaLM, Falcon
21Extensions: PI, NTK, YaRN for 100k+ context
22
23
24ALiBi:
25ββββββ
26β Use when:
27 β’ Need excellent extrapolation out-of-box
28 β’ Simplicity is important
29 β’ Training long-context models
30
31β Consider alternatives when:
32 β’ Need position info in embeddings
33 β’ Doing retrieval with embeddings
34 β’ Pre-trained model uses RoPE
35
36Examples: BLOOM, MPT, Code Llama (trained with ALiBi)Recommendation for New Projects:
Short context (β€4k):
β’ RoPE (most widely used, good ecosystem)
β’ ALiBi (simpler, good extrapolation)
Long context (4k-32k):
β’ RoPE + YaRN scaling
β’ ALiBi
Very long context (32k+):
β’ RoPE + YaRN or Dynamic NTK
β’ ALiBi
β’ Consider sparse attention too
Implementation Checklist:
RoPE:
β Apply to Q and K only (not V)
β Cache sin/cos for efficiency
β Consider grouped-query attention
β Plan for length extension if needed
ALiBi:
β Remove position embeddings from input
β Compute slopes correctly for head count
β Add bias BEFORE softmax
β Works with causal and bidirectional
Quick Comparison Table:
| Aspect | Sinusoidal | RoPE | ALiBi |
|---|---|---|---|
| Parameters | None | None | None |
| Position in | Embedding | Q,K rotation | Attn bias |
| Relative pos | Implicit | Yes | Yes |
| Extrapolation | Poor | Good* | Excellent |
| Complexity | Low | Medium | Low |
| Modern usage | Legacy | Dominant | Popular |
* With extensions (PI, NTK, YaRN)
Summary:
| Method | How It Works | Best For |
|---|---|---|
| Sinusoidal | Add fixed patterns to embeddings | Legacy, simple models |
| RoPE | Rotate Q,K by position angle | Modern LLMs, good balance |
| ALiBi | Linear bias in attention | Long context, extrapolation |
Implementation Notes:
RoPE:
β’ Apply to Q and K, not V
β’ Use half-rotation trick for efficiency
β’ Cache sin/cos values
β’ Consider YaRN for extension
ALiBi:
β’ No position in embeddings
β’ Geometric slopes per head
β’ Add bias before softmax
β’ Works out-of-box for long sequences
Exercises:
1. Implement RoPE and verify the relative position property.
2. Compare attention patterns with sinusoidal, RoPE, and ALiBi.
3. Test extrapolation: train on 512, evaluate on 1024, 2048.
4. Implement YaRN scaling and compare with linear interpolation.
5. Build a transformer that uses RoPE for encoder and ALiBi for decoder.
Next Chapter: In Chapter 17, we'll cover production deployment including model optimization, quantization, and serving infrastructure.