Introduction
While sinusoidal positional encoding from the original Transformer paper works well, researchers have developed numerous alternatives that address its limitations—particularly for long context lengths and length extrapolation.
This section explores the landscape of modern positional encodings, from the widely-adopted Rotary Position Embedding (RoPE) used in LLaMA and Gemma, to ALiBi used in BLOOM, and beyond.
Key Insight: Modern positional encodings focus on encoding relative positions rather than absolute positions, enabling better generalization to sequence lengths not seen during training.
6.1 Taxonomy of Positional Encodings
Positional encodings can be categorized along several dimensions:
By Position Type
| Type | Description | Examples |
|---|---|---|
| Absolute | Each position gets a unique encoding | Sinusoidal, Learned |
| Relative | Encodes distance between tokens | T5 RPE, RoPE, ALiBi |
By Learning Method
| Type | Description | Examples |
|---|---|---|
| Fixed/Deterministic | Computed using mathematical functions | Sinusoidal, ALiBi |
| Learned | Trained as model parameters | BERT, GPT-2, T5 bias |
| Hybrid | Fixed structure with learnable components | RoPE (fixed), FIRE (learned) |
By Integration Point
| Type | Description | Examples |
|---|---|---|
| Input Addition | Added to token embeddings | Sinusoidal, Learned |
| Attention Modification | Applied within attention computation | RoPE, ALiBi, T5 RPE |
| None (Implicit) | No explicit PE, relies on causal mask | NoPE |
6.2 Rotary Position Embedding (RoPE)
RoPE is one of the most widely adopted positional encodings in modern LLMs, used in LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, and many others.
Core Idea
Instead of adding positional information to embeddings, RoPE rotates the query and key vectors based on their position. The rotation angle depends on both the position and the dimension.
Key Property: When computing attention between positions and, the dot product depends only on the relative distance , not the absolute positions.
Mathematical Formulation
For a 2D case, the rotation is applied as:
For higher dimensions, RoPE applies rotations to pairs of dimensions:
The rotated query at position becomes:
Where is a block-diagonal rotation matrix.
Why Rotation Works
The key insight is that the dot product of two rotated vectors gives:
This means the attention score depends only on the relative position !
PyTorch Implementation
1import torch
2import torch.nn as nn
3import math
4
5
6class RotaryPositionalEmbedding(nn.Module):
7 """
8 Rotary Position Embedding (RoPE) from the RoFormer paper.
9
10 Used in: LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, Qwen, etc.
11 """
12
13 def __init__(self, dim: int, max_seq_len: int = 4096, base: int = 10000):
14 super().__init__()
15 self.dim = dim
16 self.max_seq_len = max_seq_len
17 self.base = base
18
19 # Compute inverse frequencies
20 inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
21 self.register_buffer("inv_freq", inv_freq)
22
23 # Precompute cos and sin
24 self._build_cache(max_seq_len)
25
26 def _build_cache(self, seq_len: int):
27 """Precompute cos and sin for positions."""
28 positions = torch.arange(seq_len, dtype=self.inv_freq.dtype)
29 freqs = torch.outer(positions, self.inv_freq) # [seq_len, dim/2]
30
31 # Create [seq_len, dim] by interleaving
32 emb = torch.cat([freqs, freqs], dim=-1)
33
34 self.register_buffer("cos_cached", emb.cos())
35 self.register_buffer("sin_cached", emb.sin())
36
37 def forward(self, x: torch.Tensor, seq_len: int) -> tuple[torch.Tensor, torch.Tensor]:
38 """
39 Returns cos and sin for the given sequence length.
40
41 Args:
42 x: Input tensor (for device/dtype)
43 seq_len: Current sequence length
44
45 Returns:
46 (cos, sin) each of shape [seq_len, dim]
47 """
48 if seq_len > self.max_seq_len:
49 self._build_cache(seq_len)
50
51 return (
52 self.cos_cached[:seq_len].to(x.dtype),
53 self.sin_cached[:seq_len].to(x.dtype)
54 )
55
56
57def rotate_half(x: torch.Tensor) -> torch.Tensor:
58 """Rotate half the hidden dims of the input."""
59 x1 = x[..., : x.shape[-1] // 2]
60 x2 = x[..., x.shape[-1] // 2 :]
61 return torch.cat([-x2, x1], dim=-1)
62
63
64def apply_rotary_pos_emb(
65 q: torch.Tensor,
66 k: torch.Tensor,
67 cos: torch.Tensor,
68 sin: torch.Tensor
69) -> tuple[torch.Tensor, torch.Tensor]:
70 """
71 Apply rotary position embedding to query and key tensors.
72
73 Args:
74 q: Query tensor [batch, heads, seq_len, head_dim]
75 k: Key tensor [batch, heads, seq_len, head_dim]
76 cos: Cosine values [seq_len, head_dim]
77 sin: Sine values [seq_len, head_dim]
78
79 Returns:
80 Rotated (q, k) tensors
81 """
82 # Reshape cos/sin for broadcasting
83 cos = cos.unsqueeze(0).unsqueeze(0) # [1, 1, seq_len, dim]
84 sin = sin.unsqueeze(0).unsqueeze(0)
85
86 q_embed = (q * cos) + (rotate_half(q) * sin)
87 k_embed = (k * cos) + (rotate_half(k) * sin)
88
89 return q_embed, k_embed
90
91
92# Example usage
93def test_rope():
94 batch_size = 2
95 num_heads = 8
96 seq_len = 64
97 head_dim = 64
98
99 rope = RotaryPositionalEmbedding(dim=head_dim)
100
101 q = torch.randn(batch_size, num_heads, seq_len, head_dim)
102 k = torch.randn(batch_size, num_heads, seq_len, head_dim)
103
104 cos, sin = rope(q, seq_len)
105 q_rot, k_rot = apply_rotary_pos_emb(q, k, cos, sin)
106
107 print(f"Original Q shape: {q.shape}")
108 print(f"Rotated Q shape: {q_rot.shape}")
109 print(f"Cos cache shape: {cos.shape}")
110
111test_rope()RoPE Variants for Long Context
Several extensions improve RoPE's length extrapolation:
- Position Interpolation (PI): Scales positions by a factor when handling longer sequences than training length.
- NTK-Aware Scaling: Adjusts the base frequency for better extrapolation.
- YaRN: Combines multiple scaling strategies for optimal long-context performance.
6.3 ALiBi (Attention with Linear Biases)
ALiBi takes a radically different approach: instead of encoding positions in representations, it directly biases attention scores based on key-query distance.
Core Idea
ALiBi adds a linear penalty to attention scores based on the distance between tokens. Tokens that are farther apart receive lower attention scores.
Key Insight: ALiBi doesn't modify embeddings at all. It only adds a position-dependent bias to the attention scores before softmax.
Mathematical Formulation
The attention score becomes:
Where is a head-specific slope and is:
Multi-Head Slopes
Different attention heads use different slopes, creating diverse attention patterns. ALiBi assigns slopes using a geometric sequence:
Where is the number of heads and is the head index.
PyTorch Implementation
1import torch
2import torch.nn as nn
3import math
4
5
6class ALiBiAttention(nn.Module):
7 """
8 Attention with Linear Biases (ALiBi).
9
10 Used in: BLOOM, MPT, and other models optimized for long contexts.
11 """
12
13 def __init__(self, num_heads: int, max_seq_len: int = 4096):
14 super().__init__()
15 self.num_heads = num_heads
16
17 # Compute slopes for each head (geometric sequence)
18 slopes = self._get_alibi_slopes(num_heads)
19 self.register_buffer("slopes", slopes)
20
21 # Precompute bias matrix
22 bias = self._build_alibi_bias(max_seq_len)
23 self.register_buffer("alibi_bias", bias)
24
25 def _get_alibi_slopes(self, num_heads: int) -> torch.Tensor:
26 """
27 Compute ALiBi slopes for each attention head.
28
29 The slopes follow a geometric sequence.
30 """
31 def get_slopes_power_of_2(n: int) -> list:
32 start = 2 ** (-(2 ** -(math.log2(n) - 3)))
33 ratio = start
34 return [start * (ratio ** i) for i in range(n)]
35
36 if math.log2(num_heads).is_integer():
37 slopes = get_slopes_power_of_2(num_heads)
38 else:
39 # Handle non-power-of-2 heads
40 closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
41 slopes = get_slopes_power_of_2(closest_power_of_2)
42
43 extra_slopes = get_slopes_power_of_2(2 * closest_power_of_2)
44 extra_slopes = extra_slopes[0::2][:num_heads - closest_power_of_2]
45 slopes = slopes + extra_slopes
46
47 return torch.tensor(slopes, dtype=torch.float32)
48
49 def _build_alibi_bias(self, seq_len: int) -> torch.Tensor:
50 """
51 Build the ALiBi bias matrix.
52
53 bias[i, j] = -|i - j| * slope
54 """
55 # Create distance matrix
56 positions = torch.arange(seq_len)
57 distance = positions.unsqueeze(0) - positions.unsqueeze(1) # [seq, seq]
58 distance = -torch.abs(distance).float() # Negative distance
59
60 # Apply slopes for each head: [num_heads, seq, seq]
61 bias = distance.unsqueeze(0) * self.slopes.view(-1, 1, 1)
62
63 return bias
64
65 def get_bias(self, seq_len: int) -> torch.Tensor:
66 """Get ALiBi bias for the given sequence length."""
67 if seq_len > self.alibi_bias.size(1):
68 self.alibi_bias = self._build_alibi_bias(seq_len)
69
70 return self.alibi_bias[:, :seq_len, :seq_len]
71
72 def forward(
73 self,
74 query: torch.Tensor,
75 key: torch.Tensor,
76 value: torch.Tensor,
77 attention_mask: torch.Tensor | None = None
78 ) -> torch.Tensor:
79 """
80 Compute attention with ALiBi bias.
81
82 Args:
83 query: [batch, heads, seq_len, head_dim]
84 key: [batch, heads, seq_len, head_dim]
85 value: [batch, heads, seq_len, head_dim]
86 attention_mask: Optional mask
87
88 Returns:
89 Output tensor [batch, heads, seq_len, head_dim]
90 """
91 batch_size, num_heads, seq_len, head_dim = query.shape
92
93 # Compute attention scores
94 scale = 1.0 / math.sqrt(head_dim)
95 attn_scores = torch.matmul(query, key.transpose(-2, -1)) * scale
96
97 # Add ALiBi bias
98 alibi_bias = self.get_bias(seq_len)
99 attn_scores = attn_scores + alibi_bias.unsqueeze(0) # [batch, heads, seq, seq]
100
101 # Apply attention mask if provided
102 if attention_mask is not None:
103 attn_scores = attn_scores + attention_mask
104
105 # Softmax and weighted sum
106 attn_probs = torch.softmax(attn_scores, dim=-1)
107 output = torch.matmul(attn_probs, value)
108
109 return output
110
111
112# Example usage
113def test_alibi():
114 batch_size = 2
115 num_heads = 8
116 seq_len = 64
117 head_dim = 64
118
119 alibi_attn = ALiBiAttention(num_heads=num_heads)
120
121 q = torch.randn(batch_size, num_heads, seq_len, head_dim)
122 k = torch.randn(batch_size, num_heads, seq_len, head_dim)
123 v = torch.randn(batch_size, num_heads, seq_len, head_dim)
124
125 output = alibi_attn(q, k, v)
126
127 print(f"ALiBi slopes: {alibi_attn.slopes}")
128 print(f"Output shape: {output.shape}")
129 print(f"ALiBi bias shape: {alibi_attn.alibi_bias.shape}")
130
131test_alibi()ALiBi Advantages
- Excellent Extrapolation: Models trained on 512 tokens can handle 3072+ tokens.
- No Additional Parameters: Bias is computed, not learned.
- Simple Implementation: Just add a bias matrix to attention scores.
- Computational Efficiency: Reduces training time compared to learned embeddings.
6.4 Relative Positional Encodings
Relative positional encodings focus on the distance between tokens rather than their absolute positions.
T5 Relative Position Bias
T5 simplified relative position encoding by using bucketed biases. Distances are grouped into buckets, and each bucket has a learnable bias.
1import torch
2import torch.nn as nn
3
4
5class T5RelativePositionBias(nn.Module):
6 """
7 T5-style relative position bias with bucketing.
8
9 Used in: T5, mT5, FLAN-T5
10 """
11
12 def __init__(
13 self,
14 num_heads: int,
15 num_buckets: int = 32,
16 max_distance: int = 128,
17 bidirectional: bool = True
18 ):
19 super().__init__()
20 self.num_heads = num_heads
21 self.num_buckets = num_buckets
22 self.max_distance = max_distance
23 self.bidirectional = bidirectional
24
25 # Learnable bias for each bucket and head
26 self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
27
28 def _relative_position_bucket(
29 self,
30 relative_position: torch.Tensor
31 ) -> torch.Tensor:
32 """
33 Map relative positions to bucket indices.
34
35 Uses logarithmic bucketing for larger distances.
36 """
37 num_buckets = self.num_buckets
38 max_distance = self.max_distance
39
40 relative_buckets = 0
41
42 if self.bidirectional:
43 num_buckets //= 2
44 # Separate buckets for positive and negative positions
45 relative_buckets += (relative_position > 0).long() * num_buckets
46 relative_position = torch.abs(relative_position)
47 else:
48 relative_position = -torch.min(
49 relative_position,
50 torch.zeros_like(relative_position)
51 )
52
53 # Half buckets for exact positions, half for log-spaced
54 max_exact = num_buckets // 2
55 is_small = relative_position < max_exact
56
57 # Log bucketing for larger distances
58 relative_position_if_large = max_exact + (
59 torch.log(relative_position.float() / max_exact)
60 / math.log(max_distance / max_exact)
61 * (num_buckets - max_exact)
62 ).long()
63
64 relative_position_if_large = torch.min(
65 relative_position_if_large,
66 torch.full_like(relative_position_if_large, num_buckets - 1)
67 )
68
69 relative_buckets += torch.where(
70 is_small,
71 relative_position,
72 relative_position_if_large
73 )
74
75 return relative_buckets
76
77 def forward(self, query_length: int, key_length: int) -> torch.Tensor:
78 """
79 Compute relative position bias matrix.
80
81 Returns:
82 Bias tensor [1, num_heads, query_length, key_length]
83 """
84 device = self.relative_attention_bias.weight.device
85
86 # Create position indices
87 query_positions = torch.arange(query_length, device=device)
88 key_positions = torch.arange(key_length, device=device)
89
90 # Compute relative positions
91 relative_positions = key_positions.unsqueeze(0) - query_positions.unsqueeze(1)
92
93 # Map to buckets
94 relative_buckets = self._relative_position_bucket(relative_positions)
95
96 # Look up biases
97 values = self.relative_attention_bias(relative_buckets) # [q, k, heads]
98 values = values.permute(2, 0, 1).unsqueeze(0) # [1, heads, q, k]
99
100 return valuesTransformer-XL Relative Encoding
Transformer-XL introduced relative position encoding with a recurrence mechanism, enabling very long context through segment-level recurrence.
6.5 Learned Positional Embeddings
Many models simply learn positional embeddings as trainable parameters, just like word embeddings.
1import torch
2import torch.nn as nn
3
4
5class LearnedPositionalEmbedding(nn.Module):
6 """
7 Learned positional embeddings (used in BERT, GPT-2).
8
9 Each position has a learnable embedding vector.
10 """
11
12 def __init__(self, max_seq_len: int, d_model: int):
13 super().__init__()
14 self.embedding = nn.Embedding(max_seq_len, d_model)
15 self.max_seq_len = max_seq_len
16
17 def forward(self, x: torch.Tensor) -> torch.Tensor:
18 """
19 Args:
20 x: Input tensor [batch, seq_len, d_model]
21
22 Returns:
23 Tensor with positional embeddings added
24 """
25 seq_len = x.size(1)
26
27 if seq_len > self.max_seq_len:
28 raise ValueError(
29 f"Sequence length {seq_len} exceeds maximum {self.max_seq_len}"
30 )
31
32 positions = torch.arange(seq_len, device=x.device)
33 pos_embeddings = self.embedding(positions) # [seq_len, d_model]
34
35 return x + pos_embeddings.unsqueeze(0) # Broadcast over batch| Model | Position Embedding Type | Max Length |
|---|---|---|
| BERT | Learned absolute | 512 |
| GPT-2 | Learned absolute | 1024 |
| GPT-3 | Learned absolute | 2048 |
| RoBERTa | Learned absolute | 512 |
6.6 Advanced Methods
xPos (Extrapolatable Position Embeddings)
xPos extends RoPE with a decay factor similar to ALiBi, fixing length extrapolation issues.
1# xPos applies an exponential decay to RoPE
2# decay[i] = (i + gamma) / (1 + gamma) where gamma is a hyperparameter
3
4class XPosEmbedding(RotaryPositionalEmbedding):
5 """xPos: RoPE with exponential decay for better extrapolation."""
6
7 def __init__(self, dim: int, base: int = 10000, gamma: float = 0.4):
8 super().__init__(dim, base=base)
9 self.gamma = gamma
10
11 # Compute decay factors
12 scale = (torch.arange(0, dim, 2) + gamma) / (1 + gamma)
13 self.register_buffer("scale", scale)FIRE (Functional Interpolation for Relative Position Encoding)
FIRE uses progressive interpolation by normalizing distance by query position, ensuring the normalized distance is always bounded between [0, 1].
- Can represent T5's RPE, ALiBi, and Kerple theoretically
- Better length extrapolation through position normalization
- Learnable interpolation function
NoPE (No Positional Encoding)
Surprisingly, some research shows that decoder-only Transformers can work without explicit positional encoding, relying on causal attention masks to implicitly encode position.
Research Finding: NoPE can outperform explicit positional encoding methods while requiring no additional computation. When trained with SGD, NoPE mostly resembles T5's relative PE attention patterns.
6.7 Comparison and Trade-offs
Feature Comparison
| Method | Type | Params | Extrapolation | Compute Cost |
|---|---|---|---|---|
| Sinusoidal | Absolute | None | Poor | Low |
| Learned | Absolute | O(L×d) | None | Low |
| RoPE | Relative | None | Good* | Medium |
| ALiBi | Relative | None | Excellent | Low |
| T5 RPE | Relative | O(B×H) | Good | Low |
| xPos | Relative | None | Excellent | Medium |
| FIRE | Relative | Learned | Excellent | Medium |
*RoPE extrapolation improves with PI/NTK/YaRN extensions
Extrapolation Performance
When models need to handle sequences longer than their training length:
- Best: ALiBi, FIRE, xPos
- Good: T5 RPE, RoPE (with extensions)
- Limited: Sinusoidal, Vanilla RoPE
- None: Learned embeddings (hard cutoff)
Key Similarities: RoPE vs ALiBi
- Both focus on relative positions, not absolute
- Both modify attention weights rather than adding to embeddings
- Both have recency bias (farther tokens receive less attention)
- Both are deterministic (no learned parameters for positions)
Key Differences: RoPE vs ALiBi
| Aspect | RoPE | ALiBi |
|---|---|---|
| Mechanism | Rotates Q/K vectors | Adds bias to attention scores |
| Math | Trigonometric (rotation) | Linear penalty |
| Where Applied | Before dot product | After dot product |
| Head Variation | Same rotation all heads | Different slopes per head |
| Extensions | PI, NTK, YaRN | Generally not needed |
6.8 Which Models Use What?
| Model | Positional Encoding | Notes |
|---|---|---|
| GPT-2/3 | Learned Absolute | Limited to training length |
| BERT | Learned Absolute | 512 token limit |
| T5 | Relative (Bucketed) | Good extrapolation |
| LLaMA 1/2/3 | RoPE | With extensions for long context |
| Mistral | RoPE | Sliding window attention |
| Gemma | RoPE | Google's open model |
| Qwen | RoPE | With NTK-aware scaling |
| BLOOM | ALiBi | 176B multilingual model |
| MPT | ALiBi | MosaicML's model |
| Falcon | RoPE + ALiBi | Hybrid approach |
| Claude | Unknown | Anthropic's models |
| GPT-4 | Unknown | OpenAI's latest |
Trend: Modern LLMs (2023-2024) predominantly use RoPE, especially for models that need long context support. ALiBi remains popular for its simplicity and excellent extrapolation without extensions.
6.9 Implementation Guide
Choosing the Right Encoding
Use RoPE when:
- Building a general-purpose LLM
- You need good relative position understanding
- Long context is important (with extensions)
- Following LLaMA-style architectures
Use ALiBi when:
- Length extrapolation is critical
- You want simplicity (no extensions needed)
- Training on shorter sequences, deploying on longer
- Computational efficiency is important
Use Learned embeddings when:
- Sequence length is fixed and known
- Simplicity is preferred over flexibility
- Following BERT-style architectures
Use Sinusoidal when:
- Following the original Transformer paper
- Building encoder-decoder models
- Simplicity and determinism are needed
Summary
Key Takeaways
- Sinusoidal (Original): Deterministic, simple, but poor extrapolation
- Learned: Flexible but limited to training length
- RoPE: Rotates embeddings, widely adopted in modern LLMs (LLaMA, Mistral)
- ALiBi: Linear bias on attention, excellent extrapolation (BLOOM, MPT)
- T5 RPE: Bucketed relative biases, good balance of simplicity and performance
- xPos/FIRE: Advanced extensions for even better extrapolation
The Trend
Modern LLMs are moving toward relative positional encodings that modify attention rather than embeddings. RoPE has become the de facto standard for most open-source LLMs, while ALiBi offers a simpler alternative with excellent extrapolation properties.
Remember: The best positional encoding depends on your use case. For most modern applications, RoPE or ALiBi will outperform sinusoidal and learned embeddings, especially for long-context scenarios.
References
- RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
- Train Short, Test Long: Attention with Linear Biases (Press et al., 2022)
- Exploring Length Generalization in Large Language Models (2023)
- FIRE: Functional Interpolation for Relative Position Encoding (2023)
Next Steps
Now that you understand the landscape of positional encodings, you're ready to move on to Subword Tokenization in the next chapter, where we'll learn how to build vocabularies using Byte Pair Encoding (BPE).