Chapter 16
25 min read
Section 73 of 75

Modern Position Encodings

Advanced Architectures

Introduction

This section covers modern position encoding techniques that have replaced sinusoidal embeddings in state-of-the-art models: Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi).


3.1 Limitations of Traditional Position Encodings

Why New Approaches Were Needed

πŸ“text
1ORIGINAL SINUSOIDAL (Vaswani et al., 2017):
2───────────────────────────────────────────
3
4PE(pos, 2i) = sin(pos / 10000^(2i/d))
5PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
6
7Properties:
8βœ“ No learned parameters
9βœ“ Theoretically extrapolates to longer sequences
10βœ“ Relative position information encoded
11
12Limitations:
13βœ— Extrapolation doesn't work well in practice
14βœ— Position info mixed with content
15βœ— Limited flexibility
16
17
18LEARNED ABSOLUTE (BERT, GPT-2):
19────────────────────────────────
20
21PE = Embedding(position, d_model)
22
23Properties:
24βœ“ Simple and flexible
25βœ“ Can learn complex patterns
26βœ“ Works well within trained length
27
28Limitations:
29βœ— Cannot extrapolate beyond max_position
30βœ— Fixed maximum sequence length
31βœ— More parameters

Modern Approaches:

RoPE (Rotary Position Embeddings)
Used in: LLaMA, Mistral, Qwen, PaLM

β€’ Encodes position through rotation
β€’ Relative positions emerge naturally
β€’ Good length extrapolation with extensions

ALiBi (Attention with Linear Biases)
Used in: BLOOM, MPT

β€’ No position in embeddings
β€’ Linear bias in attention based on distance
β€’ Excellent extrapolation

Length Extrapolation Comparison:

Model trained on 2048 tokens, tested on 4096:

MethodPerplexity (2k)Perplexity (4k)Extrapolation
Sinusoidal15.2150+Very Poor
Learned Absolute14.8500+Terrible
T5 Relative15.045.2Poor
RoPE (base)14.535.8Decent
RoPE (extended)14.518.2Good
ALiBi15.117.5Excellent

3.2 Rotary Position Embeddings (RoPE)

The Core Idea

πŸ“text
1KEY INSIGHT:
2────────────
3
4Instead of ADDING position to embeddings,
5ROTATE embeddings based on position.
6
7Traditional: x' = x + PE(pos)
8RoPE: x' = R(pos) Β· x   (rotation matrix)
9
10
11WHY ROTATION?
12─────────────
13
14Consider 2D rotation:
15
16R(ΞΈ) = [cos(ΞΈ)  -sin(ΞΈ)]
17       [sin(ΞΈ)   cos(ΞΈ)]
18
19Key property: R(θ₁) Β· R(ΞΈβ‚‚) = R(θ₁ + ΞΈβ‚‚)
20
21This means:
22β€’ Position m rotates by angle mΒ·ΞΈ
23β€’ Position n rotates by angle nΒ·ΞΈ
24β€’ Their difference is (m-n)Β·ΞΈ
25
26The RELATIVE position is naturally encoded!
27
28
29HOW IT WORKS IN ATTENTION:
30──────────────────────────
31
32Standard attention at positions m and n:
33
34Attention ∝ q_m · k_n
35
36With RoPE:
37
38Attention ∝ R(m)q · R(n)k
39         = q^T Β· R(m)^T Β· R(n) Β· k
40         = q^T Β· R(n-m) Β· k        (rotation property!)
41
42The attention score depends only on relative position (n-m)!

RoPE Implementation

🐍python
1import torch
2import torch.nn as nn
3import math
4from typing import Optional, Tuple
5
6
7class RotaryPositionEmbedding(nn.Module):
8    """
9    Rotary Position Embeddings (RoPE).
10
11    Used in LLaMA, Mistral, and many modern LLMs.
12    """
13
14    def __init__(
15        self,
16        dim: int,
17        max_seq_len: int = 2048,
18        base: float = 10000.0
19    ):
20        """
21        Initialize RoPE.
22
23        Args:
24            dim: Dimension to apply RoPE (usually head_dim)
25            max_seq_len: Maximum sequence length for precomputation
26            base: Base for frequency computation
27        """
28        super().__init__()
29
30        self.dim = dim
31        self.max_seq_len = max_seq_len
32        self.base = base
33
34        # Precompute frequencies
35        # theta_i = 1 / (base^(2i/dim)) for i = 0, 1, ..., dim/2-1
36        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
37        self.register_buffer('inv_freq', inv_freq)
38
39        # Precompute sin and cos for positions
40        self._precompute_cache(max_seq_len)
41
42    def _precompute_cache(self, seq_len: int):
43        """Precompute sin/cos values for efficiency."""
44        positions = torch.arange(seq_len).float()
45
46        # freqs: [seq_len, dim/2]
47        freqs = torch.outer(positions, self.inv_freq)
48
49        # cos and sin: [seq_len, dim/2]
50        cos_cache = torch.cos(freqs)
51        sin_cache = torch.sin(freqs)
52
53        # Expand to full dimension by repeating
54        # [seq_len, dim]
55        cos_cache = torch.cat([cos_cache, cos_cache], dim=-1)
56        sin_cache = torch.cat([sin_cache, sin_cache], dim=-1)
57
58        self.register_buffer('cos_cache', cos_cache)
59        self.register_buffer('sin_cache', sin_cache)
60
61    def _rotate_half(self, x: torch.Tensor) -> torch.Tensor:
62        """
63        Rotate half the hidden dims.
64
65        [x1, x2, x3, x4] -> [-x2, x1, -x4, x3]
66        """
67        x1 = x[..., :x.shape[-1] // 2]
68        x2 = x[..., x.shape[-1] // 2:]
69        return torch.cat([-x2, x1], dim=-1)
70
71    def forward(
72        self,
73        q: torch.Tensor,
74        k: torch.Tensor,
75        positions: Optional[torch.Tensor] = None
76    ) -> Tuple[torch.Tensor, torch.Tensor]:
77        """
78        Apply rotary embeddings to queries and keys.
79
80        Args:
81            q: Query tensor [batch, heads, seq_len, head_dim]
82            k: Key tensor [batch, heads, seq_len, head_dim]
83            positions: Optional position indices [seq_len]
84
85        Returns:
86            Rotated q and k tensors
87        """
88        seq_len = q.shape[2]
89
90        # Get sin/cos for current sequence
91        if positions is None:
92            cos = self.cos_cache[:seq_len]
93            sin = self.sin_cache[:seq_len]
94        else:
95            cos = self.cos_cache[positions]
96            sin = self.sin_cache[positions]
97
98        # Reshape for broadcasting: [1, 1, seq_len, dim]
99        cos = cos.unsqueeze(0).unsqueeze(0)
100        sin = sin.unsqueeze(0).unsqueeze(0)
101
102        # Apply rotation
103        # x * cos + rotate_half(x) * sin
104        q_rotated = q * cos + self._rotate_half(q) * sin
105        k_rotated = k * cos + self._rotate_half(k) * sin
106
107        return q_rotated, k_rotated
108
109
110class RoPEAttention(nn.Module):
111    """
112    Multi-head attention with RoPE.
113    """
114
115    def __init__(
116        self,
117        d_model: int,
118        num_heads: int,
119        max_seq_len: int = 2048,
120        dropout: float = 0.0
121    ):
122        super().__init__()
123
124        self.d_model = d_model
125        self.num_heads = num_heads
126        self.head_dim = d_model // num_heads
127
128        self.q_proj = nn.Linear(d_model, d_model, bias=False)
129        self.k_proj = nn.Linear(d_model, d_model, bias=False)
130        self.v_proj = nn.Linear(d_model, d_model, bias=False)
131        self.out_proj = nn.Linear(d_model, d_model, bias=False)
132
133        # RoPE
134        self.rope = RotaryPositionEmbedding(self.head_dim, max_seq_len)
135
136        self.dropout = nn.Dropout(dropout)
137
138    def forward(
139        self,
140        x: torch.Tensor,
141        attention_mask: Optional[torch.Tensor] = None,
142        is_causal: bool = False
143    ) -> torch.Tensor:
144        """
145        Forward with RoPE.
146
147        Args:
148            x: Input [batch, seq_len, d_model]
149            attention_mask: Optional mask
150            is_causal: Whether to use causal masking
151
152        Returns:
153            Output [batch, seq_len, d_model]
154        """
155        batch_size, seq_len, _ = x.shape
156
157        # Project to Q, K, V
158        q = self.q_proj(x)
159        k = self.k_proj(x)
160        v = self.v_proj(x)
161
162        # Reshape: [batch, seq, d_model] -> [batch, heads, seq, head_dim]
163        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
164        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
165        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
166
167        # Apply RoPE to Q and K (not V!)
168        q, k = self.rope(q, k)
169
170        # Standard attention computation
171        scale = 1.0 / math.sqrt(self.head_dim)
172        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
173
174        if attention_mask is not None:
175            scores = scores + attention_mask
176
177        if is_causal:
178            causal_mask = torch.triu(
179                torch.ones(seq_len, seq_len, device=x.device),
180                diagonal=1
181            ).bool()
182            scores = scores.masked_fill(causal_mask, float('-inf'))
183
184        attn_weights = torch.softmax(scores, dim=-1)
185        attn_weights = self.dropout(attn_weights)
186
187        output = torch.matmul(attn_weights, v)
188
189        # Reshape back
190        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
191
192        return self.out_proj(output)

3.3 RoPE Extensions for Long Sequences

NTK-aware Scaling and YaRN

πŸ“text
1POSITION INTERPOLATION (PI):
2────────────────────────────
3
4Simply scale positions down:
5pos' = pos Γ— (trained_len / target_len)
6
7Example: Trained on 2048, extend to 4096
8Position 2048 becomes: 2048 Γ— (2048/4096) = 1024
9
10βœ“ Simple
11βœ“ Works reasonably
12βœ— May lose precision at high frequencies
13
14
15NTK-AWARE SCALING:
16──────────────────
17
18Increase base to reduce all frequencies:
19base' = base Γ— (scale ^ (dim / (dim-2)))
20
21Example: scale=2, dim=64
22base' = 10000 Γ— (2 ^ (64/62)) β‰ˆ 20700
23
24βœ“ Preserves high-frequency information
25βœ“ Better for very long sequences
26βœ— May affect short-range patterns
27
28
29YaRN (Yet another RoPE extensioN):
30──────────────────────────────────
31
32Adaptive interpolation:
33- High frequencies: Linear interpolation (preserves local)
34- Low frequencies: NTK scaling (enables long-range)
35
36βœ“ Best of both worlds
37βœ“ State-of-the-art extrapolation
38βœ— More complex

Practical Results:

Model trained on 4096, tested on 16384:

MethodPPL (4k)PPL (8k)PPL (16k)Quality
No scaling12.51500+OOMBroken
Linear (PI)12.518.228.5Good
NTK12.515.822.1Better
YaRN12.513.816.2Best

3.4 ALiBi: Attention with Linear Biases

Implementation

🐍python
1class ALiBiPositionBias(nn.Module):
2    """
3    Attention with Linear Biases (ALiBi).
4
5    No position encodings in embeddings!
6    Instead, add bias to attention scores based on distance.
7    """
8
9    def __init__(
10        self,
11        num_heads: int,
12        max_seq_len: int = 2048
13    ):
14        """
15        Initialize ALiBi.
16
17        Args:
18            num_heads: Number of attention heads
19            max_seq_len: Maximum sequence length for bias precomputation
20        """
21        super().__init__()
22
23        self.num_heads = num_heads
24        self.max_seq_len = max_seq_len
25
26        # Compute slopes for each head
27        # Geometric sequence: 2^(-8/n), 2^(-16/n), ..., 2^(-8)
28        slopes = self._compute_slopes(num_heads)
29        self.register_buffer('slopes', slopes)
30
31        # Precompute bias matrix
32        bias = self._compute_bias(max_seq_len)
33        self.register_buffer('bias', bias)
34
35    def _compute_slopes(self, num_heads: int) -> torch.Tensor:
36        """
37        Compute slopes for each head.
38
39        Geometric sequence with ratio 2^(-8/num_heads).
40        """
41        # Get closest power of 2
42        def get_slopes_power_of_2(n):
43            start = 2 ** (-(2 ** -(math.log2(n) - 3)))
44            ratio = start
45            return [start * ratio ** i for i in range(n)]
46
47        if math.log2(num_heads).is_integer():
48            slopes = get_slopes_power_of_2(num_heads)
49        else:
50            # For non-power-of-2 heads, interpolate
51            closest_power = 2 ** math.floor(math.log2(num_heads))
52            slopes = get_slopes_power_of_2(closest_power)
53
54            extra_slopes = get_slopes_power_of_2(2 * closest_power)
55            extra_slopes = extra_slopes[0::2][:num_heads - closest_power]
56            slopes = slopes + extra_slopes
57
58        return torch.tensor(slopes).view(num_heads, 1, 1)
59
60    def _compute_bias(self, seq_len: int) -> torch.Tensor:
61        """
62        Compute position bias matrix.
63
64        bias[i,j] = -|i-j| (relative distance)
65        """
66        # Create distance matrix
67        positions = torch.arange(seq_len)
68        distance = positions.unsqueeze(1) - positions.unsqueeze(0)  # [S, S]
69        distance = -torch.abs(distance).float()  # Negative distances
70
71        return distance
72
73    def forward(self, seq_len: int) -> torch.Tensor:
74        """
75        Get position bias for given sequence length.
76
77        Args:
78            seq_len: Current sequence length
79
80        Returns:
81            Bias tensor [num_heads, seq_len, seq_len]
82        """
83        # Get bias for current length
84        bias = self.bias[:seq_len, :seq_len]
85
86        # Scale by head-specific slopes
87        # [num_heads, 1, 1] * [seq_len, seq_len] -> [num_heads, seq_len, seq_len]
88        return self.slopes * bias
89
90
91class ALiBiAttention(nn.Module):
92    """
93    Multi-head attention with ALiBi position biases.
94    """
95
96    def __init__(
97        self,
98        d_model: int,
99        num_heads: int,
100        max_seq_len: int = 2048,
101        dropout: float = 0.0
102    ):
103        super().__init__()
104
105        self.d_model = d_model
106        self.num_heads = num_heads
107        self.head_dim = d_model // num_heads
108
109        self.q_proj = nn.Linear(d_model, d_model)
110        self.k_proj = nn.Linear(d_model, d_model)
111        self.v_proj = nn.Linear(d_model, d_model)
112        self.out_proj = nn.Linear(d_model, d_model)
113
114        # ALiBi position bias
115        self.alibi = ALiBiPositionBias(num_heads, max_seq_len)
116
117        self.dropout = nn.Dropout(dropout)
118
119    def forward(
120        self,
121        x: torch.Tensor,
122        attention_mask: Optional[torch.Tensor] = None,
123        is_causal: bool = False
124    ) -> torch.Tensor:
125        """
126        Forward with ALiBi.
127
128        Note: No position encoding added to input!
129        Position is only in attention bias.
130        """
131        batch_size, seq_len, _ = x.shape
132
133        # Project (no position encoding in embeddings!)
134        q = self.q_proj(x)
135        k = self.k_proj(x)
136        v = self.v_proj(x)
137
138        # Reshape
139        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
140        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
141        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
142
143        # Compute attention scores
144        scale = 1.0 / math.sqrt(self.head_dim)
145        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
146
147        # Add ALiBi bias (this is where position information comes in!)
148        alibi_bias = self.alibi(seq_len)
149        scores = scores + alibi_bias.unsqueeze(0)  # [1, H, S, S]
150
151        if attention_mask is not None:
152            scores = scores + attention_mask
153
154        if is_causal:
155            causal_mask = torch.triu(
156                torch.ones(seq_len, seq_len, device=x.device),
157                diagonal=1
158            ).bool()
159            scores = scores.masked_fill(causal_mask, float('-inf'))
160
161        attn_weights = torch.softmax(scores, dim=-1)
162        attn_weights = self.dropout(attn_weights)
163
164        output = torch.matmul(attn_weights, v)
165        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
166
167        return self.out_proj(output)

Key Insight:

Each head has a different "attention span":

β€’ Head 0: Smallest slope β†’ Long-range attention
β€’ Head 7: Largest slope β†’ Short-range attention

The bias makes distant tokens less attended to, with the decay rate controlled by the slope.

Token at distance d has bias: -slope Γ— |d|

This creates RELATIVE position encoding without modifying the embeddings at all!

Extrapolation:

Because position is encoded as linear bias, ALiBi naturally extrapolates to longer sequences:

β€’ Train on 1024: bias at position 500 = -slope Γ— 500
β€’ Test on 4096: bias at position 2000 = -slope Γ— 2000

The relationship remains linear, so it just works!


3.5 Comparison and When to Use Each

Summary

πŸ“text
1SINUSOIDAL/LEARNED ABSOLUTE:
2────────────────────────────
3βœ“ Use when: Fixed sequence length, simplicity needed
4βœ— Avoid when: Need length extrapolation
5Examples: BERT, GPT-2, original Transformer
6
7
8ROTARY POSITION EMBEDDINGS (RoPE):
9──────────────────────────────────
10βœ“ Use when:
11  β€’ Building modern LLMs
12  β€’ Need good length extension
13  β€’ Want relative position benefits
14  β€’ Using grouped-query attention
15
16βœ— Consider alternatives when:
17  β€’ Need absolute position info
18  β€’ Very long sequences (>32k) without extension
19
20Examples: LLaMA, Mistral, Qwen, PaLM, Falcon
21Extensions: PI, NTK, YaRN for 100k+ context
22
23
24ALiBi:
25──────
26βœ“ Use when:
27  β€’ Need excellent extrapolation out-of-box
28  β€’ Simplicity is important
29  β€’ Training long-context models
30
31βœ— Consider alternatives when:
32  β€’ Need position info in embeddings
33  β€’ Doing retrieval with embeddings
34  β€’ Pre-trained model uses RoPE
35
36Examples: BLOOM, MPT, Code Llama (trained with ALiBi)

Recommendation for New Projects:

Short context (≀4k):
β€’ RoPE (most widely used, good ecosystem)
β€’ ALiBi (simpler, good extrapolation)

Long context (4k-32k):
β€’ RoPE + YaRN scaling
β€’ ALiBi

Very long context (32k+):
β€’ RoPE + YaRN or Dynamic NTK
β€’ ALiBi
β€’ Consider sparse attention too

Implementation Checklist:

RoPE:
☐ Apply to Q and K only (not V)
☐ Cache sin/cos for efficiency
☐ Consider grouped-query attention
☐ Plan for length extension if needed

ALiBi:
☐ Remove position embeddings from input
☐ Compute slopes correctly for head count
☐ Add bias BEFORE softmax
☐ Works with causal and bidirectional

Quick Comparison Table:

AspectSinusoidalRoPEALiBi
ParametersNoneNoneNone
Position inEmbeddingQ,K rotationAttn bias
Relative posImplicitYesYes
ExtrapolationPoorGood*Excellent
ComplexityLowMediumLow
Modern usageLegacyDominantPopular

* With extensions (PI, NTK, YaRN)


Summary:

MethodHow It WorksBest For
SinusoidalAdd fixed patterns to embeddingsLegacy, simple models
RoPERotate Q,K by position angleModern LLMs, good balance
ALiBiLinear bias in attentionLong context, extrapolation

Implementation Notes:

RoPE:
β€’ Apply to Q and K, not V
β€’ Use half-rotation trick for efficiency
β€’ Cache sin/cos values
β€’ Consider YaRN for extension

ALiBi:
β€’ No position in embeddings
β€’ Geometric slopes per head
β€’ Add bias before softmax
β€’ Works out-of-box for long sequences

Exercises:

1. Implement RoPE and verify the relative position property.
2. Compare attention patterns with sinusoidal, RoPE, and ALiBi.
3. Test extrapolation: train on 512, evaluate on 1024, 2048.
4. Implement YaRN scaling and compare with linear interpolation.
5. Build a transformer that uses RoPE for encoder and ALiBi for decoder.

Next Chapter: In Chapter 17, we'll cover production deployment including model optimization, quantization, and serving infrastructure.

Loading comments...