Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

This section covers modern position encoding techniques that have replaced sinusoidal embeddings in state-of-the-art models: Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi).

3.1 Limitations of Traditional Position Encodings

Why New Approaches Were Needed

📝text

1ORIGINAL SINUSOIDAL (Vaswani et al., 2017):
2───────────────────────────────────────────
3
4PE(pos, 2i) = sin(pos / 10000^(2i/d))
5PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
6
7Properties:
8✓ No learned parameters
9✓ Theoretically extrapolates to longer sequences
10✓ Relative position information encoded
11
12Limitations:
13✗ Extrapolation doesn't work well in practice
14✗ Position info mixed with content
15✗ Limited flexibility
16
17
18LEARNED ABSOLUTE (BERT, GPT-2):
19────────────────────────────────
20
21PE = Embedding(position, d_model)
22
23Properties:
24✓ Simple and flexible
25✓ Can learn complex patterns
26✓ Works well within trained length
27
28Limitations:
29✗ Cannot extrapolate beyond max_position
30✗ Fixed maximum sequence length
31✗ More parameters

Modern Approaches:

RoPE (Rotary Position Embeddings)
Used in: LLaMA, Mistral, Qwen, PaLM

• Encodes position through rotation
• Relative positions emerge naturally
• Good length extrapolation with extensions

ALiBi (Attention with Linear Biases)
Used in: BLOOM, MPT

• No position in embeddings
• Linear bias in attention based on distance
• Excellent extrapolation

Length Extrapolation Comparison:

Model trained on 2048 tokens, tested on 4096:

Method	Perplexity (2k)	Perplexity (4k)	Extrapolation
Sinusoidal	15.2	150+	Very Poor
Learned Absolute	14.8	500+	Terrible
T5 Relative	15.0	45.2	Poor
RoPE (base)	14.5	35.8	Decent
RoPE (extended)	14.5	18.2	Good
ALiBi	15.1	17.5	Excellent

3.2 Rotary Position Embeddings (RoPE)

The Core Idea

📝text

1KEY INSIGHT:
2────────────
3
4Instead of ADDING position to embeddings,
5ROTATE embeddings based on position.
6
7Traditional: x' = x + PE(pos)
8RoPE: x' = R(pos) · x   (rotation matrix)
9
10
11WHY ROTATION?
12─────────────
13
14Consider 2D rotation:
15
16R(θ) = [cos(θ)  -sin(θ)]
17       [sin(θ)   cos(θ)]
18
19Key property: R(θ₁) · R(θ₂) = R(θ₁ + θ₂)
20
21This means:
22• Position m rotates by angle m·θ
23• Position n rotates by angle n·θ
24• Their difference is (m-n)·θ
25
26The RELATIVE position is naturally encoded!
27
28
29HOW IT WORKS IN ATTENTION:
30──────────────────────────
31
32Standard attention at positions m and n:
33
34Attention ∝ q_m · k_n
35
36With RoPE:
37
38Attention ∝ R(m)q · R(n)k
39         = q^T · R(m)^T · R(n) · k
40         = q^T · R(n-m) · k        (rotation property!)
41
42The attention score depends only on relative position (n-m)!

RoPE Implementation

🐍python

1import torch
2import torch.nn as nn
3import math
4from typing import Optional, Tuple
5
6
7class RotaryPositionEmbedding(nn.Module):
8    """
9    Rotary Position Embeddings (RoPE).
10
11    Used in LLaMA, Mistral, and many modern LLMs.
12    """
13
14    def __init__(
15        self,
16        dim: int,
17        max_seq_len: int = 2048,
18        base: float = 10000.0
19    ):
20        """
21        Initialize RoPE.
22
23        Args:
24            dim: Dimension to apply RoPE (usually head_dim)
25            max_seq_len: Maximum sequence length for precomputation
26            base: Base for frequency computation
27        """
28        super().__init__()
29
30        self.dim = dim
31        self.max_seq_len = max_seq_len
32        self.base = base
33
34        # Precompute frequencies
35        # theta_i = 1 / (base^(2i/dim)) for i = 0, 1, ..., dim/2-1
36        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
37        self.register_buffer('inv_freq', inv_freq)
38
39        # Precompute sin and cos for positions
40        self._precompute_cache(max_seq_len)
41
42    def _precompute_cache(self, seq_len: int):
43        """Precompute sin/cos values for efficiency."""
44        positions = torch.arange(seq_len).float()
45
46        # freqs: [seq_len, dim/2]
47        freqs = torch.outer(positions, self.inv_freq)
48
49        # cos and sin: [seq_len, dim/2]
50        cos_cache = torch.cos(freqs)
51        sin_cache = torch.sin(freqs)
52
53        # Expand to full dimension by repeating
54        # [seq_len, dim]
55        cos_cache = torch.cat([cos_cache, cos_cache], dim=-1)
56        sin_cache = torch.cat([sin_cache, sin_cache], dim=-1)
57
58        self.register_buffer('cos_cache', cos_cache)
59        self.register_buffer('sin_cache', sin_cache)
60
61    def _rotate_half(self, x: torch.Tensor) -> torch.Tensor:
62        """
63        Rotate half the hidden dims.
64
65        [x1, x2, x3, x4] -> [-x2, x1, -x4, x3]
66        """
67        x1 = x[..., :x.shape[-1] // 2]
68        x2 = x[..., x.shape[-1] // 2:]
69        return torch.cat([-x2, x1], dim=-1)
70
71    def forward(
72        self,
73        q: torch.Tensor,
74        k: torch.Tensor,
75        positions: Optional[torch.Tensor] = None
76    ) -> Tuple[torch.Tensor, torch.Tensor]:
77        """
78        Apply rotary embeddings to queries and keys.
79
80        Args:
81            q: Query tensor [batch, heads, seq_len, head_dim]
82            k: Key tensor [batch, heads, seq_len, head_dim]
83            positions: Optional position indices [seq_len]
84
85        Returns:
86            Rotated q and k tensors
87        """
88        seq_len = q.shape[2]
89
90        # Get sin/cos for current sequence
91        if positions is None:
92            cos = self.cos_cache[:seq_len]
93            sin = self.sin_cache[:seq_len]
94        else:
95            cos = self.cos_cache[positions]
96            sin = self.sin_cache[positions]
97
98        # Reshape for broadcasting: [1, 1, seq_len, dim]
99        cos = cos.unsqueeze(0).unsqueeze(0)
100        sin = sin.unsqueeze(0).unsqueeze(0)
101
102        # Apply rotation
103        # x * cos + rotate_half(x) * sin
104        q_rotated = q * cos + self._rotate_half(q) * sin
105        k_rotated = k * cos + self._rotate_half(k) * sin
106
107        return q_rotated, k_rotated
108
109
110class RoPEAttention(nn.Module):
111    """
112    Multi-head attention with RoPE.
113    """
114
115    def __init__(
116        self,
117        d_model: int,
118        num_heads: int,
119        max_seq_len: int = 2048,
120        dropout: float = 0.0
121    ):
122        super().__init__()
123
124        self.d_model = d_model
125        self.num_heads = num_heads
126        self.head_dim = d_model // num_heads
127
128        self.q_proj = nn.Linear(d_model, d_model, bias=False)
129        self.k_proj = nn.Linear(d_model, d_model, bias=False)
130        self.v_proj = nn.Linear(d_model, d_model, bias=False)
131        self.out_proj = nn.Linear(d_model, d_model, bias=False)
132
133        # RoPE
134        self.rope = RotaryPositionEmbedding(self.head_dim, max_seq_len)
135
136        self.dropout = nn.Dropout(dropout)
137
138    def forward(
139        self,
140        x: torch.Tensor,
141        attention_mask: Optional[torch.Tensor] = None,
142        is_causal: bool = False
143    ) -> torch.Tensor:
144        """
145        Forward with RoPE.
146
147        Args:
148            x: Input [batch, seq_len, d_model]
149            attention_mask: Optional mask
150            is_causal: Whether to use causal masking
151
152        Returns:
153            Output [batch, seq_len, d_model]
154        """
155        batch_size, seq_len, _ = x.shape
156
157        # Project to Q, K, V
158        q = self.q_proj(x)
159        k = self.k_proj(x)
160        v = self.v_proj(x)
161
162        # Reshape: [batch, seq, d_model] -> [batch, heads, seq, head_dim]
163        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
164        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
165        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
166
167        # Apply RoPE to Q and K (not V!)
168        q, k = self.rope(q, k)
169
170        # Standard attention computation
171        scale = 1.0 / math.sqrt(self.head_dim)
172        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
173
174        if attention_mask is not None:
175            scores = scores + attention_mask
176
177        if is_causal:
178            causal_mask = torch.triu(
179                torch.ones(seq_len, seq_len, device=x.device),
180                diagonal=1
181            ).bool()
182            scores = scores.masked_fill(causal_mask, float('-inf'))
183
184        attn_weights = torch.softmax(scores, dim=-1)
185        attn_weights = self.dropout(attn_weights)
186
187        output = torch.matmul(attn_weights, v)
188
189        # Reshape back
190        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
191
192        return self.out_proj(output)

3.3 RoPE Extensions for Long Sequences

NTK-aware Scaling and YaRN

📝text

1POSITION INTERPOLATION (PI):
2────────────────────────────
3
4Simply scale positions down:
5pos' = pos × (trained_len / target_len)
6
7Example: Trained on 2048, extend to 4096
8Position 2048 becomes: 2048 × (2048/4096) = 1024
9
10✓ Simple
11✓ Works reasonably
12✗ May lose precision at high frequencies
13
14
15NTK-AWARE SCALING:
16──────────────────
17
18Increase base to reduce all frequencies:
19base' = base × (scale ^ (dim / (dim-2)))
20
21Example: scale=2, dim=64
22base' = 10000 × (2 ^ (64/62)) ≈ 20700
23
24✓ Preserves high-frequency information
25✓ Better for very long sequences
26✗ May affect short-range patterns
27
28
29YaRN (Yet another RoPE extensioN):
30──────────────────────────────────
31
32Adaptive interpolation:
33- High frequencies: Linear interpolation (preserves local)
34- Low frequencies: NTK scaling (enables long-range)
35
36✓ Best of both worlds
37✓ State-of-the-art extrapolation
38✗ More complex

Practical Results:

Model trained on 4096, tested on 16384:

Method	PPL (4k)	PPL (8k)	PPL (16k)	Quality
No scaling	12.5	1500+	OOM	Broken
Linear (PI)	12.5	18.2	28.5	Good
NTK	12.5	15.8	22.1	Better
YaRN	12.5	13.8	16.2	Best

3.4 ALiBi: Attention with Linear Biases

Implementation

🐍python

1class ALiBiPositionBias(nn.Module):
2    """
3    Attention with Linear Biases (ALiBi).
4
5    No position encodings in embeddings!
6    Instead, add bias to attention scores based on distance.
7    """
8
9    def __init__(
10        self,
11        num_heads: int,
12        max_seq_len: int = 2048
13    ):
14        """
15        Initialize ALiBi.
16
17        Args:
18            num_heads: Number of attention heads
19            max_seq_len: Maximum sequence length for bias precomputation
20        """
21        super().__init__()
22
23        self.num_heads = num_heads
24        self.max_seq_len = max_seq_len
25
26        # Compute slopes for each head
27        # Geometric sequence: 2^(-8/n), 2^(-16/n), ..., 2^(-8)
28        slopes = self._compute_slopes(num_heads)
29        self.register_buffer('slopes', slopes)
30
31        # Precompute bias matrix
32        bias = self._compute_bias(max_seq_len)
33        self.register_buffer('bias', bias)
34
35    def _compute_slopes(self, num_heads: int) -> torch.Tensor:
36        """
37        Compute slopes for each head.
38
39        Geometric sequence with ratio 2^(-8/num_heads).
40        """
41        # Get closest power of 2
42        def get_slopes_power_of_2(n):
43            start = 2 ** (-(2 ** -(math.log2(n) - 3)))
44            ratio = start
45            return [start * ratio ** i for i in range(n)]
46
47        if math.log2(num_heads).is_integer():
48            slopes = get_slopes_power_of_2(num_heads)
49        else:
50            # For non-power-of-2 heads, interpolate
51            closest_power = 2 ** math.floor(math.log2(num_heads))
52            slopes = get_slopes_power_of_2(closest_power)
53
54            extra_slopes = get_slopes_power_of_2(2 * closest_power)
55            extra_slopes = extra_slopes[0::2][:num_heads - closest_power]
56            slopes = slopes + extra_slopes
57
58        return torch.tensor(slopes).view(num_heads, 1, 1)
59
60    def _compute_bias(self, seq_len: int) -> torch.Tensor:
61        """
62        Compute position bias matrix.
63
64        bias[i,j] = -|i-j| (relative distance)
65        """
66        # Create distance matrix
67        positions = torch.arange(seq_len)
68        distance = positions.unsqueeze(1) - positions.unsqueeze(0)  # [S, S]
69        distance = -torch.abs(distance).float()  # Negative distances
70
71        return distance
72
73    def forward(self, seq_len: int) -> torch.Tensor:
74        """
75        Get position bias for given sequence length.
76
77        Args:
78            seq_len: Current sequence length
79
80        Returns:
81            Bias tensor [num_heads, seq_len, seq_len]
82        """
83        # Get bias for current length
84        bias = self.bias[:seq_len, :seq_len]
85
86        # Scale by head-specific slopes
87        # [num_heads, 1, 1] * [seq_len, seq_len] -> [num_heads, seq_len, seq_len]
88        return self.slopes * bias
89
90
91class ALiBiAttention(nn.Module):
92    """
93    Multi-head attention with ALiBi position biases.
94    """
95
96    def __init__(
97        self,
98        d_model: int,
99        num_heads: int,
100        max_seq_len: int = 2048,
101        dropout: float = 0.0
102    ):
103        super().__init__()
104
105        self.d_model = d_model
106        self.num_heads = num_heads
107        self.head_dim = d_model // num_heads
108
109        self.q_proj = nn.Linear(d_model, d_model)
110        self.k_proj = nn.Linear(d_model, d_model)
111        self.v_proj = nn.Linear(d_model, d_model)
112        self.out_proj = nn.Linear(d_model, d_model)
113
114        # ALiBi position bias
115        self.alibi = ALiBiPositionBias(num_heads, max_seq_len)
116
117        self.dropout = nn.Dropout(dropout)
118
119    def forward(
120        self,
121        x: torch.Tensor,
122        attention_mask: Optional[torch.Tensor] = None,
123        is_causal: bool = False
124    ) -> torch.Tensor:
125        """
126        Forward with ALiBi.
127
128        Note: No position encoding added to input!
129        Position is only in attention bias.
130        """
131        batch_size, seq_len, _ = x.shape
132
133        # Project (no position encoding in embeddings!)
134        q = self.q_proj(x)
135        k = self.k_proj(x)
136        v = self.v_proj(x)
137
138        # Reshape
139        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
140        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
141        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
142
143        # Compute attention scores
144        scale = 1.0 / math.sqrt(self.head_dim)
145        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
146
147        # Add ALiBi bias (this is where position information comes in!)
148        alibi_bias = self.alibi(seq_len)
149        scores = scores + alibi_bias.unsqueeze(0)  # [1, H, S, S]
150
151        if attention_mask is not None:
152            scores = scores + attention_mask
153
154        if is_causal:
155            causal_mask = torch.triu(
156                torch.ones(seq_len, seq_len, device=x.device),
157                diagonal=1
158            ).bool()
159            scores = scores.masked_fill(causal_mask, float('-inf'))
160
161        attn_weights = torch.softmax(scores, dim=-1)
162        attn_weights = self.dropout(attn_weights)
163
164        output = torch.matmul(attn_weights, v)
165        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.d_model)
166
167        return self.out_proj(output)

Key Insight:

Each head has a different "attention span":

• Head 0: Smallest slope → Long-range attention
• Head 7: Largest slope → Short-range attention

The bias makes distant tokens less attended to, with the decay rate controlled by the slope.

Token at distance d has bias: -slope × |d|

This creates RELATIVE position encoding without modifying the embeddings at all!

Extrapolation:

Because position is encoded as linear bias, ALiBi naturally extrapolates to longer sequences:

• Train on 1024: bias at position 500 = -slope × 500
• Test on 4096: bias at position 2000 = -slope × 2000

The relationship remains linear, so it just works!

3.5 Comparison and When to Use Each

Summary

📝text

1SINUSOIDAL/LEARNED ABSOLUTE:
2────────────────────────────
3✓ Use when: Fixed sequence length, simplicity needed
4✗ Avoid when: Need length extrapolation
5Examples: BERT, GPT-2, original Transformer
6
7
8ROTARY POSITION EMBEDDINGS (RoPE):
9──────────────────────────────────
10✓ Use when:
11  • Building modern LLMs
12  • Need good length extension
13  • Want relative position benefits
14  • Using grouped-query attention
15
16✗ Consider alternatives when:
17  • Need absolute position info
18  • Very long sequences (>32k) without extension
19
20Examples: LLaMA, Mistral, Qwen, PaLM, Falcon
21Extensions: PI, NTK, YaRN for 100k+ context
22
23
24ALiBi:
25──────
26✓ Use when:
27  • Need excellent extrapolation out-of-box
28  • Simplicity is important
29  • Training long-context models
30
31✗ Consider alternatives when:
32  • Need position info in embeddings
33  • Doing retrieval with embeddings
34  • Pre-trained model uses RoPE
35
36Examples: BLOOM, MPT, Code Llama (trained with ALiBi)

Recommendation for New Projects:

Short context (≤4k):
• RoPE (most widely used, good ecosystem)
• ALiBi (simpler, good extrapolation)

Long context (4k-32k):
• RoPE + YaRN scaling
• ALiBi

Very long context (32k+):
• RoPE + YaRN or Dynamic NTK
• ALiBi
• Consider sparse attention too

Implementation Checklist:

RoPE:
☐ Apply to Q and K only (not V)
☐ Cache sin/cos for efficiency
☐ Consider grouped-query attention
☐ Plan for length extension if needed

ALiBi:
☐ Remove position embeddings from input
☐ Compute slopes correctly for head count
☐ Add bias BEFORE softmax
☐ Works with causal and bidirectional

Quick Comparison Table:

Aspect	Sinusoidal	RoPE	ALiBi
Parameters	None	None	None
Position in	Embedding	Q,K rotation	Attn bias
Relative pos	Implicit	Yes	Yes
Extrapolation	Poor	Good*	Excellent
Complexity	Low	Medium	Low
Modern usage	Legacy	Dominant	Popular

* With extensions (PI, NTK, YaRN)

Summary:

Method	How It Works	Best For
Sinusoidal	Add fixed patterns to embeddings	Legacy, simple models
RoPE	Rotate Q,K by position angle	Modern LLMs, good balance
ALiBi	Linear bias in attention	Long context, extrapolation

Implementation Notes:

RoPE:
• Apply to Q and K, not V
• Use half-rotation trick for efficiency
• Cache sin/cos values
• Consider YaRN for extension

ALiBi:
• No position in embeddings
• Geometric slopes per head
• Add bias before softmax
• Works out-of-box for long sequences

Exercises:

1. Implement RoPE and verify the relative position property.
2. Compare attention patterns with sinusoidal, RoPE, and ALiBi.
3. Test extrapolation: train on 512, evaluate on 1024, 2048.
4. Implement YaRN scaling and compare with linear interpolation.
5. Build a transformer that uses RoPE for encoder and ALiBi for decoder.

Next Chapter: In Chapter 17, we'll cover production deployment including model optimization, quantization, and serving infrastructure.