Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

While sinusoidal positional encoding from the original Transformer paper works well, researchers have developed numerous alternatives that address its limitations—particularly for long context lengths and length extrapolation.

This section explores the landscape of modern positional encodings, from the widely-adopted Rotary Position Embedding (RoPE) used in LLaMA and Gemma, to ALiBi used in BLOOM, and beyond.

Key Insight: Modern positional encodings focus on encoding relative positions rather than absolute positions, enabling better generalization to sequence lengths not seen during training.

6.1 Taxonomy of Positional Encodings

Positional encodings can be categorized along several dimensions:

By Position Type

Type	Description	Examples
Absolute	Each position gets a unique encoding	Sinusoidal, Learned
Relative	Encodes distance between tokens	T5 RPE, RoPE, ALiBi

By Learning Method

Type	Description	Examples
Fixed/Deterministic	Computed using mathematical functions	Sinusoidal, ALiBi
Learned	Trained as model parameters	BERT, GPT-2, T5 bias
Hybrid	Fixed structure with learnable components	RoPE (fixed), FIRE (learned)

By Integration Point

Type	Description	Examples
Input Addition	Added to token embeddings	Sinusoidal, Learned
Attention Modification	Applied within attention computation	RoPE, ALiBi, T5 RPE
None (Implicit)	No explicit PE, relies on causal mask	NoPE

6.2 Rotary Position Embedding (RoPE)

RoPE is one of the most widely adopted positional encodings in modern LLMs, used in LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, and many others.

Core Idea

Instead of adding positional information to embeddings, RoPE rotates the query and key vectors based on their position. The rotation angle depends on both the position and the dimension.

Key Property: When computing attention between positions $m$ and $n$ , the dot product $q_m^T k_n$ depends only on the relative distance $m - n$ , not the absolute positions.

Mathematical Formulation

For a 2D case, the rotation is applied as:

R_ heta(x, m) = egin{pmatrix} \cos(m heta) & -\sin(m heta) \\ \sin(m heta) & \cos(m heta) \end{pmatrix} egin{pmatrix} x_1 \\ x_2 \end{pmatrix}

For higher dimensions, RoPE applies rotations to pairs of dimensions:

heta_i = 10000^{-2i/d}

The rotated query at position $m$ becomes:

ilde{q}_m = R_{\Theta, m} \cdot q_m

Where $R_{\Theta, m}$ is a block-diagonal rotation matrix.

Why Rotation Works

The key insight is that the dot product of two rotated vectors gives:

ilde{q}_m^T ilde{k}_n = q_m^T R_{\Theta, m-n} k_n

This means the attention score depends only on the relative position $m - n$ !

PyTorch Implementation

🐍python

1import torch
2import torch.nn as nn
3import math
4
5
6class RotaryPositionalEmbedding(nn.Module):
7    """
8    Rotary Position Embedding (RoPE) from the RoFormer paper.
9
10    Used in: LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, Qwen, etc.
11    """
12
13    def __init__(self, dim: int, max_seq_len: int = 4096, base: int = 10000):
14        super().__init__()
15        self.dim = dim
16        self.max_seq_len = max_seq_len
17        self.base = base
18
19        # Compute inverse frequencies
20        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
21        self.register_buffer("inv_freq", inv_freq)
22
23        # Precompute cos and sin
24        self._build_cache(max_seq_len)
25
26    def _build_cache(self, seq_len: int):
27        """Precompute cos and sin for positions."""
28        positions = torch.arange(seq_len, dtype=self.inv_freq.dtype)
29        freqs = torch.outer(positions, self.inv_freq)  # [seq_len, dim/2]
30
31        # Create [seq_len, dim] by interleaving
32        emb = torch.cat([freqs, freqs], dim=-1)
33
34        self.register_buffer("cos_cached", emb.cos())
35        self.register_buffer("sin_cached", emb.sin())
36
37    def forward(self, x: torch.Tensor, seq_len: int) -> tuple[torch.Tensor, torch.Tensor]:
38        """
39        Returns cos and sin for the given sequence length.
40
41        Args:
42            x: Input tensor (for device/dtype)
43            seq_len: Current sequence length
44
45        Returns:
46            (cos, sin) each of shape [seq_len, dim]
47        """
48        if seq_len > self.max_seq_len:
49            self._build_cache(seq_len)
50
51        return (
52            self.cos_cached[:seq_len].to(x.dtype),
53            self.sin_cached[:seq_len].to(x.dtype)
54        )
55
56
57def rotate_half(x: torch.Tensor) -> torch.Tensor:
58    """Rotate half the hidden dims of the input."""
59    x1 = x[..., : x.shape[-1] // 2]
60    x2 = x[..., x.shape[-1] // 2 :]
61    return torch.cat([-x2, x1], dim=-1)
62
63
64def apply_rotary_pos_emb(
65    q: torch.Tensor,
66    k: torch.Tensor,
67    cos: torch.Tensor,
68    sin: torch.Tensor
69) -> tuple[torch.Tensor, torch.Tensor]:
70    """
71    Apply rotary position embedding to query and key tensors.
72
73    Args:
74        q: Query tensor [batch, heads, seq_len, head_dim]
75        k: Key tensor [batch, heads, seq_len, head_dim]
76        cos: Cosine values [seq_len, head_dim]
77        sin: Sine values [seq_len, head_dim]
78
79    Returns:
80        Rotated (q, k) tensors
81    """
82    # Reshape cos/sin for broadcasting
83    cos = cos.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, dim]
84    sin = sin.unsqueeze(0).unsqueeze(0)
85
86    q_embed = (q * cos) + (rotate_half(q) * sin)
87    k_embed = (k * cos) + (rotate_half(k) * sin)
88
89    return q_embed, k_embed
90
91
92# Example usage
93def test_rope():
94    batch_size = 2
95    num_heads = 8
96    seq_len = 64
97    head_dim = 64
98
99    rope = RotaryPositionalEmbedding(dim=head_dim)
100
101    q = torch.randn(batch_size, num_heads, seq_len, head_dim)
102    k = torch.randn(batch_size, num_heads, seq_len, head_dim)
103
104    cos, sin = rope(q, seq_len)
105    q_rot, k_rot = apply_rotary_pos_emb(q, k, cos, sin)
106
107    print(f"Original Q shape: {q.shape}")
108    print(f"Rotated Q shape: {q_rot.shape}")
109    print(f"Cos cache shape: {cos.shape}")
110
111test_rope()

RoPE Variants for Long Context

Several extensions improve RoPE's length extrapolation:

Position Interpolation (PI): Scales positions by a factor when handling longer sequences than training length.
NTK-Aware Scaling: Adjusts the base frequency for better extrapolation.
YaRN: Combines multiple scaling strategies for optimal long-context performance.

6.3 ALiBi (Attention with Linear Biases)

ALiBi takes a radically different approach: instead of encoding positions in representations, it directly biases attention scores based on key-query distance.

Core Idea

ALiBi adds a linear penalty to attention scores based on the distance between tokens. Tokens that are farther apart receive lower attention scores.

Key Insight: ALiBi doesn't modify embeddings at all. It only adds a position-dependent bias to the attention scores before softmax.

Mathematical Formulation

The attention score becomes:

ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} + m cdot ext{dist} ight)V

Where $m$ is a head-specific slope and $ext{dist}$ is:

ext{dist}[i, j] = -|i - j|

Multi-Head Slopes

Different attention heads use different slopes, creating diverse attention patterns. ALiBi assigns slopes using a geometric sequence:

m_h = 2^{-8/n} cdot 2^{-h cdot 8/n}

Where $n$ is the number of heads and $h$ is the head index.

PyTorch Implementation

🐍python

1import torch
2import torch.nn as nn
3import math
4
5
6class ALiBiAttention(nn.Module):
7    """
8    Attention with Linear Biases (ALiBi).
9
10    Used in: BLOOM, MPT, and other models optimized for long contexts.
11    """
12
13    def __init__(self, num_heads: int, max_seq_len: int = 4096):
14        super().__init__()
15        self.num_heads = num_heads
16
17        # Compute slopes for each head (geometric sequence)
18        slopes = self._get_alibi_slopes(num_heads)
19        self.register_buffer("slopes", slopes)
20
21        # Precompute bias matrix
22        bias = self._build_alibi_bias(max_seq_len)
23        self.register_buffer("alibi_bias", bias)
24
25    def _get_alibi_slopes(self, num_heads: int) -> torch.Tensor:
26        """
27        Compute ALiBi slopes for each attention head.
28
29        The slopes follow a geometric sequence.
30        """
31        def get_slopes_power_of_2(n: int) -> list:
32            start = 2 ** (-(2 ** -(math.log2(n) - 3)))
33            ratio = start
34            return [start * (ratio ** i) for i in range(n)]
35
36        if math.log2(num_heads).is_integer():
37            slopes = get_slopes_power_of_2(num_heads)
38        else:
39            # Handle non-power-of-2 heads
40            closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
41            slopes = get_slopes_power_of_2(closest_power_of_2)
42
43            extra_slopes = get_slopes_power_of_2(2 * closest_power_of_2)
44            extra_slopes = extra_slopes[0::2][:num_heads - closest_power_of_2]
45            slopes = slopes + extra_slopes
46
47        return torch.tensor(slopes, dtype=torch.float32)
48
49    def _build_alibi_bias(self, seq_len: int) -> torch.Tensor:
50        """
51        Build the ALiBi bias matrix.
52
53        bias[i, j] = -|i - j| * slope
54        """
55        # Create distance matrix
56        positions = torch.arange(seq_len)
57        distance = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]
58        distance = -torch.abs(distance).float()  # Negative distance
59
60        # Apply slopes for each head: [num_heads, seq, seq]
61        bias = distance.unsqueeze(0) * self.slopes.view(-1, 1, 1)
62
63        return bias
64
65    def get_bias(self, seq_len: int) -> torch.Tensor:
66        """Get ALiBi bias for the given sequence length."""
67        if seq_len > self.alibi_bias.size(1):
68            self.alibi_bias = self._build_alibi_bias(seq_len)
69
70        return self.alibi_bias[:, :seq_len, :seq_len]
71
72    def forward(
73        self,
74        query: torch.Tensor,
75        key: torch.Tensor,
76        value: torch.Tensor,
77        attention_mask: torch.Tensor | None = None
78    ) -> torch.Tensor:
79        """
80        Compute attention with ALiBi bias.
81
82        Args:
83            query: [batch, heads, seq_len, head_dim]
84            key: [batch, heads, seq_len, head_dim]
85            value: [batch, heads, seq_len, head_dim]
86            attention_mask: Optional mask
87
88        Returns:
89            Output tensor [batch, heads, seq_len, head_dim]
90        """
91        batch_size, num_heads, seq_len, head_dim = query.shape
92
93        # Compute attention scores
94        scale = 1.0 / math.sqrt(head_dim)
95        attn_scores = torch.matmul(query, key.transpose(-2, -1)) * scale
96
97        # Add ALiBi bias
98        alibi_bias = self.get_bias(seq_len)
99        attn_scores = attn_scores + alibi_bias.unsqueeze(0)  # [batch, heads, seq, seq]
100
101        # Apply attention mask if provided
102        if attention_mask is not None:
103            attn_scores = attn_scores + attention_mask
104
105        # Softmax and weighted sum
106        attn_probs = torch.softmax(attn_scores, dim=-1)
107        output = torch.matmul(attn_probs, value)
108
109        return output
110
111
112# Example usage
113def test_alibi():
114    batch_size = 2
115    num_heads = 8
116    seq_len = 64
117    head_dim = 64
118
119    alibi_attn = ALiBiAttention(num_heads=num_heads)
120
121    q = torch.randn(batch_size, num_heads, seq_len, head_dim)
122    k = torch.randn(batch_size, num_heads, seq_len, head_dim)
123    v = torch.randn(batch_size, num_heads, seq_len, head_dim)
124
125    output = alibi_attn(q, k, v)
126
127    print(f"ALiBi slopes: {alibi_attn.slopes}")
128    print(f"Output shape: {output.shape}")
129    print(f"ALiBi bias shape: {alibi_attn.alibi_bias.shape}")
130
131test_alibi()

ALiBi Advantages

Excellent Extrapolation: Models trained on 512 tokens can handle 3072+ tokens.
No Additional Parameters: Bias is computed, not learned.
Simple Implementation: Just add a bias matrix to attention scores.
Computational Efficiency: Reduces training time compared to learned embeddings.

6.4 Relative Positional Encodings

Relative positional encodings focus on the distance between tokens rather than their absolute positions.

T5 Relative Position Bias

T5 simplified relative position encoding by using bucketed biases. Distances are grouped into buckets, and each bucket has a learnable bias.

🐍python

1import torch
2import torch.nn as nn
3
4
5class T5RelativePositionBias(nn.Module):
6    """
7    T5-style relative position bias with bucketing.
8
9    Used in: T5, mT5, FLAN-T5
10    """
11
12    def __init__(
13        self,
14        num_heads: int,
15        num_buckets: int = 32,
16        max_distance: int = 128,
17        bidirectional: bool = True
18    ):
19        super().__init__()
20        self.num_heads = num_heads
21        self.num_buckets = num_buckets
22        self.max_distance = max_distance
23        self.bidirectional = bidirectional
24
25        # Learnable bias for each bucket and head
26        self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
27
28    def _relative_position_bucket(
29        self,
30        relative_position: torch.Tensor
31    ) -> torch.Tensor:
32        """
33        Map relative positions to bucket indices.
34
35        Uses logarithmic bucketing for larger distances.
36        """
37        num_buckets = self.num_buckets
38        max_distance = self.max_distance
39
40        relative_buckets = 0
41
42        if self.bidirectional:
43            num_buckets //= 2
44            # Separate buckets for positive and negative positions
45            relative_buckets += (relative_position > 0).long() * num_buckets
46            relative_position = torch.abs(relative_position)
47        else:
48            relative_position = -torch.min(
49                relative_position,
50                torch.zeros_like(relative_position)
51            )
52
53        # Half buckets for exact positions, half for log-spaced
54        max_exact = num_buckets // 2
55        is_small = relative_position < max_exact
56
57        # Log bucketing for larger distances
58        relative_position_if_large = max_exact + (
59            torch.log(relative_position.float() / max_exact)
60            / math.log(max_distance / max_exact)
61            * (num_buckets - max_exact)
62        ).long()
63
64        relative_position_if_large = torch.min(
65            relative_position_if_large,
66            torch.full_like(relative_position_if_large, num_buckets - 1)
67        )
68
69        relative_buckets += torch.where(
70            is_small,
71            relative_position,
72            relative_position_if_large
73        )
74
75        return relative_buckets
76
77    def forward(self, query_length: int, key_length: int) -> torch.Tensor:
78        """
79        Compute relative position bias matrix.
80
81        Returns:
82            Bias tensor [1, num_heads, query_length, key_length]
83        """
84        device = self.relative_attention_bias.weight.device
85
86        # Create position indices
87        query_positions = torch.arange(query_length, device=device)
88        key_positions = torch.arange(key_length, device=device)
89
90        # Compute relative positions
91        relative_positions = key_positions.unsqueeze(0) - query_positions.unsqueeze(1)
92
93        # Map to buckets
94        relative_buckets = self._relative_position_bucket(relative_positions)
95
96        # Look up biases
97        values = self.relative_attention_bias(relative_buckets)  # [q, k, heads]
98        values = values.permute(2, 0, 1).unsqueeze(0)  # [1, heads, q, k]
99
100        return values

Transformer-XL Relative Encoding

Transformer-XL introduced relative position encoding with a recurrence mechanism, enabling very long context through segment-level recurrence.

6.5 Learned Positional Embeddings

Many models simply learn positional embeddings as trainable parameters, just like word embeddings.

🐍python

1import torch
2import torch.nn as nn
3
4
5class LearnedPositionalEmbedding(nn.Module):
6    """
7    Learned positional embeddings (used in BERT, GPT-2).
8
9    Each position has a learnable embedding vector.
10    """
11
12    def __init__(self, max_seq_len: int, d_model: int):
13        super().__init__()
14        self.embedding = nn.Embedding(max_seq_len, d_model)
15        self.max_seq_len = max_seq_len
16
17    def forward(self, x: torch.Tensor) -> torch.Tensor:
18        """
19        Args:
20            x: Input tensor [batch, seq_len, d_model]
21
22        Returns:
23            Tensor with positional embeddings added
24        """
25        seq_len = x.size(1)
26
27        if seq_len > self.max_seq_len:
28            raise ValueError(
29                f"Sequence length {seq_len} exceeds maximum {self.max_seq_len}"
30            )
31
32        positions = torch.arange(seq_len, device=x.device)
33        pos_embeddings = self.embedding(positions)  # [seq_len, d_model]
34
35        return x + pos_embeddings.unsqueeze(0)  # Broadcast over batch

Model	Position Embedding Type	Max Length
BERT	Learned absolute	512
GPT-2	Learned absolute	1024
GPT-3	Learned absolute	2048
RoBERTa	Learned absolute	512

6.6 Advanced Methods

xPos (Extrapolatable Position Embeddings)

xPos extends RoPE with a decay factor similar to ALiBi, fixing length extrapolation issues.

🐍python

1# xPos applies an exponential decay to RoPE
2# decay[i] = (i + gamma) / (1 + gamma)  where gamma is a hyperparameter
3
4class XPosEmbedding(RotaryPositionalEmbedding):
5    """xPos: RoPE with exponential decay for better extrapolation."""
6
7    def __init__(self, dim: int, base: int = 10000, gamma: float = 0.4):
8        super().__init__(dim, base=base)
9        self.gamma = gamma
10
11        # Compute decay factors
12        scale = (torch.arange(0, dim, 2) + gamma) / (1 + gamma)
13        self.register_buffer("scale", scale)

FIRE (Functional Interpolation for Relative Position Encoding)

FIRE uses progressive interpolation by normalizing distance by query position, ensuring the normalized distance is always bounded between [0, 1].

Can represent T5's RPE, ALiBi, and Kerple theoretically
Better length extrapolation through position normalization
Learnable interpolation function

NoPE (No Positional Encoding)

Surprisingly, some research shows that decoder-only Transformers can work without explicit positional encoding, relying on causal attention masks to implicitly encode position.

Research Finding: NoPE can outperform explicit positional encoding methods while requiring no additional computation. When trained with SGD, NoPE mostly resembles T5's relative PE attention patterns.

6.7 Comparison and Trade-offs

Feature Comparison

Method	Type	Params	Extrapolation	Compute Cost
Sinusoidal	Absolute	None	Poor	Low
Learned	Absolute	O(L×d)	None	Low
RoPE	Relative	None	Good*	Medium
ALiBi	Relative	None	Excellent	Low
T5 RPE	Relative	O(B×H)	Good	Low
xPos	Relative	None	Excellent	Medium
FIRE	Relative	Learned	Excellent	Medium

*RoPE extrapolation improves with PI/NTK/YaRN extensions

Extrapolation Performance

When models need to handle sequences longer than their training length:

Best: ALiBi, FIRE, xPos
Good: T5 RPE, RoPE (with extensions)
Limited: Sinusoidal, Vanilla RoPE
None: Learned embeddings (hard cutoff)

Key Similarities: RoPE vs ALiBi

Both focus on relative positions, not absolute
Both modify attention weights rather than adding to embeddings
Both have recency bias (farther tokens receive less attention)
Both are deterministic (no learned parameters for positions)

Key Differences: RoPE vs ALiBi

Aspect	RoPE	ALiBi
Mechanism	Rotates Q/K vectors	Adds bias to attention scores
Math	Trigonometric (rotation)	Linear penalty
Where Applied	Before dot product	After dot product
Head Variation	Same rotation all heads	Different slopes per head
Extensions	PI, NTK, YaRN	Generally not needed

6.8 Which Models Use What?

Model	Positional Encoding	Notes
GPT-2/3	Learned Absolute	Limited to training length
BERT	Learned Absolute	512 token limit
T5	Relative (Bucketed)	Good extrapolation
LLaMA 1/2/3	RoPE	With extensions for long context
Mistral	RoPE	Sliding window attention
Gemma	RoPE	Google's open model
Qwen	RoPE	With NTK-aware scaling
BLOOM	ALiBi	176B multilingual model
MPT	ALiBi	MosaicML's model
Falcon	RoPE + ALiBi	Hybrid approach
Claude	Unknown	Anthropic's models
GPT-4	Unknown	OpenAI's latest

Trend: Modern LLMs (2023-2024) predominantly use RoPE, especially for models that need long context support. ALiBi remains popular for its simplicity and excellent extrapolation without extensions.

6.9 Implementation Guide

Choosing the Right Encoding

Use RoPE when:

Building a general-purpose LLM
You need good relative position understanding
Long context is important (with extensions)
Following LLaMA-style architectures

Use ALiBi when:

Length extrapolation is critical
You want simplicity (no extensions needed)
Training on shorter sequences, deploying on longer
Computational efficiency is important

Use Learned embeddings when:

Sequence length is fixed and known
Simplicity is preferred over flexibility
Following BERT-style architectures

Use Sinusoidal when:

Following the original Transformer paper
Building encoder-decoder models
Simplicity and determinism are needed

Summary

Key Takeaways

Sinusoidal (Original): Deterministic, simple, but poor extrapolation
Learned: Flexible but limited to training length
RoPE: Rotates embeddings, widely adopted in modern LLMs (LLaMA, Mistral)
ALiBi: Linear bias on attention, excellent extrapolation (BLOOM, MPT)
T5 RPE: Bucketed relative biases, good balance of simplicity and performance
xPos/FIRE: Advanced extensions for even better extrapolation

The Trend

Modern LLMs are moving toward relative positional encodings that modify attention rather than embeddings. RoPE has become the de facto standard for most open-source LLMs, while ALiBi offers a simpler alternative with excellent extrapolation properties.

Remember: The best positional encoding depends on your use case. For most modern applications, RoPE or ALiBi will outperform sinusoidal and learned embeddings, especially for long-context scenarios.

References

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
Train Short, Test Long: Attention with Linear Biases (Press et al., 2022)
Exploring Length Generalization in Large Language Models (2023)
FIRE: Functional Interpolation for Relative Position Encoding (2023)

Next Steps

Now that you understand the landscape of positional encodings, you're ready to move on to Subword Tokenization in the next chapter, where we'll learn how to build vocabularies using Byte Pair Encoding (BPE).