Chapter 4
25 min read
Section 24 of 75

Modern Positional Encodings

Positional Encoding and Embeddings

Introduction

While sinusoidal positional encoding from the original Transformer paper works well, researchers have developed numerous alternatives that address its limitations—particularly for long context lengths and length extrapolation.

This section explores the landscape of modern positional encodings, from the widely-adopted Rotary Position Embedding (RoPE) used in LLaMA and Gemma, to ALiBi used in BLOOM, and beyond.

Key Insight: Modern positional encodings focus on encoding relative positions rather than absolute positions, enabling better generalization to sequence lengths not seen during training.

6.1 Taxonomy of Positional Encodings

Positional encodings can be categorized along several dimensions:

By Position Type

TypeDescriptionExamples
AbsoluteEach position gets a unique encodingSinusoidal, Learned
RelativeEncodes distance between tokensT5 RPE, RoPE, ALiBi

By Learning Method

TypeDescriptionExamples
Fixed/DeterministicComputed using mathematical functionsSinusoidal, ALiBi
LearnedTrained as model parametersBERT, GPT-2, T5 bias
HybridFixed structure with learnable componentsRoPE (fixed), FIRE (learned)

By Integration Point

TypeDescriptionExamples
Input AdditionAdded to token embeddingsSinusoidal, Learned
Attention ModificationApplied within attention computationRoPE, ALiBi, T5 RPE
None (Implicit)No explicit PE, relies on causal maskNoPE

6.2 Rotary Position Embedding (RoPE)

RoPE is one of the most widely adopted positional encodings in modern LLMs, used in LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, and many others.

Core Idea

Instead of adding positional information to embeddings, RoPE rotates the query and key vectors based on their position. The rotation angle depends on both the position and the dimension.

Key Property: When computing attention between positions mm andnn, the dot product qmTknq_m^T k_n depends only on the relative distance mnm - n, not the absolute positions.

Mathematical Formulation

For a 2D case, the rotation is applied as:

R_ heta(x, m) = egin{pmatrix} \cos(m heta) & -\sin(m heta) \\ \sin(m heta) & \cos(m heta) \end{pmatrix} egin{pmatrix} x_1 \\ x_2 \end{pmatrix}

For higher dimensions, RoPE applies rotations to pairs of dimensions:

hetai=100002i/dheta_i = 10000^{-2i/d}

The rotated query at position mm becomes:

ildeqm=RΘ,mqmilde{q}_m = R_{\Theta, m} \cdot q_m

Where RΘ,mR_{\Theta, m} is a block-diagonal rotation matrix.

Why Rotation Works

The key insight is that the dot product of two rotated vectors gives:

ildeqmTildekn=qmTRΘ,mnknilde{q}_m^T ilde{k}_n = q_m^T R_{\Theta, m-n} k_n

This means the attention score depends only on the relative position mnm - n!

PyTorch Implementation

🐍python
1import torch
2import torch.nn as nn
3import math
4
5
6class RotaryPositionalEmbedding(nn.Module):
7    """
8    Rotary Position Embedding (RoPE) from the RoFormer paper.
9
10    Used in: LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, Qwen, etc.
11    """
12
13    def __init__(self, dim: int, max_seq_len: int = 4096, base: int = 10000):
14        super().__init__()
15        self.dim = dim
16        self.max_seq_len = max_seq_len
17        self.base = base
18
19        # Compute inverse frequencies
20        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
21        self.register_buffer("inv_freq", inv_freq)
22
23        # Precompute cos and sin
24        self._build_cache(max_seq_len)
25
26    def _build_cache(self, seq_len: int):
27        """Precompute cos and sin for positions."""
28        positions = torch.arange(seq_len, dtype=self.inv_freq.dtype)
29        freqs = torch.outer(positions, self.inv_freq)  # [seq_len, dim/2]
30
31        # Create [seq_len, dim] by interleaving
32        emb = torch.cat([freqs, freqs], dim=-1)
33
34        self.register_buffer("cos_cached", emb.cos())
35        self.register_buffer("sin_cached", emb.sin())
36
37    def forward(self, x: torch.Tensor, seq_len: int) -> tuple[torch.Tensor, torch.Tensor]:
38        """
39        Returns cos and sin for the given sequence length.
40
41        Args:
42            x: Input tensor (for device/dtype)
43            seq_len: Current sequence length
44
45        Returns:
46            (cos, sin) each of shape [seq_len, dim]
47        """
48        if seq_len > self.max_seq_len:
49            self._build_cache(seq_len)
50
51        return (
52            self.cos_cached[:seq_len].to(x.dtype),
53            self.sin_cached[:seq_len].to(x.dtype)
54        )
55
56
57def rotate_half(x: torch.Tensor) -> torch.Tensor:
58    """Rotate half the hidden dims of the input."""
59    x1 = x[..., : x.shape[-1] // 2]
60    x2 = x[..., x.shape[-1] // 2 :]
61    return torch.cat([-x2, x1], dim=-1)
62
63
64def apply_rotary_pos_emb(
65    q: torch.Tensor,
66    k: torch.Tensor,
67    cos: torch.Tensor,
68    sin: torch.Tensor
69) -> tuple[torch.Tensor, torch.Tensor]:
70    """
71    Apply rotary position embedding to query and key tensors.
72
73    Args:
74        q: Query tensor [batch, heads, seq_len, head_dim]
75        k: Key tensor [batch, heads, seq_len, head_dim]
76        cos: Cosine values [seq_len, head_dim]
77        sin: Sine values [seq_len, head_dim]
78
79    Returns:
80        Rotated (q, k) tensors
81    """
82    # Reshape cos/sin for broadcasting
83    cos = cos.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, dim]
84    sin = sin.unsqueeze(0).unsqueeze(0)
85
86    q_embed = (q * cos) + (rotate_half(q) * sin)
87    k_embed = (k * cos) + (rotate_half(k) * sin)
88
89    return q_embed, k_embed
90
91
92# Example usage
93def test_rope():
94    batch_size = 2
95    num_heads = 8
96    seq_len = 64
97    head_dim = 64
98
99    rope = RotaryPositionalEmbedding(dim=head_dim)
100
101    q = torch.randn(batch_size, num_heads, seq_len, head_dim)
102    k = torch.randn(batch_size, num_heads, seq_len, head_dim)
103
104    cos, sin = rope(q, seq_len)
105    q_rot, k_rot = apply_rotary_pos_emb(q, k, cos, sin)
106
107    print(f"Original Q shape: {q.shape}")
108    print(f"Rotated Q shape: {q_rot.shape}")
109    print(f"Cos cache shape: {cos.shape}")
110
111test_rope()

RoPE Variants for Long Context

Several extensions improve RoPE's length extrapolation:

  • Position Interpolation (PI): Scales positions by a factor when handling longer sequences than training length.
  • NTK-Aware Scaling: Adjusts the base frequency for better extrapolation.
  • YaRN: Combines multiple scaling strategies for optimal long-context performance.

6.3 ALiBi (Attention with Linear Biases)

ALiBi takes a radically different approach: instead of encoding positions in representations, it directly biases attention scores based on key-query distance.

Core Idea

ALiBi adds a linear penalty to attention scores based on the distance between tokens. Tokens that are farther apart receive lower attention scores.

Key Insight: ALiBi doesn't modify embeddings at all. It only adds a position-dependent bias to the attention scores before softmax.

Mathematical Formulation

The attention score becomes:

ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} + m cdot ext{dist} ight)V

Where mm is a head-specific slope and extdistext{dist} is:

extdist[i,j]=ijext{dist}[i, j] = -|i - j|

Multi-Head Slopes

Different attention heads use different slopes, creating diverse attention patterns. ALiBi assigns slopes using a geometric sequence:

mh=28/ncdot2hcdot8/nm_h = 2^{-8/n} cdot 2^{-h cdot 8/n}

Where nn is the number of heads and hh is the head index.

PyTorch Implementation

🐍python
1import torch
2import torch.nn as nn
3import math
4
5
6class ALiBiAttention(nn.Module):
7    """
8    Attention with Linear Biases (ALiBi).
9
10    Used in: BLOOM, MPT, and other models optimized for long contexts.
11    """
12
13    def __init__(self, num_heads: int, max_seq_len: int = 4096):
14        super().__init__()
15        self.num_heads = num_heads
16
17        # Compute slopes for each head (geometric sequence)
18        slopes = self._get_alibi_slopes(num_heads)
19        self.register_buffer("slopes", slopes)
20
21        # Precompute bias matrix
22        bias = self._build_alibi_bias(max_seq_len)
23        self.register_buffer("alibi_bias", bias)
24
25    def _get_alibi_slopes(self, num_heads: int) -> torch.Tensor:
26        """
27        Compute ALiBi slopes for each attention head.
28
29        The slopes follow a geometric sequence.
30        """
31        def get_slopes_power_of_2(n: int) -> list:
32            start = 2 ** (-(2 ** -(math.log2(n) - 3)))
33            ratio = start
34            return [start * (ratio ** i) for i in range(n)]
35
36        if math.log2(num_heads).is_integer():
37            slopes = get_slopes_power_of_2(num_heads)
38        else:
39            # Handle non-power-of-2 heads
40            closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
41            slopes = get_slopes_power_of_2(closest_power_of_2)
42
43            extra_slopes = get_slopes_power_of_2(2 * closest_power_of_2)
44            extra_slopes = extra_slopes[0::2][:num_heads - closest_power_of_2]
45            slopes = slopes + extra_slopes
46
47        return torch.tensor(slopes, dtype=torch.float32)
48
49    def _build_alibi_bias(self, seq_len: int) -> torch.Tensor:
50        """
51        Build the ALiBi bias matrix.
52
53        bias[i, j] = -|i - j| * slope
54        """
55        # Create distance matrix
56        positions = torch.arange(seq_len)
57        distance = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]
58        distance = -torch.abs(distance).float()  # Negative distance
59
60        # Apply slopes for each head: [num_heads, seq, seq]
61        bias = distance.unsqueeze(0) * self.slopes.view(-1, 1, 1)
62
63        return bias
64
65    def get_bias(self, seq_len: int) -> torch.Tensor:
66        """Get ALiBi bias for the given sequence length."""
67        if seq_len > self.alibi_bias.size(1):
68            self.alibi_bias = self._build_alibi_bias(seq_len)
69
70        return self.alibi_bias[:, :seq_len, :seq_len]
71
72    def forward(
73        self,
74        query: torch.Tensor,
75        key: torch.Tensor,
76        value: torch.Tensor,
77        attention_mask: torch.Tensor | None = None
78    ) -> torch.Tensor:
79        """
80        Compute attention with ALiBi bias.
81
82        Args:
83            query: [batch, heads, seq_len, head_dim]
84            key: [batch, heads, seq_len, head_dim]
85            value: [batch, heads, seq_len, head_dim]
86            attention_mask: Optional mask
87
88        Returns:
89            Output tensor [batch, heads, seq_len, head_dim]
90        """
91        batch_size, num_heads, seq_len, head_dim = query.shape
92
93        # Compute attention scores
94        scale = 1.0 / math.sqrt(head_dim)
95        attn_scores = torch.matmul(query, key.transpose(-2, -1)) * scale
96
97        # Add ALiBi bias
98        alibi_bias = self.get_bias(seq_len)
99        attn_scores = attn_scores + alibi_bias.unsqueeze(0)  # [batch, heads, seq, seq]
100
101        # Apply attention mask if provided
102        if attention_mask is not None:
103            attn_scores = attn_scores + attention_mask
104
105        # Softmax and weighted sum
106        attn_probs = torch.softmax(attn_scores, dim=-1)
107        output = torch.matmul(attn_probs, value)
108
109        return output
110
111
112# Example usage
113def test_alibi():
114    batch_size = 2
115    num_heads = 8
116    seq_len = 64
117    head_dim = 64
118
119    alibi_attn = ALiBiAttention(num_heads=num_heads)
120
121    q = torch.randn(batch_size, num_heads, seq_len, head_dim)
122    k = torch.randn(batch_size, num_heads, seq_len, head_dim)
123    v = torch.randn(batch_size, num_heads, seq_len, head_dim)
124
125    output = alibi_attn(q, k, v)
126
127    print(f"ALiBi slopes: {alibi_attn.slopes}")
128    print(f"Output shape: {output.shape}")
129    print(f"ALiBi bias shape: {alibi_attn.alibi_bias.shape}")
130
131test_alibi()

ALiBi Advantages

  • Excellent Extrapolation: Models trained on 512 tokens can handle 3072+ tokens.
  • No Additional Parameters: Bias is computed, not learned.
  • Simple Implementation: Just add a bias matrix to attention scores.
  • Computational Efficiency: Reduces training time compared to learned embeddings.

6.4 Relative Positional Encodings

Relative positional encodings focus on the distance between tokens rather than their absolute positions.

T5 Relative Position Bias

T5 simplified relative position encoding by using bucketed biases. Distances are grouped into buckets, and each bucket has a learnable bias.

🐍python
1import torch
2import torch.nn as nn
3
4
5class T5RelativePositionBias(nn.Module):
6    """
7    T5-style relative position bias with bucketing.
8
9    Used in: T5, mT5, FLAN-T5
10    """
11
12    def __init__(
13        self,
14        num_heads: int,
15        num_buckets: int = 32,
16        max_distance: int = 128,
17        bidirectional: bool = True
18    ):
19        super().__init__()
20        self.num_heads = num_heads
21        self.num_buckets = num_buckets
22        self.max_distance = max_distance
23        self.bidirectional = bidirectional
24
25        # Learnable bias for each bucket and head
26        self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
27
28    def _relative_position_bucket(
29        self,
30        relative_position: torch.Tensor
31    ) -> torch.Tensor:
32        """
33        Map relative positions to bucket indices.
34
35        Uses logarithmic bucketing for larger distances.
36        """
37        num_buckets = self.num_buckets
38        max_distance = self.max_distance
39
40        relative_buckets = 0
41
42        if self.bidirectional:
43            num_buckets //= 2
44            # Separate buckets for positive and negative positions
45            relative_buckets += (relative_position > 0).long() * num_buckets
46            relative_position = torch.abs(relative_position)
47        else:
48            relative_position = -torch.min(
49                relative_position,
50                torch.zeros_like(relative_position)
51            )
52
53        # Half buckets for exact positions, half for log-spaced
54        max_exact = num_buckets // 2
55        is_small = relative_position < max_exact
56
57        # Log bucketing for larger distances
58        relative_position_if_large = max_exact + (
59            torch.log(relative_position.float() / max_exact)
60            / math.log(max_distance / max_exact)
61            * (num_buckets - max_exact)
62        ).long()
63
64        relative_position_if_large = torch.min(
65            relative_position_if_large,
66            torch.full_like(relative_position_if_large, num_buckets - 1)
67        )
68
69        relative_buckets += torch.where(
70            is_small,
71            relative_position,
72            relative_position_if_large
73        )
74
75        return relative_buckets
76
77    def forward(self, query_length: int, key_length: int) -> torch.Tensor:
78        """
79        Compute relative position bias matrix.
80
81        Returns:
82            Bias tensor [1, num_heads, query_length, key_length]
83        """
84        device = self.relative_attention_bias.weight.device
85
86        # Create position indices
87        query_positions = torch.arange(query_length, device=device)
88        key_positions = torch.arange(key_length, device=device)
89
90        # Compute relative positions
91        relative_positions = key_positions.unsqueeze(0) - query_positions.unsqueeze(1)
92
93        # Map to buckets
94        relative_buckets = self._relative_position_bucket(relative_positions)
95
96        # Look up biases
97        values = self.relative_attention_bias(relative_buckets)  # [q, k, heads]
98        values = values.permute(2, 0, 1).unsqueeze(0)  # [1, heads, q, k]
99
100        return values

Transformer-XL Relative Encoding

Transformer-XL introduced relative position encoding with a recurrence mechanism, enabling very long context through segment-level recurrence.


6.5 Learned Positional Embeddings

Many models simply learn positional embeddings as trainable parameters, just like word embeddings.

🐍python
1import torch
2import torch.nn as nn
3
4
5class LearnedPositionalEmbedding(nn.Module):
6    """
7    Learned positional embeddings (used in BERT, GPT-2).
8
9    Each position has a learnable embedding vector.
10    """
11
12    def __init__(self, max_seq_len: int, d_model: int):
13        super().__init__()
14        self.embedding = nn.Embedding(max_seq_len, d_model)
15        self.max_seq_len = max_seq_len
16
17    def forward(self, x: torch.Tensor) -> torch.Tensor:
18        """
19        Args:
20            x: Input tensor [batch, seq_len, d_model]
21
22        Returns:
23            Tensor with positional embeddings added
24        """
25        seq_len = x.size(1)
26
27        if seq_len > self.max_seq_len:
28            raise ValueError(
29                f"Sequence length {seq_len} exceeds maximum {self.max_seq_len}"
30            )
31
32        positions = torch.arange(seq_len, device=x.device)
33        pos_embeddings = self.embedding(positions)  # [seq_len, d_model]
34
35        return x + pos_embeddings.unsqueeze(0)  # Broadcast over batch
ModelPosition Embedding TypeMax Length
BERTLearned absolute512
GPT-2Learned absolute1024
GPT-3Learned absolute2048
RoBERTaLearned absolute512

6.6 Advanced Methods

xPos (Extrapolatable Position Embeddings)

xPos extends RoPE with a decay factor similar to ALiBi, fixing length extrapolation issues.

🐍python
1# xPos applies an exponential decay to RoPE
2# decay[i] = (i + gamma) / (1 + gamma)  where gamma is a hyperparameter
3
4class XPosEmbedding(RotaryPositionalEmbedding):
5    """xPos: RoPE with exponential decay for better extrapolation."""
6
7    def __init__(self, dim: int, base: int = 10000, gamma: float = 0.4):
8        super().__init__(dim, base=base)
9        self.gamma = gamma
10
11        # Compute decay factors
12        scale = (torch.arange(0, dim, 2) + gamma) / (1 + gamma)
13        self.register_buffer("scale", scale)

FIRE (Functional Interpolation for Relative Position Encoding)

FIRE uses progressive interpolation by normalizing distance by query position, ensuring the normalized distance is always bounded between [0, 1].

  • Can represent T5's RPE, ALiBi, and Kerple theoretically
  • Better length extrapolation through position normalization
  • Learnable interpolation function

NoPE (No Positional Encoding)

Surprisingly, some research shows that decoder-only Transformers can work without explicit positional encoding, relying on causal attention masks to implicitly encode position.

Research Finding: NoPE can outperform explicit positional encoding methods while requiring no additional computation. When trained with SGD, NoPE mostly resembles T5's relative PE attention patterns.

6.7 Comparison and Trade-offs

Feature Comparison

MethodTypeParamsExtrapolationCompute Cost
SinusoidalAbsoluteNonePoorLow
LearnedAbsoluteO(L×d)NoneLow
RoPERelativeNoneGood*Medium
ALiBiRelativeNoneExcellentLow
T5 RPERelativeO(B×H)GoodLow
xPosRelativeNoneExcellentMedium
FIRERelativeLearnedExcellentMedium

*RoPE extrapolation improves with PI/NTK/YaRN extensions

Extrapolation Performance

When models need to handle sequences longer than their training length:

  • Best: ALiBi, FIRE, xPos
  • Good: T5 RPE, RoPE (with extensions)
  • Limited: Sinusoidal, Vanilla RoPE
  • None: Learned embeddings (hard cutoff)

Key Similarities: RoPE vs ALiBi

  • Both focus on relative positions, not absolute
  • Both modify attention weights rather than adding to embeddings
  • Both have recency bias (farther tokens receive less attention)
  • Both are deterministic (no learned parameters for positions)

Key Differences: RoPE vs ALiBi

AspectRoPEALiBi
MechanismRotates Q/K vectorsAdds bias to attention scores
MathTrigonometric (rotation)Linear penalty
Where AppliedBefore dot productAfter dot product
Head VariationSame rotation all headsDifferent slopes per head
ExtensionsPI, NTK, YaRNGenerally not needed

6.8 Which Models Use What?

ModelPositional EncodingNotes
GPT-2/3Learned AbsoluteLimited to training length
BERTLearned Absolute512 token limit
T5Relative (Bucketed)Good extrapolation
LLaMA 1/2/3RoPEWith extensions for long context
MistralRoPESliding window attention
GemmaRoPEGoogle's open model
QwenRoPEWith NTK-aware scaling
BLOOMALiBi176B multilingual model
MPTALiBiMosaicML's model
FalconRoPE + ALiBiHybrid approach
ClaudeUnknownAnthropic's models
GPT-4UnknownOpenAI's latest
Trend: Modern LLMs (2023-2024) predominantly use RoPE, especially for models that need long context support. ALiBi remains popular for its simplicity and excellent extrapolation without extensions.

6.9 Implementation Guide

Choosing the Right Encoding

Use RoPE when:

  • Building a general-purpose LLM
  • You need good relative position understanding
  • Long context is important (with extensions)
  • Following LLaMA-style architectures

Use ALiBi when:

  • Length extrapolation is critical
  • You want simplicity (no extensions needed)
  • Training on shorter sequences, deploying on longer
  • Computational efficiency is important

Use Learned embeddings when:

  • Sequence length is fixed and known
  • Simplicity is preferred over flexibility
  • Following BERT-style architectures

Use Sinusoidal when:

  • Following the original Transformer paper
  • Building encoder-decoder models
  • Simplicity and determinism are needed

Summary

Key Takeaways

  • Sinusoidal (Original): Deterministic, simple, but poor extrapolation
  • Learned: Flexible but limited to training length
  • RoPE: Rotates embeddings, widely adopted in modern LLMs (LLaMA, Mistral)
  • ALiBi: Linear bias on attention, excellent extrapolation (BLOOM, MPT)
  • T5 RPE: Bucketed relative biases, good balance of simplicity and performance
  • xPos/FIRE: Advanced extensions for even better extrapolation

The Trend

Modern LLMs are moving toward relative positional encodings that modify attention rather than embeddings. RoPE has become the de facto standard for most open-source LLMs, while ALiBi offers a simpler alternative with excellent extrapolation properties.

Remember: The best positional encoding depends on your use case. For most modern applications, RoPE or ALiBi will outperform sinusoidal and learned embeddings, especially for long-context scenarios.

References

  • RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
  • Train Short, Test Long: Attention with Linear Biases (Press et al., 2022)
  • Exploring Length Generalization in Large Language Models (2023)
  • FIRE: Functional Interpolation for Relative Position Encoding (2023)

Next Steps

Now that you understand the landscape of positional encodings, you're ready to move on to Subword Tokenization in the next chapter, where we'll learn how to build vocabularies using Byte Pair Encoding (BPE).