Chapter 4
18 min read
Section 20 of 75

Sinusoidal Positional Encoding

Positional Encoding and Embeddings

Introduction

The original Transformer paper introduced a clever solution to the position problem: sinusoidal positional encoding. Using sine and cosine functions at different frequencies, this method creates unique position representations that can theoretically generalize to any sequence length.

This section dives deep into the formula, explains why it works, and provides a complete PyTorch implementation.


2.1 The Formula

Complete Definition

For position extposext{pos} and dimension ii:

egin{aligned} PE_{( ext{pos}, 2i)} &= sinleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) \\[0.5em] PE_{( ext{pos}, 2i+1)} &= cosleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) end{aligned}

Where:

- extposext{pos}: Position in sequence (0, 1, 2, ...)

- ii: Dimension index (0, 1, 2, ..., dextmodel/21d_{ ext{model}}/2 - 1)

- dextmodeld_{ ext{model}}: Embedding dimension

- Even dimensions use sine, odd dimensions use cosine

Equivalent Formulation

PE_{( ext{pos}, ext{dim})} = egin{cases} sinleft(\dfrac{ ext{pos}}{10000^{ ext{dim}/d_{ ext{model}}}} ight) & ext{if dim is even} \\[1em] cosleft(\dfrac{ ext{pos}}{10000^{ ext{dim}/d_{ ext{model}}}} ight) & ext{if dim is odd} end{cases}

Numerical Example

For dextmodel=4d_{ ext{model}} = 4, position 0:

egin{aligned} ext{dim } 0 \;(i=0): quad & sinleft( rac{0}{10000^{0/4}} ight) = sin(0) = 0 \\ ext{dim } 1 \;(i=0): quad & cosleft( rac{0}{10000^{0/4}} ight) = cos(0) = 1 \\ ext{dim } 2 \;(i=1): quad & sinleft( rac{0}{10000^{2/4}} ight) = sin(0) = 0 \\ ext{dim } 3 \;(i=1): quad & cosleft( rac{0}{10000^{2/4}} ight) = cos(0) = 1 \\[0.5em] & PE(0) = [0, 1, 0, 1] end{aligned}

For dextmodel=4d_{ ext{model}} = 4, position 1:

egin{aligned} ext{dim } 0 \;(i=0): quad & sinleft( rac{1}{10000^{0/4}} ight) = sin(1) approx 0.841 \\ ext{dim } 1 \;(i=0): quad & cosleft( rac{1}{10000^{0/4}} ight) = cos(1) approx 0.540 \\ ext{dim } 2 \;(i=1): quad & sinleft( rac{1}{10000^{2/4}} ight) = sin(0.01) approx 0.010 \\ ext{dim } 3 \;(i=1): quad & cosleft( rac{1}{10000^{2/4}} ight) = cos(0.01) approx 1.000 \\[0.5em] & PE(1) approx [0.841, 0.540, 0.010, 1.000] end{aligned}

2.2 Understanding the Frequency Pattern

Different Frequencies for Different Dimensions

The denominator 100002i/dextmodel10000^{2i/d_{ ext{model}}} creates different "wavelengths":

egin{aligned} ext{Dimension } i=0: quad & 10000^{0/d} = 1 quad ightarrow quad ext{wavelength} = 2pi \\ ext{Dimension } i=1: quad & 10000^{2/d} quad ightarrow quad ext{longer wavelength} \\ ext{Dimension } i=2: quad & 10000^{4/d} quad ightarrow quad ext{even longer} \\ & \vdots \\ ext{Dimension } i=d/2: quad & 10000^1 = 10000 quad ightarrow quad ext{wavelength} approx 62{,}832 end{aligned}

Visual Intuition

📝text
1Position:   0    1    2    3    4    5    6    7    ...
2
3Dim 0,1:    ╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮    (fast oscillation)
4            ╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯
5
6Dim 2,3:    ╭────────╮╭────────╮╭────────╮      (medium oscillation)
7            ╰────────╯╰────────╯╰────────╯
8
9Dim d-2,d-1: ╭───────────────────────────────   (very slow oscillation)
10             (nearly constant over typical sequences)

Why Multiple Frequencies?

Like binary representation but continuous:

- Low dimensions: fine-grained position (nearby positions differ)

- High dimensions: coarse-grained position (global structure)

This allows the model to:

- Distinguish adjacent positions (low freq dimensions)

- Recognize large-scale position (high freq dimensions)

- Learn combinations for relative positions


2.3 Why Sine and Cosine?

The Key Property: Linear Relative Position

For any fixed offset kk, we can express:

PE(extpos+k)=f(PE(extpos))PE( ext{pos} + k) = f(PE( ext{pos}))

Where ff is a linear transformation!

Mathematical Proof

Using the angle addition formulas:

egin{aligned} sin(a + b) &= sin(a)cos(b) + cos(a)sin(b) \\ cos(a + b) &= cos(a)cos(b) - sin(a)sin(b) end{aligned}

For position extpos+kext{pos} + k at dimension pair (2i,2i+1)(2i, 2i+1):

ext{Let } \omega = rac{1}{10000^{2i/d_{ ext{model}}}}
egin{aligned} PE_{( ext{pos}+k, 2i)} &= sin(omega( ext{pos}+k)) \\ &= sin(omega cdot ext{pos})cos(omega k) + cos(omega cdot ext{pos})sin(omega k) \\[0.5em] PE_{( ext{pos}+k, 2i+1)} &= cos(omega( ext{pos}+k)) \\ &= cos(omega cdot ext{pos})cos(omega k) - sin(omega cdot ext{pos})sin(omega k) end{aligned}

In matrix form:

egin{bmatrix} PE_{( ext{pos}+k, 2i)} \\ PE_{( ext{pos}+k, 2i+1)} end{bmatrix} = egin{bmatrix} cos(omega k) & sin(omega k) \\ -sin(omega k) & cos(omega k) end{bmatrix} egin{bmatrix} PE_{( ext{pos}, 2i)} \\ PE_{( ext{pos}, 2i+1)} end{bmatrix}

This is a rotation matrix! PE(extpos+k)PE( ext{pos}+k) is a rotation of PE(extpos)PE( ext{pos}).

Why This Matters

The attention mechanism can learn to identify:

- "Token 3 positions before" → consistent transformation

- "Token 5 positions after" → different but consistent transformation

The linear relationship enables the model to learn relative positions even though we encode absolute positions!


2.4 PyTorch Implementation

Complete Implementation

🐍python
1import torch
2import torch.nn as nn
3import math
4from typing import Optional
5
6
7class SinusoidalPositionalEncoding(nn.Module):
8    """
9    Sinusoidal Positional Encoding from "Attention Is All You Need".
10
11    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
12    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
13
14    Args:
15        d_model: Embedding dimension
16        max_len: Maximum sequence length (for precomputation)
17        dropout: Dropout probability
18
19    Example:
20        >>> pe = SinusoidalPositionalEncoding(d_model=512, max_len=1000)
21        >>> x = torch.randn(2, 50, 512)  # [batch, seq_len, d_model]
22        >>> output = pe(x)  # Same shape, with positions added
23    """
24
25    def __init__(
26        self,
27        d_model: int,
28        max_len: int = 5000,
29        dropout: float = 0.1
30    ):
31        super().__init__()
32
33        self.d_model = d_model
34        self.dropout = nn.Dropout(p=dropout)
35
36        # Create positional encoding matrix
37        pe = self._create_pe_matrix(max_len, d_model)
38
39        # Register as buffer (not a parameter, but saved with model)
40        self.register_buffer('pe', pe)
41
42    def _create_pe_matrix(self, max_len: int, d_model: int) -> torch.Tensor:
43        """
44        Create the positional encoding matrix.
45
46        Returns:
47            pe: [1, max_len, d_model]
48        """
49        # Position indices: [max_len, 1]
50        position = torch.arange(max_len).unsqueeze(1).float()
51
52        # Dimension indices: [d_model/2]
53        # div_term = 10000^(2i/d_model)
54        div_term = torch.exp(
55            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
56        )
57
58        # Compute PE matrix
59        pe = torch.zeros(max_len, d_model)
60        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
61        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions
62
63        # Add batch dimension: [1, max_len, d_model]
64        pe = pe.unsqueeze(0)
65
66        return pe
67
68    def forward(self, x: torch.Tensor) -> torch.Tensor:
69        """
70        Add positional encoding to input.
71
72        Args:
73            x: Input tensor [batch, seq_len, d_model]
74
75        Returns:
76            Output tensor [batch, seq_len, d_model]
77        """
78        seq_len = x.size(1)
79
80        # Add positional encoding (broadcasts over batch)
81        x = x + self.pe[:, :seq_len, :]
82
83        return self.dropout(x)
84
85    def extra_repr(self) -> str:
86        return f'd_model={self.d_model}, max_len={self.pe.size(1)}'
87
88
89# Test the implementation
90def test_sinusoidal_pe():
91    d_model = 512
92    max_len = 100
93    batch_size = 2
94    seq_len = 50
95
96    pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
97
98    x = torch.zeros(batch_size, seq_len, d_model)
99    output = pe(x)
100
101    print(f"Input shape: {x.shape}")
102    print(f"Output shape: {output.shape}")
103    print(f"PE buffer shape: {pe.pe.shape}")
104
105    # Verify values at position 0
106    print(f"\nPE at position 0, first 8 dims:")
107    print(pe.pe[0, 0, :8])
108
109    # Verify values are bounded
110    print(f"\nPE min: {pe.pe.min():.4f}, max: {pe.pe.max():.4f}")
111
112    # Verify each position is unique
113    pe_matrix = pe.pe[0]  # [max_len, d_model]
114    for i in range(min(5, seq_len)):
115        for j in range(i + 1, min(5, seq_len)):
116            diff = (pe_matrix[i] - pe_matrix[j]).abs().sum()
117            assert diff > 0.1, f"Positions {i} and {j} too similar!"
118    print("\n✓ All positions are unique")
119
120    print("\n✓ Sinusoidal PE test passed!")
121
122
123test_sinusoidal_pe()

Understanding the div_term Calculation

🐍python
1# Equivalent but clearer:
2i = torch.arange(0, d_model, 2)  # [0, 2, 4, ..., d_model-2]
3denominator = 10000 ** (i / d_model)  # [1, 10000^(2/d), 10000^(4/d), ...]
4div_term = 1 / denominator
5
6# Used as:
7pe[:, 0::2] = sin(position / denominator)  # = sin(position * div_term)

The log-exp trick avoids numerical issues:

🐍python
1div_term = torch.exp(i * (-log(10000) / d_model))
2         = exp(log(10000^(-i/d_model)))
3         = 10000^(-i/d_model)
4         = 1 / 10000^(i/d_model)

2.5 Visualization

Heatmap of Positional Encodings

🐍python
1import matplotlib.pyplot as plt
2import numpy as np
3
4
5def visualize_positional_encoding(d_model: int = 128, max_len: int = 100):
6    """Visualize positional encodings as a heatmap."""
7
8    pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
9    pe_matrix = pe.pe[0].numpy()  # [max_len, d_model]
10
11    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
12
13    # Full heatmap
14    ax1 = axes[0]
15    im1 = ax1.imshow(pe_matrix, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
16    ax1.set_xlabel('Dimension')
17    ax1.set_ylabel('Position')
18    ax1.set_title('Sinusoidal Positional Encoding')
19    plt.colorbar(im1, ax=ax1)
20
21    # First few dimensions over positions
22    ax2 = axes[1]
23    positions = np.arange(max_len)
24    for dim in [0, 1, 10, 11, 50, 51]:
25        ax2.plot(positions, pe_matrix[:, dim], label=f'dim {dim}')
26    ax2.set_xlabel('Position')
27    ax2.set_ylabel('Encoding Value')
28    ax2.set_title('Encoding Values Across Positions')
29    ax2.legend()
30    ax2.grid(True, alpha=0.3)
31
32    plt.tight_layout()
33    plt.savefig('sinusoidal_pe_visualization.png', dpi=150)
34    plt.close()
35    print("Saved sinusoidal_pe_visualization.png")
36
37
38visualize_positional_encoding()

Position Similarity Matrix

🐍python
1def visualize_position_similarity(d_model: int = 128, num_positions: int = 50):
2    """Visualize similarity between position encodings."""
3
4    pe = SinusoidalPositionalEncoding(d_model, num_positions, dropout=0.0)
5    pe_matrix = pe.pe[0].numpy()  # [num_positions, d_model]
6
7    # Compute cosine similarity between all position pairs
8    # Normalize each row
9    norms = np.linalg.norm(pe_matrix, axis=1, keepdims=True)
10    pe_normalized = pe_matrix / norms
11
12    # Similarity matrix
13    similarity = pe_normalized @ pe_normalized.T
14
15    plt.figure(figsize=(10, 8))
16    plt.imshow(similarity, cmap='viridis')
17    plt.colorbar(label='Cosine Similarity')
18    plt.xlabel('Position')
19    plt.ylabel('Position')
20    plt.title('Position Encoding Similarity Matrix')
21    plt.savefig('position_similarity.png', dpi=150)
22    plt.close()
23    print("Saved position_similarity.png")
24
25
26visualize_position_similarity()

2.6 Interactive Demo

Now that we've seen the theory and code, let's explore positional encodings interactively. Use the visualizer below to see how different dimensions create unique wave patterns, how positions are encoded, and how the encodings relate to each other.

Interactive Positional Encoding Visualizer

Explore how sinusoidal positional encodings work with animated waves and live computation

Click tokens to select positions for comparison:

Sinusoidal Waves by Dimension

PE Matrix Heatmap

pos\dim
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-1
+1(sin: even dims, cos: odd dims)

Position Comparison: 0 vs 1

30.9168
Dot Product
0.9662
Cosine Similarity
1.4718
Euclidean Distance

Insight: Adjacent positions have high similarity but are still distinguishable. The dot product between PE(pos) and PE(pos+k) follows a predictable pattern due to the rotation property.

Formula Reminder:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

The wave frequency decreases exponentially with dimension index. This creates a unique "fingerprint" for each position that the model can learn to interpret.

Step-by-Step Pipeline

Now let's see the complete flow of how a token gets its position-aware embedding. Press Play to watch the animation, or step through manually to understand each stage.

Step-by-Step: How Positional Encoding is Applied

Watch the complete pipeline from token to position-aware embedding

1Token

The input word/token from your text

"Hello"
Token at position 0
lookup
Token Embedding [32d]
+24
PE(pos=0) [32d]
+24
Final Embedding [32d]
+24
The Complete Process:
Output = TokenEmbedding(token) + PositionalEncoding(position)

Each token's final representation contains both its semantic meaning (from the embedding) and its position in the sequence (from the sinusoidal encoding).


2.7 Deep Dive: Implementation Details

Let's examine exactly how the positional encoding is computed and applied in PyTorch. This section breaks down two critical operations:

1. Sin/Cos Index Assignment: How pe[:, 0::2] and pe[:, 1::2] assign sine values to even dimensions and cosine values to odd dimensions.

2. Broadcasting and Addition: How x = x + self.pe[:, :seq_len, :] broadcasts the positional encoding across batches and adds it element-wise to token embeddings.

Deep Dive: How PE is Computed and Applied

Step-by-step walkthrough of the positional encoding implementation

Focus: Understanding how pe[:, 0::2] and pe[:, 1::2] assign sin and cos values to alternating dimensions.
Step 1 of 5

Step 1: Understanding Index Slicing

For a PE tensor with d_model = 6, we have dimensions [0, 1, 2, 3, 4, 5].

pe[:, 0::2] → Selects indices [0, 2, 4] (even) → sin
pe[:, 1::2] → Selects indices [1, 3, 5] (odd) → cos
Even Indices (0::2)
0
2
4
→ Apply sin()
Odd Indices (1::2)
1
3
5
→ Apply cos()

Key Code Lines

# Computing PE:
pe[:, 0::2] = torch.sin(position * div_term)  # Even dims
pe[:, 1::2] = torch.cos(position * div_term)  # Odd dims

# Applying PE:
x = x + self.pe[:, :seq_len, :]  # Broadcast + add

2.8 Properties Verification

Uniqueness

🐍python
1def verify_uniqueness(pe_module, num_positions=100):
2    """Verify each position has a unique encoding."""
3    pe_matrix = pe_module.pe[0, :num_positions]  # [num_positions, d_model]
4
5    for i in range(num_positions):
6        for j in range(i + 1, num_positions):
7            if torch.allclose(pe_matrix[i], pe_matrix[j], atol=1e-6):
8                print(f"WARNING: Positions {i} and {j} have same encoding!")
9                return False
10
11    print(f"✓ All {num_positions} positions have unique encodings")
12    return True

Boundedness

🐍python
1def verify_boundedness(pe_module):
2    """Verify values are bounded in [-1, 1]."""
3    pe_matrix = pe_module.pe
4
5    min_val = pe_matrix.min().item()
6    max_val = pe_matrix.max().item()
7
8    print(f"PE range: [{min_val:.4f}, {max_val:.4f}]")
9
10    if min_val >= -1.0 and max_val <= 1.0:
11        print("✓ Values are bounded in [-1, 1]")
12        return True
13    else:
14        print("WARNING: Values exceed [-1, 1]")
15        return False

Relative Position Property

🐍python
1def verify_relative_position_property(pe_module, d_model=128):
2    """Verify PE(pos+k) is a linear function of PE(pos)."""
3
4    pe_matrix = pe_module.pe[0]  # [max_len, d_model]
5
6    # Check for dimension pair 0,1 (i=0)
7    pos = 10
8    k = 5
9
10    pe_pos = pe_matrix[pos, 0:2]      # [sin, cos] at pos
11    pe_pos_k = pe_matrix[pos + k, 0:2]  # [sin, cos] at pos+k
12
13    # Compute what the rotation matrix would give
14    omega = 1.0  # For i=0, div_term = 1
15    rotation = torch.tensor([
16        [math.cos(omega * k), math.sin(omega * k)],
17        [-math.sin(omega * k), math.cos(omega * k)]
18    ])
19
20    pe_pos_k_computed = rotation @ pe_pos
21
22    error = (pe_pos_k - pe_pos_k_computed).abs().max().item()
23    print(f"Rotation matrix error: {error:.6f}")
24
25    if error < 1e-5:
26        print("✓ Relative position property verified")
27        return True
28    else:
29        print("WARNING: Relative position property not satisfied")
30        return False
31
32
33# Run verifications
34pe = SinusoidalPositionalEncoding(d_model=128, max_len=200, dropout=0.0)
35verify_uniqueness(pe)
36verify_boundedness(pe)
37verify_relative_position_property(pe)

2.9 Alternative Implementation Styles

Compute on the Fly

🐍python
1class SinusoidalPEOnTheFly(nn.Module):
2    """Compute PE during forward pass (no precomputation)."""
3
4    def __init__(self, d_model: int, dropout: float = 0.1):
5        super().__init__()
6        self.d_model = d_model
7        self.dropout = nn.Dropout(dropout)
8
9        # Store div_term as buffer
10        div_term = torch.exp(
11            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
12        )
13        self.register_buffer('div_term', div_term)
14
15    def forward(self, x: torch.Tensor) -> torch.Tensor:
16        seq_len = x.size(1)
17        device = x.device
18
19        position = torch.arange(seq_len, device=device).unsqueeze(1)
20
21        pe = torch.zeros(seq_len, self.d_model, device=device)
22        pe[:, 0::2] = torch.sin(position * self.div_term)
23        pe[:, 1::2] = torch.cos(position * self.div_term)
24
25        return self.dropout(x + pe.unsqueeze(0))

Trade-offs:

- Precomputed: Faster forward pass, uses more memory

- On-the-fly: Slower forward pass, less memory, handles any length


Summary

The Formula

egin{aligned} PE_{( ext{pos}, 2i)} &= sinleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) \\[0.5em] PE_{( ext{pos}, 2i+1)} &= cosleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) end{aligned}

Key Properties

PropertyHow It's Achieved
Unique positionsDifferent wavelengths per dimension
Bounded valuessin/cos output in [-1, 1]
Relative positionsLinear (rotation) transformation
GeneralizationNo max length inherent in formula
Multi-scaleLow dims = local, high dims = global

Implementation Notes

1. Precompute PE matrix for efficiency

2. Register as buffer (saved with model, not trained)

3. Add dropout after adding PE to embeddings

4. Handle variable sequence lengths by slicing


Next Section Preview

In the next section, we'll implement learned positional embeddings—an alternative approach used by models like BERT and GPT. We'll compare the trade-offs between sinusoidal and learned positions.