Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

The original Transformer paper introduced a clever solution to the position problem: sinusoidal positional encoding. Using sine and cosine functions at different frequencies, this method creates unique position representations that can theoretically generalize to any sequence length.

This section dives deep into the formula, explains why it works, and provides a complete PyTorch implementation.

2.1 The Formula

Complete Definition

For position $ext{pos}$ and dimension $i$ :

egin{aligned} PE_{( ext{pos}, 2i)} &= sinleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) \\[0.5em] PE_{( ext{pos}, 2i+1)} &= cosleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) end{aligned}

Where:

- $ext{pos}$ : Position in sequence (0, 1, 2, ...)

- $i$ : Dimension index (0, 1, 2, ..., $d_{ ext{model}}/2 - 1$ )

- $d_{ ext{model}}$ : Embedding dimension

- Even dimensions use sine, odd dimensions use cosine

Equivalent Formulation

PE_{( ext{pos}, ext{dim})} = egin{cases} sinleft(\dfrac{ ext{pos}}{10000^{ ext{dim}/d_{ ext{model}}}} ight) & ext{if dim is even} \\[1em] cosleft(\dfrac{ ext{pos}}{10000^{ ext{dim}/d_{ ext{model}}}} ight) & ext{if dim is odd} end{cases}

Numerical Example

For $d_{ ext{model}} = 4$ , position 0:

egin{aligned} ext{dim } 0 \;(i=0): quad & sinleft( rac{0}{10000^{0/4}} ight) = sin(0) = 0 \\ ext{dim } 1 \;(i=0): quad & cosleft( rac{0}{10000^{0/4}} ight) = cos(0) = 1 \\ ext{dim } 2 \;(i=1): quad & sinleft( rac{0}{10000^{2/4}} ight) = sin(0) = 0 \\ ext{dim } 3 \;(i=1): quad & cosleft( rac{0}{10000^{2/4}} ight) = cos(0) = 1 \\[0.5em] & PE(0) = [0, 1, 0, 1] end{aligned}

For $d_{ ext{model}} = 4$ , position 1:

egin{aligned} ext{dim } 0 \;(i=0): quad & sinleft( rac{1}{10000^{0/4}} ight) = sin(1) approx 0.841 \\ ext{dim } 1 \;(i=0): quad & cosleft( rac{1}{10000^{0/4}} ight) = cos(1) approx 0.540 \\ ext{dim } 2 \;(i=1): quad & sinleft( rac{1}{10000^{2/4}} ight) = sin(0.01) approx 0.010 \\ ext{dim } 3 \;(i=1): quad & cosleft( rac{1}{10000^{2/4}} ight) = cos(0.01) approx 1.000 \\[0.5em] & PE(1) approx [0.841, 0.540, 0.010, 1.000] end{aligned}

2.2 Understanding the Frequency Pattern

Different Frequencies for Different Dimensions

The denominator $10000^{2i/d_{ ext{model}}}$ creates different "wavelengths":

egin{aligned} ext{Dimension } i=0: quad & 10000^{0/d} = 1 quad ightarrow quad ext{wavelength} = 2pi \\ ext{Dimension } i=1: quad & 10000^{2/d} quad ightarrow quad ext{longer wavelength} \\ ext{Dimension } i=2: quad & 10000^{4/d} quad ightarrow quad ext{even longer} \\ & \vdots \\ ext{Dimension } i=d/2: quad & 10000^1 = 10000 quad ightarrow quad ext{wavelength} approx 62{,}832 end{aligned}

Visual Intuition

📝text

1Position:   0    1    2    3    4    5    6    7    ...
2
3Dim 0,1:    ╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮    (fast oscillation)
4            ╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯
5
6Dim 2,3:    ╭────────╮╭────────╮╭────────╮      (medium oscillation)
7            ╰────────╯╰────────╯╰────────╯
8
9Dim d-2,d-1: ╭───────────────────────────────   (very slow oscillation)
10             (nearly constant over typical sequences)

Why Multiple Frequencies?

Like binary representation but continuous:

- Low dimensions: fine-grained position (nearby positions differ)

- High dimensions: coarse-grained position (global structure)

This allows the model to:

- Distinguish adjacent positions (low freq dimensions)

- Recognize large-scale position (high freq dimensions)

- Learn combinations for relative positions

2.3 Why Sine and Cosine?

The Key Property: Linear Relative Position

For any fixed offset $k$ , we can express:

PE( ext{pos} + k) = f(PE( ext{pos}))

Where $f$ is a linear transformation!

Mathematical Proof

Using the angle addition formulas:

egin{aligned} sin(a + b) &= sin(a)cos(b) + cos(a)sin(b) \\ cos(a + b) &= cos(a)cos(b) - sin(a)sin(b) end{aligned}

For position $ext{pos} + k$ at dimension pair $(2i, 2i+1)$ :

ext{Let } \omega = rac{1}{10000^{2i/d_{ ext{model}}}}

egin{aligned} PE_{( ext{pos}+k, 2i)} &= sin(omega( ext{pos}+k)) \\ &= sin(omega cdot ext{pos})cos(omega k) + cos(omega cdot ext{pos})sin(omega k) \\[0.5em] PE_{( ext{pos}+k, 2i+1)} &= cos(omega( ext{pos}+k)) \\ &= cos(omega cdot ext{pos})cos(omega k) - sin(omega cdot ext{pos})sin(omega k) end{aligned}

In matrix form:

egin{bmatrix} PE_{( ext{pos}+k, 2i)} \\ PE_{( ext{pos}+k, 2i+1)} end{bmatrix} = egin{bmatrix} cos(omega k) & sin(omega k) \\ -sin(omega k) & cos(omega k) end{bmatrix} egin{bmatrix} PE_{( ext{pos}, 2i)} \\ PE_{( ext{pos}, 2i+1)} end{bmatrix}

This is a rotation matrix! $PE( ext{pos}+k)$ is a rotation of $PE( ext{pos})$ .

Why This Matters

The attention mechanism can learn to identify:

- "Token 3 positions before" → consistent transformation

- "Token 5 positions after" → different but consistent transformation

The linear relationship enables the model to learn relative positions even though we encode absolute positions!

2.4 PyTorch Implementation

Complete Implementation

🐍python

1import torch
2import torch.nn as nn
3import math
4from typing import Optional
5
6
7class SinusoidalPositionalEncoding(nn.Module):
8    """
9    Sinusoidal Positional Encoding from "Attention Is All You Need".
10
11    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
12    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
13
14    Args:
15        d_model: Embedding dimension
16        max_len: Maximum sequence length (for precomputation)
17        dropout: Dropout probability
18
19    Example:
20        >>> pe = SinusoidalPositionalEncoding(d_model=512, max_len=1000)
21        >>> x = torch.randn(2, 50, 512)  # [batch, seq_len, d_model]
22        >>> output = pe(x)  # Same shape, with positions added
23    """
24
25    def __init__(
26        self,
27        d_model: int,
28        max_len: int = 5000,
29        dropout: float = 0.1
30    ):
31        super().__init__()
32
33        self.d_model = d_model
34        self.dropout = nn.Dropout(p=dropout)
35
36        # Create positional encoding matrix
37        pe = self._create_pe_matrix(max_len, d_model)
38
39        # Register as buffer (not a parameter, but saved with model)
40        self.register_buffer('pe', pe)
41
42    def _create_pe_matrix(self, max_len: int, d_model: int) -> torch.Tensor:
43        """
44        Create the positional encoding matrix.
45
46        Returns:
47            pe: [1, max_len, d_model]
48        """
49        # Position indices: [max_len, 1]
50        position = torch.arange(max_len).unsqueeze(1).float()
51
52        # Dimension indices: [d_model/2]
53        # div_term = 10000^(2i/d_model)
54        div_term = torch.exp(
55            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
56        )
57
58        # Compute PE matrix
59        pe = torch.zeros(max_len, d_model)
60        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
61        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions
62
63        # Add batch dimension: [1, max_len, d_model]
64        pe = pe.unsqueeze(0)
65
66        return pe
67
68    def forward(self, x: torch.Tensor) -> torch.Tensor:
69        """
70        Add positional encoding to input.
71
72        Args:
73            x: Input tensor [batch, seq_len, d_model]
74
75        Returns:
76            Output tensor [batch, seq_len, d_model]
77        """
78        seq_len = x.size(1)
79
80        # Add positional encoding (broadcasts over batch)
81        x = x + self.pe[:, :seq_len, :]
82
83        return self.dropout(x)
84
85    def extra_repr(self) -> str:
86        return f'd_model={self.d_model}, max_len={self.pe.size(1)}'
87
88
89# Test the implementation
90def test_sinusoidal_pe():
91    d_model = 512
92    max_len = 100
93    batch_size = 2
94    seq_len = 50
95
96    pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
97
98    x = torch.zeros(batch_size, seq_len, d_model)
99    output = pe(x)
100
101    print(f"Input shape: {x.shape}")
102    print(f"Output shape: {output.shape}")
103    print(f"PE buffer shape: {pe.pe.shape}")
104
105    # Verify values at position 0
106    print(f"\nPE at position 0, first 8 dims:")
107    print(pe.pe[0, 0, :8])
108
109    # Verify values are bounded
110    print(f"\nPE min: {pe.pe.min():.4f}, max: {pe.pe.max():.4f}")
111
112    # Verify each position is unique
113    pe_matrix = pe.pe[0]  # [max_len, d_model]
114    for i in range(min(5, seq_len)):
115        for j in range(i + 1, min(5, seq_len)):
116            diff = (pe_matrix[i] - pe_matrix[j]).abs().sum()
117            assert diff > 0.1, f"Positions {i} and {j} too similar!"
118    print("\n✓ All positions are unique")
119
120    print("\n✓ Sinusoidal PE test passed!")
121
122
123test_sinusoidal_pe()

Understanding the div_term Calculation

🐍python

1# Equivalent but clearer:
2i = torch.arange(0, d_model, 2)  # [0, 2, 4, ..., d_model-2]
3denominator = 10000 ** (i / d_model)  # [1, 10000^(2/d), 10000^(4/d), ...]
4div_term = 1 / denominator
5
6# Used as:
7pe[:, 0::2] = sin(position / denominator)  # = sin(position * div_term)

The log-exp trick avoids numerical issues:

🐍python

1div_term = torch.exp(i * (-log(10000) / d_model))
2         = exp(log(10000^(-i/d_model)))
3         = 10000^(-i/d_model)
4         = 1 / 10000^(i/d_model)

2.5 Visualization

Heatmap of Positional Encodings

🐍python

1import matplotlib.pyplot as plt
2import numpy as np
3
4
5def visualize_positional_encoding(d_model: int = 128, max_len: int = 100):
6    """Visualize positional encodings as a heatmap."""
7
8    pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
9    pe_matrix = pe.pe[0].numpy()  # [max_len, d_model]
10
11    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
12
13    # Full heatmap
14    ax1 = axes[0]
15    im1 = ax1.imshow(pe_matrix, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
16    ax1.set_xlabel('Dimension')
17    ax1.set_ylabel('Position')
18    ax1.set_title('Sinusoidal Positional Encoding')
19    plt.colorbar(im1, ax=ax1)
20
21    # First few dimensions over positions
22    ax2 = axes[1]
23    positions = np.arange(max_len)
24    for dim in [0, 1, 10, 11, 50, 51]:
25        ax2.plot(positions, pe_matrix[:, dim], label=f'dim {dim}')
26    ax2.set_xlabel('Position')
27    ax2.set_ylabel('Encoding Value')
28    ax2.set_title('Encoding Values Across Positions')
29    ax2.legend()
30    ax2.grid(True, alpha=0.3)
31
32    plt.tight_layout()
33    plt.savefig('sinusoidal_pe_visualization.png', dpi=150)
34    plt.close()
35    print("Saved sinusoidal_pe_visualization.png")
36
37
38visualize_positional_encoding()

Position Similarity Matrix

🐍python

1def visualize_position_similarity(d_model: int = 128, num_positions: int = 50):
2    """Visualize similarity between position encodings."""
3
4    pe = SinusoidalPositionalEncoding(d_model, num_positions, dropout=0.0)
5    pe_matrix = pe.pe[0].numpy()  # [num_positions, d_model]
6
7    # Compute cosine similarity between all position pairs
8    # Normalize each row
9    norms = np.linalg.norm(pe_matrix, axis=1, keepdims=True)
10    pe_normalized = pe_matrix / norms
11
12    # Similarity matrix
13    similarity = pe_normalized @ pe_normalized.T
14
15    plt.figure(figsize=(10, 8))
16    plt.imshow(similarity, cmap='viridis')
17    plt.colorbar(label='Cosine Similarity')
18    plt.xlabel('Position')
19    plt.ylabel('Position')
20    plt.title('Position Encoding Similarity Matrix')
21    plt.savefig('position_similarity.png', dpi=150)
22    plt.close()
23    print("Saved position_similarity.png")
24
25
26visualize_position_similarity()

2.6 Interactive Demo

Now that we've seen the theory and code, let's explore positional encodings interactively. Use the visualizer below to see how different dimensions create unique wave patterns, how positions are encoded, and how the encodings relate to each other.

Interactive Positional Encoding Visualizer

Explore how sinusoidal positional encodings work with animated waves and live computation

Input Text (tokens will be extracted)

Click tokens to select positions for comparison:

d_model: 64

Max Positions: 20

Highlight Dimension

Sinusoidal Waves by Dimension

PE Matrix Heatmap

pos\dim

-1

+1(sin: even dims, cos: odd dims)

Position Comparison: 0 vs 1

30.9168

Dot Product

0.9662

Cosine Similarity

1.4718

Euclidean Distance

Insight: Adjacent positions have high similarity but are still distinguishable. The dot product between PE(pos) and PE(pos+k) follows a predictable pattern due to the rotation property.

Formula Reminder:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

The wave frequency decreases exponentially with dimension index. This creates a unique "fingerprint" for each position that the model can learn to interpret.

Step-by-Step Pipeline

Now let's see the complete flow of how a token gets its position-aware embedding. Press Play to watch the animation, or step through manually to understand each stage.

Step-by-Step: How Positional Encoding is Applied

Watch the complete pipeline from token to position-aware embedding

Token:

Position:

d_model:

1Token

The input word/token from your text

"Hello"

Token at position 0

lookup

Token Embedding [32d]

+24

PE(pos=0) [32d]

+24

Final Embedding [32d]

+24

The Complete Process:

Output = TokenEmbedding(token) + PositionalEncoding(position)

Each token's final representation contains both its semantic meaning (from the embedding) and its position in the sequence (from the sinusoidal encoding).

2.7 Deep Dive: Implementation Details

Let's examine exactly how the positional encoding is computed and applied in PyTorch. This section breaks down two critical operations:

1. Sin/Cos Index Assignment: How pe[:, 0::2] and pe[:, 1::2] assign sine values to even dimensions and cosine values to odd dimensions.

2. Broadcasting and Addition: How x = x + self.pe[:, :seq_len, :] broadcasts the positional encoding across batches and adds it element-wise to token embeddings.

Deep Dive: How PE is Computed and Applied

Step-by-step walkthrough of the positional encoding implementation

Focus: Understanding how pe[:, 0::2] and pe[:, 1::2] assign sin and cos values to alternating dimensions.

Step 1 of 5

Step 1: Understanding Index Slicing

For a PE tensor with d_model = 6, we have dimensions [0, 1, 2, 3, 4, 5].

pe[:, 0::2] → Selects indices [0, 2, 4] (even) → sin

pe[:, 1::2] → Selects indices [1, 3, 5] (odd) → cos

Even Indices (0::2)

→ Apply sin()

Odd Indices (1::2)

→ Apply cos()

Key Code Lines

# Computing PE:
pe[:, 0::2] = torch.sin(position * div_term)  # Even dims
pe[:, 1::2] = torch.cos(position * div_term)  # Odd dims

# Applying PE:
x = x + self.pe[:, :seq_len, :]  # Broadcast + add

2.8 Properties Verification

Uniqueness

🐍python

1def verify_uniqueness(pe_module, num_positions=100):
2    """Verify each position has a unique encoding."""
3    pe_matrix = pe_module.pe[0, :num_positions]  # [num_positions, d_model]
4
5    for i in range(num_positions):
6        for j in range(i + 1, num_positions):
7            if torch.allclose(pe_matrix[i], pe_matrix[j], atol=1e-6):
8                print(f"WARNING: Positions {i} and {j} have same encoding!")
9                return False
10
11    print(f"✓ All {num_positions} positions have unique encodings")
12    return True

Boundedness

🐍python

1def verify_boundedness(pe_module):
2    """Verify values are bounded in [-1, 1]."""
3    pe_matrix = pe_module.pe
4
5    min_val = pe_matrix.min().item()
6    max_val = pe_matrix.max().item()
7
8    print(f"PE range: [{min_val:.4f}, {max_val:.4f}]")
9
10    if min_val >= -1.0 and max_val <= 1.0:
11        print("✓ Values are bounded in [-1, 1]")
12        return True
13    else:
14        print("WARNING: Values exceed [-1, 1]")
15        return False

Relative Position Property

🐍python

1def verify_relative_position_property(pe_module, d_model=128):
2    """Verify PE(pos+k) is a linear function of PE(pos)."""
3
4    pe_matrix = pe_module.pe[0]  # [max_len, d_model]
5
6    # Check for dimension pair 0,1 (i=0)
7    pos = 10
8    k = 5
9
10    pe_pos = pe_matrix[pos, 0:2]      # [sin, cos] at pos
11    pe_pos_k = pe_matrix[pos + k, 0:2]  # [sin, cos] at pos+k
12
13    # Compute what the rotation matrix would give
14    omega = 1.0  # For i=0, div_term = 1
15    rotation = torch.tensor([
16        [math.cos(omega * k), math.sin(omega * k)],
17        [-math.sin(omega * k), math.cos(omega * k)]
18    ])
19
20    pe_pos_k_computed = rotation @ pe_pos
21
22    error = (pe_pos_k - pe_pos_k_computed).abs().max().item()
23    print(f"Rotation matrix error: {error:.6f}")
24
25    if error < 1e-5:
26        print("✓ Relative position property verified")
27        return True
28    else:
29        print("WARNING: Relative position property not satisfied")
30        return False
31
32
33# Run verifications
34pe = SinusoidalPositionalEncoding(d_model=128, max_len=200, dropout=0.0)
35verify_uniqueness(pe)
36verify_boundedness(pe)
37verify_relative_position_property(pe)

2.9 Alternative Implementation Styles

Compute on the Fly

🐍python

1class SinusoidalPEOnTheFly(nn.Module):
2    """Compute PE during forward pass (no precomputation)."""
3
4    def __init__(self, d_model: int, dropout: float = 0.1):
5        super().__init__()
6        self.d_model = d_model
7        self.dropout = nn.Dropout(dropout)
8
9        # Store div_term as buffer
10        div_term = torch.exp(
11            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
12        )
13        self.register_buffer('div_term', div_term)
14
15    def forward(self, x: torch.Tensor) -> torch.Tensor:
16        seq_len = x.size(1)
17        device = x.device
18
19        position = torch.arange(seq_len, device=device).unsqueeze(1)
20
21        pe = torch.zeros(seq_len, self.d_model, device=device)
22        pe[:, 0::2] = torch.sin(position * self.div_term)
23        pe[:, 1::2] = torch.cos(position * self.div_term)
24
25        return self.dropout(x + pe.unsqueeze(0))

Trade-offs:

- Precomputed: Faster forward pass, uses more memory

- On-the-fly: Slower forward pass, less memory, handles any length

Summary

The Formula

egin{aligned} PE_{( ext{pos}, 2i)} &= sinleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) \\[0.5em] PE_{( ext{pos}, 2i+1)} &= cosleft( rac{ ext{pos}}{10000^{2i/d_{ ext{model}}}} ight) end{aligned}

Key Properties

Property	How It's Achieved
Unique positions	Different wavelengths per dimension
Bounded values	sin/cos output in [-1, 1]
Relative positions	Linear (rotation) transformation
Generalization	No max length inherent in formula
Multi-scale	Low dims = local, high dims = global

Implementation Notes

1. Precompute PE matrix for efficiency

2. Register as buffer (saved with model, not trained)

3. Add dropout after adding PE to embeddings

4. Handle variable sequence lengths by slicing

Next Section Preview

In the next section, we'll implement learned positional embeddings—an alternative approach used by models like BERT and GPT. We'll compare the trade-offs between sinusoidal and learned positions.