Introduction
The original Transformer paper introduced a clever solution to the position problem: sinusoidal positional encoding. Using sine and cosine functions at different frequencies, this method creates unique position representations that can theoretically generalize to any sequence length.
This section dives deep into the formula, explains why it works, and provides a complete PyTorch implementation.
2.1 The Formula
Complete Definition
For position and dimension :
Where:
- : Position in sequence (0, 1, 2, ...)
- : Dimension index (0, 1, 2, ..., )
- : Embedding dimension
- Even dimensions use sine, odd dimensions use cosine
Equivalent Formulation
Numerical Example
For , position 0:
For , position 1:
2.2 Understanding the Frequency Pattern
Different Frequencies for Different Dimensions
The denominator creates different "wavelengths":
Visual Intuition
1Position: 0 1 2 3 4 5 6 7 ...
2
3Dim 0,1: ╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮╭──╮ (fast oscillation)
4 ╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯╰──╯
5
6Dim 2,3: ╭────────╮╭────────╮╭────────╮ (medium oscillation)
7 ╰────────╯╰────────╯╰────────╯
8
9Dim d-2,d-1: ╭─────────────────────────────── (very slow oscillation)
10 (nearly constant over typical sequences)Why Multiple Frequencies?
Like binary representation but continuous:
- Low dimensions: fine-grained position (nearby positions differ)
- High dimensions: coarse-grained position (global structure)
This allows the model to:
- Distinguish adjacent positions (low freq dimensions)
- Recognize large-scale position (high freq dimensions)
- Learn combinations for relative positions
2.3 Why Sine and Cosine?
The Key Property: Linear Relative Position
For any fixed offset , we can express:
Where is a linear transformation!
Mathematical Proof
Using the angle addition formulas:
For position at dimension pair :
In matrix form:
This is a rotation matrix! is a rotation of .
Why This Matters
The attention mechanism can learn to identify:
- "Token 3 positions before" → consistent transformation
- "Token 5 positions after" → different but consistent transformation
The linear relationship enables the model to learn relative positions even though we encode absolute positions!
2.4 PyTorch Implementation
Complete Implementation
1import torch
2import torch.nn as nn
3import math
4from typing import Optional
5
6
7class SinusoidalPositionalEncoding(nn.Module):
8 """
9 Sinusoidal Positional Encoding from "Attention Is All You Need".
10
11 PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
12 PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
13
14 Args:
15 d_model: Embedding dimension
16 max_len: Maximum sequence length (for precomputation)
17 dropout: Dropout probability
18
19 Example:
20 >>> pe = SinusoidalPositionalEncoding(d_model=512, max_len=1000)
21 >>> x = torch.randn(2, 50, 512) # [batch, seq_len, d_model]
22 >>> output = pe(x) # Same shape, with positions added
23 """
24
25 def __init__(
26 self,
27 d_model: int,
28 max_len: int = 5000,
29 dropout: float = 0.1
30 ):
31 super().__init__()
32
33 self.d_model = d_model
34 self.dropout = nn.Dropout(p=dropout)
35
36 # Create positional encoding matrix
37 pe = self._create_pe_matrix(max_len, d_model)
38
39 # Register as buffer (not a parameter, but saved with model)
40 self.register_buffer('pe', pe)
41
42 def _create_pe_matrix(self, max_len: int, d_model: int) -> torch.Tensor:
43 """
44 Create the positional encoding matrix.
45
46 Returns:
47 pe: [1, max_len, d_model]
48 """
49 # Position indices: [max_len, 1]
50 position = torch.arange(max_len).unsqueeze(1).float()
51
52 # Dimension indices: [d_model/2]
53 # div_term = 10000^(2i/d_model)
54 div_term = torch.exp(
55 torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
56 )
57
58 # Compute PE matrix
59 pe = torch.zeros(max_len, d_model)
60 pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions
61 pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions
62
63 # Add batch dimension: [1, max_len, d_model]
64 pe = pe.unsqueeze(0)
65
66 return pe
67
68 def forward(self, x: torch.Tensor) -> torch.Tensor:
69 """
70 Add positional encoding to input.
71
72 Args:
73 x: Input tensor [batch, seq_len, d_model]
74
75 Returns:
76 Output tensor [batch, seq_len, d_model]
77 """
78 seq_len = x.size(1)
79
80 # Add positional encoding (broadcasts over batch)
81 x = x + self.pe[:, :seq_len, :]
82
83 return self.dropout(x)
84
85 def extra_repr(self) -> str:
86 return f'd_model={self.d_model}, max_len={self.pe.size(1)}'
87
88
89# Test the implementation
90def test_sinusoidal_pe():
91 d_model = 512
92 max_len = 100
93 batch_size = 2
94 seq_len = 50
95
96 pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
97
98 x = torch.zeros(batch_size, seq_len, d_model)
99 output = pe(x)
100
101 print(f"Input shape: {x.shape}")
102 print(f"Output shape: {output.shape}")
103 print(f"PE buffer shape: {pe.pe.shape}")
104
105 # Verify values at position 0
106 print(f"\nPE at position 0, first 8 dims:")
107 print(pe.pe[0, 0, :8])
108
109 # Verify values are bounded
110 print(f"\nPE min: {pe.pe.min():.4f}, max: {pe.pe.max():.4f}")
111
112 # Verify each position is unique
113 pe_matrix = pe.pe[0] # [max_len, d_model]
114 for i in range(min(5, seq_len)):
115 for j in range(i + 1, min(5, seq_len)):
116 diff = (pe_matrix[i] - pe_matrix[j]).abs().sum()
117 assert diff > 0.1, f"Positions {i} and {j} too similar!"
118 print("\n✓ All positions are unique")
119
120 print("\n✓ Sinusoidal PE test passed!")
121
122
123test_sinusoidal_pe()Understanding the div_term Calculation
1# Equivalent but clearer:
2i = torch.arange(0, d_model, 2) # [0, 2, 4, ..., d_model-2]
3denominator = 10000 ** (i / d_model) # [1, 10000^(2/d), 10000^(4/d), ...]
4div_term = 1 / denominator
5
6# Used as:
7pe[:, 0::2] = sin(position / denominator) # = sin(position * div_term)The log-exp trick avoids numerical issues:
1div_term = torch.exp(i * (-log(10000) / d_model))
2 = exp(log(10000^(-i/d_model)))
3 = 10000^(-i/d_model)
4 = 1 / 10000^(i/d_model)2.5 Visualization
Heatmap of Positional Encodings
1import matplotlib.pyplot as plt
2import numpy as np
3
4
5def visualize_positional_encoding(d_model: int = 128, max_len: int = 100):
6 """Visualize positional encodings as a heatmap."""
7
8 pe = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
9 pe_matrix = pe.pe[0].numpy() # [max_len, d_model]
10
11 fig, axes = plt.subplots(1, 2, figsize=(14, 6))
12
13 # Full heatmap
14 ax1 = axes[0]
15 im1 = ax1.imshow(pe_matrix, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
16 ax1.set_xlabel('Dimension')
17 ax1.set_ylabel('Position')
18 ax1.set_title('Sinusoidal Positional Encoding')
19 plt.colorbar(im1, ax=ax1)
20
21 # First few dimensions over positions
22 ax2 = axes[1]
23 positions = np.arange(max_len)
24 for dim in [0, 1, 10, 11, 50, 51]:
25 ax2.plot(positions, pe_matrix[:, dim], label=f'dim {dim}')
26 ax2.set_xlabel('Position')
27 ax2.set_ylabel('Encoding Value')
28 ax2.set_title('Encoding Values Across Positions')
29 ax2.legend()
30 ax2.grid(True, alpha=0.3)
31
32 plt.tight_layout()
33 plt.savefig('sinusoidal_pe_visualization.png', dpi=150)
34 plt.close()
35 print("Saved sinusoidal_pe_visualization.png")
36
37
38visualize_positional_encoding()Position Similarity Matrix
1def visualize_position_similarity(d_model: int = 128, num_positions: int = 50):
2 """Visualize similarity between position encodings."""
3
4 pe = SinusoidalPositionalEncoding(d_model, num_positions, dropout=0.0)
5 pe_matrix = pe.pe[0].numpy() # [num_positions, d_model]
6
7 # Compute cosine similarity between all position pairs
8 # Normalize each row
9 norms = np.linalg.norm(pe_matrix, axis=1, keepdims=True)
10 pe_normalized = pe_matrix / norms
11
12 # Similarity matrix
13 similarity = pe_normalized @ pe_normalized.T
14
15 plt.figure(figsize=(10, 8))
16 plt.imshow(similarity, cmap='viridis')
17 plt.colorbar(label='Cosine Similarity')
18 plt.xlabel('Position')
19 plt.ylabel('Position')
20 plt.title('Position Encoding Similarity Matrix')
21 plt.savefig('position_similarity.png', dpi=150)
22 plt.close()
23 print("Saved position_similarity.png")
24
25
26visualize_position_similarity()2.6 Interactive Demo
Now that we've seen the theory and code, let's explore positional encodings interactively. Use the visualizer below to see how different dimensions create unique wave patterns, how positions are encoded, and how the encodings relate to each other.
Interactive Positional Encoding Visualizer
Explore how sinusoidal positional encodings work with animated waves and live computation
Sinusoidal Waves by Dimension
PE Matrix Heatmap
Position Comparison: 0 vs 1
Insight: Adjacent positions have high similarity but are still distinguishable. The dot product between PE(pos) and PE(pos+k) follows a predictable pattern due to the rotation property.
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
The wave frequency decreases exponentially with dimension index. This creates a unique "fingerprint" for each position that the model can learn to interpret.
Step-by-Step Pipeline
Now let's see the complete flow of how a token gets its position-aware embedding. Press Play to watch the animation, or step through manually to understand each stage.
Step-by-Step: How Positional Encoding is Applied
Watch the complete pipeline from token to position-aware embedding
The input word/token from your text
Each token's final representation contains both its semantic meaning (from the embedding) and its position in the sequence (from the sinusoidal encoding).
2.7 Deep Dive: Implementation Details
Let's examine exactly how the positional encoding is computed and applied in PyTorch. This section breaks down two critical operations:
1. Sin/Cos Index Assignment: How pe[:, 0::2] and pe[:, 1::2] assign sine values to even dimensions and cosine values to odd dimensions.
2. Broadcasting and Addition: How x = x + self.pe[:, :seq_len, :] broadcasts the positional encoding across batches and adds it element-wise to token embeddings.
Deep Dive: How PE is Computed and Applied
Step-by-step walkthrough of the positional encoding implementation
pe[:, 0::2] and pe[:, 1::2] assign sin and cos values to alternating dimensions.Step 1: Understanding Index Slicing
For a PE tensor with d_model = 6, we have dimensions [0, 1, 2, 3, 4, 5].
Even Indices (0::2)
Odd Indices (1::2)
Key Code Lines
# Computing PE:
pe[:, 0::2] = torch.sin(position * div_term) # Even dims
pe[:, 1::2] = torch.cos(position * div_term) # Odd dims
# Applying PE:
x = x + self.pe[:, :seq_len, :] # Broadcast + add2.8 Properties Verification
Uniqueness
1def verify_uniqueness(pe_module, num_positions=100):
2 """Verify each position has a unique encoding."""
3 pe_matrix = pe_module.pe[0, :num_positions] # [num_positions, d_model]
4
5 for i in range(num_positions):
6 for j in range(i + 1, num_positions):
7 if torch.allclose(pe_matrix[i], pe_matrix[j], atol=1e-6):
8 print(f"WARNING: Positions {i} and {j} have same encoding!")
9 return False
10
11 print(f"✓ All {num_positions} positions have unique encodings")
12 return TrueBoundedness
1def verify_boundedness(pe_module):
2 """Verify values are bounded in [-1, 1]."""
3 pe_matrix = pe_module.pe
4
5 min_val = pe_matrix.min().item()
6 max_val = pe_matrix.max().item()
7
8 print(f"PE range: [{min_val:.4f}, {max_val:.4f}]")
9
10 if min_val >= -1.0 and max_val <= 1.0:
11 print("✓ Values are bounded in [-1, 1]")
12 return True
13 else:
14 print("WARNING: Values exceed [-1, 1]")
15 return FalseRelative Position Property
1def verify_relative_position_property(pe_module, d_model=128):
2 """Verify PE(pos+k) is a linear function of PE(pos)."""
3
4 pe_matrix = pe_module.pe[0] # [max_len, d_model]
5
6 # Check for dimension pair 0,1 (i=0)
7 pos = 10
8 k = 5
9
10 pe_pos = pe_matrix[pos, 0:2] # [sin, cos] at pos
11 pe_pos_k = pe_matrix[pos + k, 0:2] # [sin, cos] at pos+k
12
13 # Compute what the rotation matrix would give
14 omega = 1.0 # For i=0, div_term = 1
15 rotation = torch.tensor([
16 [math.cos(omega * k), math.sin(omega * k)],
17 [-math.sin(omega * k), math.cos(omega * k)]
18 ])
19
20 pe_pos_k_computed = rotation @ pe_pos
21
22 error = (pe_pos_k - pe_pos_k_computed).abs().max().item()
23 print(f"Rotation matrix error: {error:.6f}")
24
25 if error < 1e-5:
26 print("✓ Relative position property verified")
27 return True
28 else:
29 print("WARNING: Relative position property not satisfied")
30 return False
31
32
33# Run verifications
34pe = SinusoidalPositionalEncoding(d_model=128, max_len=200, dropout=0.0)
35verify_uniqueness(pe)
36verify_boundedness(pe)
37verify_relative_position_property(pe)2.9 Alternative Implementation Styles
Compute on the Fly
1class SinusoidalPEOnTheFly(nn.Module):
2 """Compute PE during forward pass (no precomputation)."""
3
4 def __init__(self, d_model: int, dropout: float = 0.1):
5 super().__init__()
6 self.d_model = d_model
7 self.dropout = nn.Dropout(dropout)
8
9 # Store div_term as buffer
10 div_term = torch.exp(
11 torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
12 )
13 self.register_buffer('div_term', div_term)
14
15 def forward(self, x: torch.Tensor) -> torch.Tensor:
16 seq_len = x.size(1)
17 device = x.device
18
19 position = torch.arange(seq_len, device=device).unsqueeze(1)
20
21 pe = torch.zeros(seq_len, self.d_model, device=device)
22 pe[:, 0::2] = torch.sin(position * self.div_term)
23 pe[:, 1::2] = torch.cos(position * self.div_term)
24
25 return self.dropout(x + pe.unsqueeze(0))Trade-offs:
- Precomputed: Faster forward pass, uses more memory
- On-the-fly: Slower forward pass, less memory, handles any length
Summary
The Formula
Key Properties
| Property | How It's Achieved |
|---|---|
| Unique positions | Different wavelengths per dimension |
| Bounded values | sin/cos output in [-1, 1] |
| Relative positions | Linear (rotation) transformation |
| Generalization | No max length inherent in formula |
| Multi-scale | Low dims = local, high dims = global |
Implementation Notes
1. Precompute PE matrix for efficiency
2. Register as buffer (saved with model, not trained)
3. Add dropout after adding PE to embeddings
4. Handle variable sequence lengths by slicing
Next Section Preview
In the next section, we'll implement learned positional embeddings—an alternative approach used by models like BERT and GPT. We'll compare the trade-offs between sinusoidal and learned positions.