Introduction
We've built attention mechanisms that can relate any two tokens directly. But there's a fundamental problem: attention doesn't know about position. The sentence "dog bites man" and "man bites dog" would produce identical attention patterns if we don't add positional information.
This section explains why position matters, why attention is "position-blind," and what properties we need from a good positional encoding.
1.1 The Permutation Invariance Problem
What is Permutation Invariance?
A function is permutation invariant if reordering its inputs doesn't change the output (except for reordering):
1f([a, b, c]) = f([c, a, b]) (up to reordering)Self-Attention is Permutation Equivariant
Consider self-attention on tokens [A, B, C]:
1Original: Attention([A, B, C]) = [A', B', C']
2
3Permuted: Attention([C, A, B]) = [C', A', B']The output for token A is the same regardless of where A appears in the inputβit's just a weighted sum of all token values.
Mathematical Proof
For self-attention:
1output[i] = Ξ£_j softmax(Q[i] Β· K[j]) Γ V[j]This sum is over all positions j. The position i only determines which query we use, not which keys/values are available.
If we swap positions of tokens, the output swaps accordinglyβthe function is equivariant to permutation.
Why This Is a Problem
Language is order-dependent:
1"The dog bit the man" β Dog is the attacker
2"The man bit the dog" β Man is the attackerWithout positional information, attention treats these as equivalent!
1.2 Why Order Matters in Language
Syntactic Dependency
Word order determines grammatical roles:
| Sentence | Subject | Object |
|---|---|---|
| "John loves Mary" | John | Mary |
| "Mary loves John" | Mary | John |
The same words mean different things based on position.
Semantic Interpretation
Position affects meaning even without changing grammar:
1"Only I love you" β I love you, no one else does
2"I only love you" β I love you, no one else
3"I love only you" β You're the only one I loveInformation Flow
In coherent text, information flows positionally:
1"The scientist discovered a new particle. She published her findings."
2
3"She" refers to "scientist" β Position helps resolve coreferenceFixed Expressions and Idioms
Many expressions depend on exact word order:
1"Kick the bucket" = die (idiom)
2"Bucket the kick" = nonsense1.3 How RNNs Handled Position
Implicit Positional Information
RNNs process sequences step by step:
1h_0 β h_1 β h_2 β h_3 β ...
2 β β β β
3x_0 x_1 x_2 x_3The hidden state h_t "knows" it came after h_{t-1}:
- Position is implicit in the recurrence
- Earlier tokens influence later hidden states
- No explicit position encoding needed
The Price of Implicit Position
This sequential processing caused RNN limitations:
- Can't parallelize
- Long-range dependencies are hard
- Gradient flow issues
Transformers solve these problems, but lose implicit position.
1.4 Requirements for Good Positional Encoding
Desirable Properties
A good positional encoding should be:
| Property | Description | Why It Matters |
|---|---|---|
| Unique | Different positions β different encodings | Distinguishes positions |
| Bounded | Values don't grow unboundedly | Numerical stability |
| Deterministic | Same position β same encoding | Reproducibility |
| Generalizable | Works for unseen lengths | Handles long sequences |
| Smooth | Nearby positions β similar encodings | Captures locality |
| Relative-aware | Enables learning relative positions | "3 positions apart" |
What Doesn't Work
Simple integer positions:
1PE(0) = 0, PE(1) = 1, PE(2) = 2, ...- Not bounded (grows unboundedly)
- Magnitude varies with position (bad for learning)
Normalized positions:
1PE(i) = i / seq_len β [0, 0.1, 0.2, ..., 1.0]- Depends on sequence length
- Same position has different values for different lengths
One-hot positions:
1PE(0) = [1,0,0,...], PE(1) = [0,1,0,...], ...- Doesn't capture relationships between positions
- Requires knowing max length in advance
- Very high dimensional
1.5 Absolute vs Relative Position
Absolute Position
What it encodes: "This token is at position 5"
Implementation: Add position-specific vector to each token
1input[i] = token_embedding[i] + position_encoding[i]Pros:
- Simple to implement
- Each position has unique representation
Cons:
- "Position 5" is arbitraryβwhat matters is often relative position
- May not generalize to longer sequences
Relative Position
What it encodes: "Token A is 3 positions before token B"
Implementation: Modify attention to include relative position
1# In attention scores
2scores[i,j] = Q[i] Β· K[j] + relative_bias[i-j]Pros:
- Captures "distance" between tokens
- Generalizes better to long sequences
Cons:
- More complex to implement
- Requires attention mechanism changes
Modern Approaches
| Model | Position Type | Method |
|---|---|---|
| Original Transformer | Absolute | Sinusoidal encoding |
| BERT | Absolute | Learned embeddings |
| T5 | Relative | Relative position bias |
| GPT-3 | Absolute | Learned embeddings |
| RoPE (LLaMA) | Both | Rotary embeddings |
| ALiBi | Relative | Attention bias |
1.6 The Solution: Positional Encoding
The Core Idea
Add positional information to token embeddings:
1input = token_embedding + positional_encodingWhere:
- token_embedding: What the token means (semantic)
- positional_encoding: Where the token is (structural)
Visual Representation
1Token: "The" "cat" "sat"
2Position: 0 1 2
3
4Token emb: [0.2, [0.8, [0.4,
5 0.8, 0.3, 0.6,
6 0.1] 0.7] 0.5]
7
8Pos enc: [0.0, [0.84, [0.91,
9 1.0, 0.54, -0.42,
10 0.0] 0.84] 0.91]
11
12Combined: [0.2, [1.64, [1.31,
13 1.8, 0.84, 0.18,
14 0.1] 1.54] 1.41]Why Addition?
We add (not concatenate) positional encoding:
1. Keeps dimension constant (d_model)
2. Allows position to modulate meaning
3. More parameter efficient
4. Works well empirically
1.7 Preview: Two Main Approaches
Sinusoidal Positional Encoding
From the original Transformer paper:
1PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
2PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Properties:
- Deterministic (no learned parameters)
- Can extrapolate to longer sequences
- Relative positions are linear combinations
- Fixed once computed
Learned Positional Embeddings
Used by BERT, GPT:
1position_embedding = nn.Embedding(max_seq_len, d_model)Properties:
- Learned from data
- May capture task-specific patterns
- Limited to max_seq_len positions
- More parameters
1.8 Impact on Attention
Without Position: Ambiguous Attention
1"The cat sat on the mat"
2
3Q("sat") looking for subject:
4- "The" at position 0: similarity S1
5- "cat" at position 1: similarity S2
6- "mat" at position 5: similarity S3
7
8Without position: Can't distinguish "The" at position 0 from "the" at position 4!
9Both have identical embeddings.With Position: Disambiguated
1After adding positional encoding:
2
3"The" (pos 0) embedding = token_emb + pos_enc(0) = unique!
4"the" (pos 4) embedding = token_emb + pos_enc(4) = different!
5
6Now attention can learn:
7- First "The" is likely part of subject
8- Second "the" is likely part of prepositional phraseSummary
The Problem
| Without Position | With Position |
|---|---|
| Attention is permutation equivariant | Order is encoded |
| "dog bites man" = "man bites dog" | Distinct representations |
| Can't distinguish repeated tokens | Each occurrence is unique |
| Loses syntactic information | Captures word order |
Key Concepts
1. Permutation Equivariance: Self-attention treats input as a set, not sequence
2. Order Matters: Language meaning depends heavily on position
3. Positional Encoding: Add position-specific vectors to embeddings
4. Two Approaches: Sinusoidal (fixed) vs Learned (trained)
5. Desirable Properties: Unique, bounded, generalizable, smooth
Next Section Preview
In the next section, we'll implement sinusoidal positional encoding from the original Transformer paper. We'll understand why the authors chose this specific formula and implement it in PyTorch.