Chapter 4
15 min read
Section 19 of 75

The Position Problem in Transformers

Positional Encoding and Embeddings

Introduction

We've built attention mechanisms that can relate any two tokens directly. But there's a fundamental problem: attention doesn't know about position. The sentence "dog bites man" and "man bites dog" would produce identical attention patterns if we don't add positional information.

This section explains why position matters, why attention is "position-blind," and what properties we need from a good positional encoding.


1.1 The Permutation Invariance Problem

What is Permutation Invariance?

A function is permutation invariant if reordering its inputs doesn't change the output (except for reordering):

πŸ“text
1f([a, b, c]) = f([c, a, b])  (up to reordering)

Self-Attention is Permutation Equivariant

Consider self-attention on tokens [A, B, C]:

πŸ“text
1Original: Attention([A, B, C]) = [A', B', C']
2
3Permuted: Attention([C, A, B]) = [C', A', B']

The output for token A is the same regardless of where A appears in the inputβ€”it's just a weighted sum of all token values.

Mathematical Proof

For self-attention:

πŸ“text
1output[i] = Ξ£_j softmax(Q[i] Β· K[j]) Γ— V[j]

This sum is over all positions j. The position i only determines which query we use, not which keys/values are available.

If we swap positions of tokens, the output swaps accordinglyβ€”the function is equivariant to permutation.

Why This Is a Problem

Language is order-dependent:

πŸ“text
1"The dog bit the man" β†’ Dog is the attacker
2"The man bit the dog" β†’ Man is the attacker

Without positional information, attention treats these as equivalent!


1.2 Why Order Matters in Language

Syntactic Dependency

Word order determines grammatical roles:

SentenceSubjectObject
"John loves Mary"JohnMary
"Mary loves John"MaryJohn

The same words mean different things based on position.

Semantic Interpretation

Position affects meaning even without changing grammar:

πŸ“text
1"Only I love you"     β†’ I love you, no one else does
2"I only love you"     β†’ I love you, no one else
3"I love only you"     β†’ You're the only one I love

Information Flow

In coherent text, information flows positionally:

πŸ“text
1"The scientist discovered a new particle. She published her findings."
2
3"She" refers to "scientist" β†’ Position helps resolve coreference

Fixed Expressions and Idioms

Many expressions depend on exact word order:

πŸ“text
1"Kick the bucket" = die (idiom)
2"Bucket the kick" = nonsense

1.3 How RNNs Handled Position

Implicit Positional Information

RNNs process sequences step by step:

πŸ“text
1h_0 β†’ h_1 β†’ h_2 β†’ h_3 β†’ ...
2 ↑     ↑     ↑     ↑
3x_0   x_1   x_2   x_3

The hidden state h_t "knows" it came after h_{t-1}:

- Position is implicit in the recurrence

- Earlier tokens influence later hidden states

- No explicit position encoding needed

The Price of Implicit Position

This sequential processing caused RNN limitations:

- Can't parallelize

- Long-range dependencies are hard

- Gradient flow issues

Transformers solve these problems, but lose implicit position.


1.4 Requirements for Good Positional Encoding

Desirable Properties

A good positional encoding should be:

PropertyDescriptionWhy It Matters
UniqueDifferent positions β†’ different encodingsDistinguishes positions
BoundedValues don't grow unboundedlyNumerical stability
DeterministicSame position β†’ same encodingReproducibility
GeneralizableWorks for unseen lengthsHandles long sequences
SmoothNearby positions β†’ similar encodingsCaptures locality
Relative-awareEnables learning relative positions"3 positions apart"

What Doesn't Work

Simple integer positions:

πŸ“text
1PE(0) = 0, PE(1) = 1, PE(2) = 2, ...

- Not bounded (grows unboundedly)

- Magnitude varies with position (bad for learning)

Normalized positions:

πŸ“text
1PE(i) = i / seq_len  β†’  [0, 0.1, 0.2, ..., 1.0]

- Depends on sequence length

- Same position has different values for different lengths

One-hot positions:

πŸ“text
1PE(0) = [1,0,0,...], PE(1) = [0,1,0,...], ...

- Doesn't capture relationships between positions

- Requires knowing max length in advance

- Very high dimensional


1.5 Absolute vs Relative Position

Absolute Position

What it encodes: "This token is at position 5"

Implementation: Add position-specific vector to each token

🐍python
1input[i] = token_embedding[i] + position_encoding[i]

Pros:

- Simple to implement

- Each position has unique representation

Cons:

- "Position 5" is arbitraryβ€”what matters is often relative position

- May not generalize to longer sequences

Relative Position

What it encodes: "Token A is 3 positions before token B"

Implementation: Modify attention to include relative position

🐍python
1# In attention scores
2scores[i,j] = Q[i] Β· K[j] + relative_bias[i-j]

Pros:

- Captures "distance" between tokens

- Generalizes better to long sequences

Cons:

- More complex to implement

- Requires attention mechanism changes

Modern Approaches

ModelPosition TypeMethod
Original TransformerAbsoluteSinusoidal encoding
BERTAbsoluteLearned embeddings
T5RelativeRelative position bias
GPT-3AbsoluteLearned embeddings
RoPE (LLaMA)BothRotary embeddings
ALiBiRelativeAttention bias

1.6 The Solution: Positional Encoding

The Core Idea

Add positional information to token embeddings:

πŸ“text
1input = token_embedding + positional_encoding

Where:

- token_embedding: What the token means (semantic)

- positional_encoding: Where the token is (structural)

Visual Representation

πŸ“text
1Token:     "The"   "cat"   "sat"
2Position:    0       1       2
3
4Token emb: [0.2,   [0.8,   [0.4,
5            0.8,    0.3,    0.6,
6            0.1]    0.7]    0.5]
7
8Pos enc:   [0.0,   [0.84,  [0.91,
9            1.0,    0.54,   -0.42,
10            0.0]    0.84]   0.91]
11
12Combined:  [0.2,   [1.64,  [1.31,
13            1.8,    0.84,   0.18,
14            0.1]    1.54]   1.41]

Why Addition?

We add (not concatenate) positional encoding:

1. Keeps dimension constant (d_model)

2. Allows position to modulate meaning

3. More parameter efficient

4. Works well empirically


1.7 Preview: Two Main Approaches

Sinusoidal Positional Encoding

From the original Transformer paper:

πŸ“text
1PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
2PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Properties:

- Deterministic (no learned parameters)

- Can extrapolate to longer sequences

- Relative positions are linear combinations

- Fixed once computed

Learned Positional Embeddings

Used by BERT, GPT:

🐍python
1position_embedding = nn.Embedding(max_seq_len, d_model)

Properties:

- Learned from data

- May capture task-specific patterns

- Limited to max_seq_len positions

- More parameters


1.8 Impact on Attention

Without Position: Ambiguous Attention

πŸ“text
1"The cat sat on the mat"
2
3Q("sat") looking for subject:
4- "The" at position 0: similarity S1
5- "cat" at position 1: similarity S2
6- "mat" at position 5: similarity S3
7
8Without position: Can't distinguish "The" at position 0 from "the" at position 4!
9Both have identical embeddings.

With Position: Disambiguated

πŸ“text
1After adding positional encoding:
2
3"The" (pos 0) embedding = token_emb + pos_enc(0) = unique!
4"the" (pos 4) embedding = token_emb + pos_enc(4) = different!
5
6Now attention can learn:
7- First "The" is likely part of subject
8- Second "the" is likely part of prepositional phrase

Summary

The Problem

Without PositionWith Position
Attention is permutation equivariantOrder is encoded
"dog bites man" = "man bites dog"Distinct representations
Can't distinguish repeated tokensEach occurrence is unique
Loses syntactic informationCaptures word order

Key Concepts

1. Permutation Equivariance: Self-attention treats input as a set, not sequence

2. Order Matters: Language meaning depends heavily on position

3. Positional Encoding: Add position-specific vectors to embeddings

4. Two Approaches: Sinusoidal (fixed) vs Learned (trained)

5. Desirable Properties: Unique, bounded, generalizable, smooth


Next Section Preview

In the next section, we'll implement sinusoidal positional encoding from the original Transformer paper. We'll understand why the authors chose this specific formula and implement it in PyTorch.