Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

We've built attention mechanisms that can relate any two tokens directly. But there's a fundamental problem: attention doesn't know about position. The sentence "dog bites man" and "man bites dog" would produce identical attention patterns if we don't add positional information.

This section explains why position matters, why attention is "position-blind," and what properties we need from a good positional encoding.

1.1 The Permutation Invariance Problem

What is Permutation Invariance?

A function is permutation invariant if reordering its inputs doesn't change the output (except for reordering):

📝text

1f([a, b, c]) = f([c, a, b])  (up to reordering)

Self-Attention is Permutation Equivariant

Consider self-attention on tokens [A, B, C]:

📝text

1Original: Attention([A, B, C]) = [A', B', C']
2
3Permuted: Attention([C, A, B]) = [C', A', B']

The output for token A is the same regardless of where A appears in the input—it's just a weighted sum of all token values.

Mathematical Proof

For self-attention:

📝text

1output[i] = Σ_j softmax(Q[i] · K[j]) × V[j]

This sum is over all positions j. The position i only determines which query we use, not which keys/values are available.

If we swap positions of tokens, the output swaps accordingly—the function is equivariant to permutation.

Why This Is a Problem

Language is order-dependent:

📝text

1"The dog bit the man" → Dog is the attacker
2"The man bit the dog" → Man is the attacker

Without positional information, attention treats these as equivalent!

1.2 Why Order Matters in Language

Syntactic Dependency

Word order determines grammatical roles:

Sentence	Subject	Object
"John loves Mary"	John	Mary
"Mary loves John"	Mary	John

The same words mean different things based on position.

Semantic Interpretation

Position affects meaning even without changing grammar:

📝text

1"Only I love you"     → I love you, no one else does
2"I only love you"     → I love you, no one else
3"I love only you"     → You're the only one I love

Information Flow

In coherent text, information flows positionally:

📝text

1"The scientist discovered a new particle. She published her findings."
2
3"She" refers to "scientist" → Position helps resolve coreference

Fixed Expressions and Idioms

Many expressions depend on exact word order:

📝text

1"Kick the bucket" = die (idiom)
2"Bucket the kick" = nonsense

1.3 How RNNs Handled Position

Implicit Positional Information

RNNs process sequences step by step:

📝text

1h_0 → h_1 → h_2 → h_3 → ...
2 ↑     ↑     ↑     ↑
3x_0   x_1   x_2   x_3

The hidden state h_t "knows" it came after h_{t-1}:

- Position is implicit in the recurrence

- Earlier tokens influence later hidden states

- No explicit position encoding needed

The Price of Implicit Position

This sequential processing caused RNN limitations:

- Can't parallelize

- Long-range dependencies are hard

- Gradient flow issues

Transformers solve these problems, but lose implicit position.

1.4 Requirements for Good Positional Encoding

Desirable Properties

A good positional encoding should be:

Property	Description	Why It Matters
Unique	Different positions → different encodings	Distinguishes positions
Bounded	Values don't grow unboundedly	Numerical stability
Deterministic	Same position → same encoding	Reproducibility
Generalizable	Works for unseen lengths	Handles long sequences
Smooth	Nearby positions → similar encodings	Captures locality
Relative-aware	Enables learning relative positions	"3 positions apart"

What Doesn't Work

Simple integer positions:

📝text

1PE(0) = 0, PE(1) = 1, PE(2) = 2, ...

- Not bounded (grows unboundedly)

- Magnitude varies with position (bad for learning)

Normalized positions:

📝text

1PE(i) = i / seq_len  →  [0, 0.1, 0.2, ..., 1.0]

- Depends on sequence length

- Same position has different values for different lengths

One-hot positions:

📝text

1PE(0) = [1,0,0,...], PE(1) = [0,1,0,...], ...

- Doesn't capture relationships between positions

- Requires knowing max length in advance

- Very high dimensional

1.5 Absolute vs Relative Position

Absolute Position

What it encodes: "This token is at position 5"

Implementation: Add position-specific vector to each token

🐍python

1input[i] = token_embedding[i] + position_encoding[i]

Pros:

- Simple to implement

- Each position has unique representation

Cons:

- "Position 5" is arbitrary—what matters is often relative position

- May not generalize to longer sequences

Relative Position

What it encodes: "Token A is 3 positions before token B"

Implementation: Modify attention to include relative position

🐍python

1# In attention scores
2scores[i,j] = Q[i] · K[j] + relative_bias[i-j]

Pros:

- Captures "distance" between tokens

- Generalizes better to long sequences

Cons:

- More complex to implement

- Requires attention mechanism changes

Modern Approaches

Model	Position Type	Method
Original Transformer	Absolute	Sinusoidal encoding
BERT	Absolute	Learned embeddings
T5	Relative	Relative position bias
GPT-3	Absolute	Learned embeddings
RoPE (LLaMA)	Both	Rotary embeddings
ALiBi	Relative	Attention bias

1.6 The Solution: Positional Encoding

The Core Idea

Add positional information to token embeddings:

📝text

1input = token_embedding + positional_encoding

Where:

- token_embedding: What the token means (semantic)

- positional_encoding: Where the token is (structural)

Visual Representation

📝text

1Token:     "The"   "cat"   "sat"
2Position:    0       1       2
3
4Token emb: [0.2,   [0.8,   [0.4,
5            0.8,    0.3,    0.6,
6            0.1]    0.7]    0.5]
7
8Pos enc:   [0.0,   [0.84,  [0.91,
9            1.0,    0.54,   -0.42,
10            0.0]    0.84]   0.91]
11
12Combined:  [0.2,   [1.64,  [1.31,
13            1.8,    0.84,   0.18,
14            0.1]    1.54]   1.41]

Why Addition?

We add (not concatenate) positional encoding:

1. Keeps dimension constant (d_model)

2. Allows position to modulate meaning

3. More parameter efficient

4. Works well empirically

1.7 Preview: Two Main Approaches

Sinusoidal Positional Encoding

From the original Transformer paper:

📝text

1PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
2PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Properties:

- Deterministic (no learned parameters)

- Can extrapolate to longer sequences

- Relative positions are linear combinations

- Fixed once computed

Learned Positional Embeddings

Used by BERT, GPT:

🐍python

1position_embedding = nn.Embedding(max_seq_len, d_model)

Properties:

- Learned from data

- May capture task-specific patterns

- Limited to max_seq_len positions

- More parameters

1.8 Impact on Attention

Without Position: Ambiguous Attention

📝text

1"The cat sat on the mat"
2
3Q("sat") looking for subject:
4- "The" at position 0: similarity S1
5- "cat" at position 1: similarity S2
6- "mat" at position 5: similarity S3
7
8Without position: Can't distinguish "The" at position 0 from "the" at position 4!
9Both have identical embeddings.

With Position: Disambiguated

📝text

1After adding positional encoding:
2
3"The" (pos 0) embedding = token_emb + pos_enc(0) = unique!
4"the" (pos 4) embedding = token_emb + pos_enc(4) = different!
5
6Now attention can learn:
7- First "The" is likely part of subject
8- Second "the" is likely part of prepositional phrase

Summary

The Problem

Without Position	With Position
Attention is permutation equivariant	Order is encoded
"dog bites man" = "man bites dog"	Distinct representations
Can't distinguish repeated tokens	Each occurrence is unique
Loses syntactic information	Captures word order

Key Concepts

1. Permutation Equivariance: Self-attention treats input as a set, not sequence

2. Order Matters: Language meaning depends heavily on position

3. Positional Encoding: Add position-specific vectors to embeddings

4. Two Approaches: Sinusoidal (fixed) vs Learned (trained)

5. Desirable Properties: Unique, bounded, generalizable, smooth

Next Section Preview

In the next section, we'll implement sinusoidal positional encoding from the original Transformer paper. We'll understand why the authors chose this specific formula and implement it in PyTorch.