Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

In the previous chapter, we implemented single-head attention. It works well, but the original Transformer paper uses multi-head attention instead. Why?

This section explores the theoretical motivation for using multiple attention heads—how they allow the model to capture different types of relationships simultaneously and why this matters for language understanding.

Limitations of Single-Head Attention

The Problem: One Perspective Only

Single-head attention computes one set of attention weights:

📝text

1For "The cat sat on the mat":
2
3Query "sat" → Attention weights → [0.05, 0.35, 0.15, 0.10, 0.05, 0.30]
4                                   The  cat   sat   on   the   mat

But language has multiple types of relationships:

Syntactic: "sat" needs its subject ("cat") and object ("mat")
Semantic: "cat" and "mat" might be related (domestic context)
Positional: Adjacent words often matter for grammar
Long-range: Pronouns need to find their referents

Can one attention pattern capture all of this? Not effectively.

Mathematical Limitation

Single-head attention projects into one subspace:

egin{aligned} Q &= X cdot W_Q quad ightarrow quad ext{Single query representation} \\ K &= X cdot W_K quad ightarrow quad ext{Single key representation} end{aligned}

This single projection must balance:

Finding syntactic dependencies
Capturing semantic similarity
Maintaining positional relationships

One projection can't optimize for all of these simultaneously.

The "Averaging" Problem

When a single head tries to capture multiple relationship types, it compromises:

📝text

1Ideal for syntax:     [0.0, 0.8, 0.0, 0.0, 0.0, 0.2]  (subject + object)
2Ideal for position:   [0.3, 0.3, 0.2, 0.2, 0.0, 0.0]  (nearby words)
3Ideal for semantics:  [0.0, 0.3, 0.0, 0.0, 0.0, 0.7]  (related concepts)
4
5Compromise (averaged): [0.1, 0.5, 0.1, 0.1, 0.0, 0.3]  (muddy signal)

The compromise is worse than any specialized pattern.

The Multi-Head Solution

Core Idea: Multiple Perspectives

Instead of one attention pattern, compute multiple patterns in parallel:

📝text

1Head 1: Focuses on syntactic relationships
2Head 2: Focuses on semantic similarity
3Head 3: Focuses on adjacent positions
4Head 4: Focuses on long-range dependencies
5...

Each head can specialize for different relationship types.

Visual Representation

📝text

1Input: "The cat sat on the mat"
2
3Head 1 (Syntactic):     sat → [·, ■, ·, ·, ·, ■]  (subject, object)
4Head 2 (Positional):    sat → [·, ■, ■, ■, ·, ·]  (nearby words)
5Head 3 (Semantic):      cat → [·, ■, ·, ·, ·, ■]  (cat ↔ mat, domestic)
6Head 4 (Article):       cat → [■, ■, ·, ·, ·, ·]  (article-noun pair)
7
8Concatenate all heads → Rich, multi-faceted representation

The Mathematical Framework

Multi-head attention:

ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, \ldots, ext{head}_h) \cdot W^O

Where each head is computed as:

ext{head}_i = ext{Attention}(Q \cdot W^Q_i, \; K \cdot W^K_i, \; V \cdot W^V_i)

Each head has its own projection matrices:

$W^Q_i$ : What each head looks for
$W^K_i$ : What each position advertises
$W^V_i$ : What information to extract

Specialization of Attention Heads

Empirical Observations

Research analyzing trained transformers reveals that heads do specialize:

Head Type	What It Learns	Example Pattern
Positional	Attend to adjacent tokens	Diagonal or band pattern
Syntactic	Subject-verb, verb-object	Sparse, specific connections
Semantic	Similar/related concepts	Cluster-like patterns
Delimiter	Attend to [CLS], [SEP], periods	Column pattern
Previous Token	Always attend to position i-1	Shifted diagonal
Rare Token	Attend to unusual/important words	Sparse, content-specific

Visualization from Research

BERT attention analysis shows distinct patterns across heads:

📝text

1Layer 1, Head 1:           Layer 1, Head 2:           Layer 1, Head 3:
2  [■ · · · ·]                [■ ■ · · ·]                [· · · · ■]
3  [■ ■ · · ·]                [· ■ ■ · ·]                [· · · · ■]
4  [· ■ ■ · ·]                [· · ■ ■ ·]                [· · · · ■]
5  [· · ■ ■ ·]                [· · · ■ ■]                [· · · · ■]
6  [· · · ■ ■]                [· · · · ■]                [· · · · ■]
7
8  Previous token            Next token               Last position

The Redundancy-Expressiveness Tradeoff

Not all heads are equally useful:

Some heads learn similar patterns (redundancy)
Some heads learn seemingly random patterns (noise)
Some heads are critical (pruning them hurts performance)

This observation has led to:

Head pruning: Removing less important heads
Adaptive attention: Learning which heads to use
Mixture of experts: Routing to relevant heads

Subspace Projection Intuition

High-Dimensional Embedding Space

Word embeddings live in high-dimensional space (e.g., 512 dimensions).

Different dimensions encode different information:

Dimensions 1-50: Maybe gender/animacy
Dimensions 51-100: Maybe part-of-speech
Dimensions 101-200: Maybe semantic category
...

Projecting into Subspaces

Each attention head projects into a subspace:

egin{aligned} d_{ ext{model}} &= 512 quad ext{(full embedding dimension)} \\ n_{ ext{heads}} &= 8 quad ext{(number of attention heads)} \\ d_k &= rac{d_{ ext{model}}}{n_{ ext{heads}}} = rac{512}{8} = 64 quad ext{(per-head dimension)} end{aligned}

Head 1: Projects into 64-dim subspace → Captures one aspect
Head 2: Projects into different 64-dim subspace → Captures another aspect

The projection matrices $W^Q_i$ , $W^K_i$ learn which aspects to focus on.

Geometric Intuition

Imagine embeddings in 3D (simplified):

📝text

1Full 3D space:
2  Words live here with multiple properties
3
4Head 1 projects onto XY plane:
5  Captures relationships in X-Y dimensions
6  (e.g., syntactic features)
7
8Head 2 projects onto XZ plane:
9  Captures relationships in X-Z dimensions
10  (e.g., semantic features)
11
12Combining both:
13  Richer understanding than either alone

One Large Head vs Multiple Small Heads

Parameter Comparison

One large head ( $d_k = d_{ ext{model}} = 512$ ):

egin{aligned} W^Q &: [512 imes 512] ightarrow 262{,}144 ext{ parameters} \\ W^K &: [512 imes 512] ightarrow 262{,}144 ext{ parameters} \\ W^V &: [512 imes 512] ightarrow 262{,}144 ext{ parameters} \\ \hline extbf{Total} &: 786{,}432 ext{ parameters} end{aligned}

Eight small heads ( $d_k = 64$ each):

egin{aligned} W^Q &: [512 imes 512] ightarrow 262{,}144 ext{ parameters (projects to all heads)} \\ W^K &: [512 imes 512] ightarrow 262{,}144 ext{ parameters} \\ W^V &: [512 imes 512] ightarrow 262{,}144 ext{ parameters} \\ W^O &: [512 imes 512] ightarrow 262{,}144 ext{ parameters (output projection)} \\ \hline extbf{Total} &: 1{,}048{,}576 ext{ parameters} end{aligned}

Multi-head uses ~33% more parameters (for $W^O$ ), but gains:

Multiple attention patterns
Specialization capability
Better gradient flow (parallel heads)

Empirical Results

The original paper compared configurations:

Config	BLEU Score	Note
1 head, $d_k = 512$	24.9	Baseline
8 heads, $d_k = 64$	25.8	+0.9 improvement
16 heads, $d_k = 32$	25.5	Diminishing returns
32 heads, $d_k = 16$	25.0	Too many heads

Sweet spot: 8-16 heads for typical models.

How Heads Learn to Specialize

Random Initialization Breaks Symmetry

At initialization:

All heads have random weights
Random differences lead to different gradients
Heads diverge during training

Specialization Emerges from Data

The data contains different types of patterns:

Syntactic patterns → Some heads specialize here
Semantic patterns → Other heads specialize here
The loss function rewards useful specialization

Not Explicitly Designed

Heads are NOT told what to specialize in:

No "syntax head" label
No "semantic head" constraint
Specialization emerges from training

This is learned, not engineered.

Attention Head Redundancy

The Dropout Perspective

If we apply dropout to attention heads:

Randomly zero out entire heads during training
Model must learn redundant representations
No single head becomes critical

Head Pruning Research

Studies show:

30-40% of heads can be removed with minimal loss
Some heads are redundant (similar patterns)
Some heads are critical (removing them hurts a lot)

Implications

Robustness: Multiple heads provide backup
Efficiency opportunity: Can prune for inference
Interpretability challenge: Hard to assign meaning to heads

Multi-Head Attention in Different Layers

Layer-Dependent Patterns

Different layers learn different patterns:

Early Layers (1-4):

Local patterns (adjacent tokens)
Syntactic patterns (POS, grammar)
Surface-level features

Middle Layers (5-8):

Longer-range dependencies
Semantic relationships
Entity tracking

Late Layers (9-12):

Task-specific patterns
Abstract representations
Output preparation

Why This Matters

Understanding layer roles helps with:

Transfer learning: Which layers to freeze
Interpretability: Where to look for specific patterns
Efficiency: Which layers are most important

Summary

Why Multi-Head Attention?

Single Head	Multiple Heads
One attention pattern	Multiple patterns
One subspace projection	Multiple subspaces
Must compromise	Can specialize
Less expressive	More expressive
Lower parameter count	Slightly more parameters

Key Takeaways

Specialization: Heads learn to capture different relationship types
Parallel Subspaces: Each head projects to a different subspace
Empirical Benefit: Multi-head consistently outperforms single-head
Emergent Behavior: Specialization emerges from training, not design
Redundancy: Some heads are redundant, others are critical

The Formula Preview

Multi-Head Attention:

ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, \ldots, ext{head}_h) \cdot W^O

Each head:

ext{head}_i = ext{Attention}(Q \cdot W^Q_i, \; K \cdot W^K_i, \; V \cdot W^V_i)

Exercises

Conceptual Questions

Explain in your own words why a single attention head can't simultaneously optimize for syntax and semantics.
If you had 1024-dimensional embeddings and 16 heads, what would be the dimension per head ( $d_k$ )? Why must $d_{ ext{model}}$ be divisible by $n_{ ext{heads}}$ ?
Why might increasing heads beyond 16 show diminishing returns or even hurt performance?
If all heads learned identical patterns, what would be lost compared to diverse specialization?

Thought Experiments

Design an experiment to test whether attention heads specialize. What would you measure?
If you could manually assign head functions (Head 1 = syntax, Head 2 = semantics, etc.), how would you do it? Why might learned specialization be better?
How might the optimal number of heads change for:
- A very small model ( $d_{ ext{model}} = 128$ )?
- A very large model ( $d_{ ext{model}} = 4096$ )?
- A model for very long sequences?