Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Multi-head attention can operate in two modes: self-attention (where query, key, and value come from the same source) and cross-attention (where query comes from one source and key/value from another). Understanding this distinction is crucial for building encoder-decoder transformers.

The Two Modes

Self-Attention

Definition: Query, Key, and Value are all derived from the same input sequence.

Use cases:

Encoder self-attention
Decoder self-attention (masked)

Self-Attention Example

🐍attention.py

Explanation(0)

Code(3)

3 lines without explanation

1# Self-attention: Q, K, V from same source
2x = encoder_input  # [batch, src_len, d_model]
3output = attention(query=x, key=x, value=x)

What it captures: Relationships within a sequence

How tokens relate to each other
Syntactic dependencies (subject-verb)
Semantic relationships (coreference)

Cross-Attention

Definition: Query comes from one sequence, Key and Value from another.

Use cases:

Encoder-decoder attention (decoder queries, encoder provides K/V)
Vision-language models (text queries, image provides K/V)

Cross-Attention Example

🐍attention.py

Explanation(0)

Code(8)

8 lines without explanation

1# Cross-attention: Q from decoder, K/V from encoder
2decoder_state = ...    # [batch, tgt_len, d_model]
3encoder_output = ...   # [batch, src_len, d_model]
4output = attention(
5    query=decoder_state,
6    key=encoder_output,
7    value=encoder_output
8)

What it captures: Relationships between sequences

How decoder tokens relate to encoder tokens
Alignment in translation (which source word to translate)
Grounding in multimodal tasks

Visual Comparison

Self-Attention Pattern

📝text

Explanation(0)

Code(13)

13 lines without explanation

1Input: "The cat sat on the mat"
2
3Attention matrix (6×6):
4          The  cat  sat   on  the  mat
5The      [■    ·    ·    ·    ·    ·  ]
6cat      [·    ■    ·    ·    ·    ·  ]
7sat      [·    ■    ■    ·    ·    ■  ]  ← "sat" attends to "cat" and "mat"
8on       [·    ·    ·    ■    ·    ·  ]
9the      [·    ·    ·    ·    ■    ·  ]
10mat      [·    ·    ·    ·    ·    ■  ]
11
12Shape: [seq_len_q=6, seq_len_k=6]
13Query and Key from same sequence → Square matrix

Cross-Attention Pattern

📝text

Explanation(0)

Code(13)

13 lines without explanation

1Source (German): "Der Hund ist schwarz"
2Target (English): "The dog is black"
3
4Attention matrix (4×4):
5          Der  Hund  ist  schwarz
6The      [■    ·     ·    ·      ]  ← "The" attends to "Der"
7dog      [·    ■     ·    ·      ]  ← "dog" attends to "Hund"
8is       [·    ·     ■    ·      ]  ← "is" attends to "ist"
9black    [·    ·     ·    ■      ]  ← "black" attends to "schwarz"
10
11Shape: [seq_len_q=4, seq_len_k=4]
12Query from target, Key from source
13(Could be different lengths!)

Cross-Attention with Different Lengths

📝text

Explanation(0)

Code(12)

12 lines without explanation

1Source: "Der schwarze Hund" (3 tokens)
2Target: "The black dog runs" (4 tokens)
3
4Attention matrix (4×3):
5          Der  schwarze  Hund
6The      [■      ·       ·   ]
7black    [·      ■       ·   ]
8dog      [·      ·       ■   ]
9runs     [·      ·       ■   ]  ← No source for "runs", attends to "Hund"
10
11Shape: [seq_len_q=4, seq_len_k=3]
12Rectangular matrix!

Architecture Usage

In the Original Transformer

Transformer Architecture Overview

📝text

Explanation(0)

Code(8)

8 lines without explanation

1ENCODER (N layers):
2├── Self-Attention: Q=K=V=encoder_input
3└── Feed-Forward
4
5DECODER (N layers):
6├── Masked Self-Attention: Q=K=V=decoder_input (causal mask)
7├── Cross-Attention: Q=decoder_state, K=V=encoder_output
8└── Feed-Forward

Data Flow Diagram

📝text

Explanation(0)

Code(13)

13 lines without explanation

1ENCODER
2                              │
3Source tokens ──→ Embedding ──→ Self-Attn ──→ FFN ──→ encoder_output
4                              │                            │
5                              └────────────────────────────┤
6                                                           │
7                           DECODER                         │
8                              │                            │
9Target tokens ──→ Embedding ──→ Masked      ──→ Cross   ──┴──→ FFN ──→ Output
10                              │ Self-Attn      │ Attn
11                              │                │
12                           (Q=K=V)          (Q from decoder,
13                                            K=V from encoder)

Implementation Comparison

Minimal Code Difference

🐍attention.py

Explanation(0)

Code(18)

18 lines without explanation

1class MultiHeadAttention(nn.Module):
2    def forward(self, query, key, value, mask=None):
3        # Projections
4        Q = self.W_Q(query)  # From query input
5        K = self.W_K(key)    # From key input
6        V = self.W_V(value)  # From value input
7
8        # Rest is identical...
9        # Split heads, attention, combine heads, output projection
10
11        return output, weights
12
13
14# Self-attention: same input for all
15output = mha(x, x, x)  # Q=K=V=x
16
17# Cross-attention: different inputs
18output = mha(decoder_state, encoder_output, encoder_output)  # Q≠K=V

The same module handles both modes—the only difference is what inputs you pass!

Shape Implications

🐍attention.py

Explanation(0)

Code(12)

12 lines without explanation

1# Self-attention
2x = torch.randn(batch, seq_len, d_model)
3output, weights = mha(x, x, x)
4# output: [batch, seq_len, d_model]
5# weights: [batch, heads, seq_len, seq_len]  ← Square!
6
7# Cross-attention
8query = torch.randn(batch, tgt_len, d_model)   # e.g., 20 tokens
9kv = torch.randn(batch, src_len, d_model)      # e.g., 30 tokens
10output, weights = mha(query, kv, kv)
11# output: [batch, tgt_len, d_model]           # 20 tokens
12# weights: [batch, heads, tgt_len, src_len]   # 20×30, rectangular!

Masking Differences

Self-Attention Masks

Encoder self-attention: Only padding mask

Encoder Padding Mask

🐍masking.py

Explanation(0)

Code(3)

3 lines without explanation

1# Mask padding tokens in source
2padding_mask = (src != PAD_TOKEN).unsqueeze(1).unsqueeze(2)
3# Shape: [batch, 1, 1, src_len]

Decoder self-attention: Causal + padding mask

Decoder Causal Mask

🐍masking.py

Explanation(0)

Code(5)

5 lines without explanation

1# Prevent attending to future tokens
2causal_mask = torch.tril(torch.ones(tgt_len, tgt_len))
3padding_mask = (tgt != PAD_TOKEN).unsqueeze(1).unsqueeze(2)
4combined_mask = causal_mask & padding_mask
5# Shape: [batch, 1, tgt_len, tgt_len]

Cross-Attention Masks

Only mask padding in the source (encoder output):

Cross-Attention Mask

🐍masking.py

Explanation(0)

Code(4)

4 lines without explanation

1# Decoder queries can see all non-padding encoder positions
2cross_mask = (src != PAD_TOKEN).unsqueeze(1).unsqueeze(2)
3# Shape: [batch, 1, 1, src_len]
4# Broadcasts to [batch, heads, tgt_len, src_len]

No causal mask needed—decoder can look at entire source!

Flexible Attention Module

Here's a module that explicitly handles both modes:

FlexibleMultiHeadAttention

🐍flexible_attention.py

Explanation(0)

Code(114)

114 lines without explanation

1class FlexibleMultiHeadAttention(nn.Module):
2    """
3    Multi-head attention supporting both self-attention and cross-attention.
4    """
5
6    def __init__(
7        self,
8        d_model: int,
9        num_heads: int,
10        dropout: float = 0.0,
11        bias: bool = True
12    ):
13        super().__init__()
14
15        self.d_model = d_model
16        self.num_heads = num_heads
17        self.d_k = d_model // num_heads
18        self.scale = math.sqrt(self.d_k)
19
20        # Query projection (always from query input)
21        self.W_Q = nn.Linear(d_model, d_model, bias=bias)
22
23        # Key/Value projections (from key/value inputs)
24        self.W_K = nn.Linear(d_model, d_model, bias=bias)
25        self.W_V = nn.Linear(d_model, d_model, bias=bias)
26
27        # Output projection
28        self.W_O = nn.Linear(d_model, d_model, bias=bias)
29
30        self.dropout = nn.Dropout(dropout)
31
32    def forward(
33        self,
34        query: torch.Tensor,
35        key: torch.Tensor = None,
36        value: torch.Tensor = None,
37        mask: torch.Tensor = None,
38        is_self_attention: bool = None
39    ):
40        """
41        Forward pass.
42
43        Args:
44            query: [batch, seq_len_q, d_model]
45            key: [batch, seq_len_k, d_model] (optional, defaults to query)
46            value: [batch, seq_len_k, d_model] (optional, defaults to key)
47            mask: Attention mask
48            is_self_attention: Explicit flag (auto-detected if None)
49
50        Returns:
51            output: [batch, seq_len_q, d_model]
52            weights: [batch, num_heads, seq_len_q, seq_len_k]
53        """
54        # Auto-detect mode if not specified
55        if is_self_attention is None:
56            is_self_attention = (key is None) or (key is query)
57
58        # Default key/value to query for self-attention
59        if key is None:
60            key = query
61        if value is None:
62            value = key
63
64        batch_size, seq_len_q, _ = query.shape
65        seq_len_k = key.size(1)
66
67        # Project
68        Q = self.W_Q(query)
69        K = self.W_K(key)
70        V = self.W_V(value)
71
72        # Split heads
73        Q = self._split_heads(Q, batch_size)
74        K = self._split_heads(K, batch_size)
75        V = self._split_heads(V, batch_size)
76
77        # Attention
78        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
79
80        if mask is not None:
81            scores = scores.masked_fill(mask == 0, float('-inf'))
82
83        weights = F.softmax(scores, dim=-1)
84        weights = torch.nan_to_num(weights, nan=0.0)
85        weights = self.dropout(weights)
86
87        # Output
88        out = torch.matmul(weights, V)
89        out = self._combine_heads(out, batch_size)
90        out = self.W_O(out)
91
92        return out, weights
93
94    def _split_heads(self, x, batch_size):
95        return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
96
97    def _combine_heads(self, x, batch_size):
98        return x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
99
100
101# Usage examples
102mha = FlexibleMultiHeadAttention(d_model=512, num_heads=8)
103
104# Self-attention (three equivalent ways)
105x = torch.randn(2, 10, 512)
106out1, _ = mha(x)                    # Auto-detect: key=None
107out2, _ = mha(x, x, x)              # Explicit: Q=K=V
108out3, _ = mha(x, is_self_attention=True)  # Flag
109
110# Cross-attention
111dec = torch.randn(2, 8, 512)   # Decoder state
112enc = torch.randn(2, 12, 512)  # Encoder output
113out4, weights = mha(dec, enc, enc)  # Q from dec, K/V from enc
114print(f"Cross-attention weights: {weights.shape}")  # [2, 8, 8, 12]

When to Use Each

Use Self-Attention When:

Scenario	Example
Encoding a sequence	Understanding source sentence
Language modeling	Predicting next word (causal)
Bidirectional context	BERT-style encoding
Single sequence tasks	Classification, NER

Use Cross-Attention When:

Scenario	Example
Sequence-to-sequence	Translation, summarization
Multimodal fusion	Image + text
Retrieval augmentation	Query + retrieved docs
Encoder-decoder models	T5, BART

Practical Considerations

Memory Efficiency

Self-attention: O(n²) where n is sequence length

Cross-attention: O(n × m) where n is query length, m is key length

For cross-attention with long encoder output:

Consider sparse attention patterns
Use chunked processing
KV caching for generation

Caching for Generation

During autoregressive generation, encoder output doesn't change:

CachedCrossAttention

🐍cached_attention.py

Explanation(0)

Code(20)

20 lines without explanation

1class CachedCrossAttention(nn.Module):
2    def __init__(self, base_attention):
3        super().__init__()
4        self.attention = base_attention
5        self.cached_K = None
6        self.cached_V = None
7
8    def forward(self, query, encoder_output=None, use_cache=True):
9        if encoder_output is not None:
10            # First call: compute and cache K, V
11            self.cached_K = self.attention.W_K(encoder_output)
12            self.cached_V = self.attention.W_V(encoder_output)
13
14        if use_cache and self.cached_K is not None:
15            # Use cached K, V
16            Q = self.attention.W_Q(query)
17            # ... attention with cached K, V
18        else:
19            # Normal forward
20            return self.attention(query, encoder_output, encoder_output)

Summary

Key Differences

Aspect	Self-Attention	Cross-Attention
Q source	Same as K, V	Different from K, V
Matrix shape	Square (n×n)	Rectangular (n×m)
Typical use	Encoder, decoder self-attn	Encoder-decoder bridge
Masking	Causal for decoder	Source padding only
What it learns	Intra-sequence relations	Inter-sequence relations

Implementation Insight

The same module handles both—the distinction is purely in what inputs you provide:

Self vs Cross Attention Usage

🐍attention.py

Explanation(0)

Code(5)

5 lines without explanation

1# Self-attention
2output = mha(x, x, x)
3
4# Cross-attention
5output = mha(decoder_state, encoder_output, encoder_output)

Exercises

Conceptual Questions

Why doesn't cross-attention need a causal mask?
In a translation model, what does high cross-attention weight between "dog" (English) and "Hund" (German) indicate?
Could you use cross-attention between two unrelated sequences? What would the model learn?

Implementation Exercises

Implement a KV-caching wrapper for cross-attention that avoids recomputing encoder projections.
Create a visualization that shows self-attention vs cross-attention patterns for a translation example.
Implement "bi-directional cross-attention" where both sequences query each other.

Chapter Summary

In this chapter on Multi-Head Attention, you learned:

Why multiple heads: Specialization for different relationship types
Linear projections: W_Q, W_K, W_V transform inputs to Q, K, V
Shape transformations: split_heads and combine_heads for parallel computation
Complete implementation: Production-ready MultiHeadAttention module
Self vs cross attention: Same mechanism, different input patterns

You now have a complete, reusable multi-head attention module that forms the core of every transformer layer.

Next Chapter Preview

In Chapter 4: Positional Encoding and Embeddings, we'll solve the "position problem"—transformers are permutation invariant, but language is order-dependent. We'll implement:

Token embeddings
Sinusoidal positional encoding
Learned positional embeddings
Combined embedding layers

This will complete the input processing pipeline before we move on to building full encoder and decoder layers.