Introduction
Before diving into formulas and code, let's build a solid intuition for what attention actually does. Understanding attention at a conceptual level will make the mathematics feel natural rather than arbitrary.
The attention mechanism is arguably the single most important innovation in modern deep learning. Once you truly understand it, you'll see it everywhere—and you'll understand why transformers work so well.
How to read this section
1.1 Attention as Selective Focus
The Human Analogy
When you read this sentence, you don't process every word with equal importance. Your brain automatically:
- Focuses on key words that carry meaning
- Relates words to each other based on context
- Ignores or downweights less relevant information
Consider: "The cat that was sitting on the mat was hungry."
To understand what "hungry" refers to, your brain:
- Connects "hungry" to "cat" (not "mat")
- Uses grammatical structure (subject-verb agreement)
- Ignores intervening words as less relevant
Attention mechanisms do exactly this—computationally.
The Cocktail Party Effect
Imagine being at a noisy party. You can:
- Focus on one conversation while ignoring others
- Suddenly shift attention when someone says your name
- Dynamically adjust focus based on relevance
Neural attention works similarly:
- Each position in a sequence can "focus" on different parts
- Focus is determined by relevance (similarity)
- Attention is dynamic—different inputs produce different patterns
1.2 Attention as Weighted Averaging
The Core Operation
At its heart, attention is just a weighted average:
1output = Σ (weight_i × value_i)Shapes (keep them in mind for code)
Where:
value_iis the information at position iweight_iis how much to focus on position iΣ weight_i = 1(weights form a probability distribution)
A Simple Example
Suppose we want to compute a "context-aware" representation for the word "it" in:
"The cat sat on the mat. It was happy."
What does "it" refer to? Let's say we have embeddings:
1cat = [0.8, 0.2, 0.1] (animal-like)
2mat = [0.1, 0.9, 0.3] (object-like)
3it = [0.7, 0.3, 0.2] (pronoun, somewhat animal-like)If attention assigns weights:
- cat: 0.7 (high attention - "it" likely refers to cat)
- mat: 0.2 (low attention)
- other words: 0.1
Then the context-enriched representation of "it":
1new_it = 0.7 × cat + 0.2 × mat + 0.1 × others
2 = 0.7 × [0.8, 0.2, 0.1] + 0.2 × [0.1, 0.9, 0.3] + ...Now "it" carries information about what it refers to!
Tiny numeric sanity check
1.3 The Query-Key-Value Framework
The Database Analogy
Think of attention like searching a database:
| Concept | Database | Attention |
|---|---|---|
| Query | Search term | "What am I looking for?" |
| Key | Index entries | "What does each item contain?" |
| Value | Actual data | "What information should I return?" |
Process:
- Compare Query to all Keys (compute similarity)
- Use similarities to create attention weights
- Return weighted combination of Values
Example: Finding a Book
Query: "machine learning Python"
Database entries:
| Key (Title/Tags) | Value (Content) |
|---|---|
| "ML algorithms Python" | Chapter on scikit-learn... |
| "Web development JavaScript" | React tutorial... |
| "Deep learning PyTorch" | Neural networks... |
Process:
- Match query to keys: ML→high, Web→low, Deep→medium
- Weights: [0.6, 0.05, 0.35]
- Return: Mostly scikit-learn chapter, some PyTorch
In Neural Networks
For each position in a sequence:
1Query: "What information do I need?" (from current position)
2Keys: "What information do I have?" (from all positions)
3Values: "What information can I give?" (from all positions)The key insight: Q, K, V are all learned transformations of the same input!
1Q = input @ W_Q # Learn what to look for
2K = input @ W_K # Learn what to advertise
3V = input @ W_V # Learn what to provideAvoid over-interpreting attention
1.4 Self-Attention vs Cross-Attention
Self-Attention
All of Q, K, V come from the same sequence.
Used in: Encoder self-attention, decoder self-attention
1Input: "The cat sat"
2 ↓ ↓ ↓
3 [e1, e2, e3] (embeddings)
4 ↓ ↓ ↓
5Q: [q1, q2, q3] (queries from same input)
6K: [k1, k2, k3] (keys from same input)
7V: [v1, v2, v3] (values from same input)Each token attends to all tokens in the same sequence (including itself).
Purpose: Learn relationships within a sequence
- "cat" attends to "The" to understand it's a specific cat
- "sat" attends to "cat" to understand who is sitting
Cross-Attention
Q comes from one sequence, K and V from another.
Used in: Decoder cross-attention (attending to encoder output)
1Decoder input: "The"
2 ↓
3Q: [q1] (query from decoder)
4
5Encoder output: [e1, e2, e3, e4] (from "Der Hund ist schwarz")
6 ↓ ↓ ↓ ↓
7K: [k1, k2, k3, k4] (keys from encoder)
8V: [v1, v2, v3, v4] (values from encoder)Purpose: Let decoder "look at" the source sequence
- When generating "The", attend to "Der" (German article)
- When generating "dog", attend to "Hund" (German noun)
Shapes to remember
1.5 What High Attention Weight Means
Interpreting Attention Patterns
When position i has high attention weight on position j:
1Position i "pays attention to" / "looks at" / "focuses on" position jPractical reading of a heatmap
This means:
- Token i needs information from token j
- Token j is relevant for understanding token i
- Output at position i will incorporate information from position j
Common Attention Patterns
1. Diagonal Pattern
1a b c d
2a [■ · · ·] Each token attends mostly to itself
3b [· ■ · ·] (common in lower layers)
4c [· · ■ ·]
5d [· · · ■]2. Column Pattern
1a b c d
2a [· ■ · ·] All tokens attend to one important token
3b [· ■ · ·] (e.g., subject of sentence)
4c [· ■ · ·]
5d [· ■ · ·]3. Syntactic Pattern
1The cat sat on the mat
2sat [· ■ · · · ■] "sat" attends to "cat" (subject)
3 and "mat" (object)4. Previous Token Pattern
1a b c d
2a [■ · · ·] Each token attends to the previous
3b [■ ■ · ·] (common in causal/decoder attention)
4c [· ■ ■ ·]
5d [· · ■ ■]1.6 Attention as Dynamic Routing
Static vs Dynamic
Traditional neural networks: Fixed weights, same computation for every input
1output = W × input (W is fixed after training)Attention: Weights depend on input, different computation for different inputs
1weights = f(input) (computed fresh for each input)
2output = weights × valuesWhy This Matters
Consider processing two sentences:
- "The animal didn't cross the road because it was too tired."
- "The animal didn't cross the road because it was too wide."
In sentence 1, "it" = animal
In sentence 2, "it" = road
A static network would process "it" the same way in both cases.
Attention computes different weights:
- Sentence 1: "it" attends strongly to "animal"
- Sentence 2: "it" attends strongly to "road"
The routing of information adapts to the input.
Link to next section
1.7 Building Intuition: A Visual Walkthrough
Step-by-Step Example
Input: "The cat sat"
Step 1: Create embeddings
1The → [0.2, 0.8, 0.1]
2cat → [0.9, 0.3, 0.7]
3sat → [0.4, 0.6, 0.5]Step 2: Project to Q, K, V (simplified, same as input here)
1Q = K = V = [[0.2, 0.8, 0.1],
2 [0.9, 0.3, 0.7],
3 [0.4, 0.6, 0.5]]Step 3: Compute attention scores (Q × K^T)
For query "sat" (row 3):
1score(sat, The) = sat · The = 0.4×0.2 + 0.6×0.8 + 0.5×0.1 = 0.61
2score(sat, cat) = sat · cat = 0.4×0.9 + 0.6×0.3 + 0.5×0.7 = 0.89
3score(sat, sat) = sat · sat = 0.4×0.4 + 0.6×0.6 + 0.5×0.5 = 0.77"sat" is most similar to "cat" (highest score)!
Step 4: Apply softmax (convert to probabilities)
1weights = softmax([0.61, 0.89, 0.77]) ≈ [0.26, 0.39, 0.35]Step 5: Weighted sum of values
1output_sat = 0.26 × V_The + 0.39 × V_cat + 0.35 × V_sat
2 = 0.26 × [0.2, 0.8, 0.1] + 0.39 × [0.9, 0.3, 0.7] + 0.35 × [0.4, 0.6, 0.5]
3 = [0.55, 0.50, 0.48]The output for "sat" now carries information from all tokens, weighted by relevance!
1.8 Why Attention Weights Sum to 1
The Softmax Normalization
Attention weights are computed using softmax:
1attention_weight_i = exp(score_i) / Σ exp(score_j)This guarantees:
- All weights are positive (exp is always > 0)
- Weights sum to 1 (by construction of softmax)
- Weights form a probability distribution
Why Normalization Matters
Without normalization:
- Weights could be arbitrarily large
- Output magnitude would depend on input magnitude
- Training would be unstable
With normalization:
- Output is a convex combination of values
- Magnitude is controlled
- Interpretable as "probability of attending to each position"
Summary
Key Intuitions
- Attention = Weighted Averaging: Each output is a weighted combination of inputs, where weights indicate relevance.
- Query-Key-Value:
- Query: What am I looking for?
- Key: What do I contain?
- Value: What can I provide?
- Self vs Cross:
- Self-attention: Q, K, V from same sequence
- Cross-attention: Q from one sequence, K/V from another
- Dynamic Routing: Attention weights are input-dependent, enabling context-sensitive processing.
- Interpretability: High attention weight from i to j means position i "focuses on" position j.
Mental Model
1For each position i:
2 1. Create a query: "What do I need?"
3 2. Compare query to all keys: "How relevant is each position?"
4 3. Softmax: Convert relevance to probabilities
5 4. Weighted sum: Gather information from all positions based on relevanceExercises
Conceptual Questions
- In your own words, explain why attention is called a "soft lookup" mechanism.
- Given the sentence "The dog chased the cat and then it ran away", what tokens would you expect "it" to attend to? What would determine which interpretation is chosen?
- Why do we need separate Q, K, V projections instead of using the input directly for all three?
- Explain the difference between self-attention and cross-attention. Give an example of when each would be used.
Hands-on Drills
- Compute a 3-token attention by hand: pick small Q/K/V vectors, form QK^T, apply softmax, and produce the weighted sum.
- Change one query vector slightly and recompute weights—notice how routing shifts. This builds intuition for dynamic routing.
Thought Experiments
- If all attention weights were equal (uniform attention), what would the output represent? When might this be useful or problematic?
- Consider a very long document where the answer to a question is in the first paragraph, but the question is asked in the last paragraph. How does attention help solve this compared to an RNN?
- If we removed the softmax normalization and just used raw dot products, what problems might arise during training?
Next Section Preview
In the next section, we'll translate these intuitions into mathematics. We'll derive the scaled dot-product attention formula:
1Attention(Q, K, V) = softmax(QK^T / √d_k) × VAnd understand exactly why each component is necessary.