Introduction
Now that we have intuition for what attention does, let's derive the mathematical formula precisely. By the end of this section, you'll understand every component of:
This formula is deceptively simple—just a few operations—but each component is carefully chosen. Understanding why each part exists will help you implement attention correctly and debug issues when they arise.
What Does This Formula Actually Do?
Think of attention as: Each token decides how much it should "look at" every other token.
| Symbol | Role | Intuition |
|---|---|---|
| Q (Query) | What the current token wants | "I'm looking for information about..." |
| K (Key) | What each token offers | "Here's what I can provide..." |
| V (Value) | The actual information | "Here's my content if you need it" |
The flow is elegantly simple:
Attention Flow
How information flows through the attention mechanism
Q compares with K
Similarity scores
Softmax normalizes
Weights sum to 1
Multiply by V
Weighted values
Output
Enriched representation
There is NO magic
Why Is V Outside the Softmax?
This is a common point of confusion. Let's be crystal clear:
| Component | What It Contains | Should We Normalize It? |
|---|---|---|
| QKᵀ | Raw similarity scores (can be negative, large, small) | ✅ YES — convert to probabilities |
| V | Actual content/information | ❌ NO — it's not a score! |
The softmax must operate on scores, not content.
- gives raw compatibility scores: "How similar am I to each token?"
- Softmax converts these to attention weights = how much token should pay attention to token
- contains the actual information we want to retrieve
If V were inside softmax...
What Context Does the Attention Matrix Encode?
The attention matrix A = ext{softmax}left(rac{QK^T}{sqrt{d_k}} ight) is an matrix (for n tokens).
Each row tells you: For token , how much it attends to every other token.
Example: Consider the sentence "The cat sat on the mat."
At the token "cat", the attention row might look like:
| Token | Attention Weight | Interpretation |
|---|---|---|
| The | 0.05 | Article, less important |
| cat | 0.60 | Self-attention (I am the subject) |
| sat | 0.25 | Verb related to me |
| on | 0.05 | Preposition, less relevant |
| the | 0.03 | Another article |
| mat | 0.02 | Object, not directly related to cat |
This row vector = context distribution. It tells us "cat" mostly cares about itself (0.60) and the verb "sat" (0.25).
What information does attention provide?
- Who depends on whom — relational structure in the sequence
- Pronoun resolution — "He" attends strongly to "John" in "John went to the store because he was hungry"
- Grammar structure — subject-verb relations, modifiers
- Semantic relevance — which words matter for predicting the next token
The Google Search Analogy
Here's the most intuitive way to understand attention:
| Attention | Google Search |
|---|---|
| Q (Query) | Your search query |
| K (Key) | Webpage SEO tags/titles |
| V (Value) | Actual webpage content |
| QKᵀ | Compare query to all page tags |
| softmax | Rank pages by relevance |
| A × V | Return weighted combination of relevant content |
Google Search as Attention
Understanding Q, K, V through a familiar analogy
Ultra-simple summary
Q: What I need | K: What others offer | V: The actual information
Compare → Weight → Combine → Context-aware representation
2.1 The Attention Formula Breakdown
The Complete Formula
Breaking down each component:
Let's identify each component:
| Component | Name | Purpose |
|---|---|---|
| Q | Query matrix | What we're looking for |
| K | Key matrix | What each position offers |
| V | Value matrix | Information to retrieve |
| QKᵀ | Attention scores | Raw similarity between queries and keys |
| √d_k | Scaling factor | Prevents extreme softmax outputs |
| softmax | Normalization | Converts scores to probabilities |
Dimensions
For a single sequence:
Key constraint
For batched operations with batch size :
2.2 Dot Product as Similarity Measure
Why Dot Product?
The dot product between two vectors measures their similarity:
Where is the angle between vectors.
Properties:
- If vectors point in same direction (): dot product is high (positive)
- If vectors are orthogonal (): dot product is zero
- If vectors point in opposite directions (): dot product is negative
Computing QKᵀ
When we compute :
The result is a matrix where:
- = dot product of query and key
- Shape:
Example:
1Q = [[1, 0], # Query 1
2 [0, 1]] # Query 2
3
4K = [[1, 0], # Key 1
5 [1, 1], # Key 2
6 [0, 1]] # Key 3Reading the result:
- Query 1 is similar to Key 1 and Key 2 (score = 1), dissimilar to Key 3 (score = 0)
- Query 2 is similar to Key 2 and Key 3 (score = 1), dissimilar to Key 1 (score = 0)
2.3 The Critical Scaling Factor:
The Problem Without Scaling
Consider what happens as (dimension of keys/queries) increases:
If and have elements drawn from a distribution with mean 0 and variance 1:
- Each element of is a sum of terms
- By Central Limit Theorem: ,
Why Extreme Values Are Bad
Softmax with extreme values produces nearly one-hot outputs:
Problems:
- Vanishing gradients: Softmax saturates,
- No learning: Model can't adjust attention patterns
- Numerical instability: Overflow in
The Solution: Scale by
Dividing by normalizes the variance:
Mathematical Derivation
Let ec{q}, ec{k} be vectors with independent elements :
After scaling:
This is crucial!
Empirical Demonstration
1import torch
2
3d_k = 512
4q = torch.randn(1000, d_k) # 1000 queries
5k = torch.randn(1000, d_k) # 1000 keys
6
7# Without scaling
8scores_unscaled = (q @ k.T)
9print(f"Unscaled std: {scores_unscaled.std():.2f}") # ≈ 22.6 (≈ √512)
10
11# With scaling
12scores_scaled = (q @ k.T) / (d_k ** 0.5)
13print(f"Scaled std: {scores_scaled.std():.2f}") # ≈ 1.02.4 Softmax Normalization
Purpose of Softmax
Softmax converts raw scores into a probability distribution:
Properties:
- All outputs positive: for all
- Outputs sum to 1:
- Preserves ordering: if , then
- Differentiable: enables gradient-based learning
Applying Softmax to Attention Scores
Each row sums to 1:
- Row contains attention weights for query
- = how much query attends to key/value
Softmax Temperature
The scaling factor acts like inverse temperature:
- Lower T (higher scaling): Sharper, more focused attention
- Higher T (lower scaling): Softer, more uniform attention
Temperature intuition
2.5 Weighted Sum with Values
The Final Computation
After computing attention weights, we use them to aggregate values:
Shape transformation:
What This Computes
For each query position :
This is a weighted average of all value vectors, where weights indicate relevance.
Example
1attention_weights = [[0.7, 0.2, 0.1], # Query 1: mostly attends to Value 1
2 [0.1, 0.8, 0.1]] # Query 2: mostly attends to Value 2
3
4V = [[1, 0], # Value 1
5 [0, 1], # Value 2
6 [1, 1]] # Value 32.6 Complete Algorithm
Pseudocode
1def attention(Q, K, V, mask=None):
2 # Step 1: Compute raw attention scores
3 d_k = K.shape[-1]
4 scores = Q @ K.transpose(-2, -1) # [n_q, n_k]
5
6 # Step 2: Scale scores
7 scores = scores / sqrt(d_k)
8
9 # Step 3: Apply mask (optional)
10 if mask is not None:
11 scores = scores.masked_fill(mask == 0, -infinity)
12
13 # Step 4: Softmax to get attention weights
14 attention_weights = softmax(scores, dim=-1)
15
16 # Step 5: Weighted sum of values
17 output = attention_weights @ V # [n_q, d_v]
18
19 return output, attention_weightsThe Algorithm in Equations
Computational Complexity
| Operation | Complexity |
|---|---|
| QKᵀ | O(n_q × n_k × d_k) |
| Softmax | O(n_q × n_k) |
| Weights × V | O(n_q × n_k × d_v) |
| Total | O(n_q × n_k × d) |
For self-attention where :
The quadratic bottleneck
2.7 Gradient Flow Through Attention
Why Gradients Flow Well
The attention mechanism has favorable gradient properties:
- Softmax gradients: Non-zero gradients as long as attention isn't saturated (thanks to scaling!)
- Direct paths: Every output position has a direct connection to every input position
- No vanishing through time: Unlike RNNs, no multiplicative chains over sequence length
Gradient of Softmax
For softmax output :
Where is the Kronecker delta (1 if , else 0).
Key insight
Backpropagation Through Attention
All operations are differentiable with well-behaved gradients!
2.8 Mathematical Notation Summary
Standard Notation
| Symbol | Meaning | Shape |
|---|---|---|
| Q | Query matrix | [n_q, d_k] |
| K | Key matrix | [n_k, d_k] |
| V | Value matrix | [n_k, d_v] |
| d_k | Key/query dimension | scalar |
| d_v | Value dimension | scalar |
| n_q | Number of queries | scalar |
| n_k | Number of keys/values | scalar |
| A | Attention weights | [n_q, n_k] |
The Formula in Matrix Form
Batched Form
With batch dimension :
With Multi-Head Attention (Preview)
Summary
The Complete Picture
Component purposes:
- : Compute similarity between each query and all keys
- : Normalize variance to prevent softmax saturation
- : Convert to probability distribution
- : Aggregate values weighted by attention
Key Equations at a Glance
Critical Insights
- Scaling is essential: Without , training fails for large
- Softmax creates focus: Converts scores to probability distribution
- Output is weighted average: Each output is combination of all values
- complexity: Quadratic in sequence length, limiting long sequences
Exercises
Mathematical Derivations
- Variance calculation: Prove that if and are vectors with independent elements, then ext{Var}[ec{q} \cdot ec{k}] = d_k.
- Softmax property: Prove that for any input vector .
- Temperature effect: If we use , what happens to the output distribution as ? As ?
Calculation Practice
4. Given:
1Q = [[1, 0]]
2K = [[1, 0], [0, 1], [1, 1]]
3V = [[1, 2], [3, 4], [5, 6]]
4d_k = 2Compute the attention output step by step:
- Compute
- Apply scaling:
- Apply softmax
- Compute final output:
5. What happens to attention weights if one score is much larger than others? Compute .
Conceptual Questions
- Why can't we just normalize by dividing by its maximum value instead of ?
- If we wanted attention to be "sharper" (more focused on one position), what could we change in the formula?
- Why do and need to have the same sequence length, but can have a different length?
Next Section Preview
In the next section, we'll work through a complete numerical example with actual numbers. You'll compute attention by hand (or calculator) on a tiny 3-token, 4-dimensional example, building confidence through concrete verification.