Chapter 2
18 min read
Section 9 of 75

Scaled Dot-Product Attention Formula

Attention Mechanism From Scratch

Introduction

Now that we have intuition for what attention does, let's derive the mathematical formula precisely. By the end of this section, you'll understand every component of:

ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) V

This formula is deceptively simple—just a few operations—but each component is carefully chosen. Understanding why each part exists will help you implement attention correctly and debug issues when they arise.

What Does This Formula Actually Do?

Think of attention as: Each token decides how much it should "look at" every other token.

SymbolRoleIntuition
Q (Query)What the current token wants"I'm looking for information about..."
K (Key)What each token offers"Here's what I can provide..."
V (Value)The actual information"Here's my content if you need it"

The flow is elegantly simple:

Attention Flow

How information flows through the attention mechanism

1

Q compares with K

QKᵀ

Similarity scores

2

Softmax normalizes

softmax(·)

Weights sum to 1

3

Multiply by V

A × V

Weighted values

4

Output

Context

Enriched representation

Compare
Weight
Combine
Context-aware output

There is NO magic

Attention is just: compare → weight → combine. That's it. The formula compares what you need with what others offer, converts scores to weights, then combines information accordingly.

Why Is V Outside the Softmax?

This is a common point of confusion. Let's be crystal clear:

ComponentWhat It ContainsShould We Normalize It?
QKᵀRaw similarity scores (can be negative, large, small)✅ YES — convert to probabilities
VActual content/information❌ NO — it's not a score!

The softmax must operate on scores, not content.

  • QKTQK^T gives raw compatibility scores: "How similar am I to each token?"
  • Softmax converts these to attention weights alphaijalpha_{ij} = how much token ii should pay attention to token jj
  • VV contains the actual information we want to retrieve

If V were inside softmax...

You would be turning actual content into a probability distribution — that makes no sense! We want to normalize the attention distribution, not the values.

What Context Does the Attention Matrix Encode?

The attention matrix A = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) is an nimesnn imes n matrix (for n tokens).

Each row AiA_i tells you: For token ii, how much it attends to every other token.

Example: Consider the sentence "The cat sat on the mat."

At the token "cat", the attention row might look like:

TokenAttention WeightInterpretation
The0.05Article, less important
cat0.60Self-attention (I am the subject)
sat0.25Verb related to me
on0.05Preposition, less relevant
the0.03Another article
mat0.02Object, not directly related to cat

This row vector = context distribution. It tells us "cat" mostly cares about itself (0.60) and the verb "sat" (0.25).

What information does attention provide?

  • Who depends on whom — relational structure in the sequence
  • Pronoun resolution — "He" attends strongly to "John" in "John went to the store because he was hungry"
  • Grammar structure — subject-verb relations, modifiers
  • Semantic relevance — which words matter for predicting the next token

The Google Search Analogy

Here's the most intuitive way to understand attention:

AttentionGoogle Search
Q (Query)Your search query
K (Key)Webpage SEO tags/titles
V (Value)Actual webpage content
QKᵀCompare query to all page tags
softmaxRank pages by relevance
A × VReturn weighted combination of relevant content

Google Search as Attention

Understanding Q, K, V through a familiar analogy

1
Your Query (Q)
"best pizza recipe"
2
Compare with Page Tags (K) → QKT
Italian pizza recipe homemade0.92 (high match!)
Best burger recipes 20240.15 (low match)
Pizza dough techniques0.78 (good match)
History of pizza in Italy0.45 (partial)
3
Softmax → Attention Weights
52%
Page 1
3%
Page 2
35%
Page 3
10%
Page 4
Sum = 1.00 (probability distribution!)
4
Multiply by Values (V) → Contextual Blend
52%
Pizza recipe page content
3%
Burger page content
35%
Dough techniques content
10%
History page content
Output: Contextual blend of the most relevant information!
Q = Your search query
K = Page SEO tags
softmax = Rank pages
V = Page content

Ultra-simple summary

Attention = Weighted Information Lookup
Q: What I need | K: What others offer | V: The actual information
Compare → Weight → Combine → Context-aware representation

2.1 The Attention Formula Breakdown

The Complete Formula

ext{Attention}(Q, K, V) = \underbrace{ ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight)}_{ ext{attention weights } A} cdot V

Breaking down each component:

QK^T \quad \overset{ ext{scale}}{\longrightarrow} \quad rac{QK^T}{\sqrt{d_k}} \quad \overset{ ext{softmax}}{\longrightarrow} \quad A \quad \overset{ imes V}{\longrightarrow} \quad ext{Output}

Let's identify each component:

ComponentNamePurpose
QQuery matrixWhat we're looking for
KKey matrixWhat each position offers
VValue matrixInformation to retrieve
QKᵀAttention scoresRaw similarity between queries and keys
√d_kScaling factorPrevents extreme softmax outputs
softmaxNormalizationConverts scores to probabilities

Dimensions

For a single sequence:

egin{aligned} Q &in mathbb{R}^{n_q imes d_k} & ext{(}n_q ext{ queries, each dimension }d_k ext{)} \\ K &in mathbb{R}^{n_k imes d_k} & ext{(}n_k ext{ keys, each dimension }d_k ext{)} \\ V &in mathbb{R}^{n_k imes d_v} & ext{(}n_k ext{ values, each dimension }d_v ext{)} end{aligned}

Key constraint

KK and VV must have the same sequence length nkn_k because they come from the same source. The dimensions dkd_k and dvd_v can differ.

For batched operations with batch size bb:

QRbimesnqimesdk,KRbimesnkimesdk,VRbimesnkimesdvQ \in \mathbb{R}^{b imes n_q imes d_k}, \quad K \in \mathbb{R}^{b imes n_k imes d_k}, \quad V \in \mathbb{R}^{b imes n_k imes d_v}

2.2 Dot Product as Similarity Measure

Why Dot Product?

The dot product between two vectors measures their similarity:

ec{a} \cdot ec{b} = \| ec{a}\| \cdot \| ec{b}\| \cdot \cos( heta)

Where hetaheta is the angle between vectors.

Properties:

  • If vectors point in same direction (heta=0heta = 0): dot product is high (positive)
  • If vectors are orthogonal (heta=90°heta = 90°): dot product is zero
  • If vectors point in opposite directions (heta=180°heta = 180°): dot product is negative

Computing QKᵀ

When we compute QKTQK^T:

ext{scores} = Q \cdot K^T \quad \Rightarrow \quad ext{scores}_{ij} = ec{q}_i \cdot ec{k}_j

The result is a matrix where:

  • extscores[i][j]ext{scores}[i][j] = dot product of query ii and key jj
  • Shape: [nq,nk][n_q, n_k]

Example:

🐍python
1Q = [[1, 0],    # Query 1
2     [0, 1]]    # Query 2
3
4K = [[1, 0],    # Key 1
5     [1, 1],    # Key 2
6     [0, 1]]    # Key 3
QK^T = egin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 1 \end{bmatrix}

Reading the result:

  • Query 1 is similar to Key 1 and Key 2 (score = 1), dissimilar to Key 3 (score = 0)
  • Query 2 is similar to Key 2 and Key 3 (score = 1), dissimilar to Key 1 (score = 0)

2.3 The Critical Scaling Factor: sqrtdksqrt{d_k}

The Problem Without Scaling

Consider what happens as dkd_k (dimension of keys/queries) increases:

If QQ and KK have elements drawn from a distribution with mean 0 and variance 1:

  • Each element of QKTQK^T is a sum of dkd_k terms
  • By Central Limit Theorem: extmean0ext{mean} \approx 0, extvariancedkext{variance} \approx d_k
extLargedkightarrowextLargevarianceightarrowextExtremevaluesext{Large } d_k ightarrow ext{Large variance} ightarrow ext{Extreme values}

Why Extreme Values Are Bad

Softmax with extreme values produces nearly one-hot outputs:

egin{aligned} ext{softmax}([1, 2, 3]) &approx [0.09, 0.24, 0.67] & ext{(smooth)} \\ ext{softmax}([10, 20, 30]) &approx [0.00, 0.00, 1.00] & ext{(nearly one-hot)} \\ ext{softmax}([100, 200, 300]) & ightarrow ext{overflow!} & ext{(numerical issues)} end{aligned}

Problems:

  1. Vanishing gradients: Softmax saturates, ablaightarrow0abla ightarrow 0
  2. No learning: Model can't adjust attention patterns
  3. Numerical instability: Overflow in exp()exp()

The Solution: Scale by sqrtdksqrt{d_k}

Dividing by sqrtdksqrt{d_k} normalizes the variance:

ext{If } ext{Var}[QK^T] approx d_k, quad ext{then } ext{Var}left[ rac{QK^T}{sqrt{d_k}} ight] approx 1

Mathematical Derivation

Let ec{q}, ec{k} be vectors with independent elements simmathcalN(0,1)sim mathcal{N}(0, 1):

egin{aligned} mathbb{E}[ ec{q} cdot ec{k}] &= mathbb{E}left[sum_{i=1}^{d_k} q_i k_i ight] = sum_{i=1}^{d_k} mathbb{E}[q_i] cdot mathbb{E}[k_i] = 0 \\[1em] ext{Var}[ ec{q} cdot ec{k}] &= ext{Var}left[sum_{i=1}^{d_k} q_i k_i ight] = sum_{i=1}^{d_k} ext{Var}[q_i k_i] = d_k end{aligned}

After scaling:

ext{Var}left[ rac{ ec{q} cdot ec{k}}{sqrt{d_k}} ight] = rac{ ext{Var}[ ec{q} cdot ec{k}]}{d_k} = rac{d_k}{d_k} = 1

This is crucial!

The scaling factor sqrtdksqrt{d_k} ensures stable gradients regardless of the embedding dimension. Without it, larger models would be harder to train.

Empirical Demonstration

🐍python
1import torch
2
3d_k = 512
4q = torch.randn(1000, d_k)  # 1000 queries
5k = torch.randn(1000, d_k)  # 1000 keys
6
7# Without scaling
8scores_unscaled = (q @ k.T)
9print(f"Unscaled std: {scores_unscaled.std():.2f}")  # ≈ 22.6 (≈ √512)
10
11# With scaling
12scores_scaled = (q @ k.T) / (d_k ** 0.5)
13print(f"Scaled std: {scores_scaled.std():.2f}")  # ≈ 1.0

2.4 Softmax Normalization

Purpose of Softmax

Softmax converts raw scores into a probability distribution:

ext{softmax}(x)_i = rac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Properties:

  1. All outputs positive: ex>0e^x > 0 for all xx
  2. Outputs sum to 1: iextsoftmax(x)i=1\sum_i ext{softmax}(x)_i = 1
  3. Preserves ordering: if xi>xjx_i > x_j, then extsoftmax(x)i>extsoftmax(x)jext{softmax}(x)_i > ext{softmax}(x)_j
  4. Differentiable: enables gradient-based learning

Applying Softmax to Attention Scores

A = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) in mathbb{R}^{n_q imes n_k}

Each row sums to 1:

  • Row ii contains attention weights for query ii
  • AijA_{ij} = how much query ii attends to key/value jj

Softmax Temperature

The scaling factor acts like inverse temperature:

ext{softmax}left( rac{ ext{scores}}{T} ight) quad ext{where } T = sqrt{d_k}
  • Lower T (higher scaling): Sharper, more focused attention
  • Higher T (lower scaling): Softer, more uniform attention

Temperature intuition

Think of temperature like adjusting focus on a camera. Low temperature = sharp focus on one thing. High temperature = everything equally blurry.

2.5 Weighted Sum with Values

The Final Computation

After computing attention weights, we use them to aggregate values:

ext{Output} = A cdot V = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) cdot V

Shape transformation:

AnqimesnkimesVnkimesdv=extOutputnqimesdv\underbrace{A}_{n_q imes n_k} imes \underbrace{V}_{n_k imes d_v} = \underbrace{ ext{Output}}_{n_q imes d_v}

What This Computes

For each query position ii:

ext{output}_i = \sum_{j=1}^{n_k} A_{ij} \cdot ec{v}_j

This is a weighted average of all value vectors, where weights indicate relevance.

Example

🐍python
1attention_weights = [[0.7, 0.2, 0.1],    # Query 1: mostly attends to Value 1
2                     [0.1, 0.8, 0.1]]    # Query 2: mostly attends to Value 2
3
4V = [[1, 0],    # Value 1
5     [0, 1],    # Value 2
6     [1, 1]]    # Value 3
egin{aligned} ext{output}_1 &= 0.7 cdot [1, 0] + 0.2 cdot [0, 1] + 0.1 cdot [1, 1] = [0.8, 0.3] \\ ext{output}_2 &= 0.1 cdot [1, 0] + 0.8 cdot [0, 1] + 0.1 cdot [1, 1] = [0.2, 0.9] end{aligned}

2.6 Complete Algorithm

Pseudocode

🐍python
1def attention(Q, K, V, mask=None):
2    # Step 1: Compute raw attention scores
3    d_k = K.shape[-1]
4    scores = Q @ K.transpose(-2, -1)  # [n_q, n_k]
5
6    # Step 2: Scale scores
7    scores = scores / sqrt(d_k)
8
9    # Step 3: Apply mask (optional)
10    if mask is not None:
11        scores = scores.masked_fill(mask == 0, -infinity)
12
13    # Step 4: Softmax to get attention weights
14    attention_weights = softmax(scores, dim=-1)
15
16    # Step 5: Weighted sum of values
17    output = attention_weights @ V  # [n_q, d_v]
18
19    return output, attention_weights

The Algorithm in Equations

\boxed{ egin{aligned} & extbf{Step 1: } ext{scores} = Q cdot K^T \\[0.5em] & extbf{Step 2: } ext{scores} = rac{ ext{scores}}{sqrt{d_k}} \\[0.5em] & extbf{Step 3: } ext{scores} = ext{scores} + ext{mask} quad ext{(optional)} \\[0.5em] & extbf{Step 4: } A = ext{softmax}( ext{scores}) \\[0.5em] & extbf{Step 5: } ext{output} = A cdot V end{aligned} }

Computational Complexity

OperationComplexity
QKᵀO(n_q × n_k × d_k)
SoftmaxO(n_q × n_k)
Weights × VO(n_q × n_k × d_v)
TotalO(n_q × n_k × d)

For self-attention where nq=nk=nn_q = n_k = n:

mathcalO(n2cdotd)quadextquadraticinsequencelength!mathcal{O}(n^2 cdot d) quad ext{— quadratic in sequence length!}

The quadratic bottleneck

This O(n2)O(n^2) complexity is why transformers struggle with very long sequences. For a 4096-token context, you need 16 million attention computations per layer!

2.7 Gradient Flow Through Attention

Why Gradients Flow Well

The attention mechanism has favorable gradient properties:

  1. Softmax gradients: Non-zero gradients as long as attention isn't saturated (thanks to scaling!)
  2. Direct paths: Every output position has a direct connection to every input position
  3. No vanishing through time: Unlike RNNs, no multiplicative chains over sequence length

Gradient of Softmax

For softmax output s=extsoftmax(x)s = ext{softmax}(x):

rac{partial s_i}{partial x_j} = s_i (delta_{ij} - s_j) = egin{cases} s_i(1 - s_i) & ext{if } i = j \\ -s_i s_j & ext{if } i eq j end{cases}

Where deltaijdelta_{ij} is the Kronecker delta (1 if i=ji=j, else 0).

Key insight

Gradients exist as long as sis_i is not too close to 0 or 1 (i.e., not saturated). The scaling factor prevents saturation!

Backpropagation Through Attention

egin{aligned} rac{partial mathcal{L}}{partial V} &= A^T cdot rac{partial mathcal{L}}{partial ext{output}} \\[0.5em] rac{partial mathcal{L}}{partial A} &= rac{partial mathcal{L}}{partial ext{output}} cdot V^T \\[0.5em] rac{partial mathcal{L}}{partial Q} &= rac{partial mathcal{L}}{partial ext{scores}} cdot K cdot rac{1}{sqrt{d_k}} \\[0.5em] rac{partial mathcal{L}}{partial K} &= left( rac{partial mathcal{L}}{partial ext{scores}} ight)^T cdot Q cdot rac{1}{sqrt{d_k}} end{aligned}

All operations are differentiable with well-behaved gradients!


2.8 Mathematical Notation Summary

Standard Notation

SymbolMeaningShape
QQuery matrix[n_q, d_k]
KKey matrix[n_k, d_k]
VValue matrix[n_k, d_v]
d_kKey/query dimensionscalar
d_vValue dimensionscalar
n_qNumber of queriesscalar
n_kNumber of keys/valuesscalar
AAttention weights[n_q, n_k]

The Formula in Matrix Form

egin{aligned} A &= ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) in mathbb{R}^{n_q imes n_k} \\[1em] ext{Output} &= A cdot V in mathbb{R}^{n_q imes d_v} end{aligned}

Batched Form

With batch dimension bb:

egin{aligned} Q &in mathbb{R}^{b imes n_q imes d_k} \\ K &in mathbb{R}^{b imes n_k imes d_k} \\ V &in mathbb{R}^{b imes n_k imes d_v} \\ ext{Output} &in mathbb{R}^{b imes n_q imes d_v} end{aligned}

With Multi-Head Attention (Preview)

egin{aligned} Q, K, V &in mathbb{R}^{b imes h imes n imes d_k} quad ext{where } h = ext{num\_heads} \\ ext{Output} &in mathbb{R}^{b imes h imes n imes d_k} end{aligned}

Summary

The Complete Picture

\boxed{ ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) V}

Component purposes:

  1. QKTQK^T: Compute similarity between each query and all keys
  2. /sqrtdk/ sqrt{d_k}: Normalize variance to prevent softmax saturation
  3. extsoftmaxext{softmax}: Convert to probability distribution
  4. imesVimes V: Aggregate values weighted by attention

Key Equations at a Glance

egin{aligned} ext{Similarity:} quad & ext{scores} = Q cdot K^T & [n_q, n_k] \\ ext{Scaling:} quad & ext{scores} = rac{ ext{scores}}{sqrt{d_k}} & \\ ext{Normalization:} quad & A = ext{softmax}( ext{scores}) & [n_q, n_k] \\ ext{Aggregation:} quad & ext{output} = A cdot V & [n_q, d_v] end{aligned}

Critical Insights

  1. Scaling is essential: Without sqrtdksqrt{d_k}, training fails for large dkd_k
  2. Softmax creates focus: Converts scores to probability distribution
  3. Output is weighted average: Each output is combination of all values
  4. O(n2)O(n^2) complexity: Quadratic in sequence length, limiting long sequences

Exercises

Mathematical Derivations

  1. Variance calculation: Prove that if qq and kk are vectors with independent mathcalN(0,1)mathcal{N}(0,1) elements, then ext{Var}[ ec{q} \cdot ec{k}] = d_k.
  2. Softmax property: Prove that iextsoftmax(x)i=1\sum_i ext{softmax}(x)_i = 1 for any input vector xx.
  3. Temperature effect: If we use extsoftmax(x/T)ext{softmax}(x/T), what happens to the output distribution as Tightarrow0T ightarrow 0? As TightarrowinftyT ightarrow infty?

Calculation Practice

4. Given:

🐍python
1Q = [[1, 0]]
2K = [[1, 0], [0, 1], [1, 1]]
3V = [[1, 2], [3, 4], [5, 6]]
4d_k = 2

Compute the attention output step by step:

  • Compute QKTQK^T
  • Apply scaling: racQKTdkrac{QK^T}{\sqrt{d_k}}
  • Apply softmax
  • Compute final output: AcdotVA cdot V

5. What happens to attention weights if one score is much larger than others? Compute extsoftmax([0,0,10])ext{softmax}([0, 0, 10]).

Conceptual Questions

  1. Why can't we just normalize QKTQK^T by dividing by its maximum value instead of sqrtdksqrt{d_k}?
  2. If we wanted attention to be "sharper" (more focused on one position), what could we change in the formula?
  3. Why do KK and VV need to have the same sequence length, but QQ can have a different length?

Next Section Preview

In the next section, we'll work through a complete numerical example with actual numbers. You'll compute attention by hand (or calculator) on a tiny 3-token, 4-dimensional example, building confidence through concrete verification.

ext{Coming up: } Q = egin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 end{bmatrix} ightarrow ext{?}