Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Now that we have intuition for what attention does, let's derive the mathematical formula precisely. By the end of this section, you'll understand every component of:

ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) V

This formula is deceptively simple—just a few operations—but each component is carefully chosen. Understanding why each part exists will help you implement attention correctly and debug issues when they arise.

What Does This Formula Actually Do?

Think of attention as: Each token decides how much it should "look at" every other token.

Symbol	Role	Intuition
Q (Query)	What the current token wants	"I'm looking for information about..."
K (Key)	What each token offers	"Here's what I can provide..."
V (Value)	The actual information	"Here's my content if you need it"

The flow is elegantly simple:

Attention Flow

How information flows through the attention mechanism

Q compares with K

QKᵀ

Similarity scores

Softmax normalizes

softmax(·)

Weights sum to 1

Multiply by V

A × V

Weighted values

Output

Context

Enriched representation

Compare→

Weight→

Combine→

Context-aware output

There is NO magic

Attention is just: compare → weight → combine. That's it. The formula compares what you need with what others offer, converts scores to weights, then combines information accordingly.

Why Is V Outside the Softmax?

This is a common point of confusion. Let's be crystal clear:

Component	What It Contains	Should We Normalize It?
QKᵀ	Raw similarity scores (can be negative, large, small)	✅ YES — convert to probabilities
V	Actual content/information	❌ NO — it's not a score!

The softmax must operate on scores, not content.

$QK^T$ gives raw compatibility scores: "How similar am I to each token?"
Softmax converts these to attention weights $alpha_{ij}$ = how much token $i$ should pay attention to token $j$
$V$ contains the actual information we want to retrieve

If V were inside softmax...

You would be turning actual content into a probability distribution — that makes no sense! We want to normalize the attention distribution, not the values.

What Context Does the Attention Matrix Encode?

The attention matrix $A = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight)$ is an $n imes n$ matrix (for n tokens).

Each row $A_i$ tells you: For token $i$ , how much it attends to every other token.

Example: Consider the sentence "The cat sat on the mat."

At the token "cat", the attention row might look like:

Token	Attention Weight	Interpretation
The	0.05	Article, less important
cat	0.60	Self-attention (I am the subject)
sat	0.25	Verb related to me
on	0.05	Preposition, less relevant
the	0.03	Another article
mat	0.02	Object, not directly related to cat

This row vector = context distribution. It tells us "cat" mostly cares about itself (0.60) and the verb "sat" (0.25).

What information does attention provide?

Who depends on whom — relational structure in the sequence
Pronoun resolution — "He" attends strongly to "John" in "John went to the store because he was hungry"
Grammar structure — subject-verb relations, modifiers
Semantic relevance — which words matter for predicting the next token

The Google Search Analogy

Here's the most intuitive way to understand attention:

Attention	Google Search
Q (Query)	Your search query
K (Key)	Webpage SEO tags/titles
V (Value)	Actual webpage content
QKᵀ	Compare query to all page tags
softmax	Rank pages by relevance
A × V	Return weighted combination of relevant content

Google Search as Attention

Understanding Q, K, V through a familiar analogy

Your Query (Q)

"best pizza recipe"

Compare with Page Tags (K) → QK^T

Italian pizza recipe homemade0.92 (high match!)

Best burger recipes 20240.15 (low match)

Pizza dough techniques0.78 (good match)

History of pizza in Italy0.45 (partial)

Softmax → Attention Weights

52%

Page 1

Page 2

35%

Page 3

10%

Page 4

Sum = 1.00 (probability distribution!)

Multiply by Values (V) → Contextual Blend

52%

Pizza recipe page content

Burger page content

35%

Dough techniques content

10%

History page content

Output: Contextual blend of the most relevant information!

Q = Your search query

K = Page SEO tags

softmax = Rank pages

V = Page content

Ultra-simple summary

Attention = Weighted Information Lookup
Q: What I need | K: What others offer | V: The actual information
Compare → Weight → Combine → Context-aware representation

2.1 The Attention Formula Breakdown

The Complete Formula

ext{Attention}(Q, K, V) = \underbrace{ ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight)}_{ ext{attention weights } A} cdot V

Breaking down each component:

QK^T \quad \overset{ ext{scale}}{\longrightarrow} \quad rac{QK^T}{\sqrt{d_k}} \quad \overset{ ext{softmax}}{\longrightarrow} \quad A \quad \overset{ imes V}{\longrightarrow} \quad ext{Output}

Let's identify each component:

Component	Name	Purpose
Q	Query matrix	What we're looking for
K	Key matrix	What each position offers
V	Value matrix	Information to retrieve
QKᵀ	Attention scores	Raw similarity between queries and keys
√d_k	Scaling factor	Prevents extreme softmax outputs
softmax	Normalization	Converts scores to probabilities

Dimensions

For a single sequence:

egin{aligned} Q &in mathbb{R}^{n_q imes d_k} & ext{(}n_q ext{ queries, each dimension }d_k ext{)} \\ K &in mathbb{R}^{n_k imes d_k} & ext{(}n_k ext{ keys, each dimension }d_k ext{)} \\ V &in mathbb{R}^{n_k imes d_v} & ext{(}n_k ext{ values, each dimension }d_v ext{)} end{aligned}

Key constraint

K

and

V

must have the same sequence length

n_k

because they come from the same source. The dimensions

d_k

and

d_v

can differ.

For batched operations with batch size $b$ :

Q \in \mathbb{R}^{b imes n_q imes d_k}, \quad K \in \mathbb{R}^{b imes n_k imes d_k}, \quad V \in \mathbb{R}^{b imes n_k imes d_v}

2.2 Dot Product as Similarity Measure

Why Dot Product?

The dot product between two vectors measures their similarity:

ec{a} \cdot ec{b} = \| ec{a}\| \cdot \| ec{b}\| \cdot \cos( heta)

Where $heta$ is the angle between vectors.

Properties:

If vectors point in same direction ( $heta = 0$ ): dot product is high (positive)
If vectors are orthogonal ( $heta = 90°$ ): dot product is zero
If vectors point in opposite directions ( $heta = 180°$ ): dot product is negative

Computing QKᵀ

When we compute $QK^T$ :

ext{scores} = Q \cdot K^T \quad \Rightarrow \quad ext{scores}_{ij} = ec{q}_i \cdot ec{k}_j

The result is a matrix where:

$ext{scores}[i][j]$ = dot product of query $i$ and key $j$
Shape: $[n_q, n_k]$

Example:

🐍python

1Q = [[1, 0],    # Query 1
2     [0, 1]]    # Query 2
3
4K = [[1, 0],    # Key 1
5     [1, 1],    # Key 2
6     [0, 1]]    # Key 3

QK^T = egin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 1 \end{bmatrix}

Reading the result:

Query 1 is similar to Key 1 and Key 2 (score = 1), dissimilar to Key 3 (score = 0)
Query 2 is similar to Key 2 and Key 3 (score = 1), dissimilar to Key 1 (score = 0)

2.3 The Critical Scaling Factor: $sqrt{d_k}$

The Problem Without Scaling

Consider what happens as $d_k$ (dimension of keys/queries) increases:

If $Q$ and $K$ have elements drawn from a distribution with mean 0 and variance 1:

Each element of $QK^T$ is a sum of $d_k$ terms
By Central Limit Theorem: $ext{mean} \approx 0$ , $ext{variance} \approx d_k$

ext{Large } d_k ightarrow ext{Large variance} ightarrow ext{Extreme values}

Why Extreme Values Are Bad

Softmax with extreme values produces nearly one-hot outputs:

egin{aligned} ext{softmax}([1, 2, 3]) &approx [0.09, 0.24, 0.67] & ext{(smooth)} \\ ext{softmax}([10, 20, 30]) &approx [0.00, 0.00, 1.00] & ext{(nearly one-hot)} \\ ext{softmax}([100, 200, 300]) & ightarrow ext{overflow!} & ext{(numerical issues)} end{aligned}

Problems:

Vanishing gradients: Softmax saturates, $abla ightarrow 0$
No learning: Model can't adjust attention patterns
Numerical instability: Overflow in $exp()$

The Solution: Scale by $sqrt{d_k}$

Dividing by $sqrt{d_k}$ normalizes the variance:

ext{If } ext{Var}[QK^T] approx d_k, quad ext{then } ext{Var}left[ rac{QK^T}{sqrt{d_k}} ight] approx 1

Mathematical Derivation

Let $ec{q}, ec{k}$ be vectors with independent elements $sim mathcal{N}(0, 1)$ :

egin{aligned} mathbb{E}[ ec{q} cdot ec{k}] &= mathbb{E}left[sum_{i=1}^{d_k} q_i k_i ight] = sum_{i=1}^{d_k} mathbb{E}[q_i] cdot mathbb{E}[k_i] = 0 \\[1em] ext{Var}[ ec{q} cdot ec{k}] &= ext{Var}left[sum_{i=1}^{d_k} q_i k_i ight] = sum_{i=1}^{d_k} ext{Var}[q_i k_i] = d_k end{aligned}

After scaling:

ext{Var}left[ rac{ ec{q} cdot ec{k}}{sqrt{d_k}} ight] = rac{ ext{Var}[ ec{q} cdot ec{k}]}{d_k} = rac{d_k}{d_k} = 1

This is crucial!

The scaling factor

sqrt{d_k}

ensures stable gradients regardless of the embedding dimension. Without it, larger models would be harder to train.

Empirical Demonstration

🐍python

1import torch
2
3d_k = 512
4q = torch.randn(1000, d_k)  # 1000 queries
5k = torch.randn(1000, d_k)  # 1000 keys
6
7# Without scaling
8scores_unscaled = (q @ k.T)
9print(f"Unscaled std: {scores_unscaled.std():.2f}")  # ≈ 22.6 (≈ √512)
10
11# With scaling
12scores_scaled = (q @ k.T) / (d_k ** 0.5)
13print(f"Scaled std: {scores_scaled.std():.2f}")  # ≈ 1.0

2.4 Softmax Normalization

Purpose of Softmax

Softmax converts raw scores into a probability distribution:

ext{softmax}(x)_i = rac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Properties:

All outputs positive: $e^x > 0$ for all $x$
Outputs sum to 1: $\sum_i ext{softmax}(x)_i = 1$
Preserves ordering: if $x_i > x_j$ , then $ext{softmax}(x)_i > ext{softmax}(x)_j$
Differentiable: enables gradient-based learning

Applying Softmax to Attention Scores

A = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) in mathbb{R}^{n_q imes n_k}

Each row sums to 1:

Row $i$ contains attention weights for query $i$
$A_{ij}$ = how much query $i$ attends to key/value $j$

Softmax Temperature

The scaling factor acts like inverse temperature:

ext{softmax}left( rac{ ext{scores}}{T} ight) quad ext{where } T = sqrt{d_k}

Lower T (higher scaling): Sharper, more focused attention
Higher T (lower scaling): Softer, more uniform attention

Temperature intuition

Think of temperature like adjusting focus on a camera. Low temperature = sharp focus on one thing. High temperature = everything equally blurry.

2.5 Weighted Sum with Values

The Final Computation

After computing attention weights, we use them to aggregate values:

ext{Output} = A cdot V = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) cdot V

Shape transformation:

\underbrace{A}_{n_q imes n_k} imes \underbrace{V}_{n_k imes d_v} = \underbrace{ ext{Output}}_{n_q imes d_v}

What This Computes

For each query position $i$ :

ext{output}_i = \sum_{j=1}^{n_k} A_{ij} \cdot ec{v}_j

This is a weighted average of all value vectors, where weights indicate relevance.

Example

🐍python

1attention_weights = [[0.7, 0.2, 0.1],    # Query 1: mostly attends to Value 1
2                     [0.1, 0.8, 0.1]]    # Query 2: mostly attends to Value 2
3
4V = [[1, 0],    # Value 1
5     [0, 1],    # Value 2
6     [1, 1]]    # Value 3

egin{aligned} ext{output}_1 &= 0.7 cdot [1, 0] + 0.2 cdot [0, 1] + 0.1 cdot [1, 1] = [0.8, 0.3] \\ ext{output}_2 &= 0.1 cdot [1, 0] + 0.8 cdot [0, 1] + 0.1 cdot [1, 1] = [0.2, 0.9] end{aligned}

2.6 Complete Algorithm

Pseudocode

🐍python

1def attention(Q, K, V, mask=None):
2    # Step 1: Compute raw attention scores
3    d_k = K.shape[-1]
4    scores = Q @ K.transpose(-2, -1)  # [n_q, n_k]
5
6    # Step 2: Scale scores
7    scores = scores / sqrt(d_k)
8
9    # Step 3: Apply mask (optional)
10    if mask is not None:
11        scores = scores.masked_fill(mask == 0, -infinity)
12
13    # Step 4: Softmax to get attention weights
14    attention_weights = softmax(scores, dim=-1)
15
16    # Step 5: Weighted sum of values
17    output = attention_weights @ V  # [n_q, d_v]
18
19    return output, attention_weights

The Algorithm in Equations

\boxed{ egin{aligned} & extbf{Step 1: } ext{scores} = Q cdot K^T \\[0.5em] & extbf{Step 2: } ext{scores} = rac{ ext{scores}}{sqrt{d_k}} \\[0.5em] & extbf{Step 3: } ext{scores} = ext{scores} + ext{mask} quad ext{(optional)} \\[0.5em] & extbf{Step 4: } A = ext{softmax}( ext{scores}) \\[0.5em] & extbf{Step 5: } ext{output} = A cdot V end{aligned} }

Computational Complexity

Operation	Complexity
QKᵀ	O(n_q × n_k × d_k)
Softmax	O(n_q × n_k)
Weights × V	O(n_q × n_k × d_v)
Total	O(n_q × n_k × d)

For self-attention where $n_q = n_k = n$ :

mathcal{O}(n^2 cdot d) quad ext{— quadratic in sequence length!}

The quadratic bottleneck

This

O(n^2)

complexity is why transformers struggle with very long sequences. For a 4096-token context, you need 16 million attention computations per layer!

2.7 Gradient Flow Through Attention

Why Gradients Flow Well

The attention mechanism has favorable gradient properties:

Softmax gradients: Non-zero gradients as long as attention isn't saturated (thanks to scaling!)
Direct paths: Every output position has a direct connection to every input position
No vanishing through time: Unlike RNNs, no multiplicative chains over sequence length

Gradient of Softmax

For softmax output $s = ext{softmax}(x)$ :

rac{partial s_i}{partial x_j} = s_i (delta_{ij} - s_j) = egin{cases} s_i(1 - s_i) & ext{if } i = j \\ -s_i s_j & ext{if } i eq j end{cases}

Where $delta_{ij}$ is the Kronecker delta (1 if $i=j$ , else 0).

Key insight

Gradients exist as long as

s_i

is not too close to 0 or 1 (i.e., not saturated). The scaling factor prevents saturation!

Backpropagation Through Attention

egin{aligned} rac{partial mathcal{L}}{partial V} &= A^T cdot rac{partial mathcal{L}}{partial ext{output}} \\[0.5em] rac{partial mathcal{L}}{partial A} &= rac{partial mathcal{L}}{partial ext{output}} cdot V^T \\[0.5em] rac{partial mathcal{L}}{partial Q} &= rac{partial mathcal{L}}{partial ext{scores}} cdot K cdot rac{1}{sqrt{d_k}} \\[0.5em] rac{partial mathcal{L}}{partial K} &= left( rac{partial mathcal{L}}{partial ext{scores}} ight)^T cdot Q cdot rac{1}{sqrt{d_k}} end{aligned}

All operations are differentiable with well-behaved gradients!

2.8 Mathematical Notation Summary

Standard Notation

Symbol	Meaning	Shape
Q	Query matrix	[n_q, d_k]
K	Key matrix	[n_k, d_k]
V	Value matrix	[n_k, d_v]
d_k	Key/query dimension	scalar
d_v	Value dimension	scalar
n_q	Number of queries	scalar
n_k	Number of keys/values	scalar
A	Attention weights	[n_q, n_k]

The Formula in Matrix Form

egin{aligned} A &= ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) in mathbb{R}^{n_q imes n_k} \\[1em] ext{Output} &= A cdot V in mathbb{R}^{n_q imes d_v} end{aligned}

Batched Form

With batch dimension $b$ :

egin{aligned} Q &in mathbb{R}^{b imes n_q imes d_k} \\ K &in mathbb{R}^{b imes n_k imes d_k} \\ V &in mathbb{R}^{b imes n_k imes d_v} \\ ext{Output} &in mathbb{R}^{b imes n_q imes d_v} end{aligned}

With Multi-Head Attention (Preview)

egin{aligned} Q, K, V &in mathbb{R}^{b imes h imes n imes d_k} quad ext{where } h = ext{num\_heads} \\ ext{Output} &in mathbb{R}^{b imes h imes n imes d_k} end{aligned}

Summary

The Complete Picture

\boxed{ ext{Attention}(Q, K, V) = ext{softmax}left( rac{QK^T}{sqrt{d_k}} ight) V}

Component purposes:

$QK^T$ : Compute similarity between each query and all keys
$/ sqrt{d_k}$ : Normalize variance to prevent softmax saturation
$ext{softmax}$ : Convert to probability distribution
$imes V$ : Aggregate values weighted by attention

Key Equations at a Glance

egin{aligned} ext{Similarity:} quad & ext{scores} = Q cdot K^T & [n_q, n_k] \\ ext{Scaling:} quad & ext{scores} = rac{ ext{scores}}{sqrt{d_k}} & \\ ext{Normalization:} quad & A = ext{softmax}( ext{scores}) & [n_q, n_k] \\ ext{Aggregation:} quad & ext{output} = A cdot V & [n_q, d_v] end{aligned}

Critical Insights

Scaling is essential: Without $sqrt{d_k}$ , training fails for large $d_k$
Softmax creates focus: Converts scores to probability distribution
Output is weighted average: Each output is combination of all values
$O(n^2)$ complexity: Quadratic in sequence length, limiting long sequences

Exercises

Mathematical Derivations

Variance calculation: Prove that if $q$ and $k$ are vectors with independent $mathcal{N}(0,1)$ elements, then $ext{Var}[ ec{q} \cdot ec{k}] = d_k$ .
Softmax property: Prove that $\sum_i ext{softmax}(x)_i = 1$ for any input vector $x$ .
Temperature effect: If we use $ext{softmax}(x/T)$ , what happens to the output distribution as $T ightarrow 0$ ? As $T ightarrow infty$ ?

Calculation Practice

4. Given:

🐍python

1Q = [[1, 0]]
2K = [[1, 0], [0, 1], [1, 1]]
3V = [[1, 2], [3, 4], [5, 6]]
4d_k = 2

Compute the attention output step by step:

Compute $QK^T$
Apply scaling: $rac{QK^T}{\sqrt{d_k}}$
Apply softmax
Compute final output: $A cdot V$

5. What happens to attention weights if one score is much larger than others? Compute $ext{softmax}([0, 0, 10])$ .

Conceptual Questions

Why can't we just normalize $QK^T$ by dividing by its maximum value instead of $sqrt{d_k}$ ?
If we wanted attention to be "sharper" (more focused on one position), what could we change in the formula?
Why do $K$ and $V$ need to have the same sequence length, but $Q$ can have a different length?

Next Section Preview

In the next section, we'll work through a complete numerical example with actual numbers. You'll compute attention by hand (or calculator) on a tiny 3-token, 4-dimensional example, building confidence through concrete verification.

ext{Coming up: } Q = egin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 end{bmatrix} ightarrow ext{?}

Introduction

What Does This Formula Actually Do?

Attention Flow

There is NO magic

Why Is V Outside the Softmax?

If V were inside softmax...

What Context Does the Attention Matrix Encode?

What information does attention provide?

The Google Search Analogy

Google Search as Attention

Ultra-simple summary

2.1 The Attention Formula Breakdown

The Complete Formula

Dimensions

Key constraint

2.2 Dot Product as Similarity Measure

Why Dot Product?

Computing QKᵀ

2.3 The Critical Scaling Factor: sqrtdksqrt{d_k}sqrtdk​

The Problem Without Scaling

Why Extreme Values Are Bad

The Solution: Scale by sqrtdksqrt{d_k}sqrtdk​

Mathematical Derivation

This is crucial!

Empirical Demonstration

2.4 Softmax Normalization

Purpose of Softmax

Applying Softmax to Attention Scores

Softmax Temperature

Temperature intuition

2.5 Weighted Sum with Values

The Final Computation

What This Computes

Example

2.6 Complete Algorithm

Pseudocode

The Algorithm in Equations

Computational Complexity

The quadratic bottleneck

2.7 Gradient Flow Through Attention

Why Gradients Flow Well

Gradient of Softmax

Key insight

Backpropagation Through Attention

2.8 Mathematical Notation Summary

Standard Notation

The Formula in Matrix Form

Batched Form

With Multi-Head Attention (Preview)

Summary

The Complete Picture

Key Equations at a Glance

Critical Insights

Exercises

Mathematical Derivations

Calculation Practice

Conceptual Questions

Next Section Preview

2.3 The Critical Scaling Factor: $sqrt{d_k}$

The Solution: Scale by $sqrt{d_k}$