AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Derive the dot-product attention formula step by step
Understand why scaling by √d is necessary for stable training
Apply softmax normalization to obtain attention weights
Master the complete attention formula
Trace through a worked numerical example

Why This Matters: Scaled dot-product attention is the core building block of modern attention mechanisms, from Transformers to our AMNL model. Understanding its mathematical formulation reveals why it works and how to debug issues like vanishing gradients or attention collapse.

Dot-Product Attention

The simplest way to measure query-key similarity is the dot product.

Similarity via Dot Product

For a query $q$ and key $k$ , both vectors in $\mathbb{R}^{d_k}$ :

\text{similarity}(q, k) = q \cdot k = \sum_{i=1}^{d_k} q_i k_i

Properties of the dot product:

Positive: Similar directions → high value
Negative: Opposite directions → low value
Zero: Orthogonal → no similarity

All Pairs at Once

For all queries and keys, compute similarities via matrix multiplication:

\text{Scores} = QK^T \in \mathbb{R}^{T \times T}

Where $Q \in \mathbb{R}^{T \times d_k}$ and $K \in \mathbb{R}^{T \times d_k}$ . Each element $(i, j)$ gives the similarity between query i and key j.

📝text

1Score matrix (T=30 timesteps):
2
3         Key positions (j)
4         1    2    3   ...   30
5       ┌─────────────────────────┐
6Q   1  │ s₁₁  s₁₂  s₁₃  ...  s₁₃₀ │  ← Query 1's similarity to all keys
7u   2  │ s₂₁  s₂₂  s₂₃  ...  s₂₃₀ │
8e   3  │ s₃₁  s₃₂  s₃₃  ...  s₃₃₀ │
9r   :  │  :    :    :         :   │
10y  30  │ s₃₀₁ s₃₀₂ s₃₀₃ ... s₃₀₃₀ │
11       └─────────────────────────┘
12
13Each row becomes attention weights after softmax

Why Scale by √d?

Raw dot products can become very large, causing problems with softmax. Scaling fixes this.

The Variance Problem

If $q$ and $k$ have independent components with mean 0 and variance 1:

\text{Var}(q \cdot k) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right) = d_k

The variance of the dot product grows with dimension!

The Scaling Solution

Divide by $\sqrt{d_k}$ to restore unit variance:

\text{Var}\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1

Now scores have consistent magnitude regardless of dimension.

Impact on Softmax

Scores	Softmax Output	Problem
[3, 1, -1, 2]	[0.66, 0.09, 0.01, 0.24]	Good distribution
[30, 10, -10, 20]	[0.9999, 0.0001, 0, 0]	Near one-hot, vanishing gradients

Always Scale

Forgetting to scale is a common bug. Symptoms: attention weights become very peaky (near one-hot), gradients vanish, model doesn't learn meaningful attention patterns.

Softmax Normalization

Softmax converts raw scores into a probability distribution over keys.

Softmax Formula

For score vector $\mathbf{s} = [s_1, ..., s_T]$ :

\alpha_t = \text{softmax}(s_t) = \frac{\exp(s_t)}{\sum_{j=1}^{T} \exp(s_j)}

Properties of softmax outputs:

$\alpha_t > 0$ for all t (positive)
$\sum_t \alpha_t = 1$ (normalized)
Higher scores → higher weights (monotonic)
Smooth and differentiable (trainable)

Row-wise Application

Softmax is applied independently to each row of the score matrix:

A_{i,:} = \text{softmax}\left(\frac{Q_i K^T}{\sqrt{d_k}}\right)

Row i contains query i's attention distribution over all keys.

Complete Attention Formula

Combining all components, the scaled dot-product attention formula is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Breakdown

Compute scores: $S = QK^T$ — all pairwise similarities
Scale: $S' = S / \sqrt{d_k}$ — stabilize variance
Normalize: $A = \text{softmax}(S')$ — row-wise probabilities
Aggregate: $\text{Output} = AV$ — weighted sum of values

Dimensions

Tensor	Shape	Description
Q	(T, dₖ)	Queries
K	(T, dₖ)	Keys
V	(T, dᵥ)	Values
QKᵀ	(T, T)	Score matrix
A = softmax(·)	(T, T)	Attention weights
AV	(T, dᵥ)	Output

For our model: T = 30, dₖ = dᵥ = 32 (per head).

Worked Example

Let us trace through attention with concrete numbers.

Summary

In this section, we derived scaled dot-product attention:

Dot-product similarity: $QK^T$ measures query-key alignment
Scaling: Divide by $\sqrt{d_k}$ to prevent variance explosion
Softmax: Converts scores to probability distribution
Complete formula: $\text{softmax}(QK^T/\sqrt{d_k}) \cdot V$
Output: Weighted combination of values based on query-key similarity

Property	Value
Scaling factor	√dₖ = √32 ≈ 5.66
Score matrix shape	(30, 30)
Attention weights	Each row sums to 1
Output shape	(30, dᵥ)

Looking Ahead: A single attention head learns one type of query-key relationship. To capture multiple types of relationships (different aspects of degradation), we use multi-head attention—running multiple attention mechanisms in parallel and combining their outputs.

With single-head attention understood, we now extend to multi-head attention with 8 parallel heads.