Learning Objectives
By the end of this section, you will:
- Derive the dot-product attention formula step by step
- Understand why scaling by √d is necessary for stable training
- Apply softmax normalization to obtain attention weights
- Master the complete attention formula
- Trace through a worked numerical example
Why This Matters: Scaled dot-product attention is the core building block of modern attention mechanisms, from Transformers to our AMNL model. Understanding its mathematical formulation reveals why it works and how to debug issues like vanishing gradients or attention collapse.
Dot-Product Attention
The simplest way to measure query-key similarity is the dot product.
Similarity via Dot Product
For a query and key , both vectors in :
Properties of the dot product:
- Positive: Similar directions → high value
- Negative: Opposite directions → low value
- Zero: Orthogonal → no similarity
All Pairs at Once
For all queries and keys, compute similarities via matrix multiplication:
Where and . Each element gives the similarity between query i and key j.
1Score matrix (T=30 timesteps):
2
3 Key positions (j)
4 1 2 3 ... 30
5 ┌─────────────────────────┐
6Q 1 │ s₁₁ s₁₂ s₁₃ ... s₁₃₀ │ ← Query 1's similarity to all keys
7u 2 │ s₂₁ s₂₂ s₂₃ ... s₂₃₀ │
8e 3 │ s₃₁ s₃₂ s₃₃ ... s₃₃₀ │
9r : │ : : : : │
10y 30 │ s₃₀₁ s₃₀₂ s₃₀₃ ... s₃₀₃₀ │
11 └─────────────────────────┘
12
13Each row becomes attention weights after softmaxWhy Scale by √d?
Raw dot products can become very large, causing problems with softmax. Scaling fixes this.
The Variance Problem
If and have independent components with mean 0 and variance 1:
The variance of the dot product grows with dimension!
The Scaling Solution
Divide by to restore unit variance:
Now scores have consistent magnitude regardless of dimension.
Impact on Softmax
| Scores | Softmax Output | Problem |
|---|---|---|
| [3, 1, -1, 2] | [0.66, 0.09, 0.01, 0.24] | Good distribution |
| [30, 10, -10, 20] | [0.9999, 0.0001, 0, 0] | Near one-hot, vanishing gradients |
Always Scale
Forgetting to scale is a common bug. Symptoms: attention weights become very peaky (near one-hot), gradients vanish, model doesn't learn meaningful attention patterns.
Softmax Normalization
Softmax converts raw scores into a probability distribution over keys.
Softmax Formula
For score vector :
Properties of softmax outputs:
- for all t (positive)
- (normalized)
- Higher scores → higher weights (monotonic)
- Smooth and differentiable (trainable)
Row-wise Application
Softmax is applied independently to each row of the score matrix:
Row i contains query i's attention distribution over all keys.
Complete Attention Formula
Combining all components, the scaled dot-product attention formula is:
Breakdown
- Compute scores: — all pairwise similarities
- Scale: — stabilize variance
- Normalize: — row-wise probabilities
- Aggregate: — weighted sum of values
Dimensions
| Tensor | Shape | Description |
|---|---|---|
| Q | (T, dₖ) | Queries |
| K | (T, dₖ) | Keys |
| V | (T, dᵥ) | Values |
| QKᵀ | (T, T) | Score matrix |
| A = softmax(·) | (T, T) | Attention weights |
| AV | (T, dᵥ) | Output |
For our model: T = 30, dₖ = dᵥ = 32 (per head).
Worked Example
Let us trace through attention with concrete numbers.
Summary
In this section, we derived scaled dot-product attention:
- Dot-product similarity: measures query-key alignment
- Scaling: Divide by to prevent variance explosion
- Softmax: Converts scores to probability distribution
- Complete formula:
- Output: Weighted combination of values based on query-key similarity
| Property | Value |
|---|---|
| Scaling factor | √dₖ = √32 ≈ 5.66 |
| Score matrix shape | (30, 30) |
| Attention weights | Each row sums to 1 |
| Output shape | (30, dᵥ) |
Looking Ahead: A single attention head learns one type of query-key relationship. To capture multiple types of relationships (different aspects of degradation), we use multi-head attention—running multiple attention mechanisms in parallel and combining their outputs.
With single-head attention understood, we now extend to multi-head attention with 8 parallel heads.