Chapter 7
18 min read
Section 33 of 104

Scaled Dot-Product Attention Formula

Multi-Head Self-Attention

Learning Objectives

By the end of this section, you will:

  1. Derive the dot-product attention formula step by step
  2. Understand why scaling by √d is necessary for stable training
  3. Apply softmax normalization to obtain attention weights
  4. Master the complete attention formula
  5. Trace through a worked numerical example
Why This Matters: Scaled dot-product attention is the core building block of modern attention mechanisms, from Transformers to our AMNL model. Understanding its mathematical formulation reveals why it works and how to debug issues like vanishing gradients or attention collapse.

Dot-Product Attention

The simplest way to measure query-key similarity is the dot product.

Similarity via Dot Product

For a query qq and key kk, both vectors in Rdk\mathbb{R}^{d_k}:

similarity(q,k)=qk=i=1dkqiki\text{similarity}(q, k) = q \cdot k = \sum_{i=1}^{d_k} q_i k_i

Properties of the dot product:

  • Positive: Similar directions → high value
  • Negative: Opposite directions → low value
  • Zero: Orthogonal → no similarity

All Pairs at Once

For all queries and keys, compute similarities via matrix multiplication:

Scores=QKTRT×T\text{Scores} = QK^T \in \mathbb{R}^{T \times T}

Where QRT×dkQ \in \mathbb{R}^{T \times d_k} and KRT×dkK \in \mathbb{R}^{T \times d_k}. Each element (i,j)(i, j) gives the similarity between query i and key j.

📝text
1Score matrix (T=30 timesteps):
2
3         Key positions (j)
4         1    2    3   ...   30
5       ┌─────────────────────────┐
6Q   1  │ s₁₁  s₁₂  s₁₃  ...  s₁₃₀ │  ← Query 1's similarity to all keys
7u   2  │ s₂₁  s₂₂  s₂₃  ...  s₂₃₀ │
8e   3  │ s₃₁  s₃₂  s₃₃  ...  s₃₃₀ │
9r   :  │  :    :    :         :   │
10y  30  │ s₃₀₁ s₃₀₂ s₃₀₃ ... s₃₀₃₀ │
11       └─────────────────────────┘
12
13Each row becomes attention weights after softmax

Why Scale by √d?

Raw dot products can become very large, causing problems with softmax. Scaling fixes this.

The Variance Problem

If qq and kk have independent components with mean 0 and variance 1:

Var(qk)=Var(i=1dkqiki)=dk\text{Var}(q \cdot k) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right) = d_k

The variance of the dot product grows with dimension!

The Scaling Solution

Divide by dk\sqrt{d_k} to restore unit variance:

Var(qkdk)=dkdk=1\text{Var}\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1

Now scores have consistent magnitude regardless of dimension.

Impact on Softmax

ScoresSoftmax OutputProblem
[3, 1, -1, 2][0.66, 0.09, 0.01, 0.24]Good distribution
[30, 10, -10, 20][0.9999, 0.0001, 0, 0]Near one-hot, vanishing gradients

Always Scale

Forgetting to scale is a common bug. Symptoms: attention weights become very peaky (near one-hot), gradients vanish, model doesn't learn meaningful attention patterns.


Softmax Normalization

Softmax converts raw scores into a probability distribution over keys.

Softmax Formula

For score vector s=[s1,...,sT]\mathbf{s} = [s_1, ..., s_T]:

αt=softmax(st)=exp(st)j=1Texp(sj)\alpha_t = \text{softmax}(s_t) = \frac{\exp(s_t)}{\sum_{j=1}^{T} \exp(s_j)}

Properties of softmax outputs:

  • αt>0\alpha_t > 0 for all t (positive)
  • tαt=1\sum_t \alpha_t = 1 (normalized)
  • Higher scores → higher weights (monotonic)
  • Smooth and differentiable (trainable)

Row-wise Application

Softmax is applied independently to each row of the score matrix:

Ai,:=softmax(QiKTdk)A_{i,:} = \text{softmax}\left(\frac{Q_i K^T}{\sqrt{d_k}}\right)

Row i contains query i's attention distribution over all keys.


Complete Attention Formula

Combining all components, the scaled dot-product attention formula is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Breakdown

  1. Compute scores: S=QKTS = QK^T — all pairwise similarities
  2. Scale: S=S/dkS' = S / \sqrt{d_k} — stabilize variance
  3. Normalize: A=softmax(S)A = \text{softmax}(S') — row-wise probabilities
  4. Aggregate: Output=AV\text{Output} = AV — weighted sum of values

Dimensions

TensorShapeDescription
Q(T, dₖ)Queries
K(T, dₖ)Keys
V(T, dᵥ)Values
QKᵀ(T, T)Score matrix
A = softmax(·)(T, T)Attention weights
AV(T, dᵥ)Output

For our model: T = 30, dₖ = dᵥ = 32 (per head).


Worked Example

Let us trace through attention with concrete numbers.


Summary

In this section, we derived scaled dot-product attention:

  1. Dot-product similarity: QKTQK^T measures query-key alignment
  2. Scaling: Divide by dk\sqrt{d_k} to prevent variance explosion
  3. Softmax: Converts scores to probability distribution
  4. Complete formula: softmax(QKT/dk)V\text{softmax}(QK^T/\sqrt{d_k}) \cdot V
  5. Output: Weighted combination of values based on query-key similarity
PropertyValue
Scaling factor√dₖ = √32 ≈ 5.66
Score matrix shape(30, 30)
Attention weightsEach row sums to 1
Output shape(30, dᵥ)
Looking Ahead: A single attention head learns one type of query-key relationship. To capture multiple types of relationships (different aspects of degradation), we use multi-head attention—running multiple attention mechanisms in parallel and combining their outputs.

With single-head attention understood, we now extend to multi-head attention with 8 parallel heads.