Introduction
There's no better way to truly understand attention than to compute it by hand. In this section, we'll work through a complete example with actual numbersβsmall enough to verify manually, but complete enough to show all the concepts.
By the end, you'll be able to verify your implementation against these hand calculations, building confidence that your code is correct.
3.1 Problem Setup
The Scenario
We have a tiny sequence of 3 tokens with 4-dimensional embeddings.
Think of it as the sentence "cat sat mat" (simplified).
Input Data
1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]As a matrix X (input embeddings):
1X = [[1.0, 0.0, 1.0, 0.0], # Token 0
2 [0.0, 1.0, 0.0, 1.0], # Token 1
3 [1.0, 1.0, 0.0, 0.0]] # Token 2
4
5Shape: [3, 4] = [seq_len, d_model]Simplification: Q = K = V = X
For this walkthrough, we'll use the input directly as Q, K, and V (no learned projections). This is called "vanilla" attention and helps isolate the core computation.
1Q = X [3, 4]
2K = X [3, 4]
3V = X [3, 4]
4
5d_k = 43.2 Step 1: Compute QK^T (Attention Scores)
The Matrix Multiplication
1QK^T = Q @ K^T
2 = [3, 4] @ [4, 3]
3 = [3, 3]Computing Element by Element
1scores[i][j] = Q[i] Β· K[j] (dot product)Row 0 (Query = Token 0 = "cat"):
1scores[0][0] = [1,0,1,0] Β· [1,0,1,0] = 1Γ1 + 0Γ0 + 1Γ1 + 0Γ0 = 2.0
2scores[0][1] = [1,0,1,0] Β· [0,1,0,1] = 1Γ0 + 0Γ1 + 1Γ0 + 0Γ1 = 0.0
3scores[0][2] = [1,0,1,0] Β· [1,1,0,0] = 1Γ1 + 0Γ1 + 1Γ0 + 0Γ0 = 1.0Row 1 (Query = Token 1 = "sat"):
1scores[1][0] = [0,1,0,1] Β· [1,0,1,0] = 0Γ1 + 1Γ0 + 0Γ1 + 1Γ0 = 0.0
2scores[1][1] = [0,1,0,1] Β· [0,1,0,1] = 0Γ0 + 1Γ1 + 0Γ0 + 1Γ1 = 2.0
3scores[1][2] = [0,1,0,1] Β· [1,1,0,0] = 0Γ1 + 1Γ1 + 0Γ0 + 1Γ0 = 1.0Row 2 (Query = Token 2 = "mat"):
1scores[2][0] = [1,1,0,0] Β· [1,0,1,0] = 1Γ1 + 1Γ0 + 0Γ1 + 0Γ0 = 1.0
2scores[2][1] = [1,1,0,0] Β· [0,1,0,1] = 1Γ0 + 1Γ1 + 0Γ0 + 0Γ1 = 1.0
3scores[2][2] = [1,1,0,0] Β· [1,1,0,0] = 1Γ1 + 1Γ1 + 0Γ0 + 0Γ0 = 2.0Result: Attention Score Matrix
1scores = [[2.0, 0.0, 1.0],
2 [0.0, 2.0, 1.0],
3 [1.0, 1.0, 2.0]]Interpretation:
- Token 0 ("cat") is most similar to itself (score 2.0)
- Token 1 ("sat") is most similar to itself (score 2.0)
- Token 2 ("mat") is most similar to itself (score 2.0)
- "cat" and "sat" are completely dissimilar (score 0.0)
3.3 Step 2: Scale by βd_k
The Scaling Factor
1d_k = 4
2βd_k = β4 = 2.0Apply Scaling
1scaled_scores = scores / βd_k
2 = scores / 2.0
3
4scaled_scores = [[2.0/2, 0.0/2, 1.0/2],
5 [0.0/2, 2.0/2, 1.0/2],
6 [1.0/2, 1.0/2, 2.0/2]]
7
8 = [[1.0, 0.0, 0.5],
9 [0.0, 1.0, 0.5],
10 [0.5, 0.5, 1.0]]Why we scale: These values (max 1.0) will produce softer softmax outputs than the unscaled values (max 2.0).
3.4 Step 3: Apply Softmax
Softmax Formula
For each row independently:
1softmax(x)_i = exp(x_i) / Ξ£_j exp(x_j)Row 0: [1.0, 0.0, 0.5]
1exp(1.0) = 2.7183
2exp(0.0) = 1.0000
3exp(0.5) = 1.6487
4
5sum = 2.7183 + 1.0000 + 1.6487 = 5.3670
6
7attention[0] = [2.7183/5.3670, 1.0000/5.3670, 1.6487/5.3670]
8 = [0.5066, 0.1863, 0.3071]Row 1: [0.0, 1.0, 0.5]
1exp(0.0) = 1.0000
2exp(1.0) = 2.7183
3exp(0.5) = 1.6487
4
5sum = 1.0000 + 2.7183 + 1.6487 = 5.3670
6
7attention[1] = [1.0000/5.3670, 2.7183/5.3670, 1.6487/5.3670]
8 = [0.1863, 0.5066, 0.3071]Row 2: [0.5, 0.5, 1.0]
1exp(0.5) = 1.6487
2exp(0.5) = 1.6487
3exp(1.0) = 2.7183
4
5sum = 1.6487 + 1.6487 + 2.7183 = 6.0157
6
7attention[2] = [1.6487/6.0157, 1.6487/6.0157, 2.7183/6.0157]
8 = [0.2741, 0.2741, 0.4518]Result: Attention Weight Matrix
1attention_weights = [[0.5066, 0.1863, 0.3071],
2 [0.1863, 0.5066, 0.3071],
3 [0.2741, 0.2741, 0.4518]]Verify rows sum to 1:
1Row 0: 0.5066 + 0.1863 + 0.3071 = 1.0000 β
2Row 1: 0.1863 + 0.5066 + 0.3071 = 1.0000 β
3Row 2: 0.2741 + 0.2741 + 0.4518 = 1.0000 βInterpretation:
- Token 0 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 1 (18.6%)
- Token 1 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 0 (18.6%)
- Token 2 attends most to itself (45.2%), and equally to Token 0 and 1 (27.4% each)
3.5 Step 4: Compute Weighted Sum (Output)
Matrix Multiplication
1output = attention_weights @ V
2 = [3, 3] @ [3, 4]
3 = [3, 4]Computing Row by Row
Recall V = X:
1V = [[1.0, 0.0, 1.0, 0.0], # V[0]
2 [0.0, 1.0, 0.0, 1.0], # V[1]
3 [1.0, 1.0, 0.0, 0.0]] # V[2]Output Row 0:
1output[0] = 0.5066 Γ V[0] + 0.1863 Γ V[1] + 0.3071 Γ V[2]
2 = 0.5066 Γ [1,0,1,0] + 0.1863 Γ [0,1,0,1] + 0.3071 Γ [1,1,0,0]
3
4Element by element:
5output[0][0] = 0.5066Γ1 + 0.1863Γ0 + 0.3071Γ1 = 0.8137
6output[0][1] = 0.5066Γ0 + 0.1863Γ1 + 0.3071Γ1 = 0.4934
7output[0][2] = 0.5066Γ1 + 0.1863Γ0 + 0.3071Γ0 = 0.5066
8output[0][3] = 0.5066Γ0 + 0.1863Γ1 + 0.3071Γ0 = 0.1863
9
10output[0] = [0.8137, 0.4934, 0.5066, 0.1863]Output Row 1:
1output[1] = 0.1863 Γ V[0] + 0.5066 Γ V[1] + 0.3071 Γ V[2]
2
3output[1][0] = 0.1863Γ1 + 0.5066Γ0 + 0.3071Γ1 = 0.4934
4output[1][1] = 0.1863Γ0 + 0.5066Γ1 + 0.3071Γ1 = 0.8137
5output[1][2] = 0.1863Γ1 + 0.5066Γ0 + 0.3071Γ0 = 0.1863
6output[1][3] = 0.1863Γ0 + 0.5066Γ1 + 0.3071Γ0 = 0.5066
7
8output[1] = [0.4934, 0.8137, 0.1863, 0.5066]Output Row 2:
1output[2] = 0.2741 Γ V[0] + 0.2741 Γ V[1] + 0.4518 Γ V[2]
2
3output[2][0] = 0.2741Γ1 + 0.2741Γ0 + 0.4518Γ1 = 0.7259
4output[2][1] = 0.2741Γ0 + 0.2741Γ1 + 0.4518Γ1 = 0.7259
5output[2][2] = 0.2741Γ1 + 0.2741Γ0 + 0.4518Γ0 = 0.2741
6output[2][3] = 0.2741Γ0 + 0.2741Γ1 + 0.4518Γ0 = 0.2741
7
8output[2] = [0.7259, 0.7259, 0.2741, 0.2741]Final Output Matrix
1output = [[0.8137, 0.4934, 0.5066, 0.1863],
2 [0.4934, 0.8137, 0.1863, 0.5066],
3 [0.7259, 0.7259, 0.2741, 0.2741]]
4
5Shape: [3, 4] = [seq_len, d_v]3.6 Interpretation of Results
What Did Attention Do?
Original embeddings:
1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]After attention:
1Token 0 ("cat"): [0.81, 0.49, 0.51, 0.19] (gained info from others)
2Token 1 ("sat"): [0.49, 0.81, 0.19, 0.51] (gained info from others)
3Token 2 ("mat"): [0.73, 0.73, 0.27, 0.27] (blended, balanced attention)Key observations:
- Information mixing: Each output contains information from all inputs
- Self-dominance: Each token still most resembles its original (due to self-attention)
- Symmetry: Token 0 and Token 1 outputs are "mirror images" (orthogonal inputs)
- Balanced blending: Token 2 got more uniform blending (it was equally similar to 0 and 1)
3.7 Complete Computation Summary
The Journey
1Input X: [3, 4]
2 β
3Q = K = V = X
4 β
5QK^T: [3, 3] (similarity scores)
6[[2.0, 0.0, 1.0],
7 [0.0, 2.0, 1.0],
8 [1.0, 1.0, 2.0]]
9 β
10Scale by β4 = 2: [3, 3]
11[[1.0, 0.0, 0.5],
12 [0.0, 1.0, 0.5],
13 [0.5, 0.5, 1.0]]
14 β
15Softmax: [3, 3] (attention weights)
16[[0.507, 0.186, 0.307],
17 [0.186, 0.507, 0.307],
18 [0.274, 0.274, 0.452]]
19 β
20@ V: [3, 4] (output)
21[[0.814, 0.493, 0.507, 0.186],
22 [0.493, 0.814, 0.186, 0.507],
23 [0.726, 0.726, 0.274, 0.274]]3.8 Verification Code
Python Implementation
1import numpy as np
2
3def softmax(x, axis=-1):
4 """Numerically stable softmax."""
5 exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
6 return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
7
8def scaled_dot_product_attention(Q, K, V):
9 """Compute attention exactly as derived."""
10 d_k = K.shape[-1]
11
12 # Step 1: QK^T
13 scores = Q @ K.T
14 print("Step 1 - QK^T:")
15 print(scores)
16 print()
17
18 # Step 2: Scale
19 scaled_scores = scores / np.sqrt(d_k)
20 print(f"Step 2 - Scaled by β{d_k} = {np.sqrt(d_k)}:")
21 print(scaled_scores)
22 print()
23
24 # Step 3: Softmax
25 attention_weights = softmax(scaled_scores, axis=-1)
26 print("Step 3 - Softmax (attention weights):")
27 print(attention_weights)
28 print("Row sums:", attention_weights.sum(axis=-1))
29 print()
30
31 # Step 4: Weighted sum
32 output = attention_weights @ V
33 print("Step 4 - Output (attention_weights @ V):")
34 print(output)
35
36 return output, attention_weights
37
38# Define input
39X = np.array([
40 [1.0, 0.0, 1.0, 0.0], # Token 0
41 [0.0, 1.0, 0.0, 1.0], # Token 1
42 [1.0, 1.0, 0.0, 0.0], # Token 2
43])
44
45print("Input X:")
46print(X)
47print()
48
49# Run attention
50Q = K = V = X
51output, weights = scaled_dot_product_attention(Q, K, V)Expected Output
1Input X:
2[[1. 0. 1. 0.]
3 [0. 1. 0. 1.]
4 [1. 1. 0. 0.]]
5
6Step 1 - QK^T:
7[[2. 0. 1.]
8 [0. 2. 1.]
9 [1. 1. 2.]]
10
11Step 2 - Scaled by β4 = 2.0:
12[[1. 0. 0.5]
13 [0. 1. 0.5]
14 [0.5 0.5 1. ]]
15
16Step 3 - Softmax (attention weights):
17[[0.5066 0.1863 0.3071]
18 [0.1863 0.5066 0.3071]
19 [0.2741 0.2741 0.4518]]
20Row sums: [1. 1. 1.]
21
22Step 4 - Output (attention_weights @ V):
23[[0.8137 0.4934 0.5066 0.1863]
24 [0.4934 0.8137 0.1863 0.5066]
25 [0.7259 0.7259 0.2741 0.2741]]3.9 What If We Skip Scaling?
Without βd_k
Let's see what happens without scaling:
1# Softmax directly on QK^T (without scaling)
2scores = np.array([[2.0, 0.0, 1.0],
3 [0.0, 2.0, 1.0],
4 [1.0, 1.0, 2.0]])
5
6weights_unscaled = softmax(scores)
7print("Without scaling:")
8print(weights_unscaled)Result:
1Without scaling:
2[[0.6652, 0.0900, 0.2447]
3 [0.0900, 0.6652, 0.2447]
4 [0.2119, 0.2119, 0.5762]]Comparison:
1With scaling: [0.507, 0.186, 0.307] (softer)
2Without scaling: [0.665, 0.090, 0.245] (sharper)The unscaled version:
- More weight on highest-scoring position (0.665 vs 0.507)
- Less weight on lowest-scoring position (0.090 vs 0.186)
- More "focused" but potentially too aggressive
With larger d_k, this effect becomes more extreme!
Summary
The Numbers
We computed attention for 3 tokens with 4-dimensional embeddings:
| Step | Operation | Output Shape | Key Values |
|---|---|---|---|
| 1 | QK^T | [3, 3] | max=2.0 |
| 2 | / βd_k | [3, 3] | max=1.0 |
| 3 | softmax | [3, 3] | rows sum to 1 |
| 4 | @ V | [3, 4] | context-enriched |
Key Takeaways
- Hand computation builds intuition: Now you know exactly what each step does
- Scaling matters: βd_k keeps softmax from being too sharp
- Attention is weighted averaging: Output combines all inputs, weighted by similarity
- Self-attention is often dominant: Tokens attend most to themselves
Exercises
Verification Exercises
- Implement the attention computation in PyTorch and verify your output matches the hand calculations above.
- Try a different input where tokens are more similar:What do you expect to happen to the attention weights?πpython
1X = [[1, 1, 0, 0], 2 [1, 0, 1, 0], 3 [0, 1, 1, 0]] - What happens if all tokens are identical? Compute attention for:πpython
1X = [[1, 0, 1, 0], 2 [1, 0, 1, 0], 3 [1, 0, 1, 0]]
Extension Exercises
- Add a padding mask that ignores Token 2. Set scores[:,2] = -inf before softmax. How do the attention weights change?
- Create a causal mask that prevents Token 1 from attending to Token 2, and Token 0 from attending to Token 1 or 2. How does the output change?
- Increase d_k to 16 (pad your vectors with zeros) and observe how the attention weights change without scaling vs with scaling.
Next Section Preview
Now that we've verified our understanding with hand calculations, we're ready to implement scaled dot-product attention in PyTorch! The next section will create a clean, reusable implementation with proper shape annotations.