Chapter 2
20 min read
Section 10 of 75

Numerical Walkthrough with Tiny Vectors

Attention Mechanism From Scratch

Introduction

There's no better way to truly understand attention than to compute it by hand. In this section, we'll work through a complete example with actual numbersβ€”small enough to verify manually, but complete enough to show all the concepts.

By the end, you'll be able to verify your implementation against these hand calculations, building confidence that your code is correct.


3.1 Problem Setup

The Scenario

We have a tiny sequence of 3 tokens with 4-dimensional embeddings.

Think of it as the sentence "cat sat mat" (simplified).

Input Data

🐍python
1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]

As a matrix X (input embeddings):

🐍python
1X = [[1.0, 0.0, 1.0, 0.0],   # Token 0
2     [0.0, 1.0, 0.0, 1.0],   # Token 1
3     [1.0, 1.0, 0.0, 0.0]]   # Token 2
4
5Shape: [3, 4] = [seq_len, d_model]

Simplification: Q = K = V = X

For this walkthrough, we'll use the input directly as Q, K, and V (no learned projections). This is called "vanilla" attention and helps isolate the core computation.

🐍python
1Q = X  [3, 4]
2K = X  [3, 4]
3V = X  [3, 4]
4
5d_k = 4

3.2 Step 1: Compute QK^T (Attention Scores)

The Matrix Multiplication

🐍python
1QK^T = Q @ K^T
2     = [3, 4] @ [4, 3]
3     = [3, 3]

Computing Element by Element

🐍python
1scores[i][j] = Q[i] Β· K[j]  (dot product)

Row 0 (Query = Token 0 = "cat"):

🐍python
1scores[0][0] = [1,0,1,0] Β· [1,0,1,0] = 1Γ—1 + 0Γ—0 + 1Γ—1 + 0Γ—0 = 2.0
2scores[0][1] = [1,0,1,0] Β· [0,1,0,1] = 1Γ—0 + 0Γ—1 + 1Γ—0 + 0Γ—1 = 0.0
3scores[0][2] = [1,0,1,0] Β· [1,1,0,0] = 1Γ—1 + 0Γ—1 + 1Γ—0 + 0Γ—0 = 1.0

Row 1 (Query = Token 1 = "sat"):

🐍python
1scores[1][0] = [0,1,0,1] Β· [1,0,1,0] = 0Γ—1 + 1Γ—0 + 0Γ—1 + 1Γ—0 = 0.0
2scores[1][1] = [0,1,0,1] Β· [0,1,0,1] = 0Γ—0 + 1Γ—1 + 0Γ—0 + 1Γ—1 = 2.0
3scores[1][2] = [0,1,0,1] Β· [1,1,0,0] = 0Γ—1 + 1Γ—1 + 0Γ—0 + 1Γ—0 = 1.0

Row 2 (Query = Token 2 = "mat"):

🐍python
1scores[2][0] = [1,1,0,0] Β· [1,0,1,0] = 1Γ—1 + 1Γ—0 + 0Γ—1 + 0Γ—0 = 1.0
2scores[2][1] = [1,1,0,0] Β· [0,1,0,1] = 1Γ—0 + 1Γ—1 + 0Γ—0 + 0Γ—1 = 1.0
3scores[2][2] = [1,1,0,0] Β· [1,1,0,0] = 1Γ—1 + 1Γ—1 + 0Γ—0 + 0Γ—0 = 2.0

Result: Attention Score Matrix

🐍python
1scores = [[2.0, 0.0, 1.0],
2          [0.0, 2.0, 1.0],
3          [1.0, 1.0, 2.0]]

Interpretation:

  • Token 0 ("cat") is most similar to itself (score 2.0)
  • Token 1 ("sat") is most similar to itself (score 2.0)
  • Token 2 ("mat") is most similar to itself (score 2.0)
  • "cat" and "sat" are completely dissimilar (score 0.0)

3.3 Step 2: Scale by √d_k

The Scaling Factor

🐍python
1d_k = 4
2√d_k = √4 = 2.0

Apply Scaling

🐍python
1scaled_scores = scores / √d_k
2              = scores / 2.0
3
4scaled_scores = [[2.0/2, 0.0/2, 1.0/2],
5                 [0.0/2, 2.0/2, 1.0/2],
6                 [1.0/2, 1.0/2, 2.0/2]]
7
8              = [[1.0, 0.0, 0.5],
9                 [0.0, 1.0, 0.5],
10                 [0.5, 0.5, 1.0]]

Why we scale: These values (max 1.0) will produce softer softmax outputs than the unscaled values (max 2.0).


3.4 Step 3: Apply Softmax

Softmax Formula

For each row independently:

🐍python
1softmax(x)_i = exp(x_i) / Ξ£_j exp(x_j)

Row 0: [1.0, 0.0, 0.5]

🐍python
1exp(1.0) = 2.7183
2exp(0.0) = 1.0000
3exp(0.5) = 1.6487
4
5sum = 2.7183 + 1.0000 + 1.6487 = 5.3670
6
7attention[0] = [2.7183/5.3670, 1.0000/5.3670, 1.6487/5.3670]
8             = [0.5066, 0.1863, 0.3071]

Row 1: [0.0, 1.0, 0.5]

🐍python
1exp(0.0) = 1.0000
2exp(1.0) = 2.7183
3exp(0.5) = 1.6487
4
5sum = 1.0000 + 2.7183 + 1.6487 = 5.3670
6
7attention[1] = [1.0000/5.3670, 2.7183/5.3670, 1.6487/5.3670]
8             = [0.1863, 0.5066, 0.3071]

Row 2: [0.5, 0.5, 1.0]

🐍python
1exp(0.5) = 1.6487
2exp(0.5) = 1.6487
3exp(1.0) = 2.7183
4
5sum = 1.6487 + 1.6487 + 2.7183 = 6.0157
6
7attention[2] = [1.6487/6.0157, 1.6487/6.0157, 2.7183/6.0157]
8             = [0.2741, 0.2741, 0.4518]

Result: Attention Weight Matrix

🐍python
1attention_weights = [[0.5066, 0.1863, 0.3071],
2                     [0.1863, 0.5066, 0.3071],
3                     [0.2741, 0.2741, 0.4518]]

Verify rows sum to 1:

🐍python
1Row 0: 0.5066 + 0.1863 + 0.3071 = 1.0000 βœ“
2Row 1: 0.1863 + 0.5066 + 0.3071 = 1.0000 βœ“
3Row 2: 0.2741 + 0.2741 + 0.4518 = 1.0000 βœ“

Interpretation:

  • Token 0 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 1 (18.6%)
  • Token 1 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 0 (18.6%)
  • Token 2 attends most to itself (45.2%), and equally to Token 0 and 1 (27.4% each)

3.5 Step 4: Compute Weighted Sum (Output)

Matrix Multiplication

🐍python
1output = attention_weights @ V
2       = [3, 3] @ [3, 4]
3       = [3, 4]

Computing Row by Row

Recall V = X:

🐍python
1V = [[1.0, 0.0, 1.0, 0.0],   # V[0]
2     [0.0, 1.0, 0.0, 1.0],   # V[1]
3     [1.0, 1.0, 0.0, 0.0]]   # V[2]

Output Row 0:

🐍python
1output[0] = 0.5066 Γ— V[0] + 0.1863 Γ— V[1] + 0.3071 Γ— V[2]
2          = 0.5066 Γ— [1,0,1,0] + 0.1863 Γ— [0,1,0,1] + 0.3071 Γ— [1,1,0,0]
3
4Element by element:
5output[0][0] = 0.5066Γ—1 + 0.1863Γ—0 + 0.3071Γ—1 = 0.8137
6output[0][1] = 0.5066Γ—0 + 0.1863Γ—1 + 0.3071Γ—1 = 0.4934
7output[0][2] = 0.5066Γ—1 + 0.1863Γ—0 + 0.3071Γ—0 = 0.5066
8output[0][3] = 0.5066Γ—0 + 0.1863Γ—1 + 0.3071Γ—0 = 0.1863
9
10output[0] = [0.8137, 0.4934, 0.5066, 0.1863]

Output Row 1:

🐍python
1output[1] = 0.1863 Γ— V[0] + 0.5066 Γ— V[1] + 0.3071 Γ— V[2]
2
3output[1][0] = 0.1863Γ—1 + 0.5066Γ—0 + 0.3071Γ—1 = 0.4934
4output[1][1] = 0.1863Γ—0 + 0.5066Γ—1 + 0.3071Γ—1 = 0.8137
5output[1][2] = 0.1863Γ—1 + 0.5066Γ—0 + 0.3071Γ—0 = 0.1863
6output[1][3] = 0.1863Γ—0 + 0.5066Γ—1 + 0.3071Γ—0 = 0.5066
7
8output[1] = [0.4934, 0.8137, 0.1863, 0.5066]

Output Row 2:

🐍python
1output[2] = 0.2741 Γ— V[0] + 0.2741 Γ— V[1] + 0.4518 Γ— V[2]
2
3output[2][0] = 0.2741Γ—1 + 0.2741Γ—0 + 0.4518Γ—1 = 0.7259
4output[2][1] = 0.2741Γ—0 + 0.2741Γ—1 + 0.4518Γ—1 = 0.7259
5output[2][2] = 0.2741Γ—1 + 0.2741Γ—0 + 0.4518Γ—0 = 0.2741
6output[2][3] = 0.2741Γ—0 + 0.2741Γ—1 + 0.4518Γ—0 = 0.2741
7
8output[2] = [0.7259, 0.7259, 0.2741, 0.2741]

Final Output Matrix

🐍python
1output = [[0.8137, 0.4934, 0.5066, 0.1863],
2          [0.4934, 0.8137, 0.1863, 0.5066],
3          [0.7259, 0.7259, 0.2741, 0.2741]]
4
5Shape: [3, 4] = [seq_len, d_v]

3.6 Interpretation of Results

What Did Attention Do?

Original embeddings:

🐍python
1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]

After attention:

🐍python
1Token 0 ("cat"): [0.81, 0.49, 0.51, 0.19]  (gained info from others)
2Token 1 ("sat"): [0.49, 0.81, 0.19, 0.51]  (gained info from others)
3Token 2 ("mat"): [0.73, 0.73, 0.27, 0.27]  (blended, balanced attention)

Key observations:

  1. Information mixing: Each output contains information from all inputs
  2. Self-dominance: Each token still most resembles its original (due to self-attention)
  3. Symmetry: Token 0 and Token 1 outputs are "mirror images" (orthogonal inputs)
  4. Balanced blending: Token 2 got more uniform blending (it was equally similar to 0 and 1)

3.7 Complete Computation Summary

The Journey

🐍python
1Input X:           [3, 4]
2        ↓
3Q = K = V = X
4        ↓
5QK^T:             [3, 3]  (similarity scores)
6[[2.0, 0.0, 1.0],
7 [0.0, 2.0, 1.0],
8 [1.0, 1.0, 2.0]]
9        ↓
10Scale by √4 = 2:  [3, 3]
11[[1.0, 0.0, 0.5],
12 [0.0, 1.0, 0.5],
13 [0.5, 0.5, 1.0]]
14        ↓
15Softmax:          [3, 3]  (attention weights)
16[[0.507, 0.186, 0.307],
17 [0.186, 0.507, 0.307],
18 [0.274, 0.274, 0.452]]
19        ↓
20@ V:              [3, 4]  (output)
21[[0.814, 0.493, 0.507, 0.186],
22 [0.493, 0.814, 0.186, 0.507],
23 [0.726, 0.726, 0.274, 0.274]]

3.8 Verification Code

Python Implementation

🐍python
1import numpy as np
2
3def softmax(x, axis=-1):
4    """Numerically stable softmax."""
5    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
6    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
7
8def scaled_dot_product_attention(Q, K, V):
9    """Compute attention exactly as derived."""
10    d_k = K.shape[-1]
11
12    # Step 1: QK^T
13    scores = Q @ K.T
14    print("Step 1 - QK^T:")
15    print(scores)
16    print()
17
18    # Step 2: Scale
19    scaled_scores = scores / np.sqrt(d_k)
20    print(f"Step 2 - Scaled by √{d_k} = {np.sqrt(d_k)}:")
21    print(scaled_scores)
22    print()
23
24    # Step 3: Softmax
25    attention_weights = softmax(scaled_scores, axis=-1)
26    print("Step 3 - Softmax (attention weights):")
27    print(attention_weights)
28    print("Row sums:", attention_weights.sum(axis=-1))
29    print()
30
31    # Step 4: Weighted sum
32    output = attention_weights @ V
33    print("Step 4 - Output (attention_weights @ V):")
34    print(output)
35
36    return output, attention_weights
37
38# Define input
39X = np.array([
40    [1.0, 0.0, 1.0, 0.0],  # Token 0
41    [0.0, 1.0, 0.0, 1.0],  # Token 1
42    [1.0, 1.0, 0.0, 0.0],  # Token 2
43])
44
45print("Input X:")
46print(X)
47print()
48
49# Run attention
50Q = K = V = X
51output, weights = scaled_dot_product_attention(Q, K, V)

Expected Output

πŸ“text
1Input X:
2[[1. 0. 1. 0.]
3 [0. 1. 0. 1.]
4 [1. 1. 0. 0.]]
5
6Step 1 - QK^T:
7[[2. 0. 1.]
8 [0. 2. 1.]
9 [1. 1. 2.]]
10
11Step 2 - Scaled by √4 = 2.0:
12[[1.  0.  0.5]
13 [0.  1.  0.5]
14 [0.5 0.5 1. ]]
15
16Step 3 - Softmax (attention weights):
17[[0.5066 0.1863 0.3071]
18 [0.1863 0.5066 0.3071]
19 [0.2741 0.2741 0.4518]]
20Row sums: [1. 1. 1.]
21
22Step 4 - Output (attention_weights @ V):
23[[0.8137 0.4934 0.5066 0.1863]
24 [0.4934 0.8137 0.1863 0.5066]
25 [0.7259 0.7259 0.2741 0.2741]]

3.9 What If We Skip Scaling?

Without √d_k

Let's see what happens without scaling:

🐍python
1# Softmax directly on QK^T (without scaling)
2scores = np.array([[2.0, 0.0, 1.0],
3                   [0.0, 2.0, 1.0],
4                   [1.0, 1.0, 2.0]])
5
6weights_unscaled = softmax(scores)
7print("Without scaling:")
8print(weights_unscaled)

Result:

🐍python
1Without scaling:
2[[0.6652, 0.0900, 0.2447]
3 [0.0900, 0.6652, 0.2447]
4 [0.2119, 0.2119, 0.5762]]

Comparison:

🐍python
1With scaling:    [0.507, 0.186, 0.307]  (softer)
2Without scaling: [0.665, 0.090, 0.245]  (sharper)

The unscaled version:

  • More weight on highest-scoring position (0.665 vs 0.507)
  • Less weight on lowest-scoring position (0.090 vs 0.186)
  • More "focused" but potentially too aggressive

With larger d_k, this effect becomes more extreme!


Summary

The Numbers

We computed attention for 3 tokens with 4-dimensional embeddings:

StepOperationOutput ShapeKey Values
1QK^T[3, 3]max=2.0
2/ √d_k[3, 3]max=1.0
3softmax[3, 3]rows sum to 1
4@ V[3, 4]context-enriched

Key Takeaways

  1. Hand computation builds intuition: Now you know exactly what each step does
  2. Scaling matters: √d_k keeps softmax from being too sharp
  3. Attention is weighted averaging: Output combines all inputs, weighted by similarity
  4. Self-attention is often dominant: Tokens attend most to themselves

Exercises

Verification Exercises

  1. Implement the attention computation in PyTorch and verify your output matches the hand calculations above.
  2. Try a different input where tokens are more similar:
    🐍python
    1X = [[1, 1, 0, 0],
    2     [1, 0, 1, 0],
    3     [0, 1, 1, 0]]
    What do you expect to happen to the attention weights?
  3. What happens if all tokens are identical? Compute attention for:
    🐍python
    1X = [[1, 0, 1, 0],
    2     [1, 0, 1, 0],
    3     [1, 0, 1, 0]]

Extension Exercises

  1. Add a padding mask that ignores Token 2. Set scores[:,2] = -inf before softmax. How do the attention weights change?
  2. Create a causal mask that prevents Token 1 from attending to Token 2, and Token 0 from attending to Token 1 or 2. How does the output change?
  3. Increase d_k to 16 (pad your vectors with zeros) and observe how the attention weights change without scaling vs with scaling.

Next Section Preview

Now that we've verified our understanding with hand calculations, we're ready to implement scaled dot-product attention in PyTorch! The next section will create a clean, reusable implementation with proper shape annotations.