Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

There's no better way to truly understand attention than to compute it by hand. In this section, we'll work through a complete example with actual numbers—small enough to verify manually, but complete enough to show all the concepts.

By the end, you'll be able to verify your implementation against these hand calculations, building confidence that your code is correct.

3.1 Problem Setup

The Scenario

We have a tiny sequence of 3 tokens with 4-dimensional embeddings.

Think of it as the sentence "cat sat mat" (simplified).

Input Data

🐍python

1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]

As a matrix X (input embeddings):

🐍python

1X = [[1.0, 0.0, 1.0, 0.0],   # Token 0
2     [0.0, 1.0, 0.0, 1.0],   # Token 1
3     [1.0, 1.0, 0.0, 0.0]]   # Token 2
4
5Shape: [3, 4] = [seq_len, d_model]

Simplification: Q = K = V = X

For this walkthrough, we'll use the input directly as Q, K, and V (no learned projections). This is called "vanilla" attention and helps isolate the core computation.

🐍python

1Q = X  [3, 4]
2K = X  [3, 4]
3V = X  [3, 4]
4
5d_k = 4

3.2 Step 1: Compute QK^T (Attention Scores)

The Matrix Multiplication

🐍python

1QK^T = Q @ K^T
2     = [3, 4] @ [4, 3]
3     = [3, 3]

Computing Element by Element

🐍python

1scores[i][j] = Q[i] · K[j]  (dot product)

Row 0 (Query = Token 0 = "cat"):

🐍python

1scores[0][0] = [1,0,1,0] · [1,0,1,0] = 1×1 + 0×0 + 1×1 + 0×0 = 2.0
2scores[0][1] = [1,0,1,0] · [0,1,0,1] = 1×0 + 0×1 + 1×0 + 0×1 = 0.0
3scores[0][2] = [1,0,1,0] · [1,1,0,0] = 1×1 + 0×1 + 1×0 + 0×0 = 1.0

Row 1 (Query = Token 1 = "sat"):

🐍python

1scores[1][0] = [0,1,0,1] · [1,0,1,0] = 0×1 + 1×0 + 0×1 + 1×0 = 0.0
2scores[1][1] = [0,1,0,1] · [0,1,0,1] = 0×0 + 1×1 + 0×0 + 1×1 = 2.0
3scores[1][2] = [0,1,0,1] · [1,1,0,0] = 0×1 + 1×1 + 0×0 + 1×0 = 1.0

Row 2 (Query = Token 2 = "mat"):

🐍python

1scores[2][0] = [1,1,0,0] · [1,0,1,0] = 1×1 + 1×0 + 0×1 + 0×0 = 1.0
2scores[2][1] = [1,1,0,0] · [0,1,0,1] = 1×0 + 1×1 + 0×0 + 0×1 = 1.0
3scores[2][2] = [1,1,0,0] · [1,1,0,0] = 1×1 + 1×1 + 0×0 + 0×0 = 2.0

Result: Attention Score Matrix

🐍python

1scores = [[2.0, 0.0, 1.0],
2          [0.0, 2.0, 1.0],
3          [1.0, 1.0, 2.0]]

Interpretation:

Token 0 ("cat") is most similar to itself (score 2.0)
Token 1 ("sat") is most similar to itself (score 2.0)
Token 2 ("mat") is most similar to itself (score 2.0)
"cat" and "sat" are completely dissimilar (score 0.0)

3.3 Step 2: Scale by √d_k

The Scaling Factor

🐍python

1d_k = 4
2√d_k = √4 = 2.0

Apply Scaling

🐍python

1scaled_scores = scores / √d_k
2              = scores / 2.0
3
4scaled_scores = [[2.0/2, 0.0/2, 1.0/2],
5                 [0.0/2, 2.0/2, 1.0/2],
6                 [1.0/2, 1.0/2, 2.0/2]]
7
8              = [[1.0, 0.0, 0.5],
9                 [0.0, 1.0, 0.5],
10                 [0.5, 0.5, 1.0]]

Why we scale: These values (max 1.0) will produce softer softmax outputs than the unscaled values (max 2.0).

3.4 Step 3: Apply Softmax

Softmax Formula

For each row independently:

🐍python

1softmax(x)_i = exp(x_i) / Σ_j exp(x_j)

Row 0: [1.0, 0.0, 0.5]

🐍python

1exp(1.0) = 2.7183
2exp(0.0) = 1.0000
3exp(0.5) = 1.6487
4
5sum = 2.7183 + 1.0000 + 1.6487 = 5.3670
6
7attention[0] = [2.7183/5.3670, 1.0000/5.3670, 1.6487/5.3670]
8             = [0.5066, 0.1863, 0.3071]

Row 1: [0.0, 1.0, 0.5]

🐍python

1exp(0.0) = 1.0000
2exp(1.0) = 2.7183
3exp(0.5) = 1.6487
4
5sum = 1.0000 + 2.7183 + 1.6487 = 5.3670
6
7attention[1] = [1.0000/5.3670, 2.7183/5.3670, 1.6487/5.3670]
8             = [0.1863, 0.5066, 0.3071]

Row 2: [0.5, 0.5, 1.0]

🐍python

1exp(0.5) = 1.6487
2exp(0.5) = 1.6487
3exp(1.0) = 2.7183
4
5sum = 1.6487 + 1.6487 + 2.7183 = 6.0157
6
7attention[2] = [1.6487/6.0157, 1.6487/6.0157, 2.7183/6.0157]
8             = [0.2741, 0.2741, 0.4518]

Result: Attention Weight Matrix

🐍python

1attention_weights = [[0.5066, 0.1863, 0.3071],
2                     [0.1863, 0.5066, 0.3071],
3                     [0.2741, 0.2741, 0.4518]]

Verify rows sum to 1:

🐍python

1Row 0: 0.5066 + 0.1863 + 0.3071 = 1.0000 ✓
2Row 1: 0.1863 + 0.5066 + 0.3071 = 1.0000 ✓
3Row 2: 0.2741 + 0.2741 + 0.4518 = 1.0000 ✓

Interpretation:

Token 0 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 1 (18.6%)
Token 1 attends mostly to itself (50.7%), then Token 2 (30.7%), then Token 0 (18.6%)
Token 2 attends most to itself (45.2%), and equally to Token 0 and 1 (27.4% each)

3.5 Step 4: Compute Weighted Sum (Output)

Matrix Multiplication

🐍python

1output = attention_weights @ V
2       = [3, 3] @ [3, 4]
3       = [3, 4]

Computing Row by Row

Recall V = X:

🐍python

1V = [[1.0, 0.0, 1.0, 0.0],   # V[0]
2     [0.0, 1.0, 0.0, 1.0],   # V[1]
3     [1.0, 1.0, 0.0, 0.0]]   # V[2]

Output Row 0:

🐍python

1output[0] = 0.5066 × V[0] + 0.1863 × V[1] + 0.3071 × V[2]
2          = 0.5066 × [1,0,1,0] + 0.1863 × [0,1,0,1] + 0.3071 × [1,1,0,0]
3
4Element by element:
5output[0][0] = 0.5066×1 + 0.1863×0 + 0.3071×1 = 0.8137
6output[0][1] = 0.5066×0 + 0.1863×1 + 0.3071×1 = 0.4934
7output[0][2] = 0.5066×1 + 0.1863×0 + 0.3071×0 = 0.5066
8output[0][3] = 0.5066×0 + 0.1863×1 + 0.3071×0 = 0.1863
9
10output[0] = [0.8137, 0.4934, 0.5066, 0.1863]

Output Row 1:

🐍python

1output[1] = 0.1863 × V[0] + 0.5066 × V[1] + 0.3071 × V[2]
2
3output[1][0] = 0.1863×1 + 0.5066×0 + 0.3071×1 = 0.4934
4output[1][1] = 0.1863×0 + 0.5066×1 + 0.3071×1 = 0.8137
5output[1][2] = 0.1863×1 + 0.5066×0 + 0.3071×0 = 0.1863
6output[1][3] = 0.1863×0 + 0.5066×1 + 0.3071×0 = 0.5066
7
8output[1] = [0.4934, 0.8137, 0.1863, 0.5066]

Output Row 2:

🐍python

1output[2] = 0.2741 × V[0] + 0.2741 × V[1] + 0.4518 × V[2]
2
3output[2][0] = 0.2741×1 + 0.2741×0 + 0.4518×1 = 0.7259
4output[2][1] = 0.2741×0 + 0.2741×1 + 0.4518×1 = 0.7259
5output[2][2] = 0.2741×1 + 0.2741×0 + 0.4518×0 = 0.2741
6output[2][3] = 0.2741×0 + 0.2741×1 + 0.4518×0 = 0.2741
7
8output[2] = [0.7259, 0.7259, 0.2741, 0.2741]

Final Output Matrix

🐍python

1output = [[0.8137, 0.4934, 0.5066, 0.1863],
2          [0.4934, 0.8137, 0.1863, 0.5066],
3          [0.7259, 0.7259, 0.2741, 0.2741]]
4
5Shape: [3, 4] = [seq_len, d_v]

3.6 Interpretation of Results

What Did Attention Do?

Original embeddings:

🐍python

1Token 0 ("cat"): [1.0, 0.0, 1.0, 0.0]
2Token 1 ("sat"): [0.0, 1.0, 0.0, 1.0]
3Token 2 ("mat"): [1.0, 1.0, 0.0, 0.0]

After attention:

🐍python

1Token 0 ("cat"): [0.81, 0.49, 0.51, 0.19]  (gained info from others)
2Token 1 ("sat"): [0.49, 0.81, 0.19, 0.51]  (gained info from others)
3Token 2 ("mat"): [0.73, 0.73, 0.27, 0.27]  (blended, balanced attention)

Key observations:

Information mixing: Each output contains information from all inputs
Self-dominance: Each token still most resembles its original (due to self-attention)
Symmetry: Token 0 and Token 1 outputs are "mirror images" (orthogonal inputs)
Balanced blending: Token 2 got more uniform blending (it was equally similar to 0 and 1)

3.7 Complete Computation Summary

The Journey

🐍python

1Input X:           [3, 4]
2        ↓
3Q = K = V = X
4        ↓
5QK^T:             [3, 3]  (similarity scores)
6[[2.0, 0.0, 1.0],
7 [0.0, 2.0, 1.0],
8 [1.0, 1.0, 2.0]]
9        ↓
10Scale by √4 = 2:  [3, 3]
11[[1.0, 0.0, 0.5],
12 [0.0, 1.0, 0.5],
13 [0.5, 0.5, 1.0]]
14        ↓
15Softmax:          [3, 3]  (attention weights)
16[[0.507, 0.186, 0.307],
17 [0.186, 0.507, 0.307],
18 [0.274, 0.274, 0.452]]
19        ↓
20@ V:              [3, 4]  (output)
21[[0.814, 0.493, 0.507, 0.186],
22 [0.493, 0.814, 0.186, 0.507],
23 [0.726, 0.726, 0.274, 0.274]]

3.8 Verification Code

Python Implementation

🐍python

1import numpy as np
2
3def softmax(x, axis=-1):
4    """Numerically stable softmax."""
5    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
6    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
7
8def scaled_dot_product_attention(Q, K, V):
9    """Compute attention exactly as derived."""
10    d_k = K.shape[-1]
11
12    # Step 1: QK^T
13    scores = Q @ K.T
14    print("Step 1 - QK^T:")
15    print(scores)
16    print()
17
18    # Step 2: Scale
19    scaled_scores = scores / np.sqrt(d_k)
20    print(f"Step 2 - Scaled by √{d_k} = {np.sqrt(d_k)}:")
21    print(scaled_scores)
22    print()
23
24    # Step 3: Softmax
25    attention_weights = softmax(scaled_scores, axis=-1)
26    print("Step 3 - Softmax (attention weights):")
27    print(attention_weights)
28    print("Row sums:", attention_weights.sum(axis=-1))
29    print()
30
31    # Step 4: Weighted sum
32    output = attention_weights @ V
33    print("Step 4 - Output (attention_weights @ V):")
34    print(output)
35
36    return output, attention_weights
37
38# Define input
39X = np.array([
40    [1.0, 0.0, 1.0, 0.0],  # Token 0
41    [0.0, 1.0, 0.0, 1.0],  # Token 1
42    [1.0, 1.0, 0.0, 0.0],  # Token 2
43])
44
45print("Input X:")
46print(X)
47print()
48
49# Run attention
50Q = K = V = X
51output, weights = scaled_dot_product_attention(Q, K, V)

Expected Output

📝text

1Input X:
2[[1. 0. 1. 0.]
3 [0. 1. 0. 1.]
4 [1. 1. 0. 0.]]
5
6Step 1 - QK^T:
7[[2. 0. 1.]
8 [0. 2. 1.]
9 [1. 1. 2.]]
10
11Step 2 - Scaled by √4 = 2.0:
12[[1.  0.  0.5]
13 [0.  1.  0.5]
14 [0.5 0.5 1. ]]
15
16Step 3 - Softmax (attention weights):
17[[0.5066 0.1863 0.3071]
18 [0.1863 0.5066 0.3071]
19 [0.2741 0.2741 0.4518]]
20Row sums: [1. 1. 1.]
21
22Step 4 - Output (attention_weights @ V):
23[[0.8137 0.4934 0.5066 0.1863]
24 [0.4934 0.8137 0.1863 0.5066]
25 [0.7259 0.7259 0.2741 0.2741]]

3.9 What If We Skip Scaling?

Without √d_k

Let's see what happens without scaling:

🐍python

1# Softmax directly on QK^T (without scaling)
2scores = np.array([[2.0, 0.0, 1.0],
3                   [0.0, 2.0, 1.0],
4                   [1.0, 1.0, 2.0]])
5
6weights_unscaled = softmax(scores)
7print("Without scaling:")
8print(weights_unscaled)

Result:

🐍python

1Without scaling:
2[[0.6652, 0.0900, 0.2447]
3 [0.0900, 0.6652, 0.2447]
4 [0.2119, 0.2119, 0.5762]]

Comparison:

🐍python

1With scaling:    [0.507, 0.186, 0.307]  (softer)
2Without scaling: [0.665, 0.090, 0.245]  (sharper)

The unscaled version:

More weight on highest-scoring position (0.665 vs 0.507)
Less weight on lowest-scoring position (0.090 vs 0.186)
More "focused" but potentially too aggressive

With larger d_k, this effect becomes more extreme!

Summary

The Numbers

We computed attention for 3 tokens with 4-dimensional embeddings:

Step	Operation	Output Shape	Key Values
1	QK^T	[3, 3]	max=2.0
2	/ √d_k	[3, 3]	max=1.0
3	softmax	[3, 3]	rows sum to 1
4	@ V	[3, 4]	context-enriched

Key Takeaways

Hand computation builds intuition: Now you know exactly what each step does
Scaling matters: √d_k keeps softmax from being too sharp
Attention is weighted averaging: Output combines all inputs, weighted by similarity
Self-attention is often dominant: Tokens attend most to themselves

Exercises

Verification Exercises

Implement the attention computation in PyTorch and verify your output matches the hand calculations above.
Try a different input where tokens are more similar:
🐍python
```
1X = [[1, 1, 0, 0],
2     [1, 0, 1, 0],
3     [0, 1, 1, 0]]
```
What do you expect to happen to the attention weights?

What happens if all tokens are identical? Compute attention for:

🐍python

1X = [[1, 0, 1, 0],
2     [1, 0, 1, 0],
3     [1, 0, 1, 0]]

Extension Exercises

Add a padding mask that ignores Token 2. Set scores[:,2] = -inf before softmax. How do the attention weights change?
Create a causal mask that prevents Token 1 from attending to Token 2, and Token 0 from attending to Token 1 or 2. How does the output change?
Increase d_k to 16 (pad your vectors with zeros) and observe how the attention weights change without scaling vs with scaling.

Next Section Preview

Now that we've verified our understanding with hand calculations, we're ready to implement scaled dot-product attention in PyTorch! The next section will create a clean, reusable implementation with proper shape annotations.