Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Before diving into formulas and code, let's build a solid intuition for what attention actually does. Understanding attention at a conceptual level will make the mathematics feel natural rather than arbitrary.

The attention mechanism is arguably the single most important innovation in modern deep learning. Once you truly understand it, you'll see it everywhere—and you'll understand why transformers work so well.

How to read this section

Pause after each analogy and try to map it to the Q/K/V computation you'll code next. If you can explain how a weight is produced and why it sums with others to 1, you're ready for the math.

1.1 Attention as Selective Focus

The Human Analogy

When you read this sentence, you don't process every word with equal importance. Your brain automatically:

Focuses on key words that carry meaning
Relates words to each other based on context
Ignores or downweights less relevant information

Consider: "The cat that was sitting on the mat was hungry."

To understand what "hungry" refers to, your brain:

Connects "hungry" to "cat" (not "mat")
Uses grammatical structure (subject-verb agreement)
Ignores intervening words as less relevant

Attention mechanisms do exactly this—computationally.

The Cocktail Party Effect

Imagine being at a noisy party. You can:

Focus on one conversation while ignoring others
Suddenly shift attention when someone says your name
Dynamically adjust focus based on relevance

Neural attention works similarly:

Each position in a sequence can "focus" on different parts
Focus is determined by relevance (similarity)
Attention is dynamic—different inputs produce different patterns

1.2 Attention as Weighted Averaging

The Core Operation

At its heart, attention is just a weighted average:

📝text

1output = Σ (weight_i × value_i)

Shapes (keep them in mind for code)

Values: [batch, seq, d_model]; Weights: [batch, seq, seq] after softmax; Output: [batch, seq, d_model] — same length, enriched by context.

Where:

value_i is the information at position i
weight_i is how much to focus on position i
Σ weight_i = 1 (weights form a probability distribution)

A Simple Example

Suppose we want to compute a "context-aware" representation for the word "it" in:

"The cat sat on the mat. It was happy."

What does "it" refer to? Let's say we have embeddings:

🐍python

1cat = [0.8, 0.2, 0.1]  (animal-like)
2mat = [0.1, 0.9, 0.3]  (object-like)
3it  = [0.7, 0.3, 0.2]  (pronoun, somewhat animal-like)

If attention assigns weights:

cat: 0.7 (high attention - "it" likely refers to cat)
mat: 0.2 (low attention)
other words: 0.1

Then the context-enriched representation of "it":

🐍python

1new_it = 0.7 × cat + 0.2 × mat + 0.1 × others
2       = 0.7 × [0.8, 0.2, 0.1] + 0.2 × [0.1, 0.9, 0.3] + ...

Now "it" carries information about what it refers to!

Tiny numeric sanity check

If weights = [0.7, 0.2, 0.1] and values are 3D vectors, the weighted sum must stay within the convex hull of those vectors. If it doesn't, your softmax or masking is wrong.

1.3 The Query-Key-Value Framework

The Database Analogy

Think of attention like searching a database:

Concept	Database	Attention
Query	Search term	"What am I looking for?"
Key	Index entries	"What does each item contain?"
Value	Actual data	"What information should I return?"

Process:

Compare Query to all Keys (compute similarity)
Use similarities to create attention weights
Return weighted combination of Values

Example: Finding a Book

Query: "machine learning Python"

Database entries:

Key (Title/Tags)	Value (Content)
"ML algorithms Python"	Chapter on scikit-learn...
"Web development JavaScript"	React tutorial...
"Deep learning PyTorch"	Neural networks...

Process:

Match query to keys: ML→high, Web→low, Deep→medium
Weights: [0.6, 0.05, 0.35]
Return: Mostly scikit-learn chapter, some PyTorch

In Neural Networks

For each position in a sequence:

📝text

1Query:  "What information do I need?"  (from current position)
2Keys:   "What information do I have?"  (from all positions)
3Values: "What information can I give?"  (from all positions)

The key insight: Q, K, V are all learned transformations of the same input!

🐍python

1Q = input @ W_Q  # Learn what to look for
2K = input @ W_K  # Learn what to advertise
3V = input @ W_V  # Learn what to provide

Avoid over-interpreting attention

Attention weights show where information flows, not strict causality or “explanations.” Different heads/layers can disagree; use attention patterns as clues, not proof.

1.4 Self-Attention vs Cross-Attention

Self-Attention

All of Q, K, V come from the same sequence.

Used in: Encoder self-attention, decoder self-attention

📝text

1Input:  "The cat sat"
2        ↓   ↓   ↓
3       [e1, e2, e3]  (embeddings)
4        ↓   ↓   ↓
5Q:     [q1, q2, q3]  (queries from same input)
6K:     [k1, k2, k3]  (keys from same input)
7V:     [v1, v2, v3]  (values from same input)

Each token attends to all tokens in the same sequence (including itself).

Purpose: Learn relationships within a sequence

"cat" attends to "The" to understand it's a specific cat
"sat" attends to "cat" to understand who is sitting

Cross-Attention

Q comes from one sequence, K and V from another.

Used in: Decoder cross-attention (attending to encoder output)

📝text

1Decoder input: "The"
2               ↓
3Q:            [q1]  (query from decoder)
4
5Encoder output: [e1, e2, e3, e4]  (from "Der Hund ist schwarz")
6                 ↓   ↓   ↓   ↓
7K:              [k1, k2, k3, k4]  (keys from encoder)
8V:              [v1, v2, v3, v4]  (values from encoder)

Purpose: Let decoder "look at" the source sequence

When generating "The", attend to "Der" (German article)
When generating "dog", attend to "Hund" (German noun)

Shapes to remember

Self-attention mask: [batch, 1, seq, seq]; cross-attention mask: [batch, 1, 1, src_seq] so every target position can see (masked) source tokens.

1.5 What High Attention Weight Means

Interpreting Attention Patterns

When position i has high attention weight on position j:

📝text

1Position i "pays attention to" / "looks at" / "focuses on" position j

Practical reading of a heatmap

In early layers, expect local/diagonal focus; in middle layers, syntactic patterns; in later layers, task-specific focus (e.g., answer tokens, entities). If everything is uniform, something is wrong with scaling or masking.

This means:

Token i needs information from token j
Token j is relevant for understanding token i
Output at position i will incorporate information from position j

Common Attention Patterns

1. Diagonal Pattern

📝text

1a b c d
2a [■ · · ·]    Each token attends mostly to itself
3b [· ■ · ·]    (common in lower layers)
4c [· · ■ ·]
5d [· · · ■]

2. Column Pattern

📝text

1a b c d
2a [· ■ · ·]    All tokens attend to one important token
3b [· ■ · ·]    (e.g., subject of sentence)
4c [· ■ · ·]
5d [· ■ · ·]

3. Syntactic Pattern

📝text

1The cat sat on the mat
2sat [·  ■  ·  ·  ·  ■]   "sat" attends to "cat" (subject)
3                         and "mat" (object)

4. Previous Token Pattern

📝text

1a b c d
2a [■ · · ·]    Each token attends to the previous
3b [■ ■ · ·]    (common in causal/decoder attention)
4c [· ■ ■ ·]
5d [· · ■ ■]

1.6 Attention as Dynamic Routing

Static vs Dynamic

Traditional neural networks: Fixed weights, same computation for every input

📝text

1output = W × input  (W is fixed after training)

Attention: Weights depend on input, different computation for different inputs

📝text

1weights = f(input)  (computed fresh for each input)
2output = weights × values

Why This Matters

Consider processing two sentences:

"The animal didn't cross the road because it was too tired."
"The animal didn't cross the road because it was too wide."

In sentence 1, "it" = animal
In sentence 2, "it" = road

A static network would process "it" the same way in both cases.

Attention computes different weights:

Sentence 1: "it" attends strongly to "animal"
Sentence 2: "it" attends strongly to "road"

The routing of information adapts to the input.

Link to next section

The only trainable part here is the Q/K/V projections; the routing (softmax of QK^T) is computed on the fly per input. That’s why we normalize and scale (next section) instead of learning a static lookup.

1.7 Building Intuition: A Visual Walkthrough

Step-by-Step Example

Input: "The cat sat"

Step 1: Create embeddings

🐍python

1The → [0.2, 0.8, 0.1]
2cat → [0.9, 0.3, 0.7]
3sat → [0.4, 0.6, 0.5]

Step 2: Project to Q, K, V (simplified, same as input here)

🐍python

1Q = K = V = [[0.2, 0.8, 0.1],
2             [0.9, 0.3, 0.7],
3             [0.4, 0.6, 0.5]]

Step 3: Compute attention scores (Q × K^T)

For query "sat" (row 3):

🐍python

1score(sat, The) = sat · The = 0.4×0.2 + 0.6×0.8 + 0.5×0.1 = 0.61
2score(sat, cat) = sat · cat = 0.4×0.9 + 0.6×0.3 + 0.5×0.7 = 0.89
3score(sat, sat) = sat · sat = 0.4×0.4 + 0.6×0.6 + 0.5×0.5 = 0.77

"sat" is most similar to "cat" (highest score)!

Step 4: Apply softmax (convert to probabilities)

🐍python

1weights = softmax([0.61, 0.89, 0.77]) ≈ [0.26, 0.39, 0.35]

Step 5: Weighted sum of values

🐍python

1output_sat = 0.26 × V_The + 0.39 × V_cat + 0.35 × V_sat
2           = 0.26 × [0.2, 0.8, 0.1] + 0.39 × [0.9, 0.3, 0.7] + 0.35 × [0.4, 0.6, 0.5]
3           = [0.55, 0.50, 0.48]

The output for "sat" now carries information from all tokens, weighted by relevance!

1.8 Why Attention Weights Sum to 1

The Softmax Normalization

Attention weights are computed using softmax:

📝text

1attention_weight_i = exp(score_i) / Σ exp(score_j)

This guarantees:

All weights are positive (exp is always > 0)
Weights sum to 1 (by construction of softmax)
Weights form a probability distribution

Why Normalization Matters

Without normalization:

Weights could be arbitrarily large
Output magnitude would depend on input magnitude
Training would be unstable

With normalization:

Output is a convex combination of values
Magnitude is controlled
Interpretable as "probability of attending to each position"

Summary

Key Intuitions

Attention = Weighted Averaging: Each output is a weighted combination of inputs, where weights indicate relevance.
Query-Key-Value:
- Query: What am I looking for?
- Key: What do I contain?
- Value: What can I provide?
Self vs Cross:
- Self-attention: Q, K, V from same sequence
- Cross-attention: Q from one sequence, K/V from another
Dynamic Routing: Attention weights are input-dependent, enabling context-sensitive processing.
Interpretability: High attention weight from i to j means position i "focuses on" position j.

Mental Model

📝text

1For each position i:
2    1. Create a query: "What do I need?"
3    2. Compare query to all keys: "How relevant is each position?"
4    3. Softmax: Convert relevance to probabilities
5    4. Weighted sum: Gather information from all positions based on relevance

Exercises

Conceptual Questions

In your own words, explain why attention is called a "soft lookup" mechanism.
Given the sentence "The dog chased the cat and then it ran away", what tokens would you expect "it" to attend to? What would determine which interpretation is chosen?
Why do we need separate Q, K, V projections instead of using the input directly for all three?
Explain the difference between self-attention and cross-attention. Give an example of when each would be used.

Hands-on Drills

Compute a 3-token attention by hand: pick small Q/K/V vectors, form QK^T, apply softmax, and produce the weighted sum.
Change one query vector slightly and recompute weights—notice how routing shifts. This builds intuition for dynamic routing.

Thought Experiments

If all attention weights were equal (uniform attention), what would the output represent? When might this be useful or problematic?
Consider a very long document where the answer to a question is in the first paragraph, but the question is asked in the last paragraph. How does attention help solve this compared to an RNN?
If we removed the softmax normalization and just used raw dot products, what problems might arise during training?

Next Section Preview

In the next section, we'll translate these intuitions into mathematics. We'll derive the scaled dot-product attention formula:

📝text

1Attention(Q, K, V) = softmax(QK^T / √d_k) × V

And understand exactly why each component is necessary.