Chapter 2
15 min read
Section 8 of 75

Intuition Behind Attention

Attention Mechanism From Scratch

Introduction

Before diving into formulas and code, let's build a solid intuition for what attention actually does. Understanding attention at a conceptual level will make the mathematics feel natural rather than arbitrary.

The attention mechanism is arguably the single most important innovation in modern deep learning. Once you truly understand it, you'll see it everywhere—and you'll understand why transformers work so well.

How to read this section

Pause after each analogy and try to map it to the Q/K/V computation you'll code next. If you can explain how a weight is produced and why it sums with others to 1, you're ready for the math.

1.1 Attention as Selective Focus

The Human Analogy

When you read this sentence, you don't process every word with equal importance. Your brain automatically:

  1. Focuses on key words that carry meaning
  2. Relates words to each other based on context
  3. Ignores or downweights less relevant information

Consider: "The cat that was sitting on the mat was hungry."

To understand what "hungry" refers to, your brain:

  • Connects "hungry" to "cat" (not "mat")
  • Uses grammatical structure (subject-verb agreement)
  • Ignores intervening words as less relevant

Attention mechanisms do exactly this—computationally.

The Cocktail Party Effect

Imagine being at a noisy party. You can:

  • Focus on one conversation while ignoring others
  • Suddenly shift attention when someone says your name
  • Dynamically adjust focus based on relevance

Neural attention works similarly:

  • Each position in a sequence can "focus" on different parts
  • Focus is determined by relevance (similarity)
  • Attention is dynamic—different inputs produce different patterns

1.2 Attention as Weighted Averaging

The Core Operation

At its heart, attention is just a weighted average:

📝text
1output = Σ (weight_i × value_i)

Shapes (keep them in mind for code)

Values: [batch, seq, d_model]; Weights: [batch, seq, seq] after softmax; Output: [batch, seq, d_model] — same length, enriched by context.

Where:

  • value_i is the information at position i
  • weight_i is how much to focus on position i
  • Σ weight_i = 1 (weights form a probability distribution)

A Simple Example

Suppose we want to compute a "context-aware" representation for the word "it" in:

"The cat sat on the mat. It was happy."

What does "it" refer to? Let's say we have embeddings:

🐍python
1cat = [0.8, 0.2, 0.1]  (animal-like)
2mat = [0.1, 0.9, 0.3]  (object-like)
3it  = [0.7, 0.3, 0.2]  (pronoun, somewhat animal-like)

If attention assigns weights:

  • cat: 0.7 (high attention - "it" likely refers to cat)
  • mat: 0.2 (low attention)
  • other words: 0.1

Then the context-enriched representation of "it":

🐍python
1new_it = 0.7 × cat + 0.2 × mat + 0.1 × others
2       = 0.7 × [0.8, 0.2, 0.1] + 0.2 × [0.1, 0.9, 0.3] + ...

Now "it" carries information about what it refers to!

Tiny numeric sanity check

If weights = [0.7, 0.2, 0.1] and values are 3D vectors, the weighted sum must stay within the convex hull of those vectors. If it doesn't, your softmax or masking is wrong.

1.3 The Query-Key-Value Framework

The Database Analogy

Think of attention like searching a database:

ConceptDatabaseAttention
QuerySearch term"What am I looking for?"
KeyIndex entries"What does each item contain?"
ValueActual data"What information should I return?"

Process:

  1. Compare Query to all Keys (compute similarity)
  2. Use similarities to create attention weights
  3. Return weighted combination of Values

Example: Finding a Book

Query: "machine learning Python"

Database entries:

Key (Title/Tags)Value (Content)
"ML algorithms Python"Chapter on scikit-learn...
"Web development JavaScript"React tutorial...
"Deep learning PyTorch"Neural networks...

Process:

  1. Match query to keys: ML→high, Web→low, Deep→medium
  2. Weights: [0.6, 0.05, 0.35]
  3. Return: Mostly scikit-learn chapter, some PyTorch

In Neural Networks

For each position in a sequence:

📝text
1Query:  "What information do I need?"  (from current position)
2Keys:   "What information do I have?"  (from all positions)
3Values: "What information can I give?"  (from all positions)

The key insight: Q, K, V are all learned transformations of the same input!

🐍python
1Q = input @ W_Q  # Learn what to look for
2K = input @ W_K  # Learn what to advertise
3V = input @ W_V  # Learn what to provide

Avoid over-interpreting attention

Attention weights show where information flows, not strict causality or “explanations.” Different heads/layers can disagree; use attention patterns as clues, not proof.

1.4 Self-Attention vs Cross-Attention

Self-Attention

All of Q, K, V come from the same sequence.

Used in: Encoder self-attention, decoder self-attention

📝text
1Input:  "The cat sat"
2        ↓   ↓   ↓
3       [e1, e2, e3]  (embeddings)
4        ↓   ↓   ↓
5Q:     [q1, q2, q3]  (queries from same input)
6K:     [k1, k2, k3]  (keys from same input)
7V:     [v1, v2, v3]  (values from same input)

Each token attends to all tokens in the same sequence (including itself).

Purpose: Learn relationships within a sequence

  • "cat" attends to "The" to understand it's a specific cat
  • "sat" attends to "cat" to understand who is sitting

Cross-Attention

Q comes from one sequence, K and V from another.

Used in: Decoder cross-attention (attending to encoder output)

📝text
1Decoder input: "The"
23Q:            [q1]  (query from decoder)
4
5Encoder output: [e1, e2, e3, e4]  (from "Der Hund ist schwarz")
6                 ↓   ↓   ↓   ↓
7K:              [k1, k2, k3, k4]  (keys from encoder)
8V:              [v1, v2, v3, v4]  (values from encoder)

Purpose: Let decoder "look at" the source sequence

  • When generating "The", attend to "Der" (German article)
  • When generating "dog", attend to "Hund" (German noun)

Shapes to remember

Self-attention mask: [batch, 1, seq, seq]; cross-attention mask: [batch, 1, 1, src_seq] so every target position can see (masked) source tokens.

1.5 What High Attention Weight Means

Interpreting Attention Patterns

When position i has high attention weight on position j:

📝text
1Position i "pays attention to" / "looks at" / "focuses on" position j

Practical reading of a heatmap

In early layers, expect local/diagonal focus; in middle layers, syntactic patterns; in later layers, task-specific focus (e.g., answer tokens, entities). If everything is uniform, something is wrong with scaling or masking.

This means:

  • Token i needs information from token j
  • Token j is relevant for understanding token i
  • Output at position i will incorporate information from position j

Common Attention Patterns

1. Diagonal Pattern

📝text
1a b c d
2a [■ · · ·]    Each token attends mostly to itself
3b [· ■ · ·]    (common in lower layers)
4c [· · ■ ·]
5d [· · · ■]

2. Column Pattern

📝text
1a b c d
2a [· ■ · ·]    All tokens attend to one important token
3b [· ■ · ·]    (e.g., subject of sentence)
4c [· ■ · ·]
5d [· ■ · ·]

3. Syntactic Pattern

📝text
1The cat sat on the mat
2sat [·  ■  ·  ·  ·  ■]   "sat" attends to "cat" (subject)
3                         and "mat" (object)

4. Previous Token Pattern

📝text
1a b c d
2a [■ · · ·]    Each token attends to the previous
3b [■ ■ · ·]    (common in causal/decoder attention)
4c [· ■ ■ ·]
5d [· · ■ ■]

1.6 Attention as Dynamic Routing

Static vs Dynamic

Traditional neural networks: Fixed weights, same computation for every input

📝text
1output = W × input  (W is fixed after training)

Attention: Weights depend on input, different computation for different inputs

📝text
1weights = f(input)  (computed fresh for each input)
2output = weights × values

Why This Matters

Consider processing two sentences:

  1. "The animal didn't cross the road because it was too tired."
  2. "The animal didn't cross the road because it was too wide."

In sentence 1, "it" = animal
In sentence 2, "it" = road

A static network would process "it" the same way in both cases.

Attention computes different weights:

  • Sentence 1: "it" attends strongly to "animal"
  • Sentence 2: "it" attends strongly to "road"

The routing of information adapts to the input.

Link to next section

The only trainable part here is the Q/K/V projections; the routing (softmax of QK^T) is computed on the fly per input. That’s why we normalize and scale (next section) instead of learning a static lookup.

1.7 Building Intuition: A Visual Walkthrough

Step-by-Step Example

Input: "The cat sat"

Step 1: Create embeddings

🐍python
1The → [0.2, 0.8, 0.1]
2cat → [0.9, 0.3, 0.7]
3sat → [0.4, 0.6, 0.5]

Step 2: Project to Q, K, V (simplified, same as input here)

🐍python
1Q = K = V = [[0.2, 0.8, 0.1],
2             [0.9, 0.3, 0.7],
3             [0.4, 0.6, 0.5]]

Step 3: Compute attention scores (Q × K^T)

For query "sat" (row 3):

🐍python
1score(sat, The) = sat · The = 0.4×0.2 + 0.6×0.8 + 0.5×0.1 = 0.61
2score(sat, cat) = sat · cat = 0.4×0.9 + 0.6×0.3 + 0.5×0.7 = 0.89
3score(sat, sat) = sat · sat = 0.4×0.4 + 0.6×0.6 + 0.5×0.5 = 0.77

"sat" is most similar to "cat" (highest score)!

Step 4: Apply softmax (convert to probabilities)

🐍python
1weights = softmax([0.61, 0.89, 0.77])[0.26, 0.39, 0.35]

Step 5: Weighted sum of values

🐍python
1output_sat = 0.26 × V_The + 0.39 × V_cat + 0.35 × V_sat
2           = 0.26 × [0.2, 0.8, 0.1] + 0.39 × [0.9, 0.3, 0.7] + 0.35 × [0.4, 0.6, 0.5]
3           = [0.55, 0.50, 0.48]

The output for "sat" now carries information from all tokens, weighted by relevance!


1.8 Why Attention Weights Sum to 1

The Softmax Normalization

Attention weights are computed using softmax:

📝text
1attention_weight_i = exp(score_i) / Σ exp(score_j)

This guarantees:

  1. All weights are positive (exp is always > 0)
  2. Weights sum to 1 (by construction of softmax)
  3. Weights form a probability distribution

Why Normalization Matters

Without normalization:

  • Weights could be arbitrarily large
  • Output magnitude would depend on input magnitude
  • Training would be unstable

With normalization:

  • Output is a convex combination of values
  • Magnitude is controlled
  • Interpretable as "probability of attending to each position"

Summary

Key Intuitions

  1. Attention = Weighted Averaging: Each output is a weighted combination of inputs, where weights indicate relevance.
  2. Query-Key-Value:
    • Query: What am I looking for?
    • Key: What do I contain?
    • Value: What can I provide?
  3. Self vs Cross:
    • Self-attention: Q, K, V from same sequence
    • Cross-attention: Q from one sequence, K/V from another
  4. Dynamic Routing: Attention weights are input-dependent, enabling context-sensitive processing.
  5. Interpretability: High attention weight from i to j means position i "focuses on" position j.

Mental Model

📝text
1For each position i:
2    1. Create a query: "What do I need?"
3    2. Compare query to all keys: "How relevant is each position?"
4    3. Softmax: Convert relevance to probabilities
5    4. Weighted sum: Gather information from all positions based on relevance

Exercises

Conceptual Questions

  1. In your own words, explain why attention is called a "soft lookup" mechanism.
  2. Given the sentence "The dog chased the cat and then it ran away", what tokens would you expect "it" to attend to? What would determine which interpretation is chosen?
  3. Why do we need separate Q, K, V projections instead of using the input directly for all three?
  4. Explain the difference between self-attention and cross-attention. Give an example of when each would be used.

Hands-on Drills

  1. Compute a 3-token attention by hand: pick small Q/K/V vectors, form QK^T, apply softmax, and produce the weighted sum.
  2. Change one query vector slightly and recompute weights—notice how routing shifts. This builds intuition for dynamic routing.

Thought Experiments

  1. If all attention weights were equal (uniform attention), what would the output represent? When might this be useful or problematic?
  2. Consider a very long document where the answer to a question is in the first paragraph, but the question is asked in the last paragraph. How does attention help solve this compared to an RNN?
  3. If we removed the softmax normalization and just used raw dot products, what problems might arise during training?

Next Section Preview

In the next section, we'll translate these intuitions into mathematics. We'll derive the scaled dot-product attention formula:

📝text
1Attention(Q, K, V) = softmax(QK^T / √d_k) × V

And understand exactly why each component is necessary.