Book · Intermediate · 20+ hours

The Attention Atlas: Mechanisms That Power Modern AI

A Visual, Mathematical, and Practical Guide

Master all 15 attention mechanisms using the same example sentence throughout. From Scaled Dot-Product to Flash Attention, MLA, and Differential Attention — with step-by-step math, matrices, and Python code for each.

17Chapters

17Sections

7hReading

7Parts

Start chapter 00 Browse curriculum

Part IFoundation2 Part IIMulti-Head Variants3 Part IIIKV-Cache Optimization2 Part IVPositional Encoding3 Part VEfficient Attention3 Part VIModern Innovations3 Part VIIComparison1

Part I·2 chapters · 2 sections

Foundation— Shared example and core attention.

The Origin of Attention

From biological neurons to the transformer — how evolution's solution to resource allocation became the most important mechanism in modern AI

1 section35 min read

Start chapter

01The Origin of Attention35m

Scaled Dot-Product Attention

Vaswani et al. 2017 — the foundational attention mechanism

1 section25 min read

Start chapter

01Scaled Dot-Product Attention25m

Part II·3 chapters · 3 sections

Multi-Head Variants— MHA, Causal, Cross-Attention.

Multi-Head Attention

Vaswani et al. 2017 — H independent attention heads in parallel

1 section14 min read

Start chapter

01Multi-Head Attention14m

Causal (Masked) Self-Attention

Radford et al. 2018 — autoregressive masking for language generation

1 section40 min read

Start chapter

01Causal (Masked) Self-Attention40m

Cross-Attention

Vaswani et al. 2017 — the bridge between encoder and decoder

1 section10 min read

Start chapter

01Cross-Attention10m

Part III·2 chapters · 2 sections

KV-Cache Optimization— MQA, GQA for efficient inference.

Multi-Query Attention (MQA)

Shazeer 2019 — shared K/V across all heads for fast inference

1 section12 min read

Start chapter

01Multi-Query Attention (MQA)12m

Grouped-Query Attention (GQA)

Ainslie et al. 2023 — the sweet spot between MHA and MQA

1 section30 min read

Start chapter

01Grouped-Query Attention (GQA)30m

Part IV·3 chapters · 3 sections

Positional Encoding— Relative bias, RoPE, ALiBi.

Relative Position Bias Attention

Shaw et al. 2018 / T5 2020 — distance-aware attention scoring

1 section35 min read

Start chapter

01Relative Position Bias Attention35m

RoPE — Rotary Position Embedding

Su et al. 2021 — position via rotation of Q and K vectors

1 section14 min read

Start chapter

01RoPE — Rotary Position Embedding14m

ALiBi — Attention with Linear Biases

Press et al. 2022 — linear distance penalty, no positional embeddings

1 section12 min read

Start chapter

01ALiBi — Attention with Linear Biases12m

Part V·3 chapters · 3 sections

Efficient Attention— Linear, Sliding Window, Sparse.

Linear Attention

Katharopoulos et al. 2020 — O(Nd²) instead of O(N²d)

1 section14 min read

Start chapter

01Linear Attention14m

Sliding Window Attention

Beltagy et al. 2020 — local window for O(N) complexity

1 section12 min read

Start chapter

01Sliding Window Attention12m

Sparse Attention — BigBird

Zaheer et al. 2020 — local + global + random for long documents

1 section14 min read

Start chapter

01Sparse Attention — BigBird14m

Part VI·3 chapters · 3 sections

Modern Innovations— Flash, Differential, MLA.

Flash Attention

Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster

1 section45 min read

Start chapter

01Flash Attention45m

Differential Attention

Microsoft Research 2024 — noise cancellation via dual softmax

1 section12 min read

Start chapter

01Differential Attention12m

Multi-Head Latent Attention (MLA)

DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck

1 section35 min read

Start chapter

01Multi-Head Latent Attention (MLA)35m

Part VII·1 chapter · 1 sections

Comparison— All 15 mechanisms side-by-side.

Final Comparison

All 15 mechanisms side-by-side — choosing the right one

1 section45 min read

Start chapter

01All 15 Mechanisms Compared45m

The capstone

Where the book lands in practice.

Chapter 13·1 sections

Flash Attention

Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster

Open chapter

Chapter 14·1 sections

Differential Attention

Microsoft Research 2024 — noise cancellation via dual softmax

Open chapter

Chapter 15·1 sections

Multi-Head Latent Attention (MLA)

DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck

Open chapter

17 sections. Begin with one.

Chapter 0 — The Origin of Attention — is where every reader starts.

Start chapter 00 All books