← All books
Book · Intermediate · 20+ hours

The Attention Atlas: Mechanisms That Power Modern AI

A Visual, Mathematical, and Practical Guide

Master all 15 attention mechanisms using the same example sentence throughout. From Scaled Dot-Product to Flash Attention, MLA, and Differential Attention — with step-by-step math, matrices, and Python code for each.

17Chapters
17Sections
7hReading
7Parts
Part I·2 chapters · 2 sections

FoundationShared example and core attention.

The Origin of Attention

From biological neurons to the transformer — how evolution's solution to resource allocation became the most important mechanism in modern AI

1 section35 min read
Start chapter
  1. 01The Origin of Attention35m

Scaled Dot-Product Attention

Vaswani et al. 2017 — the foundational attention mechanism

1 section25 min read
Start chapter
  1. 01Scaled Dot-Product Attention25m
Part II·3 chapters · 3 sections

Multi-Head VariantsMHA, Causal, Cross-Attention.

Multi-Head Attention

Vaswani et al. 2017 — H independent attention heads in parallel

1 section14 min read
Start chapter
  1. 01Multi-Head Attention14m

Causal (Masked) Self-Attention

Radford et al. 2018 — autoregressive masking for language generation

1 section40 min read
Start chapter
  1. 01Causal (Masked) Self-Attention40m

Cross-Attention

Vaswani et al. 2017 — the bridge between encoder and decoder

1 section10 min read
Start chapter
  1. 01Cross-Attention10m
Part III·2 chapters · 2 sections

KV-Cache OptimizationMQA, GQA for efficient inference.

Multi-Query Attention (MQA)

Shazeer 2019 — shared K/V across all heads for fast inference

1 section12 min read
Start chapter
  1. 01Multi-Query Attention (MQA)12m

Grouped-Query Attention (GQA)

Ainslie et al. 2023 — the sweet spot between MHA and MQA

1 section30 min read
Start chapter
  1. 01Grouped-Query Attention (GQA)30m
Part IV·3 chapters · 3 sections

Positional EncodingRelative bias, RoPE, ALiBi.

Relative Position Bias Attention

Shaw et al. 2018 / T5 2020 — distance-aware attention scoring

1 section35 min read
Start chapter
  1. 01Relative Position Bias Attention35m

RoPE — Rotary Position Embedding

Su et al. 2021 — position via rotation of Q and K vectors

1 section14 min read
Start chapter
  1. 01RoPE — Rotary Position Embedding14m

ALiBi — Attention with Linear Biases

Press et al. 2022 — linear distance penalty, no positional embeddings

1 section12 min read
Start chapter
  1. 01ALiBi — Attention with Linear Biases12m
Part V·3 chapters · 3 sections

Efficient AttentionLinear, Sliding Window, Sparse.

Linear Attention

Katharopoulos et al. 2020 — O(Nd²) instead of O(N²d)

1 section14 min read
Start chapter
  1. 01Linear Attention14m

Sliding Window Attention

Beltagy et al. 2020 — local window for O(N) complexity

1 section12 min read
Start chapter
  1. 01Sliding Window Attention12m

Sparse Attention — BigBird

Zaheer et al. 2020 — local + global + random for long documents

1 section14 min read
Start chapter
  1. 01Sparse Attention — BigBird14m
Part VI·3 chapters · 3 sections

Modern InnovationsFlash, Differential, MLA.

Flash Attention

Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster

1 section45 min read
Start chapter
  1. 01Flash Attention45m

Differential Attention

Microsoft Research 2024 — noise cancellation via dual softmax

1 section12 min read
Start chapter
  1. 01Differential Attention12m

Multi-Head Latent Attention (MLA)

DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck

1 section35 min read
Start chapter
  1. 01Multi-Head Latent Attention (MLA)35m
Part VII·1 chapter · 1 sections

ComparisonAll 15 mechanisms side-by-side.

Final Comparison

All 15 mechanisms side-by-side — choosing the right one

1 section45 min read
Start chapter
  1. 01All 15 Mechanisms Compared45m
The capstone

Where the book lands in practice.

Chapter 13·1 sections

Flash Attention

Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster

Open chapter
Chapter 14·1 sections

Differential Attention

Microsoft Research 2024 — noise cancellation via dual softmax

Open chapter
Chapter 15·1 sections

Multi-Head Latent Attention (MLA)

DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck

Open chapter

17 sections. Begin with one.

Chapter 0 — The Origin of Attention — is where every reader starts.