The Real Problem: Generation Is Sequential
A transformer is famously parallel during training. Feed a sequence of length S, and every token attends to every other token in one big matrix multiply. The whole forward pass happens at once on the GPU.
At inference, the picture inverts. To answer a user, the model must produce one token, then read what it just produced, then produce the next, and so on. Generation is autoregressive: token depends on tokens . The model cannot leap forward, because it does not yet know what it is going to say.
Now ask the naive question: at step , what does attention actually need? For every past token , attention multiplies the new token's query against the past token's key , then weights the past token's value by that score. The keys and values and were already computed on earlier steps. They do not change. Yet a naive implementation, asked to process a fresh forward call for the new token, recomputes all of them from scratch.
The waste: a naive decoder doing S tokens of generation performs key/value projections, when would have been enough. At S = 32k, that is the difference between billions of redundant matmuls and none.
The fix is brutally simple in one sentence: store the K and V vectors of every past token in a buffer, and reuse them at every future step. That buffer is the KV cache. It buys back generation. It costs memory. And at the scale of frontier models, that memory cost becomes the dominant constraint on inference systems.
Why this matters: the entire Multi-Head Latent Attention architecture you will study in this chapter exists to compress this buffer. Before MLA can make sense, the bottleneck has to feel real. That is the job of this section.
The Intuition: A Notebook the Model Refuses to Reread
Picture a person taking an exam where each question depends on every previous answer. Two strategies:
- The naive student. Before answering question 100, they re-read questions 1 through 99 and re-solve them in their head, just to recompute the context they need. Question 1000 takes ten times longer than question 100.
- The student with a notebook. They write down their key and value for every past question as they go. Before question 100 they glance at the notebook. Question 1000 still takes one glance per past answer to attend over them, but they never re-solve.
The KV cache is the notebook. Two things are obvious from the analogy:
- The notebook saves enormous amounts of redundant work.
- The notebook grows with every question. By question 10,000 it is a small book.
For a 70B model with 80 layers and 64 heads of dimension 128, the "notebook" for a single 32k-token conversation is about 20 GB. Eight users in a batch turn that into 160 GB — already more memory than a single GPU has, before the model weights even load. That is the bottleneck.
The Mathematics of the Cache
Recall single-head causal self-attention. With input embedding for token , we form , , , where each is a vector in . The output for the new token is the familiar
.
Read it slowly. The sum is over past indices . The pairs for were produced on earlier steps and the model has no reason to recompute them. Caching them is not a hack — the math literally asks for them.
With multiple heads () and multiple layers (), every token writes vectors of dimension into the cache: K for each (layer, head) and V for each (layer, head). For a whole batch of sequences of length , the cache stores
scalars in total. Where:
- — one tensor for K, one for V.
- — number of transformer layers. Each layer has its own attention, so each has its own cache.
- — number of attention heads per layer.
- — dimension of each head, typically .
- — sequence length (prompt + generated so far).
- — number of concurrent sequences in the batch.
The Memory Equation
Multiply by the byte size of one scalar ( bytes — 4 for fp32, 2 for fp16/bf16, 1 for fp8) to get bytes:
Every factor matters. Make any of them bigger and the cache grows linearly with it. Make all of them bigger — which is exactly what the trend in frontier models does — and the cache explodes.
Manual Numerical Walkthrough
Pick numbers up before you trust the formula. Use a Llama-2 7B style configuration and one long-context conversation, then scale to a real serving batch.
Interactive: KV Cache Memory Explorer
Drive the equation yourself. Move the sliders and watch where each axis bites: layers and heads are the multiplicative constants, sequence length is the linear ramp, batch size and precision are the brutal multipliers that determine whether you fit on one GPU.
Three observations worth pausing on:
- fp16 → fp8 halves the cache. This is why quantising the KV cache is one of the highest-leverage inference optimisations. Cache memory is the bottleneck; weights have already been compressed.
- Doubling batch size doubles the cache. Throughput-vs-latency tradeoffs in production serving are largely cache-memory tradeoffs.
- Long context is the dominant axis. 4k → 32k means 8× the cache. 32k → 128k means another 4×. The march toward million-token contexts is a direct collision course with this equation.
Plain Python Implementation
Before PyTorch, write the cache in raw numpy so the mechanism is naked. The naive version is deliberately ugly — that ugliness is the cost the cached version eliminates.
Run both on the same input and the outputs match to machine precision. The difference is entirely in the FLOPs spent to get there: the naive version does projections, the cached version does . For S = 1024 that is roughly a 500× speedup on the projections, before even counting the attention matmul itself.
PyTorch Implementation
Real systems do this inside an with shape tensors. The cache lives across forward calls and gets concatenated along the sequence axis. This is exactly how Hugging Face Transformers, vLLM, and TensorRT-LLM all structure it (give or take a memory layout trick).
A single-step generation loop then looks like this in spirit:
.
The cache object grows by one slice along axis 2 each step. Across all layers, that growth is exactly the memory we have been computing.
Practical detail: production systems pre-allocate the cache to its maximum length and write into a slot, rather than concatenating. vLLM's PagedAttention goes further: it stores the cache in fixed-size pages, the same way an operating system pages virtual memory, so that GPU memory is not fragmented as sequences finish at different times. The algebra is identical; only the allocator changes.
Why It Becomes Catastrophic at Scale
Five things go wrong together as models and contexts grow.
| Pressure | Direction | Consequence |
|---|---|---|
| More layers L | deeper models | Linear blow-up of cache; 80 layers ≈ 2.5× the cache of a 32-layer model at the same width. |
| More heads n_h | more parallel attention paths | Linear blow-up; each head writes its own K, V. |
| Larger context S | 32k → 128k → 1M | Cache is linear in S, but S is also the axis users push hardest. The headline driver. |
| Higher batch B | throughput / cost | Cache is linear in B. Doubling concurrent users doubles cache. Throughput stalls when cache hits the wall. |
| Lower precision b | fp16 → fp8 → int4 | The only factor that goes down. Quantising the cache is the high-leverage lever — but quality must not regress. |
The model parameters themselves do not grow with sequence length. The activations during a single forward pass also do not, beyond an O(S) factor. The KV cache is the one quantity that is multiplied by every architectural choice you have made and by the user's usage pattern at the same time. That is why it dominates.
Reality check from production. When a serving cluster falls over, it is almost always because the KV cache exceeded the GPU's HBM and the scheduler had to start evicting sessions. The model weights are static; activations are transient; the cache is the one resource that grows in real time as users converse.
Engineering Reality: What Production Systems Do
There are four families of solutions, listed in roughly the order they were invented. The rest of this chapter is the story of the fourth — DeepSeek's MLA — but every team is using some combination of these:
- Shrink B or S (give up). Smaller batches, shorter contexts. This works but throws away product capability. Used as a last resort.
- Share K and V across heads — GQA / MQA. The next section. Group several query heads to share a single K, V pair. Cuts the cache by the grouping factor, but caps how aggressively you can compress.
- Quantise the cache. fp16 → fp8 → int4. Halves or quarters the cache for essentially free, as long as the model tolerates the precision loss. Now standard.
- Compress what is stored — MLA. Project K and V into a small shared latent space, cache only that latent, and reconstruct K and V on the fly. The remaining sections of this chapter derive this from scratch and show why DeepSeek V3 can hold a 128k context with a cache that fits where a 70B's 8k cache used to live.
Adjacent to the cache itself, two systems-level techniques squeeze more out of whatever cache you have:
- PagedAttention (vLLM). Manages the cache as pages, eliminating internal fragmentation so you can pack more concurrent sequences into the same memory.
- Prefix sharing. When many users share the same system prompt, store one copy of its KV cache and reuse it for all of them. Saves billions of bytes in chatbot workloads.
Summary and Bridge to MLA
Three things to carry into the rest of the chapter:
- The KV cache is a direct mathematical consequence of causal attention. It is not a hack — the attention sum literally calls for the past tokens' K and V vectors, and storing them once is the only sane option.
- Its size is . Every factor is something the architecture or the workload pushes upward. There is no axis along which the cache shrinks on its own.
- At frontier scale (70B+ models, 32k+ context, real batches) the cache is the dominant memory cost of inference — bigger than the weights, bigger than the activations, bigger than anything else on the GPU.
The next section examines the first serious architectural fix — Grouped-Query Attention — which shares K and V across groups of heads and was adopted by Llama 2, Mistral, and most open models. Then we will see why GQA, while a big win, is not enough; and finally we will derive Multi-Head Latent Attention, the technique that lets DeepSeek V3 carry massive contexts at batch sizes its competitors cannot match.