Learning Objectives
By the end of this chapter, you will:
- Understand why sliding window attention alone fails for tasks requiring long-range dependencies, and how BigBird solves this with a three-pillar sparse attention pattern.
- Learn the graph-theoretic result that makes BigBird a universal approximator equivalent to full attention, yet with complexity instead of .
- Master the mathematical formulation of the BigBird sparse mask, including local window, global tokens, and random edges.
- Compute a complete worked example by hand using “The cat sat on mat”, verifying every matrix value.
- Implement BigBird sparse attention from scratch in both NumPy and PyTorch, ready to plug into real transformer architectures.
Why This Matters: BigBird was the first attention variant to prove a theoretical equivalence to full attention while operating in linear time. It powers Google's long-document NLP systems and is the direct ancestor of LongT5, Pegasus-X, and the sparse attention patterns in modern models like Gemini. Understanding BigBird teaches you how to think about attention as a graph problem — a perspective that unlocks all subsequent efficient attention research.
The Real Problem: When Local Windows Are Not Enough
In Chapter 11, we saw that Sliding Window Attention reduces complexity from to by restricting each token to its nearest neighbors. This works beautifully for local syntax — a verb is usually near its subject, an adjective near its noun. But real language is not purely local.
Consider a 10,000-token legal document. The opening paragraph states: “The Seller agrees to indemnify the Buyer.” Nine thousand tokens later, a clause reads: “In breach of the foregoing obligation, penalties shall apply.” The word “foregoing” at position 9,500 must attend to “indemnify” at position 42. With a window of , information would need to propagate through layers before these tokens can interact — by which point the signal has decayed beyond recovery.
Why Sliding Window Fails
Think of the attention mask as a graph where tokens are nodes and allowed attention connections are edges. In sliding window attention with , the graph looks like a chain:
The ↔ cat ↔ sat ↔ on ↔ mat
The diameter of this graph (the longest shortest path between any two nodes) is . For our 5-token example, that means “The” and “mat” need 4 hops (layers) to exchange information. For tokens with , the diameter is 4,095 — requiring thousands of transformer layers for end-to-end communication.
| Attention Pattern | Graph Diameter | Hops for N=4096 |
|---|---|---|
| Full attention | 1 | 1 |
| Sliding window (W=1) | N − 1 | 4,095 |
| Sliding window (W=256) | ⌈(N−1)/W⌉ | 16 |
| BigBird (local + global) | 2 | 2 |
The Graph-Theory Insight
Zaheer et al. (NeurIPS 2020) made a crucial observation: if you add even a single global token that connects to every other token, the graph diameter drops to 2, regardless of sequence length. Any token can reach any other token in at most 2 hops: go to the global token, then from the global token to the target.
This is the same principle behind hub-and-spoke networks in airline routing. You don't need direct flights between every pair of cities. A few major hubs (Chicago, Atlanta, Dallas) let you reach any destination in at most 2 flights.
The Analogy: Full attention is like having a direct flight between every pair of 4,096 cities — routes. BigBird is like keeping local bus routes (window), adding a few airport hubs (global tokens), and adding occasional charter flights (random edges) — only routes but you can still reach anywhere in 2 hops.
BigBird's Three-Pillar Solution
BigBird constructs its sparse attention mask by combining three independent patterns, each serving a distinct purpose:
Pillar 1: Local Window Attention
Identical to Chapter 11's sliding window. Token attends to all tokens where . This captures the syntactic dependencies that are overwhelmingly local: subject-verb agreement, adjective-noun modification, preposition attachment.
In our example with , “sat” (position 2) attends to “cat” (1), “sat” (2), and “on” (3). This captures the core phrase structure “cat sat on”.
Pillar 2: Global Token Attention
A small set of designated tokens (typically the [CLS] token or the first few tokens) are made “global”: they attend to every token, and every token attends to them. These global tokens serve as information aggregators — they absorb information from the entire sequence and make it available to all other tokens.
Formally, if is a global token index:
- Row is all active: token attends to every position.
- Column is all active: every token attends to position .
In our example, “The” (index 0) is the global token. This means “mat” (index 4) can access information about “The” directly — even though they are 4 positions apart, well beyond the window.
Pillar 3: Random Attention
Each token additionally attends to randomly selected positions that are not already covered by local or global patterns. This is motivated by the Erd\u0151s–R\u00e9nyi random graph theory: a random graph with random edges per node is connected with high probability.
Random edges serve as “shortcuts” through the graph, reducing the effective diameter even further and ensuring that information can flow between any two tokens through unexpected paths. Even with just random edges per token, the attention graph becomes a small-world network where multi-layer propagation reaches everywhere efficiently.
The Theoretical Guarantee
Theorem (Zaheer et al., 2020): The BigBird sparse attention pattern with local window, global tokens, and random connections is a universal approximator of sequence-to-sequence functions. Any function computable by a full-attention transformer can also be computed by a BigBird transformer — with the same depth and width, using only attention operations per layer instead of .
This is a remarkable theoretical result. It means that the quadratic cost of full attention is not an inherent requirement of the transformer architecture — it is an artifact of the dense attention pattern. The expressivity of transformers comes from the multi-layer composition of attention and feedforward operations, not from having every token attend to every other token in a single layer.
Mathematical Formulation
Mask Construction
The BigBird attention mask is defined as the union of three binary masks:
where:
- Local mask:
- Global mask: , where is the set of global token indices
- Random mask: For each row , select positions uniformly at random from the inactive set
The Sparse Attention Equation
Given the combined mask , the BigBird attention output is:
— raw scaled dot-product score
— masked score
— attention weight (softmax over active positions only)
— weighted sum of value vectors
The key insight: setting blocked scores to before softmax causes , which means blocked positions contribute zero weight. The softmax denominator only sums over active positions, so the remaining weights still sum to 1.
Complexity Analysis
| Component | Connections per token | Total connections | Asymptotic |
|---|---|---|---|
| Local window | 2W + 1 | N \times (2W + 1) | O(NW) |
| Global tokens | |\mathcal{G}| | 2 \times N \times |\mathcal{G}| | O(NG) |
| Random edges | R | N \times R | O(NR) |
| Total BigBird | \leq 2W + 1 + G + R | \leq N(2W + 1 + 2G + R) | O(N) |
| Full attention | N | N^2 | O(N^2) |
For fixed , , and , the total number of active connections is — linear in sequence length. In practice, BigBird typically uses or , global tokens, and random edges. For :
- Full attention: connections
- BigBird (): connections — an 87% reduction
Interactive: BigBird Mask Explorer
Experiment with BigBird's three attention pillars below. Adjust the window size, change the global token, and toggle random edges to see how the sparse mask, attention weights, and output change in real time.
Step-by-Step Calculation
Let us trace the complete BigBird sparse attention computation on our shared example “The cat sat on mat” with , global token = “The” (index 0), and (no random edges).
Step 1: Compute Raw Scores
First, compute all 25 pairwise scores exactly as in full attention. Each score measures similarity between a query token and a key token:
| 0.0000 | 1.0000 | 0.5000 | 0.5000 | 0.7500 | |
| 1.5000 | 0.0000 | 1.0000 | 0.5000 | 0.2500 | |
| 0.5000 | 1.0000 | 1.0000 | 0.5000 | 0.7500 | |
| 0.5000 | 0.5000 | 0.0000 | 1.0000 | 0.5000 | |
| 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.7500 |
For example,
Step 2: Build the BigBird Mask
For each position pair , check if ANY condition holds:
| Query ↓ / Key → | The (0) | cat (1) | sat (2) | on (3) | mat (4) | Active |
|---|---|---|---|---|---|---|
| The (global) | G+L | G+L | G | G | G | 5/5 |
| cat | G+L | L | L | ✗ | ✗ | 3/5 |
| sat | G | L | L | L | ✗ | 4/5 |
| on | G | ✗ | L | L | L | 4/5 |
| mat | G | ✗ | ✗ | L | L | 3/5 |
G = global (row 0 or column 0), L = local (), × = blocked. Total: 19/25 connections active (76%), versus 25/25 for full attention.
Step 3: Apply Mask to Scores
Blocked positions are set to :
| The | cat | sat | on | mat | |
|---|---|---|---|---|---|
| The | 0.0000 | 1.0000 | 0.5000 | 0.5000 | 0.7500 |
| cat | 1.5000 | 0.0000 | 1.0000 | −∞ | −∞ |
| sat | 0.5000 | 1.0000 | 1.0000 | 0.5000 | −∞ |
| on | 0.5000 | −∞ | 0.0000 | 1.0000 | 0.5000 |
| mat | 0.5000 | −∞ | −∞ | 0.5000 | 0.7500 |
Step 4: Row-Wise Softmax
Now we apply softmax to each row independently. The positions produce , so only active positions contribute. Let us trace the most interesting row: “mat” (row 4).
Row 4 (“mat”) — Why this row is interesting: Without the global token, “mat” would only see “on” and itself (window ). With the global hub “The”, it gains access to a document-level summary, letting it borrow context from the very start of the sentence.
Active positions for “mat”: The (j=0, global), on (j=3, local), mat (j=4, local)
Active scores:
Subtract row max ():
Exponentiate: , sum =
Normalize:
Notice that “mat” distributes its attention across three sources: 30.45% to the global hub “The”, 30.45% to its neighbor “on”, and 39.10% to itself. The global connection lets “mat” access global context that would be invisible with only a local window.
Step 5: Compute Output
Full Attention Weight and Output Matrices
Attention Weights — BigBird Sparse ()
| The | cat | sat | on | mat | |
|---|---|---|---|---|---|
| The | 0.1095 | 0.2976 | 0.1805 | 0.1805 | 0.2318 |
| cat | 0.5465 | 0.1220 | 0.3315 | 0.0000 | 0.0000 |
| sat | 0.1888 | 0.3112 | 0.3112 | 0.1888 | 0.0000 |
| on | 0.2350 | 0.0000 | 0.1425 | 0.3875 | 0.2350 |
| mat | 0.3045 | 0.0000 | 0.0000 | 0.3045 | 0.3910 |
Key observations: Row 0 (“The”, global) has nonzero weights for all 5 tokens. Rows 1 and 4 (“cat” and “mat”) have only 3 nonzero weights each. The zeros correspond to blocked positions.
Output Matrix — Sparse Attention ()
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 0.2254 | 0.4135 | 0.2964 | 0.2964 |
| cat | 0.5465 | 0.1220 | 0.3315 | 0.0000 |
| sat | 0.1888 | 0.3112 | 0.3112 | 0.1888 |
| on | 0.3525 | 0.1175 | 0.2600 | 0.5050 |
| mat | 0.5000 | 0.1955 | 0.1955 | 0.5000 |
Comparison: Sparse vs Full Attention
How much does the sparse mask change the output compared to full quadratic attention?
| Token | BigBird Output | Full Attention Output | L2 Error |
|---|---|---|---|
| The | [0.2254, 0.4135, 0.2964, 0.2964] | [0.2254, 0.4135, 0.2964, 0.2964] | 0.0000 |
| cat | [0.5465, 0.1220, 0.3315, 0.0000] | [0.4602, 0.1475, 0.3018, 0.2058] | 0.2363 |
| sat | [0.1888, 0.3112, 0.3112, 0.1888] | [0.2495, 0.3481, 0.3481, 0.2495] | 0.0859 |
| on | [0.3525, 0.1175, 0.2600, 0.5050] | [0.2854, 0.2854, 0.2106, 0.4089] | 0.2003 |
| mat | [0.5000, 0.1955, 0.1955, 0.5000] | [0.3108, 0.3108, 0.3108, 0.3108] | 0.2678 |
“The” has zero error because it is the global token and sees all 5 positions in both sparse and full attention — identical weights, identical output. Other tokens show small L2 errors because the sparse mask redistributes their attention weights among fewer positions. In practice, with learned Q/K/V projections and multiple layers, the model compensates for this redistribution during training.
Applications Across Domains
| Domain | Long Sequence Problem | BigBird Application | Result |
|---|---|---|---|
| Genomics | DNA sequences of 32K+ bases | Predicting promoter-gene interactions across distant loci | SOTA on chromatin profiling benchmarks |
| Legal NLP | Contracts with 10K+ tokens | Cross-referencing clauses separated by thousands of words | Better than full attention with 87% less memory |
| Scientific Papers | Full-text papers (8K+ tokens) | Abstract-to-conclusion citation analysis | Powers Google’s LongT5 and Pegasus-X |
| Code Understanding | Large codebases (16K+ tokens) | Import statements referencing distant function definitions | Used in Code Llama’s long-context mode |
| Medical Records | Patient histories over years | Connecting diagnosis notes to treatment outcomes months apart | Improved clinical prediction accuracy |
Connection to Modern Systems
| System | BigBird’s Influence | Key Difference |
|---|---|---|
| Longformer (Beltagy 2020) | Same local + global pattern, independent parallel work | Longformer uses dilated sliding window; BigBird adds random edges |
| LongT5 (Guo et al. 2022) | Direct descendant of BigBird | Adds TGlobal (Transient Global) with learned global projections |
| Flash Attention (Dao 2022) | Orthogonal optimization: same math, faster IO | Flash Attention is about hardware tiling, not sparsity pattern |
| Mistral / Sliding Window | Adopted BigBird’s local window pattern | Mistral uses pure sliding window without global/random |
| Gemini (Google 2024) | Uses sparse attention patterns for long context | Combines multiple sparsity strategies including BigBird-like patterns |
Key distinction: BigBird defines which positions attend to each other (the sparsity pattern). Flash Attention optimizes how the computation is executed on GPU hardware. They are complementary: you can use Flash Attention to efficiently compute BigBird's sparse attention pattern.
Python Implementation
The full BigBird sparse attention implementation with the shared example. Click any line to see the execution state, variable values, and row-by-row softmax computation.
PyTorch Implementation
The equivalent PyTorch implementation using and . This version is GPU-ready and integrates with PyTorch's autograd for training.
Key Takeaways
- Three pillars: BigBird combines local window (syntax), global tokens (information hubs), and random edges (shortcuts) into one sparse mask.
- Linear complexity: Total connections are , which is for fixed . An 87% reduction at compared to full attention.
- Universal approximation: BigBird is provably as expressive as full attention — the quadratic cost was never necessary for the transformer's representational power.
- Graph diameter = 2: Global tokens reduce the shortest path between any two tokens to 2 hops, regardless of sequence length.
- Practical impact: Enables processing of 4K–16K+ token sequences (documents, genomes, codebases) where full attention would OOM.
- Complementary to Flash Attention: BigBird defines the pattern (which positions interact), Flash Attention optimizes the computation (how to compute it fast on GPUs). Use both together.
Exercises
- Window size exploration: Recompute the BigBird attention weights with (no local window, only global). How does the output for “mat” change? What happens to tokens that are not adjacent to the global token?
- Multiple global tokens: Make both “The” (index 0) and “mat” (index 4) global tokens. How many connections are active now? Compute the new attention weights for “sat”.
- Random edges: Add random edge per token. Suppose the random connections are: cat→mat, sat→mat, on→cat, mat→cat. Recompute the weights for “cat” now that it can see “mat”.
- Scaling analysis: For a sequence of length with , , , how many attention connections does BigBird use? What percentage of full attention is that?
- Implementation challenge: Modify the PyTorch implementation to support batched inputs with shape and multi-head attention with shape .
References
- Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 17283–17297.
- Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.
- Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
- Guo, M., Ainslie, J., Uthus, D., Ontanon, S., Ni, J., Sung, Y.-H., & Yang, Y. (2022). LongT5: Efficient Text-To-Text Transformer for Long Sequences. Findings of NAACL 2022.
- Erd\u0151s, P. & R\u00e9nyi, A. (1959). On Random Graphs I. Publicationes Mathematicae Debrecen, 6, 290–297.