Press, Smith, & Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", ICLR 2022
Learning Objectives
By the end of this chapter, you will be able to:
Explain why learned and sinusoidal positional embeddings fail to extrapolate beyond training sequence lengths, and why this is a critical problem for deployment.
Describe the core insight of ALiBi: replace position embeddings with a simple linear distance penalty added directly to attention scores — requiring zero additional parameters.
Derive the ALiBi formula and explain the geometric slope schedule that allows different heads to attend at different distance scales simultaneously.
Compute ALiBi attention weights and outputs by hand for “The cat sat on mat” and compare them to the standard attention results from Chapter 1.
Implement a complete ALiBi class in both NumPy and PyTorch that you can run on any input.
Connect ALiBi to RoPE (Chapter 8), Flash Attention (Chapter 13), and modern inference systems, understanding the tradeoffs between rotation-based and bias-based position encoding.
Where this appears: ALiBi was introduced at Facebook AI Research and adopted by the BLOOM model (176B parameters, BigScience), MPT-7B/30B (MosaicML), and other production LLMs. It offers the simplest possible approach to positional encoding — no learned parameters, no trigonometric functions, just subtraction. Understanding ALiBi illuminates a deep question: how much positional information does attention truly need?
The Real Problem
Chapters 7 and 8 established two approaches to injecting position information into attention: relative position biases (Chapter 7) and rotary position embeddings (Chapter 8, RoPE). Both solve the problem of making attention aware of token order. But there is a deeper problem that neither the original sinusoidal embeddings nor learned position embeddings solve reliably: what happens when the model encounters sequences longer than those seen during training?
Position Embedding Failure Modes
Consider a language model trained on sequences of up to 2048 tokens. During training, the model learns (or is given) position representations for positions 0 through 2047. At inference time, a user provides a 4096-token document. What happens?
Learned position embeddings (GPT-2, BERT): The model has an embedding lookup table with exactly 2048 rows. Position 2049 simply does not exist. The model either crashes or truncates the input. There is no extrapolation at all.
Sinusoidal position embeddings (original Transformer): The sine/cosine functions can produce values at any position. But the model has never seen the specific activation patterns produced by positions beyond 2048. The learned attention weights are calibrated for a specific range of positional signals, and positions outside that range produce unpredictable behavior. Perplexity collapses.
RoPE (Chapter 8): Rotation-based encoding can generate rotations for any position. However, without techniques like NTK-aware scaling or YaRN, the effective frequency distribution at unseen positions differs from training. RoPE degrades more gracefully than learned embeddings but still requires extension methods for reliable extrapolation.
The Extrapolation Challenge
The root cause is architectural: all three methods above encode position into the token representations themselves. The model must then learn to interpret these positional signals during attention computation. When the positional signals at inference time differ from those seen during training, the learned interpretation breaks down.
Press, Smith, and Lewis (2022) asked a radical question: what if we bypass token representations entirely and inject position information directly into the attention scores? If the position signal is a simple, predictable function of distance — one that the model never needs to “learn to interpret” — then extrapolation becomes trivial. A token 3 positions away receives the same penalty regardless of whether position 3 appeared during training at index 100 or index 10,000.
The Key Insight
ALiBi does not encode position into token vectors at all. Instead, it subtracts a linear distance penalty directly from attention scores before softmax. Because the penalty is a fixed algebraic function of distance (not a learned representation), models trained at 1024 tokens achieve near-perfect perplexity at 2048 tokens and degrade gracefully beyond.
From Intuition to Mathematics
The Distance Penalty Insight
Imagine you are sitting in a library, and someone asks you a question (your query). You can look at every book on the shelf (the keys and values), but you have a natural preference: the closer a book is to you, the more likely it is to be relevant. A book one shelf away is easy to reach; a book across the room requires more effort. You can still access distant books, but you need a stronger reason (a higher content-based score) to bother.
ALiBi formalizes this intuition. Given the standard scaled dot-product score between query i and key j, ALiBi subtracts a penalty proportional to their distance:
ALiBi Score[i,j]=dkQ[i]⋅K[j]⊤−mh×∣i−j∣
The penalty mh×∣i−j∣ grows linearly with distance. Token j that is 1 position away from i receives a penalty of mh; 2 positions away receives 2mh; and so on. After this subtraction, the scores enter softmax normally. The effect is that distant tokens must have proportionally higher content scores to achieve the same attention weight as nearby tokens.
Head-Specific Slopes
The slope mh is different for each attention head, creating a multi-scale view of the sequence:
Steep slopes (large mh like 0.5): The penalty grows quickly with distance. These heads become “local experts” that focus strongly on nearby tokens — useful for syntax, morphology, and local coherence.
Gentle slopes (small mh like 0.0625): The penalty grows slowly. These heads can attend broadly across the sequence — useful for coreference, topic tracking, and long-range dependencies.
Together, the set of heads covers multiple distance scales simultaneously, much like wavelet analysis decomposes a signal into components at different frequencies. The model does not learn these slopes; they are fixed by a geometric schedule, eliminating any risk of position overfitting.
The Mathematical Definition
Symbol-by-Symbol Breakdown
For head h with slope mh, the ALiBi attention score between query position i and key position j is:
The attention weights are then obtained by applying softmax row-wise:
Ah[i,j]=∑k=0N−1exp(Sh[i,k])exp(Sh[i,j])
And the output is the weighted sum of values:
Outputh[i]=∑j=0N−1Ah[i,j]⋅Vh[j]
The Geometric Slope Schedule
Press et al. define the slopes using a geometric sequence:
mh=2h1for h=1,2,…,H
For H=8 heads, this gives:
Head
Slope
Penalty at distance 10
Character
1
1/2=0.5
−5.0
Ultra-local
2
1/4=0.25
−2.5
Local
3
1/8=0.125
−1.25
Medium-range
4
1/16=0.0625
−0.625
Broad
...
...
...
...
8
1/256≈0.0039
−0.039
Near-global
Head 1 penalizes a token 10 positions away by −5.0 — essentially killing that attention score. Head 8 penalizes the same distance by only −0.039 — barely noticeable. Together, the heads form a multi-resolution view of the sequence, from hyper-local to near-global.
Why Not Learn the Slopes?
Press et al. experimented with learnable slopes but found no benefit. The geometric schedule works well because it provides exponentially spaced coverage of distance scales. Learning the slopes risks overfitting to training-length distances, which would defeat the purpose of ALiBi (extrapolation). Fixed slopes guarantee that the position signal generalizes.
Interactive: ALiBi Bias Explorer
Use the controls below to explore how ALiBi reshapes attention. Adjust the slope to see how stronger penalties create more local attention patterns. Switch between heads to compare their different distance sensitivities. Click any row to see its attention distribution as a bar chart.
Loading bias visualizer...
Interactive: Extrapolation Comparison
The visualization below compares how different positional encoding methods degrade when the input sequence exceeds the training length. ALiBi degrades most gracefully because its bias is a deterministic function of distance, not a learned representation that must generalize.
Loading extrapolation demo...
Step-by-Step Calculation
We use the same shared example as every chapter: tokens = ["The", "cat", "sat", "on", "mat"], with dmodel=4, 2 heads, and dk=2. Head 1 uses slope m1=0.5 and Head 2 uses slope m2=0.25.
Step-by-Step for "The" (Row 0), Head 1 (m=0.5)
Step 1: Extract head slices. Head 1 uses columns 0–1:
Compare to standard attention (no ALiBi): [0.1237,0.2509,0.2509,0.1237,0.2509]. ALiBi dramatically shifts weight toward nearby tokens: "The" goes from 12.4% to 28.8%, while distant "on" drops from 12.4% to 6.4%.
Step-by-Step for "The" (Row 0), Head 2 (m=0.25)
Head 2 uses columns 2–3: Qh2[The]=[1.0,0.0]. The scaled scores are [0.000,0.707,0.000,0.707,0.354].
With the gentler slope of 0.25, the penalties are halved compared to Head 1:
Key token
Penalty (m=0.25)
ALiBi score
The
0.000
0.000
cat
-0.250
0.457
sat
-0.500
-0.500
on
-0.750
-0.043
mat
-1.000
-0.646
Result: Ah2[The]=[0.2142,0.3384,0.1299,0.2052,0.1122]. Compare to Head 1: "on" receives 20.5% in Head 2 vs 6.4% in Head 1. The gentler slope lets Head 2 maintain broader context while Head 1 focuses locally.
Scaled: [1.414,0.000,1.414,0.000,0.000]. ALiBi bias (distances from position 1): [−0.5,0,−0.5,−1.0,−1.5].
ALiBi[cat]=[0.914,0.000,0.914,−1.000,−1.500]
Weights: [0.3791,0.1520,0.3791,0.0559,0.0339]. "The" and "sat" (both distance 1) get identical penalties, so they tie at 37.9%. Distant "mat" (distance 3) drops to just 3.4%.
Full Attention Weights and Output
ALiBi vs Standard: Side-by-Side
Head 1 Weights (m=0.5):
The
cat
sat
on
mat
The
0.2879
0.3541
0.2148
0.0642
0.0790
cat
0.3791
0.1520
0.3791
0.0559
0.0339
sat
0.1003
0.1653
0.5527
0.0815
0.1003
on
0.0796
0.1312
0.2163
0.3566
0.2163
mat
0.0341
0.1140
0.1880
0.1528
0.5110
Head 2 Weights (m=0.25):
The
cat
sat
on
mat
The
0.2142
0.3384
0.1299
0.2052
0.1122
cat
0.3002
0.1901
0.1480
0.2338
0.1279
sat
0.1077
0.2806
0.1776
0.2806
0.1534
on
0.1106
0.1421
0.0899
0.4750
0.1824
mat
0.1545
0.0978
0.1256
0.3271
0.2949
Averaged Weights (Head 1 + Head 2) / 2:
The
cat
sat
on
mat
The
0.2510
0.3462
0.1724
0.1347
0.0956
cat
0.3397
0.1710
0.2636
0.1449
0.0809
sat
0.1040
0.2229
0.3652
0.1810
0.1268
on
0.0951
0.1366
0.1531
0.4158
0.1994
mat
0.0943
0.1059
0.1568
0.2400
0.4030
Final Output (5×4):
dim-0
dim-1
dim-2
dim-3
The
0.3274
0.3936
0.1861
0.2613
cat
0.3961
0.1689
0.2120
0.2977
sat
0.1504
0.2154
0.2544
0.3573
on
0.1877
0.2393
0.1811
0.5662
mat
0.2896
0.3695
0.2731
0.4746
Interpreting the Differences
The averaged weight matrix reveals ALiBi's defining characteristic: diagonal dominance. Every token gives itself or its immediate neighbor the highest weight. The diagonal values (self-attention) are: The=0.251, cat=0.171, sat=0.365, on=0.416, mat=0.403. Contrast with standard attention where "on" distributes weight uniformly (all 0.200 in Head 1) because its query vector is zero.
Even when a token has zero content-based preference (like "on" in Head 1), ALiBi still creates meaningful attention by position alone: self-attention (distance 0) receives the highest weight, and the weight decays with distance. Position becomes the tiebreaker when content scores are equal.
Key Observation
In standard attention, "on" (Head 1) has all scores = 0, producing perfectly uniform weights [0.20, 0.20, 0.20, 0.20, 0.20]. With ALiBi, those same zero scores become [-1.5, -1.0, -0.5, 0.0, -0.5] after bias, producing [0.08, 0.13, 0.22, 0.36, 0.22]. Even a content-agnostic token gains positional structure from ALiBi.
Applications Across Domains
Long-Document Language Modeling
The original motivation for ALiBi was enabling language models to process documents longer than their training context. BLOOM (176B, BigScience) and MPT-7B/30B (MosaicML) adopted ALiBi for this reason. A model trained on 2048-token sequences can be deployed on 4096+ token documents without fine-tuning. This is critical for summarization, document QA, and legal/medical text processing where documents routinely exceed training lengths.
Code Generation
Source code files are often much longer than natural language training samples. A model trained on 2048-token code snippets may need to process 10,000+ token files at inference. ALiBi's extrapolation property allows the model to maintain coherent attention across entire files, keeping track of function definitions, variable scope, and import relationships that span thousands of tokens.
Scientific Sequence Modeling
Protein sequences (up to 30,000+ amino acids), genomic sequences, and time series data all benefit from length extrapolation. A model trained on 1024-residue protein fragments can process full-length proteins without retraining. The linear distance penalty maps naturally to the biological intuition that nearby residues in a sequence interact more strongly.
Connection to Modern Systems
RoPE vs ALiBi
Property
RoPE (Chapter 8)
ALiBi (this chapter)
Where applied
Token embeddings (Q, K)
Attention scores
Position signal
2D rotation per dim pair
Linear distance penalty
Parameters
0 (frequencies are fixed)
0 (slopes are fixed)
Relative position
Built into dot product via rotation
Explicit distance subtraction
Extrapolation
Moderate (needs NTK/YaRN)
Strong (linear penalty generalizes)
Modern adoption
LLaMA, Mistral, Qwen, Gemma
BLOOM, MPT, Falcon
KV-cache impact
Q and K must be rotated
Bias can be precomputed or fused
In practice, RoPE has become more widely adopted (LLaMA 2/3, Mistral, Gemma) partly because extension methods like NTK-aware scaling and YaRN have largely solved its extrapolation weakness. ALiBi remains attractive for its simplicity and when extrapolation without any fine-tuning is paramount.
Flash Attention Compatibility
ALiBi is naturally compatible with Flash Attention (Chapter 13). The bias matrix can be computed on-the-fly within each tile of the tiled attention computation, without materializing the full N×N bias matrix in GPU HBM. Since the bias at position (i,j) is just −mh×∣i−j∣, it can be computed from the tile indices alone. Flash Attention 2 natively supports ALiBi-style additive biases.
KV-Cache and Inference
ALiBi does not modify Q, K, or V vectors, so it has zero impact on KV-cache size. Compare with RoPE, where rotated K vectors must be stored in the cache (though this has the same size as unrotated vectors). ALiBi's bias is applied after the Q·K dot product, meaning the cache stores only the original unmodified K and V vectors. The bias is recomputed at each decoding step from the position indices — a negligible cost.
ALiBi composes cleanly with MQA (Chapter 5) and GQA (Chapter 6). The bias is applied per-head to the attention scores regardless of whether keys and values are shared.
Complexity Analysis
Resource
Standard Attention
ALiBi
Difference
Added parameters
O(Nmax) or 0
0
Learned PE has parameters; ALiBi has zero
Score computation
O(N2⋅dk)
O(N2⋅dk+N2)
O(N2) bias addition (negligible)
Memory overhead
0 (standard scores only)
O(N2) per head (bias matrix)
Can be computed on-the-fly to avoid storing
Extrapolation
None (learned) or weak (sinusoidal)
Strong (linear penalty generalizes)
Primary advantage of ALiBi
The computational overhead of ALiBi is negligible: adding a scalar to each attention score costs O(N2) per head, which is dominated by the O(N2⋅dk) cost of the dot products themselves. In practice, the bias can be fused into the attention kernel with zero additional memory passes.
Python Implementation
The class below implements ALiBi from scratch. The critical method is apply_alibi_bias(), which subtracts the linear distance penalty from scaled scores. Compare this with the standard attention class from Chapter 1 — only one additional step is needed.
ALiBi Attention — NumPy Implementation
🐍alibi_attention.py
Explanation(54)
Code(163)
1import numpy as np
NumPy provides vectorized matrix operations. Qh @ Kh.T runs as optimized C code, not Python loops.
2import math
Python standard library. We use math.sqrt() to precompute the scaling factor √d_k.
4class ALiBiAttention
Wraps ALiBi in a reusable class. The key difference from standard attention: apply_alibi_bias() subtracts a linear distance penalty from scores before softmax. Zero additional learned parameters — slopes are computed by formula.
17def __init__(self, d_model, n_heads)
Constructor. Takes the full model dimension and number of heads. Computes per-head dimension d_k, scaling factor √d_k, and the geometric slope schedule.
EXECUTION STATE
⬇ input: d_model = 4
⬇ input: n_heads = 2
18self.d_model = d_model
Store full model dimension (4 in our example).
EXECUTION STATE
self.d_model = 4
19self.n_heads = n_heads
Store number of attention heads. Each head gets a different slope for its distance penalty.
EXECUTION STATE
self.n_heads = 2
20self.d_k = d_model // n_heads
Per-head dimension. Each head sees d_k=2 dimensions of Q, K, and V.
EXECUTION STATE
d_model // n_heads = 4 // 2 = 2
self.d_k = 2
21self.scale = math.sqrt(self.d_k)
Precompute √d_k once. Divides every dot product to control variance (same as Chapter 1).
EXECUTION STATE
math.sqrt(2) = 1.4142
self.scale = 1.4142
22self.slopes = self._compute_slopes(n_heads)
Compute the geometric slope schedule: m_h = 1/2^h. Head 1 gets slope=0.5 (local), Head 2 gets slope=0.25 (broader).
EXECUTION STATE
self.slopes = [0.5, 0.25]
24def _compute_slopes(self, n_heads) → list
Geometric slope schedule. Each head gets a different penalty strength. Lower-indexed heads have steeper slopes (attend more locally). Higher-indexed heads have gentler slopes (attend more broadly).
EXECUTION STATE
⬇ input: n_heads = 2
⬆ returns = [0.5, 0.25] — list of slopes
26return [1.0 / (2 ** (h + 1)) for h in range(n_heads)]
List comprehension. For h=0: 1/2^1 = 0.5. For h=1: 1/2^2 = 0.25. No learning required.
EXECUTION STATE
h=0 = 1.0 / 2^1 = 0.5
h=1 = 1.0 / 2^2 = 0.25
⬆ return = [0.5, 0.25]
28def _build_alibi_bias(self, N, slope) → np.ndarray
Build the N×N distance penalty matrix. Entry (i,j) = -slope × |i - j|. Tokens far apart receive larger penalties (more negative values), suppressing their attention scores before softmax.
The cat sat on mat
The 0.00 -0.50 -1.00 -1.50 -2.00
cat -0.50 0.00 -0.50 -1.00 -1.50
sat -1.00 -0.50 0.00 -0.50 -1.00
on -1.50 -1.00 -0.50 0.00 -0.50
mat -2.00 -1.50 -1.00 -0.50 0.00
⬆ return (slope=0.25) =
The cat sat on mat
The 0.00 -0.25 -0.50 -0.75 -1.00
cat -0.25 0.00 -0.25 -0.50 -0.75
sat -0.50 -0.25 0.00 -0.25 -0.50
on -0.75 -0.50 -0.25 0.00 -0.25
mat -1.00 -0.75 -0.50 -0.25 0.00
34def _softmax(self, x) → np.ndarray
Numerically stable softmax. Takes a score matrix (5×5) and returns probabilities per row. Each row sums to 1.0.
EXECUTION STATE
⬇ input: x = shape (5, 5) — ALiBi score matrix for one head
Divide each element by its row sum to normalize into probabilities. Each row sums to exactly 1.0.
EXECUTION STATE
axis=-1 = sum along last axis — sum each row independently
keepdims=True = sum returns shape (5,1) so broadcasting works
40def compute_scores(self, Qh, Kh)
Raw dot-product scores: Qh (5×2) @ Kh.T (2×5) → (5×5). Same as standard attention — ALiBi does not change this step.
EXECUTION STATE
⬇ input: Qh = shape (5, 2) — one head’s queries
⬇ input: Kh = shape (5, 2) — one head’s keys
⬆ returns = np.ndarray (5, 5) — raw scores
42return Qh @ Kh.T
Matrix multiply Qh (5×2) with Kh transposed (2×5). Entry (i,j) = dot product of query_i with key_j.
EXECUTION STATE
.T = transpose — Kh (5×2) becomes (2×5)
⬆ return (Head 1) =
The cat sat on mat
The 0.0 1.0 1.0 0.0 1.0
cat 2.0 0.0 2.0 0.0 0.0
sat 1.0 1.0 2.0 0.0 1.0
on 0.0 0.0 0.0 0.0 0.0
mat 0.0 1.0 1.0 0.0 1.0
44def scale_scores(self, scores) → np.ndarray
Divide every score by √d_k = √2 ≈ 1.4142 to prevent softmax saturation.
EXECUTION STATE
⬇ input: scores = shape (5, 5) — raw dot products
self.scale = 1.4142 (√2)
⬆ returns = np.ndarray (5, 5) — scores ÷ 1.4142
46return scores / self.scale
Elementwise division. Every score is divided by 1.4142.
EXECUTION STATE
⬆ return (Head 1) =
The cat sat on mat
The 0.0000 0.7071 0.7071 0.0000 0.7071
cat 1.4142 0.0000 1.4142 0.0000 0.0000
sat 0.7071 0.7071 1.4142 0.0000 0.7071
on 0.0000 0.0000 0.0000 0.0000 0.0000
mat 0.0000 0.7071 0.7071 0.0000 0.7071
48def apply_alibi_bias(self, scaled, slope)
THE KEY ALiBi OPERATION: add the linear distance penalty to the scaled scores. This is the ONLY difference from standard attention. Tokens far apart get large negative penalties, suppressing their weights after softmax.
The cat sat on mat
The 0.00 -0.50 -1.00 -1.50 -2.00
cat -0.50 0.00 -0.50 -1.00 -1.50
sat -1.00 -0.50 0.00 -0.50 -1.00
on -1.50 -1.00 -0.50 0.00 -0.50
mat -2.00 -1.50 -1.00 -0.50 0.00
52return scaled + bias
Elementwise addition. The bias is always ≤ 0, so every off-diagonal score is reduced. Nearby tokens lose little; distant tokens lose a lot.
EXECUTION STATE
⬆ return (Head 1) =
The cat sat on mat
The 0.000 0.207 -0.293 -1.500 -1.293
cat 0.914 0.000 0.914 -1.000 -1.500
sat -0.293 0.207 1.414 -0.500 -0.293
on -1.500 -1.000 -0.500 0.000 -0.500
mat -2.000 -0.793 -0.293 -0.500 0.707
54def compute_weights(self, alibi_scores)
Apply softmax row-wise to get attention probabilities. The distance penalties have already been applied, so nearby tokens naturally receive higher weights.
⬆ returns = np.ndarray (5, 5) — each row sums to 1.0
56return self._softmax(alibi_scores)
Calls _softmax. All 5 rows become probability distributions that favor nearby tokens.
EXECUTION STATE
⬆ return (Head 1) =
The cat sat on mat
The 0.2879 0.3541 0.2148 0.0642 0.0790
cat 0.3791 0.1520 0.3791 0.0559 0.0339
sat 0.1003 0.1653 0.5527 0.0815 0.1003
on 0.0796 0.1312 0.2163 0.3566 0.2163
mat 0.0341 0.1140 0.1880 0.1528 0.5110
58def compute_output(self, weights, Vh)
Weighted sum of value vectors. weights (5×5) @ Vh (5×2) → (5×2). Each output row blends all value vectors, weighted by ALiBi-adjusted attention.
THE ALIBI STEP: subtract distance penalty. This is the ONLY line that differs from standard attention. After this, nearby tokens have higher effective scores.
87weights = self.compute_weights(alibi_scores)
Softmax over ALiBi-adjusted scores → probabilities that favor nearby tokens.
88output = self.compute_output(weights, Vh)
Weighted sum of value vectors using ALiBi-adjusted weights.
90head_outputs.append(output)
Store this head’s (5×2) output.
91all_weights.append(weights)
Store this head’s (5×5) weight matrix.
93return np.hstack(head_outputs), all_weights
Concatenate all head outputs horizontally: [(5×2), (5×2)] → (5×4).
EXECUTION STATE
np.hstack() = horizontal stack — concatenates along columns
⬆ return: output (5×4) =
d0 d1 d2 d3
The 0.3274 0.3936 0.1861 0.2613
cat 0.3961 0.1689 0.2120 0.2977
sat 0.1504 0.2154 0.2544 0.3573
on 0.1877 0.2393 0.1811 0.5662
mat 0.2896 0.3695 0.2731 0.4746
115tokens = [...]
The 5 tokens used in every chapter of this book.
EXECUTION STATE
tokens = ['The', 'cat', 'sat', 'on', 'mat']
117Q = np.array([...])
Query matrix — same as Chapter 1. Each row encodes what that token looks for.
EXECUTION STATE
Q (5×4) =
d0 d1 d2 d3
The 1.0 0.0 1.0 0.0
cat 0.0 2.0 0.0 1.0
sat 1.0 1.0 1.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.0 1.0
125K = np.array([...])
Key matrix — same as Chapter 1. Each row encodes what that token advertises.
EXECUTION STATE
K (5×4) =
d0 d1 d2 d3
The 0.0 1.0 0.0 1.0
cat 1.0 0.0 1.0 0.0
sat 1.0 1.0 0.0 0.0
on 0.0 0.0 1.0 1.0
mat 1.0 0.0 0.5 0.5
133V = np.array([...])
Value matrix — same as Chapter 1. Each row is the content that token provides when attended to.
EXECUTION STATE
V (5×4) =
d0 d1 d2 d3
The 1.0 0.0 0.0 0.0
cat 0.0 1.0 0.0 0.0
sat 0.0 0.0 1.0 0.0
on 0.0 0.0 0.0 1.0
mat 0.5 0.5 0.5 0.5
143alibi = ALiBiAttention(d_model=4, n_heads=2)
Instantiate ALiBi with 4 model dims and 2 heads. Sets d_k=2, scale=√2, slopes=[0.5, 0.25].
EXECUTION STATE
alibi.d_k = 2
alibi.scale = 1.4142
alibi.slopes = [0.5, 0.25]
144output, all_weights = alibi.forward(Q, K, V)
Run the full ALiBi pipeline. For each head: standard scores + ALiBi bias → softmax → weighted sum.
EXECUTION STATE
output (5×4) =
d0 d1 d2 d3
The 0.3274 0.3936 0.1861 0.2613
cat 0.3961 0.1689 0.2120 0.2977
sat 0.1504 0.2154 0.2544 0.3573
on 0.1877 0.2393 0.1811 0.5662
mat 0.2896 0.3695 0.2731 0.4746
len(all_weights) = 2 (one 5×5 matrix per head)
153alibi.explain(Q, K, V, tokens, query_idx=0)
Print detailed trace for ‘The’ (token 0). Shows per-head scores, distance penalties, and weights.
EXECUTION STATE
query_idx = 0 → tracing ‘The’
154alibi.explain(Q, K, V, tokens, query_idx=1)
Print detailed trace for ‘cat’ (token 1). Compare how Head 1 (steep slope) vs Head 2 (gentle slope) treat nearby vs distant tokens.
EXECUTION STATE
query_idx = 1 → tracing ‘cat’
109 lines without explanation
1import numpy as np
2import math
34classALiBiAttention:5"""
6 ALiBi — Attention with Linear Biases (Press et al., ICLR 2022)
78 Eliminates positional embeddings entirely.
9 Instead, subtracts a head-specific linear distance penalty
10 from attention scores before softmax.
1112 slope_h = 1 / 2^h (geometric schedule, no learned params)
13 """1415def__init__(self, d_model:int, n_heads:int):16 self.d_model = d_model
17 self.n_heads = n_heads
18 self.d_k = d_model // n_heads
19 self.scale = math.sqrt(self.d_k)20 self.slopes = self._compute_slopes(n_heads)2122def_compute_slopes(self, n_heads:int)->list:23"""Geometric slope schedule: m_h = 1/2^h."""24return[1.0/(2**(h +1))for h inrange(n_heads)]2526def_build_alibi_bias(self, N:int, slope:float)-> np.ndarray:27"""Build the N×N distance penalty matrix."""28 i_idx = np.arange(N)[:,None]29 j_idx = np.arange(N)[None,:]30return-slope * np.abs(i_idx - j_idx)3132def_softmax(self, x: np.ndarray)-> np.ndarray:33"""Numerically stable softmax along last axis."""34 x_shifted = x - np.max(x, axis=-1, keepdims=True)35 exp_x = np.exp(x_shifted)36return exp_x / np.sum(exp_x, axis=-1, keepdims=True)3738defcompute_scores(self, Qh: np.ndarray, Kh: np.ndarray):39"""Raw dot-product scores: Qh @ Kh^T."""40return Qh @ Kh.T
4142defscale_scores(self, scores: np.ndarray)-> np.ndarray:43"""Divide by sqrt(d_k) to control variance."""44return scores / self.scale
4546defapply_alibi_bias(self, scaled: np.ndarray, slope:float):47"""Add the linear distance penalty to scores."""48 N = scaled.shape[0]49 bias = self._build_alibi_bias(N, slope)50return scaled + bias
5152defcompute_weights(self, alibi_scores: np.ndarray):53"""Apply softmax to get attention weights."""54return self._softmax(alibi_scores)5556defcompute_output(self, weights: np.ndarray, Vh: np.ndarray):57"""Weighted sum of value vectors."""58return weights @ Vh
5960defforward(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray):61"""
62 Full forward pass.
6364 Args:
65 Q: Query matrix (N, d_model)
66 K: Key matrix (N, d_model)
67 V: Value matrix (N, d_model)
6869 Returns:
70 output: Concatenated head outputs (N, d_model)
71 all_weights: List of weight matrices per head
72 """73 head_outputs =[]74 all_weights =[]7576for h inrange(self.n_heads):77 s = h * self.d_k
78 e = s + self.d_k
79 Qh, Kh, Vh = Q[:, s:e], K[:, s:e], V[:, s:e]8081 raw_scores = self.compute_scores(Qh, Kh)82 scaled = self.scale_scores(raw_scores)83 alibi_scores = self.apply_alibi_bias(scaled, self.slopes[h])84 weights = self.compute_weights(alibi_scores)85 output = self.compute_output(weights, Vh)8687 head_outputs.append(output)88 all_weights.append(weights)8990return np.hstack(head_outputs), all_weights
9192defexplain(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray,93 tokens:list, query_idx:int=0):94"""Print a detailed trace for a specific query token."""95 token = tokens[query_idx]96print(f"\n=== ALiBi trace for '{token}' (row {query_idx}) ===")9798for h inrange(self.n_heads):99 s = h * self.d_k
100 e = s + self.d_k
101 Qh, Kh, Vh = Q[:, s:e], K[:, s:e], V[:, s:e]102103 raw = self.compute_scores(Qh, Kh)104 scaled = self.scale_scores(raw)105 alibi = self.apply_alibi_bias(scaled, self.slopes[h])106 w = self.compute_weights(alibi)107 out = self.compute_output(w, Vh)108109print(f"\n--- Head {h+1} (slope={self.slopes[h]}) ---")110print(f"Q_h{h+1}[{token}] = {Qh[query_idx]}")111for j, t inenumerate(tokens):112 dist =abs(query_idx - j)113 penalty = self.slopes[h]* dist
114print(f" score[{t}] = {scaled[query_idx,j]:.4f}"115f" penalty = -{penalty:.4f}"116f" alibi = {alibi[query_idx,j]:.4f}")117print(f" softmax = {np.round(w[query_idx],4)}")118print(f" output = {np.round(out[query_idx],4)}")119120121# ── Shared Example (used in every chapter) ──122tokens =["The","cat","sat","on","mat"]123124Q = np.array([125[1.0,0.0,1.0,0.0],# The126[0.0,2.0,0.0,1.0],# cat127[1.0,1.0,1.0,0.0],# sat128[0.0,0.0,1.0,1.0],# on129[1.0,0.0,0.0,1.0],# mat130])131132K = np.array([133[0.0,1.0,0.0,1.0],# The134[1.0,0.0,1.0,0.0],# cat135[1.0,1.0,0.0,0.0],# sat136[0.0,0.0,1.0,1.0],# on137[1.0,0.0,0.5,0.5],# mat138])139140V = np.array([141[1.0,0.0,0.0,0.0],# The142[0.0,1.0,0.0,0.0],# cat143[0.0,0.0,1.0,0.0],# sat144[0.0,0.0,0.0,1.0],# on145[0.5,0.5,0.5,0.5],# mat146])147148# ── Run ──149alibi = ALiBiAttention(d_model=4, n_heads=2)150output, all_weights = alibi.forward(Q, K, V)151152print("ALiBi Output (5x4):")153print(np.round(output,4))154155print("\nHead 1 Weights (5x5):")156print(np.round(all_weights[0],4))157158print("\nHead 2 Weights (5x5):")159print(np.round(all_weights[1],4))160161# Detailed trace for "The" and "cat"162alibi.explain(Q, K, V, tokens, query_idx=0)163alibi.explain(Q, K, V, tokens, query_idx=1)
PyTorch Implementation
The PyTorch version supports GPU acceleration and automatic differentiation. The key difference from the NumPy version: slopes are registered as a buffer (not a parameter), ensuring they move to GPU with the model but are never updated by the optimizer. The bias tensor is built for all heads at once using broadcasting: slopes (H,)→(H,1,1) times distances (N,N)→(H,N,N).
ALiBi Attention — PyTorch Implementation
🐍alibi_attention_torch.py
Explanation(16)
Code(139)
1Import PyTorch
torch is the core tensor library. torch.nn provides neural network building blocks (nn.Module). torch.nn.functional provides stateless operations like softmax. math is used for sqrt.
6class ALiBiAttention(nn.Module)
nn.Module subclass. Unlike the NumPy version, this supports GPU acceleration via .cuda(), automatic gradient computation via autograd, and integration with PyTorch training loops. The slopes are registered as a buffer (non-learnable).
16def __init__(self, d_model, n_heads)
Constructor. Computes d_k, scale, and registers the geometric slope schedule as a buffer. Buffers move with the model to GPU/CPU but are not updated by the optimizer.
EXECUTION STATE
⬇ input: d_model = 4
⬇ input: n_heads = 2
self.d_k = 4 // 2 = 2
self.scale = √2 = 1.4142
23slopes = torch.tensor([...])
Compute slopes as a tensor: [1/2, 1/4] = [0.5, 0.25]. These are constants, not learned parameters.
EXECUTION STATE
slopes = tensor([0.5000, 0.2500])
26self.register_buffer("slopes", slopes)
register_buffer stores a tensor that: (1) moves to GPU with model.cuda(), (2) is saved in state_dict, (3) is NOT updated by optimizer. Perfect for fixed constants like ALiBi slopes.
EXECUTION STATE
register_buffer = makes slopes part of model state but not a parameter
28def _build_alibi_bias(self, N, device) → Tensor
Build the (n_heads, N, N) bias tensor for all heads at once. Uses broadcasting to create different penalty strengths per head.
EXECUTION STATE
⬇ input: N = 5 (sequence length)
⬇ input: device = cpu or cuda
⬆ returns = Tensor shape (2, 5, 5) — one 5×5 penalty matrix per head
31positions = torch.arange(N, device=device)
Create position indices [0, 1, 2, 3, 4] on the same device as the inputs.
Broadcast slopes (2,) → (2,1,1) against dist (5,5) → (2,5,5). Each head gets its own penalty matrix.
EXECUTION STATE
.unsqueeze(1).unsqueeze(2) = slopes shape (2,) → (2,1,1) for broadcasting with (5,5)
bias[0] (slope=0.5) = Head 1: -0.5 * dist
bias[1] (slope=0.25) = Head 2: -0.25 * dist
37return bias
Return the (2, 5, 5) ALiBi bias tensor. bias[h] is the 5×5 penalty matrix for head h.
39def forward(self, Q, K, V, mask) → tuple
Full forward pass. Builds ALiBi bias once for all heads, then loops over heads applying standard attention + ALiBi penalty.
EXECUTION STATE
⬇ input: Q = shape (5, 4)
⬇ input: K = shape (5, 4)
⬇ input: V = shape (5, 4)
⬇ input: mask = None or (5, 5) boolean
57alibi_bias = self._build_alibi_bias(N, device)
Build the ALiBi bias tensor once. Shape: (2, 5, 5). This is the key precomputation.
EXECUTION STATE
alibi_bias.shape = (2, 5, 5)
68scores = scores + alibi_bias[h]
THE ALIBI STEP: add head h’s distance penalty matrix to the scaled scores. This single line is the entire difference from standard attention.
EXECUTION STATE
alibi_bias[h] = 5×5 penalty matrix for head h
73weights = F.softmax(scores, dim=-1)
Softmax along last dimension. The ALiBi penalties have already been applied, so nearby tokens naturally receive more weight.
EXECUTION STATE
dim=-1 = softmax along last axis — each row becomes a probability distribution
85alibi = ALiBiAttention(d_model=4, n_heads=2)
Instantiate. slopes buffer = [0.5, 0.25]. No learned parameters.
EXECUTION STATE
alibi.slopes = tensor([0.5000, 0.2500])
86output, all_weights = alibi(Q, K, V)
Call forward(). Calling the module directly invokes forward() plus any hooks.
EXECUTION STATE
output.shape = (5, 4)
len(all_weights) = 2
123 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
56classALiBiAttention(nn.Module):7"""
8 ALiBi — Attention with Linear Biases (Press et al., ICLR 2022)
9 PyTorch implementation with GPU support.
1011 Subtracts head-specific linear distance penalty from scores.
12 Zero additional learned parameters.
13 """1415def__init__(self, d_model:int, n_heads:int):16super().__init__()17 self.d_model = d_model
18 self.n_heads = n_heads
19 self.d_k = d_model // n_heads
20 self.scale = math.sqrt(self.d_k)2122# Register slopes as a non-learnable buffer23 slopes = torch.tensor(24[1.0/(2**(h +1))for h inrange(n_heads)]25)26 self.register_buffer("slopes", slopes)2728def_build_alibi_bias(29 self, N:int, device: torch.device
30)-> torch.Tensor:31"""Build (n_heads, N, N) ALiBi bias tensor."""32 positions = torch.arange(N, device=device)33 dist = torch.abs(34 positions.unsqueeze(1)- positions.unsqueeze(0)35).float()36# slopes: (n_heads,) -> (n_heads, 1, 1) for broadcast37 bias =-self.slopes.unsqueeze(1).unsqueeze(2)* dist
38return bias
3940defforward(41 self,42 Q: torch.Tensor,43 K: torch.Tensor,44 V: torch.Tensor,45 mask: torch.Tensor |None=None,46)->tuple[torch.Tensor,list[torch.Tensor]]:47"""
48 Args:
49 Q: (N, d_model) — query matrix
50 K: (N, d_model) — key matrix
51 V: (N, d_model) — value matrix
52 mask: Optional (N, N) boolean mask
5354 Returns:
55 output: (N, d_model) — concatenated heads
56 all_weights: list of (N, N) per-head weights
57 """58 N = Q.size(0)59 device = Q.device
6061# Build ALiBi bias for all heads at once62 alibi_bias = self._build_alibi_bias(N, device)6364 head_outputs =[]65 all_weights =[]6667for h inrange(self.n_heads):68 s = h * self.d_k
69 e = s + self.d_k
70 Qh = Q[:, s:e]71 Kh = K[:, s:e]72 Vh = V[:, s:e]7374# Scaled dot-product75 scores = torch.matmul(Qh, Kh.transpose(-2,-1))76 scores = scores / self.scale
7778# Add ALiBi bias (the only difference from standard)79 scores = scores + alibi_bias[h]8081if mask isnotNone:82 scores = scores.masked_fill(mask,float("-inf"))8384 weights = F.softmax(scores, dim=-1)85 output = torch.matmul(weights, Vh)8687 head_outputs.append(output)88 all_weights.append(weights)8990return torch.cat(head_outputs, dim=-1), all_weights
919293# ── Shared Example ──94tokens =["The","cat","sat","on","mat"]9596Q = torch.tensor([97[1.0,0.0,1.0,0.0],98[0.0,2.0,0.0,1.0],99[1.0,1.0,1.0,0.0],100[0.0,0.0,1.0,1.0],101[1.0,0.0,0.0,1.0],102])103104K = torch.tensor([105[0.0,1.0,0.0,1.0],106[1.0,0.0,1.0,0.0],107[1.0,1.0,0.0,0.0],108[0.0,0.0,1.0,1.0],109[1.0,0.0,0.5,0.5],110])111112V = torch.tensor([113[1.0,0.0,0.0,0.0],114[0.0,1.0,0.0,0.0],115[0.0,0.0,1.0,0.0],116[0.0,0.0,0.0,1.0],117[0.5,0.5,0.5,0.5],118])119120# ── Run ──121alibi = ALiBiAttention(d_model=4, n_heads=2)122output, all_weights = alibi(Q, K, V)123124print("ALiBi Output (5x4):")125print(output.round(decimals=4))126127print("\nHead 1 Weights:")128print(all_weights[0].round(decimals=4))129130print("\nHead 2 Weights:")131print(all_weights[1].round(decimals=4))132133# ── GPU: just move tensors ──134if torch.cuda.is_available():135 alibi_gpu = alibi.cuda()136 Q_gpu, K_gpu, V_gpu = Q.cuda(), K.cuda(), V.cuda()137 out_gpu, _ = alibi_gpu(Q_gpu, K_gpu, V_gpu)138print("GPU matches CPU?",139 torch.allclose(output, out_gpu.cpu(), atol=1e-4))
Key Takeaways
ALiBi replaces positional embeddings with a linear distance penalty subtracted from attention scores before softmax. This is the simplest possible position encoding: zero parameters, zero trigonometric functions, just −mh×∣i−j∣.
Different heads use different slopes via a geometric schedule mh=1/2h. This creates a multi-resolution view: steep slopes for local syntax, gentle slopes for long-range dependencies.
Extrapolation is the primary advantage. Because the bias is a deterministic function of distance (not a learned representation), models trained at length L generalize to 2L or beyond without fine-tuning.
Diagonal dominance emerges naturally. Even when content scores are uniform or zero, ALiBi creates meaningful attention patterns where self-attention and nearby tokens dominate.
ALiBi composes with all other optimizations. It works with MQA/GQA (KV-sharing), Flash Attention (tiled computation), and causal masking. The bias adds negligible overhead.
Exercises
Exercise 1: Compute for "sat"
Using the shared Q, K, V matrices, compute the ALiBi attention weights for “sat” (row 2) in both heads. Verify that Head 1 gives [0.1003,0.1653,0.5527,0.0815,0.1003]. Why does "sat" strongly self-attend (55.3%) in Head 1? Consider both the content score and the distance penalty.
Exercise 2: Slope = 0 Reduces to Standard Attention
Prove mathematically that when mh=0 for all heads, ALiBi produces identical results to standard scaled dot-product attention (Chapter 1). Then verify numerically using the interactive visualizer above (set slope to 0).
Exercise 3: ALiBi + Causal Mask
In autoregressive models (Chapter 3), causal masking sets future positions to −∞. How does this interact with ALiBi? Write the combined score formula and explain whether the ALiBi penalty is redundant for future positions. Hint: −∞+(−mh×d)=−∞.
Exercise 4: Comparing Slope Schedules
Press et al. used mh=1/2h. Propose two alternative schedules: (a) linear: mh=(H−h+1)/H and (b) constant: mh=0.25 for all heads. For each, compute the attention weights for "The" across all 5 tokens using Head 1. What is lost when all heads have the same slope?
Exercise 5: Extrapolation Experiment
Extend the shared example to 10 tokens: ["The", "cat", "sat", "on", "mat", "and", "the", "dog", "ran", "home"]. Invent reasonable Q, K, V vectors for the new tokens. Compute ALiBi attention for the original 5 tokens and verify that their weights are nearly unchanged despite the longer sequence. This demonstrates ALiBi's extrapolation: adding more tokens barely affects existing attention patterns.
References
Press, O., Smith, N. A., & Lewis, M. (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” International Conference on Learning Representations (ICLR).
Scao, T. L., et al. (2022). “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.” arXiv:2211.05100.
MosaicML (2023). “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.” MosaicML Blog.
Su, J., et al. (2021). “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864.
Dao, T., et al. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv:2205.14135.
Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems (NeurIPS), 30.