Boo-AI — Master Artificial Intelligence by Building from Scratch

Press, Smith, & Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", ICLR 2022

Learning Objectives

By the end of this chapter, you will be able to:

Explain why learned and sinusoidal positional embeddings fail to extrapolate beyond training sequence lengths, and why this is a critical problem for deployment.
Describe the core insight of ALiBi: replace position embeddings with a simple linear distance penalty added directly to attention scores — requiring zero additional parameters.
Derive the ALiBi formula and explain the geometric slope schedule that allows different heads to attend at different distance scales simultaneously.
Compute ALiBi attention weights and outputs by hand for “The cat sat on mat” and compare them to the standard attention results from Chapter 1.
Implement a complete ALiBi class in both NumPy and PyTorch that you can run on any input.
Connect ALiBi to RoPE (Chapter 8), Flash Attention (Chapter 13), and modern inference systems, understanding the tradeoffs between rotation-based and bias-based position encoding.

Where this appears: ALiBi was introduced at Facebook AI Research and adopted by the BLOOM model (176B parameters, BigScience), MPT-7B/30B (MosaicML), and other production LLMs. It offers the simplest possible approach to positional encoding — no learned parameters, no trigonometric functions, just subtraction. Understanding ALiBi illuminates a deep question: how much positional information does attention truly need?

The Real Problem

Chapters 7 and 8 established two approaches to injecting position information into attention: relative position biases (Chapter 7) and rotary position embeddings (Chapter 8, RoPE). Both solve the problem of making attention aware of token order. But there is a deeper problem that neither the original sinusoidal embeddings nor learned position embeddings solve reliably: what happens when the model encounters sequences longer than those seen during training?

Position Embedding Failure Modes

Consider a language model trained on sequences of up to 2048 tokens. During training, the model learns (or is given) position representations for positions 0 through 2047. At inference time, a user provides a 4096-token document. What happens?

Learned position embeddings (GPT-2, BERT): The model has an embedding lookup table with exactly 2048 rows. Position 2049 simply does not exist. The model either crashes or truncates the input. There is no extrapolation at all.
Sinusoidal position embeddings (original Transformer): The sine/cosine functions can produce values at any position. But the model has never seen the specific activation patterns produced by positions beyond 2048. The learned attention weights are calibrated for a specific range of positional signals, and positions outside that range produce unpredictable behavior. Perplexity collapses.
RoPE (Chapter 8): Rotation-based encoding can generate rotations for any position. However, without techniques like NTK-aware scaling or YaRN, the effective frequency distribution at unseen positions differs from training. RoPE degrades more gracefully than learned embeddings but still requires extension methods for reliable extrapolation.

The Extrapolation Challenge

The root cause is architectural: all three methods above encode position into the token representations themselves. The model must then learn to interpret these positional signals during attention computation. When the positional signals at inference time differ from those seen during training, the learned interpretation breaks down.

Press, Smith, and Lewis (2022) asked a radical question: what if we bypass token representations entirely and inject position information directly into the attention scores? If the position signal is a simple, predictable function of distance — one that the model never needs to “learn to interpret” — then extrapolation becomes trivial. A token 3 positions away receives the same penalty regardless of whether position 3 appeared during training at index 100 or index 10,000.

The Key Insight

ALiBi does not encode position into token vectors at all. Instead, it subtracts a linear distance penalty directly from attention scores before softmax. Because the penalty is a fixed algebraic function of distance (not a learned representation), models trained at 1024 tokens achieve near-perfect perplexity at 2048 tokens and degrade gracefully beyond.

From Intuition to Mathematics

The Distance Penalty Insight

Imagine you are sitting in a library, and someone asks you a question (your query). You can look at every book on the shelf (the keys and values), but you have a natural preference: the closer a book is to you, the more likely it is to be relevant. A book one shelf away is easy to reach; a book across the room requires more effort. You can still access distant books, but you need a stronger reason (a higher content-based score) to bother.

ALiBi formalizes this intuition. Given the standard scaled dot-product score between query $i$ and key $j$ , ALiBi subtracts a penalty proportional to their distance:

\text{ALiBi Score}[i,j] = \frac{Q[i] \cdot K[j]^\top}{\sqrt{d_k}} - m_h \times |i - j|

The penalty $m_h \times |i - j|$ grows linearly with distance. Token $j$ that is 1 position away from $i$ receives a penalty of $m_h$ ; 2 positions away receives $2 m_h$ ; and so on. After this subtraction, the scores enter softmax normally. The effect is that distant tokens must have proportionally higher content scores to achieve the same attention weight as nearby tokens.

Head-Specific Slopes

The slope $m_h$ is different for each attention head, creating a multi-scale view of the sequence:

Steep slopes (large $m_h$ like 0.5): The penalty grows quickly with distance. These heads become “local experts” that focus strongly on nearby tokens — useful for syntax, morphology, and local coherence.
Gentle slopes (small $m_h$ like 0.0625): The penalty grows slowly. These heads can attend broadly across the sequence — useful for coreference, topic tracking, and long-range dependencies.

Together, the set of heads covers multiple distance scales simultaneously, much like wavelet analysis decomposes a signal into components at different frequencies. The model does not learn these slopes; they are fixed by a geometric schedule, eliminating any risk of position overfitting.

The Mathematical Definition

Symbol-by-Symbol Breakdown

For head $h$ with slope $m_h$ , the ALiBi attention score between query position $i$ and key position $j$ is:

S_h[i,j] = \underbrace{\frac{Q_h[i] \cdot K_h[j]^\top}{\sqrt{d_k}}}_{\text{content score}} \;-\; \underbrace{m_h \cdot |i - j|}_{\text{distance penalty}}

Symbol	Meaning	Our Example
$Q_h[i]$	Query vector for token i in head h	$Q_{h1}[\text{The}] = [1.0,\; 0.0]$
$K_h[j]$	Key vector for token j in head h	$K_{h1}[\text{cat}] = [1.0,\; 0.0]$
$d_k$	Per-head dimension	$2$
$m_h$	Head-specific slope (fixed, not learned)	$m_1 = 0.5,\; m_2 = 0.25$
$\|i - j\|$	Absolute distance between positions	$\|\text{The} - \text{mat}\| = \|0 - 4\| = 4$

The attention weights are then obtained by applying softmax row-wise:

A_h[i,j] = \frac{\exp(S_h[i,j])}{\sum_{k=0}^{N-1} \exp(S_h[i,k])}

And the output is the weighted sum of values:

\text{Output}_h[i] = \sum_{j=0}^{N-1} A_h[i,j] \cdot V_h[j]

The Geometric Slope Schedule

Press et al. define the slopes using a geometric sequence:

m_h = \frac{1}{2^h} \quad \text{for } h = 1, 2, \ldots, H

For $H = 8$ heads, this gives:

Head	Slope	Penalty at distance 10	Character
1	$1/2 = 0.5$	$-5.0$	Ultra-local
2	$1/4 = 0.25$	$-2.5$	Local
3	$1/8 = 0.125$	$-1.25$	Medium-range
4	$1/16 = 0.0625$	$-0.625$	Broad
...	...	...	...
8	$1/256 \approx 0.0039$	$-0.039$	Near-global

Head 1 penalizes a token 10 positions away by $-5.0$ — essentially killing that attention score. Head 8 penalizes the same distance by only $-0.039$ — barely noticeable. Together, the heads form a multi-resolution view of the sequence, from hyper-local to near-global.

Why Not Learn the Slopes?

Press et al. experimented with learnable slopes but found no benefit. The geometric schedule works well because it provides exponentially spaced coverage of distance scales. Learning the slopes risks overfitting to training-length distances, which would defeat the purpose of ALiBi (extrapolation). Fixed slopes guarantee that the position signal generalizes.

Interactive: ALiBi Bias Explorer

Use the controls below to explore how ALiBi reshapes attention. Adjust the slope to see how stronger penalties create more local attention patterns. Switch between heads to compare their different distance sensitivities. Click any row to see its attention distribution as a bar chart.

Loading bias visualizer...

Interactive: Extrapolation Comparison

The visualization below compares how different positional encoding methods degrade when the input sequence exceeds the training length. ALiBi degrades most gracefully because its bias is a deterministic function of distance, not a learned representation that must generalize.

Loading extrapolation demo...

Step-by-Step Calculation

We use the same shared example as every chapter: tokens = ["The", "cat", "sat", "on", "mat"], with $d_{\text{model}} = 4$ , 2 heads, and $d_k = 2$ . Head 1 uses slope $m_1 = 0.5$ and Head 2 uses slope $m_2 = 0.25$ .

Step-by-Step for "The" (Row 0), Head 1 ( $m = 0.5$ )

Step 1: Extract head slices. Head 1 uses columns 0–1:

Q_{h1}[\text{The}] = [1.0,\; 0.0], \quad K_{h1} = \begin{bmatrix} 0.0 & 1.0 \\ 1.0 & 0.0 \\ 1.0 & 1.0 \\ 0.0 & 0.0 \\ 1.0 & 0.0 \end{bmatrix}

Step 2: Compute raw dot products.

$Q_{h1}[\text{The}] \cdot K_{h1}[\text{The}] = 1 \times 0 + 0 \times 1 = 0.0$
$Q_{h1}[\text{The}] \cdot K_{h1}[\text{cat}] = 1 \times 1 + 0 \times 0 = 1.0$
$Q_{h1}[\text{The}] \cdot K_{h1}[\text{sat}] = 1 \times 1 + 0 \times 1 = 1.0$
$Q_{h1}[\text{The}] \cdot K_{h1}[\text{on}] = 1 \times 0 + 0 \times 0 = 0.0$
$Q_{h1}[\text{The}] \cdot K_{h1}[\text{mat}] = 1 \times 1 + 0 \times 0 = 1.0$

Step 3: Scale by $\sqrt{d_k} = \sqrt{2} \approx 1.4142$ .

\text{Scaled}[\text{The}] = [0.0000,\; 0.7071,\; 0.7071,\; 0.0000,\; 0.7071]

Step 4: Compute ALiBi penalties. For slope $m = 0.5$ and query position $i = 0$ :

Key token	Position j	\|0 - j\|	Penalty = -0.5 × \|0 - j\|
The	0	0	0.000
cat	1	1	-0.500
sat	2	2	-1.000
on	3	3	-1.500
mat	4	4	-2.000

Step 5: Add bias to scaled scores.

\text{ALiBi}[\text{The}] = [0.000,\; 0.207,\; -0.293,\; -1.500,\; -1.293]

Notice: "cat" (score 0.707, penalty -0.500 = 0.207) beats "mat" (score 0.707, penalty -2.000 = -1.293) despite having identical content scores. Distance matters.

Step 6: Softmax.

Shift by max (0.207): $[-0.207,\; 0.000,\; -0.500,\; -1.707,\; -1.500]$
Exponentiate: $[0.813,\; 1.000,\; 0.607,\; 0.181,\; 0.223]$
Sum: $2.824$
Normalize: $A_{h1}[\text{The}] = [0.2879,\; 0.3541,\; 0.2148,\; 0.0642,\; 0.0790]$

Compare to standard attention (no ALiBi): $[0.1237,\; 0.2509,\; 0.2509,\; 0.1237,\; 0.2509]$ . ALiBi dramatically shifts weight toward nearby tokens: "The" goes from 12.4% to 28.8%, while distant "on" drops from 12.4% to 6.4%.

Step-by-Step for "The" (Row 0), Head 2 ( $m = 0.25$ )

Head 2 uses columns 2–3: $Q_{h2}[\text{The}] = [1.0,\; 0.0]$ . The scaled scores are $[0.000,\; 0.707,\; 0.000,\; 0.707,\; 0.354]$ .

With the gentler slope of 0.25, the penalties are halved compared to Head 1:

Key token	Penalty (m=0.25)	ALiBi score
The	0.000	0.000
cat	-0.250	0.457
sat	-0.500	-0.500
on	-0.750	-0.043
mat	-1.000	-0.646

Result: $A_{h2}[\text{The}] = [0.2142,\; 0.3384,\; 0.1299,\; 0.2052,\; 0.1122]$ . Compare to Head 1: "on" receives 20.5% in Head 2 vs 6.4% in Head 1. The gentler slope lets Head 2 maintain broader context while Head 1 focuses locally.

Step-by-Step for "cat" (Row 1), Head 1

$Q_{h1}[\text{cat}] = [0.0,\; 2.0]$ . The raw dot products:

$[0,2] \cdot [0,1] = 2.0, \quad [0,2] \cdot [1,0] = 0.0, \quad [0,2] \cdot [1,1] = 2.0, \quad [0,2] \cdot [0,0] = 0.0, \quad [0,2] \cdot [1,0] = 0.0$

Scaled: $[1.414,\; 0.000,\; 1.414,\; 0.000,\; 0.000]$ . ALiBi bias (distances from position 1): $[-0.5,\; 0,\; -0.5,\; -1.0,\; -1.5]$ .

\text{ALiBi}[\text{cat}] = [0.914,\; 0.000,\; 0.914,\; -1.000,\; -1.500]

Weights: $[0.3791,\; 0.1520,\; 0.3791,\; 0.0559,\; 0.0339]$ . "The" and "sat" (both distance 1) get identical penalties, so they tie at 37.9%. Distant "mat" (distance 3) drops to just 3.4%.

Full Attention Weights and Output

ALiBi vs Standard: Side-by-Side

Head 1 Weights ( $m = 0.5$ ):

	The	cat	sat	on	mat
The	0.2879	0.3541	0.2148	0.0642	0.0790
cat	0.3791	0.1520	0.3791	0.0559	0.0339
sat	0.1003	0.1653	0.5527	0.0815	0.1003
on	0.0796	0.1312	0.2163	0.3566	0.2163
mat	0.0341	0.1140	0.1880	0.1528	0.5110

Head 2 Weights ( $m = 0.25$ ):

	The	cat	sat	on	mat
The	0.2142	0.3384	0.1299	0.2052	0.1122
cat	0.3002	0.1901	0.1480	0.2338	0.1279
sat	0.1077	0.2806	0.1776	0.2806	0.1534
on	0.1106	0.1421	0.0899	0.4750	0.1824
mat	0.1545	0.0978	0.1256	0.3271	0.2949

Averaged Weights (Head 1 + Head 2) / 2:

	The	cat	sat	on	mat
The	0.2510	0.3462	0.1724	0.1347	0.0956
cat	0.3397	0.1710	0.2636	0.1449	0.0809
sat	0.1040	0.2229	0.3652	0.1810	0.1268
on	0.0951	0.1366	0.1531	0.4158	0.1994
mat	0.0943	0.1059	0.1568	0.2400	0.4030

Final Output ( $5 \times 4$ ):

	dim-0	dim-1	dim-2	dim-3
The	0.3274	0.3936	0.1861	0.2613
cat	0.3961	0.1689	0.2120	0.2977
sat	0.1504	0.2154	0.2544	0.3573
on	0.1877	0.2393	0.1811	0.5662
mat	0.2896	0.3695	0.2731	0.4746

Interpreting the Differences

The averaged weight matrix reveals ALiBi's defining characteristic: diagonal dominance. Every token gives itself or its immediate neighbor the highest weight. The diagonal values (self-attention) are: The=0.251, cat=0.171, sat=0.365, on=0.416, mat=0.403. Contrast with standard attention where "on" distributes weight uniformly (all 0.200 in Head 1) because its query vector is zero.

Even when a token has zero content-based preference (like "on" in Head 1), ALiBi still creates meaningful attention by position alone: self-attention (distance 0) receives the highest weight, and the weight decays with distance. Position becomes the tiebreaker when content scores are equal.

Key Observation

In standard attention, "on" (Head 1) has all scores = 0, producing perfectly uniform weights [0.20, 0.20, 0.20, 0.20, 0.20]. With ALiBi, those same zero scores become [-1.5, -1.0, -0.5, 0.0, -0.5] after bias, producing [0.08, 0.13, 0.22, 0.36, 0.22]. Even a content-agnostic token gains positional structure from ALiBi.

Applications Across Domains

Long-Document Language Modeling

The original motivation for ALiBi was enabling language models to process documents longer than their training context. BLOOM (176B, BigScience) and MPT-7B/30B (MosaicML) adopted ALiBi for this reason. A model trained on 2048-token sequences can be deployed on 4096+ token documents without fine-tuning. This is critical for summarization, document QA, and legal/medical text processing where documents routinely exceed training lengths.

Code Generation

Source code files are often much longer than natural language training samples. A model trained on 2048-token code snippets may need to process 10,000+ token files at inference. ALiBi's extrapolation property allows the model to maintain coherent attention across entire files, keeping track of function definitions, variable scope, and import relationships that span thousands of tokens.

Scientific Sequence Modeling

Protein sequences (up to 30,000+ amino acids), genomic sequences, and time series data all benefit from length extrapolation. A model trained on 1024-residue protein fragments can process full-length proteins without retraining. The linear distance penalty maps naturally to the biological intuition that nearby residues in a sequence interact more strongly.

Connection to Modern Systems

RoPE vs ALiBi

Property	RoPE (Chapter 8)	ALiBi (this chapter)
Where applied	Token embeddings (Q, K)	Attention scores
Position signal	2D rotation per dim pair	Linear distance penalty
Parameters	0 (frequencies are fixed)	0 (slopes are fixed)
Relative position	Built into dot product via rotation	Explicit distance subtraction
Extrapolation	Moderate (needs NTK/YaRN)	Strong (linear penalty generalizes)
Modern adoption	LLaMA, Mistral, Qwen, Gemma	BLOOM, MPT, Falcon
KV-cache impact	Q and K must be rotated	Bias can be precomputed or fused

In practice, RoPE has become more widely adopted (LLaMA 2/3, Mistral, Gemma) partly because extension methods like NTK-aware scaling and YaRN have largely solved its extrapolation weakness. ALiBi remains attractive for its simplicity and when extrapolation without any fine-tuning is paramount.

Flash Attention Compatibility

ALiBi is naturally compatible with Flash Attention (Chapter 13). The bias matrix can be computed on-the-fly within each tile of the tiled attention computation, without materializing the full $N \times N$ bias matrix in GPU HBM. Since the bias at position $(i,j)$ is just $-m_h \times |i-j|$ , it can be computed from the tile indices alone. Flash Attention 2 natively supports ALiBi-style additive biases.

KV-Cache and Inference

ALiBi does not modify Q, K, or V vectors, so it has zero impact on KV-cache size. Compare with RoPE, where rotated K vectors must be stored in the cache (though this has the same size as unrotated vectors). ALiBi's bias is applied after the Q·K dot product, meaning the cache stores only the original unmodified K and V vectors. The bias is recomputed at each decoding step from the position indices — a negligible cost.

ALiBi composes cleanly with MQA (Chapter 5) and GQA (Chapter 6). The bias is applied per-head to the attention scores regardless of whether keys and values are shared.

Complexity Analysis

Resource	Standard Attention	ALiBi	Difference
Added parameters	$O(N_{\max})$ or $0$	$0$	Learned PE has parameters; ALiBi has zero
Score computation	$O(N^2 \cdot d_k)$	$O(N^2 \cdot d_k + N^2)$	$O(N^2)$ bias addition (negligible)
Memory overhead	0 (standard scores only)	$O(N^2)$ per head (bias matrix)	Can be computed on-the-fly to avoid storing
Extrapolation	None (learned) or weak (sinusoidal)	Strong (linear penalty generalizes)	Primary advantage of ALiBi

The computational overhead of ALiBi is negligible: adding a scalar to each attention score costs $O(N^2)$ per head, which is dominated by the $O(N^2 \cdot d_k)$ cost of the dot products themselves. In practice, the bias can be fused into the attention kernel with zero additional memory passes.

Python Implementation

The class below implements ALiBi from scratch. The critical method is apply_alibi_bias(), which subtracts the linear distance penalty from scaled scores. Compare this with the standard attention class from Chapter 1 — only one additional step is needed.

ALiBi Attention — NumPy Implementation

🐍alibi_attention.py

Explanation(54)

Code(163)

1import numpy as np

NumPy provides vectorized matrix operations. Qh @ Kh.T runs as optimized C code, not Python loops.

2import math

Python standard library. We use math.sqrt() to precompute the scaling factor √d_k.

4class ALiBiAttention

Wraps ALiBi in a reusable class. The key difference from standard attention: apply_alibi_bias() subtracts a linear distance penalty from scores before softmax. Zero additional learned parameters — slopes are computed by formula.

17def __init__(self, d_model, n_heads)

Constructor. Takes the full model dimension and number of heads. Computes per-head dimension d_k, scaling factor √d_k, and the geometric slope schedule.

EXECUTION STATE

⬇ input: d_model = 4

⬇ input: n_heads = 2

18self.d_model = d_model

Store full model dimension (4 in our example).

EXECUTION STATE

self.d_model = 4

19self.n_heads = n_heads

Store number of attention heads. Each head gets a different slope for its distance penalty.

EXECUTION STATE

self.n_heads = 2

20self.d_k = d_model // n_heads

Per-head dimension. Each head sees d_k=2 dimensions of Q, K, and V.

EXECUTION STATE

d_model // n_heads = 4 // 2 = 2

self.d_k = 2

21self.scale = math.sqrt(self.d_k)

Precompute √d_k once. Divides every dot product to control variance (same as Chapter 1).

EXECUTION STATE

math.sqrt(2) = 1.4142

self.scale = 1.4142

22self.slopes = self._compute_slopes(n_heads)

Compute the geometric slope schedule: m_h = 1/2^h. Head 1 gets slope=0.5 (local), Head 2 gets slope=0.25 (broader).

EXECUTION STATE

self.slopes = [0.5, 0.25]

24def _compute_slopes(self, n_heads) → list

Geometric slope schedule. Each head gets a different penalty strength. Lower-indexed heads have steeper slopes (attend more locally). Higher-indexed heads have gentler slopes (attend more broadly).

EXECUTION STATE

⬇ input: n_heads = 2

⬆ returns = [0.5, 0.25] — list of slopes

26return [1.0 / (2 ** (h + 1)) for h in range(n_heads)]

List comprehension. For h=0: 1/2^1 = 0.5. For h=1: 1/2^2 = 0.25. No learning required.

EXECUTION STATE

h=0 = 1.0 / 2^1 = 0.5

h=1 = 1.0 / 2^2 = 0.25

⬆ return = [0.5, 0.25]

28def _build_alibi_bias(self, N, slope) → np.ndarray

Build the N×N distance penalty matrix. Entry (i,j) = -slope × |i - j|. Tokens far apart receive larger penalties (more negative values), suppressing their attention scores before softmax.

EXECUTION STATE

⬇ input: N = 5 (number of tokens)

⬇ input: slope = 0.5 (Head 1) or 0.25 (Head 2)

⬆ returns = np.ndarray (5, 5) — distance penalties

30i_idx = np.arange(N)[:, None]

Create column vector [0,1,2,3,4]^T. The [:, None] adds a second axis — shape becomes (5,1) for broadcasting.

EXECUTION STATE

np.arange(5) = [0, 1, 2, 3, 4]

[:, None] = reshape (5,) → (5,1) for broadcasting

i_idx = [[0],[1],[2],[3],[4]] shape (5,1)

31j_idx = np.arange(N)[None, :]

Create row vector [0,1,2,3,4]. The [None, :] adds a first axis — shape becomes (1,5) for broadcasting.

EXECUTION STATE

[None, :] = reshape (5,) → (1,5) for broadcasting

j_idx = [[0, 1, 2, 3, 4]] shape (1,5)

32return -slope * np.abs(i_idx - j_idx)

Broadcasting: (5,1) - (1,5) = (5,5) distance matrix. Multiply by -slope to create penalties. Diagonal = 0 (self-attention has no penalty). Off-diagonal grows linearly with distance.

EXECUTION STATE

np.abs(i_idx - j_idx) =

     0  1  2  3  4
0 [  0  1  2  3  4 ]
1 [  1  0  1  2  3 ]
2 [  2  1  0  1  2 ]
3 [  3  2  1  0  1 ]
4 [  4  3  2  1  0 ]

⬆ return (slope=0.5) =

        The    cat    sat     on    mat
The   0.00  -0.50  -1.00  -1.50  -2.00
cat  -0.50   0.00  -0.50  -1.00  -1.50
sat  -1.00  -0.50   0.00  -0.50  -1.00
on   -1.50  -1.00  -0.50   0.00  -0.50
mat  -2.00  -1.50  -1.00  -0.50   0.00

⬆ return (slope=0.25) =

        The    cat    sat     on    mat
The   0.00  -0.25  -0.50  -0.75  -1.00
cat  -0.25   0.00  -0.25  -0.50  -0.75
sat  -0.50  -0.25   0.00  -0.25  -0.50
on   -0.75  -0.50  -0.25   0.00  -0.25
mat  -1.00  -0.75  -0.50  -0.25   0.00

34def _softmax(self, x) → np.ndarray

Numerically stable softmax. Takes a score matrix (5×5) and returns probabilities per row. Each row sums to 1.0.

EXECUTION STATE

⬇ input: x = shape (5, 5) — ALiBi score matrix for one head

⬆ returns = np.ndarray (5, 5) — softmax probabilities per row

36x_shifted = x - np.max(x, axis=-1, keepdims=True)

Subtract row-wise max for numerical stability. exp(x - max) prevents overflow while preserving the softmax result.

EXECUTION STATE

axis=-1 = find max along last axis — each row gets its own max

keepdims=True = result shape (5,1) not (5,) — broadcasting x(5×5) - max(5×1) works correctly

37exp_x = np.exp(x_shifted)

Exponentiate every element. The largest value per row is exp(0)=1.0 — no overflow possible.

38return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Divide each element by its row sum to normalize into probabilities. Each row sums to exactly 1.0.

EXECUTION STATE

axis=-1 = sum along last axis — sum each row independently

keepdims=True = sum returns shape (5,1) so broadcasting works

40def compute_scores(self, Qh, Kh)

Raw dot-product scores: Qh (5×2) @ Kh.T (2×5) → (5×5). Same as standard attention — ALiBi does not change this step.

EXECUTION STATE

⬇ input: Qh = shape (5, 2) — one head’s queries

⬇ input: Kh = shape (5, 2) — one head’s keys

⬆ returns = np.ndarray (5, 5) — raw scores

42return Qh @ Kh.T

Matrix multiply Qh (5×2) with Kh transposed (2×5). Entry (i,j) = dot product of query_i with key_j.

EXECUTION STATE

.T = transpose — Kh (5×2) becomes (2×5)

⬆ return (Head 1) =

     The  cat  sat   on  mat
The  0.0  1.0  1.0  0.0  1.0
cat  2.0  0.0  2.0  0.0  0.0
sat  1.0  1.0  2.0  0.0  1.0
on   0.0  0.0  0.0  0.0  0.0
mat  0.0  1.0  1.0  0.0  1.0

44def scale_scores(self, scores) → np.ndarray

Divide every score by √d_k = √2 ≈ 1.4142 to prevent softmax saturation.

EXECUTION STATE

⬇ input: scores = shape (5, 5) — raw dot products

self.scale = 1.4142 (√2)

⬆ returns = np.ndarray (5, 5) — scores ÷ 1.4142

46return scores / self.scale

Elementwise division. Every score is divided by 1.4142.

EXECUTION STATE

⬆ return (Head 1) =

       The     cat     sat      on     mat
The  0.0000  0.7071  0.7071  0.0000  0.7071
cat  1.4142  0.0000  1.4142  0.0000  0.0000
sat  0.7071  0.7071  1.4142  0.0000  0.7071
on   0.0000  0.0000  0.0000  0.0000  0.0000
mat  0.0000  0.7071  0.7071  0.0000  0.7071

48def apply_alibi_bias(self, scaled, slope)

THE KEY ALiBi OPERATION: add the linear distance penalty to the scaled scores. This is the ONLY difference from standard attention. Tokens far apart get large negative penalties, suppressing their weights after softmax.

EXECUTION STATE

⬇ input: scaled (5×5) = scaled dot-product scores

⬇ input: slope = 0.5 (Head 1) or 0.25 (Head 2)

⬆ returns = np.ndarray (5, 5) — scores with distance penalties

50N = scaled.shape[0]

Get sequence length from matrix dimensions.

EXECUTION STATE

N = 5

51bias = self._build_alibi_bias(N, slope)

Build the distance penalty matrix. Entry (i,j) = -slope × |i-j|.

EXECUTION STATE

bias (slope=0.5) =

        The    cat    sat     on    mat
The   0.00  -0.50  -1.00  -1.50  -2.00
cat  -0.50   0.00  -0.50  -1.00  -1.50
sat  -1.00  -0.50   0.00  -0.50  -1.00
on   -1.50  -1.00  -0.50   0.00  -0.50
mat  -2.00  -1.50  -1.00  -0.50   0.00

52return scaled + bias

Elementwise addition. The bias is always ≤ 0, so every off-diagonal score is reduced. Nearby tokens lose little; distant tokens lose a lot.

EXECUTION STATE

⬆ return (Head 1) =

        The     cat     sat      on     mat
The   0.000   0.207  -0.293  -1.500  -1.293
cat   0.914   0.000   0.914  -1.000  -1.500
sat  -0.293   0.207   1.414  -0.500  -0.293
on   -1.500  -1.000  -0.500   0.000  -0.500
mat  -2.000  -0.793  -0.293  -0.500   0.707

54def compute_weights(self, alibi_scores)

Apply softmax row-wise to get attention probabilities. The distance penalties have already been applied, so nearby tokens naturally receive higher weights.

EXECUTION STATE

⬇ input: alibi_scores = shape (5, 5) — scores with ALiBi bias

⬆ returns = np.ndarray (5, 5) — each row sums to 1.0

56return self._softmax(alibi_scores)

Calls _softmax. All 5 rows become probability distributions that favor nearby tokens.

EXECUTION STATE

⬆ return (Head 1) =

        The      cat      sat       on      mat
The  0.2879   0.3541   0.2148   0.0642   0.0790
cat  0.3791   0.1520   0.3791   0.0559   0.0339
sat  0.1003   0.1653   0.5527   0.0815   0.1003
on   0.0796   0.1312   0.2163   0.3566   0.2163
mat  0.0341   0.1140   0.1880   0.1528   0.5110

58def compute_output(self, weights, Vh)

Weighted sum of value vectors. weights (5×5) @ Vh (5×2) → (5×2). Each output row blends all value vectors, weighted by ALiBi-adjusted attention.

EXECUTION STATE

⬇ input: weights = shape (5, 5) — attention probabilities

⬇ input: Vh (5×2) = value vectors for this head

⬆ returns = np.ndarray (5, 2) — weighted output

60return weights @ Vh

Matrix multiply weights (5×5) with Vh (5×2). Each output row is a blend of all 5 value vectors.

EXECUTION STATE

⬆ return (Head 1) =

       d0       d1
The  0.3274   0.3936
cat  0.3961   0.1689
sat  0.1504   0.2154
on   0.1877   0.2393
mat  0.2896   0.3695

62def forward(self, Q, K, V)

Main entry point. For each head: (1) slice Q,K,V, (2) compute scaled dot-product scores, (3) ADD ALiBi bias, (4) softmax, (5) weighted sum. Returns concatenated outputs.

EXECUTION STATE

⬇ input: Q (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0

⬇ input: K (5×4) =

      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5

⬇ input: V (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5

⬆ returns = (output (5,4), all_weights list of 2 matrices)

76head_outputs = []

Will collect each head’s (5×2) output to concatenate at the end.

EXECUTION STATE

head_outputs = [] (empty, will hold 2 matrices)

77all_weights = []

Will collect each head’s (5×5) attention weight matrix.

EXECUTION STATE

all_weights = [] (empty, will hold 2 matrices)

79for h in range(self.n_heads):

Loop over all heads (h=0 and h=1). Each head uses a DIFFERENT slope for its distance penalty. Head 0: slope=0.5 (local). Head 1: slope=0.25 (broader).

LOOP TRACE · 2 iterations

h=0 (Head 1, slope=0.5)

Qh1 = Q[:, 0:2] = d0 d1 The 1.0 0.0 cat 0.0 2.0 sat 1.0 1.0 on 0.0 0.0 mat 1.0 0.0

slope = 0.5 — strong local bias

output_h1 = d0 d1 The 0.3274 0.3936 cat 0.3961 0.1689 sat 0.1504 0.2154 on 0.1877 0.2393 mat 0.2896 0.3695

h=1 (Head 2, slope=0.25)

Qh2 = Q[:, 2:4] = d2 d3 The 1.0 0.0 cat 0.0 1.0 sat 1.0 0.0 on 1.0 1.0 mat 0.0 1.0

slope = 0.25 — gentler, broader attention

output_h2 = d0 d1 The 0.1861 0.2613 cat 0.2120 0.2977 sat 0.2544 0.3573 on 0.1811 0.5662 mat 0.2731 0.4746

80s = h * self.d_k

Start column index for this head.

EXECUTION STATE

h=0 → s = 0 × 2 = 0

h=1 → s = 1 × 2 = 2

81e = s + self.d_k

End column index.

EXECUTION STATE

h=0 → e = 0 + 2 = 2

h=1 → e = 2 + 2 = 4

82Qh, Kh, Vh = Q[:, s:e], K[:, s:e], V[:, s:e]

Slice all three matrices for this head. Each head gets its own 2-column view of Q, K, and V.

84raw_scores = self.compute_scores(Qh, Kh)

Dot products of head-specific Q with head-specific K. Same as standard attention.

85scaled = self.scale_scores(raw_scores)

Divide by √2 ≈ 1.4142. Same as standard attention.

86alibi_scores = self.apply_alibi_bias(scaled, self.slopes[h])

THE ALIBI STEP: subtract distance penalty. This is the ONLY line that differs from standard attention. After this, nearby tokens have higher effective scores.

87weights = self.compute_weights(alibi_scores)

Softmax over ALiBi-adjusted scores → probabilities that favor nearby tokens.

88output = self.compute_output(weights, Vh)

Weighted sum of value vectors using ALiBi-adjusted weights.

90head_outputs.append(output)

Store this head’s (5×2) output.

91all_weights.append(weights)

Store this head’s (5×5) weight matrix.

93return np.hstack(head_outputs), all_weights

Concatenate all head outputs horizontally: [(5×2), (5×2)] → (5×4).

EXECUTION STATE

np.hstack() = horizontal stack — concatenates along columns

⬆ return: output (5×4) =

        d0       d1       d2       d3
The  0.3274   0.3936   0.1861   0.2613
cat  0.3961   0.1689   0.2120   0.2977
sat  0.1504   0.2154   0.2544   0.3573
on   0.1877   0.2393   0.1811   0.5662
mat  0.2896   0.3695   0.2731   0.4746

115tokens = [...]

The 5 tokens used in every chapter of this book.

EXECUTION STATE

tokens = ['The', 'cat', 'sat', 'on', 'mat']

117Q = np.array([...])

Query matrix — same as Chapter 1. Each row encodes what that token looks for.

EXECUTION STATE

Q (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0

125K = np.array([...])

Key matrix — same as Chapter 1. Each row encodes what that token advertises.

EXECUTION STATE

K (5×4) =

      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5

133V = np.array([...])

Value matrix — same as Chapter 1. Each row is the content that token provides when attended to.

EXECUTION STATE

V (5×4) =

      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5

143alibi = ALiBiAttention(d_model=4, n_heads=2)

Instantiate ALiBi with 4 model dims and 2 heads. Sets d_k=2, scale=√2, slopes=[0.5, 0.25].

EXECUTION STATE

alibi.d_k = 2

alibi.scale = 1.4142

alibi.slopes = [0.5, 0.25]

144output, all_weights = alibi.forward(Q, K, V)

Run the full ALiBi pipeline. For each head: standard scores + ALiBi bias → softmax → weighted sum.

EXECUTION STATE

output (5×4) =

        d0       d1       d2       d3
The  0.3274   0.3936   0.1861   0.2613
cat  0.3961   0.1689   0.2120   0.2977
sat  0.1504   0.2154   0.2544   0.3573
on   0.1877   0.2393   0.1811   0.5662
mat  0.2896   0.3695   0.2731   0.4746

len(all_weights) = 2 (one 5×5 matrix per head)

153alibi.explain(Q, K, V, tokens, query_idx=0)

Print detailed trace for ‘The’ (token 0). Shows per-head scores, distance penalties, and weights.

EXECUTION STATE

query_idx = 0 → tracing ‘The’

154alibi.explain(Q, K, V, tokens, query_idx=1)

Print detailed trace for ‘cat’ (token 1). Compare how Head 1 (steep slope) vs Head 2 (gentle slope) treat nearby vs distant tokens.

EXECUTION STATE

query_idx = 1 → tracing ‘cat’

109 lines without explanation

1import numpy as np
2import math
3
4class ALiBiAttention:
5    """
6    ALiBi — Attention with Linear Biases (Press et al., ICLR 2022)
7
8    Eliminates positional embeddings entirely.
9    Instead, subtracts a head-specific linear distance penalty
10    from attention scores before softmax.
11
12    slope_h = 1 / 2^h  (geometric schedule, no learned params)
13    """
14
15    def __init__(self, d_model: int, n_heads: int):
16        self.d_model = d_model
17        self.n_heads = n_heads
18        self.d_k = d_model // n_heads
19        self.scale = math.sqrt(self.d_k)
20        self.slopes = self._compute_slopes(n_heads)
21
22    def _compute_slopes(self, n_heads: int) -> list:
23        """Geometric slope schedule: m_h = 1/2^h."""
24        return [1.0 / (2 ** (h + 1)) for h in range(n_heads)]
25
26    def _build_alibi_bias(self, N: int, slope: float) -> np.ndarray:
27        """Build the N×N distance penalty matrix."""
28        i_idx = np.arange(N)[:, None]
29        j_idx = np.arange(N)[None, :]
30        return -slope * np.abs(i_idx - j_idx)
31
32    def _softmax(self, x: np.ndarray) -> np.ndarray:
33        """Numerically stable softmax along last axis."""
34        x_shifted = x - np.max(x, axis=-1, keepdims=True)
35        exp_x = np.exp(x_shifted)
36        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
37
38    def compute_scores(self, Qh: np.ndarray, Kh: np.ndarray):
39        """Raw dot-product scores: Qh @ Kh^T."""
40        return Qh @ Kh.T
41
42    def scale_scores(self, scores: np.ndarray) -> np.ndarray:
43        """Divide by sqrt(d_k) to control variance."""
44        return scores / self.scale
45
46    def apply_alibi_bias(self, scaled: np.ndarray, slope: float):
47        """Add the linear distance penalty to scores."""
48        N = scaled.shape[0]
49        bias = self._build_alibi_bias(N, slope)
50        return scaled + bias
51
52    def compute_weights(self, alibi_scores: np.ndarray):
53        """Apply softmax to get attention weights."""
54        return self._softmax(alibi_scores)
55
56    def compute_output(self, weights: np.ndarray, Vh: np.ndarray):
57        """Weighted sum of value vectors."""
58        return weights @ Vh
59
60    def forward(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray):
61        """
62        Full forward pass.
63
64        Args:
65            Q: Query matrix  (N, d_model)
66            K: Key matrix    (N, d_model)
67            V: Value matrix  (N, d_model)
68
69        Returns:
70            output:      Concatenated head outputs  (N, d_model)
71            all_weights: List of weight matrices per head
72        """
73        head_outputs = []
74        all_weights = []
75
76        for h in range(self.n_heads):
77            s = h * self.d_k
78            e = s + self.d_k
79            Qh, Kh, Vh = Q[:, s:e], K[:, s:e], V[:, s:e]
80
81            raw_scores = self.compute_scores(Qh, Kh)
82            scaled = self.scale_scores(raw_scores)
83            alibi_scores = self.apply_alibi_bias(scaled, self.slopes[h])
84            weights = self.compute_weights(alibi_scores)
85            output = self.compute_output(weights, Vh)
86
87            head_outputs.append(output)
88            all_weights.append(weights)
89
90        return np.hstack(head_outputs), all_weights
91
92    def explain(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray,
93                tokens: list, query_idx: int = 0):
94        """Print a detailed trace for a specific query token."""
95        token = tokens[query_idx]
96        print(f"\n=== ALiBi trace for '{token}' (row {query_idx}) ===")
97
98        for h in range(self.n_heads):
99            s = h * self.d_k
100            e = s + self.d_k
101            Qh, Kh, Vh = Q[:, s:e], K[:, s:e], V[:, s:e]
102
103            raw = self.compute_scores(Qh, Kh)
104            scaled = self.scale_scores(raw)
105            alibi = self.apply_alibi_bias(scaled, self.slopes[h])
106            w = self.compute_weights(alibi)
107            out = self.compute_output(w, Vh)
108
109            print(f"\n--- Head {h+1} (slope={self.slopes[h]}) ---")
110            print(f"Q_h{h+1}[{token}] = {Qh[query_idx]}")
111            for j, t in enumerate(tokens):
112                dist = abs(query_idx - j)
113                penalty = self.slopes[h] * dist
114                print(f"  score[{t}] = {scaled[query_idx,j]:.4f}"
115                      f"  penalty = -{penalty:.4f}"
116                      f"  alibi = {alibi[query_idx,j]:.4f}")
117            print(f"  softmax = {np.round(w[query_idx], 4)}")
118            print(f"  output  = {np.round(out[query_idx], 4)}")
119
120
121# ── Shared Example (used in every chapter) ──
122tokens = ["The", "cat", "sat", "on", "mat"]
123
124Q = np.array([
125    [1.0, 0.0, 1.0, 0.0],   # The
126    [0.0, 2.0, 0.0, 1.0],   # cat
127    [1.0, 1.0, 1.0, 0.0],   # sat
128    [0.0, 0.0, 1.0, 1.0],   # on
129    [1.0, 0.0, 0.0, 1.0],   # mat
130])
131
132K = np.array([
133    [0.0, 1.0, 0.0, 1.0],   # The
134    [1.0, 0.0, 1.0, 0.0],   # cat
135    [1.0, 1.0, 0.0, 0.0],   # sat
136    [0.0, 0.0, 1.0, 1.0],   # on
137    [1.0, 0.0, 0.5, 0.5],   # mat
138])
139
140V = np.array([
141    [1.0, 0.0, 0.0, 0.0],   # The
142    [0.0, 1.0, 0.0, 0.0],   # cat
143    [0.0, 0.0, 1.0, 0.0],   # sat
144    [0.0, 0.0, 0.0, 1.0],   # on
145    [0.5, 0.5, 0.5, 0.5],   # mat
146])
147
148# ── Run ──
149alibi = ALiBiAttention(d_model=4, n_heads=2)
150output, all_weights = alibi.forward(Q, K, V)
151
152print("ALiBi Output (5x4):")
153print(np.round(output, 4))
154
155print("\nHead 1 Weights (5x5):")
156print(np.round(all_weights[0], 4))
157
158print("\nHead 2 Weights (5x5):")
159print(np.round(all_weights[1], 4))
160
161# Detailed trace for "The" and "cat"
162alibi.explain(Q, K, V, tokens, query_idx=0)
163alibi.explain(Q, K, V, tokens, query_idx=1)

PyTorch Implementation

The PyTorch version supports GPU acceleration and automatic differentiation. The key difference from the NumPy version: slopes are registered as a buffer (not a parameter), ensuring they move to GPU with the model but are never updated by the optimizer. The bias tensor is built for all heads at once using broadcasting: slopes $(H,) \rightarrow (H,1,1)$ times distances $(N,N) \rightarrow (H,N,N)$ .

ALiBi Attention — PyTorch Implementation

🐍alibi_attention_torch.py

Explanation(16)

Code(139)

1Import PyTorch

torch is the core tensor library. torch.nn provides neural network building blocks (nn.Module). torch.nn.functional provides stateless operations like softmax. math is used for sqrt.

6class ALiBiAttention(nn.Module)

nn.Module subclass. Unlike the NumPy version, this supports GPU acceleration via .cuda(), automatic gradient computation via autograd, and integration with PyTorch training loops. The slopes are registered as a buffer (non-learnable).

16def __init__(self, d_model, n_heads)

Constructor. Computes d_k, scale, and registers the geometric slope schedule as a buffer. Buffers move with the model to GPU/CPU but are not updated by the optimizer.

EXECUTION STATE

⬇ input: d_model = 4

⬇ input: n_heads = 2

self.d_k = 4 // 2 = 2

self.scale = √2 = 1.4142

23slopes = torch.tensor([...])

Compute slopes as a tensor: [1/2, 1/4] = [0.5, 0.25]. These are constants, not learned parameters.

EXECUTION STATE

slopes = tensor([0.5000, 0.2500])

26self.register_buffer("slopes", slopes)

register_buffer stores a tensor that: (1) moves to GPU with model.cuda(), (2) is saved in state_dict, (3) is NOT updated by optimizer. Perfect for fixed constants like ALiBi slopes.

EXECUTION STATE

register_buffer = makes slopes part of model state but not a parameter

28def _build_alibi_bias(self, N, device) → Tensor

Build the (n_heads, N, N) bias tensor for all heads at once. Uses broadcasting to create different penalty strengths per head.

EXECUTION STATE

⬇ input: N = 5 (sequence length)

⬇ input: device = cpu or cuda

⬆ returns = Tensor shape (2, 5, 5) — one 5×5 penalty matrix per head

31positions = torch.arange(N, device=device)

Create position indices [0, 1, 2, 3, 4] on the same device as the inputs.

EXECUTION STATE

positions = tensor([0, 1, 2, 3, 4])

32dist = torch.abs(positions.unsqueeze(1) - positions.unsqueeze(0)).float()

Compute pairwise absolute distances. unsqueeze(1) makes (5,1), unsqueeze(0) makes (1,5). Broadcasting gives (5,5).

EXECUTION STATE

.unsqueeze(1) = shape (5,) → (5,1) — column vector

.unsqueeze(0) = shape (5,) → (1,5) — row vector

dist (5×5) =

     0  1  2  3  4
0 [  0  1  2  3  4 ]
1 [  1  0  1  2  3 ]
2 [  2  1  0  1  2 ]
3 [  3  2  1  0  1 ]
4 [  4  3  2  1  0 ]

36bias = -self.slopes.unsqueeze(1).unsqueeze(2) * dist

Broadcast slopes (2,) → (2,1,1) against dist (5,5) → (2,5,5). Each head gets its own penalty matrix.

EXECUTION STATE

.unsqueeze(1).unsqueeze(2) = slopes shape (2,) → (2,1,1) for broadcasting with (5,5)

bias[0] (slope=0.5) = Head 1: -0.5 * dist

bias[1] (slope=0.25) = Head 2: -0.25 * dist

37return bias

Return the (2, 5, 5) ALiBi bias tensor. bias[h] is the 5×5 penalty matrix for head h.

39def forward(self, Q, K, V, mask) → tuple

Full forward pass. Builds ALiBi bias once for all heads, then loops over heads applying standard attention + ALiBi penalty.

EXECUTION STATE

⬇ input: Q = shape (5, 4)

⬇ input: K = shape (5, 4)

⬇ input: V = shape (5, 4)

⬇ input: mask = None or (5, 5) boolean

57alibi_bias = self._build_alibi_bias(N, device)

Build the ALiBi bias tensor once. Shape: (2, 5, 5). This is the key precomputation.

EXECUTION STATE

alibi_bias.shape = (2, 5, 5)

68scores = scores + alibi_bias[h]

THE ALIBI STEP: add head h’s distance penalty matrix to the scaled scores. This single line is the entire difference from standard attention.

EXECUTION STATE

alibi_bias[h] = 5×5 penalty matrix for head h

73weights = F.softmax(scores, dim=-1)

Softmax along last dimension. The ALiBi penalties have already been applied, so nearby tokens naturally receive more weight.

EXECUTION STATE

dim=-1 = softmax along last axis — each row becomes a probability distribution

85alibi = ALiBiAttention(d_model=4, n_heads=2)

Instantiate. slopes buffer = [0.5, 0.25]. No learned parameters.

EXECUTION STATE

alibi.slopes = tensor([0.5000, 0.2500])

86output, all_weights = alibi(Q, K, V)

Call forward(). Calling the module directly invokes forward() plus any hooks.

EXECUTION STATE

output.shape = (5, 4)

len(all_weights) = 2

123 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
5
6class ALiBiAttention(nn.Module):
7    """
8    ALiBi — Attention with Linear Biases (Press et al., ICLR 2022)
9    PyTorch implementation with GPU support.
10
11    Subtracts head-specific linear distance penalty from scores.
12    Zero additional learned parameters.
13    """
14
15    def __init__(self, d_model: int, n_heads: int):
16        super().__init__()
17        self.d_model = d_model
18        self.n_heads = n_heads
19        self.d_k = d_model // n_heads
20        self.scale = math.sqrt(self.d_k)
21
22        # Register slopes as a non-learnable buffer
23        slopes = torch.tensor(
24            [1.0 / (2 ** (h + 1)) for h in range(n_heads)]
25        )
26        self.register_buffer("slopes", slopes)
27
28    def _build_alibi_bias(
29        self, N: int, device: torch.device
30    ) -> torch.Tensor:
31        """Build (n_heads, N, N) ALiBi bias tensor."""
32        positions = torch.arange(N, device=device)
33        dist = torch.abs(
34            positions.unsqueeze(1) - positions.unsqueeze(0)
35        ).float()
36        # slopes: (n_heads,) -> (n_heads, 1, 1) for broadcast
37        bias = -self.slopes.unsqueeze(1).unsqueeze(2) * dist
38        return bias
39
40    def forward(
41        self,
42        Q: torch.Tensor,
43        K: torch.Tensor,
44        V: torch.Tensor,
45        mask: torch.Tensor | None = None,
46    ) -> tuple[torch.Tensor, list[torch.Tensor]]:
47        """
48        Args:
49            Q: (N, d_model)  — query matrix
50            K: (N, d_model)  — key matrix
51            V: (N, d_model)  — value matrix
52            mask: Optional (N, N) boolean mask
53
54        Returns:
55            output:      (N, d_model) — concatenated heads
56            all_weights: list of (N, N) per-head weights
57        """
58        N = Q.size(0)
59        device = Q.device
60
61        # Build ALiBi bias for all heads at once
62        alibi_bias = self._build_alibi_bias(N, device)
63
64        head_outputs = []
65        all_weights = []
66
67        for h in range(self.n_heads):
68            s = h * self.d_k
69            e = s + self.d_k
70            Qh = Q[:, s:e]
71            Kh = K[:, s:e]
72            Vh = V[:, s:e]
73
74            # Scaled dot-product
75            scores = torch.matmul(Qh, Kh.transpose(-2, -1))
76            scores = scores / self.scale
77
78            # Add ALiBi bias (the only difference from standard)
79            scores = scores + alibi_bias[h]
80
81            if mask is not None:
82                scores = scores.masked_fill(mask, float("-inf"))
83
84            weights = F.softmax(scores, dim=-1)
85            output = torch.matmul(weights, Vh)
86
87            head_outputs.append(output)
88            all_weights.append(weights)
89
90        return torch.cat(head_outputs, dim=-1), all_weights
91
92
93# ── Shared Example ──
94tokens = ["The", "cat", "sat", "on", "mat"]
95
96Q = torch.tensor([
97    [1.0, 0.0, 1.0, 0.0],
98    [0.0, 2.0, 0.0, 1.0],
99    [1.0, 1.0, 1.0, 0.0],
100    [0.0, 0.0, 1.0, 1.0],
101    [1.0, 0.0, 0.0, 1.0],
102])
103
104K = torch.tensor([
105    [0.0, 1.0, 0.0, 1.0],
106    [1.0, 0.0, 1.0, 0.0],
107    [1.0, 1.0, 0.0, 0.0],
108    [0.0, 0.0, 1.0, 1.0],
109    [1.0, 0.0, 0.5, 0.5],
110])
111
112V = torch.tensor([
113    [1.0, 0.0, 0.0, 0.0],
114    [0.0, 1.0, 0.0, 0.0],
115    [0.0, 0.0, 1.0, 0.0],
116    [0.0, 0.0, 0.0, 1.0],
117    [0.5, 0.5, 0.5, 0.5],
118])
119
120# ── Run ──
121alibi = ALiBiAttention(d_model=4, n_heads=2)
122output, all_weights = alibi(Q, K, V)
123
124print("ALiBi Output (5x4):")
125print(output.round(decimals=4))
126
127print("\nHead 1 Weights:")
128print(all_weights[0].round(decimals=4))
129
130print("\nHead 2 Weights:")
131print(all_weights[1].round(decimals=4))
132
133# ── GPU: just move tensors ──
134if torch.cuda.is_available():
135    alibi_gpu = alibi.cuda()
136    Q_gpu, K_gpu, V_gpu = Q.cuda(), K.cuda(), V.cuda()
137    out_gpu, _ = alibi_gpu(Q_gpu, K_gpu, V_gpu)
138    print("GPU matches CPU?",
139          torch.allclose(output, out_gpu.cpu(), atol=1e-4))

Key Takeaways

ALiBi replaces positional embeddings with a linear distance penalty subtracted from attention scores before softmax. This is the simplest possible position encoding: zero parameters, zero trigonometric functions, just $-m_h \times |i - j|$ .
Different heads use different slopes via a geometric schedule $m_h = 1/2^h$ . This creates a multi-resolution view: steep slopes for local syntax, gentle slopes for long-range dependencies.
Extrapolation is the primary advantage. Because the bias is a deterministic function of distance (not a learned representation), models trained at length $L$ generalize to $2L$ or beyond without fine-tuning.
Diagonal dominance emerges naturally. Even when content scores are uniform or zero, ALiBi creates meaningful attention patterns where self-attention and nearby tokens dominate.
ALiBi composes with all other optimizations. It works with MQA/GQA (KV-sharing), Flash Attention (tiled computation), and causal masking. The bias adds negligible overhead.

Exercises

Exercise 1: Compute for "sat"

Using the shared Q, K, V matrices, compute the ALiBi attention weights for “sat” (row 2) in both heads. Verify that Head 1 gives $[0.1003,\; 0.1653,\; 0.5527,\; 0.0815,\; 0.1003]$ . Why does "sat" strongly self-attend (55.3%) in Head 1? Consider both the content score and the distance penalty.

Exercise 2: Slope = 0 Reduces to Standard Attention

Prove mathematically that when $m_h = 0$ for all heads, ALiBi produces identical results to standard scaled dot-product attention (Chapter 1). Then verify numerically using the interactive visualizer above (set slope to 0).

Exercise 3: ALiBi + Causal Mask

In autoregressive models (Chapter 3), causal masking sets future positions to $-\infty$ . How does this interact with ALiBi? Write the combined score formula and explain whether the ALiBi penalty is redundant for future positions. Hint: $-\infty + (-m_h \times d) = -\infty$ .

Exercise 4: Comparing Slope Schedules

Press et al. used $m_h = 1/2^h$ . Propose two alternative schedules: (a) linear: $m_h = (H - h + 1) / H$ and (b) constant: $m_h = 0.25$ for all heads. For each, compute the attention weights for "The" across all 5 tokens using Head 1. What is lost when all heads have the same slope?

Exercise 5: Extrapolation Experiment

Extend the shared example to 10 tokens: ["The", "cat", "sat", "on", "mat", "and", "the", "dog", "ran", "home"]. Invent reasonable Q, K, V vectors for the new tokens. Compute ALiBi attention for the original 5 tokens and verify that their weights are nearly unchanged despite the longer sequence. This demonstrates ALiBi's extrapolation: adding more tokens barely affects existing attention patterns.

References

Press, O., Smith, N. A., & Lewis, M. (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” International Conference on Learning Representations (ICLR).
Scao, T. L., et al. (2022). “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.” arXiv:2211.05100.
MosaicML (2023). “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.” MosaicML Blog.
Su, J., et al. (2021). “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864.
Dao, T., et al. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv:2205.14135.
Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems (NeurIPS), 30.

Learning Objectives

The Real Problem

Position Embedding Failure Modes

The Extrapolation Challenge

The Key Insight

From Intuition to Mathematics

The Distance Penalty Insight

Head-Specific Slopes

The Mathematical Definition

Symbol-by-Symbol Breakdown

The Geometric Slope Schedule

Why Not Learn the Slopes?

Interactive: ALiBi Bias Explorer

Interactive: Extrapolation Comparison

Step-by-Step Calculation

Step-by-Step for "The" (Row 0), Head 1 (m=0.5m = 0.5m=0.5)

Step-by-Step for "The" (Row 0), Head 2 (m=0.25m = 0.25m=0.25)

Step-by-Step for "cat" (Row 1), Head 1

Full Attention Weights and Output

ALiBi vs Standard: Side-by-Side

Interpreting the Differences

Key Observation

Applications Across Domains

Long-Document Language Modeling

Code Generation

Scientific Sequence Modeling

Connection to Modern Systems

RoPE vs ALiBi

Flash Attention Compatibility

KV-Cache and Inference

Complexity Analysis

Python Implementation

PyTorch Implementation

Key Takeaways

Exercises

Exercise 1: Compute for "sat"

Exercise 2: Slope = 0 Reduces to Standard Attention

Exercise 3: ALiBi + Causal Mask

Exercise 4: Comparing Slope Schedules

Exercise 5: Extrapolation Experiment

References

Step-by-Step for "The" (Row 0), Head 1 ( $m = 0.5$ )

Step-by-Step for "The" (Row 0), Head 2 ( $m = 0.25$ )