Boo-AI — Master Artificial Intelligence by Building from Scratch

The Black-Box Problem

A modern neural network is, in one sense, a perfectly ordinary function: a sequence of matrix multiplications, element-wise nonlinearities, and additions. Every number it produces can, in principle, be traced back to the inputs. And yet — after training a 100-million-parameter model on a few billion tokens — we often find ourselves staring at the result and asking the same question every scientist asks of a complicated instrument: what is it actually doing?

The difficulty is not that the computation is secret. The difficulty is that it is too detailed to hold in mind at once. A single forward pass through a small CNN may execute a billion floating- point operations; a single transformer layer multiplies tens of millions of numbers. No one reads those numbers one by one. We need pictures — compact, low-dimensional summaries that project the enormous internal state into something a human eye can process in a second.

The point of visualization is not decoration. It is compression. A good picture takes a state that is too big to think about and projects it onto a surface the brain already knows how to read: a grid, a curve, a heatmap. Every technique in this section is a compression scheme for a different aspect of the network.

Visualization is also the first defence against the failure modes of Chapter 20 §1. A loss curve alone will tell you that training has stalled; a saliency map will often tell you why. A histogram of activations will show you dead ReLUs before a metric on the validation set ever notices. A 3-D loss surface will explain why a carefully chosen learning rate still oscillates. The relationship is simple: debugging without pictures is debugging with a blindfold on.

Four Windows Into a Network

There are essentially four quantities inside a trained network that are worth plotting. Each opens a different window onto the model, and each answers a different question.

Quantity	What you plot	Question it answers
Activations (h)	histograms, feature maps, t-SNE of hidden states	What has each layer learned to detect?
Weights (W)	filters as images, weight histograms	What recurring pattern is each neuron tuned to?
Gradients (∂L/∂x)	saliency maps, Grad-CAM, integrated gradients	Which parts of the input is the model using?
Loss (L(θ))	training curves, 1-D line searches, 2-D/3-D loss landscapes	Is optimization healthy? Where are the minima?

Attention weights — which we cover last in this section — are really a special kind of activation, but they deserve their own treatment because they have become the most used and most misused picture in modern deep learning.

1. Activation Visualization: What Each Layer Sees

An activation is the output of a neuron on a particular input. Collect the activations of every unit in a layer on one example and you have the layer's representation of that example. Collect them across a batch and you have a snapshot of how the whole layer is behaving.

What to plot and why

Per-layer histograms. For every layer, plot a histogram of its activations over a validation batch. A healthy hidden layer with ReLU typically has mean somewhere between $0.2$ and $1$ , with a spike at exactly 0 (dead units). If the mean drifts toward zero or the spike dominates, ReLUs are dying (see §1). If the distribution explodes toward huge values, you have an activation-scale problem that residual connections, normalization, or smaller initialisations need to fix.
Feature maps (CNNs). For a conv layer producing shape $(C, H, W)$ , each of the $C$ channels is an $H \times W$ image. Tile them. Early layers look like edge and colour detectors; middle layers look like texture and part detectors; late layers look like abstract composites. This is the technique Zeiler & Fergus used in 2014 to prove — picture by picture — that CNN features form a hierarchy.
Per-unit activation vectors (transformers). For a transformer hidden state of shape $(T, d_{\text{model}})$ , each token $t$ has a $d_{\text{model}}$ -dim vector. Project that pile of vectors to 2-D with t-SNE or UMAP and you can literally see sentence structure emerge as clusters in the plane (see §5).

How you collect them: forward hooks

In PyTorch the standard tool is register_forward_hook. It is a callback the framework invokes every time a particular module finishes its forward call, handing you the module, its input, and its output. You stash the output in a dictionary and move on:

Collecting activations with forward hooks

🐍register_hook.py

Explanation(8)

Code(11)

1activations = {}

A plain Python dictionary that hooks will write into. Keys are layer names; values are the captured tensors. Using .detach().cpu() later ensures we don't leak GPU memory by holding on to the autograd graph.

3def save_hook(name)

A factory that returns a hook closure. We need a factory (not just a bare hook) because each hook must know WHICH layer it is capturing — the closure captures `name`.

EXECUTION STATE

⬇ input: name = String label for this layer, e.g. 'fc1'. Becomes the dict key.

⬆ returns = A function suitable to pass to register_forward_hook.

4def hook(module, inputs, output)

The actual callback. PyTorch gives us three arguments: the Module instance, a tuple of input tensors, and the output tensor. We only save the output here, but you can inspect inputs too — sometimes useful.

EXECUTION STATE

⬇ module = The nn.Module that just ran forward. Handy for logging its class name or parameter stats.

⬇ inputs = Tuple of positional inputs to forward. For nn.Linear this is (x,).

⬇ output = The tensor returned by forward. For a Linear layer with batch=1 and out=32, shape (1, 32).

5activations[name] = output.detach().cpu()

Record the output. .detach() strips the autograd graph so we can hold the tensor without keeping the whole backward history alive. .cpu() moves it off GPU so the hook is safe for large models on small GPUs.

EXECUTION STATE

📚 .detach() = Returns a tensor that shares storage but has requires_grad=False and no grad_fn. Essential inside hooks — otherwise every captured activation pins the entire computational graph in memory.

📚 .cpu() = Move tensor data to CPU memory. No-op if already on CPU.

6return hook

The factory returns the closure so we can register it on different layers, each with its own name.

8model.fc1.register_forward_hook(save_hook('fc1'))

Attach our hook to the fc1 submodule. PyTorch will call it every time fc1 runs forward. The method returns a handle object whose .remove() detaches the hook — important in long-running code to avoid memory leaks.

EXECUTION STATE

📚 .register_forward_hook(fn) = Method of every nn.Module. Adds `fn` to the list of callbacks invoked after forward(). Returns a RemovableHandle — store it and call .remove() when done.

⬇ arg: save_hook('fc1') = The closure — NOT save_hook (the factory). Passing the factory by accident is a common bug.

9model.fc2.register_forward_hook(save_hook('fc2'))

Same idea, but attached to fc2 and labelled 'fc2'. You can register as many hooks as you want on as many layers as you want.

11_ = model(x)

Run a single forward pass. The underscore discards the returned logits — we only care about the side effect (activations being filled). After this line, activations['fc1'] and activations['fc2'] contain the tensors we captured.

EXECUTION STATE

→ what's now in activations = activations['fc1'] : torch.Tensor shape (1, 32) activations['fc2'] : torch.Tensor shape (1, 10)

→ how to visualise = Matplotlib: plt.hist(activations['fc1'].flatten().numpy(), bins=50). The histogram tells you at a glance whether the layer is well-conditioned or dying.

3 lines without explanation

1activations = {}
2
3def save_hook(name):
4    def hook(module, inputs, output):
5        activations[name] = output.detach().cpu()
6    return hook
7
8model.fc1.register_forward_hook(save_hook("fc1"))
9model.fc2.register_forward_hook(save_hook("fc2"))
10
11_ = model(x)          # activations["fc1"], activations["fc2"] are filled

Interactive Activation Histograms

The hooks above collect activations; the visualizer below shows what you would actually see when you plot them. Toggle the activation function and the network's health condition. Notice how the signature shifts: a healthy ReLU layer has a tall spike at exactly zero (legitimate sparsity) plus a positive tail; a dying ReLU layer pushes that spike past 50% — those neurons have stopped learning. Saturated tanh/sigmoid layers pile up at the extremes, which is where their gradient is ≈ 0.

Loading activation histograms…

Read the signature, not the mean. A layer with mean 0.3 and a 60% spike at zero is sick — its "mean" is being propped up by a few overactive units. A layer with mean 0.3 and a smooth distribution is healthy. The shape of the histogram carries more information than any single summary statistic.

2. Weight and Filter Visualization

Weights are static — they do not depend on the input. Plotting them tells you what recurring pattern a neuron is tuned to, independent of any particular example. This is cheap and surprisingly informative.

The canonical example is the first convolutional layer of a vision CNN. Its weight tensor has shape $(C_{\text{out}}, C_{\text{in}}, k_h, k_w)$ — e.g. $(64, 3, 7, 7)$ for an ImageNet ResNet. Each output channel is therefore a $3 \times 7 \times 7$ RGB image. Display the 64 of them in an $8 \times 8$ grid and you get one of the most famous pictures in deep learning: oriented edges, colour opponents, Gabor-like bars — learned, not designed.

Why is this so informative? Because the convolution operation is correlation. A conv filter with weights $w$ gives a large response precisely when the input patch matches $w$ . So the filter image literally IS the pattern the neuron is searching for. You are looking at the neuron's ideal stimulus.

For fully-connected layers the analogous move is to reshape each row of $W_1 \in \mathbb{R}^{h \times d}$ back into an image (if $d$ is a flattened image dimension). Each row of the first-layer weight matrix is a template — plotting them recovers recognisable strokes for MNIST digits, face parts for face datasets, and so on.

For deeper layers, the raw weights stop being interpretable: they live in the abstract space of earlier feature maps, not in pixel space. For those you need either activation maximisation (optimise the input to maximally excite a chosen neuron — the basis of DeepDream) or the gradient-based techniques we turn to now.

3. Saliency Maps: The Gradient as a Spotlight

Saliency maps answer a different question from activations or weights. Given this specific prediction, which input features did the model actually use? The cleanest formulation comes from Simonyan, Vedaldi & Zisserman (2014), and it is astonishingly simple.

Let $f_c(x)$ be the score the network assigns to class $c$ on input $x$ . A first-order Taylor expansion around $x$ gives $f_c(x + \delta) \approx f_c(x) + \nabla_x f_c(x)^\top \delta$ . So the pixel-wise gradient $g = \nabla_x f_c(x)$ tells us — to first order — how much each pixel contributes to the class score. Pixels with large $|g_i|$ are the ones the model is "looking at".

For a multi-layer network we get $g$ by the chain rule. For the one-hidden-layer MLP of this chapter, $f_c(x) = W_2^{(c)} \, \sigma(W_1 x + b_1) + b_2^{(c)}$ , so

$\nabla_x f_c = \left(W_2^{(c)} \odot \sigma'(W_1 x + b_1)\right)^\top W_1$ , where $\sigma'$ is the ReLU derivative — a binary mask. That is three tensor ops: pick a row of $W_2$ , multiply by a 0/1 mask, left-multiply by $W_1$ .

Known limitations (worth saying once, clearly)

Vanilla gradients are noisy: tiny local fluctuations in the loss surface produce speckled heatmaps. SmoothGrad (Smilkov et al. 2017) averages the saliency over $n$ noisy copies of the input and produces far cleaner maps. The visualizer below lets you slide $n$ up and see the cleanup happen.
They show local sensitivity, not causal importance. A pixel can have a huge gradient yet be unnecessary for the prediction globally. Integrated Gradients (Sundararajan, Taly & Yan 2017) fixes this by integrating gradients along a path from a baseline to the input.
For CNNs, Grad-CAM (Selvaraju et al. 2017) is usually preferred: it weights the last conv feature map by the gradient of the target logit, producing a smooth, class-specific heatmap at the resolution of the feature map.

Computing Saliency From Scratch (NumPy)

Before we reach for autograd, let us compute the exact same gradient by hand. It is only three tensor operations, and seeing them without framework magic is the fastest way to understand what autograd is doing for us.

Saliency by explicit chain rule

🐍saliency_numpy.py

Explanation(14)

Code(27)

1import numpy as np

NumPy is the numerical-computing backbone of scientific Python. We use it here for matrix math (the @ operator), element-wise operations, and reshaping. Every array is an ndarray stored contiguously in memory and operated on by fast C routines — so a 32×256 matrix-vector product happens as a single BLAS call, not a Python loop.

EXECUTION STATE

numpy = N-dimensional array library. Provides ndarray, linear algebra (np.linalg), broadcasting, reductions.

as np = Universal Python convention — lets us write np.maximum(...) instead of numpy.maximum(...).

3def relu(z) → ndarray

Rectified Linear Unit — the most common activation function. It keeps positive values unchanged and sets negative values to zero. Its derivative is piecewise: 1 for z>0, 0 for z<0. This derivative is what makes backpropagation through ReLU so cheap — it is just a multiplicative mask.

EXECUTION STATE

⬇ input: z = Pre-activation vector, any shape. Example: [-1.3, 0.7, 2.1, -0.2]

⬆ returns = Same shape as z, with negatives clipped to 0. Example: [0.0, 0.7, 2.1, 0.0]

4return np.maximum(0.0, z)

Element-wise max between the scalar 0.0 and each element of z. NumPy broadcasts 0.0 across z so no Python loop is needed.

EXECUTION STATE

📚 np.maximum(a, b) = Element-wise maximum. Broadcasts a and b to a common shape. Unlike np.max() (which reduces), np.maximum() returns an array the same shape as its inputs.

⬇ arg 1: 0.0 = Scalar threshold. Broadcasts to every element of z.

⬇ arg 2: z = The pre-activation vector we want to rectify.

→ worked example = np.maximum(0.0, [-1.3, 0.7, 2.1, -0.2]) → [0.0, 0.7, 2.1, 0.0]

6def forward(x, W1, b1, W2, b2) → (logits, z1, h)

One forward pass through a tiny MLP: affine → ReLU → affine. We return not only the logits but also the intermediates z1 and h because the saliency map needs them for the backward pass (the ReLU mask is computed from z1).

EXECUTION STATE

⬇ input: x = Flattened 16×16 image, shape (256,). Example first 8 values: [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0]

⬇ input: W1 = First-layer weight matrix, shape (32, 256). Each of 32 rows is a feature detector over the 256 pixels.

⬇ input: b1 = First-layer bias, shape (32,). One bias per hidden unit.

⬇ input: W2 = Second-layer weight matrix, shape (10, 32). Row c is the 'how strongly does each hidden feature vote for class c' vector.

⬇ input: b2 = Second-layer bias, shape (10,). One per output class.

⬆ returns = Tuple (logits, z1, h): logits (10,) are raw class scores; z1 (32,) are pre-activations; h (32,) are post-ReLU activations. We keep z1 so we can compute ReLU's derivative during backprop.

8z1 = W1 @ x + b1

Affine transform. @ is NumPy's matrix multiplication operator. Shapes: (32,256) @ (256,) → (32,). Then we add the bias vector (32,) element-wise.

EXECUTION STATE

📚 @ (matmul) = PEP 465 matrix-multiply operator. For 2-D @ 1-D, it contracts the last axis of the matrix with the vector, producing a 1-D output. Faster and clearer than np.dot(W1, x).

→ shape math = W1 (32, 256) @ x (256,) → (32,) then + b1 (32,) → (32,)

⬆ z1 (first 4 of 32) = [-0.41, 0.82, 1.57, -0.09, ...]

9h = relu(z1)

Apply ReLU to every hidden pre-activation. Negative units are zeroed; positive units pass through unchanged. This is where the network becomes nonlinear — without ReLU (or some nonlinearity), the whole MLP would collapse to a single linear map.

EXECUTION STATE

⬇ input: z1 (first 4 of 32) = [-0.41, 0.82, 1.57, -0.09, ...]

⬆ h (first 4 of 32) = [ 0.00, 0.82, 1.57, 0.00, ...]

→ intuition = Any unit that 'isn't excited enough' is silenced. Those silent units will also have zero gradient — this is why the ReLU mask appears in the backward pass below.

10logits = W2 @ h + b2

Second affine layer. Turns the 32 hidden features into 10 class scores. Row c of W2 is literally 'how much does each hidden feature vote for class c'.

EXECUTION STATE

→ shape math = W2 (10, 32) @ h (32,) → (10,) then + b2 (10,) → (10,)

⬆ logits (example for digit '3') = [-1.2, -0.8, 0.4, 3.6, 0.2, -0.5, 0.0, -1.1, 0.3, -0.7] class: 0 1 2 3 4 5 6 7 8 9

→ note = The highest logit (3.6) is at index 3 → predicted class is '3'. No softmax needed for saliency — we differentiate the raw logit.

11return logits, z1, h

Return the three values the caller needs. Notice we return the pre-activation z1 explicitly — the backward pass needs it to compute the ReLU mask.

EXECUTION STATE

⬆ return = (logits, z1, h) — three ndarrays. Python packs them as a tuple; the caller unpacks with `logits, z1, h = forward(...)`.

13def saliency_map(x, target, W1, b1, W2, b2) → (16, 16) ndarray

Computes ∂logits[target]/∂x — a heatmap the same size as the input image that answers, pixel-by-pixel: 'if I nudge this pixel up by a tiny amount, does the target-class score go up (positive) or down (negative)?'. This is the raw 'vanilla gradient' saliency of Simonyan, Vedaldi & Zisserman (2014).

EXECUTION STATE

⬇ input: x = Flattened image, shape (256,). Values typically in [0, 1].

⬇ input: target = Integer 0..9 — the class whose score we want to explain. If we pick the predicted class, we get 'what makes this prediction what it is'. If we pick a different class, we get 'why wasn't it this class instead?'.

⬆ returns = ndarray of shape (16, 16) — gradient map ready to overlay on the input image.

16logits, z1, h = forward(x, W1, b1, W2, b2)

Run the forward pass and capture the intermediates. In a deep framework like PyTorch, autograd stores these automatically; here we do it manually so the chain rule is visible.

EXECUTION STATE

logits = shape (10,) — class scores after the second affine.

z1 = shape (32,) — pre-activations, needed for the ReLU mask.

h = shape (32,) — post-ReLU activations. Not strictly needed for this gradient, but returned for completeness.

19dL_dh = W2[target]

The logit for class c is logits[c] = W2[c] · h + b2[c]. Differentiating w.r.t. h is just W2[c] — row c of W2. So the first gradient on the backward path is simply this row.

EXECUTION STATE

📚 W2[target] = NumPy fancy-indexing: picks row `target` of W2. If target=3 and W2 is (10, 32), this returns a 1-D array of length 32.

⬆ dL_dh (first 4 of 32) = [ 0.15, -0.22, 0.34, 0.08, ...]

→ interpretation = Each entry says: 'if hidden unit j fires 1 unit stronger, logit[target] moves by W2[target, j]'.

22dh_dz1 = (z1 > 0).astype(np.float32)

The ReLU mask. ReLU's derivative is 1 where the pre-activation was positive and 0 where it was negative (undefined at 0 — we follow convention and call it 0). This mask is what kills the gradient flowing through 'silent' units.

EXECUTION STATE

📚 z1 > 0 = NumPy comparison — returns a boolean array of the same shape, True where z1 > 0. Example: [False, True, True, False, ...]

📚 .astype(np.float32) = Converts bool → float: True→1.0, False→0.0. This is exactly the derivative of ReLU at each position.

⬆ dh_dz1 (first 4 of 32) = [ 0.0, 1.0, 1.0, 0.0, ...]

→ why this matters for debugging = If a unit's dh_dz1 is 0, no gradient flows through it. A layer full of zeros here is the symptom of the 'dying ReLU' problem that Chapter 20 §1 diagnosed.

25grad = (dL_dh * dh_dz1) @ W1

Chain rule in one line. Start at the top: multiply dL_dh by the ReLU mask (element-wise) to get ∂logit/∂z1. Then multiply by W1 to propagate back to the input pixels: ∂logit/∂x = ∂logit/∂z1 · W1.

EXECUTION STATE

📚 element-wise * = NumPy broadcasts — dL_dh (32,) * dh_dz1 (32,) multiplies corresponding entries. Units whose mask is 0 contribute nothing.

→ dL_dh * dh_dz1 (shape 32) = [ 0.00, -0.22, 0.34, 0.00, ...] — the gradient at the hidden layer.

📚 @ W1 = Matrix multiply a (32,) row vector by W1 (32, 256). Result has shape (256,) — one number per input pixel.

→ why 'row × matrix'? = Because we're computing the Jacobian-vector product (∂L/∂z1)ᵀ · (∂z1/∂x). With z1 = W1 x, ∂z1/∂x = W1, so we're doing (grad at z1)ᵀ · W1 — which NumPy expresses as the row @ W1.

⬆ grad (256,) = One scalar per input pixel. Example values: [ 0.02, -0.01, 0.00, ..., 0.05, 0.03, -0.02]

27return grad.reshape(16, 16)

Reshape the 256-length gradient back into the 16×16 grid of the original image so we can overlay it as a heatmap. This 2-D array IS the saliency map.

EXECUTION STATE

📚 .reshape(16, 16) = NumPy method. Requires that 16 × 16 = 256, the current total size. Doesn't copy data — it just re-interprets the same buffer as 2-D.

⬆ return (16×16) =

A gradient image. Red (positive) pixels, if brightened, push the target logit up; blue (negative) pixels push it down. Row-by-row the values look like:
 [[ 0.01,  0.03,  0.05, ...,  0.02],
  [ 0.04,  0.12,  0.22, ..., -0.01],
  ...]

13 lines without explanation

1import numpy as np
2
3def relu(z):
4    return np.maximum(0.0, z)
5
6def forward(x, W1, b1, W2, b2):
7    """One-hidden-layer MLP: x (256,) -> h (32,) -> logits (10,)."""
8    z1 = W1 @ x + b1            # pre-activation, shape (32,)
9    h  = relu(z1)               # hidden activations, shape (32,)
10    logits = W2 @ h + b2        # class scores, shape (10,)
11    return logits, z1, h
12
13def saliency_map(x, target, W1, b1, W2, b2):
14    """Gradient of logits[target] w.r.t. the input pixels x."""
15    # Forward pass, keep intermediates for backward
16    logits, z1, h = forward(x, W1, b1, W2, b2)
17
18    # d logits[c] / d h  =  row c of W2  -> shape (32,)
19    dL_dh = W2[target]
20
21    # d h / d z1 = 1 where z1 > 0 else 0 (ReLU derivative), shape (32,)
22    dh_dz1 = (z1 > 0).astype(np.float32)
23
24    # Chain: d logits[c] / d x = (dL_dh * dh_dz1) @ W1
25    grad = (dL_dh * dh_dz1) @ W1  # shape (256,)
26
27    return grad.reshape(16, 16)   # reshape to image grid

Three lines — pick row, mask, multiply — reproduce what a deep-learning framework does. For networks with many layers this bookkeeping obviously gets painful, which is precisely the problem autograd was invented to solve.

The Same Thing in PyTorch (autograd.grad)

PyTorch turns the explicit chain rule into a two-line call. You set requires_grad=True on the input, run the forward pass, pick a scalar output, and call torch.autograd.grad. Autograd walks the stored computation graph backwards and hands you the gradient.

Saliency via autograd

🐍saliency_pytorch.py

Explanation(16)

Code(22)

1import torch

PyTorch is a tensor library with automatic differentiation. Every operation on a Tensor with requires_grad=True is recorded in a computational graph so we can later call .backward() or torch.autograd.grad() to get gradients — no manual chain rule needed.

EXECUTION STATE

torch = Core PyTorch namespace: tensors, autograd, linear algebra, CUDA, JIT.

2import torch.nn as nn

torch.nn is PyTorch's neural-network building-block namespace. It contains Module (base class), Linear, Conv2d, activation functions, loss functions, and containers like Sequential.

EXECUTION STATE

nn = Shorthand for torch.nn. Lets us write nn.Linear instead of torch.nn.Linear.

4class TinyClassifier(nn.Module)

Defines a neural network by subclassing nn.Module. Any attribute that is itself an nn.Module (like nn.Linear) is automatically registered as a learnable parameter — this is how PyTorch's .parameters() and optimizer hook in.

EXECUTION STATE

📚 nn.Module = Base class for all PyTorch networks. Provides: .parameters() (iterate learnables), .to(device), .eval()/.train(), .state_dict() (save/load), and integration with autograd.

5def __init__(self)

Constructor. Called when we write TinyClassifier(). We must call super().__init__() so nn.Module can set up its internal bookkeeping before we assign any submodules.

6super().__init__()

Initialise the nn.Module parent. Without this line, attempts to assign self.fc1 = nn.Linear(...) would raise 'cannot assign module before Module.__init__() call'.

7self.fc1 = nn.Linear(256, 32)

Fully-connected (affine) layer: y = x W^T + b. PyTorch stores W as shape (out_features, in_features) and b as shape (out_features,). It is automatically initialised (Kaiming-uniform by default).

EXECUTION STATE

📚 nn.Linear(in_features, out_features, bias=True) = PyTorch module. On forward: output = x @ weight.T + bias. The weight tensor has shape (out_features, in_features).

⬇ arg 1: in_features = 256 = Input dimension — our flattened 16×16 image. This determines the number of columns in weight.

⬇ arg 2: out_features = 32 = Output dimension — 32 hidden units. This determines the number of rows in weight.

→ learnable params = weight: (32, 256) = 8,192 params + bias: (32,) = 32 params → 8,224 total.

8self.fc2 = nn.Linear(32, 10)

Second fully-connected layer. Projects 32 hidden features to 10 class logits.

EXECUTION STATE

⬇ arg 1: in_features = 32 = Matches fc1's output width — must agree so that fc2 can consume fc1's output.

⬇ arg 2: out_features = 10 = One logit per digit class 0–9.

→ learnable params = weight: (10, 32) = 320 + bias: (10,) = 10 → 330 total.

10def forward(self, x)

Defines the computation the module performs. PyTorch's __call__ hook (invoked when we write model(x)) calls forward under the hood after running pre/post hooks. Never call model.forward(x) directly — always use model(x).

EXECUTION STATE

⬇ input: x = Tensor of shape (batch, 256). Batch dimension is required for nn.Linear; we'll use batch=1 at inference.

11return self.fc2(torch.relu(self.fc1(x)))

The whole forward pass as one nested expression: fc1 → ReLU → fc2. torch.relu is the functional ReLU — identical to F.relu or nn.ReLU()(x) but without constructing a module.

EXECUTION STATE

📚 torch.relu(z) = Element-wise max(0, z). Autograd knows its derivative (1 where z>0 else 0), so gradients automatically flow correctly during backprop.

→ evaluation order = 1) fc1(x): (1,256) → (1,32) 2) torch.relu(...): (1,32) → (1,32), negatives zeroed 3) fc2(...): (1,32) → (1,10)

14model = TinyClassifier().eval()

Instantiate the model and flip it to eval mode. .eval() turns off dropout and switches BatchNorm to its running statistics. Our model has neither — but it's a strong habit to always call .eval() before inference so you never get bitten by stale running stats.

EXECUTION STATE

📚 .eval() = Sets self.training = False on every submodule. Returns self, so it can be chained.

→ .eval() vs .train() = .train() enables dropout and uses batch statistics for BN. .eval() disables dropout and uses running statistics. Saliency must use .eval() — otherwise the map changes on every call.

15x = torch.tensor(digit_flat, dtype=torch.float32).unsqueeze(0)

Build the input tensor. torch.tensor copies the Python list/NumPy array into a new PyTorch tensor with the requested dtype. .unsqueeze(0) prepends a batch dimension of size 1, turning (256,) into (1, 256) — required because nn.Linear expects a leading batch axis.

EXECUTION STATE

📚 torch.tensor(data, dtype) = Creates a new tensor from the given data. Unlike torch.as_tensor, it always copies. dtype=torch.float32 is the standard for forward/backward math on GPU.

📚 .unsqueeze(0) = Insert a new axis of size 1 at position 0. (256,) → (1, 256). This lets us treat a single sample as a mini-batch of one.

→ concrete example = digit_flat (list length 256) → torch.tensor([...]) shape (256,) → .unsqueeze(0) → shape (1, 256)

16x.requires_grad_(True)

Turn on gradient tracking for x IN-PLACE (the trailing underscore). From now on, every op on x is recorded in the autograd graph. This is how we'll later ask 'what is ∂logit/∂x?'. By default, inputs have requires_grad=False; model parameters have it True — here we flip it on the input because saliency differentiates w.r.t. the input, not the parameters.

EXECUTION STATE

📚 .requires_grad_(True) = In-place method (note the trailing _). Equivalent to x.requires_grad = True but returns self so it can be chained.

→ intuition = We are about to treat x as a 'learnable' — not to update it, but to differentiate through it. Standard trick for saliency, adversarial examples, and input optimisation.

18logits = model(x)

Run the forward pass. Because x requires grad, PyTorch builds a computational graph: x → fc1 → relu → fc2 → logits. Every intermediate tensor knows how it was produced and can be differentiated.

EXECUTION STATE

⬆ logits = Tensor of shape (1, 10). Example: [[-1.2, -0.8, 0.4, 3.6, 0.2, -0.5, 0.0, -1.1, 0.3, -0.7]]

→ autograd fact = logits.grad_fn is <AddmmBackward>. Walking grad_fn.next_functions gives the whole computation graph backward.

19target_logit = logits[0, target_class]

Select a single scalar — the logit we want to explain. Autograd requires a scalar output to backpropagate from. If we had multiple outputs we'd need a grad_outputs argument; a single scalar makes the call clean.

EXECUTION STATE

📚 logits[0, target_class] = Standard tensor indexing. [0] selects batch item 0; [target_class] selects the class column. Returns a 0-D tensor (scalar).

→ worked example = logits[0, 3] → tensor(3.6, grad_fn=<SelectBackward>) — a scalar that still knows how to be differentiated.

21(grad,) = torch.autograd.grad(target_logit, x)

Computes ∂ target_logit / ∂ x by autograd, WITHOUT accumulating into x.grad. This is the clean, side-effect-free way to get a gradient. The result is a tuple; we unpack it with the trailing comma.

EXECUTION STATE

📚 torch.autograd.grad(outputs, inputs) = Functional interface to the backward engine. Returns a tuple of gradients, one per input tensor. Does NOT modify any .grad attribute. Preferred over .backward() when you want a one-shot gradient without polluting state.

⬇ arg 1: outputs = target_logit = The scalar we differentiate. Must be a scalar unless we also pass grad_outputs.

⬇ arg 2: inputs = x = The tensor(s) we differentiate with respect to. Must have requires_grad=True.

⬆ grad (tuple) = (tensor of shape (1, 256),). Element [0, i] is ∂ target_logit / ∂ x[0, i] — exactly the same quantity we computed by hand in the NumPy version.

→ why the trailing comma? = Tuple unpacking: (grad,) = (...) binds the single element to `grad`. Without the comma, `grad = torch.autograd.grad(...)` leaves grad as a tuple — easy to forget and awkward later.

22saliency = grad.view(16, 16).abs()

Reshape (1, 256) → (16, 16) to recover the image grid, then take absolute value for an unsigned saliency map. .abs() is the convention from Simonyan et al. (2014) — magnitude of the gradient, ignoring direction, because users usually want 'where did the model look?' rather than 'which way should this pixel move?'.

EXECUTION STATE

📚 .view(shape) = Reinterpret the same memory as a different shape. Cheap — no copy. Requires the tensor to be contiguous; if not, use .reshape() which copies when needed.

→ why not keepdims earlier? = .view(16, 16) drops both singletons at once: (1, 256) has size 256, and 16×16 = 256, so PyTorch happily collapses the batch axis.

📚 .abs() = Element-wise absolute value. Produces a non-negative heatmap. For signed saliency (red = push up, blue = push down), SKIP .abs() and colour by sign — that is what the interactive visualizer above does.

⬆ saliency (16, 16) =

A non-negative heatmap ready to overlay on the input digit. Peaks at strokes that the model uses to identify the target class.

6 lines without explanation

1import torch
2import torch.nn as nn
3
4class TinyClassifier(nn.Module):
5    def __init__(self):
6        super().__init__()
7        self.fc1 = nn.Linear(256, 32)
8        self.fc2 = nn.Linear(32, 10)
9
10    def forward(self, x):
11        return self.fc2(torch.relu(self.fc1(x)))
12
13# Assume we have a trained model and a flattened 16×16 digit image
14model = TinyClassifier().eval()
15x = torch.tensor(digit_flat, dtype=torch.float32).unsqueeze(0)  # (1, 256)
16x.requires_grad_(True)                                          # turn on autograd
17
18logits = model(x)                                               # (1, 10)
19target_logit = logits[0, target_class]                          # scalar
20
21(grad,) = torch.autograd.grad(target_logit, x)                  # (1, 256)
22saliency = grad.view(16, 16).abs()                              # unsigned heatmap

Why these two views are really the same. Autograd does not do anything we could not do by hand — it just keeps the bookkeeping. Every grad_fn attribute on a tensor is a pointer to the exact same local chain-rule rule (multiply by the derivative of this op, then pass along) that we wrote out explicitly in the NumPy version.

Interactive Saliency Heatmap

Pick a digit, pick a target class, and watch the map. When input and target agree (e.g. the digit is a 3 and we explain class 3), the red pixels trace the strokes that carry the class. Ask for a different class and watch how the map changes — blue pixels are the evidence that is currently missing, red pixels are the evidence that is (wrongly) present.

Crank up the SmoothGrad samples slider to see how averaging the gradient over noise-jittered copies of the input removes the speckle characteristic of vanilla gradients.

Loading saliency visualizer…

4. Loss-Landscape Visualization

A neural network's loss function is a scalar-valued function of tens of millions of parameters: $L : \mathbb{R}^d \to \mathbb{R}$ . We cannot see $d$ -dimensional surfaces. But we can project. Pick two directions $\boldsymbol{u}, \boldsymbol{v}$ in parameter space and plot $L(\theta^\star + a \boldsymbol{u} + b \boldsymbol{v})$ as a function of the scalars $a, b$ . The choice of $\boldsymbol{u}, \boldsymbol{v}$ matters; the clean modern choice is filter normalisation (Li et al. 2018), which rescales random directions so that the resulting surface is comparable across architectures.

Even with only two parameters (as in our toy below) you can already see the four qualitative behaviours that dominate real training:

Convex bowl. Gradient descent glides to the minimum. Small learning rates are painfully slow; large ones overshoot but still converge.
Ill-conditioned ravine. One direction is steep, another is gentle. Plain SGD zig-zags across the steep walls, making tiny progress along the valley floor. Momentum accumulates velocity down the floor and damps the zig-zag — it is the reason optimisers like Adam and RMSProp work on real models.
Saddle points. Gradient is zero but it is not a minimum — one direction curves up, another down. In very high dimensions most critical points are saddles, and this is a big part of why training can converge despite the non-convex landscape.
Non-convex, multi-modal. Multiple local minima of different quality. The initialisation determines which basin you fall into — which is one reason random seeds matter.

Interactive 3D Loss Landscape

Pick a surface. Set the learning rate and momentum. Drop the ball at an initial $(\theta_1, \theta_2)$ and watch SGD find its way. Drag to rotate; the axes are the two parameters and the height is the loss.

On the Convex Bowl, any reasonable learning rate converges. Push $\eta$ above about $0.2$ and you'll see the ball overshoot and oscillate before settling.
On the Ill-conditioned Ravine, set momentum to 0 — SGD zig-zags down the walls. Push momentum up to $0.9$ and the ball rolls smoothly along the floor. This is literally what momentum does to real optimisation.
On the Saddle Point, start near the origin. The gradient is small so progress is slow — but once you tip off the saddle in the negative-curvature direction the ball accelerates away. Momentum helps here too: it remembers the earlier direction and won't be fooled by a flat patch.
On the Non-convex surface, change the starting point and watch different runs fall into different basins.

Loading 3D loss landscape…

5. Embedding Projections: t-SNE, UMAP, PCA

A hidden-layer representation is a set of points in $\mathbb{R}^d$ . To see them, we have to project to $\mathbb{R}^2$ or $\mathbb{R}^3$ . Three methods dominate:

Method	Preserves	When to use	Gotcha
PCA	Global variance (linear)	Fast first look; baselines; when clusters are well-separated	Misses curved / nonlinear structure
t-SNE	Local neighbourhoods (probabilistic)	Beautiful 2-D cluster plots; small-to-medium datasets	Distances and cluster SIZES are not meaningful; stochastic
UMAP	Local + some global (topological)	Large datasets; sharper clusters than t-SNE; faster	Same caveats on distance; sensitive to n_neighbors

The standard sanity check during training is: take the penultimate-layer activations on a labelled validation set, colour them by class, and project. A network that has actually learned its task produces clean, class-coloured clusters. A network with a representation problem produces a single blob. This picture is much more sensitive than a loss curve — a model can have an excellent training loss and still have a badly tangled representation.

For transformers, the same trick on word embeddings shows linear analogies (king − man + woman ≈ queen) and cleanly separated part-of-speech clusters — direct visual evidence that the embedding geometry has meaning. For sentence-level $[\text{CLS}]$ vectors it shows document topics. For image encoders it shows object categories as tight clusters arranged by visual similarity.

Interactive Embedding Projection

Compare the three projection methods on the same simulated penultimate-layer features. Toggle the training state to see how clusters separate as the network learns: at initialisation, projections of all three methods look like a single hairy blob; at convergence, well-defined class clusters appear. Notice the visual differences between methods — PCA produces elongated, axis-aligned clusters; t-SNE produces tighter blobs but meaningless inter-cluster distances; UMAP sits between them.

Loading embedding projection…

Distances in t-SNE and UMAP plots are not metric. Two clusters that look close in a t-SNE plot are not necessarily close in the original feature space — the optimisation only preserves local neighbour structure. Use these plots to argue "the network learned to separate these classes", never "class A is closer to class B than to class C".

6. Attention Visualization — A Window Into Transformers

Attention weights are, on paper, just a $T \times T$ matrix of non-negative entries whose rows sum to 1. They are computed inside the layer by $A = \mathrm{softmax}(QK^\top / \sqrt{d_k})$ where $Q, K \in \mathbb{R}^{T \times d_k}$ .

Because each row of $A$ is a probability distribution over tokens, plotting $A$ as a heatmap is a natural move: the cell $A_{ij}$ is the fraction of query $i$ 's output that comes from token $j$ . A bright cell at $(i, j)$ says "when computing the representation of token $i$ , the model mostly attended to token $j$ ".

Interpretability studies of BERT and GPT consistently find that different heads specialise in strikingly interpretable patterns:

Previous / next token heads. The bright cell is always one off-diagonal — the head is a positional pointer.
Coreference / dependency heads. Pronouns and verbs attend back to the noun they refer to; adjectives attend to the noun they modify (Clark et al. 2019, Voita et al. 2019).
Broadcast / sink heads. Almost every token attends back to the first token — a storage head that summarises the whole sequence into one slot (Xiao et al. 2023's attention sinks).

Caveat. Attention weights are not, by themselves, explanations. Jain & Wallace (2019) showed that many different attention patterns can produce the same output — so an attention heatmap tells you where the model routed information from, not necessarily which tokens were causally responsible for the prediction. Use attention visualisations the way a doctor uses an X-ray: a genuine window, not a final verdict.

Interactive Multi-Head Attention Heatmap

Hover to read the exact attention weight for any $(\text{query}, \text{key})$ pair. Swap between heads to see how different heads implement different routing patterns. The temperature slider divides raw scores by a scale factor — this is a direct visualisation of why attention is divided by $\sqrt{d_k}$ : large scores collapse the softmax into a hard one-hot and kill gradient flow. Turn on the causal mask and watch the upper triangle vanish — this is what makes a GPT-style decoder autoregressive.

Loading attention heatmap…

Visualization in Modern Transformer Systems

Every visualization technique in this section has a direct, working place in large modern systems. Below is a brief tour of how the pictures you just built connect to the production concerns of billion- parameter models.

Multi-head attention — as many heatmaps as there are heads

A single transformer block has $h$ attention heads running in parallel (typically $h \in \{8, 16, 32, 64\}$ ), each with its own $Q, K, V$ projections and therefore its own attention matrix. The picture that unlocked interpretability of transformers is: plot one heatmap per head, tile them, and look at the specialisation. That is literally the picture the visualizer above makes, one head at a time.

The computational problem multi-head solves is representation diversity: splitting $d_{\text{model}}$ across $h$ heads lets each head attend to a different linguistic relation at the same token position. The tradeoff is $h$ separate softmaxes (and $h$ separate $T \times T$ matrices) — which is exactly what Flash Attention had to optimise.

Flash Attention — a picture the memory hierarchy never shows you

Naïve attention materialises the full $T \times T$ matrix $QK^\top$ in GPU HBM before softmax and before multiplying by $V$ . At long context the memory bill is $\mathcal{O}(T^2)$ and the HBM traffic is the bottleneck (Dao et al. 2022). Flash Attention avoids this by tiling the computation into blocks that fit in on-chip SRAM, never materialising the big matrix.

The visualisation consequence: Flash Attention produces exactly the same $A$ that naive attention does — the heatmap you just interacted with is unchanged. Flash Attention is a memory and IO optimisation, not a behavioural one. You can still plot $A$ , but you now have to explicitly opt-in to materialising it (e.g. in PyTorch, F.scaled_dot_product_attention hides $A$ by default; to visualise you must either use a plain implementation or a return_attention_probs branch). Problem solved: $\mathcal{O}(T^2)$ HBM traffic. Tradeoff: a more complex kernel; attention weights are not free to inspect.

Positional encodings — visualise the coordinate system

Plot the positional embedding matrix as a $T \times d_{\text{model}}$ heatmap and you can see the structure of the coordinate system the model uses for order. Sinusoidal encodings (Vaswani et al. 2017) show horizontal bands of different frequencies; RoPE (Su et al. 2021) shows pairwise rotations whose angles scale with position; ALiBi (Press et al. 2021) is a linear bias added straight onto $QK^\top$ and looks like a soft lower- triangular ramp on the attention heatmap itself.

Problem solved: attention is permutation-invariant; positional encodings break the symmetry so the model can tell "cat bites dog" from "dog bites cat". Tradeoff: learned absolute encodings don't extrapolate to longer contexts; RoPE and ALiBi do. You literally see this in the visualisation — absolute bands stop at $T_{\text{train}}$ ; RoPE's rotations and ALiBi's linear bias continue forever.

KV-cache — visualising memory growth at inference

Autoregressive generation reuses previous $K, V$ tensors for every new token — this is the KV-cache, and its size is $2 \cdot L \cdot h \cdot T \cdot d_k \cdot \text{bytes}$ . At $T = 32{,}000$ it can dwarf the model weights. The natural visualisation is a simple bar chart: memory vs. context length for full-head MHA, multi-query attention (MQA; Shazeer 2019), and grouped-query attention (GQA; Ainslie et al. 2023). MQA shares one $K, V$ across all heads ( $h \times$ reduction); GQA shares across $g$ heads ( $h/g \times$ reduction) and recovers most of the quality.

Problem solved: KV memory pressure at long context and high batch. Tradeoff: MQA and GQA slightly reduce representational diversity — the attention heatmaps you can plot for shared-K heads are correlated in a way full MHA's are not.

Transformer scaling — loss-curve visualisation is the scaling law

The Chinchilla (Hoffmann et al. 2022) and Kaplan et al. (2020) scaling-law papers are, at heart, massive collections of training- loss curves plotted against compute. Taking logs of both axes turns the power law $L \approx L_\infty + A \cdot N^{-\alpha}$ into a straight line, and the slope $-\alpha$ is directly legible. The whole field of "scaling the transformer family" is built on the fact that the right picture — log loss vs log compute — turns empirical noise into a straight-edge prediction.

Problem solved: deciding how to spend a fixed training budget — tokens vs parameters. Tradeoff: scaling laws are extrapolations; they assume the data distribution and architecture stay in-family, which is why a new architecture (Mamba, Mixture-of-Experts, etc.) always deserves its own re-measured curve.

Logging Workflow: TensorBoard and Weights & Biases

Visualisations are most useful when they are produced automatically on every run, not hand-cranked after a bug bites. Two ecosystems dominate. TensorBoard is offline, free, and runs locally — best for solo development. Weights & Biases (W&B) is hosted, syncs across machines, and adds dataset / model versioning — best for teams.

🐍python

1import torch
2from torch.utils.tensorboard import SummaryWriter
3
4writer = SummaryWriter("runs/exp_001")
5
6for epoch in range(num_epochs):
7    train_loss = train_one_epoch(model, loader, opt, loss_fn)
8    val_loss   = evaluate(model, val_loader, loss_fn)
9
10    # Scalars (loss curves)
11    writer.add_scalar("loss/train", train_loss, epoch)
12    writer.add_scalar("loss/val",   val_loss,   epoch)
13
14    # Histograms (activations, weights, gradients)
15    for name, p in model.named_parameters():
16        writer.add_histogram(f"weights/{name}",  p,        epoch)
17        if p.grad is not None:
18            writer.add_histogram(f"grads/{name}", p.grad,   epoch)
19
20    # Images (filter visualisation)
21    if epoch % 5 == 0:
22        first_conv = model.conv1.weight.detach().cpu()       # (C_out, 3, k, k)
23        writer.add_images("filters/conv1", first_conv, epoch)
24
25    # Embedding projector (penultimate-layer features → 3D)
26    feats, labels = collect_penultimate(model, val_loader, n=512)
27    writer.add_embedding(feats, metadata=labels, global_step=epoch, tag="penult")
28
29writer.close()
30# Then: tensorboard --logdir runs/

The W&B equivalent. Replace writer.add_scalar(name, v, step) with wandb.log({name: v}, step=step); replace add_histogram with wandb.log({...: wandb.Histogram(p)}). W&B's killer feature is parallel coordinates plots across runs — invaluable for hyperparameter sweeps.

Beyond Visualization: Mechanistic Interpretability

The techniques above answer where a model attends and what activations it produces. Mechanistic interpretability goes one step further and asks which algorithm the network is actually implementing — by reverse-engineering the weights of trained circuits.

Induction heads (Olsson et al., 2022) — pairs of attention heads in transformers that implement copy-and-complete on token sequences. Identifying them explains the sudden in-context-learning capability that emerges at modest model scale.
Sparse autoencoders (SAEs) (Bricken et al., 2023; Templeton et al., 2024) — a separate, much wider autoencoder trained on a layer's activations to recover human-meaningful features (e.g. "the Golden Gate Bridge", "deceptive reasoning"). SAEs convert the diffuse, superposed activation space into a sparse code where each direction maps to one concept.
Activation patching / causal tracing — replace a single tensor inside the forward pass with one from a different input and measure the change in output. Pinpoints which intermediate activation is causally responsible for a given behaviour.
Neuron-level circuits (Olah et al., 2020 and subsequent work in Distill & the Anthropic Transformer Circuits thread) — manually decompose what each unit detects and how they combine.

These methods are open research, not yet a settled toolkit, but they extend the visualisation philosophy of this section: every question about model behaviour ought to be answerable by looking at something — and the right thing to look at is rarely the loss alone.

Quick Check

Q1. A teammate's training loss looks healthy but they suspect the network has not actually learned the task. What single visualisation would settle the question fastest?
Answer: A 2-D projection (UMAP or t-SNE) of the penultimate-layer features on a labelled validation batch, coloured by class. A network that has learned the task produces tight, well-separated clusters; one that has not produces a single tangled blob — even if the loss is "low".

Q2. You have an attention heatmap that shows your BERT model attending heavily to the [CLS] token from every other token. Two papers say this is fine, two say it's a problem. How do you decide?
Answer: Look at where in the network it happens. Heavy CLS attention in a middle/late layer often is a useful "summary" head that aggregates the sequence into one slot. The same pattern in EVERY layer for EVERY query is the attention sink failure mode (Xiao et al., 2023) — a sign the model has nowhere else useful to put attention. Causally test by ablating the head and measuring downstream metrics.

Q3. A saliency map is bright on a watermark in the corner of every photo of class "dog". What does this tell you, and what is the next step?
Answer: The model has learned a spurious correlation — the watermark is a near-perfect predictor in your dataset, so the gradient flows through it. Next step: rebuild a balanced dataset without the artefact, OR add data augmentation that hides the corner, OR explicitly mask that region during training. Saliency maps are most valuable when they reveal these data bugs, not just "model attention".

Recap

Four core quantities to plot: activations, weights, gradients, loss. Each answers a distinct question, and together they catch almost every failure mode from Chapter 20 §1.
Saliency maps are three tensor ops in NumPy and two lines in PyTorch. Understand them by hand first; then use autograd.
Loss-landscape 3-D plots turn abstract optimiser theory into something you can see: ravines need momentum, saddles are slow, non-convex means sensitive to init.
Embedding projections (t-SNE / UMAP / PCA) are more sensitive than loss curves for spotting representation problems.
Attention heatmaps reveal head specialisation and connect directly to the engineering of modern systems: multi-head produces $h$ such heatmaps, Flash Attention computes the same $A$ without materialising it, positional encodings structure it, KV-cache stores the intermediates needed to keep computing it, and scaling laws measure how the whole stack gets better.

References

Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps". ICLR Workshop.
Zeiler, M. D. & Fergus, R. (2014). "Visualizing and Understanding Convolutional Networks". ECCV.
Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). "SmoothGrad: removing noise by adding noise". arXiv:1706.03825.
Sundararajan, M., Taly, A., & Yan, Q. (2017). "Axiomatic Attribution for Deep Networks" (Integrated Gradients). ICML.
Selvaraju, R. R. et al. (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization". ICCV.
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets". NeurIPS.
van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE". JMLR.
McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection". arXiv:1802.03426.
Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS.
Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention". BlackboxNLP.
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). "Analyzing Multi-Head Self-Attention". ACL.
Jain, S. & Wallace, B. C. (2019). "Attention is not Explanation". NAACL.
Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS.
Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.
Press, O., Smith, N., & Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ALiBi). ICLR 2022.
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need" (MQA). arXiv:1911.02150.
Ainslie, J. et al. (2023). "GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints". EMNLP.
Xiao, G. et al. (2023). "Efficient Streaming Language Models with Attention Sinks". arXiv:2309.17453.
Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla). NeurIPS.
Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits". Distill.
Olsson, C. et al. (2022). "In-context Learning and Induction Heads". Anthropic Transformer Circuits Thread.
Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning". Anthropic.
Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet". Anthropic.