The Black-Box Problem
A modern neural network is, in one sense, a perfectly ordinary function: a sequence of matrix multiplications, element-wise nonlinearities, and additions. Every number it produces can, in principle, be traced back to the inputs. And yet — after training a 100-million-parameter model on a few billion tokens — we often find ourselves staring at the result and asking the same question every scientist asks of a complicated instrument: what is it actually doing?
The difficulty is not that the computation is secret. The difficulty is that it is too detailed to hold in mind at once. A single forward pass through a small CNN may execute a billion floating- point operations; a single transformer layer multiplies tens of millions of numbers. No one reads those numbers one by one. We need pictures — compact, low-dimensional summaries that project the enormous internal state into something a human eye can process in a second.
The point of visualization is not decoration. It is compression. A good picture takes a state that is too big to think about and projects it onto a surface the brain already knows how to read: a grid, a curve, a heatmap. Every technique in this section is a compression scheme for a different aspect of the network.
Visualization is also the first defence against the failure modes of Chapter 20 §1. A loss curve alone will tell you that training has stalled; a saliency map will often tell you why. A histogram of activations will show you dead ReLUs before a metric on the validation set ever notices. A 3-D loss surface will explain why a carefully chosen learning rate still oscillates. The relationship is simple: debugging without pictures is debugging with a blindfold on.
Four Windows Into a Network
There are essentially four quantities inside a trained network that are worth plotting. Each opens a different window onto the model, and each answers a different question.
| Quantity | What you plot | Question it answers |
|---|---|---|
| Activations (h) | histograms, feature maps, t-SNE of hidden states | What has each layer learned to detect? |
| Weights (W) | filters as images, weight histograms | What recurring pattern is each neuron tuned to? |
| Gradients (∂L/∂x) | saliency maps, Grad-CAM, integrated gradients | Which parts of the input is the model using? |
| Loss (L(θ)) | training curves, 1-D line searches, 2-D/3-D loss landscapes | Is optimization healthy? Where are the minima? |
Attention weights — which we cover last in this section — are really a special kind of activation, but they deserve their own treatment because they have become the most used and most misused picture in modern deep learning.
1. Activation Visualization: What Each Layer Sees
An activation is the output of a neuron on a particular input. Collect the activations of every unit in a layer on one example and you have the layer's representation of that example. Collect them across a batch and you have a snapshot of how the whole layer is behaving.
What to plot and why
- Per-layer histograms. For every layer, plot a histogram of its activations over a validation batch. A healthy hidden layer with ReLU typically has mean somewhere between and , with a spike at exactly 0 (dead units). If the mean drifts toward zero or the spike dominates, ReLUs are dying (see §1). If the distribution explodes toward huge values, you have an activation-scale problem that residual connections, normalization, or smaller initialisations need to fix.
- Feature maps (CNNs). For a conv layer producing shape , each of the channels is an image. Tile them. Early layers look like edge and colour detectors; middle layers look like texture and part detectors; late layers look like abstract composites. This is the technique Zeiler & Fergus used in 2014 to prove — picture by picture — that CNN features form a hierarchy.
- Per-unit activation vectors (transformers). For a transformer hidden state of shape , each token has a -dim vector. Project that pile of vectors to 2-D with t-SNE or UMAP and you can literally see sentence structure emerge as clusters in the plane (see §5).
How you collect them: forward hooks
In PyTorch the standard tool is register_forward_hook. It is a callback the framework invokes every time a particular module finishes its forward call, handing you the module, its input, and its output. You stash the output in a dictionary and move on:
Interactive Activation Histograms
The hooks above collect activations; the visualizer below shows what you would actually see when you plot them. Toggle the activation function and the network's health condition. Notice how the signature shifts: a healthy ReLU layer has a tall spike at exactly zero (legitimate sparsity) plus a positive tail; a dying ReLU layer pushes that spike past 50% — those neurons have stopped learning. Saturated tanh/sigmoid layers pile up at the extremes, which is where their gradient is ≈ 0.
Read the signature, not the mean. A layer with mean 0.3 and a 60% spike at zero is sick — its "mean" is being propped up by a few overactive units. A layer with mean 0.3 and a smooth distribution is healthy. The shape of the histogram carries more information than any single summary statistic.
2. Weight and Filter Visualization
Weights are static — they do not depend on the input. Plotting them tells you what recurring pattern a neuron is tuned to, independent of any particular example. This is cheap and surprisingly informative.
The canonical example is the first convolutional layer of a vision CNN. Its weight tensor has shape — e.g. for an ImageNet ResNet. Each output channel is therefore a RGB image. Display the 64 of them in an grid and you get one of the most famous pictures in deep learning: oriented edges, colour opponents, Gabor-like bars — learned, not designed.
Why is this so informative? Because the convolution operation is correlation. A conv filter with weights gives a large response precisely when the input patch matches . So the filter image literally IS the pattern the neuron is searching for. You are looking at the neuron's ideal stimulus.
For fully-connected layers the analogous move is to reshape each row of back into an image (if is a flattened image dimension). Each row of the first-layer weight matrix is a template — plotting them recovers recognisable strokes for MNIST digits, face parts for face datasets, and so on.
For deeper layers, the raw weights stop being interpretable: they live in the abstract space of earlier feature maps, not in pixel space. For those you need either activation maximisation (optimise the input to maximally excite a chosen neuron — the basis of DeepDream) or the gradient-based techniques we turn to now.
3. Saliency Maps: The Gradient as a Spotlight
Saliency maps answer a different question from activations or weights. Given this specific prediction, which input features did the model actually use? The cleanest formulation comes from Simonyan, Vedaldi & Zisserman (2014), and it is astonishingly simple.
Let be the score the network assigns to class on input . A first-order Taylor expansion around gives . So the pixel-wise gradient tells us — to first order — how much each pixel contributes to the class score. Pixels with large are the ones the model is "looking at".
For a multi-layer network we get by the chain rule. For the one-hidden-layer MLP of this chapter, , so
, where is the ReLU derivative — a binary mask. That is three tensor ops: pick a row of , multiply by a 0/1 mask, left-multiply by .
Known limitations (worth saying once, clearly)
- Vanilla gradients are noisy: tiny local fluctuations in the loss surface produce speckled heatmaps. SmoothGrad (Smilkov et al. 2017) averages the saliency over noisy copies of the input and produces far cleaner maps. The visualizer below lets you slide up and see the cleanup happen.
- They show local sensitivity, not causal importance. A pixel can have a huge gradient yet be unnecessary for the prediction globally. Integrated Gradients (Sundararajan, Taly & Yan 2017) fixes this by integrating gradients along a path from a baseline to the input.
- For CNNs, Grad-CAM (Selvaraju et al. 2017) is usually preferred: it weights the last conv feature map by the gradient of the target logit, producing a smooth, class-specific heatmap at the resolution of the feature map.
Computing Saliency From Scratch (NumPy)
Before we reach for autograd, let us compute the exact same gradient by hand. It is only three tensor operations, and seeing them without framework magic is the fastest way to understand what autograd is doing for us.
Three lines — pick row, mask, multiply — reproduce what a deep-learning framework does. For networks with many layers this bookkeeping obviously gets painful, which is precisely the problem autograd was invented to solve.
The Same Thing in PyTorch (autograd.grad)
PyTorch turns the explicit chain rule into a two-line call. You set requires_grad=True on the input, run the forward pass, pick a scalar output, and call torch.autograd.grad. Autograd walks the stored computation graph backwards and hands you the gradient.
Why these two views are really the same. Autograd does not do anything we could not do by hand — it just keeps the bookkeeping. Everygrad_fnattribute on a tensor is a pointer to the exact same local chain-rule rule (multiply by the derivative of this op, then pass along) that we wrote out explicitly in the NumPy version.
Interactive Saliency Heatmap
Pick a digit, pick a target class, and watch the map. When input and target agree (e.g. the digit is a 3 and we explain class 3), the red pixels trace the strokes that carry the class. Ask for a different class and watch how the map changes — blue pixels are the evidence that is currently missing, red pixels are the evidence that is (wrongly) present.
Crank up the SmoothGrad samples slider to see how averaging the gradient over noise-jittered copies of the input removes the speckle characteristic of vanilla gradients.
4. Loss-Landscape Visualization
A neural network's loss function is a scalar-valued function of tens of millions of parameters: . We cannot see -dimensional surfaces. But we can project. Pick two directions in parameter space and plot as a function of the scalars . The choice of matters; the clean modern choice is filter normalisation (Li et al. 2018), which rescales random directions so that the resulting surface is comparable across architectures.
Even with only two parameters (as in our toy below) you can already see the four qualitative behaviours that dominate real training:
- Convex bowl. Gradient descent glides to the minimum. Small learning rates are painfully slow; large ones overshoot but still converge.
- Ill-conditioned ravine. One direction is steep, another is gentle. Plain SGD zig-zags across the steep walls, making tiny progress along the valley floor. Momentum accumulates velocity down the floor and damps the zig-zag — it is the reason optimisers like Adam and RMSProp work on real models.
- Saddle points. Gradient is zero but it is not a minimum — one direction curves up, another down. In very high dimensions most critical points are saddles, and this is a big part of why training can converge despite the non-convex landscape.
- Non-convex, multi-modal. Multiple local minima of different quality. The initialisation determines which basin you fall into — which is one reason random seeds matter.
Interactive 3D Loss Landscape
Pick a surface. Set the learning rate and momentum. Drop the ball at an initial and watch SGD find its way. Drag to rotate; the axes are the two parameters and the height is the loss.
- On the Convex Bowl, any reasonable learning rate converges. Push above about and you'll see the ball overshoot and oscillate before settling.
- On the Ill-conditioned Ravine, set momentum to 0 — SGD zig-zags down the walls. Push momentum up to and the ball rolls smoothly along the floor. This is literally what momentum does to real optimisation.
- On the Saddle Point, start near the origin. The gradient is small so progress is slow — but once you tip off the saddle in the negative-curvature direction the ball accelerates away. Momentum helps here too: it remembers the earlier direction and won't be fooled by a flat patch.
- On the Non-convex surface, change the starting point and watch different runs fall into different basins.
5. Embedding Projections: t-SNE, UMAP, PCA
A hidden-layer representation is a set of points in . To see them, we have to project to or . Three methods dominate:
| Method | Preserves | When to use | Gotcha |
|---|---|---|---|
| PCA | Global variance (linear) | Fast first look; baselines; when clusters are well-separated | Misses curved / nonlinear structure |
| t-SNE | Local neighbourhoods (probabilistic) | Beautiful 2-D cluster plots; small-to-medium datasets | Distances and cluster SIZES are not meaningful; stochastic |
| UMAP | Local + some global (topological) | Large datasets; sharper clusters than t-SNE; faster | Same caveats on distance; sensitive to n_neighbors |
The standard sanity check during training is: take the penultimate-layer activations on a labelled validation set, colour them by class, and project. A network that has actually learned its task produces clean, class-coloured clusters. A network with a representation problem produces a single blob. This picture is much more sensitive than a loss curve — a model can have an excellent training loss and still have a badly tangled representation.
For transformers, the same trick on word embeddings shows linear analogies (king − man + woman ≈ queen) and cleanly separated part-of-speech clusters — direct visual evidence that the embedding geometry has meaning. For sentence-level vectors it shows document topics. For image encoders it shows object categories as tight clusters arranged by visual similarity.
Interactive Embedding Projection
Compare the three projection methods on the same simulated penultimate-layer features. Toggle the training state to see how clusters separate as the network learns: at initialisation, projections of all three methods look like a single hairy blob; at convergence, well-defined class clusters appear. Notice the visual differences between methods — PCA produces elongated, axis-aligned clusters; t-SNE produces tighter blobs but meaningless inter-cluster distances; UMAP sits between them.
6. Attention Visualization — A Window Into Transformers
Attention weights are, on paper, just a matrix of non-negative entries whose rows sum to 1. They are computed inside the layer by where .
Because each row of is a probability distribution over tokens, plotting as a heatmap is a natural move: the cell is the fraction of query 's output that comes from token . A bright cell at says "when computing the representation of token , the model mostly attended to token ".
Interpretability studies of BERT and GPT consistently find that different heads specialise in strikingly interpretable patterns:
- Previous / next token heads. The bright cell is always one off-diagonal — the head is a positional pointer.
- Coreference / dependency heads. Pronouns and verbs attend back to the noun they refer to; adjectives attend to the noun they modify (Clark et al. 2019, Voita et al. 2019).
- Broadcast / sink heads. Almost every token attends back to the first token — a storage head that summarises the whole sequence into one slot (Xiao et al. 2023's attention sinks).
Caveat. Attention weights are not, by themselves, explanations. Jain & Wallace (2019) showed that many different attention patterns can produce the same output — so an attention heatmap tells you where the model routed information from, not necessarily which tokens were causally responsible for the prediction. Use attention visualisations the way a doctor uses an X-ray: a genuine window, not a final verdict.
Interactive Multi-Head Attention Heatmap
Hover to read the exact attention weight for any pair. Swap between heads to see how different heads implement different routing patterns. The temperature slider divides raw scores by a scale factor — this is a direct visualisation of why attention is divided by : large scores collapse the softmax into a hard one-hot and kill gradient flow. Turn on the causal mask and watch the upper triangle vanish — this is what makes a GPT-style decoder autoregressive.
Visualization in Modern Transformer Systems
Every visualization technique in this section has a direct, working place in large modern systems. Below is a brief tour of how the pictures you just built connect to the production concerns of billion- parameter models.
Multi-head attention — as many heatmaps as there are heads
A single transformer block has attention heads running in parallel (typically ), each with its own projections and therefore its own attention matrix. The picture that unlocked interpretability of transformers is: plot one heatmap per head, tile them, and look at the specialisation. That is literally the picture the visualizer above makes, one head at a time.
The computational problem multi-head solves is representation diversity: splitting across heads lets each head attend to a different linguistic relation at the same token position. The tradeoff is separate softmaxes (and separate matrices) — which is exactly what Flash Attention had to optimise.
Flash Attention — a picture the memory hierarchy never shows you
Naïve attention materialises the full matrix in GPU HBM before softmax and before multiplying by . At long context the memory bill is and the HBM traffic is the bottleneck (Dao et al. 2022). Flash Attention avoids this by tiling the computation into blocks that fit in on-chip SRAM, never materialising the big matrix.
The visualisation consequence: Flash Attention produces exactly the same that naive attention does — the heatmap you just interacted with is unchanged. Flash Attention is a memory and IO optimisation, not a behavioural one. You can still plot , but you now have to explicitly opt-in to materialising it (e.g. in PyTorch, F.scaled_dot_product_attention hides by default; to visualise you must either use a plain implementation or a return_attention_probs branch). Problem solved: HBM traffic. Tradeoff: a more complex kernel; attention weights are not free to inspect.
Positional encodings — visualise the coordinate system
Plot the positional embedding matrix as a heatmap and you can see the structure of the coordinate system the model uses for order. Sinusoidal encodings (Vaswani et al. 2017) show horizontal bands of different frequencies; RoPE (Su et al. 2021) shows pairwise rotations whose angles scale with position; ALiBi (Press et al. 2021) is a linear bias added straight onto and looks like a soft lower- triangular ramp on the attention heatmap itself.
Problem solved: attention is permutation-invariant; positional encodings break the symmetry so the model can tell "cat bites dog" from "dog bites cat". Tradeoff: learned absolute encodings don't extrapolate to longer contexts; RoPE and ALiBi do. You literally see this in the visualisation — absolute bands stop at ; RoPE's rotations and ALiBi's linear bias continue forever.
KV-cache — visualising memory growth at inference
Autoregressive generation reuses previous tensors for every new token — this is the KV-cache, and its size is . At it can dwarf the model weights. The natural visualisation is a simple bar chart: memory vs. context length for full-head MHA, multi-query attention (MQA; Shazeer 2019), and grouped-query attention (GQA; Ainslie et al. 2023). MQA shares one across all heads ( reduction); GQA shares across heads ( reduction) and recovers most of the quality.
Problem solved: KV memory pressure at long context and high batch. Tradeoff: MQA and GQA slightly reduce representational diversity — the attention heatmaps you can plot for shared-K heads are correlated in a way full MHA's are not.
Transformer scaling — loss-curve visualisation is the scaling law
The Chinchilla (Hoffmann et al. 2022) and Kaplan et al. (2020) scaling-law papers are, at heart, massive collections of training- loss curves plotted against compute. Taking logs of both axes turns the power law into a straight line, and the slope is directly legible. The whole field of "scaling the transformer family" is built on the fact that the right picture — log loss vs log compute — turns empirical noise into a straight-edge prediction.
Problem solved: deciding how to spend a fixed training budget — tokens vs parameters. Tradeoff: scaling laws are extrapolations; they assume the data distribution and architecture stay in-family, which is why a new architecture (Mamba, Mixture-of-Experts, etc.) always deserves its own re-measured curve.
Logging Workflow: TensorBoard and Weights & Biases
Visualisations are most useful when they are produced automatically on every run, not hand-cranked after a bug bites. Two ecosystems dominate. TensorBoard is offline, free, and runs locally — best for solo development. Weights & Biases (W&B) is hosted, syncs across machines, and adds dataset / model versioning — best for teams.
1import torch
2from torch.utils.tensorboard import SummaryWriter
3
4writer = SummaryWriter("runs/exp_001")
5
6for epoch in range(num_epochs):
7 train_loss = train_one_epoch(model, loader, opt, loss_fn)
8 val_loss = evaluate(model, val_loader, loss_fn)
9
10 # Scalars (loss curves)
11 writer.add_scalar("loss/train", train_loss, epoch)
12 writer.add_scalar("loss/val", val_loss, epoch)
13
14 # Histograms (activations, weights, gradients)
15 for name, p in model.named_parameters():
16 writer.add_histogram(f"weights/{name}", p, epoch)
17 if p.grad is not None:
18 writer.add_histogram(f"grads/{name}", p.grad, epoch)
19
20 # Images (filter visualisation)
21 if epoch % 5 == 0:
22 first_conv = model.conv1.weight.detach().cpu() # (C_out, 3, k, k)
23 writer.add_images("filters/conv1", first_conv, epoch)
24
25 # Embedding projector (penultimate-layer features → 3D)
26 feats, labels = collect_penultimate(model, val_loader, n=512)
27 writer.add_embedding(feats, metadata=labels, global_step=epoch, tag="penult")
28
29writer.close()
30# Then: tensorboard --logdir runs/writer.add_scalar(name, v, step) with wandb.log({name: v}, step=step); replace add_histogram with wandb.log({...: wandb.Histogram(p)}). W&B's killer feature is parallel coordinates plots across runs — invaluable for hyperparameter sweeps.Beyond Visualization: Mechanistic Interpretability
The techniques above answer where a model attends and what activations it produces. Mechanistic interpretability goes one step further and asks which algorithm the network is actually implementing — by reverse-engineering the weights of trained circuits.
- Induction heads (Olsson et al., 2022) — pairs of attention heads in transformers that implement copy-and-complete on token sequences. Identifying them explains the sudden in-context-learning capability that emerges at modest model scale.
- Sparse autoencoders (SAEs) (Bricken et al., 2023; Templeton et al., 2024) — a separate, much wider autoencoder trained on a layer's activations to recover human-meaningful features (e.g. "the Golden Gate Bridge", "deceptive reasoning"). SAEs convert the diffuse, superposed activation space into a sparse code where each direction maps to one concept.
- Activation patching / causal tracing — replace a single tensor inside the forward pass with one from a different input and measure the change in output. Pinpoints which intermediate activation is causally responsible for a given behaviour.
- Neuron-level circuits (Olah et al., 2020 and subsequent work in Distill & the Anthropic Transformer Circuits thread) — manually decompose what each unit detects and how they combine.
These methods are open research, not yet a settled toolkit, but they extend the visualisation philosophy of this section: every question about model behaviour ought to be answerable by looking at something — and the right thing to look at is rarely the loss alone.
Quick Check
Q1. A teammate's training loss looks healthy but they suspect the network has not actually learned the task. What single visualisation would settle the question fastest?
Answer: A 2-D projection (UMAP or t-SNE) of the penultimate-layer features on a labelled validation batch, coloured by class. A network that has learned the task produces tight, well-separated clusters; one that has not produces a single tangled blob — even if the loss is "low".
Q2. You have an attention heatmap that shows your BERT model attending heavily to the [CLS] token from every other token. Two papers say this is fine, two say it's a problem. How do you decide?
Answer: Look at where in the network it happens. Heavy CLS attention in a middle/late layer often is a useful "summary" head that aggregates the sequence into one slot. The same pattern in EVERY layer for EVERY query is the attention sink failure mode (Xiao et al., 2023) — a sign the model has nowhere else useful to put attention. Causally test by ablating the head and measuring downstream metrics.
Q3. A saliency map is bright on a watermark in the corner of every photo of class "dog". What does this tell you, and what is the next step?
Answer: The model has learned a spurious correlation — the watermark is a near-perfect predictor in your dataset, so the gradient flows through it. Next step: rebuild a balanced dataset without the artefact, OR add data augmentation that hides the corner, OR explicitly mask that region during training. Saliency maps are most valuable when they reveal these data bugs, not just "model attention".
Recap
- Four core quantities to plot: activations, weights, gradients, loss. Each answers a distinct question, and together they catch almost every failure mode from Chapter 20 §1.
- Saliency maps are three tensor ops in NumPy and two lines in PyTorch. Understand them by hand first; then use autograd.
- Loss-landscape 3-D plots turn abstract optimiser theory into something you can see: ravines need momentum, saddles are slow, non-convex means sensitive to init.
- Embedding projections (t-SNE / UMAP / PCA) are more sensitive than loss curves for spotting representation problems.
- Attention heatmaps reveal head specialisation and connect directly to the engineering of modern systems: multi-head produces such heatmaps, Flash Attention computes the same without materialising it, positional encodings structure it, KV-cache stores the intermediates needed to keep computing it, and scaling laws measure how the whole stack gets better.
References
- Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps". ICLR Workshop.
- Zeiler, M. D. & Fergus, R. (2014). "Visualizing and Understanding Convolutional Networks". ECCV.
- Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). "SmoothGrad: removing noise by adding noise". arXiv:1706.03825.
- Sundararajan, M., Taly, A., & Yan, Q. (2017). "Axiomatic Attribution for Deep Networks" (Integrated Gradients). ICML.
- Selvaraju, R. R. et al. (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization". ICCV.
- Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets". NeurIPS.
- van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE". JMLR.
- McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection". arXiv:1802.03426.
- Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS.
- Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention". BlackboxNLP.
- Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). "Analyzing Multi-Head Self-Attention". ACL.
- Jain, S. & Wallace, B. C. (2019). "Attention is not Explanation". NAACL.
- Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS.
- Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.
- Press, O., Smith, N., & Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ALiBi). ICLR 2022.
- Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need" (MQA). arXiv:1911.02150.
- Ainslie, J. et al. (2023). "GQA: Training Generalized Multi- Query Transformer Models from Multi-Head Checkpoints". EMNLP.
- Xiao, G. et al. (2023). "Efficient Streaming Language Models with Attention Sinks". arXiv:2309.17453.
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
- Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla). NeurIPS.
- Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits". Distill.
- Olsson, C. et al. (2022). "In-context Learning and Induction Heads". Anthropic Transformer Circuits Thread.
- Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning". Anthropic.
- Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet". Anthropic.