The Highway Around the Layer
Picture a city expressway with on-and-off ramps. Cars on the expressway can keep moving even if any one ramp is congested. Take away the expressway and every car must crawl through every intersection.
Residual connections do the same thing for gradients. The attention layer's output is ADDED to its input - the original signal travels through both an “expressway” (the residual) and the “intersection” (the attention computation). Gradients during backprop have a direct path; the layer can be ignored if it isn't helping.
Residual Connections
Originally introduced by ResNet (He et al. 2015) for very deep CNNs. The recipe:
Three immediate benefits. (1) Gradients always include the identity term, so they cannot vanish to zero. (2) The layer learns to predict a RESIDUAL adjustment to the input rather than a full output - usually easier. (3) The network can degenerate to a no-op (Layer(x) = 0) without catastrophic loss.
LayerNorm vs BatchNorm
| Property | BatchNorm (CNN) | LayerNorm (attention) |
|---|---|---|
| Normalises over | Batch + spatial axes (B, T) | Feature axis only |
| Per-channel statistics? | Yes | No |
| Per-sample statistics? | No | Yes |
| Sensitive to batch size? | Yes (fails at batch_size=1) | No |
| Sensitive to sequence length? | Yes (mixes across timesteps) | No |
| Standard with | CNNs | Attention / Transformers |
For attention LayerNorm is the right choice because each timestep's feature vector should live on a comparable scale independent of what the OTHER timesteps look like. BatchNorm would mix statistics across timesteps, defeating the purpose of attention's per-position computation.
The Sub-Layer Equation
Two variants exist. Post-norm (original Vaswani 2017):
Pre-norm (more stable for very deep models):
We use post-norm in this book because the backbone is shallow (one attention sub-layer) and post-norm is the original choice. For 12+ stacked transformer layers pre-norm tends to train more reliably.
Python: Residual + LayerNorm Wrapper
PyTorch: nn.LayerNorm
Sub-Layer Patterns Across Architectures
| Architecture | Sub-layer recipe | Notes |
|---|---|---|
| This book (post-norm) | y = LayerNorm(x + drop(MHA(x))) | Vaswani 2017 default |
| GPT (pre-norm) | y = x + drop(MHA(LayerNorm(x))) | Pre-norm for stability |
| ResNet (CNN) | y = x + Conv(x) + Conv(x) | Two convs in the residual |
| U-Net | y = upsample(decoder) + skip(encoder) | Skip across the U |
| DenseNet | y = concat(x, Layer(x)) | Concat instead of add |
| RWKV | y = LayerNorm(x + α·attn + β·MLP) | Mixed gating |
Two Sub-Layer Pitfalls
x + in the forward pass, gradients have to flow through every attention computation - training becomes unstable, often diverging in the first few epochs.The point. Residual + LayerNorm is the four- line wrapper that turns a raw attention computation into a composable sub-layer. Stable training; clean shape preservation; modest extra cost.
Takeaway
- Residual = identity highway for gradients. Without it, deep stacks won't train.
- LayerNorm normalises across features per-sample. Different from BatchNorm; right choice for attention.
- Post-norm vs pre-norm. We use post-norm; pre- norm is more stable for very deep stacks.
- Sub-layer adds ~1k LayerNorm params. The attention block ends at ~1.05M total.