Chapter 10
10 min read
Section 41 of 121

Residual Connection and LayerNorm

Multi-Head Self-Attention

The Highway Around the Layer

Picture a city expressway with on-and-off ramps. Cars on the expressway can keep moving even if any one ramp is congested. Take away the expressway and every car must crawl through every intersection.

Residual connections do the same thing for gradients. The attention layer's output is ADDED to its input - the original signal travels through both an “expressway” (the residual) and the “intersection” (the attention computation). Gradients during backprop have a direct path; the layer can be ignored if it isn't helping.

Residual Connections

Originally introduced by ResNet (He et al. 2015) for very deep CNNs. The recipe:

y=x+Layer(x).y = x + \text{Layer}(x).

Three immediate benefits. (1) Gradients y/x=I+Layer/x\partial y / \partial x = I + \partial \text{Layer} / \partial x always include the identity term, so they cannot vanish to zero. (2) The layer learns to predict a RESIDUAL adjustment to the input rather than a full output - usually easier. (3) The network can degenerate to a no-op (Layer(x) = 0) without catastrophic loss.

LayerNorm vs BatchNorm

PropertyBatchNorm (CNN)LayerNorm (attention)
Normalises overBatch + spatial axes (B, T)Feature axis only
Per-channel statistics?YesNo
Per-sample statistics?NoYes
Sensitive to batch size?Yes (fails at batch_size=1)No
Sensitive to sequence length?Yes (mixes across timesteps)No
Standard withCNNsAttention / Transformers

For attention LayerNorm is the right choice because each timestep's feature vector should live on a comparable scale independent of what the OTHER timesteps look like. BatchNorm would mix statistics across timesteps, defeating the purpose of attention's per-position computation.

The Sub-Layer Equation

Two variants exist. Post-norm (original Vaswani 2017):

y=LayerNorm(x+Sublayer(x)).y = \text{LayerNorm}\bigl(x + \text{Sublayer}(x)\bigr).

Pre-norm (more stable for very deep models):

y=x+Sublayer ⁣(LayerNorm(x)).y = x + \text{Sublayer}\!\bigl(\text{LayerNorm}(x)\bigr).

We use post-norm in this book because the backbone is shallow (one attention sub-layer) and post-norm is the original choice. For 12+ stacked transformer layers pre-norm tends to train more reliably.

Output statistics. After post-norm every (b, t) cell has features with mean 0 and std 1 (modulo the learnable gamma / beta). This is what keeps the deep stack's activations on a stable scale.

Python: Residual + LayerNorm Wrapper

LayerNorm + residual; verify per-feature normalisation
🐍attention_sublayer_numpy.py
1import numpy as np

Standard alias.

4def layer_norm(x, gamma, beta, eps=1e-5):

LayerNorm normalises across the FEATURE axis (last axis), per-sample. Different from BatchNorm which normalises across the BATCH axis.

EXECUTION STATE
input: x (..., d) = Any shape; we normalise the last dim
input: gamma (d,) = Learnable per-feature scale
input: beta (d,) = Learnable per-feature shift
6mu = x.mean(axis=-1, keepdims=True)

Per-sample mean over the FEATURE dim. For (B, T, d) this is (B, T, 1).

7var = x.var(axis=-1, keepdims=True)

Per-sample variance over features.

8return (x - mu) / np.sqrt(var + eps) * gamma + beta

Element-wise normalise, then scale and shift. Each (b, t) sample becomes mean=0 std=1 across its 512 features.

EXECUTION STATE
→ why per-feature, not per-channel? = Attention models care about RELATIVE feature magnitudes within one timestep. LayerNorm normalises that. BatchNorm would mix statistics across timesteps which makes less sense for attention.
11def attention_sublayer(x, attention_fn, gamma, beta):

The residual + LayerNorm wrapper that surrounds the attention computation. Two variants exist: pre-norm and post-norm. We show post-norm here (the original Vaswani 2017 choice).

16a = attention_fn(x)

Run attention. Output has the same shape as x.

17return layer_norm(x + a, gamma, beta)

RESIDUAL connection: add the input to the output. Then LayerNorm. The 'add' is what gives gradients a direct path from output to input - critical for stable deep training.

22B, T, d_model = 2, 30, 512

Backbone shapes.

24x = np.random.randn(B, T, d_model).astype(np.float32)

Fake BiLSTM output.

25gamma = np.ones(d_model, dtype=np.float32)

LayerNorm gamma init.

26beta = np.zeros(d_model, dtype=np.float32)

LayerNorm beta init.

29def fake_attention(x):

A stub - we don't need real attention for the LayerNorm demo.

30return x * 0.1 + np.random.randn(*x.shape).astype(np.float32) * 0.05

Tiny perturbation of the input. Realistic small-magnitude attention output.

32out = attention_sublayer(x, fake_attention, gamma, beta)

Apply the sub-layer.

34print("x mean / std (per cell) :", x.mean().round(3), x.std().round(3))

Pre-sublayer: random Gaussian, mean ~0, std ~1.

EXECUTION STATE
Output = x mean / std : 0.001 1.000
35print("out mean / std (per cell) :", out.mean().round(3), out.std().round(3))

Post-sublayer: still mean ~0, std ~1 because of the layer-norm.

EXECUTION STATE
Output = out mean / std : 0.000 1.000
36print("out per-feature mean (cell 0,0) :", out[0, 0].mean().round(4))

Pick ONE (b, t) cell - its 512 features should average to almost 0 by definition of LayerNorm.

EXECUTION STATE
Output = out per-feature mean (cell 0,0) : -0.0000
37print("out per-feature std (cell 0,0) :", out[0, 0].std().round(4))

Same cell - 512 features have std almost exactly 1.

EXECUTION STATE
Output = out per-feature std (cell 0,0) : 1.0000
19 lines without explanation
1import numpy as np
2
3
4def layer_norm(x: np.ndarray, gamma, beta, eps=1e-5):
5    """Per-FEATURE normalisation (last axis). Different from BatchNorm."""
6    mu  = x.mean(axis=-1, keepdims=True)
7    var = x.var (axis=-1, keepdims=True)
8    return (x - mu) / np.sqrt(var + eps) * gamma + beta
9
10
11def attention_sublayer(x: np.ndarray,
12                        attention_fn,
13                        gamma, beta) -> np.ndarray:
14    """Residual + LayerNorm wrapper around an attention block.
15    Implements: y = LayerNorm(x + attention(x))   — 'post-norm' variant.
16    """
17    a = attention_fn(x)            # (B, T, d_model)
18    return layer_norm(x + a, gamma, beta)
19
20
21# ----- Demo -----
22np.random.seed(0)
23B, T, d_model = 2, 30, 512
24
25x = np.random.randn(B, T, d_model).astype(np.float32)
26gamma = np.ones (d_model, dtype=np.float32)
27beta  = np.zeros(d_model, dtype=np.float32)
28
29# Pretend attention(x) returns x perturbed slightly
30def fake_attention(x):
31    return x * 0.1 + np.random.randn(*x.shape).astype(np.float32) * 0.05
32
33out = attention_sublayer(x, fake_attention, gamma, beta)
34
35print("x   mean / std (per cell)        :", x.mean().round(3), x.std().round(3))
36print("out mean / std (per cell)        :", out.mean().round(3), out.std().round(3))
37print("out per-feature mean (cell 0,0)  :", out[0, 0].mean().round(4))      # ≈ 0
38print("out per-feature std  (cell 0,0)  :", out[0, 0].std().round(4))       # ≈ 1

PyTorch: nn.LayerNorm

MHA + Dropout + Residual + LayerNorm in one Module
🐍attention_sublayer_torch.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layers.

3import torch.nn.functional as F

(Imported for habit.)

6class AttentionSubLayer(nn.Module):

Wraps multi-head attention with residual connection + LayerNorm. The atomic 'attention block' used in every transformer.

8def __init__(self, d_model=512, num_heads=8, dropout_p=0.1):

Three hyperparameters.

10super().__init__()

Initialise nn.Module.

11self.attn = nn.MultiheadAttention(d_model, num_heads, batch_first=True, dropout=dropout_p)

From §10.2.

14self.norm = nn.LayerNorm(d_model)

LayerNorm normalises the last axis (features) per-sample. Holds gamma + beta of shape (d_model,) - 1024 parameters total.

EXECUTION STATE
→ why d_model only? = LayerNorm only needs the size of the dimension it normalises over. (B, T, d_model) → normalise over the d_model axis.
15self.drop = nn.Dropout(dropout_p)

Dropout applied AFTER attention output, before the residual add. Standard practice.

17def forward(self, x):

Standard forward.

19a, _ = self.attn(x, x, x, need_weights=False)

Self-attention. need_weights=False saves compute.

20return self.norm(x + self.drop(a))

Apply dropout to attention output; add to input (residual); LayerNorm. Three operations in one line - the canonical sub-layer recipe.

24torch.manual_seed(0)

Determinism.

25sublayer = AttentionSubLayer(d_model=512, num_heads=8, dropout_p=0.1)

Instantiate.

26x = torch.randn(2, 30, 512)

Fake BiLSTM output.

27y = sublayer(x)

Run.

29print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 30, 512)
30print("output :", tuple(y.shape))

Output shape preserved.

EXECUTION STATE
Output = output : (2, 30, 512)
31print("# params:", sum(p.numel() for p in sublayer.parameters()))

MHA + LayerNorm.

EXECUTION STATE
Output = # params: 1,051,648
13 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6class AttentionSubLayer(nn.Module):
7    """Multi-head attention with residual + LayerNorm wrapping."""
8    def __init__(self, d_model: int = 512, num_heads: int = 8,
9                 dropout_p: float = 0.1):
10        super().__init__()
11        self.attn  = nn.MultiheadAttention(d_model, num_heads,
12                                           batch_first=True,
13                                           dropout=dropout_p)
14        self.norm  = nn.LayerNorm(d_model)
15        self.drop  = nn.Dropout(dropout_p)
16
17    def forward(self, x: torch.Tensor) -> torch.Tensor:
18        # Self-attention with residual + LayerNorm (post-norm)
19        a, _ = self.attn(x, x, x, need_weights=False)
20        return self.norm(x + self.drop(a))
21
22
23# Use it
24torch.manual_seed(0)
25sublayer = AttentionSubLayer(d_model=512, num_heads=8, dropout_p=0.1)
26x = torch.randn(2, 30, 512)
27y = sublayer(x)
28
29print("input  :", tuple(x.shape))                     # (2, 30, 512)
30print("output :", tuple(y.shape))                      # (2, 30, 512)
31print("# params:", sum(p.numel() for p in sublayer.parameters()))
32# # params: 1,051,648 (= MHA + LayerNorm)

Sub-Layer Patterns Across Architectures

ArchitectureSub-layer recipeNotes
This book (post-norm)y = LayerNorm(x + drop(MHA(x)))Vaswani 2017 default
GPT (pre-norm)y = x + drop(MHA(LayerNorm(x)))Pre-norm for stability
ResNet (CNN)y = x + Conv(x) + Conv(x)Two convs in the residual
U-Nety = upsample(decoder) + skip(encoder)Skip across the U
DenseNety = concat(x, Layer(x))Concat instead of add
RWKVy = LayerNorm(x + α·attn + β·MLP)Mixed gating

Two Sub-Layer Pitfalls

Pitfall 1: Forgetting the residual. Without the x + in the forward pass, gradients have to flow through every attention computation - training becomes unstable, often diverging in the first few epochs.
Pitfall 2: BatchNorm instead of LayerNorm. BatchNorm in attention sub-layers mixes statistics across timesteps. Models train but generalise poorly because timestep-relative magnitudes are lost.
The point. Residual + LayerNorm is the four- line wrapper that turns a raw attention computation into a composable sub-layer. Stable training; clean shape preservation; modest extra cost.

Takeaway

  • Residual = identity highway for gradients. Without it, deep stacks won't train.
  • LayerNorm normalises across features per-sample. Different from BatchNorm; right choice for attention.
  • Post-norm vs pre-norm. We use post-norm; pre- norm is more stable for very deep stacks.
  • Sub-layer adds ~1k LayerNorm params. The attention block ends at ~1.05M total.
Loading comments...