Introduction
Transformers combine residual connections and layer normalization into a standard pattern called Add & Norm. This pattern appears around every sublayer (attention and FFN) in the architecture.
This section covers both the original Post-LN and modern Pre-LN variants, with complete implementations.
4.1 The Standard Pattern
Add & Norm Explained
Every transformer sublayer follows this pattern:
1Input ā Sublayer ā Residual Add ā Layer Norm ā OutputOr in code:
1output = LayerNorm(x + Sublayer(x))Where It Appears
In each transformer layer:
1āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
2ā Input ā
3āāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāā
4 ā
5 ā¼
6 āāāāāāāāāāāāāāāāāāāāāāāāā
7 ā Multi-Head Attention ā
8 āāāāāāāāāāāāā¬āāāāāāāāāāāā
9 ā
10 ā¼
11 āāāāāāāāāāāāāāā
12 ā Add & Norm ā ā Pattern #1
13 āāāāāāāā¬āāāāāāā
14 ā
15 ā¼
16 āāāāāāāāāāāāāāāāāāāāāāāāā
17 ā Feed-Forward Network ā
18 āāāāāāāāāāāāā¬āāāāāāāāāāāā
19 ā
20 ā¼
21 āāāāāāāāāāāāāāā
22 ā Add & Norm ā ā Pattern #2
23 āāāāāāāā¬āāāāāāā
24 ā
25 ā¼
26āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
27ā Output ā
28āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā4.2 Post-LN vs Pre-LN
Post-LN (Original Transformer)
LayerNorm after the residual addition:
1output = LayerNorm(x + Sublayer(x))Diagram:
1x āāāāāāāāāāāāāāā
2 ā ā
3 ā¼ ā
4 Sublayer ā
5 ā ā
6 ā¼ ā
7 Add āāāāāāāāāāāāāā
8 ā
9 ā¼
10LayerNorm
11 ā
12 ā¼
13 outputPre-LN (Modern Variant)
LayerNorm before the sublayer:
1output = x + Sublayer(LayerNorm(x))Diagram:
1x āāāāāāāāāāāāāāā
2 ā ā
3 ā¼ ā
4LayerNorm ā
5 ā ā
6 ā¼ ā
7 Sublayer ā
8 ā ā
9 ā¼ ā
10 Add āāāāāāāāāāāāāā
11 ā
12 ā¼
13 outputComparison Table
| Aspect | Post-LN | Pre-LN |
|---|---|---|
| Paper | Original Transformer | Later variants |
| Gradient flow | Through LayerNorm | Direct through residual |
| Training stability | Needs warmup | More stable |
| Final LayerNorm | Not needed | Needed at end |
| Used by | Original BERT | GPT-2, GPT-3, LLaMA |
4.3 Why Pre-LN is Often Better
Gradient Flow Analysis
Post-LN gradient path:
1Loss ā LayerNorm ā Add ā Sublayer ā ... ā Input
2 ā
3 Must pass through LayerNorm!Pre-LN gradient path:
1Loss ā Add ā ... ā Input
2 ā
3 Direct path through addition!Pre-LN provides a cleaner gradient highway.
Training Stability
1import torch
2import torch.nn as nn
3
4
5def compare_training_stability():
6 """Compare Post-LN vs Pre-LN training behavior."""
7
8 class PostLNBlock(nn.Module):
9 def __init__(self, d_model):
10 super().__init__()
11 self.ffn = nn.Sequential(
12 nn.Linear(d_model, d_model * 4),
13 nn.ReLU(),
14 nn.Linear(d_model * 4, d_model),
15 )
16 self.norm = nn.LayerNorm(d_model)
17
18 def forward(self, x):
19 return self.norm(x + self.ffn(x))
20
21 class PreLNBlock(nn.Module):
22 def __init__(self, d_model):
23 super().__init__()
24 self.ffn = nn.Sequential(
25 nn.Linear(d_model, d_model * 4),
26 nn.ReLU(),
27 nn.Linear(d_model * 4, d_model),
28 )
29 self.norm = nn.LayerNorm(d_model)
30
31 def forward(self, x):
32 return x + self.ffn(self.norm(x))
33
34 d_model = 256
35 depth = 24
36
37 # Create deep networks
38 post_ln = nn.Sequential(*[PostLNBlock(d_model) for _ in range(depth)])
39 pre_ln = nn.Sequential(*[PreLNBlock(d_model) for _ in range(depth)])
40
41 # Forward pass
42 x = torch.randn(2, 10, d_model, requires_grad=True)
43
44 # Post-LN
45 y_post = post_ln(x.clone())
46 loss_post = y_post.sum()
47 loss_post.backward()
48 grad_post = x.grad.norm().item()
49
50 x.grad = None
51
52 # Pre-LN
53 y_pre = pre_ln(x.clone())
54 loss_pre = y_pre.sum()
55 loss_pre.backward()
56 grad_pre = x.grad.norm().item()
57
58 print(f"Network depth: {depth} blocks")
59 print(f"\nPost-LN gradient norm: {grad_post:.4f}")
60 print(f"Pre-LN gradient norm: {grad_pre:.4f}")
61 print(f"\nOutput variance (Post-LN): {y_post.var():.4f}")
62 print(f"Output variance (Pre-LN): {y_pre.var():.4f}")
63
64
65compare_training_stability()Warmup Requirements
Post-LN typically needs:
- Learning rate warmup (1000-4000 steps)
- Careful initialization
- Lower initial learning rate
Pre-LN often works without:
- Can skip warmup entirely
- More robust to initialization
- Higher learning rates possible
4.4 Implementation: Post-LN
SublayerConnection (Post-LN)
1import torch
2import torch.nn as nn
3from typing import Callable, Optional
4
5
6class PostLNSublayerConnection(nn.Module):
7 """
8 Post-LayerNorm sublayer connection.
9
10 Applies: LayerNorm(x + Dropout(Sublayer(x)))
11
12 This is the original Transformer architecture.
13
14 Args:
15 d_model: Model dimension
16 dropout: Dropout probability
17 """
18
19 def __init__(self, d_model: int, dropout: float = 0.1):
20 super().__init__()
21
22 self.norm = nn.LayerNorm(d_model)
23 self.dropout = nn.Dropout(dropout)
24
25 def forward(
26 self,
27 x: torch.Tensor,
28 sublayer: Callable[[torch.Tensor], torch.Tensor]
29 ) -> torch.Tensor:
30 """
31 Apply sublayer with residual and normalization.
32
33 Args:
34 x: Input tensor [batch, seq_len, d_model]
35 sublayer: Function that transforms x
36
37 Returns:
38 Normalized output [batch, seq_len, d_model]
39 """
40 # Residual + LayerNorm
41 return self.norm(x + self.dropout(sublayer(x)))
42
43
44# Test
45def test_post_ln():
46 d_model = 512
47 batch_size = 2
48 seq_len = 10
49
50 sublayer_conn = PostLNSublayerConnection(d_model, dropout=0.1)
51
52 x = torch.randn(batch_size, seq_len, d_model)
53
54 # Define a simple sublayer (could be attention or FFN)
55 sublayer = lambda t: t * 0.5 + 0.1
56
57 output = sublayer_conn(x, sublayer)
58
59 print(f"Input shape: {x.shape}")
60 print(f"Output shape: {output.shape}")
61 print(f"Output mean: {output.mean(dim=-1).mean():.4f} (ā 0 from LayerNorm)")
62 print(f"Output std: {output.std(dim=-1).mean():.4f} (ā 1 from LayerNorm)")
63
64 print("\nā Post-LN test passed!")
65
66
67test_post_ln()4.5 Implementation: Pre-LN
SublayerConnection (Pre-LN)
1class PreLNSublayerConnection(nn.Module):
2 """
3 Pre-LayerNorm sublayer connection.
4
5 Applies: x + Dropout(Sublayer(LayerNorm(x)))
6
7 This is the modern variant used by GPT-2, GPT-3, etc.
8
9 Args:
10 d_model: Model dimension
11 dropout: Dropout probability
12 """
13
14 def __init__(self, d_model: int, dropout: float = 0.1):
15 super().__init__()
16
17 self.norm = nn.LayerNorm(d_model)
18 self.dropout = nn.Dropout(dropout)
19
20 def forward(
21 self,
22 x: torch.Tensor,
23 sublayer: Callable[[torch.Tensor], torch.Tensor]
24 ) -> torch.Tensor:
25 """
26 Apply sublayer with residual and pre-normalization.
27
28 Args:
29 x: Input tensor [batch, seq_len, d_model]
30 sublayer: Function that transforms normalized x
31
32 Returns:
33 Output [batch, seq_len, d_model]
34 """
35 # Pre-norm, then residual
36 return x + self.dropout(sublayer(self.norm(x)))
37
38
39# Test
40def test_pre_ln():
41 d_model = 512
42 batch_size = 2
43 seq_len = 10
44
45 sublayer_conn = PreLNSublayerConnection(d_model, dropout=0.1)
46
47 x = torch.randn(batch_size, seq_len, d_model)
48 sublayer = lambda t: t * 0.5
49
50 output = sublayer_conn(x, sublayer)
51
52 print(f"Input shape: {x.shape}")
53 print(f"Output shape: {output.shape}")
54
55 # Note: Pre-LN output is NOT normalized (residual added after)
56 print(f"Input std: {x.std(dim=-1).mean():.4f}")
57 print(f"Output std: {output.std(dim=-1).mean():.4f}")
58
59 print("\nā Pre-LN test passed!")
60
61
62test_pre_ln()4.6 Flexible Implementation
Configurable SublayerConnection
1class SublayerConnection(nn.Module):
2 """
3 Flexible sublayer connection supporting both Post-LN and Pre-LN.
4
5 Post-LN: LayerNorm(x + Sublayer(x))
6 Pre-LN: x + Sublayer(LayerNorm(x))
7
8 Args:
9 d_model: Model dimension
10 dropout: Dropout probability
11 pre_norm: Use Pre-LN if True, Post-LN if False
12
13 Example:
14 >>> conn = SublayerConnection(512, dropout=0.1, pre_norm=True)
15 >>> output = conn(x, attention_sublayer)
16 """
17
18 def __init__(
19 self,
20 d_model: int,
21 dropout: float = 0.1,
22 pre_norm: bool = True
23 ):
24 super().__init__()
25
26 self.pre_norm = pre_norm
27 self.norm = nn.LayerNorm(d_model)
28 self.dropout = nn.Dropout(dropout)
29
30 def forward(
31 self,
32 x: torch.Tensor,
33 sublayer: Callable[[torch.Tensor], torch.Tensor]
34 ) -> torch.Tensor:
35 """
36 Apply sublayer with residual connection and normalization.
37
38 Args:
39 x: Input tensor [batch, seq_len, d_model]
40 sublayer: Function/module to apply
41
42 Returns:
43 Output tensor [batch, seq_len, d_model]
44 """
45 if self.pre_norm:
46 # Pre-LN: x + Sublayer(LayerNorm(x))
47 return x + self.dropout(sublayer(self.norm(x)))
48 else:
49 # Post-LN: LayerNorm(x + Sublayer(x))
50 return self.norm(x + self.dropout(sublayer(x)))
51
52
53# Test both modes
54def test_flexible_sublayer():
55 d_model = 512
56 x = torch.randn(2, 10, d_model)
57 sublayer = nn.Linear(d_model, d_model)
58
59 # Post-LN
60 post_ln = SublayerConnection(d_model, pre_norm=False)
61 post_output = post_ln(x, sublayer)
62
63 # Pre-LN
64 pre_ln = SublayerConnection(d_model, pre_norm=True)
65 pre_output = pre_ln(x, sublayer)
66
67 print("Post-LN output stats:")
68 print(f" Mean: {post_output.mean(dim=-1).mean():.4f}")
69 print(f" Std: {post_output.std(dim=-1).mean():.4f}")
70
71 print("\nPre-LN output stats:")
72 print(f" Mean: {pre_output.mean(dim=-1).mean():.4f}")
73 print(f" Std: {pre_output.std(dim=-1).mean():.4f}")
74
75 # Note the difference!
76 print("\nObservation: Post-LN output is normalized, Pre-LN is not")
77
78 print("\nā Flexible sublayer test passed!")
79
80
81test_flexible_sublayer()4.7 Pre-LN Requires Final Norm
The Issue
With Pre-LN, the final output isn't normalized:
1Pre-LN Layer 1: x + Sublayer(LayerNorm(x)) ā Not normalized
2Pre-LN Layer 2: x + Sublayer(LayerNorm(x)) ā Not normalized
3...
4Pre-LN Layer N: x + Sublayer(LayerNorm(x)) ā STILL not normalized!The Solution
Add a final LayerNorm after all layers:
1class PreLNTransformer(nn.Module):
2 """
3 Pre-LN transformer with final normalization.
4 """
5
6 def __init__(self, d_model, num_layers):
7 super().__init__()
8
9 self.layers = nn.ModuleList([
10 TransformerLayer(d_model, pre_norm=True)
11 for _ in range(num_layers)
12 ])
13
14 # CRITICAL: Final LayerNorm for Pre-LN
15 self.final_norm = nn.LayerNorm(d_model)
16
17 def forward(self, x):
18 for layer in self.layers:
19 x = layer(x)
20
21 # Apply final normalization
22 x = self.final_norm(x)
23
24 return xPost-LN doesn't need this because each layer's output is already normalized.
4.8 Complete Transformer Layer
Putting It All Together
1import torch
2import torch.nn as nn
3from typing import Optional, Callable
4
5
6class TransformerLayerBase(nn.Module):
7 """
8 Base transformer layer with configurable normalization.
9
10 Contains:
11 - Multi-head self-attention with Add & Norm
12 - Feed-forward network with Add & Norm
13
14 Args:
15 d_model: Model dimension
16 num_heads: Number of attention heads
17 d_ff: Feed-forward inner dimension
18 dropout: Dropout probability
19 pre_norm: Use Pre-LN if True, Post-LN if False
20 """
21
22 def __init__(
23 self,
24 d_model: int,
25 num_heads: int,
26 d_ff: int,
27 dropout: float = 0.1,
28 pre_norm: bool = True
29 ):
30 super().__init__()
31
32 self.pre_norm = pre_norm
33
34 # Attention sublayer (placeholder - will be implemented properly later)
35 self.self_attention = nn.MultiheadAttention(
36 d_model, num_heads, dropout=dropout, batch_first=True
37 )
38
39 # Feed-forward sublayer
40 self.feed_forward = nn.Sequential(
41 nn.Linear(d_model, d_ff),
42 nn.GELU(),
43 nn.Dropout(dropout),
44 nn.Linear(d_ff, d_model),
45 )
46
47 # Layer norms
48 self.norm1 = nn.LayerNorm(d_model)
49 self.norm2 = nn.LayerNorm(d_model)
50
51 # Dropout
52 self.dropout = nn.Dropout(dropout)
53
54 def forward(
55 self,
56 x: torch.Tensor,
57 mask: Optional[torch.Tensor] = None
58 ) -> torch.Tensor:
59 """
60 Forward pass through transformer layer.
61
62 Args:
63 x: Input tensor [batch, seq_len, d_model]
64 mask: Optional attention mask
65
66 Returns:
67 Output tensor [batch, seq_len, d_model]
68 """
69 if self.pre_norm:
70 # Pre-LN variant
71 # Attention block
72 normed = self.norm1(x)
73 attn_out, _ = self.self_attention(normed, normed, normed, attn_mask=mask)
74 x = x + self.dropout(attn_out)
75
76 # FFN block
77 normed = self.norm2(x)
78 ff_out = self.feed_forward(normed)
79 x = x + self.dropout(ff_out)
80 else:
81 # Post-LN variant
82 # Attention block
83 attn_out, _ = self.self_attention(x, x, x, attn_mask=mask)
84 x = self.norm1(x + self.dropout(attn_out))
85
86 # FFN block
87 ff_out = self.feed_forward(x)
88 x = self.norm2(x + self.dropout(ff_out))
89
90 return x
91
92
93# Test
94def test_transformer_layer():
95 d_model = 512
96 num_heads = 8
97 d_ff = 2048
98 batch_size = 2
99 seq_len = 10
100
101 # Test Pre-LN
102 pre_ln_layer = TransformerLayerBase(
103 d_model, num_heads, d_ff, pre_norm=True
104 )
105 x = torch.randn(batch_size, seq_len, d_model)
106 pre_output = pre_ln_layer(x)
107
108 print("Pre-LN Transformer Layer:")
109 print(f" Input shape: {x.shape}")
110 print(f" Output shape: {pre_output.shape}")
111 print(f" Parameters: {sum(p.numel() for p in pre_ln_layer.parameters()):,}")
112
113 # Test Post-LN
114 post_ln_layer = TransformerLayerBase(
115 d_model, num_heads, d_ff, pre_norm=False
116 )
117 post_output = post_ln_layer(x)
118
119 print("\nPost-LN Transformer Layer:")
120 print(f" Output shape: {post_output.shape}")
121
122 print("\nā Transformer layer test passed!")
123
124
125test_transformer_layer()Output:
1Pre-LN Transformer Layer:
2 Input shape: torch.Size([2, 10, 512])
3 Output shape: torch.Size([2, 10, 512])
4 Parameters: 5,248,512
5
6Post-LN Transformer Layer:
7 Output shape: torch.Size([2, 10, 512])
8
9ā Transformer layer test passed!Summary
The Add & Norm Pattern
Post-LN (Original):
1output = LayerNorm(x + Sublayer(x))Pre-LN (Modern):
1output = x + Sublayer(LayerNorm(x))When to Use Each
| Scenario | Recommendation |
|---|---|
| Reproducing original BERT | Post-LN |
| New projects | Pre-LN |
| Deep networks (>24 layers) | Pre-LN |
| Training without warmup | Pre-LN |
| Research comparison | Try both |
Implementation Checklist
- Choose Pre-LN or Post-LN
- Apply pattern to both attention and FFN
- Add dropout before residual addition
- For Pre-LN: add final LayerNorm after all layers
- For Post-LN: consider learning rate warmup
Chapter Summary
Components Covered
| Component | Purpose |
|---|---|
| FFN | Per-position non-linear transformation |
| LayerNorm | Normalize across features |
| Residual | Enable gradient flow |
| Add & Norm | Combine all three |
Ready for Encoder/Decoder
With these components, we now have everything needed to build:
- Complete encoder layers
- Complete decoder layers
- Full transformer architecture
Exercises
Implementation Exercises
1. Implement a transformer layer that can switch between Pre-LN and Post-LN at runtime.
2. Create a "sandwich" norm variant: LayerNorm(x) + Sublayer(LayerNorm(x)).
3. Add support for RMSNorm instead of LayerNorm.
Analysis Exercises
4. Train identical models with Pre-LN vs Post-LN. Compare:
- Training curves
- Final performance
- Sensitivity to learning rate
5. Visualize the gradient norms through layers for both variants.
6. Experiment with different dropout placements in the Add & Norm pattern.
Conceptual Questions
7. Why does Pre-LN need a final LayerNorm but Post-LN doesn't?
8. How does the Add & Norm pattern affect the effective depth of the network?
Next Chapter Preview
In Chapter 7, we'll build the Transformer Encoder by stacking multiple encoder layers. We'll implement the complete encoder architecture that processes source sequences in our translation model.