Chapter 6
12 min read
Section 33 of 75

The Add and Norm Pattern

Feed Forward and Normalization

Introduction

Transformers combine residual connections and layer normalization into a standard pattern called Add & Norm. This pattern appears around every sublayer (attention and FFN) in the architecture.

This section covers both the original Post-LN and modern Pre-LN variants, with complete implementations.


4.1 The Standard Pattern

Add & Norm Explained

Every transformer sublayer follows this pattern:

šŸ“text
1Input → Sublayer → Residual Add → Layer Norm → Output

Or in code:

šŸpython
1output = LayerNorm(x + Sublayer(x))

Where It Appears

In each transformer layer:

šŸ“text
1ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
2│                     Input                        │
3ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
4                     │
5                     ā–¼
6         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
7         │  Multi-Head Attention │
8         ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
9                     │
10                     ā–¼
11              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
12              │ Add & Norm  │  ← Pattern #1
13              ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
14                     │
15                     ā–¼
16         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
17         │  Feed-Forward Network │
18         ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
19                     │
20                     ā–¼
21              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
22              │ Add & Norm  │  ← Pattern #2
23              ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
24                     │
25                     ā–¼
26ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
27│                    Output                        │
28ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

4.2 Post-LN vs Pre-LN

Post-LN (Original Transformer)

LayerNorm after the residual addition:

šŸ“text
1output = LayerNorm(x + Sublayer(x))

Diagram:

šŸ“text
1x ──────────────┐
2    │               │
3    ā–¼               │
4 Sublayer          │
5    │               │
6    ā–¼               │
7   Add ā—„ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
8    │
9    ā–¼
10LayerNorm
11    │
12    ā–¼
13  output

Pre-LN (Modern Variant)

LayerNorm before the sublayer:

šŸ“text
1output = x + Sublayer(LayerNorm(x))

Diagram:

šŸ“text
1x ──────────────┐
2    │               │
3    ā–¼               │
4LayerNorm          │
5    │               │
6    ā–¼               │
7 Sublayer          │
8    │               │
9    ā–¼               │
10   Add ā—„ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
11    │
12    ā–¼
13  output

Comparison Table

AspectPost-LNPre-LN
PaperOriginal TransformerLater variants
Gradient flowThrough LayerNormDirect through residual
Training stabilityNeeds warmupMore stable
Final LayerNormNot neededNeeded at end
Used byOriginal BERTGPT-2, GPT-3, LLaMA

4.3 Why Pre-LN is Often Better

Gradient Flow Analysis

Post-LN gradient path:

šŸ“text
1Loss → LayerNorm → Add → Sublayer → ... → Input
2       ↑
3       Must pass through LayerNorm!

Pre-LN gradient path:

šŸ“text
1Loss → Add → ... → Input
2       ↑
3       Direct path through addition!

Pre-LN provides a cleaner gradient highway.

Training Stability

šŸpython
1import torch
2import torch.nn as nn
3
4
5def compare_training_stability():
6    """Compare Post-LN vs Pre-LN training behavior."""
7
8    class PostLNBlock(nn.Module):
9        def __init__(self, d_model):
10            super().__init__()
11            self.ffn = nn.Sequential(
12                nn.Linear(d_model, d_model * 4),
13                nn.ReLU(),
14                nn.Linear(d_model * 4, d_model),
15            )
16            self.norm = nn.LayerNorm(d_model)
17
18        def forward(self, x):
19            return self.norm(x + self.ffn(x))
20
21    class PreLNBlock(nn.Module):
22        def __init__(self, d_model):
23            super().__init__()
24            self.ffn = nn.Sequential(
25                nn.Linear(d_model, d_model * 4),
26                nn.ReLU(),
27                nn.Linear(d_model * 4, d_model),
28            )
29            self.norm = nn.LayerNorm(d_model)
30
31        def forward(self, x):
32            return x + self.ffn(self.norm(x))
33
34    d_model = 256
35    depth = 24
36
37    # Create deep networks
38    post_ln = nn.Sequential(*[PostLNBlock(d_model) for _ in range(depth)])
39    pre_ln = nn.Sequential(*[PreLNBlock(d_model) for _ in range(depth)])
40
41    # Forward pass
42    x = torch.randn(2, 10, d_model, requires_grad=True)
43
44    # Post-LN
45    y_post = post_ln(x.clone())
46    loss_post = y_post.sum()
47    loss_post.backward()
48    grad_post = x.grad.norm().item()
49
50    x.grad = None
51
52    # Pre-LN
53    y_pre = pre_ln(x.clone())
54    loss_pre = y_pre.sum()
55    loss_pre.backward()
56    grad_pre = x.grad.norm().item()
57
58    print(f"Network depth: {depth} blocks")
59    print(f"\nPost-LN gradient norm: {grad_post:.4f}")
60    print(f"Pre-LN gradient norm:  {grad_pre:.4f}")
61    print(f"\nOutput variance (Post-LN): {y_post.var():.4f}")
62    print(f"Output variance (Pre-LN):  {y_pre.var():.4f}")
63
64
65compare_training_stability()

Warmup Requirements

Post-LN typically needs:

  • Learning rate warmup (1000-4000 steps)
  • Careful initialization
  • Lower initial learning rate

Pre-LN often works without:

  • Can skip warmup entirely
  • More robust to initialization
  • Higher learning rates possible

4.4 Implementation: Post-LN

SublayerConnection (Post-LN)

šŸpython
1import torch
2import torch.nn as nn
3from typing import Callable, Optional
4
5
6class PostLNSublayerConnection(nn.Module):
7    """
8    Post-LayerNorm sublayer connection.
9
10    Applies: LayerNorm(x + Dropout(Sublayer(x)))
11
12    This is the original Transformer architecture.
13
14    Args:
15        d_model: Model dimension
16        dropout: Dropout probability
17    """
18
19    def __init__(self, d_model: int, dropout: float = 0.1):
20        super().__init__()
21
22        self.norm = nn.LayerNorm(d_model)
23        self.dropout = nn.Dropout(dropout)
24
25    def forward(
26        self,
27        x: torch.Tensor,
28        sublayer: Callable[[torch.Tensor], torch.Tensor]
29    ) -> torch.Tensor:
30        """
31        Apply sublayer with residual and normalization.
32
33        Args:
34            x: Input tensor [batch, seq_len, d_model]
35            sublayer: Function that transforms x
36
37        Returns:
38            Normalized output [batch, seq_len, d_model]
39        """
40        # Residual + LayerNorm
41        return self.norm(x + self.dropout(sublayer(x)))
42
43
44# Test
45def test_post_ln():
46    d_model = 512
47    batch_size = 2
48    seq_len = 10
49
50    sublayer_conn = PostLNSublayerConnection(d_model, dropout=0.1)
51
52    x = torch.randn(batch_size, seq_len, d_model)
53
54    # Define a simple sublayer (could be attention or FFN)
55    sublayer = lambda t: t * 0.5 + 0.1
56
57    output = sublayer_conn(x, sublayer)
58
59    print(f"Input shape: {x.shape}")
60    print(f"Output shape: {output.shape}")
61    print(f"Output mean: {output.mean(dim=-1).mean():.4f} (ā‰ˆ 0 from LayerNorm)")
62    print(f"Output std: {output.std(dim=-1).mean():.4f} (ā‰ˆ 1 from LayerNorm)")
63
64    print("\nāœ“ Post-LN test passed!")
65
66
67test_post_ln()

4.5 Implementation: Pre-LN

SublayerConnection (Pre-LN)

šŸpython
1class PreLNSublayerConnection(nn.Module):
2    """
3    Pre-LayerNorm sublayer connection.
4
5    Applies: x + Dropout(Sublayer(LayerNorm(x)))
6
7    This is the modern variant used by GPT-2, GPT-3, etc.
8
9    Args:
10        d_model: Model dimension
11        dropout: Dropout probability
12    """
13
14    def __init__(self, d_model: int, dropout: float = 0.1):
15        super().__init__()
16
17        self.norm = nn.LayerNorm(d_model)
18        self.dropout = nn.Dropout(dropout)
19
20    def forward(
21        self,
22        x: torch.Tensor,
23        sublayer: Callable[[torch.Tensor], torch.Tensor]
24    ) -> torch.Tensor:
25        """
26        Apply sublayer with residual and pre-normalization.
27
28        Args:
29            x: Input tensor [batch, seq_len, d_model]
30            sublayer: Function that transforms normalized x
31
32        Returns:
33            Output [batch, seq_len, d_model]
34        """
35        # Pre-norm, then residual
36        return x + self.dropout(sublayer(self.norm(x)))
37
38
39# Test
40def test_pre_ln():
41    d_model = 512
42    batch_size = 2
43    seq_len = 10
44
45    sublayer_conn = PreLNSublayerConnection(d_model, dropout=0.1)
46
47    x = torch.randn(batch_size, seq_len, d_model)
48    sublayer = lambda t: t * 0.5
49
50    output = sublayer_conn(x, sublayer)
51
52    print(f"Input shape: {x.shape}")
53    print(f"Output shape: {output.shape}")
54
55    # Note: Pre-LN output is NOT normalized (residual added after)
56    print(f"Input std: {x.std(dim=-1).mean():.4f}")
57    print(f"Output std: {output.std(dim=-1).mean():.4f}")
58
59    print("\nāœ“ Pre-LN test passed!")
60
61
62test_pre_ln()

4.6 Flexible Implementation

Configurable SublayerConnection

šŸpython
1class SublayerConnection(nn.Module):
2    """
3    Flexible sublayer connection supporting both Post-LN and Pre-LN.
4
5    Post-LN: LayerNorm(x + Sublayer(x))
6    Pre-LN:  x + Sublayer(LayerNorm(x))
7
8    Args:
9        d_model: Model dimension
10        dropout: Dropout probability
11        pre_norm: Use Pre-LN if True, Post-LN if False
12
13    Example:
14        >>> conn = SublayerConnection(512, dropout=0.1, pre_norm=True)
15        >>> output = conn(x, attention_sublayer)
16    """
17
18    def __init__(
19        self,
20        d_model: int,
21        dropout: float = 0.1,
22        pre_norm: bool = True
23    ):
24        super().__init__()
25
26        self.pre_norm = pre_norm
27        self.norm = nn.LayerNorm(d_model)
28        self.dropout = nn.Dropout(dropout)
29
30    def forward(
31        self,
32        x: torch.Tensor,
33        sublayer: Callable[[torch.Tensor], torch.Tensor]
34    ) -> torch.Tensor:
35        """
36        Apply sublayer with residual connection and normalization.
37
38        Args:
39            x: Input tensor [batch, seq_len, d_model]
40            sublayer: Function/module to apply
41
42        Returns:
43            Output tensor [batch, seq_len, d_model]
44        """
45        if self.pre_norm:
46            # Pre-LN: x + Sublayer(LayerNorm(x))
47            return x + self.dropout(sublayer(self.norm(x)))
48        else:
49            # Post-LN: LayerNorm(x + Sublayer(x))
50            return self.norm(x + self.dropout(sublayer(x)))
51
52
53# Test both modes
54def test_flexible_sublayer():
55    d_model = 512
56    x = torch.randn(2, 10, d_model)
57    sublayer = nn.Linear(d_model, d_model)
58
59    # Post-LN
60    post_ln = SublayerConnection(d_model, pre_norm=False)
61    post_output = post_ln(x, sublayer)
62
63    # Pre-LN
64    pre_ln = SublayerConnection(d_model, pre_norm=True)
65    pre_output = pre_ln(x, sublayer)
66
67    print("Post-LN output stats:")
68    print(f"  Mean: {post_output.mean(dim=-1).mean():.4f}")
69    print(f"  Std:  {post_output.std(dim=-1).mean():.4f}")
70
71    print("\nPre-LN output stats:")
72    print(f"  Mean: {pre_output.mean(dim=-1).mean():.4f}")
73    print(f"  Std:  {pre_output.std(dim=-1).mean():.4f}")
74
75    # Note the difference!
76    print("\nObservation: Post-LN output is normalized, Pre-LN is not")
77
78    print("\nāœ“ Flexible sublayer test passed!")
79
80
81test_flexible_sublayer()

4.7 Pre-LN Requires Final Norm

The Issue

With Pre-LN, the final output isn't normalized:

šŸ“text
1Pre-LN Layer 1: x + Sublayer(LayerNorm(x))  → Not normalized
2Pre-LN Layer 2: x + Sublayer(LayerNorm(x))  → Not normalized
3...
4Pre-LN Layer N: x + Sublayer(LayerNorm(x))  → STILL not normalized!

The Solution

Add a final LayerNorm after all layers:

šŸpython
1class PreLNTransformer(nn.Module):
2    """
3    Pre-LN transformer with final normalization.
4    """
5
6    def __init__(self, d_model, num_layers):
7        super().__init__()
8
9        self.layers = nn.ModuleList([
10            TransformerLayer(d_model, pre_norm=True)
11            for _ in range(num_layers)
12        ])
13
14        # CRITICAL: Final LayerNorm for Pre-LN
15        self.final_norm = nn.LayerNorm(d_model)
16
17    def forward(self, x):
18        for layer in self.layers:
19            x = layer(x)
20
21        # Apply final normalization
22        x = self.final_norm(x)
23
24        return x

Post-LN doesn't need this because each layer's output is already normalized.


4.8 Complete Transformer Layer

Putting It All Together

šŸpython
1import torch
2import torch.nn as nn
3from typing import Optional, Callable
4
5
6class TransformerLayerBase(nn.Module):
7    """
8    Base transformer layer with configurable normalization.
9
10    Contains:
11    - Multi-head self-attention with Add & Norm
12    - Feed-forward network with Add & Norm
13
14    Args:
15        d_model: Model dimension
16        num_heads: Number of attention heads
17        d_ff: Feed-forward inner dimension
18        dropout: Dropout probability
19        pre_norm: Use Pre-LN if True, Post-LN if False
20    """
21
22    def __init__(
23        self,
24        d_model: int,
25        num_heads: int,
26        d_ff: int,
27        dropout: float = 0.1,
28        pre_norm: bool = True
29    ):
30        super().__init__()
31
32        self.pre_norm = pre_norm
33
34        # Attention sublayer (placeholder - will be implemented properly later)
35        self.self_attention = nn.MultiheadAttention(
36            d_model, num_heads, dropout=dropout, batch_first=True
37        )
38
39        # Feed-forward sublayer
40        self.feed_forward = nn.Sequential(
41            nn.Linear(d_model, d_ff),
42            nn.GELU(),
43            nn.Dropout(dropout),
44            nn.Linear(d_ff, d_model),
45        )
46
47        # Layer norms
48        self.norm1 = nn.LayerNorm(d_model)
49        self.norm2 = nn.LayerNorm(d_model)
50
51        # Dropout
52        self.dropout = nn.Dropout(dropout)
53
54    def forward(
55        self,
56        x: torch.Tensor,
57        mask: Optional[torch.Tensor] = None
58    ) -> torch.Tensor:
59        """
60        Forward pass through transformer layer.
61
62        Args:
63            x: Input tensor [batch, seq_len, d_model]
64            mask: Optional attention mask
65
66        Returns:
67            Output tensor [batch, seq_len, d_model]
68        """
69        if self.pre_norm:
70            # Pre-LN variant
71            # Attention block
72            normed = self.norm1(x)
73            attn_out, _ = self.self_attention(normed, normed, normed, attn_mask=mask)
74            x = x + self.dropout(attn_out)
75
76            # FFN block
77            normed = self.norm2(x)
78            ff_out = self.feed_forward(normed)
79            x = x + self.dropout(ff_out)
80        else:
81            # Post-LN variant
82            # Attention block
83            attn_out, _ = self.self_attention(x, x, x, attn_mask=mask)
84            x = self.norm1(x + self.dropout(attn_out))
85
86            # FFN block
87            ff_out = self.feed_forward(x)
88            x = self.norm2(x + self.dropout(ff_out))
89
90        return x
91
92
93# Test
94def test_transformer_layer():
95    d_model = 512
96    num_heads = 8
97    d_ff = 2048
98    batch_size = 2
99    seq_len = 10
100
101    # Test Pre-LN
102    pre_ln_layer = TransformerLayerBase(
103        d_model, num_heads, d_ff, pre_norm=True
104    )
105    x = torch.randn(batch_size, seq_len, d_model)
106    pre_output = pre_ln_layer(x)
107
108    print("Pre-LN Transformer Layer:")
109    print(f"  Input shape: {x.shape}")
110    print(f"  Output shape: {pre_output.shape}")
111    print(f"  Parameters: {sum(p.numel() for p in pre_ln_layer.parameters()):,}")
112
113    # Test Post-LN
114    post_ln_layer = TransformerLayerBase(
115        d_model, num_heads, d_ff, pre_norm=False
116    )
117    post_output = post_ln_layer(x)
118
119    print("\nPost-LN Transformer Layer:")
120    print(f"  Output shape: {post_output.shape}")
121
122    print("\nāœ“ Transformer layer test passed!")
123
124
125test_transformer_layer()

Output:

šŸ“text
1Pre-LN Transformer Layer:
2  Input shape: torch.Size([2, 10, 512])
3  Output shape: torch.Size([2, 10, 512])
4  Parameters: 5,248,512
5
6Post-LN Transformer Layer:
7  Output shape: torch.Size([2, 10, 512])
8
9āœ“ Transformer layer test passed!

Summary

The Add & Norm Pattern

Post-LN (Original):

šŸpython
1output = LayerNorm(x + Sublayer(x))

Pre-LN (Modern):

šŸpython
1output = x + Sublayer(LayerNorm(x))

When to Use Each

ScenarioRecommendation
Reproducing original BERTPost-LN
New projectsPre-LN
Deep networks (>24 layers)Pre-LN
Training without warmupPre-LN
Research comparisonTry both

Implementation Checklist

  • Choose Pre-LN or Post-LN
  • Apply pattern to both attention and FFN
  • Add dropout before residual addition
  • For Pre-LN: add final LayerNorm after all layers
  • For Post-LN: consider learning rate warmup

Chapter Summary

Components Covered

ComponentPurpose
FFNPer-position non-linear transformation
LayerNormNormalize across features
ResidualEnable gradient flow
Add & NormCombine all three

Ready for Encoder/Decoder

With these components, we now have everything needed to build:

  • Complete encoder layers
  • Complete decoder layers
  • Full transformer architecture

Exercises

Implementation Exercises

1. Implement a transformer layer that can switch between Pre-LN and Post-LN at runtime.

2. Create a "sandwich" norm variant: LayerNorm(x) + Sublayer(LayerNorm(x)).

3. Add support for RMSNorm instead of LayerNorm.

Analysis Exercises

4. Train identical models with Pre-LN vs Post-LN. Compare:

  • Training curves
  • Final performance
  • Sensitivity to learning rate

5. Visualize the gradient norms through layers for both variants.

6. Experiment with different dropout placements in the Add & Norm pattern.

Conceptual Questions

7. Why does Pre-LN need a final LayerNorm but Post-LN doesn't?

8. How does the Add & Norm pattern affect the effective depth of the network?


Next Chapter Preview

In Chapter 7, we'll build the Transformer Encoder by stacking multiple encoder layers. We'll implement the complete encoder architecture that processes source sequences in our translation model.