Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Transformers combine residual connections and layer normalization into a standard pattern called Add & Norm. This pattern appears around every sublayer (attention and FFN) in the architecture.

This section covers both the original Post-LN and modern Pre-LN variants, with complete implementations.

4.1 The Standard Pattern

Add & Norm Explained

Every transformer sublayer follows this pattern:

📝text

1Input → Sublayer → Residual Add → Layer Norm → Output

Or in code:

🐍python

1output = LayerNorm(x + Sublayer(x))

Where It Appears

In each transformer layer:

📝text

1┌─────────────────────────────────────────────────┐
2│                     Input                        │
3└────────────────────┬────────────────────────────┘
4                     │
5                     ▼
6         ┌───────────────────────┐
7         │  Multi-Head Attention │
8         └───────────┬───────────┘
9                     │
10                     ▼
11              ┌─────────────┐
12              │ Add & Norm  │  ← Pattern #1
13              └──────┬──────┘
14                     │
15                     ▼
16         ┌───────────────────────┐
17         │  Feed-Forward Network │
18         └───────────┬───────────┘
19                     │
20                     ▼
21              ┌─────────────┐
22              │ Add & Norm  │  ← Pattern #2
23              └──────┬──────┘
24                     │
25                     ▼
26┌─────────────────────────────────────────────────┐
27│                    Output                        │
28└─────────────────────────────────────────────────┘

4.2 Post-LN vs Pre-LN

Post-LN (Original Transformer)

LayerNorm after the residual addition:

📝text

1output = LayerNorm(x + Sublayer(x))

Diagram:

📝text

1x ──────────────┐
2    │               │
3    ▼               │
4 Sublayer          │
5    │               │
6    ▼               │
7   Add ◄────────────┘
8    │
9    ▼
10LayerNorm
11    │
12    ▼
13  output

Pre-LN (Modern Variant)

LayerNorm before the sublayer:

📝text

1output = x + Sublayer(LayerNorm(x))

Diagram:

📝text

1x ──────────────┐
2    │               │
3    ▼               │
4LayerNorm          │
5    │               │
6    ▼               │
7 Sublayer          │
8    │               │
9    ▼               │
10   Add ◄────────────┘
11    │
12    ▼
13  output

Comparison Table

Aspect	Post-LN	Pre-LN
Paper	Original Transformer	Later variants
Gradient flow	Through LayerNorm	Direct through residual
Training stability	Needs warmup	More stable
Final LayerNorm	Not needed	Needed at end
Used by	Original BERT	GPT-2, GPT-3, LLaMA

4.3 Why Pre-LN is Often Better

Gradient Flow Analysis

Post-LN gradient path:

📝text

1Loss → LayerNorm → Add → Sublayer → ... → Input
2       ↑
3       Must pass through LayerNorm!

Pre-LN gradient path:

📝text

1Loss → Add → ... → Input
2       ↑
3       Direct path through addition!

Pre-LN provides a cleaner gradient highway.

Training Stability

🐍python

1import torch
2import torch.nn as nn
3
4
5def compare_training_stability():
6    """Compare Post-LN vs Pre-LN training behavior."""
7
8    class PostLNBlock(nn.Module):
9        def __init__(self, d_model):
10            super().__init__()
11            self.ffn = nn.Sequential(
12                nn.Linear(d_model, d_model * 4),
13                nn.ReLU(),
14                nn.Linear(d_model * 4, d_model),
15            )
16            self.norm = nn.LayerNorm(d_model)
17
18        def forward(self, x):
19            return self.norm(x + self.ffn(x))
20
21    class PreLNBlock(nn.Module):
22        def __init__(self, d_model):
23            super().__init__()
24            self.ffn = nn.Sequential(
25                nn.Linear(d_model, d_model * 4),
26                nn.ReLU(),
27                nn.Linear(d_model * 4, d_model),
28            )
29            self.norm = nn.LayerNorm(d_model)
30
31        def forward(self, x):
32            return x + self.ffn(self.norm(x))
33
34    d_model = 256
35    depth = 24
36
37    # Create deep networks
38    post_ln = nn.Sequential(*[PostLNBlock(d_model) for _ in range(depth)])
39    pre_ln = nn.Sequential(*[PreLNBlock(d_model) for _ in range(depth)])
40
41    # Forward pass
42    x = torch.randn(2, 10, d_model, requires_grad=True)
43
44    # Post-LN
45    y_post = post_ln(x.clone())
46    loss_post = y_post.sum()
47    loss_post.backward()
48    grad_post = x.grad.norm().item()
49
50    x.grad = None
51
52    # Pre-LN
53    y_pre = pre_ln(x.clone())
54    loss_pre = y_pre.sum()
55    loss_pre.backward()
56    grad_pre = x.grad.norm().item()
57
58    print(f"Network depth: {depth} blocks")
59    print(f"\nPost-LN gradient norm: {grad_post:.4f}")
60    print(f"Pre-LN gradient norm:  {grad_pre:.4f}")
61    print(f"\nOutput variance (Post-LN): {y_post.var():.4f}")
62    print(f"Output variance (Pre-LN):  {y_pre.var():.4f}")
63
64
65compare_training_stability()

Warmup Requirements

Post-LN typically needs:

Learning rate warmup (1000-4000 steps)
Careful initialization
Lower initial learning rate

Pre-LN often works without:

Can skip warmup entirely
More robust to initialization
Higher learning rates possible

4.4 Implementation: Post-LN

SublayerConnection (Post-LN)

🐍python

1import torch
2import torch.nn as nn
3from typing import Callable, Optional
4
5
6class PostLNSublayerConnection(nn.Module):
7    """
8    Post-LayerNorm sublayer connection.
9
10    Applies: LayerNorm(x + Dropout(Sublayer(x)))
11
12    This is the original Transformer architecture.
13
14    Args:
15        d_model: Model dimension
16        dropout: Dropout probability
17    """
18
19    def __init__(self, d_model: int, dropout: float = 0.1):
20        super().__init__()
21
22        self.norm = nn.LayerNorm(d_model)
23        self.dropout = nn.Dropout(dropout)
24
25    def forward(
26        self,
27        x: torch.Tensor,
28        sublayer: Callable[[torch.Tensor], torch.Tensor]
29    ) -> torch.Tensor:
30        """
31        Apply sublayer with residual and normalization.
32
33        Args:
34            x: Input tensor [batch, seq_len, d_model]
35            sublayer: Function that transforms x
36
37        Returns:
38            Normalized output [batch, seq_len, d_model]
39        """
40        # Residual + LayerNorm
41        return self.norm(x + self.dropout(sublayer(x)))
42
43
44# Test
45def test_post_ln():
46    d_model = 512
47    batch_size = 2
48    seq_len = 10
49
50    sublayer_conn = PostLNSublayerConnection(d_model, dropout=0.1)
51
52    x = torch.randn(batch_size, seq_len, d_model)
53
54    # Define a simple sublayer (could be attention or FFN)
55    sublayer = lambda t: t * 0.5 + 0.1
56
57    output = sublayer_conn(x, sublayer)
58
59    print(f"Input shape: {x.shape}")
60    print(f"Output shape: {output.shape}")
61    print(f"Output mean: {output.mean(dim=-1).mean():.4f} (≈ 0 from LayerNorm)")
62    print(f"Output std: {output.std(dim=-1).mean():.4f} (≈ 1 from LayerNorm)")
63
64    print("\n✓ Post-LN test passed!")
65
66
67test_post_ln()

4.5 Implementation: Pre-LN

SublayerConnection (Pre-LN)

🐍python

1class PreLNSublayerConnection(nn.Module):
2    """
3    Pre-LayerNorm sublayer connection.
4
5    Applies: x + Dropout(Sublayer(LayerNorm(x)))
6
7    This is the modern variant used by GPT-2, GPT-3, etc.
8
9    Args:
10        d_model: Model dimension
11        dropout: Dropout probability
12    """
13
14    def __init__(self, d_model: int, dropout: float = 0.1):
15        super().__init__()
16
17        self.norm = nn.LayerNorm(d_model)
18        self.dropout = nn.Dropout(dropout)
19
20    def forward(
21        self,
22        x: torch.Tensor,
23        sublayer: Callable[[torch.Tensor], torch.Tensor]
24    ) -> torch.Tensor:
25        """
26        Apply sublayer with residual and pre-normalization.
27
28        Args:
29            x: Input tensor [batch, seq_len, d_model]
30            sublayer: Function that transforms normalized x
31
32        Returns:
33            Output [batch, seq_len, d_model]
34        """
35        # Pre-norm, then residual
36        return x + self.dropout(sublayer(self.norm(x)))
37
38
39# Test
40def test_pre_ln():
41    d_model = 512
42    batch_size = 2
43    seq_len = 10
44
45    sublayer_conn = PreLNSublayerConnection(d_model, dropout=0.1)
46
47    x = torch.randn(batch_size, seq_len, d_model)
48    sublayer = lambda t: t * 0.5
49
50    output = sublayer_conn(x, sublayer)
51
52    print(f"Input shape: {x.shape}")
53    print(f"Output shape: {output.shape}")
54
55    # Note: Pre-LN output is NOT normalized (residual added after)
56    print(f"Input std: {x.std(dim=-1).mean():.4f}")
57    print(f"Output std: {output.std(dim=-1).mean():.4f}")
58
59    print("\n✓ Pre-LN test passed!")
60
61
62test_pre_ln()

4.6 Flexible Implementation

Configurable SublayerConnection

🐍python

1class SublayerConnection(nn.Module):
2    """
3    Flexible sublayer connection supporting both Post-LN and Pre-LN.
4
5    Post-LN: LayerNorm(x + Sublayer(x))
6    Pre-LN:  x + Sublayer(LayerNorm(x))
7
8    Args:
9        d_model: Model dimension
10        dropout: Dropout probability
11        pre_norm: Use Pre-LN if True, Post-LN if False
12
13    Example:
14        >>> conn = SublayerConnection(512, dropout=0.1, pre_norm=True)
15        >>> output = conn(x, attention_sublayer)
16    """
17
18    def __init__(
19        self,
20        d_model: int,
21        dropout: float = 0.1,
22        pre_norm: bool = True
23    ):
24        super().__init__()
25
26        self.pre_norm = pre_norm
27        self.norm = nn.LayerNorm(d_model)
28        self.dropout = nn.Dropout(dropout)
29
30    def forward(
31        self,
32        x: torch.Tensor,
33        sublayer: Callable[[torch.Tensor], torch.Tensor]
34    ) -> torch.Tensor:
35        """
36        Apply sublayer with residual connection and normalization.
37
38        Args:
39            x: Input tensor [batch, seq_len, d_model]
40            sublayer: Function/module to apply
41
42        Returns:
43            Output tensor [batch, seq_len, d_model]
44        """
45        if self.pre_norm:
46            # Pre-LN: x + Sublayer(LayerNorm(x))
47            return x + self.dropout(sublayer(self.norm(x)))
48        else:
49            # Post-LN: LayerNorm(x + Sublayer(x))
50            return self.norm(x + self.dropout(sublayer(x)))
51
52
53# Test both modes
54def test_flexible_sublayer():
55    d_model = 512
56    x = torch.randn(2, 10, d_model)
57    sublayer = nn.Linear(d_model, d_model)
58
59    # Post-LN
60    post_ln = SublayerConnection(d_model, pre_norm=False)
61    post_output = post_ln(x, sublayer)
62
63    # Pre-LN
64    pre_ln = SublayerConnection(d_model, pre_norm=True)
65    pre_output = pre_ln(x, sublayer)
66
67    print("Post-LN output stats:")
68    print(f"  Mean: {post_output.mean(dim=-1).mean():.4f}")
69    print(f"  Std:  {post_output.std(dim=-1).mean():.4f}")
70
71    print("\nPre-LN output stats:")
72    print(f"  Mean: {pre_output.mean(dim=-1).mean():.4f}")
73    print(f"  Std:  {pre_output.std(dim=-1).mean():.4f}")
74
75    # Note the difference!
76    print("\nObservation: Post-LN output is normalized, Pre-LN is not")
77
78    print("\n✓ Flexible sublayer test passed!")
79
80
81test_flexible_sublayer()

4.7 Pre-LN Requires Final Norm

The Issue

With Pre-LN, the final output isn't normalized:

📝text

1Pre-LN Layer 1: x + Sublayer(LayerNorm(x))  → Not normalized
2Pre-LN Layer 2: x + Sublayer(LayerNorm(x))  → Not normalized
3...
4Pre-LN Layer N: x + Sublayer(LayerNorm(x))  → STILL not normalized!

The Solution

Add a final LayerNorm after all layers:

🐍python

1class PreLNTransformer(nn.Module):
2    """
3    Pre-LN transformer with final normalization.
4    """
5
6    def __init__(self, d_model, num_layers):
7        super().__init__()
8
9        self.layers = nn.ModuleList([
10            TransformerLayer(d_model, pre_norm=True)
11            for _ in range(num_layers)
12        ])
13
14        # CRITICAL: Final LayerNorm for Pre-LN
15        self.final_norm = nn.LayerNorm(d_model)
16
17    def forward(self, x):
18        for layer in self.layers:
19            x = layer(x)
20
21        # Apply final normalization
22        x = self.final_norm(x)
23
24        return x

Post-LN doesn't need this because each layer's output is already normalized.

4.8 Complete Transformer Layer

Putting It All Together

🐍python

1import torch
2import torch.nn as nn
3from typing import Optional, Callable
4
5
6class TransformerLayerBase(nn.Module):
7    """
8    Base transformer layer with configurable normalization.
9
10    Contains:
11    - Multi-head self-attention with Add & Norm
12    - Feed-forward network with Add & Norm
13
14    Args:
15        d_model: Model dimension
16        num_heads: Number of attention heads
17        d_ff: Feed-forward inner dimension
18        dropout: Dropout probability
19        pre_norm: Use Pre-LN if True, Post-LN if False
20    """
21
22    def __init__(
23        self,
24        d_model: int,
25        num_heads: int,
26        d_ff: int,
27        dropout: float = 0.1,
28        pre_norm: bool = True
29    ):
30        super().__init__()
31
32        self.pre_norm = pre_norm
33
34        # Attention sublayer (placeholder - will be implemented properly later)
35        self.self_attention = nn.MultiheadAttention(
36            d_model, num_heads, dropout=dropout, batch_first=True
37        )
38
39        # Feed-forward sublayer
40        self.feed_forward = nn.Sequential(
41            nn.Linear(d_model, d_ff),
42            nn.GELU(),
43            nn.Dropout(dropout),
44            nn.Linear(d_ff, d_model),
45        )
46
47        # Layer norms
48        self.norm1 = nn.LayerNorm(d_model)
49        self.norm2 = nn.LayerNorm(d_model)
50
51        # Dropout
52        self.dropout = nn.Dropout(dropout)
53
54    def forward(
55        self,
56        x: torch.Tensor,
57        mask: Optional[torch.Tensor] = None
58    ) -> torch.Tensor:
59        """
60        Forward pass through transformer layer.
61
62        Args:
63            x: Input tensor [batch, seq_len, d_model]
64            mask: Optional attention mask
65
66        Returns:
67            Output tensor [batch, seq_len, d_model]
68        """
69        if self.pre_norm:
70            # Pre-LN variant
71            # Attention block
72            normed = self.norm1(x)
73            attn_out, _ = self.self_attention(normed, normed, normed, attn_mask=mask)
74            x = x + self.dropout(attn_out)
75
76            # FFN block
77            normed = self.norm2(x)
78            ff_out = self.feed_forward(normed)
79            x = x + self.dropout(ff_out)
80        else:
81            # Post-LN variant
82            # Attention block
83            attn_out, _ = self.self_attention(x, x, x, attn_mask=mask)
84            x = self.norm1(x + self.dropout(attn_out))
85
86            # FFN block
87            ff_out = self.feed_forward(x)
88            x = self.norm2(x + self.dropout(ff_out))
89
90        return x
91
92
93# Test
94def test_transformer_layer():
95    d_model = 512
96    num_heads = 8
97    d_ff = 2048
98    batch_size = 2
99    seq_len = 10
100
101    # Test Pre-LN
102    pre_ln_layer = TransformerLayerBase(
103        d_model, num_heads, d_ff, pre_norm=True
104    )
105    x = torch.randn(batch_size, seq_len, d_model)
106    pre_output = pre_ln_layer(x)
107
108    print("Pre-LN Transformer Layer:")
109    print(f"  Input shape: {x.shape}")
110    print(f"  Output shape: {pre_output.shape}")
111    print(f"  Parameters: {sum(p.numel() for p in pre_ln_layer.parameters()):,}")
112
113    # Test Post-LN
114    post_ln_layer = TransformerLayerBase(
115        d_model, num_heads, d_ff, pre_norm=False
116    )
117    post_output = post_ln_layer(x)
118
119    print("\nPost-LN Transformer Layer:")
120    print(f"  Output shape: {post_output.shape}")
121
122    print("\n✓ Transformer layer test passed!")
123
124
125test_transformer_layer()

Output:

📝text

1Pre-LN Transformer Layer:
2  Input shape: torch.Size([2, 10, 512])
3  Output shape: torch.Size([2, 10, 512])
4  Parameters: 5,248,512
5
6Post-LN Transformer Layer:
7  Output shape: torch.Size([2, 10, 512])
8
9✓ Transformer layer test passed!

Summary

The Add & Norm Pattern

Post-LN (Original):

🐍python

1output = LayerNorm(x + Sublayer(x))

Pre-LN (Modern):

🐍python

1output = x + Sublayer(LayerNorm(x))

When to Use Each

Scenario	Recommendation
Reproducing original BERT	Post-LN
New projects	Pre-LN
Deep networks (>24 layers)	Pre-LN
Training without warmup	Pre-LN
Research comparison	Try both

Implementation Checklist

Choose Pre-LN or Post-LN
Apply pattern to both attention and FFN
Add dropout before residual addition
For Pre-LN: add final LayerNorm after all layers
For Post-LN: consider learning rate warmup

Chapter Summary

Components Covered

Component	Purpose
FFN	Per-position non-linear transformation
LayerNorm	Normalize across features
Residual	Enable gradient flow
Add & Norm	Combine all three

Ready for Encoder/Decoder

With these components, we now have everything needed to build:

Complete encoder layers
Complete decoder layers
Full transformer architecture

Exercises

Implementation Exercises

1. Implement a transformer layer that can switch between Pre-LN and Post-LN at runtime.

2. Create a "sandwich" norm variant: LayerNorm(x) + Sublayer(LayerNorm(x)).

3. Add support for RMSNorm instead of LayerNorm.

Analysis Exercises

4. Train identical models with Pre-LN vs Post-LN. Compare:

Training curves
Final performance
Sensitivity to learning rate

5. Visualize the gradient norms through layers for both variants.

6. Experiment with different dropout placements in the Add & Norm pattern.

Conceptual Questions

7. Why does Pre-LN need a final LayerNorm but Post-LN doesn't?

8. How does the Add & Norm pattern affect the effective depth of the network?

Next Chapter Preview

In Chapter 7, we'll build the Transformer Encoder by stacking multiple encoder layers. We'll implement the complete encoder architecture that processes source sequences in our translation model.