Chapter 7
12 min read
Section 34 of 75

Encoder Layer Architecture

Transformer Encoder

Introduction

The encoder is the first half of the transformer architecture. It processes the input sequence (e.g., German sentence) and produces contextualized representations that capture the meaning and relationships between all tokens.

This section provides a detailed walkthrough of the encoder layer structure before we implement it in code.


The Big Picture

Encoder's Role in Translation

πŸ“text
1German: "Der schnelle braune Fuchs springt ΓΌber den faulen Hund."
2           ↓
3β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4β”‚         ENCODER (6 layers)          β”‚
5β”‚                                     β”‚
6β”‚  Input β†’ Rich contextual representations
7β”‚                                     β”‚
8β”‚  Each token "knows about" all       β”‚
9β”‚  other tokens in the sentence       β”‚
10β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
11                   β”‚
12                   ↓
13          Encoder Output (Memory)
14                   β”‚
15                   ↓
16β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
17β”‚         DECODER (6 layers)          β”‚
18β”‚                                     β”‚
19β”‚  Generates: "The quick brown fox   β”‚
20β”‚  jumps over the lazy dog."          β”‚
21β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What Each Encoder Layer Does

1. Self-Attention: Each token attends to all tokens (including itself)

2. Feed-Forward: Each token is transformed independently

3. Residual + Norm: Maintain gradient flow and stability

After 6 layers, each token's representation contains information from the entire sequence!


Single Encoder Layer Structure

Architecture Diagram

πŸ“text
1Input x
2                      β”‚
3            [batch, seq_len, d_model]
4                      β”‚
5                      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
6                      β”‚                     β”‚
7                      β–Ό                     β”‚
8         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
9         β”‚  Layer Normalization   β”‚        β”‚ (Pre-LN)
10         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
11                    β”‚                      β”‚
12                    β–Ό                      β”‚
13         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
14         β”‚  Multi-Head            β”‚        β”‚
15         β”‚  Self-Attention        β”‚        β”‚
16         β”‚                        β”‚        β”‚
17         β”‚  Q = K = V = x         β”‚        β”‚
18         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
19                    β”‚                      β”‚
20                    β–Ό                      β”‚
21              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
22              β”‚ Dropout  β”‚                 β”‚
23              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                 β”‚
24                   β”‚                       β”‚
25                   β–Ό                       β”‚
26                 (+)β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Residual #1
27                   β”‚
28         [batch, seq_len, d_model]
29                   β”‚
30                   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
31                   β”‚                     β”‚
32                   β–Ό                     β”‚
33         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
34         β”‚  Layer Normalization   β”‚      β”‚ (Pre-LN)
35         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
36                    β”‚                    β”‚
37                    β–Ό                    β”‚
38         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
39         β”‚  Feed-Forward Network  β”‚      β”‚
40         β”‚                        β”‚      β”‚
41         β”‚  d_model β†’ d_ff β†’ d_model     β”‚
42         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
43                    β”‚                    β”‚
44                    β–Ό                    β”‚
45              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
46              β”‚ Dropout  β”‚               β”‚
47              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜               β”‚
48                   β”‚                     β”‚
49                   β–Ό                     β”‚
50                 (+)β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Residual #2
51                   β”‚
52                   β–Ό
53                Output
54            [batch, seq_len, d_model]

Component Breakdown

Layer Components

ComponentInput ShapeOutput ShapeParameters
LayerNorm 1[B, S, D][B, S, D]2D
MultiHeadAttention[B, S, D][B, S, D]4DΒ²
Dropout 1[B, S, D][B, S, D]0
LayerNorm 2[B, S, D][B, S, D]2D
FeedForward[B, S, D][B, S, D]2DΓ—d_ff + D + d_ff
Dropout 2[B, S, D][B, S, D]0

Where: B = batch_size, S = seq_len, D = d_model

Typical Dimensions

πŸ“text
1d_model = 512
2d_ff = 2048 (4 Γ— d_model)
3num_heads = 8
4d_k = d_model / num_heads = 64
5
6Per encoder layer parameters:
7- Attention: 4 Γ— 512 Γ— 512 = 1,048,576
8- LayerNorms: 2 Γ— 2 Γ— 512 = 2,048
9- FFN: 512 Γ— 2048 + 2048 + 2048 Γ— 512 + 512 = 2,099,712
10- Total per layer: ~3.15M parameters

Information Flow

Step-by-Step Processing

Step 1: Input Embedding

πŸ“text
1Input tokens: [Der, schnelle, braune, Fuchs, springt, ...]
2After embedding + positional encoding:
3β†’ x: [batch, seq_len, d_model]

Step 2: Self-Attention

πŸ“text
1Each token attends to all tokens (including itself)
2
3"springt" attends to:
4- "Der" (subject marker?)
5- "Fuchs" (what jumps?)
6- "ΓΌber" (jump where?)
7- itself (verb properties)
8
9Output: Contextualized representation for each token

Step 3: Feed-Forward

πŸ“text
1Each token processed independently
2Non-linear transformation
3Capacity for complex per-token computations

Step 4: Repeat for N Layers

πŸ“text
1Layer 1: Basic patterns (word types, simple relations)
2Layer 2: Higher-order patterns (phrases, modifiers)
3...
4Layer 6: Complex semantic relationships

Self-Attention in Encoder

Bidirectional Attention

Unlike decoder, encoder attention is bidirectional:

πŸ“text
1"Der Fuchs springt ΓΌber den Hund"
2
3Token "springt" can attend to:
4- Past tokens: "Der", "Fuchs" ← Subject
5- Future tokens: "ΓΌber", "den", "Hund" ← Where it jumps
6
7ALL positions are visible!

Attention Mask in Encoder

Padding Mask Only:

πŸ“text
1Sentence:  [Der, Fuchs, springt, <pad>, <pad>]
2Mask:      [ 1,    1,      1,      0,      0 ]
3
4Each token attends to all NON-padded positions

No Causal Mask: Unlike decoder, no tokens are hidden from each other.


Shape Consistency

Why Shapes Must Match

Residual connections require input and output shapes to be identical:

🐍python
1output = x + sublayer(x)
2#        ↑       ↑
3#     Same shape required!

Shapes Throughout Layer

πŸ“text
1Input:           [batch, seq_len, d_model]
2                         β”‚
3After LayerNorm:         β”‚ same
4                         β”‚
5After Attention:         β”‚ same (Q,K,V all project to d_model)
6                         β”‚
7After Residual 1:        β”‚ same
8                         β”‚
9After LayerNorm:         β”‚ same
10                         β”‚
11After FFN:               β”‚ same (output projects back to d_model)
12                         β”‚
13After Residual 2:        β”‚ same
14                         β”‚
15Output:          [batch, seq_len, d_model]  ← SAME as input!

Dropout Points

Where Dropout is Applied

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚  Multi-Head Attention                       β”‚
3β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
4β”‚  β”‚ Attention weights: dropout(softmax)  β”‚   β”‚ ← Attention dropout
5β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
6β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
7                     ↓
8              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9              β”‚ Dropout  β”‚  ← Output dropout (before residual)
10              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
11                   β”‚
12               ──(+)── Residual
13                   β”‚
14β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15β”‚  Feed-Forward Network                       β”‚
16β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
17β”‚  β”‚ Hidden layer: dropout after activationβ”‚  β”‚ ← FFN internal dropout
18β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
19β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
20                     ↓
21              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
22              β”‚ Dropout  β”‚  ← Output dropout (before residual)
23              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
24                   β”‚
25               ──(+)── Residual

Dropout Values

LocationTypical ValuePurpose
Attention0.1Prevent over-reliance on specific patterns
FFN hidden0.1Regularize transformation
Sublayer output0.1Regularize residual
Embedding0.1Regularize input

Pre-LN vs Post-LN for Encoder

🐍python
1# Each sublayer:
2output = x + dropout(sublayer(layer_norm(x)))
3
4# After all layers:
5final = final_layer_norm(output)

Benefits:

- More stable training

- Better gradient flow

- Works well for deep encoders

Post-LN (Original)

🐍python
1# Each sublayer:
2output = layer_norm(x + dropout(sublayer(x)))
3
4# After all layers:
5final = output  # Already normalized

Considerations:

- Requires careful initialization

- May need learning rate warmup

- Original Transformer paper used this


Summary

Encoder Layer Formula

Pre-LN Style:

πŸ“text
1# Attention sublayer
2a = x + Dropout(MultiHeadAttention(LayerNorm(x)))
3
4# FFN sublayer
5output = a + Dropout(FFN(LayerNorm(a)))

Key Properties

PropertyValue
Self-attentionBidirectional (all positions visible)
MaskingPadding mask only
Shape preservationInput shape = Output shape
Residual connections2 per layer
Layer normalizations2 per layer
Dropout points4+ per layer

Component Dependencies

πŸ“text
1MultiHeadAttention ←─ From Chapter 3
2LayerNorm ←────────── From Chapter 6
3FFN ←──────────────── From Chapter 6
4Residual+Dropout ←─── From Chapter 6

Exercises

Conceptual Questions

1. Why is encoder attention bidirectional while decoder attention is causal?

2. What would happen if we removed the residual connections from the encoder?

3. Why doesn't the encoder need a causal mask?

Diagram Exercises

4. Draw the data flow for a 2-layer encoder, showing shapes at each stage.

5. Mark all locations where dropout could be applied in an encoder layer.

Analysis Questions

6. For batch_size=32, seq_len=100, d_model=512, calculate the memory needed for attention scores in one head.

7. How many total parameters are in a 6-layer encoder with d_model=512, d_ff=2048, num_heads=8?


In the next section, we'll implement the TransformerEncoderLayer class in PyTorch, bringing together all the components we've built throughout the course.