Introduction
The encoder is the first half of the transformer architecture. It processes the input sequence (e.g., German sentence) and produces contextualized representations that capture the meaning and relationships between all tokens.
This section provides a detailed walkthrough of the encoder layer structure before we implement it in code.
The Big Picture
Encoder's Role in Translation
1German: "Der schnelle braune Fuchs springt ΓΌber den faulen Hund."
2 β
3βββββββββββββββββββββββββββββββββββββββ
4β ENCODER (6 layers) β
5β β
6β Input β Rich contextual representations
7β β
8β Each token "knows about" all β
9β other tokens in the sentence β
10ββββββββββββββββββββ¬βββββββββββββββββββ
11 β
12 β
13 Encoder Output (Memory)
14 β
15 β
16βββββββββββββββββββββββββββββββββββββββ
17β DECODER (6 layers) β
18β β
19β Generates: "The quick brown fox β
20β jumps over the lazy dog." β
21βββββββββββββββββββββββββββββββββββββββWhat Each Encoder Layer Does
1. Self-Attention: Each token attends to all tokens (including itself)
2. Feed-Forward: Each token is transformed independently
3. Residual + Norm: Maintain gradient flow and stability
After 6 layers, each token's representation contains information from the entire sequence!
Single Encoder Layer Structure
Architecture Diagram
1Input x
2 β
3 [batch, seq_len, d_model]
4 β
5 βββββββββββββββββββββββ
6 β β
7 βΌ β
8 ββββββββββββββββββββββββββ β
9 β Layer Normalization β β (Pre-LN)
10 ββββββββββββ¬ββββββββββββββ β
11 β β
12 βΌ β
13 ββββββββββββββββββββββββββ β
14 β Multi-Head β β
15 β Self-Attention β β
16 β β β
17 β Q = K = V = x β β
18 ββββββββββββ¬ββββββββββββββ β
19 β β
20 βΌ β
21 ββββββββββββ β
22 β Dropout β β
23 ββββββ¬ββββββ β
24 β β
25 βΌ β
26 (+)ββββββββββββββββββββββββ Residual #1
27 β
28 [batch, seq_len, d_model]
29 β
30 βββββββββββββββββββββββ
31 β β
32 βΌ β
33 ββββββββββββββββββββββββββ β
34 β Layer Normalization β β (Pre-LN)
35 ββββββββββββ¬ββββββββββββββ β
36 β β
37 βΌ β
38 ββββββββββββββββββββββββββ β
39 β Feed-Forward Network β β
40 β β β
41 β d_model β d_ff β d_model β
42 ββββββββββββ¬ββββββββββββββ β
43 β β
44 βΌ β
45 ββββββββββββ β
46 β Dropout β β
47 ββββββ¬ββββββ β
48 β β
49 βΌ β
50 (+)ββββββββββββββββββββββ Residual #2
51 β
52 βΌ
53 Output
54 [batch, seq_len, d_model]Component Breakdown
Layer Components
| Component | Input Shape | Output Shape | Parameters |
|---|---|---|---|
| LayerNorm 1 | [B, S, D] | [B, S, D] | 2D |
| MultiHeadAttention | [B, S, D] | [B, S, D] | 4DΒ² |
| Dropout 1 | [B, S, D] | [B, S, D] | 0 |
| LayerNorm 2 | [B, S, D] | [B, S, D] | 2D |
| FeedForward | [B, S, D] | [B, S, D] | 2DΓd_ff + D + d_ff |
| Dropout 2 | [B, S, D] | [B, S, D] | 0 |
Where: B = batch_size, S = seq_len, D = d_model
Typical Dimensions
1d_model = 512
2d_ff = 2048 (4 Γ d_model)
3num_heads = 8
4d_k = d_model / num_heads = 64
5
6Per encoder layer parameters:
7- Attention: 4 Γ 512 Γ 512 = 1,048,576
8- LayerNorms: 2 Γ 2 Γ 512 = 2,048
9- FFN: 512 Γ 2048 + 2048 + 2048 Γ 512 + 512 = 2,099,712
10- Total per layer: ~3.15M parametersInformation Flow
Step-by-Step Processing
Step 1: Input Embedding
1Input tokens: [Der, schnelle, braune, Fuchs, springt, ...]
2After embedding + positional encoding:
3β x: [batch, seq_len, d_model]Step 2: Self-Attention
1Each token attends to all tokens (including itself)
2
3"springt" attends to:
4- "Der" (subject marker?)
5- "Fuchs" (what jumps?)
6- "ΓΌber" (jump where?)
7- itself (verb properties)
8
9Output: Contextualized representation for each tokenStep 3: Feed-Forward
1Each token processed independently
2Non-linear transformation
3Capacity for complex per-token computationsStep 4: Repeat for N Layers
1Layer 1: Basic patterns (word types, simple relations)
2Layer 2: Higher-order patterns (phrases, modifiers)
3...
4Layer 6: Complex semantic relationshipsSelf-Attention in Encoder
Bidirectional Attention
Unlike decoder, encoder attention is bidirectional:
1"Der Fuchs springt ΓΌber den Hund"
2
3Token "springt" can attend to:
4- Past tokens: "Der", "Fuchs" β Subject
5- Future tokens: "ΓΌber", "den", "Hund" β Where it jumps
6
7ALL positions are visible!Attention Mask in Encoder
Padding Mask Only:
1Sentence: [Der, Fuchs, springt, <pad>, <pad>]
2Mask: [ 1, 1, 1, 0, 0 ]
3
4Each token attends to all NON-padded positionsNo Causal Mask: Unlike decoder, no tokens are hidden from each other.
Shape Consistency
Why Shapes Must Match
Residual connections require input and output shapes to be identical:
1output = x + sublayer(x)
2# β β
3# Same shape required!Shapes Throughout Layer
1Input: [batch, seq_len, d_model]
2 β
3After LayerNorm: β same
4 β
5After Attention: β same (Q,K,V all project to d_model)
6 β
7After Residual 1: β same
8 β
9After LayerNorm: β same
10 β
11After FFN: β same (output projects back to d_model)
12 β
13After Residual 2: β same
14 β
15Output: [batch, seq_len, d_model] β SAME as input!Dropout Points
Where Dropout is Applied
1βββββββββββββββββββββββββββββββββββββββββββββββ
2β Multi-Head Attention β
3β βββββββββββββββββββββββββββββββββββββββ β
4β β Attention weights: dropout(softmax) β β β Attention dropout
5β βββββββββββββββββββββββββββββββββββββββ β
6ββββββββββββββββββββββ¬βββββββββββββββββββββββββ
7 β
8 ββββββββββββ
9 β Dropout β β Output dropout (before residual)
10 ββββββ¬ββββββ
11 β
12 ββ(+)ββ Residual
13 β
14βββββββββββββββββββββββββββββββββββββββββββββββ
15β Feed-Forward Network β
16β βββββββββββββββββββββββββββββββββββββββ β
17β β Hidden layer: dropout after activationβ β β FFN internal dropout
18β βββββββββββββββββββββββββββββββββββββββ β
19ββββββββββββββββββββββ¬βββββββββββββββββββββββββ
20 β
21 ββββββββββββ
22 β Dropout β β Output dropout (before residual)
23 ββββββ¬ββββββ
24 β
25 ββ(+)ββ ResidualDropout Values
| Location | Typical Value | Purpose |
|---|---|---|
| Attention | 0.1 | Prevent over-reliance on specific patterns |
| FFN hidden | 0.1 | Regularize transformation |
| Sublayer output | 0.1 | Regularize residual |
| Embedding | 0.1 | Regularize input |
Pre-LN vs Post-LN for Encoder
Pre-LN (Recommended)
1# Each sublayer:
2output = x + dropout(sublayer(layer_norm(x)))
3
4# After all layers:
5final = final_layer_norm(output)Benefits:
- More stable training
- Better gradient flow
- Works well for deep encoders
Post-LN (Original)
1# Each sublayer:
2output = layer_norm(x + dropout(sublayer(x)))
3
4# After all layers:
5final = output # Already normalizedConsiderations:
- Requires careful initialization
- May need learning rate warmup
- Original Transformer paper used this
Summary
Encoder Layer Formula
Pre-LN Style:
1# Attention sublayer
2a = x + Dropout(MultiHeadAttention(LayerNorm(x)))
3
4# FFN sublayer
5output = a + Dropout(FFN(LayerNorm(a)))Key Properties
| Property | Value |
|---|---|
| Self-attention | Bidirectional (all positions visible) |
| Masking | Padding mask only |
| Shape preservation | Input shape = Output shape |
| Residual connections | 2 per layer |
| Layer normalizations | 2 per layer |
| Dropout points | 4+ per layer |
Component Dependencies
1MultiHeadAttention ββ From Chapter 3
2LayerNorm βββββββββββ From Chapter 6
3FFN βββββββββββββββββ From Chapter 6
4Residual+Dropout ββββ From Chapter 6Exercises
Conceptual Questions
1. Why is encoder attention bidirectional while decoder attention is causal?
2. What would happen if we removed the residual connections from the encoder?
3. Why doesn't the encoder need a causal mask?
Diagram Exercises
4. Draw the data flow for a 2-layer encoder, showing shapes at each stage.
5. Mark all locations where dropout could be applied in an encoder layer.
Analysis Questions
6. For batch_size=32, seq_len=100, d_model=512, calculate the memory needed for attention scores in one head.
7. How many total parameters are in a 6-layer encoder with d_model=512, d_ff=2048, num_heads=8?
In the next section, we'll implement the TransformerEncoderLayer class in PyTorch, bringing together all the components we've built throughout the course.