Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

The encoder is the first half of the transformer architecture. It processes the input sequence (e.g., German sentence) and produces contextualized representations that capture the meaning and relationships between all tokens.

This section provides a detailed walkthrough of the encoder layer structure before we implement it in code.

The Big Picture

Encoder's Role in Translation

📝text

1German: "Der schnelle braune Fuchs springt über den faulen Hund."
2           ↓
3┌─────────────────────────────────────┐
4│         ENCODER (6 layers)          │
5│                                     │
6│  Input → Rich contextual representations
7│                                     │
8│  Each token "knows about" all       │
9│  other tokens in the sentence       │
10└──────────────────┬──────────────────┘
11                   │
12                   ↓
13          Encoder Output (Memory)
14                   │
15                   ↓
16┌─────────────────────────────────────┐
17│         DECODER (6 layers)          │
18│                                     │
19│  Generates: "The quick brown fox   │
20│  jumps over the lazy dog."          │
21└─────────────────────────────────────┘

What Each Encoder Layer Does

1. Self-Attention: Each token attends to all tokens (including itself)

2. Feed-Forward: Each token is transformed independently

3. Residual + Norm: Maintain gradient flow and stability

After 6 layers, each token's representation contains information from the entire sequence!

Single Encoder Layer Structure

Architecture Diagram

📝text

1Input x
2                      │
3            [batch, seq_len, d_model]
4                      │
5                      ├─────────────────────┐
6                      │                     │
7                      ▼                     │
8         ┌────────────────────────┐        │
9         │  Layer Normalization   │        │ (Pre-LN)
10         └──────────┬─────────────┘        │
11                    │                      │
12                    ▼                      │
13         ┌────────────────────────┐        │
14         │  Multi-Head            │        │
15         │  Self-Attention        │        │
16         │                        │        │
17         │  Q = K = V = x         │        │
18         └──────────┬─────────────┘        │
19                    │                      │
20                    ▼                      │
21              ┌──────────┐                 │
22              │ Dropout  │                 │
23              └────┬─────┘                 │
24                   │                       │
25                   ▼                       │
26                 (+)◄──────────────────────┘ Residual #1
27                   │
28         [batch, seq_len, d_model]
29                   │
30                   ├─────────────────────┐
31                   │                     │
32                   ▼                     │
33         ┌────────────────────────┐      │
34         │  Layer Normalization   │      │ (Pre-LN)
35         └──────────┬─────────────┘      │
36                    │                    │
37                    ▼                    │
38         ┌────────────────────────┐      │
39         │  Feed-Forward Network  │      │
40         │                        │      │
41         │  d_model → d_ff → d_model     │
42         └──────────┬─────────────┘      │
43                    │                    │
44                    ▼                    │
45              ┌──────────┐               │
46              │ Dropout  │               │
47              └────┬─────┘               │
48                   │                     │
49                   ▼                     │
50                 (+)◄────────────────────┘ Residual #2
51                   │
52                   ▼
53                Output
54            [batch, seq_len, d_model]

Component Breakdown

Layer Components

Component	Input Shape	Output Shape	Parameters
LayerNorm 1	[B, S, D]	[B, S, D]	2D
MultiHeadAttention	[B, S, D]	[B, S, D]	4D²
Dropout 1	[B, S, D]	[B, S, D]	0
LayerNorm 2	[B, S, D]	[B, S, D]	2D
FeedForward	[B, S, D]	[B, S, D]	2D×d_ff + D + d_ff
Dropout 2	[B, S, D]	[B, S, D]	0

Where: B = batch_size, S = seq_len, D = d_model

Typical Dimensions

📝text

1d_model = 512
2d_ff = 2048 (4 × d_model)
3num_heads = 8
4d_k = d_model / num_heads = 64
5
6Per encoder layer parameters:
7- Attention: 4 × 512 × 512 = 1,048,576
8- LayerNorms: 2 × 2 × 512 = 2,048
9- FFN: 512 × 2048 + 2048 + 2048 × 512 + 512 = 2,099,712
10- Total per layer: ~3.15M parameters

Information Flow

Step-by-Step Processing

Step 1: Input Embedding

📝text

1Input tokens: [Der, schnelle, braune, Fuchs, springt, ...]
2After embedding + positional encoding:
3→ x: [batch, seq_len, d_model]

Step 2: Self-Attention

📝text

1Each token attends to all tokens (including itself)
2
3"springt" attends to:
4- "Der" (subject marker?)
5- "Fuchs" (what jumps?)
6- "über" (jump where?)
7- itself (verb properties)
8
9Output: Contextualized representation for each token

Step 3: Feed-Forward

📝text

1Each token processed independently
2Non-linear transformation
3Capacity for complex per-token computations

Step 4: Repeat for N Layers

📝text

1Layer 1: Basic patterns (word types, simple relations)
2Layer 2: Higher-order patterns (phrases, modifiers)
3...
4Layer 6: Complex semantic relationships

Self-Attention in Encoder

Bidirectional Attention

Unlike decoder, encoder attention is bidirectional:

📝text

1"Der Fuchs springt über den Hund"
2
3Token "springt" can attend to:
4- Past tokens: "Der", "Fuchs" ← Subject
5- Future tokens: "über", "den", "Hund" ← Where it jumps
6
7ALL positions are visible!

Attention Mask in Encoder

Padding Mask Only:

📝text

1Sentence:  [Der, Fuchs, springt, <pad>, <pad>]
2Mask:      [ 1,    1,      1,      0,      0 ]
3
4Each token attends to all NON-padded positions

No Causal Mask: Unlike decoder, no tokens are hidden from each other.

Shape Consistency

Why Shapes Must Match

Residual connections require input and output shapes to be identical:

🐍python

1output = x + sublayer(x)
2#        ↑       ↑
3#     Same shape required!

Shapes Throughout Layer

📝text

1Input:           [batch, seq_len, d_model]
2                         │
3After LayerNorm:         │ same
4                         │
5After Attention:         │ same (Q,K,V all project to d_model)
6                         │
7After Residual 1:        │ same
8                         │
9After LayerNorm:         │ same
10                         │
11After FFN:               │ same (output projects back to d_model)
12                         │
13After Residual 2:        │ same
14                         │
15Output:          [batch, seq_len, d_model]  ← SAME as input!

Dropout Points

Where Dropout is Applied

📝text

1┌─────────────────────────────────────────────┐
2│  Multi-Head Attention                       │
3│  ┌─────────────────────────────────────┐   │
4│  │ Attention weights: dropout(softmax)  │   │ ← Attention dropout
5│  └─────────────────────────────────────┘   │
6└────────────────────┬────────────────────────┘
7                     ↓
8              ┌──────────┐
9              │ Dropout  │  ← Output dropout (before residual)
10              └────┬─────┘
11                   │
12               ──(+)── Residual
13                   │
14┌─────────────────────────────────────────────┐
15│  Feed-Forward Network                       │
16│  ┌─────────────────────────────────────┐   │
17│  │ Hidden layer: dropout after activation│  │ ← FFN internal dropout
18│  └─────────────────────────────────────┘   │
19└────────────────────┬────────────────────────┘
20                     ↓
21              ┌──────────┐
22              │ Dropout  │  ← Output dropout (before residual)
23              └────┬─────┘
24                   │
25               ──(+)── Residual

Dropout Values

Location	Typical Value	Purpose
Attention	0.1	Prevent over-reliance on specific patterns
FFN hidden	0.1	Regularize transformation
Sublayer output	0.1	Regularize residual
Embedding	0.1	Regularize input

Pre-LN vs Post-LN for Encoder

Pre-LN (Recommended)

🐍python

1# Each sublayer:
2output = x + dropout(sublayer(layer_norm(x)))
3
4# After all layers:
5final = final_layer_norm(output)

Benefits:

- More stable training

- Better gradient flow

- Works well for deep encoders

Post-LN (Original)

🐍python

1# Each sublayer:
2output = layer_norm(x + dropout(sublayer(x)))
3
4# After all layers:
5final = output  # Already normalized

Considerations:

- Requires careful initialization

- May need learning rate warmup

- Original Transformer paper used this

Summary

Encoder Layer Formula

Pre-LN Style:

📝text

1# Attention sublayer
2a = x + Dropout(MultiHeadAttention(LayerNorm(x)))
3
4# FFN sublayer
5output = a + Dropout(FFN(LayerNorm(a)))

Key Properties

Property	Value
Self-attention	Bidirectional (all positions visible)
Masking	Padding mask only
Shape preservation	Input shape = Output shape
Residual connections	2 per layer
Layer normalizations	2 per layer
Dropout points	4+ per layer

Component Dependencies

📝text

1MultiHeadAttention ←─ From Chapter 3
2LayerNorm ←────────── From Chapter 6
3FFN ←──────────────── From Chapter 6
4Residual+Dropout ←─── From Chapter 6

Exercises

Conceptual Questions

1. Why is encoder attention bidirectional while decoder attention is causal?

2. What would happen if we removed the residual connections from the encoder?

3. Why doesn't the encoder need a causal mask?

Diagram Exercises

4. Draw the data flow for a 2-layer encoder, showing shapes at each stage.

5. Mark all locations where dropout could be applied in an encoder layer.

Analysis Questions

6. For batch_size=32, seq_len=100, d_model=512, calculate the memory needed for attention scores in one head.

7. How many total parameters are in a 6-layer encoder with d_model=512, d_ff=2048, num_heads=8?

In the next section, we'll implement the TransformerEncoderLayer class in PyTorch, bringing together all the components we've built throughout the course.