AI Book - Master Artificial Intelligence by Building from Scratch

Introduction

In machine translation, the encoder's job is to process the source language (German) and produce rich contextual representations. The decoder then uses these representations to generate the target language (English).

This section covers how to properly prepare and encode source sentences for our translation project.

The Encoder's Role in Translation

Translation Pipeline Overview

📝text

1┌─────────────────────────────────────────────────────────────┐
2│  Source (German): "Der Hund läuft im Park"                  │
3└────────────────────────────┬────────────────────────────────┘
4                             │
5                             ▼
6┌─────────────────────────────────────────────────────────────┐
7│                      TOKENIZATION                           │
8│  → [Der, Hund, läuft, im, Park]                            │
9│  → Token IDs: [45, 892, 1203, 28, 567]                     │
10└────────────────────────────┬────────────────────────────────┘
11                             │
12                             ▼
13┌─────────────────────────────────────────────────────────────┐
14│                       EMBEDDING                             │
15│  → [batch, 5, 512] + positional encoding                   │
16└────────────────────────────┬────────────────────────────────┘
17                             │
18                             ▼
19┌─────────────────────────────────────────────────────────────┐
20│                        ENCODER                              │
21│  6 layers of self-attention + FFN                          │
22│  Each German token now has context from all other tokens   │
23└────────────────────────────┬────────────────────────────────┘
24                             │
25                             ▼
26┌─────────────────────────────────────────────────────────────┐
27│                    ENCODER OUTPUT                           │
28│  "Memory" for decoder: [batch, 5, 512]                     │
29│  Rich representations of source sentence                   │
30└────────────────────────────┬────────────────────────────────┘
31                             │
32                             ▼
33┌─────────────────────────────────────────────────────────────┐
34│                        DECODER                              │
35│  Cross-attention to encoder output                         │
36│  Generates: "The dog runs in the park"                     │
37└─────────────────────────────────────────────────────────────┘

Key Insight: Encoder Runs Once

For each source sentence:

- Encoder runs once to produce memory

- Decoder runs multiple times (one per output token)

- Encoder output is cached and reused

📝text

1Source: "Der Hund läuft"
2
3Encoder: Runs 1 time → Memory
4
5Decoder:
6  Step 1: <bos> + Memory → "The"
7  Step 2: <bos> The + Memory → "dog"
8  Step 3: <bos> The dog + Memory → "runs"
9  Step 4: <bos> The dog runs + Memory → <eos>
10
11Memory is computed once, used 4 times!

Source Padding Mask

Why Source Needs Masking

Batched sentences have different lengths:

📝text

1Batch:
2  Sentence 1: "Der Hund läuft im Park"           (5 tokens)
3  Sentence 2: "Die Katze schläft"                (3 tokens)
4  Sentence 3: "Das Wetter ist heute sehr schön"  (6 tokens)
5
6After padding (to length 6):
7  Sentence 1: [Der, Hund, läuft, im, Park, <pad>]
8  Sentence 2: [Die, Katze, schläft, <pad>, <pad>, <pad>]
9  Sentence 3: [Das, Wetter, ist, heute, sehr, schön]

Tokens should NOT attend to <pad> positions!

Creating Source Padding Mask

🐍python

1import torch
2
3
4def create_source_mask(
5    source_ids: torch.Tensor,
6    pad_id: int = 0
7) -> torch.Tensor:
8    """
9    Create padding mask for encoder.
10
11    Args:
12        source_ids: [batch, src_len] source token IDs
13        pad_id: Padding token ID (usually 0)
14
15    Returns:
16        mask: [batch, 1, 1, src_len] attention mask
17              True = attend, False = ignore
18
19    Example:
20        >>> src = torch.tensor([[1, 2, 3, 0, 0], [1, 2, 0, 0, 0]])
21        >>> mask = create_source_mask(src)
22        >>> mask.squeeze()
23        tensor([[ True,  True,  True, False, False],
24                [ True,  True, False, False, False]])
25    """
26    # Shape: [batch, src_len]
27    mask = (source_ids != pad_id)
28
29    # Expand for attention: [batch, 1, 1, src_len]
30    mask = mask.unsqueeze(1).unsqueeze(2)
31
32    return mask
33
34
35# Test
36source = torch.tensor([
37    [45, 892, 1203, 28, 567, 0],   # 5 real tokens
38    [23, 456, 789, 0, 0, 0],       # 3 real tokens
39    [12, 34, 56, 78, 90, 11],      # 6 real tokens (no padding)
40])
41
42mask = create_source_mask(source, pad_id=0)
43print(f"Source shape: {source.shape}")
44print(f"Mask shape: {mask.shape}")
45print(f"Mask:\n{mask.squeeze()}")

Output:

📝text

1Source shape: torch.Size([3, 6])
2Mask shape: torch.Size([3, 1, 1, 6])
3Mask:
4tensor([[[ True,  True,  True,  True,  True, False]],
5        [[ True,  True,  True, False, False, False]],
6        [[ True,  True,  True,  True,  True,  True]]])

For nn.MultiheadAttention

PyTorch's nn.MultiheadAttention uses key_padding_mask format:

🐍python

1def create_key_padding_mask(
2    source_ids: torch.Tensor,
3    pad_id: int = 0
4) -> torch.Tensor:
5    """
6    Create key_padding_mask for nn.MultiheadAttention.
7
8    Args:
9        source_ids: [batch, src_len]
10        pad_id: Padding token ID
11
12    Returns:
13        mask: [batch, src_len] where True = IGNORE
14    """
15    # True where padding (opposite of attention mask!)
16    return source_ids == pad_id
17
18
19# For nn.MultiheadAttention
20key_padding_mask = create_key_padding_mask(source, pad_id=0)
21print(f"Key padding mask:\n{key_padding_mask}")

Complete Source Encoder

Full Implementation

🐍python

1import torch
2import torch.nn as nn
3from typing import Optional, Tuple
4
5
6class SourceEncoder(nn.Module):
7    """
8    Complete source encoder for translation.
9
10    Combines embedding, positional encoding, and transformer encoder.
11
12    Args:
13        vocab_size: Source vocabulary size
14        d_model: Model dimension
15        num_heads: Number of attention heads
16        num_layers: Number of encoder layers
17        d_ff: Feed-forward dimension
18        max_seq_len: Maximum sequence length
19        dropout: Dropout probability
20        pad_id: Padding token ID
21
22    Example:
23        >>> encoder = SourceEncoder(vocab_size=32000, d_model=512)
24        >>> source_ids = torch.randint(0, 32000, (2, 10))
25        >>> memory, mask = encoder(source_ids)
26    """
27
28    def __init__(
29        self,
30        vocab_size: int,
31        d_model: int = 512,
32        num_heads: int = 8,
33        num_layers: int = 6,
34        d_ff: int = 2048,
35        max_seq_len: int = 5000,
36        dropout: float = 0.1,
37        pad_id: int = 0
38    ):
39        super().__init__()
40
41        self.d_model = d_model
42        self.pad_id = pad_id
43
44        # Token embedding
45        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_id)
46
47        # Positional encoding (sinusoidal)
48        self.register_buffer(
49            'pos_encoding',
50            self._create_positional_encoding(max_seq_len, d_model)
51        )
52
53        # Embedding dropout
54        self.dropout = nn.Dropout(dropout)
55
56        # Scaling factor
57        self.scale = d_model ** 0.5
58
59        # Transformer encoder
60        encoder_layer = nn.TransformerEncoderLayer(
61            d_model=d_model,
62            nhead=num_heads,
63            dim_feedforward=d_ff,
64            dropout=dropout,
65            batch_first=True,
66            norm_first=True  # Pre-LN
67        )
68        self.encoder = nn.TransformerEncoder(
69            encoder_layer,
70            num_layers=num_layers,
71            norm=nn.LayerNorm(d_model)
72        )
73
74    def _create_positional_encoding(
75        self,
76        max_len: int,
77        d_model: int
78    ) -> torch.Tensor:
79        """Create sinusoidal positional encoding."""
80        import math
81
82        position = torch.arange(max_len).unsqueeze(1)
83        div_term = torch.exp(
84            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
85        )
86
87        pe = torch.zeros(1, max_len, d_model)
88        pe[0, :, 0::2] = torch.sin(position * div_term)
89        pe[0, :, 1::2] = torch.cos(position * div_term)
90
91        return pe
92
93    def forward(
94        self,
95        source_ids: torch.Tensor
96    ) -> Tuple[torch.Tensor, torch.Tensor]:
97        """
98        Encode source sequence.
99
100        Args:
101            source_ids: [batch, src_len] source token IDs
102
103        Returns:
104            memory: [batch, src_len, d_model] encoder output
105            src_key_padding_mask: [batch, src_len] padding mask
106        """
107        # Create padding mask
108        src_key_padding_mask = (source_ids == self.pad_id)
109
110        # Embed tokens
111        seq_len = source_ids.size(1)
112        x = self.embedding(source_ids) * self.scale
113
114        # Add positional encoding
115        x = x + self.pos_encoding[:, :seq_len]
116
117        # Apply dropout
118        x = self.dropout(x)
119
120        # Encode
121        memory = self.encoder(x, src_key_padding_mask=src_key_padding_mask)
122
123        return memory, src_key_padding_mask
124
125
126# Test
127def test_source_encoder():
128    """Test the complete source encoder."""
129
130    vocab_size = 32000
131    d_model = 512
132    batch_size = 4
133    src_len = 20
134
135    encoder = SourceEncoder(
136        vocab_size=vocab_size,
137        d_model=d_model,
138        num_layers=6
139    )
140
141    # Create source with varying lengths
142    source_ids = torch.randint(1, vocab_size, (batch_size, src_len))
143    # Add padding
144    source_ids[0, 15:] = 0
145    source_ids[1, 10:] = 0
146    source_ids[2, 18:] = 0
147    # source_ids[3] has no padding
148
149    # Encode
150    memory, mask = encoder(source_ids)
151
152    print("Source Encoder Test")
153    print("=" * 50)
154    print(f"Input shape: {source_ids.shape}")
155    print(f"Memory shape: {memory.shape}")
156    print(f"Mask shape: {mask.shape}")
157    print(f"\nPadding mask (True=padding):")
158    print(mask)
159
160    # Verify memory at padding positions
161    # (should still exist but will be masked in decoder)
162    print(f"\nMemory norm at position 0: {memory[:, 0].norm():.2f}")
163    print(f"Memory norm at position -1: {memory[:, -1].norm():.2f}")
164
165    print("\n✓ Source encoder test passed!")
166
167
168test_source_encoder()

Encoder Output Interpretation

What Does the Encoder Produce?

After 6 layers of self-attention, each position contains:

📝text

1Original token representation:
2  "Hund" → Initial embedding (isolated word meaning)
3
4After encoder:
5  "Hund" → Contextual representation
6           Contains information about:
7           - Its role as subject of "läuft"
8           - Relationship to "Der" (article)
9           - Part of phrase "im Park" (location context)

Visualizing Encoder Output

🐍python

1def analyze_encoder_output(memory, source_ids, tokenizer):
2    """Analyze what the encoder has learned."""
3
4    batch_idx = 0  # First sentence
5
6    # Get tokens
7    tokens = tokenizer.decode(source_ids[batch_idx].tolist())
8
9    print("Token-wise analysis:")
10    print("-" * 40)
11
12    for pos, token in enumerate(tokens.split()):
13        vec = memory[batch_idx, pos]
14        norm = vec.norm().item()
15        mean = vec.mean().item()
16        std = vec.std().item()
17
18        print(f"Position {pos}: '{token}'")
19        print(f"  Norm: {norm:.2f}, Mean: {mean:.4f}, Std: {std:.4f}")
20
21    # Token similarity analysis
22    print("\nToken similarities (cosine):")
23    import torch.nn.functional as F
24
25    for i in range(min(3, len(tokens.split()))):
26        for j in range(i+1, min(4, len(tokens.split()))):
27            vec_i = memory[batch_idx, i]
28            vec_j = memory[batch_idx, j]
29            sim = F.cosine_similarity(vec_i.unsqueeze(0), vec_j.unsqueeze(0))
30            print(f"  pos {i} vs pos {j}: {sim.item():.4f}")

Memory for Cross-Attention

How Decoder Uses Encoder Output

The encoder output serves as keys and values for the decoder's cross-attention:

📝text

1Encoder Output (Memory): [batch, src_len, d_model]
2                              ↓
3                    ┌─────────────────────────┐
4                    │   Decoder Layer N       │
5                    │                         │
6                    │   Self-Attention        │
7                    │        ↓                │
8                    │   Cross-Attention       │
9                    │     Q from decoder      │
10                    │     K, V from memory ←──┼── Encoder output!
11                    │        ↓                │
12                    │   Feed-Forward          │
13                    └─────────────────────────┘

Cross-Attention Computation

🐍python

1# In decoder cross-attention:
2
3# Query from decoder's current state
4Q = decoder_state @ W_q  # [batch, tgt_len, d_model]
5
6# Key and Value from encoder memory
7K = memory @ W_k         # [batch, src_len, d_model]
8V = memory @ W_v         # [batch, src_len, d_model]
9
10# Attention: decoder tokens attend to encoder tokens
11# scores: [batch, tgt_len, src_len]
12# Each decoder position can look at all source positions

Caching for Efficient Inference

Why Cache Encoder Output?

During inference, we generate one token at a time:

📝text

1Without caching:
2  Step 1: Encode source → Generate token 1
3  Step 2: Encode source → Generate token 2  (redundant!)
4  Step 3: Encode source → Generate token 3  (redundant!)
5
6With caching:
7  Step 0: Encode source → Cache memory
8  Step 1: Use cached memory → Generate token 1
9  Step 2: Use cached memory → Generate token 2
10  Step 3: Use cached memory → Generate token 3

Encoder Caching Implementation

🐍python

1class CachedSourceEncoder(SourceEncoder):
2    """Source encoder with caching for efficient inference."""
3
4    def __init__(self, *args, **kwargs):
5        super().__init__(*args, **kwargs)
6        self._cached_memory = None
7        self._cached_mask = None
8        self._cached_source_ids = None
9
10    def encode_with_cache(
11        self,
12        source_ids: torch.Tensor,
13        use_cache: bool = True
14    ) -> Tuple[torch.Tensor, torch.Tensor]:
15        """
16        Encode with optional caching.
17
18        Args:
19            source_ids: [batch, src_len]
20            use_cache: Whether to use/store cache
21
22        Returns:
23            memory, mask
24        """
25        if use_cache and self._cached_source_ids is not None:
26            # Check if same source
27            if torch.equal(source_ids, self._cached_source_ids):
28                return self._cached_memory, self._cached_mask
29
30        # Compute fresh
31        memory, mask = self.forward(source_ids)
32
33        # Cache
34        if use_cache:
35            self._cached_memory = memory
36            self._cached_mask = mask
37            self._cached_source_ids = source_ids.clone()
38
39        return memory, mask
40
41    def clear_cache(self):
42        """Clear the cached encoder output."""
43        self._cached_memory = None
44        self._cached_mask = None
45        self._cached_source_ids = None
46
47
48# Demonstration
49def demo_caching():
50    """Show caching benefit."""
51    import time
52
53    encoder = CachedSourceEncoder(vocab_size=32000, d_model=512, num_layers=6)
54    source_ids = torch.randint(1, 32000, (8, 100))
55
56    # First call (computes)
57    start = time.time()
58    memory1, mask1 = encoder.encode_with_cache(source_ids)
59    first_time = time.time() - start
60
61    # Second call (cached)
62    start = time.time()
63    memory2, mask2 = encoder.encode_with_cache(source_ids)
64    cached_time = time.time() - start
65
66    print(f"First call: {first_time*1000:.2f} ms")
67    print(f"Cached call: {cached_time*1000:.2f} ms")
68    print(f"Speedup: {first_time/cached_time:.1f}x")
69
70    # Verify same output
71    print(f"Same output: {torch.equal(memory1, memory2)}")
72
73
74demo_caching()

Full Integration Example

Complete Encoding Pipeline

🐍python

1def encode_german_sentences(
2    sentences: list,
3    tokenizer,
4    encoder: SourceEncoder,
5    max_length: int = 128,
6    device: str = 'cpu'
7) -> Tuple[torch.Tensor, torch.Tensor]:
8    """
9    Complete pipeline to encode German sentences.
10
11    Args:
12        sentences: List of German text strings
13        tokenizer: Trained tokenizer
14        encoder: Source encoder model
15        max_length: Maximum sequence length
16        device: Device to use
17
18    Returns:
19        memory: [batch, src_len, d_model] encoder output
20        src_mask: [batch, src_len] padding mask
21    """
22    # Tokenize all sentences
23    batch_ids = []
24    for sentence in sentences:
25        ids = tokenizer.encode(sentence)[:max_length]
26        batch_ids.append(ids)
27
28    # Pad to same length
29    max_len = max(len(ids) for ids in batch_ids)
30    padded_ids = []
31    for ids in batch_ids:
32        padded = ids + [tokenizer.pad_id] * (max_len - len(ids))
33        padded_ids.append(padded)
34
35    # Convert to tensor
36    source_ids = torch.tensor(padded_ids, dtype=torch.long, device=device)
37
38    # Encode
39    encoder = encoder.to(device)
40    encoder.eval()
41
42    with torch.no_grad():
43        memory, src_mask = encoder(source_ids)
44
45    return memory, src_mask
46
47
48# Example usage
49german_sentences = [
50    "Der schnelle braune Fuchs springt über den faulen Hund.",
51    "Die Katze sitzt auf der Matte.",
52    "Guten Morgen, wie geht es Ihnen?",
53]
54
55# memory, mask = encode_german_sentences(
56#     german_sentences, tokenizer, encoder
57# )
58# print(f"Memory shape: {memory.shape}")
59# print(f"Ready for decoder cross-attention!")

Summary

Source Encoding Pipeline

📝text

1German Text
2    │
3    ▼
4Tokenization (BPE)
5    │
6    ▼
7Token IDs [batch, src_len]
8    │
9    ├──────────────────────────────┐
10    │                              │
11    ▼                              ▼
12Embedding + Pos Enc         Padding Mask
13    │                              │
14    ▼                              │
15Encoder (6 layers)                 │
16    │                              │
17    ▼                              │
18Memory [batch, src_len, d_model]   │
19    │                              │
20    └──────────────┬───────────────┘
21                   │
22                   ▼
23        To Decoder Cross-Attention

Key Points

Aspect	Value
Encoder runs	Once per source sentence
Output name	"Memory" for decoder
Output shape	[batch, src_len, d_model]
Mask type	Padding mask only
Caching	Essential for efficient inference

Chapter Summary

What We Built

1. Encoder Layer: Self-attention + FFN with Add&Norm

2. Full Encoder: Stack of N=6 layers with final norm

3. Source Encoder: Complete with embedding and positional encoding

4. Masking: Proper padding mask for source sentences

5. Caching: Efficient inference by reusing encoder output

Ready for Decoder

With the encoder complete, we're ready to:

- Build the decoder with masked self-attention

- Implement cross-attention to encoder memory

- Complete the full transformer for translation

Exercises

Implementation

1. Add beam search support to the encoder output caching.

2. Implement gradient checkpointing for the source encoder.

3. Create an encoder that outputs attention weights for visualization.

Analysis

4. Compare encoder output similarity for semantically similar sentences.

5. Visualize how attention patterns change across encoder layers.

In Chapter 8, we'll build the Transformer Decoder—the half of the architecture that generates output. We'll implement masked self-attention, cross-attention to encoder memory, and the complete decoder stack.