Introduction
In machine translation, the encoder's job is to process the source language (German) and produce rich contextual representations. The decoder then uses these representations to generate the target language (English).
This section covers how to properly prepare and encode source sentences for our translation project.
The Encoder's Role in Translation
Translation Pipeline Overview
1βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2β Source (German): "Der Hund lΓ€uft im Park" β
3ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
4 β
5 βΌ
6βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7β TOKENIZATION β
8β β [Der, Hund, lΓ€uft, im, Park] β
9β β Token IDs: [45, 892, 1203, 28, 567] β
10ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
11 β
12 βΌ
13βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
14β EMBEDDING β
15β β [batch, 5, 512] + positional encoding β
16ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
17 β
18 βΌ
19βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
20β ENCODER β
21β 6 layers of self-attention + FFN β
22β Each German token now has context from all other tokens β
23ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
24 β
25 βΌ
26βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
27β ENCODER OUTPUT β
28β "Memory" for decoder: [batch, 5, 512] β
29β Rich representations of source sentence β
30ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
31 β
32 βΌ
33βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
34β DECODER β
35β Cross-attention to encoder output β
36β Generates: "The dog runs in the park" β
37βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββKey Insight: Encoder Runs Once
For each source sentence:
- Encoder runs once to produce memory
- Decoder runs multiple times (one per output token)
- Encoder output is cached and reused
1Source: "Der Hund lΓ€uft"
2
3Encoder: Runs 1 time β Memory
4
5Decoder:
6 Step 1: <bos> + Memory β "The"
7 Step 2: <bos> The + Memory β "dog"
8 Step 3: <bos> The dog + Memory β "runs"
9 Step 4: <bos> The dog runs + Memory β <eos>
10
11Memory is computed once, used 4 times!Source Padding Mask
Why Source Needs Masking
Batched sentences have different lengths:
1Batch:
2 Sentence 1: "Der Hund lΓ€uft im Park" (5 tokens)
3 Sentence 2: "Die Katze schlΓ€ft" (3 tokens)
4 Sentence 3: "Das Wetter ist heute sehr schΓΆn" (6 tokens)
5
6After padding (to length 6):
7 Sentence 1: [Der, Hund, lΓ€uft, im, Park, <pad>]
8 Sentence 2: [Die, Katze, schlΓ€ft, <pad>, <pad>, <pad>]
9 Sentence 3: [Das, Wetter, ist, heute, sehr, schΓΆn]Tokens should NOT attend to <pad> positions!
Creating Source Padding Mask
1import torch
2
3
4def create_source_mask(
5 source_ids: torch.Tensor,
6 pad_id: int = 0
7) -> torch.Tensor:
8 """
9 Create padding mask for encoder.
10
11 Args:
12 source_ids: [batch, src_len] source token IDs
13 pad_id: Padding token ID (usually 0)
14
15 Returns:
16 mask: [batch, 1, 1, src_len] attention mask
17 True = attend, False = ignore
18
19 Example:
20 >>> src = torch.tensor([[1, 2, 3, 0, 0], [1, 2, 0, 0, 0]])
21 >>> mask = create_source_mask(src)
22 >>> mask.squeeze()
23 tensor([[ True, True, True, False, False],
24 [ True, True, False, False, False]])
25 """
26 # Shape: [batch, src_len]
27 mask = (source_ids != pad_id)
28
29 # Expand for attention: [batch, 1, 1, src_len]
30 mask = mask.unsqueeze(1).unsqueeze(2)
31
32 return mask
33
34
35# Test
36source = torch.tensor([
37 [45, 892, 1203, 28, 567, 0], # 5 real tokens
38 [23, 456, 789, 0, 0, 0], # 3 real tokens
39 [12, 34, 56, 78, 90, 11], # 6 real tokens (no padding)
40])
41
42mask = create_source_mask(source, pad_id=0)
43print(f"Source shape: {source.shape}")
44print(f"Mask shape: {mask.shape}")
45print(f"Mask:\n{mask.squeeze()}")Output:
1Source shape: torch.Size([3, 6])
2Mask shape: torch.Size([3, 1, 1, 6])
3Mask:
4tensor([[[ True, True, True, True, True, False]],
5 [[ True, True, True, False, False, False]],
6 [[ True, True, True, True, True, True]]])For nn.MultiheadAttention
PyTorch's nn.MultiheadAttention uses key_padding_mask format:
1def create_key_padding_mask(
2 source_ids: torch.Tensor,
3 pad_id: int = 0
4) -> torch.Tensor:
5 """
6 Create key_padding_mask for nn.MultiheadAttention.
7
8 Args:
9 source_ids: [batch, src_len]
10 pad_id: Padding token ID
11
12 Returns:
13 mask: [batch, src_len] where True = IGNORE
14 """
15 # True where padding (opposite of attention mask!)
16 return source_ids == pad_id
17
18
19# For nn.MultiheadAttention
20key_padding_mask = create_key_padding_mask(source, pad_id=0)
21print(f"Key padding mask:\n{key_padding_mask}")Complete Source Encoder
Full Implementation
1import torch
2import torch.nn as nn
3from typing import Optional, Tuple
4
5
6class SourceEncoder(nn.Module):
7 """
8 Complete source encoder for translation.
9
10 Combines embedding, positional encoding, and transformer encoder.
11
12 Args:
13 vocab_size: Source vocabulary size
14 d_model: Model dimension
15 num_heads: Number of attention heads
16 num_layers: Number of encoder layers
17 d_ff: Feed-forward dimension
18 max_seq_len: Maximum sequence length
19 dropout: Dropout probability
20 pad_id: Padding token ID
21
22 Example:
23 >>> encoder = SourceEncoder(vocab_size=32000, d_model=512)
24 >>> source_ids = torch.randint(0, 32000, (2, 10))
25 >>> memory, mask = encoder(source_ids)
26 """
27
28 def __init__(
29 self,
30 vocab_size: int,
31 d_model: int = 512,
32 num_heads: int = 8,
33 num_layers: int = 6,
34 d_ff: int = 2048,
35 max_seq_len: int = 5000,
36 dropout: float = 0.1,
37 pad_id: int = 0
38 ):
39 super().__init__()
40
41 self.d_model = d_model
42 self.pad_id = pad_id
43
44 # Token embedding
45 self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_id)
46
47 # Positional encoding (sinusoidal)
48 self.register_buffer(
49 'pos_encoding',
50 self._create_positional_encoding(max_seq_len, d_model)
51 )
52
53 # Embedding dropout
54 self.dropout = nn.Dropout(dropout)
55
56 # Scaling factor
57 self.scale = d_model ** 0.5
58
59 # Transformer encoder
60 encoder_layer = nn.TransformerEncoderLayer(
61 d_model=d_model,
62 nhead=num_heads,
63 dim_feedforward=d_ff,
64 dropout=dropout,
65 batch_first=True,
66 norm_first=True # Pre-LN
67 )
68 self.encoder = nn.TransformerEncoder(
69 encoder_layer,
70 num_layers=num_layers,
71 norm=nn.LayerNorm(d_model)
72 )
73
74 def _create_positional_encoding(
75 self,
76 max_len: int,
77 d_model: int
78 ) -> torch.Tensor:
79 """Create sinusoidal positional encoding."""
80 import math
81
82 position = torch.arange(max_len).unsqueeze(1)
83 div_term = torch.exp(
84 torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
85 )
86
87 pe = torch.zeros(1, max_len, d_model)
88 pe[0, :, 0::2] = torch.sin(position * div_term)
89 pe[0, :, 1::2] = torch.cos(position * div_term)
90
91 return pe
92
93 def forward(
94 self,
95 source_ids: torch.Tensor
96 ) -> Tuple[torch.Tensor, torch.Tensor]:
97 """
98 Encode source sequence.
99
100 Args:
101 source_ids: [batch, src_len] source token IDs
102
103 Returns:
104 memory: [batch, src_len, d_model] encoder output
105 src_key_padding_mask: [batch, src_len] padding mask
106 """
107 # Create padding mask
108 src_key_padding_mask = (source_ids == self.pad_id)
109
110 # Embed tokens
111 seq_len = source_ids.size(1)
112 x = self.embedding(source_ids) * self.scale
113
114 # Add positional encoding
115 x = x + self.pos_encoding[:, :seq_len]
116
117 # Apply dropout
118 x = self.dropout(x)
119
120 # Encode
121 memory = self.encoder(x, src_key_padding_mask=src_key_padding_mask)
122
123 return memory, src_key_padding_mask
124
125
126# Test
127def test_source_encoder():
128 """Test the complete source encoder."""
129
130 vocab_size = 32000
131 d_model = 512
132 batch_size = 4
133 src_len = 20
134
135 encoder = SourceEncoder(
136 vocab_size=vocab_size,
137 d_model=d_model,
138 num_layers=6
139 )
140
141 # Create source with varying lengths
142 source_ids = torch.randint(1, vocab_size, (batch_size, src_len))
143 # Add padding
144 source_ids[0, 15:] = 0
145 source_ids[1, 10:] = 0
146 source_ids[2, 18:] = 0
147 # source_ids[3] has no padding
148
149 # Encode
150 memory, mask = encoder(source_ids)
151
152 print("Source Encoder Test")
153 print("=" * 50)
154 print(f"Input shape: {source_ids.shape}")
155 print(f"Memory shape: {memory.shape}")
156 print(f"Mask shape: {mask.shape}")
157 print(f"\nPadding mask (True=padding):")
158 print(mask)
159
160 # Verify memory at padding positions
161 # (should still exist but will be masked in decoder)
162 print(f"\nMemory norm at position 0: {memory[:, 0].norm():.2f}")
163 print(f"Memory norm at position -1: {memory[:, -1].norm():.2f}")
164
165 print("\nβ Source encoder test passed!")
166
167
168test_source_encoder()Encoder Output Interpretation
What Does the Encoder Produce?
After 6 layers of self-attention, each position contains:
1Original token representation:
2 "Hund" β Initial embedding (isolated word meaning)
3
4After encoder:
5 "Hund" β Contextual representation
6 Contains information about:
7 - Its role as subject of "lΓ€uft"
8 - Relationship to "Der" (article)
9 - Part of phrase "im Park" (location context)Visualizing Encoder Output
1def analyze_encoder_output(memory, source_ids, tokenizer):
2 """Analyze what the encoder has learned."""
3
4 batch_idx = 0 # First sentence
5
6 # Get tokens
7 tokens = tokenizer.decode(source_ids[batch_idx].tolist())
8
9 print("Token-wise analysis:")
10 print("-" * 40)
11
12 for pos, token in enumerate(tokens.split()):
13 vec = memory[batch_idx, pos]
14 norm = vec.norm().item()
15 mean = vec.mean().item()
16 std = vec.std().item()
17
18 print(f"Position {pos}: '{token}'")
19 print(f" Norm: {norm:.2f}, Mean: {mean:.4f}, Std: {std:.4f}")
20
21 # Token similarity analysis
22 print("\nToken similarities (cosine):")
23 import torch.nn.functional as F
24
25 for i in range(min(3, len(tokens.split()))):
26 for j in range(i+1, min(4, len(tokens.split()))):
27 vec_i = memory[batch_idx, i]
28 vec_j = memory[batch_idx, j]
29 sim = F.cosine_similarity(vec_i.unsqueeze(0), vec_j.unsqueeze(0))
30 print(f" pos {i} vs pos {j}: {sim.item():.4f}")Memory for Cross-Attention
How Decoder Uses Encoder Output
The encoder output serves as keys and values for the decoder's cross-attention:
1Encoder Output (Memory): [batch, src_len, d_model]
2 β
3 βββββββββββββββββββββββββββ
4 β Decoder Layer N β
5 β β
6 β Self-Attention β
7 β β β
8 β Cross-Attention β
9 β Q from decoder β
10 β K, V from memory ββββΌββ Encoder output!
11 β β β
12 β Feed-Forward β
13 βββββββββββββββββββββββββββCross-Attention Computation
1# In decoder cross-attention:
2
3# Query from decoder's current state
4Q = decoder_state @ W_q # [batch, tgt_len, d_model]
5
6# Key and Value from encoder memory
7K = memory @ W_k # [batch, src_len, d_model]
8V = memory @ W_v # [batch, src_len, d_model]
9
10# Attention: decoder tokens attend to encoder tokens
11# scores: [batch, tgt_len, src_len]
12# Each decoder position can look at all source positionsCaching for Efficient Inference
Why Cache Encoder Output?
During inference, we generate one token at a time:
1Without caching:
2 Step 1: Encode source β Generate token 1
3 Step 2: Encode source β Generate token 2 (redundant!)
4 Step 3: Encode source β Generate token 3 (redundant!)
5
6With caching:
7 Step 0: Encode source β Cache memory
8 Step 1: Use cached memory β Generate token 1
9 Step 2: Use cached memory β Generate token 2
10 Step 3: Use cached memory β Generate token 3Encoder Caching Implementation
1class CachedSourceEncoder(SourceEncoder):
2 """Source encoder with caching for efficient inference."""
3
4 def __init__(self, *args, **kwargs):
5 super().__init__(*args, **kwargs)
6 self._cached_memory = None
7 self._cached_mask = None
8 self._cached_source_ids = None
9
10 def encode_with_cache(
11 self,
12 source_ids: torch.Tensor,
13 use_cache: bool = True
14 ) -> Tuple[torch.Tensor, torch.Tensor]:
15 """
16 Encode with optional caching.
17
18 Args:
19 source_ids: [batch, src_len]
20 use_cache: Whether to use/store cache
21
22 Returns:
23 memory, mask
24 """
25 if use_cache and self._cached_source_ids is not None:
26 # Check if same source
27 if torch.equal(source_ids, self._cached_source_ids):
28 return self._cached_memory, self._cached_mask
29
30 # Compute fresh
31 memory, mask = self.forward(source_ids)
32
33 # Cache
34 if use_cache:
35 self._cached_memory = memory
36 self._cached_mask = mask
37 self._cached_source_ids = source_ids.clone()
38
39 return memory, mask
40
41 def clear_cache(self):
42 """Clear the cached encoder output."""
43 self._cached_memory = None
44 self._cached_mask = None
45 self._cached_source_ids = None
46
47
48# Demonstration
49def demo_caching():
50 """Show caching benefit."""
51 import time
52
53 encoder = CachedSourceEncoder(vocab_size=32000, d_model=512, num_layers=6)
54 source_ids = torch.randint(1, 32000, (8, 100))
55
56 # First call (computes)
57 start = time.time()
58 memory1, mask1 = encoder.encode_with_cache(source_ids)
59 first_time = time.time() - start
60
61 # Second call (cached)
62 start = time.time()
63 memory2, mask2 = encoder.encode_with_cache(source_ids)
64 cached_time = time.time() - start
65
66 print(f"First call: {first_time*1000:.2f} ms")
67 print(f"Cached call: {cached_time*1000:.2f} ms")
68 print(f"Speedup: {first_time/cached_time:.1f}x")
69
70 # Verify same output
71 print(f"Same output: {torch.equal(memory1, memory2)}")
72
73
74demo_caching()Full Integration Example
Complete Encoding Pipeline
1def encode_german_sentences(
2 sentences: list,
3 tokenizer,
4 encoder: SourceEncoder,
5 max_length: int = 128,
6 device: str = 'cpu'
7) -> Tuple[torch.Tensor, torch.Tensor]:
8 """
9 Complete pipeline to encode German sentences.
10
11 Args:
12 sentences: List of German text strings
13 tokenizer: Trained tokenizer
14 encoder: Source encoder model
15 max_length: Maximum sequence length
16 device: Device to use
17
18 Returns:
19 memory: [batch, src_len, d_model] encoder output
20 src_mask: [batch, src_len] padding mask
21 """
22 # Tokenize all sentences
23 batch_ids = []
24 for sentence in sentences:
25 ids = tokenizer.encode(sentence)[:max_length]
26 batch_ids.append(ids)
27
28 # Pad to same length
29 max_len = max(len(ids) for ids in batch_ids)
30 padded_ids = []
31 for ids in batch_ids:
32 padded = ids + [tokenizer.pad_id] * (max_len - len(ids))
33 padded_ids.append(padded)
34
35 # Convert to tensor
36 source_ids = torch.tensor(padded_ids, dtype=torch.long, device=device)
37
38 # Encode
39 encoder = encoder.to(device)
40 encoder.eval()
41
42 with torch.no_grad():
43 memory, src_mask = encoder(source_ids)
44
45 return memory, src_mask
46
47
48# Example usage
49german_sentences = [
50 "Der schnelle braune Fuchs springt ΓΌber den faulen Hund.",
51 "Die Katze sitzt auf der Matte.",
52 "Guten Morgen, wie geht es Ihnen?",
53]
54
55# memory, mask = encode_german_sentences(
56# german_sentences, tokenizer, encoder
57# )
58# print(f"Memory shape: {memory.shape}")
59# print(f"Ready for decoder cross-attention!")Summary
Source Encoding Pipeline
1German Text
2 β
3 βΌ
4Tokenization (BPE)
5 β
6 βΌ
7Token IDs [batch, src_len]
8 β
9 ββββββββββββββββββββββββββββββββ
10 β β
11 βΌ βΌ
12Embedding + Pos Enc Padding Mask
13 β β
14 βΌ β
15Encoder (6 layers) β
16 β β
17 βΌ β
18Memory [batch, src_len, d_model] β
19 β β
20 ββββββββββββββββ¬ββββββββββββββββ
21 β
22 βΌ
23 To Decoder Cross-AttentionKey Points
| Aspect | Value |
|---|---|
| Encoder runs | Once per source sentence |
| Output name | "Memory" for decoder |
| Output shape | [batch, src_len, d_model] |
| Mask type | Padding mask only |
| Caching | Essential for efficient inference |
Chapter Summary
What We Built
1. Encoder Layer: Self-attention + FFN with Add&Norm
2. Full Encoder: Stack of N=6 layers with final norm
3. Source Encoder: Complete with embedding and positional encoding
4. Masking: Proper padding mask for source sentences
5. Caching: Efficient inference by reusing encoder output
Ready for Decoder
With the encoder complete, we're ready to:
- Build the decoder with masked self-attention
- Implement cross-attention to encoder memory
- Complete the full transformer for translation
Exercises
Implementation
1. Add beam search support to the encoder output caching.
2. Implement gradient checkpointing for the source encoder.
3. Create an encoder that outputs attention weights for visualization.
Analysis
4. Compare encoder output similarity for semantically similar sentences.
5. Visualize how attention patterns change across encoder layers.
In Chapter 8, we'll build the Transformer Decoderβthe half of the architecture that generates output. We'll implement masked self-attention, cross-attention to encoder memory, and the complete decoder stack.