Introduction
The decoder is the generative half of the transformer. While the encoder processes the input sequence, the decoder generates the output sequence one token at a time, attending both to its own previous outputs and to the encoded source.
This section provides an architectural overview of the decoder before we implement it.
Encoder vs Decoder
Key Differences
| Aspect | Encoder | Decoder |
|---|---|---|
| Sublayers | 2 (self-attention, FFN) | 3 (self-attention, cross-attention, FFN) |
| Self-attention | Bidirectional | Causal (masked) |
| Cross-attention | None | Attends to encoder output |
| Purpose | Encode source | Generate target |
| Input | Source sequence | Target sequence (shifted) |
Visual Comparison
1ENCODER LAYER: DECODER LAYER:
2
3Input Input
4 β β
5 βΌ βΌ
6βββββββββββββββββββ βββββββββββββββββββ
7β Self-Attention β β Masked Self- β β Causal!
8β (Bidirectional) β β Attention β
9ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
10 β β
11 Add & Norm Add & Norm
12 β β
13 βΌ
14 βββββββββββββββββββ
15 β Cross-Attention β β NEW!
16 β (to encoder) β
17 ββββββββββ¬βββββββββ
18 β
19 Add & Norm
20 β β
21 βΌ βΌ
22βββββββββββββββββββ βββββββββββββββββββ
23β Feed-Forward β β Feed-Forward β
24ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
25 β β
26 Add & Norm Add & Norm
27 β β
28 βΌ βΌ
29 Output OutputThe Three Sublayers
Sublayer 1: Masked Self-Attention
Purpose: Allow decoder tokens to attend to each other, but only to previous tokens.
1Target: "<bos> The dog runs"
2
3At position 2 ("dog"):
4 Can attend to: "<bos>", "The", "dog"
5 Cannot attend to: "runs" (future)
6
7This enables autoregressive generation!Why Masked?
During training, we have the full target sequence. Without masking:
1Predicting "runs" at position 3:
2 Model could cheat by looking at "runs" in the input!
3
4With causal mask:
5 Model can only see "<bos>", "The", "dog"
6 Must predict "runs" from context onlySublayer 2: Cross-Attention
Purpose: Allow decoder to access encoder's representation of the source.
1Source (German): "Der Hund lΓ€uft"
2Target (English): "<bos> The dog ..."
3
4Cross-Attention:
5 Q: from decoder ("The", "dog", ...)
6 K, V: from encoder ("Der", "Hund", "lΓ€uft")
7
8 "dog" can attend to "Hund" (its translation!)Why Cross-Attention Needs No Causal Mask:
The encoder output represents the full source sentence. The decoder should be able to see the entire source at every stepβthere's nothing to "hide" in the source.
Sublayer 3: Feed-Forward Network
Purpose: Same as encoderβnon-linear transformation of each position.
1Identical to encoder FFN:
2 d_model β d_ff β d_model
3 With GELU activation and dropoutComplete Decoder Layer Diagram
1Target Input
2 β
3 [batch, tgt_len, d_model]
4 β
5 ββββββββββββ΄βββββββββββ
6 β β
7 βΌ β
8 ββββββββββββββββββ β
9 β Layer Norm 1 β β
10 βββββββββ¬βββββββββ β
11 β β
12 βΌ β
13 ββββββββββββββββββ β
14 β Masked Self- β β
15 β Attention β β
16 β (causal mask) β β
17 βββββββββ¬βββββββββ β
18 β β
19 Dropout β
20 β β
21 βΌ β
22 (+)βββββββββββββββββββββββ Residual 1
23 β
24 ββββββββββββββββββββββββ
25 β β
26 βΌ β
27 ββββββββββββββββββ β
28 β Layer Norm 2 β β
29 βββββββββ¬βββββββββ β
30 β β
31 βΌ β
32 ββββββββββββββββββ β
33 β Cross-Attentionββββββ Encoder Memory
34 β Q=decoder β [batch, src_len, d_model]
35 β K,V=encoder β
36 βββββββββ¬βββββββββ β
37 β β
38 Dropout β
39 β β
40 βΌ β
41 (+)βββββββββββββββββββββββ Residual 2
42 β
43 ββββββββββββββββββββββββ
44 β β
45 βΌ β
46 ββββββββββββββββββ β
47 β Layer Norm 3 β β
48 βββββββββ¬βββββββββ β
49 β β
50 βΌ β
51 ββββββββββββββββββ β
52 β Feed-Forward β β
53 β Network β β
54 βββββββββ¬βββββββββ β
55 β β
56 Dropout β
57 β β
58 βΌ β
59 (+)βββββββββββββββββββββββ Residual 3
60 β
61 βΌ
62 Decoder Output
63 [batch, tgt_len, d_model]Information Flow in Decoder
Training Mode (Teacher Forcing)
1Source: "Der Hund lΓ€uft"
2Target: "<bos> The dog runs <eos>"
3
4Encoder:
5 "Der Hund lΓ€uft" β Memory [1, 3, 512]
6
7Decoder Input (shifted right):
8 "<bos> The dog runs" β [1, 4, 512]
9
10Decoder Output:
11 Logits for: "The dog runs <eos>"
12
13Loss computed on:
14 Predicted: "The dog runs <eos>"
15 Target: "The dog runs <eos>"Inference Mode (Autoregressive)
1Step 1:
2 Input: "<bos>"
3 Predict: "The"
4
5Step 2:
6 Input: "<bos> The"
7 Predict: "dog"
8
9Step 3:
10 Input: "<bos> The dog"
11 Predict: "runs"
12
13Step 4:
14 Input: "<bos> The dog runs"
15 Predict: "<eos>"
16
17Stop: EOS predicted!Architecture Variants
Encoder-Decoder (Original Transformer)
For: Machine Translation, Summarization, Q&A
1Source β [Encoder] β Memory
2 β
3Target β [Decoder] β Cross-Attention
4 β
5 OutputDecoder-Only (GPT Style)
For: Language Modeling, Text Generation
1No encoder needed!
2Just: Input β [Decoder Layers] β Next Token Prediction
3
4Cross-attention layers removed or repurposedEncoder-Only (BERT Style)
For: Classification, Named Entity Recognition
1Input β [Encoder] β Contextualized Representations
2 β
3 Classification HeadOur Focus
We're building encoder-decoder for translation:
- Encoder: Processes German source
- Decoder: Generates English target
- Cross-attention: Connects them
Why Three Sublayers?
Without Cross-Attention?
1Decoder with only self-attention:
2 - Can only see target tokens
3 - No access to source information
4 - Like generating English without seeing German!
5
6Result: Nonsensical output, unrelated to sourceWhy Self-Attention + Cross-Attention?
Self-Attention: Model target language patterns
1"The dog" β likely followed by verb
2"The" β likely followed by noun/adjectiveCross-Attention: Connect to source meaning
1"Der Hund" in source β "The dog" in target
2"lΓ€uft" in source β "runs" should come nextBoth are essential for good translation!
Decoder in the Full Model
Complete Translation Pipeline
1βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2β FULL TRANSFORMER β
3β β
4β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββ β
5β β ENCODER β β DECODER β β
6β β β β β β
7β β Source Embedding β β Target Embedding β β
8β β β β β β β β
9β β + Positional Enc β β + Positional Enc β β
10β β β β β β β β
11β β Encoder Layer ΓN ββββββΆβ Decoder Layer ΓN β β
12β β β β β (uses encoder memory) β β
13β β Final LayerNorm β β β β β
14β β β β β Final LayerNorm β β
15β β Memory β β β β β
16β β β β Output Projection β β
17β β β β (to vocabulary) β β
18β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββ β
19β β β
20β Logits β
21β [batch, tgt_len, vocab] β
22βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββSummary
Decoder Layer Components
| Sublayer | Input | Purpose |
|---|---|---|
| Masked Self-Attention | Target | Model target patterns, prevent future peeking |
| Cross-Attention | Decoder + Encoder | Access source information |
| Feed-Forward | Position-wise | Non-linear transformation |
Key Properties
- Autoregressive: Generates one token at a time
- Causal Masking: Can't see future during training
- Cross-Attention: Connects to encoder output
- Same dimension: Input/output shape preserved
Exercises
Conceptual Questions
1. Why does the decoder need both self-attention AND cross-attention?
2. What would happen if we removed the causal mask during training?
3. How does teacher forcing speed up training compared to autoregressive training?
Design Questions
4. For a language model (decoder-only), which sublayers are kept/removed?
5. Could we use bidirectional attention in the decoder? Why or why not?
6. How does the decoder know when to stop generating?
Next Section Preview
In the next section, we'll implement causal maskingβthe mechanism that prevents the decoder from cheating by looking at future tokens during training.