Chapter 8
15 min read
Section 39 of 75

Decoder Architecture Overview

Transformer Decoder

Introduction

The decoder is the generative half of the transformer. While the encoder processes the input sequence, the decoder generates the output sequence one token at a time, attending both to its own previous outputs and to the encoded source.

This section provides an architectural overview of the decoder before we implement it.


Encoder vs Decoder

Key Differences

AspectEncoderDecoder
Sublayers2 (self-attention, FFN)3 (self-attention, cross-attention, FFN)
Self-attentionBidirectionalCausal (masked)
Cross-attentionNoneAttends to encoder output
PurposeEncode sourceGenerate target
InputSource sequenceTarget sequence (shifted)

Visual Comparison

πŸ“text
1ENCODER LAYER:                    DECODER LAYER:
2
3Input                             Input
4  β”‚                                 β”‚
5  β–Ό                                 β–Ό
6β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
7β”‚ Self-Attention  β”‚              β”‚ Masked Self-    β”‚ ← Causal!
8β”‚ (Bidirectional) β”‚              β”‚ Attention       β”‚
9β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
10         β”‚                                β”‚
11    Add & Norm                       Add & Norm
12         β”‚                                β”‚
13                                          β–Ό
14                                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15                                   β”‚ Cross-Attention β”‚ ← NEW!
16                                   β”‚ (to encoder)    β”‚
17                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
18                                            β”‚
19                                       Add & Norm
20         β”‚                                β”‚
21         β–Ό                                β–Ό
22β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
23β”‚  Feed-Forward   β”‚              β”‚  Feed-Forward   β”‚
24β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
25         β”‚                                β”‚
26    Add & Norm                       Add & Norm
27         β”‚                                β”‚
28         β–Ό                                β–Ό
29      Output                          Output

The Three Sublayers

Sublayer 1: Masked Self-Attention

Purpose: Allow decoder tokens to attend to each other, but only to previous tokens.

πŸ“text
1Target: "<bos> The dog runs"
2
3At position 2 ("dog"):
4  Can attend to: "<bos>", "The", "dog"
5  Cannot attend to: "runs" (future)
6
7This enables autoregressive generation!

Why Masked?

During training, we have the full target sequence. Without masking:

πŸ“text
1Predicting "runs" at position 3:
2  Model could cheat by looking at "runs" in the input!
3
4With causal mask:
5  Model can only see "<bos>", "The", "dog"
6  Must predict "runs" from context only

Sublayer 2: Cross-Attention

Purpose: Allow decoder to access encoder's representation of the source.

πŸ“text
1Source (German): "Der Hund lΓ€uft"
2Target (English): "<bos> The dog ..."
3
4Cross-Attention:
5  Q: from decoder ("The", "dog", ...)
6  K, V: from encoder ("Der", "Hund", "lΓ€uft")
7
8  "dog" can attend to "Hund" (its translation!)

Why Cross-Attention Needs No Causal Mask:

The encoder output represents the full source sentence. The decoder should be able to see the entire source at every stepβ€”there's nothing to "hide" in the source.

Sublayer 3: Feed-Forward Network

Purpose: Same as encoderβ€”non-linear transformation of each position.

πŸ“text
1Identical to encoder FFN:
2  d_model β†’ d_ff β†’ d_model
3  With GELU activation and dropout

Complete Decoder Layer Diagram

πŸ“text
1Target Input
2                          β”‚
3               [batch, tgt_len, d_model]
4                          β”‚
5               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
6               β”‚                     β”‚
7               β–Ό                     β”‚
8      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
9      β”‚ Layer Norm 1   β”‚             β”‚
10      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
11              β”‚                      β”‚
12              β–Ό                      β”‚
13      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
14      β”‚ Masked Self-   β”‚             β”‚
15      β”‚ Attention      β”‚             β”‚
16      β”‚ (causal mask)  β”‚             β”‚
17      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
18              β”‚                      β”‚
19          Dropout                    β”‚
20              β”‚                      β”‚
21              β–Ό                      β”‚
22            (+)β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Residual 1
23              β”‚
24              β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25              β”‚                      β”‚
26              β–Ό                      β”‚
27      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
28      β”‚ Layer Norm 2   β”‚             β”‚
29      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
30              β”‚                      β”‚
31              β–Ό                      β”‚
32      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
33      β”‚ Cross-Attention│◄──── Encoder Memory
34      β”‚ Q=decoder      β”‚      [batch, src_len, d_model]
35      β”‚ K,V=encoder    β”‚
36      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
37              β”‚                      β”‚
38          Dropout                    β”‚
39              β”‚                      β”‚
40              β–Ό                      β”‚
41            (+)β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Residual 2
42              β”‚
43              β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
44              β”‚                      β”‚
45              β–Ό                      β”‚
46      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
47      β”‚ Layer Norm 3   β”‚             β”‚
48      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
49              β”‚                      β”‚
50              β–Ό                      β”‚
51      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
52      β”‚ Feed-Forward   β”‚             β”‚
53      β”‚ Network        β”‚             β”‚
54      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
55              β”‚                      β”‚
56          Dropout                    β”‚
57              β”‚                      β”‚
58              β–Ό                      β”‚
59            (+)β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Residual 3
60              β”‚
61              β–Ό
62        Decoder Output
63     [batch, tgt_len, d_model]

Information Flow in Decoder

Training Mode (Teacher Forcing)

πŸ“text
1Source: "Der Hund lΓ€uft"
2Target: "<bos> The dog runs <eos>"
3
4Encoder:
5  "Der Hund lΓ€uft" β†’ Memory [1, 3, 512]
6
7Decoder Input (shifted right):
8  "<bos> The dog runs" β†’ [1, 4, 512]
9
10Decoder Output:
11  Logits for: "The dog runs <eos>"
12
13Loss computed on:
14  Predicted: "The dog runs <eos>"
15  Target:    "The dog runs <eos>"

Inference Mode (Autoregressive)

πŸ“text
1Step 1:
2  Input: "<bos>"
3  Predict: "The"
4
5Step 2:
6  Input: "<bos> The"
7  Predict: "dog"
8
9Step 3:
10  Input: "<bos> The dog"
11  Predict: "runs"
12
13Step 4:
14  Input: "<bos> The dog runs"
15  Predict: "<eos>"
16
17Stop: EOS predicted!

Architecture Variants

Encoder-Decoder (Original Transformer)

For: Machine Translation, Summarization, Q&A

πŸ“text
1Source β†’ [Encoder] β†’ Memory
2                    ↓
3Target β†’ [Decoder] ← Cross-Attention
4        ↓
5     Output

Decoder-Only (GPT Style)

For: Language Modeling, Text Generation

πŸ“text
1No encoder needed!
2Just: Input β†’ [Decoder Layers] β†’ Next Token Prediction
3
4Cross-attention layers removed or repurposed

Encoder-Only (BERT Style)

For: Classification, Named Entity Recognition

πŸ“text
1Input β†’ [Encoder] β†’ Contextualized Representations
2                      ↓
3                Classification Head

Our Focus

We're building encoder-decoder for translation:

- Encoder: Processes German source

- Decoder: Generates English target

- Cross-attention: Connects them


Why Three Sublayers?

Without Cross-Attention?

πŸ“text
1Decoder with only self-attention:
2  - Can only see target tokens
3  - No access to source information
4  - Like generating English without seeing German!
5
6Result: Nonsensical output, unrelated to source

Why Self-Attention + Cross-Attention?

Self-Attention: Model target language patterns

πŸ“text
1"The dog" β†’ likely followed by verb
2"The" β†’ likely followed by noun/adjective

Cross-Attention: Connect to source meaning

πŸ“text
1"Der Hund" in source β†’ "The dog" in target
2"lΓ€uft" in source β†’ "runs" should come next

Both are essential for good translation!


Decoder in the Full Model

Complete Translation Pipeline

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚                    FULL TRANSFORMER                          β”‚
3β”‚                                                              β”‚
4β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
5β”‚  β”‚      ENCODER         β”‚     β”‚        DECODER            β”‚ β”‚
6β”‚  β”‚                      β”‚     β”‚                           β”‚ β”‚
7β”‚  β”‚  Source Embedding    β”‚     β”‚  Target Embedding         β”‚ β”‚
8β”‚  β”‚         ↓            β”‚     β”‚         ↓                 β”‚ β”‚
9β”‚  β”‚  + Positional Enc    β”‚     β”‚  + Positional Enc         β”‚ β”‚
10β”‚  β”‚         ↓            β”‚     β”‚         ↓                 β”‚ β”‚
11β”‚  β”‚  Encoder Layer Γ—N    │────▢│  Decoder Layer Γ—N         β”‚ β”‚
12β”‚  β”‚         ↓            β”‚     β”‚  (uses encoder memory)    β”‚ β”‚
13β”‚  β”‚  Final LayerNorm     β”‚     β”‚         ↓                 β”‚ β”‚
14β”‚  β”‚         ↓            β”‚     β”‚  Final LayerNorm          β”‚ β”‚
15β”‚  β”‚      Memory          β”‚     β”‚         ↓                 β”‚ β”‚
16β”‚  β”‚                      β”‚     β”‚  Output Projection        β”‚ β”‚
17β”‚  β”‚                      β”‚     β”‚  (to vocabulary)          β”‚ β”‚
18β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
19β”‚                                          ↓                   β”‚
20β”‚                                      Logits                  β”‚
21β”‚                                 [batch, tgt_len, vocab]     β”‚
22β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Summary

Decoder Layer Components

SublayerInputPurpose
Masked Self-AttentionTargetModel target patterns, prevent future peeking
Cross-AttentionDecoder + EncoderAccess source information
Feed-ForwardPosition-wiseNon-linear transformation

Key Properties

- Autoregressive: Generates one token at a time

- Causal Masking: Can't see future during training

- Cross-Attention: Connects to encoder output

- Same dimension: Input/output shape preserved


Exercises

Conceptual Questions

1. Why does the decoder need both self-attention AND cross-attention?

2. What would happen if we removed the causal mask during training?

3. How does teacher forcing speed up training compared to autoregressive training?

Design Questions

4. For a language model (decoder-only), which sublayers are kept/removed?

5. Could we use bidirectional attention in the decoder? Why or why not?

6. How does the decoder know when to stop generating?


Next Section Preview

In the next section, we'll implement causal maskingβ€”the mechanism that prevents the decoder from cheating by looking at future tokens during training.