Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

The decoder is the generative half of the transformer. While the encoder processes the input sequence, the decoder generates the output sequence one token at a time, attending both to its own previous outputs and to the encoded source.

This section provides an architectural overview of the decoder before we implement it.

Encoder vs Decoder

Key Differences

Aspect	Encoder	Decoder
Sublayers	2 (self-attention, FFN)	3 (self-attention, cross-attention, FFN)
Self-attention	Bidirectional	Causal (masked)
Cross-attention	None	Attends to encoder output
Purpose	Encode source	Generate target
Input	Source sequence	Target sequence (shifted)

Visual Comparison

📝text

1ENCODER LAYER:                    DECODER LAYER:
2
3Input                             Input
4  │                                 │
5  ▼                                 ▼
6┌─────────────────┐              ┌─────────────────┐
7│ Self-Attention  │              │ Masked Self-    │ ← Causal!
8│ (Bidirectional) │              │ Attention       │
9└────────┬────────┘              └────────┬────────┘
10         │                                │
11    Add & Norm                       Add & Norm
12         │                                │
13                                          ▼
14                                   ┌─────────────────┐
15                                   │ Cross-Attention │ ← NEW!
16                                   │ (to encoder)    │
17                                   └────────┬────────┘
18                                            │
19                                       Add & Norm
20         │                                │
21         ▼                                ▼
22┌─────────────────┐              ┌─────────────────┐
23│  Feed-Forward   │              │  Feed-Forward   │
24└────────┬────────┘              └────────┬────────┘
25         │                                │
26    Add & Norm                       Add & Norm
27         │                                │
28         ▼                                ▼
29      Output                          Output

The Three Sublayers

Sublayer 1: Masked Self-Attention

Purpose: Allow decoder tokens to attend to each other, but only to previous tokens.

📝text

1Target: "<bos> The dog runs"
2
3At position 2 ("dog"):
4  Can attend to: "<bos>", "The", "dog"
5  Cannot attend to: "runs" (future)
6
7This enables autoregressive generation!

Why Masked?

During training, we have the full target sequence. Without masking:

📝text

1Predicting "runs" at position 3:
2  Model could cheat by looking at "runs" in the input!
3
4With causal mask:
5  Model can only see "<bos>", "The", "dog"
6  Must predict "runs" from context only

Sublayer 2: Cross-Attention

Purpose: Allow decoder to access encoder's representation of the source.

📝text

1Source (German): "Der Hund läuft"
2Target (English): "<bos> The dog ..."
3
4Cross-Attention:
5  Q: from decoder ("The", "dog", ...)
6  K, V: from encoder ("Der", "Hund", "läuft")
7
8  "dog" can attend to "Hund" (its translation!)

Why Cross-Attention Needs No Causal Mask:

The encoder output represents the full source sentence. The decoder should be able to see the entire source at every step—there's nothing to "hide" in the source.

Sublayer 3: Feed-Forward Network

Purpose: Same as encoder—non-linear transformation of each position.

📝text

1Identical to encoder FFN:
2  d_model → d_ff → d_model
3  With GELU activation and dropout

Complete Decoder Layer Diagram

📝text

1Target Input
2                          │
3               [batch, tgt_len, d_model]
4                          │
5               ┌──────────┴──────────┐
6               │                     │
7               ▼                     │
8      ┌────────────────┐             │
9      │ Layer Norm 1   │             │
10      └───────┬────────┘             │
11              │                      │
12              ▼                      │
13      ┌────────────────┐             │
14      │ Masked Self-   │             │
15      │ Attention      │             │
16      │ (causal mask)  │             │
17      └───────┬────────┘             │
18              │                      │
19          Dropout                    │
20              │                      │
21              ▼                      │
22            (+)◄─────────────────────┘  Residual 1
23              │
24              ├──────────────────────┐
25              │                      │
26              ▼                      │
27      ┌────────────────┐             │
28      │ Layer Norm 2   │             │
29      └───────┬────────┘             │
30              │                      │
31              ▼                      │
32      ┌────────────────┐             │
33      │ Cross-Attention│◄──── Encoder Memory
34      │ Q=decoder      │      [batch, src_len, d_model]
35      │ K,V=encoder    │
36      └───────┬────────┘             │
37              │                      │
38          Dropout                    │
39              │                      │
40              ▼                      │
41            (+)◄─────────────────────┘  Residual 2
42              │
43              ├──────────────────────┐
44              │                      │
45              ▼                      │
46      ┌────────────────┐             │
47      │ Layer Norm 3   │             │
48      └───────┬────────┘             │
49              │                      │
50              ▼                      │
51      ┌────────────────┐             │
52      │ Feed-Forward   │             │
53      │ Network        │             │
54      └───────┬────────┘             │
55              │                      │
56          Dropout                    │
57              │                      │
58              ▼                      │
59            (+)◄─────────────────────┘  Residual 3
60              │
61              ▼
62        Decoder Output
63     [batch, tgt_len, d_model]

Information Flow in Decoder

Training Mode (Teacher Forcing)

📝text

1Source: "Der Hund läuft"
2Target: "<bos> The dog runs <eos>"
3
4Encoder:
5  "Der Hund läuft" → Memory [1, 3, 512]
6
7Decoder Input (shifted right):
8  "<bos> The dog runs" → [1, 4, 512]
9
10Decoder Output:
11  Logits for: "The dog runs <eos>"
12
13Loss computed on:
14  Predicted: "The dog runs <eos>"
15  Target:    "The dog runs <eos>"

Inference Mode (Autoregressive)

📝text

1Step 1:
2  Input: "<bos>"
3  Predict: "The"
4
5Step 2:
6  Input: "<bos> The"
7  Predict: "dog"
8
9Step 3:
10  Input: "<bos> The dog"
11  Predict: "runs"
12
13Step 4:
14  Input: "<bos> The dog runs"
15  Predict: "<eos>"
16
17Stop: EOS predicted!

Architecture Variants

Encoder-Decoder (Original Transformer)

For: Machine Translation, Summarization, Q&A

📝text

1Source → [Encoder] → Memory
2                    ↓
3Target → [Decoder] ← Cross-Attention
4        ↓
5     Output

Decoder-Only (GPT Style)

For: Language Modeling, Text Generation

📝text

1No encoder needed!
2Just: Input → [Decoder Layers] → Next Token Prediction
3
4Cross-attention layers removed or repurposed

Encoder-Only (BERT Style)

For: Classification, Named Entity Recognition

📝text

1Input → [Encoder] → Contextualized Representations
2                      ↓
3                Classification Head

Our Focus

We're building encoder-decoder for translation:

- Encoder: Processes German source

- Decoder: Generates English target

- Cross-attention: Connects them

Why Three Sublayers?

Without Cross-Attention?

📝text

1Decoder with only self-attention:
2  - Can only see target tokens
3  - No access to source information
4  - Like generating English without seeing German!
5
6Result: Nonsensical output, unrelated to source

Why Self-Attention + Cross-Attention?

Self-Attention: Model target language patterns

📝text

1"The dog" → likely followed by verb
2"The" → likely followed by noun/adjective

Cross-Attention: Connect to source meaning

📝text

1"Der Hund" in source → "The dog" in target
2"läuft" in source → "runs" should come next

Both are essential for good translation!

Decoder in the Full Model

Complete Translation Pipeline

📝text

1┌─────────────────────────────────────────────────────────────┐
2│                    FULL TRANSFORMER                          │
3│                                                              │
4│  ┌──────────────────────┐     ┌───────────────────────────┐ │
5│  │      ENCODER         │     │        DECODER            │ │
6│  │                      │     │                           │ │
7│  │  Source Embedding    │     │  Target Embedding         │ │
8│  │         ↓            │     │         ↓                 │ │
9│  │  + Positional Enc    │     │  + Positional Enc         │ │
10│  │         ↓            │     │         ↓                 │ │
11│  │  Encoder Layer ×N    │────▶│  Decoder Layer ×N         │ │
12│  │         ↓            │     │  (uses encoder memory)    │ │
13│  │  Final LayerNorm     │     │         ↓                 │ │
14│  │         ↓            │     │  Final LayerNorm          │ │
15│  │      Memory          │     │         ↓                 │ │
16│  │                      │     │  Output Projection        │ │
17│  │                      │     │  (to vocabulary)          │ │
18│  └──────────────────────┘     └───────────────────────────┘ │
19│                                          ↓                   │
20│                                      Logits                  │
21│                                 [batch, tgt_len, vocab]     │
22└─────────────────────────────────────────────────────────────┘

Summary

Decoder Layer Components

Sublayer	Input	Purpose
Masked Self-Attention	Target	Model target patterns, prevent future peeking
Cross-Attention	Decoder + Encoder	Access source information
Feed-Forward	Position-wise	Non-linear transformation

Key Properties

- Autoregressive: Generates one token at a time

- Causal Masking: Can't see future during training

- Cross-Attention: Connects to encoder output

- Same dimension: Input/output shape preserved

Exercises

Conceptual Questions

1. Why does the decoder need both self-attention AND cross-attention?

2. What would happen if we removed the causal mask during training?

3. How does teacher forcing speed up training compared to autoregressive training?

Design Questions

4. For a language model (decoder-only), which sublayers are kept/removed?

5. Could we use bidirectional attention in the decoder? Why or why not?

6. How does the decoder know when to stop generating?

Next Section Preview

In the next section, we'll implement causal masking—the mechanism that prevents the decoder from cheating by looking at future tokens during training.