Sections 1–3 introduced the three sublayers a decoder needs—masked self-attention, encoder-decoder cross-attention, and the position-wise feed-forward network—but nobody has yet assembled them into a single reusable module with a clean forward() signature. That is what this section does. We build TransformerDecoderLayer: one block that takes a target tensor and an encoder memory tensor, threads them through the three sublayers with residual Add & Norm wrappers, and returns a tensor of the exact same shape as its target input.
Same-shape-in / same-shape-out is the crucial property. It is what lets us stack N copies of this layer in Section 5. If the output shape ever differed from the input shape, stacking would require a separate adapter between every pair of layers—and the Transformer recipe would fall apart.
Why this matters: every large-language-model you have heard of (GPT, Llama, Mistral, T5, PaLM) is, at its core, a stack of identical Transformer layers very close to this one. Understanding this single layer means understanding 95% of modern LLM architecture; the remaining 5% is vocabulary embedding, positional encoding, and a final projection head.
Recap: The Three Sublayers
A decoder layer has three sublayers, applied in order. Each one is wrapped in a residual add and a LayerNorm (jointly called Add & Norm).
Masked self-attention — the decoder looks back at its own earlier target tokens. The causal mask Mij=−∞ for j>i blocks attention to the future (section 2).
Cross-attention (encoder-decoder attention) — queries come from the decoder stream, keys and values come from the encoder output memory (section 3). This is the only place where source-side information enters the decoder.
Position-wise feed-forward network — FFN(x)=max(0,xW1+b1)W2+b2, applied independently to every target position. Provides per-token nonlinear capacity.
Notation reminder (shared across all of chapter 8): B is the batch size, Tsrc and Ttgt are source / target lengths, dmodel is the residual-stream width, H is the number of heads, dk=dmodel/H, and dff is the FFN inner width.
DecoderLayer Forward Pass Math
In the original post-norm formulation (Vaswani et al., 2017), one decoder layer executes three residual-plus-norm updates:
Read each line as: "take the current residual stream, compute a sublayer update, dropout some of it, add it back, re-normalize". The x‑prefixed vectors form a residual stream that flows through the layer with additive contributions from each sublayer. This is the same idea as ResNet (He et al., 2015): sublayers do not replace the representation, they add a correction to it. The identity path is always available, which is what keeps gradients healthy at depth.
Every sublayer function returns a tensor the same shape as its input—[B,Ttgt,dmodel]—so the residual add is always well-defined. LayerNorm re-centers each token vector independently across its dmodel features: LayerNorm(z)i=γ⋅σz2+εzi−μz+β.
Pre-Norm vs Post-Norm
Where you place LayerNorm inside the sublayer wrapper is not a detail—it materially changes training stability at depth. Two families exist:
Post-norm (Vaswani 2017): y=LayerNorm(x+Sublayer(x)). Norm is applied after the residual add.
Pre-norm (modern LLMs): y=x+Sublayer(LayerNorm(x)). Norm is applied inside the sublayer branch; the residual path is never normalized.
Why does anyone prefer pre-norm? Xiong et al. (2020) "On Layer Normalization in the Transformer Architecture" showed that post-norm networks are hard to train past ~12 layers without a learning-rate warmup: the gradient magnitude at each sublayer depends multiplicatively on the sublayers above it, producing expected gradient norms that grow (or shrink) with depth. Pre-norm fixes this by keeping the residual path un-normalized—gradients flow straight through the addition, giving a clean gradient highway from the final layer back to the input embedding. The tradeoff: pre-norm networks are sometimes slightly weaker at small depth and can produce larger-magnitude activations deep in the stack, so many modern systems pair it with RMSNorm (Zhang & Sennrich, 2019) and careful initialization.
The practical consequence: GPT-2, GPT-3, Llama 1/2/3, Mistral, PaLM, Falcon, Qwen, and essentially every production LLM since ~2019 uses pre-norm. The original Vaswani post-norm survives in some translation systems and in educational code. We implement post-norm in this section because it matches the canonical equations you will see in the 2017 paper; switching to pre-norm is a two-line change (move each LayerNorm from LN(x+sub) to x+sub(LN(x))).
Dropout Placement
There are three places dropout enters a Transformer decoder layer. Each one serves a different regularization purpose:
Location
Applied to
Rationale
Typical rate
Attention dropout
Softmax weights (inside MultiHeadAttention)
Prevents over-reliance on any single key-query pair; smooths the attention distribution
0.1 (0.0 in large LLMs)
Sublayer output dropout
Output of each sublayer, before the residual add
Stochastic residuals — the layer output becomes a noisy estimate of the clean residual path, which regularizes the whole stack
0.1
FFN internal dropout
Between the two linears of the FFN
Regularizes the wide d_ff hidden representation, which has more capacity than d_model
0.1
The original paper used p=0.1 in all three positions. Modern large models trained on billions of tokens often set dropout to p=0.0—at that data scale, regularization is less needed and noise interferes with fitting. During inference ( .eval()), dropout is always a no-op.
Plain-Python Walkthrough
Before we wrap everything in nn.Module, let us run one decoder layer in pure NumPy. Every value is visible, every shape is explicit, and there is no autograd machinery hiding the logic. This is the same configuration B=1,Tsrc=4,Ttgt=3,dmodel=8,H=2,dff=16 that §5 and §6 will reuse.
One DecoderLayer forward — pure NumPy, seeded
🐍decoder_layer_numpy.py
Explanation(37)
Code(52)
1import numpy as np
NumPy gives us fast N-dimensional arrays and matrix multiplication (@). We deliberately use plain NumPy (not PyTorch) so every value is visible and there is no hidden autograd machinery — this is the from-scratch forward pass.
EXECUTION STATE
numpy = Numerical computing library. Provides np.ndarray, @ for matmul, broadcasting, and elementwise math used below.
as np = Standard alias so we write np.random, np.sqrt, np.exp instead of numpy.random, etc.
3np.random.seed(0)
Fixes the pseudo-random stream so every weight matrix and input tensor is reproducible. Same seed across sections 4, 5, 6 keeps numerics comparable.
EXECUTION STATE
📚 np.random.seed(s) = Initializes NumPy's global RNG. Every subsequent randn(...) call draws from the same deterministic sequence. Example: after seed(0), np.random.randn(2) always yields [1.7641, 0.4002].
5Shared demo config
Fixed across every ch08 plain-Python trace. Small enough to print every tensor; large enough that each concept (batching, heads, FFN expansion) exists.
EXECUTION STATE
B = 1 = Batch size. One sequence at a time.
T_src = 4 = Source length (number of encoder tokens in memory).
T_tgt = 3 = Target length (number of decoder tokens we are currently processing).
d_ff = 16 = FFN inner width. Typical ratio is 4 x d_model (here 4 x 8 = 32 would be canonical, but 16 keeps the printout readable).
6d_k = d_model // H
Per-head key / value / query width. The d_model residual stream is split evenly across H heads so total compute stays identical to single-head.
EXECUTION STATE
d_k = 4 = 8 // 2 = 4. Each head projects into a 4-dim subspace. d_v = d_k here (standard setup).
9x = np.random.randn(B, T_tgt, d_model) * 0.3
Simulated decoder input after token embedding + positional encoding + shift-right. 0.3 scaling gives roughly unit variance after LayerNorm and matches the ballpark of real embedding magnitudes.
EXECUTION STATE
📚 np.random.randn(*shape) = Draws from standard normal N(0, 1). Shape args here are (1, 3, 8) = 24 scalars.
Upper-triangular boolean matrix. True entries mark (query_i, key_j) pairs where j > i — tokens the self-attention must NOT see. We combine this with scores via np.where(mask, -1e9, scores) so softmax kills the future.
EXECUTION STATE
📚 np.triu(A, k) = Keeps elements on or above the k-th diagonal and zeros everything below. k=1 means strictly above the main diagonal.
Sample a standard-normal matrix and scale by 1/sqrt(inp). This is a simplified LeCun/Xavier uniform-variance init.
EXECUTION STATE
1.0 / np.sqrt(inp) = For inp=8: 1/sqrt(8) ≈ 0.3536. Each entry's std is 0.3536 instead of 1.0.
19Wq1, Wk1, Wv1, Wo1 — self-attention projections
Four learned matrices for the masked self-attention sublayer. Each is (d_model, d_model) = (8, 8). In a real PyTorch module these are nn.Linear(d_model, d_model) layers.
EXECUTION STATE
Wq1 shape = (8, 8) -> projects x into queries, then heads split it as (H, d_k) = (2, 4)
Wo1 shape = (8, 8) -> merges heads back to d_model
Separate set for cross-attention. Q comes from the decoder stream x1; K and V come from memory. Sharing weights with self-attn would couple two very different subspaces — so they are disjoint.
EXECUTION STATE
why separate = Self-attn queries target the target language; cross-attn queries target the source. Different tasks deserve different projections.
One multi-head attention call. Used both for masked self-attention (Q=K=V=x) and for cross-attention (Q=x1, K=V=memory). The mask_bool argument turns it into masked or unmasked behavior.
EXECUTION STATE
⬇ input: Q_in = Source of queries. Shape [B, Tq, d_model]. For self-attn Tq = T_tgt. For cross-attn Tq = T_tgt but Q_in is x1, not memory.
⬇ input: K_in, V_in = Source of keys/values. Shape [B, Tk, d_model]. For self-attn Tk = T_tgt. For cross-attn Tk = T_src = 4.
⬇ input: Wq/Wk/Wv/Wo = Projection matrices, each (d_model, d_model). Kept separate so self-attn and cross-attn can learn different mappings.
⬇ input: mask_bool = Optional boolean mask. True entries get score = -1e9 before softmax, which makes softmax ≈ 0. Shape broadcasts to [B, H, Tq, Tk].
⬆ return = np.ndarray of shape [B, Tq, d_model] — one attended context vector per query position, merged across heads.
24Bsz, Tq, _ = Q_in.shape
Unpack query shape. Underscore discards the trailing d_model because it's redundant with the module constant.
EXECUTION STATE
Bsz = 1
Tq = 3 for self-attn, 3 for cross-attn (queries come from x1)
25_, Tk, _ = K_in.shape
Unpack key sequence length. Tk differs from Tq in cross-attention.
EXECUTION STATE
Tk = 3 for self-attn (K = x), 4 for cross-attn (K = memory)
Project to queries, split feature dim across heads, move the head axis forward. Result shape: [B, H, Tq, d_k].
EXECUTION STATE
📚 @ (matmul) = NumPy matrix multiply. Last two axes are matmul'd; leading batch dims broadcast. (1, 3, 8) @ (8, 8) -> (1, 3, 8).
📚 .reshape(Bsz, Tq, H, d_k) = Same 24 numbers, interpreted as (1, 3, 2, 4). The last axis d_model = 8 is split into H = 2 chunks of d_k = 4.
📚 .transpose(0, 2, 1, 3) = Permute axes from (B, Tq, H, d_k) to (B, H, Tq, d_k). Heads become the batch-like outer dim so each head's attention can be computed independently.
Q final shape = (1, 2, 3, 4) -> [B, H, Tq, d_k]
27K = (K_in @ Wk).reshape(...).transpose(...)
Same pattern as Q, but Tk may differ from Tq. In self-attn Tk = 3; in cross-attn Tk = 4.
EXECUTION STATE
K final shape = self-attn: (1, 2, 3, 4); cross-attn: (1, 2, 4, 4)
28V = (V_in @ Wv).reshape(...).transpose(...)
Values. Shares sequence length with K (both come from the same source). Different projection Wv means V is a different subspace than K.
Raw attention scores, per head. Swap the last two dims of K so matmul compatibility holds: Q(1,2,3,4) @ K^T(1,2,4,Tk) -> (1,2,3,Tk). Divide by sqrt(d_k)=2 to stabilize the softmax.
EXECUTION STATE
📚 K.transpose(0, 1, 3, 2) = Swap axes 3 and 2 of K. Shape (1, 2, Tk, 4) becomes (1, 2, 4, Tk).
np.sqrt(d_k) = sqrt(4) = 2.0. Scaling by 1/sqrt(d_k) keeps dot-product variance ~1 regardless of d_k (see ch02).
Branch only executes for masked self-attention. Cross-attention here uses no mask; in real training a memory (source-padding) mask would also live on this branch.
31scores = np.where(mask_bool, -1e9, scores)
Replace disallowed positions with a huge negative number so exp(score) ≈ 0 after softmax. We use -1e9 instead of -inf to avoid NaN when an entire row is masked.
EXECUTION STATE
📚 np.where(cond, a, b) = Elementwise select. Where cond is True pick a, else pick b. Broadcasts mask_bool (1, 1, 3, 3) against scores (1, 2, 3, 3).
row 0 effect = Entries (0, 1) and (0, 2) are set to -1e9. After softmax they become ~0 so the first query only attends to key 0.
32scores -= scores.max(axis=-1, keepdims=True)
Numerical-stability trick: subtract the per-row max before exp. softmax(x) = softmax(x - c) for any constant c, so the distribution is unchanged but exp() can no longer overflow.
EXECUTION STATE
axis=-1 = Reduce along the last axis (keys). Each (batch, head, query) row gets its own max.
keepdims=True = Keep axis as size 1 so scores (., ., ., Tk) - max (., ., ., 1) broadcasts cleanly.
33w = np.exp(scores); w /= w.sum(axis=-1, keepdims=True)
Standard softmax over the key axis. Result rows sum to 1. For self-attn row 0 is a one-hot on key 0 thanks to the mask.
34ctx = (w @ V).transpose(0, 2, 1, 3).reshape(Bsz, Tq, d_model)
Weighted sum of values, then collapse heads back into d_model. The transpose puts (B, Tq, H, d_k) order so the final reshape merges H and d_k contiguously.
EXECUTION STATE
📚 w @ V = Shape (1, 2, Tq, Tk) @ (1, 2, Tk, d_k) -> (1, 2, Tq, d_k). Each head produces one context vector per query.
📚 .reshape(Bsz, Tq, d_model) = (1, Tq, 2, 4) -> (1, Tq, 8). Heads concatenated along the feature axis.
ctx shape = (1, 3, 8) -> [B, Tq, d_model]
35return ctx @ Wo
Final linear mixing across heads. Wo is a learned (d_model, d_model) that blends the concatenated head outputs. This is what makes multi-head non-trivially different from H independent attentions.
EXECUTION STATE
⬆ return = np.ndarray (B, Tq, d_model). For our run, the first row of self-attn output:
[-0.2127 -0.0391 0.2221 -0.0998 0.1540 -0.0465 -0.0045 -0.1185]
37def layernorm(z, eps=1e-5)
Per-token standardization across the feature axis. gamma and beta are left as 1 and 0 here so we can focus on the residual structure — a real implementation adds those learnable affine parameters.
EXECUTION STATE
⬇ input: z = Tensor of shape (B, T, d_model). Each (b, t) vector across d_model will be re-centered and re-scaled independently.
⬇ input: eps = Small positive constant (1e-5) added inside the sqrt to avoid division by zero when var = 0.
⬆ return = z with mean 0 and variance 1 along the last axis.
38mu = z.mean(axis=-1, keepdims=True)
Per-token mean across features. Shape collapses the feature dim to 1 so we can subtract with broadcasting.
EXECUTION STATE
axis=-1 = Average the 8 feature values for each (B, T) position independently.
mu shape = (B, T, 1)
39var = z.var(axis=-1, keepdims=True)
Per-token variance across features. NumPy uses the biased estimator (divide by N), matching PyTorch's default LayerNorm.
40return (z - mu) / np.sqrt(var + eps)
Standardize each token vector. Without eps a constant-input row would produce 0/0 = NaN.
43mask1 = tgt_mask.reshape(1, 1, T_tgt, T_tgt)
Add batch and head axes so the (3, 3) bool mask broadcasts against (B, H, 3, 3) scores.
Residual add + LayerNorm (POST-norm). x1 is the input to the next sublayer. Notice how LayerNorm re-centers each 8-dim vector so values are roughly in [-2, 2].
Cross-attention. Queries from x1 (decoder stream); keys and values from encoder memory. No causal mask because decoder positions are allowed to look at all source positions.
Causal mask — blocks attention to future target tokens
memory_mask
[B, T_src]
Optional source-padding mask
out
[B, T_tgt, d_model]
Layer output — same shape as tgt so layers can stack
PyTorch Implementation
The PyTorch version is a straight transcription of the NumPy trace above, with three upgrades: (1) parameters live in nn.Linear layers that autograd tracks, (2) LayerNorm and Dropout are proper nn.Module submodules so .eval() disables dropout automatically, and (3) we inherit from nn.Module so the layer can be composed, saved, moved to GPU, and wrapped in torch.compile like any other PyTorch module.
We include MultiHeadAttention and PositionwiseFeedForward inline for completeness—in a real project these would be imported from chapter03 and chapter06 respectively.
TransformerDecoderLayer — PyTorch class
🐍decoder_layer.py
Explanation(61)
Code(98)
1import math
Python's standard math module. We use math.sqrt(d_k) for the attention scaling factor — a scalar constant, so math is slightly cheaper than torch.sqrt here.
2from typing import Optional
Lets us annotate parameters that may be None, like masks. Clearer than writing tgt_mask: torch.Tensor = None — Optional[T] means 'a T or None'.
EXECUTION STATE
Optional[torch.Tensor] = Equivalent to Union[torch.Tensor, None]. Tools like mypy/pyright check that callers can pass either a tensor or None.
4import torch
PyTorch. Gives us torch.Tensor, matmul, softmax, LayerNorm, autograd, GPU support.
5import torch.nn as nn
nn module — contains Module, Linear, LayerNorm, Dropout, MultiheadAttention. We use nn.Linear and nn.LayerNorm directly so our layer uses standard learned parameters.
6import torch.nn.functional as F
Stateless ops: F.softmax, F.relu, F.dropout. Use F.* when there's no learnable state; use nn.* when we need stored parameters (like nn.Dropout holding its rate).
EXECUTION STATE
F vs nn = nn.Dropout(p) is a Module — trainable state, toggles with .train()/.eval(). F.dropout() is a plain function — you must pass training=True manually. We use nn.Dropout so .eval() disables it automatically.
9class MultiHeadAttention(nn.Module):
Multi-head attention as a reusable Module. Both self-attn and cross-attn use the same class; the caller decides whether Q, K, V come from the same or different tensors.
⬇ input: num_heads = Number of parallel attention heads. Must evenly divide d_model. Typical 8 or 12.
⬇ input: dropout = Probability applied to softmax weights (attention dropout). Original paper used 0.1; modern LLMs often use 0.0 in large-scale pretraining.
13super().__init__()
Required for nn.Module subclasses. Initializes internal state (parameter dicts, module dicts). Forgetting this is the #1 PyTorch bug.
14assert d_model % num_heads == 0, ...
Hard-fail if d_model can't be evenly split across heads. An integer check beats silently computing wrong shapes at runtime.
15self.d_model = d_model
Store for use in forward() (needed for the final reshape).
16self.num_heads = num_heads
Stored for reshape in forward().
17self.d_k = d_model // num_heads
Per-head width. For d_model=512, num_heads=8 this gives d_k=64.
EXECUTION STATE
// operator = Floor division in Python. Returns an int when both operands are ints. d_model // num_heads guarantees an integer d_k.
18self.W_q = nn.Linear(d_model, d_model)
Query projection. Goes from (B, T, d_model) to (B, T, d_model). Learnable weight + bias inside.
EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias=True) = Affine map y = x @ W^T + b. Stores W of shape (out_features, in_features) and b of shape (out_features,). Forward uses the transpose automatically.
shape(W) = (d_model, d_model)
19self.W_k = nn.Linear(d_model, d_model)
Key projection. Separate parameters from W_q so K encodes 'what I contain', distinct from Q's 'what I'm looking for'.
20self.W_v = nn.Linear(d_model, d_model)
Value projection. Holds the content returned when a query matches.
21self.W_o = nn.Linear(d_model, d_model)
Output projection. Mixes the concatenated head outputs back into d_model. Without W_o, multi-head attention collapses to H independent attentions glued together with no cross-head interaction.
22self.attn_dropout = nn.Dropout(dropout)
Dropout on the attention weights (after softmax, before multiplying V). Randomly zeros entries of the attention matrix during training.
24def forward(self, query, key, value, mask=None)
Public call. query is the source of Q; key and value are the source of K and V. For self-attention the caller passes the same tensor three times; for cross-attention query=decoder, key=value=encoder memory.
EXECUTION STATE
⬇ input: query = Shape (B, T_q, d_model). For the decoder's self-attn: the decoder's own previous state. For cross-attn: x1 (post first Add&Norm).
⬇ input: key = Shape (B, T_k, d_model). Same sequence as value. For cross-attn this is the encoder output memory.
⬇ input: value = Shape (B, T_k, d_model). Content to return. Typically key and value come from the same source.
Replace blocked positions with -inf so softmax sends them to zero. The convention used here: mask == 1 allows, mask == 0 blocks.
EXECUTION STATE
📚 .masked_fill(mask_bool, value) = Wherever mask_bool is True, replace the entry in scores with value. Broadcasts across heads.
-inf vs -1e9 = float('-inf') is exact — softmax maps it to 0 with no numerical residue. Works fine in FP32; in FP16 it can cause NaN if an entire row is masked. Many production models use -1e4 in FP16.
38attn = F.softmax(scores, dim=-1)
Normalize each query's scores over keys into a probability distribution.
dim=-1 = Normalize across the KEY axis. Each (batch, head, query) row sums to 1.
39attn = self.attn_dropout(attn)
Zero out a fraction p of attention weights during training. During eval() this is a no-op. Renormalization is done by PyTorch (scale remaining weights by 1/(1-p)) so the expectation is preserved.
40ctx = torch.matmul(attn, V)
Weighted sum of values. Shape: (B, H, Tq, Tk) x (B, H, Tk, d_k) -> (B, H, Tq, d_k). Each head outputs d_k features per query.
📚 .contiguous() = After transpose, the memory layout is non-contiguous. .view() requires contiguous memory, so .contiguous() materializes a fresh tensor with the standard row-major layout.
Inside-out execution order: linear1(x) -> ReLU -> Dropout -> linear2. Read it like function composition.
EXECUTION STATE
📚 F.relu(z) = max(0, z) elementwise. Creates the nonlinearity the transformer relies on.
58class TransformerDecoderLayer(nn.Module):
The module we are building in this section. Wraps the three sublayers with residual Add + LayerNorm + sublayer-output dropout, following the Vaswani 2017 (post-norm) convention.
Top-level constructor. Signature mirrors nn.TransformerDecoderLayer from torch.nn.
EXECUTION STATE
⬇ input: d_model = Residual width — must match the encoder so cross-attn shapes line up.
⬇ input: num_heads = H — number of heads. Must divide d_model.
⬇ input: d_ff = FFN inner width.
⬇ input: dropout = Applied in three places: inside each attention (on softmax weights), inside the FFN, and on each sublayer output before the residual add.
Cross-attention instance. Same class, separate weights — lets the layer learn different Q/K/V subspaces for 'attend to past target tokens' vs 'attend to source tokens'.
LayerNorm after the first residual add. Per-token normalization across features.
EXECUTION STATE
📚 nn.LayerNorm(d_model) = Normalizes over the last dim of size d_model. Maintains two learnable buffers: gamma (scale, init 1) and beta (shift, init 0), each of shape (d_model,).
75self.norm2 = nn.LayerNorm(d_model)
LayerNorm after cross-attn's residual.
76self.norm3 = nn.LayerNorm(d_model)
LayerNorm after the FFN's residual.
77self.drop1 = nn.Dropout(dropout)
Dropout on the self-attn sublayer's output, applied before the residual add.
One layer's forward pass. Matches the signature of torch.nn.TransformerDecoderLayer so you can drop this class in as a replacement.
EXECUTION STATE
⬇ input: tgt = (B, T_tgt, d_model) — decoder input for this layer. For the first layer this is embed(y) + PE; for deeper layers it is the previous layer's output.
⬇ input: memory = (B, T_src, d_model) — encoder output. Same tensor is passed to every decoder layer.
⬇ input: tgt_mask = Optional mask over target-target attention. Usually a causal lower-triangular mask of shape (T_tgt, T_tgt).
⬇ input: memory_mask = Optional mask over target-source attention. Typically a source-padding mask of shape (B, T_src) broadcast as (B, 1, 1, T_src).
⬆ return = (B, T_tgt, d_model) — same shape as tgt.
Comment marking the first of three sublayers. Order matters: self-attn must come before cross-attn because cross-attn uses the refined decoder state (x1), not the raw tgt.
Cross-attention. Note the asymmetry: Q source (x1) has length T_tgt, K/V source (memory) has length T_src.
EXECUTION STATE
ca shape = (B, T_tgt, d_model)
93x2 = self.norm2(x1 + self.drop2(ca))
Second residual + LayerNorm. x2 carries both target history and source context.
94# Sublayer 3: FFN + Add & Norm
Third sublayer marker. The FFN adds per-token nonlinear capacity — it does not mix across positions.
95ff = self.ffn(x2)
Apply position-wise FFN. Same shape in, same shape out. Inside: Linear -> ReLU -> Dropout -> Linear.
96return self.norm3(x2 + self.drop3(ff))
Final Add & Norm. The return value is the decoder-layer output — you can either stack another copy of this layer on top of it (§5) or send it to the output projection (§6).
EXECUTION STATE
⬆ return = (B, T_tgt, d_model) — identical shape to tgt, ready for the next layer.
Instantiate, set to eval mode, build a causal mask, and call forward. Note the mask convention used by our module: 1 = keep, 0 = block. The standard torch.triu helper returns a "block future" mask, so we invert it with ~ before passing it in.
Running one DecoderLayer end-to-end
🐍run_decoder_layer.py
Explanation(12)
Code(18)
1import torch
Loads PyTorch.
3torch.manual_seed(0)
Deterministic parameter init and deterministic torch.randn. Same seed as §5 and §6 demos.
5Shared demo config
Same constants as the plain-Python trace so values are comparable.
Instantiate one layer. dropout=0.0 for the demo so the output is reproducible — in real training set it to 0.1.
8layer.eval()
Switches all contained nn.Dropout modules to pass-through. Also affects BatchNorm / LayerNorm behavior (not LayerNorm though — it's identical in train and eval). Always call eval() before a deterministic forward.
EXECUTION STATE
📚 Module.eval() = Sets self.training = False recursively on all submodules. Dropout becomes identity; BatchNorm uses running stats; LayerNorm is unchanged.
10tgt = torch.randn(B, T_tgt, d_model)
Sample a decoder input. Shape (1, 3, 8).
EXECUTION STATE
📚 torch.randn(*shape) = Standard normal samples. After manual_seed(0), tgt[0][0] ≈ [-1.1258, -1.1524, -0.2506, -0.4339, 0.8487, 0.6920, -0.3160, -2.1152].
Convention in our module: mask == 1 (True) ALLOWS attention, mask == 0 (False) BLOCKS. So we invert causal and add batch + head axes for broadcasting.
EXECUTION STATE
📚 ~ (logical not) = Bitwise NOT on a bool tensor. Flips True<->False elementwise.
📚 .unsqueeze(dim) = Insert a size-1 axis at dim. Two unsqueezes take shape (3,3) -> (1,3,3) -> (1,1,3,3) so the mask broadcasts against scores of shape (B, H, T, T).
Sanity check: if you replace our TransformerDecoderLayer with PyTorch's built-in nn.TransformerDecoderLayer(d_model=8, nhead=2, dim_feedforward=16, dropout=0.0, batch_first=True), the forward signature is nearly identical. The differences: PyTorch uses nhead and dim_feedforward spellings, and its attn_mask convention is "True = block" (the opposite of ours). Always check your library's mask polarity before wiring things together.
Interactive Visualization
Click any of the three sublayer boxes to highlight it and see its governing equation. Hover an arrow to read the tensor shape it carries. Toggle between Post-Norm (Vaswani 2017) and Pre-Norm (Llama, GPT-3, Mistral) to watch the LayerNorm badge move from after the residual add to inside the sublayer branch.
DecoderLayer Flow
Click a sublayer or tab through it. Hover arrows to see tensor shapes.
Active sublayer
Masked Self-Attention
x_1 = x + Dropout(MaskedSelfAttn(LN(x), tgt_mask))
Each target token looks back at earlier target tokens only. The causal tgt_mask forbids attending to the future.
Norm placement: LayerNorm is applied after the residual Add.
In Modern Systems
The layer we just built is the Vaswani-2017 blueprint. Every production Transformer since then is a small set of edits on top of it:
Llama 1/2/3, Mistral, Qwen, Falcon use pre-norm with RMSNorm (Zhang & Sennrich, 2019) instead of LayerNorm. RMSNorm drops the mean-subtraction step and the β shift, keeping only RMSNorm(z)=γ⋅z/d1∑izi2+ε. It is cheaper and empirically matches or beats LayerNorm quality.
T5, BART, mBART, NLLB keep cross-attention (they are encoder-decoder models). T5 simplifies further: its LayerNorm has no bias term, matching the mean-free pattern Llama uses.
Decoder-only models (GPT-2, GPT-3, GPT-4, Llama family, Mistral, Qwen) drop cross-attention entirely. Their DecoderLayer has only two sublayers—masked self-attention and FFN—because there is no encoder to attend to. Section 6 will show how "encoder-decoder" and "decoder-only" architectures differ at the top level.
FFN variants: Llama uses SwiGLU (Shazeer, 2020), which replaces ReLU(xW1)W2 with (Swish(xWg)⊙(xWu))Wd—two parallel projections, a Swish gate, an elementwise product, and a down projection. It uses ~1.5× the parameters of vanilla ReLU-FFN but consistently improves perplexity.
The point is not to memorize every variant; it is to see that the three-sublayer residual structure is the stable core of the Transformer, and the choices we made above (post-norm, ReLU-FFN, LayerNorm) are the textbook starting point. Swap any of those three for its modern cousin and you have a current-generation LLM block.
Summary
A TransformerDecoderLayer composes three sublayers—masked self-attention, cross-attention, feed-forward—each wrapped in Dropout + residual Add + LayerNorm.
Input and output shapes are both [B,Ttgt,dmodel], which is what lets us stack N copies in Section 5.
Post-norm (original paper) is easier to reason about; pre-norm (modern LLMs) is more stable at depth. Switching is a one-line code change.
Dropout appears three times: on softmax weights, on each sublayer output, and between the two FFN linears. Typical rate p=0.1; often p=0 in large pretraining runs.
The plain-Python forward pass and the PyTorch nn.Module implement exactly the same math—the PyTorch version just adds parameter tracking, GPU support, and autograd.
Exercises
Pre-norm conversion (easy). Rewrite TransformerDecoderLayer.forward in pre-norm form. For each sublayer, apply LayerNorm to the input before the sublayer function and leave the residual path un-normalized: e.g. replace x1=LN(x+sa) with x1=x+sa(LN(x)). Run the same usage snippet and verify the output shape is still (1,3,8).
Mask polarity bug (medium). PyTorch's built-in nn.MultiheadAttention uses the opposite mask convention from our module (True = block). Write a small adapter that takes our "True = keep" mask and converts it so the forward produces identical numerical results when you swap in the built-in module. Test on the same tgt / memory / causal mask as the usage snippet.
Parameter count (medium). Compute the total number of learnable parameters in one TransformerDecoderLayer for dmodel=512,H=8,dff=2048. Break the count down by sublayer (self-attn, cross-attn, FFN) and by projection (Q, K, V, O, W_1, W_2). Verify against sum(p.numel() for p in layer.parameters()).
Cross-attention is the only source path (hard). Replace the cross-attention sublayer with an identity (skip it entirely) and run the layer on a translation dataset of your choice for 5 epochs. Compare BLEU against the full decoder. You should see catastrophic degradation—explain why in one paragraph by reference to the residual-stream picture from the math section.
Next Section Preview
One layer is not a decoder. In Section 5 — Complete Transformer Decoder we stack N identical copies of the module we just built, add target-side token embeddings and positional encoding on top, and close the stack with a final output projection that turns the residual stream into vocabulary logits. We will also cover weight tying (reusing the embedding matrix for the output projection) and walk through how representational depth produces the feature hierarchy observed in BERTology-style probes.