Chapter 17
22 min read
Section 55 of 65

Implementing Sequence Models in PyTorch

LSTM and GRU

From Math to Machine

You now understand the math of the LSTM and the GRU — four gates, one cell state, one hidden state (or in the GRU's case, a single update-gate budget). That math has been the same since 1997 for the LSTM and 2014 for the GRU. What changes when we leave pencil-and-paper and land on a GPU is not the equations — it is the shape of the data, the layout in memory, the fusion of many small ops into one kernel, and the engineering required to feed a real hardware pipeline without wasting cycles.

This section is where intuition meets implementation. We will build an LSTM cell from scratch in NumPy — so every multiplication is visible — and then rebuild it in PyTorch with the exact same semantics and orders-of-magnitude more speed. Along the way you will meet every practical concern of production recurrent models: tensor shapes, packed variable-length batches, bidirectional stacking, and the full end-to-end sentiment classifier pattern.

The central promise: the math does not change betweenht=ottanh(ct)h_t = o_t \odot \tanh(c_t) on paper and the nn.LSTM call in code. Only the plumbing changes. Once you see the plumbing clearly, the gap between a paper and a production model is small.

This section stays focused on the implementation of recurrent models in PyTorch. By the end you will have a working sentiment classifier and a production-grade mental model of shapes, gates, packed sequences, and bidirectional stacking. The next section, §17.4, zooms out — it traces why the field eventually moved on from recurrence and how attention, the KV-cache, and Flash Attention are the direct engineering descendants of the ideas you just learned.


nn.RNN, nn.LSTM, nn.GRU: One API, Three Cells

PyTorch exposes three recurrent modules. They share almost exactly the same signature — the differences reduce to the number of gates inside the cell and the shape of the returned state tuple.

ModuleGatesReturned stateParameters per step
nn.RNN1 (tanh)(output, h_n)H·(H + I) + H
nn.GRU3 (reset, update, candidate)(output, h_n)3H·(H + I) + 3H
nn.LSTM4 (input, forget, candidate, output)(output, (h_n, c_n))4H·(H + I) + 4H

The return-value difference is the most common source of confusion. nn.LSTM is the only one that returns a tuple of states because it is the only cell with a separate cell state. Change nn.LSTM\text{nn.LSTM} to nn.GRU\text{nn.GRU} in otherwise identical code and you will get a silent unpacking error at the destructuring line — one element instead of two.

Fused gate matrix. Internally PyTorch concatenates all gate weights into one big matrix. For an LSTM with hidden size HH and input size II, weight_ih_l0 has shape (4H,I)(4H, I) and weight_hh_l0 has shape (4H,H)(4H, H). Every gate pre-activation is computed by a single matmul; the result is split into four chunks of size HH before the non-linearities. One GEMM is vastly faster than four on a GPU.

The Shape Contract You Must Honor

Every PyTorch recurrent layer has a strict tensor shape contract. Get the shape right and everything works; get it wrong and you will see cryptic errors or — worse — a model that trains on the wrong axis and never converges. Memorize these three tensors:

TensorShape (batch_first=True)Shape (batch_first=False)Meaning
input(N, L, H_in)(L, N, H_in)N sequences, each of length L, with H_in features per token
h_0 / c_0 (initial state)(num_layers · num_dirs, N, H_hidden)sameInitial hidden (and cell) states. Defaults to zeros if omitted.
output(N, L, num_dirs · H_hidden)(L, N, num_dirs · H_hidden)h_t for every real timestep of every sequence
h_n / c_n (final state)(num_layers · num_dirs, N, H_hidden)sameFinal hidden (and cell) state per direction per layer
Keep batch_first=True as a habit. Every modern layer (Transformer, attention, convolution) uses batch-first by default; keeping the RNN consistent means you never have to transpose when wiring layers together. The historical performance edge of time-major (batch_first=False) vanished on modern GPUs with cuDNN 8+.

Interactive: Tensor Shape Explorer

A 3-D tensor of shape (N,L,H)(N, L, H) is hard to picture in the abstract — sliders help. Drag the knobs on the right to change N, L, and H, and toggle batch_first\text{batch\_first} to watch the layout flip. Each small cube is one scalar. Hovering an axis label highlights a slice.

Loading tensor shape explorer…
The visualizer makes an often-overlooked point obvious: changing batch_first\text{batch\_first} does not re-order any data — it only changes which axis PyTorch reads as batch versus time. A wrong flag turns "one sequence of length 3" into "three sequences of length 1" with no error message.

An LSTM from Scratch in NumPy

Before we touch PyTorch, let's run one LSTM cell, by hand, over three timesteps. The code below is a direct translation of the four equations of the LSTM. Every number in the explanation pane is the actual computed value — not a placeholder. Click any line to see the arithmetic.

The formulas, one last time, inline for reference: the three sigmoid gates it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o), the tanh candidate gt=tanh(Wg[ht1,xt]+bg)g_t = \tanh(W_g [h_{t-1}, x_t] + b_g), the cell update ct=ftct1+itgtc_t = f_t \odot c_{t-1} + i_t \odot g_t, and finally ht=ottanh(ct)h_t = o_t \odot \tanh(c_t).

LSTM cell, built from pure NumPy — click any line
🐍lstm_scratch.py
1import numpy as np

NumPy is the foundation for every scientific Python project. We need it for fast N-dimensional arrays (ndarray), matrix multiply (@), element-wise ops (*, +), and element-wise math (np.tanh, np.exp).

EXECUTION STATE
numpy = Library for numerical computing — ndarray, linear algebra, broadcasting, vectorized math.
as np = Universal alias so we write np.tanh(x) instead of numpy.tanh(x).
3def sigmoid(x)

The gating nonlinearity. σ(x) = 1 / (1 + e^{-x}) squashes any real number into (0, 1). An LSTM has THREE sigmoid gates — input, forget, output — each behaving like a per-channel soft switch.

EXECUTION STATE
⬇ input: x = A NumPy array of pre-activations (any real numbers).
⬆ returns = Array of same shape, every element squashed into (0, 1).
→ why sigmoid for gates? = Gating is multiplication by a number in [0, 1]. σ fits exactly: 0 = block, 1 = pass. tanh would flip signs; ReLU would not cap the upper end.
4return 1.0 / (1.0 + np.exp(-x))

Element-wise sigmoid. NumPy broadcasts the scalar 1.0 over the array returned by np.exp(-x).

EXECUTION STATE
📚 np.exp = Element-wise exponential e^x. np.exp(0)=1, np.exp(1)=2.718, np.exp(-1)=0.368.
→ example: x = [0.07, -0.09] = -x = [-0.07, 0.09] → exp = [0.9324, 1.0942] → 1/(1+exp) = [0.5175, 0.4775]
7def lstm_cell(x_t, h_prev, c_prev, p)

One LSTM cell processing a SINGLE timestep. Unlike the GRU (section 17.2) which carries one state, the LSTM carries TWO: the hidden state h (external output) and the cell state c (internal memory highway). We return both because the next step needs both.

EXECUTION STATE
⬇ input: x_t (shape (2,)) = The current token's feature vector. At step 1: x_t = [0.5, -0.2] — our toy embedding for 'I'.
→ x_t purpose = The external signal driving this timestep. In a language model this is a word embedding; in time series it would be a sensor reading.
⬇ input: h_prev (shape (2,)) = Hidden state from t-1. At step 1: [0, 0] (a blank slate).
⬇ input: c_prev (shape (2,)) = Cell state from t-1. At step 1: [0, 0]. This is the long-term memory carrier.
⬇ input: p (dict) = Eight arrays: four weight matrices W_i, W_f, W_g, W_o (each 2×4) and four bias vectors b_i, b_f, b_g, b_o (each shape (2,)).
→ H and I = H = 2 (hidden size), I = 2 (input size). Each W is (H, H+I) because gates look at [h_prev, x_t] concatenated.
⬆ returns: (h_t, c_t) = Tuple of two (2,) vectors. h_t goes to the next layer; c_t is private state.
9z = np.concatenate([h_prev, x_t])

Stack the previous hidden state and the current input end-to-end. All four gates take this same vector as input, so computing it once avoids repeating the concatenation four times.

EXECUTION STATE
📚 np.concatenate = Joins a list of arrays along an existing axis. For 1-D arrays it appends them. Example: concatenate([[1,2],[3,4,5]]) → [1,2,3,4,5].
⬇ arg: [h_prev, x_t] = Python list of two 1-D arrays of shape (2,) each.
→ step 1 values = h_prev = [0.00, 0.00], x_t = [0.50, -0.20]
⬆ z (shape (H+I,) = (4,)) = [0.00, 0.00, 0.50, -0.20]
12i_t = sigmoid(W_i @ z + b_i) — input gate

The input gate decides, per channel, how much of the candidate g_t we should WRITE into the cell state. i ≈ 1 means let this channel update; i ≈ 0 means block the write.

EXECUTION STATE
📚 @ (matrix multiply) = NumPy matmul. W_i is (H, H+I) = (2, 4); z is (4,); W_i @ z is (2,).
W_i (2×4) =
    h0     h1     x0     x1
c0  0.30   0.20  -0.10   0.40
c1 -0.20   0.10   0.50   0.20
b_i (shape (2,)) = [-0.10, 0.00]
→ step 1 compute = z = [0, 0, 0.5, -0.2] row c0: 0.30·0 + 0.20·0 + (-0.10)·0.5 + 0.40·(-0.2) + (-0.10) = -0.23 row c1: -0.20·0 + 0.10·0 + 0.50·0.5 + 0.20·(-0.2) + 0.00 = 0.21 σ([-0.23, 0.21]) = [0.4428, 0.5523]
⬆ i_t (shape (2,)) = [0.4428, 0.5523] — let in ~44% of the new info in channel 0, ~55% in channel 1.
13f_t = sigmoid(W_f @ z + b_f) — forget gate

The forget gate decides, per channel, how much of the previous cell state c_prev to KEEP. A value near 1 preserves memory; near 0 erases it. NOTE: we use the same hand-picked weights as §17.1 so you can compare the two sections line for line. In a real LSTM you'd initialize b_f near +1 (Jozefowicz et al., 2015) — see §17.1's forget-bias Important box for why.

EXECUTION STATE
W_f (2×4) =
    h0     h1     x0     x1
c0  0.20  -0.30   0.40   0.10
c1  0.10   0.50  -0.20   0.30
b_f (shape (2,)) = [0.10, 0.20]
→ step 1 compute = row c0: 0.20·0 + (-0.30)·0 + 0.40·0.5 + 0.10·(-0.2) + 0.10 = 0.28 row c1: 0.10·0 + 0.50·0 + (-0.20)·0.5 + 0.30·(-0.2) + 0.20 = 0.04 σ([0.28, 0.04]) = [0.5695, 0.5100]
⬆ f_t = [0.5695, 0.5100] — channel 0 keeps ~57% of old memory, channel 1 keeps ~51%.
14g_t = tanh(W_g @ z + b_g) — candidate content

The candidate is what we COULD add to the cell state. Unlike the three gates it uses tanh, not sigmoid — we want signed content in (-1, 1), not a [0, 1] opening fraction.

EXECUTION STATE
📚 np.tanh = Element-wise hyperbolic tangent. Outputs in (-1, 1). tanh(0)=0, tanh(1)=0.7616, tanh(-1)=-0.7616. Smooth, centered at 0.
W_g (2×4) =
    h0     h1     x0     x1
c0  0.10  -0.40   0.30   0.20
c1  0.40   0.20  -0.10   0.50
b_g (shape (2,)) = [0.00, 0.10]
→ step 1 compute = row c0: 0.10·0 + (-0.40)·0 + 0.30·0.5 + 0.20·(-0.2) + 0.00 = 0.11 row c1: 0.40·0 + 0.20·0 + (-0.10)·0.5 + 0.50·(-0.2) + 0.10 = -0.05 tanh([0.11, -0.05]) = [0.1096, -0.0500]
⬆ g_t = [0.1096, -0.0500] — signed, bounded proposal. Channel 0 wants to write a small positive value; channel 1 a small negative one.
15o_t = sigmoid(W_o @ z + b_o) — output gate

The output gate decides, per channel, how much of the cell state will be EXPOSED as the hidden state h_t. This is the LSTM's separation between internal memory (c_t) and external output (h_t) — the cell can remember something without emitting it.

EXECUTION STATE
W_o (2×4) =
    h0     h1     x0     x1
c0  0.50   0.10  -0.20   0.30
c1  0.20  -0.30   0.40  -0.10
b_o (shape (2,)) = [0.00, -0.10]
→ step 1 compute = row c0: 0.50·0 + 0.10·0 + (-0.20)·0.5 + 0.30·(-0.2) + 0.00 = -0.16 row c1: 0.20·0 + (-0.30)·0 + 0.40·0.5 + (-0.10)·(-0.2) + (-0.10) = 0.12 σ([-0.16, 0.12]) = [0.4601, 0.5300]
⬆ o_t = [0.4601, 0.5300]
18c_t = f_t * c_prev + i_t * g_t — cell state update

The central LSTM equation. Two additive paths: (1) carry forward a fraction f of the old cell state, (2) write in a fraction i of the new candidate. The '+' here is the memory highway — gradients can flow through this path almost unchanged when f is near 1.

EXECUTION STATE
* (element-wise multiply) = NumPy broadcasts * as element-wise (Hadamard) product. [0.65, 0.65] * [0, 0] = [0, 0].
→ step 1 arithmetic = f_t * c_prev = [0.5695, 0.5100] * [0, 0] = [0, 0] i_t * g_t = [0.4428, 0.5523] * [0.1096, -0.0500] = [0.0485, -0.0276] c_t = [0, 0] + [0.0485, -0.0276] = [0.0485, -0.0276]
⬆ c_t (shape (2,)) = [0.0485, -0.0276]
→ why this is the key equation = ∂c_t/∂c_{t-1} = diag(f_t). As long as f stays away from 0, gradients survive. This is the architectural fix for vanishing gradients.
21h_t = o_t * np.tanh(c_t)

The hidden state is the cell state squashed to (-1, 1) and then gated by o_t. This produces what the outside world sees.

EXECUTION STATE
→ step 1 arithmetic = tanh(c_t) = tanh([0.0485, -0.0276]) = [0.0485, -0.0276] (tanh(x) ≈ x for small x) h_t = o_t * tanh(c_t) = [0.4601, 0.5300] * [0.0485, -0.0276] = [0.0223, -0.0146]
⬆ h_t = [0.0223, -0.0146]
→ h vs c = h_t is bounded in (-1, 1). c_t is unbounded. The tanh + output gate combo ensures the hidden state stays well-scaled no matter how large c grows.
22return h_t, c_t

Return both states as a Python tuple. The caller will pass them back in as h_prev, c_prev on the next step.

EXECUTION STATE
⬆ return: (h_t, c_t) = Step 1: ([0.0223, -0.0146], [0.0485, -0.0276])
25params = { ... }

A Python dictionary of eight hand-picked arrays. In a trained model these come from gradient descent; here we set them manually so every computed number below is reproducible.

EXECUTION STATE
total weights = 4 matrices × (2 × 4) = 32 + 4 bias vectors × 2 = 8 → 40 parameters total. These are the EXACT same weights used in §17.1 — so this trace reproduces §17.1's numbers line for line.
→ PyTorch comparison = nn.LSTM stacks all four gates into one big matrix weight_ih_l0 of shape (4H, I) and weight_hh_l0 of shape (4H, H). Same math, fused layout for a single GEMM call.
40sequence = [np.array(...), np.array(...), np.array(...)]

Three timesteps — a toy 'sentence' with three tokens. Each token is a 2-D embedding chosen by hand so the trace is easy to follow.

EXECUTION STATE
sequence[0] = 'I' = [0.5, -0.2]
sequence[1] = 'love' = [0.8, 0.3]
sequence[2] = 'it' = [0.1, 0.9]
44h = np.zeros(2); c = np.zeros(2)

Initialize both hidden and cell states to zero. This is the standard starting point when no prior context exists.

EXECUTION STATE
📚 np.zeros(n) = Creates a 1-D array of n zeros, dtype float64. np.zeros(2) = [0.0, 0.0].
h = [0.0, 0.0]
c = [0.0, 0.0]
46for t, x_t in enumerate(sequence, 1):

Iterate over timesteps. Calling lstm_cell once per token is exactly what a recurrent network does — and it is exactly why RNNs cannot be parallelized across time.

LOOP TRACE · 3 iterations
step 1 — token 'I'
x_t = [0.5, -0.2]
i_t = [0.4428, 0.5523]
f_t = [0.5695, 0.5100]
g_t = [0.1096, -0.0500]
o_t = [0.4601, 0.5300]
c_1 = [0.0485, -0.0276]
h_1 = [0.0223, -0.0146]
step 2 — token 'love'
x_t = [0.8, 0.3]
i_t = [0.4859, 0.6116]
f_t = [0.6127, 0.5312]
g_t = [0.2987, 0.1742]
o_t = [0.4849, 0.5495]
c_2 = [0.1749, 0.0919]
h_2 = [0.0839, 0.0504]
step 3 — token 'it'
x_t = [0.1, 0.9]
i_t = [0.5708, 0.5543]
f_t = [0.5577, 0.6186]
g_t = [0.1957, 0.5253]
o_t = [0.5737, 0.4630]
c_3 = [0.2092, 0.3480]
h_3 = [0.1183, 0.1549]
47h, c = lstm_cell(x_t, h, c, params)

Unpack the tuple returned by lstm_cell. Python tuple unpacking binds h to h_t and c to c_t in one line, ready for the next iteration.

EXECUTION STATE
→ after step 1 = h = [0.0223, -0.0146], c = [0.0485, -0.0276]
→ after step 3 = h = [0.1183, 0.1549], c = [0.2092, 0.3480]
48print(f"step {t}: h={h.round(4)} c={c.round(4)}")

Readable per-step output using Python f-strings and .round(4) to truncate noise.

EXECUTION STATE
→ printed output = step 1: h=[ 0.0223 -0.0146] c=[ 0.0485 -0.0276] step 2: h=[0.0839 0.0504] c=[0.1749 0.0919] step 3: h=[0.1183 0.1549] c=[0.2092 0.3480]
30 lines without explanation
1import numpy as np
2
3def sigmoid(x):
4    return 1.0 / (1.0 + np.exp(-x))
5
6# ---- One LSTM cell: processes a single timestep ----
7def lstm_cell(x_t, h_prev, c_prev, p):
8    # Concatenate previous hidden state with current input: shape (H + I,)
9    z = np.concatenate([h_prev, x_t])
10
11    # Four gates — each a linear layer followed by sigmoid or tanh
12    i_t = sigmoid(p["W_i"] @ z + p["b_i"])      # input gate
13    f_t = sigmoid(p["W_f"] @ z + p["b_f"])      # forget gate
14    g_t = np.tanh(p["W_g"] @ z + p["b_g"])      # candidate content
15    o_t = sigmoid(p["W_o"] @ z + p["b_o"])      # output gate
16
17    # Cell state — the memory highway
18    c_t = f_t * c_prev + i_t * g_t
19
20    # Hidden state — squashed cell state, gated by o_t
21    h_t = o_t * np.tanh(c_t)
22    return h_t, c_t
23
24# ---- Run over a sequence (same weights as §17.1, renamed for ifgo layout) ----
25params = {
26    "W_i": np.array([[ 0.3,  0.2, -0.1,  0.4],
27                     [-0.2,  0.1,  0.5,  0.2]]),
28    "b_i": np.array([-0.1,  0.0]),
29    "W_f": np.array([[ 0.2, -0.3,  0.4,  0.1],
30                     [ 0.1,  0.5, -0.2,  0.3]]),
31    "b_f": np.array([ 0.1,  0.2]),
32    "W_g": np.array([[ 0.1, -0.4,  0.3,  0.2],
33                     [ 0.4,  0.2, -0.1,  0.5]]),
34    "b_g": np.array([ 0.0,  0.1]),
35    "W_o": np.array([[ 0.5,  0.1, -0.2,  0.3],
36                     [ 0.2, -0.3,  0.4, -0.1]]),
37    "b_o": np.array([ 0.0, -0.1]),
38}
39
40sequence = [np.array([0.5, -0.2]),   # "I"
41            np.array([0.8,  0.3]),   # "love"
42            np.array([0.1,  0.9])]   # "it"
43
44h = np.zeros(2)
45c = np.zeros(2)
46for t, x_t in enumerate(sequence, 1):
47    h, c = lstm_cell(x_t, h, c, params)
48    print(f"step {t}: h={h.round(4)}  c={c.round(4)}")
The expensive part of this code is not the math — it is the Python for-loop. With hidden size 2 and three timesteps the loop is invisible, but for hidden size 512 and sequence length 1024, doing the time loop in Python is 1000× slower than letting cuDNN do it in a single fused kernel. That is the one and only reason to prefer nn.LSTM over writing your own.

The Same LSTM, Now in PyTorch

Two APIs cover almost every use case: nn.LSTMCell\text{nn.LSTMCell} runs one step at a time (like our NumPy loop), while nn.LSTM\text{nn.LSTM} runs the whole sequence in a single fused cuDNN call. Both produce identical results; the fused version is an order of magnitude faster because it eliminates the Python round-trip per timestep.

nn.LSTMCell and nn.LSTM side by side
🐍lstm_pytorch.py
1import torch

PyTorch's root package. Everything in this file flows through torch — tensors, autograd, CUDA kernels. Unlike NumPy, torch tensors carry a .requires_grad flag and a .device attribute, letting you run the exact same code on CPU or GPU without changing logic.

EXECUTION STATE
torch = Core tensor library: torch.tensor, torch.zeros, torch.matmul, autograd, device transfer.
2import torch.nn as nn

torch.nn is the neural-network module system. It holds every pre-built layer — including LSTMCell, LSTM, RNN, GRU — plus the Module base class you subclass for custom models.

EXECUTION STATE
torch.nn = Layers as PyTorch Modules. Each Module owns its learnable parameters and defines forward().
4torch.manual_seed(42)

Seed the global RNG so that the random weight initialization inside nn.LSTMCell is reproducible. Without this, the numbers below would change every run.

EXECUTION STATE
📚 torch.manual_seed(n) = Sets the CPU RNG seed. For full determinism on GPU you also need torch.cuda.manual_seed_all(n) and certain backend flags.
⬇ arg: 42 = Any integer works; 42 matches the seed used in §17.1 so values are directly comparable. Same seed + same PyTorch build = same random weights.
7cell = nn.LSTMCell(input_size=2, hidden_size=2)

Constructs an LSTM cell that processes ONE timestep. Internally PyTorch fuses the four gate matrices into two big ones (weight_ih_l0 and weight_hh_l0) so a single matmul computes all four gate pre-activations at once.

EXECUTION STATE
📚 nn.LSTMCell(input_size, hidden_size, bias=True) = Single-step LSTM module. Stores weight_ih of shape (4·H, I) and weight_hh of shape (4·H, H) — the 4 stacks input/forget/candidate/output gates together.
⬇ arg: input_size = 2 = Feature dimension of each input vector. Sets the 'I' column count in weight_ih.
⬇ arg: hidden_size = 2 = Dimension of h and c vectors. Sets both the row count (times 4 for gates) and the column count of weight_hh.
→ parameter count = weight_ih: 4·2·2 = 16 | weight_hh: 4·2·2 = 16 | bias_ih: 4·2 = 8 | bias_hh: 4·2 = 8 → 48 total. (Our NumPy version used 40 — PyTorch uses two biases per gate, which is mathematically equivalent to one but matches the original cuDNN kernel.)
9sequence = torch.tensor([[...], [...], [...]], dtype=torch.float32)

Create a (3, 2) tensor holding three 2-D tokens. Unlike np.array, torch.tensor lives in a device memory pool and can be moved to GPU with .cuda().

EXECUTION STATE
📚 torch.tensor(data, dtype) = Creates a new tensor from Python data. Copies values. For an existing array use torch.as_tensor() to avoid the copy.
⬇ arg: dtype = torch.float32 = 32-bit float. Modern GPUs also support float16 and bfloat16 which use half the memory and are much faster on Tensor Cores.
⬆ sequence (3, 2) =
     f0    f1
t0  0.50 -0.20
t1  0.80  0.30
t2  0.10  0.90
13h = torch.zeros(1, 2); c = torch.zeros(1, 2)

Initial h and c for LSTMCell. Shape is (batch, hidden) — PyTorch always puts batch first for the cell-level modules. We use batch=1 because we have a single sequence.

EXECUTION STATE
📚 torch.zeros(*size) = Tensor full of 0.0 with the given shape. torch.zeros(1, 2) → tensor([[0., 0.]]).
→ why a batch dim? = nn.LSTMCell expects (batch, features). Passing a 1-D tensor causes a cryptic error. Our NumPy loop skipped this because we ran a single un-batched sequence.
15for t in range(sequence.size(0)):

Iterate over the time dimension. Calling .size(0) returns the number of rows (seq_len = 3). This loop is the Python version of what nn.LSTM does in one fused C++/CUDA call.

EXECUTION STATE
📚 tensor.size(dim) = Returns the length of the given dimension. Equivalent to tensor.shape[dim]. sequence.size(0) = 3.
16x_t = sequence[t].unsqueeze(0)

Grab token t and add a batch dimension so the shape goes from (2,) to (1, 2). nn.LSTMCell refuses unbatched input.

EXECUTION STATE
📚 tensor.unsqueeze(dim) = Inserts a size-1 dimension at the given axis. .unsqueeze(0) adds a leading dim: (2,) → (1, 2). .unsqueeze(-1) adds a trailing dim: (2,) → (2, 1).
⬇ arg: 0 = Insert at position 0 (the new batch dim). The original features move to position 1.
→ shape journey = sequence[0]: tensor([0.5000, -0.2000]) shape (2,) .unsqueeze(0): tensor([[0.5000, -0.2000]]) shape (1, 2)
17h, c = cell(x_t, (h, c))

Calls LSTMCell.forward(). Under the hood this does one fused GEMM across all four gates, applies the non-linearities, computes c_t = f*c_prev + i*g, h_t = o*tanh(c_t), and returns the new (h, c).

EXECUTION STATE
⬇ arg 1: x_t (1, 2) = The current token batched. Shape must be (batch, input_size).
⬇ arg 2: (h, c) = Python tuple of previous hidden and cell states, each (1, 2). Passing the tuple is why the PyTorch API differs from GRUCell (which takes only h).
⬆ returns (h_new, c_new) = Tuple of two (1, 2) tensors. We assign back into h and c so the next iteration sees the update.
→ values differ from NumPy = PyTorch uses U(−√(1/H), √(1/H)) init — our NumPy run used hand-picked weights. Mechanism is identical; only the numeric fillings differ.
18print(f"step {t+1}: h={h.detach().numpy().round(4)}")

Print the hidden state after each step. .detach() removes the tensor from autograd's computation graph so .numpy() can export it safely.

EXECUTION STATE
📚 .detach() = Returns a new tensor sharing storage but without gradient tracking. Essential before calling .numpy() on a tensor that was created inside a differentiable computation.
📚 .numpy() = Zero-copy view into the tensor as a NumPy array. Only works on CPU tensors without requires_grad.
→ representative output (seed=42) = step 1: h=[[0.3217 0.1095]] step 2: h=[[0.4319 0.2324]] step 3: h=[[0.1188 0.2279]] (same seed as §17.1, so these numbers match §17.1's PyTorch run)
21lstm = nn.LSTM(input_size=2, hidden_size=2, batch_first=True)

The idiomatic PyTorch LSTM. Give it an entire (batch, seq_len, features) tensor and it unrolls across time in a single optimized kernel — no Python loop in the hot path.

EXECUTION STATE
📚 nn.LSTM = Multi-step, multi-layer LSTM. Key flags: num_layers (stacked depth), bidirectional (forward+backward), dropout (between layers), batch_first (input layout).
⬇ arg: batch_first = True = Input is (N, L, H). Default False would mean (L, N, H) — PyTorch's historical time-major default, which survives because cuDNN preferred it on older GPUs.
→ batch_first pitfall = If you forget this flag but pass a tensor of shape (1, 3, 2) thinking 'batch=1, seq=3', PyTorch reads it as 'seq=1, batch=3'. No error — just three length-1 sequences and wrong semantics.
23x = sequence.unsqueeze(0)

Add the batch dimension: (3, 2) → (1, 3, 2). This is the (N, L, H) layout nn.LSTM expects when batch_first=True.

EXECUTION STATE
→ before = sequence.shape = (3, 2) — three timesteps, 2 features each
→ after = x.shape = (1, 3, 2) — one sequence in a batch, 3 timesteps, 2 features
24output, (h_n, c_n) = lstm(x)

One call — the whole sequence runs through the LSTM. PyTorch returns two things: the full output (h_t for every timestep) and the tuple of final states (h_n, c_n) for feeding into the next layer or the next segment.

EXECUTION STATE
⬆ output (N, L, H) = (1, 3, 2) = h_t for every timestep. Use this when downstream layers need a vector per token (e.g., per-word classification, token tagging).
⬆ h_n (num_layers·num_dirs, N, H) = (1, 1, 2) = Hidden state at the FINAL timestep only. Use this when you need a single summary vector (e.g., sentiment classification from a whole sentence).
⬆ c_n (same shape as h_n) = Final cell state. Usually ignored for downstream tasks but critical if you plan to continue the sequence later.
→ consistency check = output[:, -1, :] equals h_n[-1] element-wise. Same final hidden state, different tensor shapes. Useful when you want both per-step and final outputs from one pass.
26print("output shape:", output.shape)

Inspect output — expect (1, 3, 2). Printing tensor.shape is one of the fastest ways to debug a model: shape errors account for most beginner bugs.

EXECUTION STATE
→ printed = output shape: torch.Size([1, 3, 2])
27print("h_n shape :", h_n.shape)

h_n shape is (num_layers × num_directions, batch, hidden). With a single unidirectional layer this is (1, 1, 2).

EXECUTION STATE
→ printed = h_n shape : torch.Size([1, 1, 2])
→ why the leading 1? = First axis indexes (layer × direction). If you use num_layers=2 bidirectional, this becomes 4 (2 layers × 2 directions). The leading axis survives even for the trivial case to keep the API uniform.
28print("c_n shape :", c_n.shape)

Same shape rule as h_n. For nn.GRU there is no c_n — the return is (output, h_n) with just two elements. That one-tuple-element difference is the main API change between GRU and LSTM in PyTorch.

EXECUTION STATE
→ printed = c_n shape : torch.Size([1, 1, 2])
12 lines without explanation
1import torch
2import torch.nn as nn
3
4torch.manual_seed(42)
5
6# ---- nn.LSTMCell: one step at a time (mirrors our NumPy loop) ----
7cell = nn.LSTMCell(input_size=2, hidden_size=2)
8
9sequence = torch.tensor([[0.5, -0.2],
10                         [0.8,  0.3],
11                         [0.1,  0.9]], dtype=torch.float32)
12
13h = torch.zeros(1, 2)                # (batch=1, hidden=2)
14c = torch.zeros(1, 2)
15for t in range(sequence.size(0)):
16    x_t = sequence[t].unsqueeze(0)   # (1, 2) — add batch dim
17    h, c = cell(x_t, (h, c))
18    print(f"step {t+1}: h={h.detach().numpy().round(4)}")
19
20# ---- nn.LSTM: whole sequence in one fused kernel ----
21lstm = nn.LSTM(input_size=2, hidden_size=2, batch_first=True)
22
23x = sequence.unsqueeze(0)            # (batch=1, seq_len=3, features=2)
24output, (h_n, c_n) = lstm(x)
25
26print("output shape:", output.shape)     # (1, 3, 2) — h_t for every step
27print("h_n shape   :", h_n.shape)        # (1, 1, 2) — final h only
28print("c_n shape   :", c_n.shape)        # (1, 1, 2) — final c only
Numbers differ, mechanism does not. PyTorch initializes weights from U(1/H,1/H)\mathcal{U}(-\sqrt{1/H}, \sqrt{1/H}), so the hidden states you see in the PyTorch run do not match the hand-picked NumPy run. The equations — the four gates, the cell update, the output projection — are bit-for-bit identical. If you copy-pasted the NumPy weights into the PyTorch parameters you would match to machine precision. The interactive panel below demonstrates exactly that.

Interactive: NumPy and PyTorch Agree

The two columns below are the same three-step LSTM — once computed with the hand-written NumPy path, once with the nn.LSTMCell fused kernel — on the EXACT SAME weights. Every hidden and cell state matches to floating-point precision. That is the concrete version of the claim above.

Loading parity demo…
The reference snippet that loads §17.1's NumPy weights into cell.weight_ih_l0, cell.weight_hh_l0,cell.bias_ih_l0, cell.bias_hh_l0 (splitting PyTorch's single fused bias across bias_ih and bias_hh) and then runs torch.allclose\text{torch.allclose} is the exercise at the end of §17.1. Any Python environment with torch installed reproduces the numbers shown here.

Packed Sequences: Variable Lengths Without Waste

Real sentences have different lengths. A batch of three inputs — "The cat sat on mat" (5 tokens), "I love it" (3 tokens), "Go home" (2 tokens) — cannot be stacked into a tensor unless we pad to a common length. Padding is the easy part; the hard part is telling the LSTM not to run its gates over the padded positions. The canonical solution is pack_padded_sequence\text{pack\_padded\_sequence}.

The workflow is a three-step pipeline: pad the batch to the max length, pack it into a compact representation that carries per-sequence lengths, run the LSTM, and unpack back to a regular tensor for downstream layers. During the LSTM's internal time loop, a packed sequence contributes exactly zero compute on padded positions — the gradients are numerically identical to running each sequence separately.

pad → pack → run → unpack
🐍packed_sequences.py
1import torch

Standard PyTorch import — needed for tensors and random data generation below.

EXECUTION STATE
torch = Root PyTorch package.
2import torch.nn as nn

We need nn.LSTM for the recurrent layer.

EXECUTION STATE
nn = Neural-network submodule.
3from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

The three utility functions every real-world RNN pipeline uses. Batches contain sequences of different lengths; padding + packing is how you handle them efficiently.

EXECUTION STATE
📚 pad_sequence = Stacks a list of variable-length tensors into one (batch, max_len, features) tensor, padding the rest with zeros.
📚 pack_padded_sequence = Wraps a padded tensor into a PackedSequence object that carries the true lengths. The LSTM kernel uses this to skip padded positions and return exact gradients.
📚 pad_packed_sequence = Inverse — unpacks back to a regular padded tensor for downstream layers.
6s1 = torch.randn(5, 8) — 'The cat sat on mat'

Five tokens, each an 8-dimensional random embedding. In a real pipeline these would come from an embedding layer.

EXECUTION STATE
📚 torch.randn(*shape) = Tensor filled with samples from N(0, 1). Used here to simulate learned embeddings.
→ s1.shape = (5, 8) — five tokens × 8 features
7s2 = torch.randn(3, 8) — 'I love it'

Three tokens of 8 features.

EXECUTION STATE
→ s2.shape = (3, 8)
8s3 = torch.randn(2, 8) — 'Go home'

Two tokens. Different length again — this is the whole point.

EXECUTION STATE
→ s3.shape = (2, 8)
11padded = pad_sequence([s1, s2, s3], batch_first=True)

Stack the three sequences into a single (batch, max_len, features) tensor, padding the shorter ones with zeros. max_len = 5 (from s1).

EXECUTION STATE
⬇ arg 1: [s1, s2, s3] = Python list of three 2-D tensors of shapes (5, 8), (3, 8), (2, 8).
⬇ arg 2: batch_first=True = Result shape is (batch, max_len, features). With False it would be (max_len, batch, features).
⬆ padded.shape = (3, 5, 8) — the bottom rows of s2 and s3 are filled with 0.
12lengths = torch.tensor([5, 3, 2])

A (batch,) tensor of the original lengths. This is the only way the LSTM will know which rows of 'padded' are real and which are zero-padding.

EXECUTION STATE
lengths = tensor([5, 3, 2])
→ why integers, not the tensor itself? = Keeping lengths as metadata lets PyTorch compute exact masks without scanning the data for zero rows (which could be legitimate embeddings).
15packed = pack_padded_sequence(padded, lengths, batch_first=True, enforce_sorted=False)

Packs the padded tensor into a PackedSequence — a compact representation that skips padding positions during the LSTM's time-major scan.

EXECUTION STATE
⬇ arg: padded = The (3, 5, 8) padded batch.
⬇ arg: lengths = The true length per sequence, tensor([5, 3, 2]).
⬇ arg: batch_first=True = Matches the layout of 'padded'.
⬇ arg: enforce_sorted=False = Historically pack required sequences pre-sorted by length (longest first). Setting False tells PyTorch to sort internally and unsort on return — almost always what you want.
⬆ packed = A PackedSequence namedtuple: (.data → concatenated real elements, .batch_sizes → how many seqs are still alive at each step).
18lstm = nn.LSTM(input_size=8, hidden_size=16, batch_first=True)

Construct the recurrent layer. nn.LSTM accepts a PackedSequence directly — no code change.

EXECUTION STATE
→ weight shapes = weight_ih_l0: (64, 8), weight_hh_l0: (64, 16), biases: (64,) each. 64 = 4 gates × 16 hidden.
19packed_out, (h_n, c_n) = lstm(packed)

Run the LSTM over the packed batch. The kernel processes only real positions — padded positions contribute zero compute and zero gradient.

EXECUTION STATE
⬆ packed_out = A PackedSequence holding h_t for every real position. To feed the next layer you usually unpack first.
⬆ h_n = Final hidden state per sequence — (1, 3, 16). For the 3-length sequence, this is h_3, not h_5 (which would be garbage).
22out, out_lengths = pad_packed_sequence(packed_out, batch_first=True)

Convert back to a regular padded tensor so downstream layers (attention, pooling, classifiers) can work with it.

EXECUTION STATE
⬆ out = (3, 5, 16) — padded with zeros beyond the real length.
⬆ out_lengths = tensor([5, 3, 2]) — same as the input lengths, returned for convenience.
23print("unpacked shape:", out.shape)

Confirm the unpack preserved the expected layout.

EXECUTION STATE
→ printed = unpacked shape: torch.Size([3, 5, 16])
24print("lengths :", out_lengths)

Confirm the per-sequence lengths come back unchanged.

EXECUTION STATE
→ printed = lengths : tensor([5, 3, 2])
10 lines without explanation
1import torch
2import torch.nn as nn
3from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
4
5# Three sentences of DIFFERENT lengths (simulate a real batch)
6s1 = torch.randn(5, 8)    # "The cat sat on mat"
7s2 = torch.randn(3, 8)    # "I love it"
8s3 = torch.randn(2, 8)    # "Go home"
9
10# 1) Pad to max length — creates (3, 5, 8) tensor
11padded = pad_sequence([s1, s2, s3], batch_first=True)
12lengths = torch.tensor([5, 3, 2])
13
14# 2) Pack — tells the LSTM to skip padded positions
15packed = pack_padded_sequence(padded, lengths, batch_first=True,
16                              enforce_sorted=False)
17
18lstm = nn.LSTM(input_size=8, hidden_size=16, batch_first=True)
19packed_out, (h_n, c_n) = lstm(packed)
20
21# 3) Unpack — back to a padded (batch, seq, hidden) tensor
22out, out_lengths = pad_packed_sequence(packed_out, batch_first=True)
23print("unpacked shape:", out.shape)   # (3, 5, 16)
24print("lengths        :", out_lengths)
Why bother when attention masks exist? Transformer models use attention masks, which is simpler. RNNs use packing because their time dimension is strictly sequential — you cannot "mask out" a timestep in the middle of a cuDNN call without breaking the forward chain. Packing is not an ugly workaround; it is the exact right abstraction for a serial computation over variable length.

Bidirectional and Stacked Layers

Two flags transform a plain LSTM into the workhorse of 2014–2017 NLP.

  1. bidirectional=True — run one LSTM left-to-right and another right-to-left, then concatenate their hidden states at each timestep. Output width doubles. Each token now sees both past context (from the forward pass) and future context (from the backward pass). Essential for tagging tasks where the meaning of a word depends on what comes after it.
  2. num_layers=k — stack kk LSTMs. Layer 1 consumes the embeddings. Layer 2 consumes layer 1's per-step hidden states. Dropout can be applied between layers during training to regularize. Two layers is common; four or more rarely helps without careful architectural tweaks.

With kk layers and two directions the returned hnh_n has shape (2k,N,H)(2k, N, H). The stacking order is [L1,L1,L2,L2,][L_1^\rightarrow, L_1^\leftarrow, L_2^\rightarrow, L_2^\leftarrow, \ldots]. So hn[2]h_n[-2] is the last layer's forward final state and hn[1]h_n[-1] is its backward final state. You almost always want to concatenate those two for downstream classification.

Bidirectional + causal generation is illegal. A bidirectional LSTM needs the full sequence up front to compute the backward pass. It cannot be used for language modelling or any task where you must produce one token at a time based only on the past. This is the same reason BERT (bidirectional) is an encoder while GPT (unidirectional) is a decoder.

A Full Sentiment Classifier, End to End

Bringing everything together: a two-layer bidirectional LSTM sentiment classifier. Embedding lookup → packed sequence → bi-LSTM → concatenate final hidden states → linear → logits. This is the template thousands of production RNN models used in 2016.

Embedding → Packed → BiLSTM → Linear
🐍sentiment_lstm.py
1import torch

Core PyTorch package.

EXECUTION STATE
torch = Tensors and autograd.
2import torch.nn as nn

We will subclass nn.Module and compose nn.Embedding, nn.LSTM, nn.Linear.

EXECUTION STATE
nn = Neural-net layers.
4class SentimentLSTM(nn.Module):

Custom model class. Subclassing nn.Module gives us automatic parameter registration (so optimizer.parameters() returns everything), GPU transfer (.to(device) cascades), and the callable forward interface (model(x) == model.forward(x)).

EXECUTION STATE
📚 nn.Module = Base class for all PyTorch models. Tracks submodules and parameters; provides .train() / .eval() / .to() / .state_dict().
5def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes)

Constructor. Declares the layers — PyTorch records each one assigned to self as a child module.

EXECUTION STATE
⬇ vocab_size = Number of distinct tokens (words or subwords). Sets the rows of the embedding matrix.
⬇ embed_dim = Dimension of each token vector. Sets the columns of the embedding matrix.
⬇ hidden_dim = Dimension of the LSTM hidden state (per direction).
⬇ num_classes = Output classes (3 for negative / neutral / positive).
6super().__init__()

Mandatory. Calls nn.Module's __init__ which sets up the internal dicts that track parameters and submodules. Forgetting this line is a classic PyTorch bug — your model's parameters silently won't be registered.

EXECUTION STATE
→ what it initializes = self._parameters, self._modules, self._buffers — all the bookkeeping that makes the rest of the framework work.
7self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

Learnable lookup table. Given an integer token id, returns the corresponding embedding row. padding_idx=0 zeros out the gradient for token 0 so the padding embedding stays a zero vector.

EXECUTION STATE
📚 nn.Embedding(num_embeddings, embedding_dim, padding_idx=None) = A (num_embeddings, embedding_dim) weight matrix. Forward pass: W[token_ids] — a plain indexing operation, not a matmul.
⬇ padding_idx=0 = The id reserved for padding. Its row is forcibly zeroed at init and receives no gradient, so padding never pollutes the learned embeddings.
8self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=0.3, batch_first=True)

Two-layer bidirectional LSTM. Each layer processes the sequence in both directions; the second layer stacks on top of the first. Dropout is applied between layers during training.

EXECUTION STATE
⬇ num_layers=2 = Stacked depth. Layer 1 sees the embeddings; layer 2 sees layer 1's hidden states. More layers = more representational depth, but harder to train.
⬇ bidirectional=True = Runs TWO LSTMs: one left-to-right, one right-to-left. Their hidden states are concatenated. Doubles the effective hidden dim but lets each token see future context.
⬇ dropout=0.3 = Drops 30% of activations between stacked layers during .train() mode. Does NOT apply to the final layer's output or between timesteps of the same layer.
→ parameter count = Layer 1 fwd + bwd + Layer 2 fwd + bwd = 4 LSTM blocks. Each: 4·H·(E+H) + 4·H·(H+H). With E=128, H=64: ≈ 200k params just in the LSTM.
11self.head = nn.Linear(hidden_dim * 2, num_classes)

Final classifier. Input is 2·H because we concatenate forward and backward final hidden states.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Affine transform y = x·Wᵀ + b. Weight shape: (out, in). For hidden_dim=64 → 128 in, 3 out → (3, 128) weight.
13def forward(self, token_ids, lengths):

The forward pass. PyTorch calls this when you invoke model(token_ids, lengths). Gradient tracking is automatic — you write forward; autograd records the graph; .backward() replays it.

EXECUTION STATE
⬇ token_ids (N, L) = Batch of token id sequences, padded with 0 to a common length L.
⬇ lengths (N,) = True length per sequence so the LSTM can skip padding.
14x = self.embed(token_ids)

Embedding lookup. Each integer token id becomes an embed_dim vector.

EXECUTION STATE
→ shape journey = token_ids: (N, L) integers → x: (N, L, E) floats
15packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)

Pack the padded embeddings so the LSTM skips padding. lengths must live on CPU for this API — a rare case where .cpu() is required.

EXECUTION STATE
📚 pack_padded_sequence = Turns (N, L, E) + lengths into a PackedSequence that the LSTM kernel processes in time-major order, skipping padding steps and producing exact gradients.
⬇ lengths.cpu() = Move lengths to CPU. PyTorch uses these lengths to build Python-level control flow; CUDA tensors would require a device sync.
⬇ enforce_sorted=False = Tells PyTorch to sort internally by length (longest first) and unsort on return. Without this flag you must pre-sort your batch.
17_, (h_n, _) = self.lstm(packed)

Run the LSTM. We discard the per-step output and the final cell state — for classification we only need h_n (summary) and of those only the final layer's forward and backward slots.

EXECUTION STATE
→ h_n shape = (num_layers · num_directions, N, H) = (2·2, N, H) = (4, N, H)
→ underscore convention = _ is Python for 'I don't care about this value'. Here we throw away the packed output tensor and the final cell state.
19last = torch.cat([h_n[-2], h_n[-1]], dim=-1)

Select the last layer's forward and backward final hidden states and concatenate along the feature axis. This gives one (N, 2H) vector per sequence.

EXECUTION STATE
📚 torch.cat(tensors, dim) = Concatenates along an existing dim. h_n[-2] is the forward direction of layer 2; h_n[-1] is the backward direction of layer 2.
⬇ dim=-1 = Concatenate along the last dim (features). (N, H) + (N, H) → (N, 2H). With dim=0 we'd get (2N, H) — a common bug.
→ why -2 and -1? = h_n is stacked as [L1_fwd, L1_bwd, L2_fwd, L2_bwd]. Negative indexing is a safe idiom that works for any num_layers.
20return self.head(last)

Linear projection to class logits. No softmax here — nn.CrossEntropyLoss expects raw logits and applies log-softmax internally.

EXECUTION STATE
⬆ logits shape = (N, num_classes) = (4, 3)
22model = SentimentLSTM(vocab_size=20_000, embed_dim=128, hidden_dim=64, num_classes=3)

Instantiate. Python's _ thousands separator is cosmetic: 20_000 == 20000.

EXECUTION STATE
→ total parameter count = Embedding: 20,000 · 128 = 2.56M | LSTM ≈ 200k | Linear: 128·3 + 3 = 387 → ~2.76M params
24token_ids = torch.randint(1, 20_000, (4, 12))

Random integer token ids for demo. We start at 1 (not 0) because 0 is reserved for padding.

EXECUTION STATE
📚 torch.randint(low, high, size) = Uniform integer samples in [low, high). torch.randint(1, 20000, (4, 12)) → (4, 12) ints in [1, 19999].
⬇ arg: (4, 12) = Output shape: batch of 4 sequences, each padded to 12.
25lengths = torch.tensor([12, 9, 7, 4])

True lengths. Positions beyond each length are padding even though the demo filled them with random ids.

EXECUTION STATE
→ lengths = tensor([12, 9, 7, 4])
26logits = model(token_ids, lengths)

Forward pass. model(...) is sugar for model.__call__ which wraps model.forward() with hooks. Autograd records the graph; you could now call .backward() on a loss to compute gradients.

EXECUTION STATE
→ logits = (4, 3) tensor of raw unnormalized class scores. Feed into F.cross_entropy with true labels to compute the loss.
27print("logits:", logits.shape)

Shape check.

EXECUTION STATE
→ printed = logits: torch.Size([4, 3])
8 lines without explanation
1import torch
2import torch.nn as nn
3
4class SentimentLSTM(nn.Module):
5    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
6        super().__init__()
7        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
8        self.lstm  = nn.LSTM(embed_dim, hidden_dim,
9                             num_layers=2, bidirectional=True,
10                             dropout=0.3, batch_first=True)
11        self.head  = nn.Linear(hidden_dim * 2, num_classes)
12
13    def forward(self, token_ids, lengths):
14        x = self.embed(token_ids)                         # (N, L, E)
15        packed = nn.utils.rnn.pack_padded_sequence(
16            x, lengths.cpu(), batch_first=True, enforce_sorted=False)
17        _, (h_n, _) = self.lstm(packed)
18        # h_n: (num_layers*num_dirs, N, H). Take last layer, both dirs.
19        last = torch.cat([h_n[-2], h_n[-1]], dim=-1)       # (N, 2H)
20        return self.head(last)                             # (N, num_classes)
21
22model = SentimentLSTM(vocab_size=20_000, embed_dim=128,
23                      hidden_dim=64, num_classes=3)
24token_ids = torch.randint(1, 20_000, (4, 12))             # batch of 4 sents
25lengths   = torch.tensor([12, 9, 7, 4])
26logits = model(token_ids, lengths)
27print("logits:", logits.shape)                             # (4, 3)

At this point you have the complete practical toolkit: every shape, every flag, every PyTorch helper you need to train an LSTM on real data. The remaining sections of this chapter zoom out to answer a different question: why did the field move on from this?


Looking Ahead

The sentiment classifier above is the 2016 production pattern — embedding, packed BiLSTM, concatenated final states, linear head. Beyond 2016 the field moved in a different direction. Transformers replaced the recurrent time-loop with parallel attention, the hidden-state bottleneck with a growing KV-cache, and the O(LH2)O(L \cdot H^2) per-sequence cost with O(L2H)O(L^2 \cdot H). None of this obsoleted the LSTM — it remains the right tool for very long strictly-sequential streams and for tight-latency edge inference — but it changed what “the default sequence model” is.

The next section, §17.4 From Recurrence to Attention, is a conceptual tour of the bridge. You will see why serial computation became a bottleneck, how a KV-cache re-invents the hidden state in a lossless form, how Flash Attention tiles the softmax to get around memory bandwidth, and how multi-head attention and positional encodings complete the picture. It is a primer, not a tutorial — the full build-it-in-PyTorch treatment lives in the dedicated Transformer chapter.

A mental model to carry forward: LSTM/GRU for O(H)O(H) memory; attention for O(1)O(1) path length; pick whichever property matters more for your problem.

Summary

  • PyTorch exposes nn.RNN, nn.GRU, and nn.LSTM with a near-identical API. The LSTM is the only one returning a two-tuple state (h, c).
  • The shape contract is (N,L,H)(N, L, H) with batch_first=True\text{batch\_first=True}; the state tensors carry a leading num_layersnum_dirs\text{num\_layers} \cdot \text{num\_dirs} axis.
  • For variable-length batches, always pad → pack → run → unpack. Packing guarantees exact gradients and zero wasted compute on padding.
  • Bidirectional layers double the output width and let each token see future context — essential for tagging, forbidden for autoregressive generation.
  • The full production template is Embedding → Pack → BiLSTM → concat final states → Linear. This is what thousands of 2016-era NLP systems actually shipped.
  • Hand-written NumPy and nn.LSTMCell produce bit-identical outputs when given the same weights — the fused kernel is a faster implementation, not a different algorithm.

References

  • Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780.
  • Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Forget-bias initialization.]
  • PyTorch Contributors. torch.nn.LSTM — PyTorch documentation. [Canonical source for the fused weight_ih_l0 / weight_hh_l0 layout and the i, f, g, o gate packing order referenced throughout this section.]
  • PyTorch Contributors. torch.nn.utils.rnn.pack_padded_sequence — documentation. [Reference for the pack → run → unpack workflow used in the classifier.]
Loading comments...