From Math to Machine
You now understand the math of the LSTM and the GRU — four gates, one cell state, one hidden state (or in the GRU's case, a single update-gate budget). That math has been the same since 1997 for the LSTM and 2014 for the GRU. What changes when we leave pencil-and-paper and land on a GPU is not the equations — it is the shape of the data, the layout in memory, the fusion of many small ops into one kernel, and the engineering required to feed a real hardware pipeline without wasting cycles.
This section is where intuition meets implementation. We will build an LSTM cell from scratch in NumPy — so every multiplication is visible — and then rebuild it in PyTorch with the exact same semantics and orders-of-magnitude more speed. Along the way you will meet every practical concern of production recurrent models: tensor shapes, packed variable-length batches, bidirectional stacking, and the full end-to-end sentiment classifier pattern.
The central promise: the math does not change between on paper and the nn.LSTM call in code. Only the plumbing changes. Once you see the plumbing clearly, the gap between a paper and a production model is small.
This section stays focused on the implementation of recurrent models in PyTorch. By the end you will have a working sentiment classifier and a production-grade mental model of shapes, gates, packed sequences, and bidirectional stacking. The next section, §17.4, zooms out — it traces why the field eventually moved on from recurrence and how attention, the KV-cache, and Flash Attention are the direct engineering descendants of the ideas you just learned.
nn.RNN, nn.LSTM, nn.GRU: One API, Three Cells
PyTorch exposes three recurrent modules. They share almost exactly the same signature — the differences reduce to the number of gates inside the cell and the shape of the returned state tuple.
| Module | Gates | Returned state | Parameters per step |
|---|---|---|---|
| nn.RNN | 1 (tanh) | (output, h_n) | H·(H + I) + H |
| nn.GRU | 3 (reset, update, candidate) | (output, h_n) | 3H·(H + I) + 3H |
| nn.LSTM | 4 (input, forget, candidate, output) | (output, (h_n, c_n)) | 4H·(H + I) + 4H |
The return-value difference is the most common source of confusion. nn.LSTM is the only one that returns a tuple of states because it is the only cell with a separate cell state. Change to in otherwise identical code and you will get a silent unpacking error at the destructuring line — one element instead of two.
The Shape Contract You Must Honor
Every PyTorch recurrent layer has a strict tensor shape contract. Get the shape right and everything works; get it wrong and you will see cryptic errors or — worse — a model that trains on the wrong axis and never converges. Memorize these three tensors:
| Tensor | Shape (batch_first=True) | Shape (batch_first=False) | Meaning |
|---|---|---|---|
| input | (N, L, H_in) | (L, N, H_in) | N sequences, each of length L, with H_in features per token |
| h_0 / c_0 (initial state) | (num_layers · num_dirs, N, H_hidden) | same | Initial hidden (and cell) states. Defaults to zeros if omitted. |
| output | (N, L, num_dirs · H_hidden) | (L, N, num_dirs · H_hidden) | h_t for every real timestep of every sequence |
| h_n / c_n (final state) | (num_layers · num_dirs, N, H_hidden) | same | Final hidden (and cell) state per direction per layer |
Interactive: Tensor Shape Explorer
A 3-D tensor of shape is hard to picture in the abstract — sliders help. Drag the knobs on the right to change N, L, and H, and toggle to watch the layout flip. Each small cube is one scalar. Hovering an axis label highlights a slice.
An LSTM from Scratch in NumPy
Before we touch PyTorch, let's run one LSTM cell, by hand, over three timesteps. The code below is a direct translation of the four equations of the LSTM. Every number in the explanation pane is the actual computed value — not a placeholder. Click any line to see the arithmetic.
The formulas, one last time, inline for reference: the three sigmoid gates , , , the tanh candidate , the cell update , and finally .
The Same LSTM, Now in PyTorch
Two APIs cover almost every use case: runs one step at a time (like our NumPy loop), while runs the whole sequence in a single fused cuDNN call. Both produce identical results; the fused version is an order of magnitude faster because it eliminates the Python round-trip per timestep.
Interactive: NumPy and PyTorch Agree
The two columns below are the same three-step LSTM — once computed with the hand-written NumPy path, once with the nn.LSTMCell fused kernel — on the EXACT SAME weights. Every hidden and cell state matches to floating-point precision. That is the concrete version of the claim above.
cell.weight_ih_l0, cell.weight_hh_l0,cell.bias_ih_l0, cell.bias_hh_l0 (splitting PyTorch's single fused bias across bias_ih and bias_hh) and then runs is the exercise at the end of §17.1. Any Python environment with torch installed reproduces the numbers shown here.Packed Sequences: Variable Lengths Without Waste
Real sentences have different lengths. A batch of three inputs — "The cat sat on mat" (5 tokens), "I love it" (3 tokens), "Go home" (2 tokens) — cannot be stacked into a tensor unless we pad to a common length. Padding is the easy part; the hard part is telling the LSTM not to run its gates over the padded positions. The canonical solution is .
The workflow is a three-step pipeline: pad the batch to the max length, pack it into a compact representation that carries per-sequence lengths, run the LSTM, and unpack back to a regular tensor for downstream layers. During the LSTM's internal time loop, a packed sequence contributes exactly zero compute on padded positions — the gradients are numerically identical to running each sequence separately.
Bidirectional and Stacked Layers
Two flags transform a plain LSTM into the workhorse of 2014–2017 NLP.
- bidirectional=True — run one LSTM left-to-right and another right-to-left, then concatenate their hidden states at each timestep. Output width doubles. Each token now sees both past context (from the forward pass) and future context (from the backward pass). Essential for tagging tasks where the meaning of a word depends on what comes after it.
- num_layers=k — stack LSTMs. Layer 1 consumes the embeddings. Layer 2 consumes layer 1's per-step hidden states. Dropout can be applied between layers during training to regularize. Two layers is common; four or more rarely helps without careful architectural tweaks.
With layers and two directions the returned has shape . The stacking order is . So is the last layer's forward final state and is its backward final state. You almost always want to concatenate those two for downstream classification.
A Full Sentiment Classifier, End to End
Bringing everything together: a two-layer bidirectional LSTM sentiment classifier. Embedding lookup → packed sequence → bi-LSTM → concatenate final hidden states → linear → logits. This is the template thousands of production RNN models used in 2016.
At this point you have the complete practical toolkit: every shape, every flag, every PyTorch helper you need to train an LSTM on real data. The remaining sections of this chapter zoom out to answer a different question: why did the field move on from this?
Looking Ahead
The sentiment classifier above is the 2016 production pattern — embedding, packed BiLSTM, concatenated final states, linear head. Beyond 2016 the field moved in a different direction. Transformers replaced the recurrent time-loop with parallel attention, the hidden-state bottleneck with a growing KV-cache, and the per-sequence cost with . None of this obsoleted the LSTM — it remains the right tool for very long strictly-sequential streams and for tight-latency edge inference — but it changed what “the default sequence model” is.
The next section, §17.4 From Recurrence to Attention, is a conceptual tour of the bridge. You will see why serial computation became a bottleneck, how a KV-cache re-invents the hidden state in a lossless form, how Flash Attention tiles the softmax to get around memory bandwidth, and how multi-head attention and positional encodings complete the picture. It is a primer, not a tutorial — the full build-it-in-PyTorch treatment lives in the dedicated Transformer chapter.
Summary
- PyTorch exposes nn.RNN, nn.GRU, and nn.LSTM with a near-identical API. The LSTM is the only one returning a two-tuple state (h, c).
- The shape contract is with ; the state tensors carry a leading axis.
- For variable-length batches, always pad → pack → run → unpack. Packing guarantees exact gradients and zero wasted compute on padding.
- Bidirectional layers double the output width and let each token see future context — essential for tagging, forbidden for autoregressive generation.
- The full production template is Embedding → Pack → BiLSTM → concat final states → Linear. This is what thousands of 2016-era NLP systems actually shipped.
- Hand-written NumPy and nn.LSTMCell produce bit-identical outputs when given the same weights — the fused kernel is a faster implementation, not a different algorithm.
References
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780.
- Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Forget-bias initialization.]
- PyTorch Contributors. torch.nn.LSTM — PyTorch documentation. [Canonical source for the fused
weight_ih_l0/weight_hh_l0layout and the i, f, g, o gate packing order referenced throughout this section.] - PyTorch Contributors. torch.nn.utils.rnn.pack_padded_sequence — documentation. [Reference for the pack → run → unpack workflow used in the classifier.]