Why Vanilla RNNs Forget
A vanilla RNN has exactly one knob for memory: the hidden state . Every timestep it is completely overwritten by . The word “completely” is the problem. To remember something from ten steps ago, that information must survive ten consecutive multiplications by and ten applications of the tanh nonlinearity.
Mathematically, the gradient that flows back from step to step picks up a factor . If this product shrinks to zero — the vanishing gradient. If it is bigger than one, the product explodes. Either way, learning long-range dependencies is brittle.
The core problem: a vanilla RNN has no way to selectively preserve information. Every step is a full rewrite of memory through a squashing nonlinearity. There is no highway for gradients.
The Long Short-Term Memory cell, introduced by Hochreiter and Schmidhuber in 1997, is the classic fix. It adds a second state variable — the cell state — whose update is additive rather than multiplicative, and three learned gates that decide how that memory is read, written, and exposed. Once you understand those four pieces, the entire architecture falls into place.
The Cell State: A Memory Highway
Picture a conveyor belt running along the top of each LSTM cell. On that belt rides the cell state , a vector the same size as the hidden state. The entire mechanism of the LSTM is about carefully editing what is on that belt:
- The forget gate scales element-wise, erasing parts of memory the cell no longer needs.
- The input gate and the candidate together decide what new content to write onto the belt.
- The output gate decides what part of the belt's current contents to expose through the hidden state .
The whole update is captured in one short, additive line: . That plus sign is the entire reason LSTMs work. When we backpropagate through it, the derivative has no squashing nonlinearity wrapped around it — so if the network learns to keep , gradients flow unchanged across hundreds of steps.
The Four Sub-Networks Inside a Cell
Every LSTM cell contains four tiny feed-forward layers, and each of them takes exactly the same input: the concatenation . Three of them produce sigmoid-gated vectors in ; one produces a tanh-bounded candidate in .
| Symbol | What it computes | Activation | Role |
|---|---|---|---|
| f_t | σ(W_f · [h_{t-1}, x_t] + b_f) | sigmoid | erase fraction per channel of old memory |
| i_t | σ(W_i · [h_{t-1}, x_t] + b_i) | sigmoid | write fraction per channel of new memory |
| C̃_t | tanh(W_C · [h_{t-1}, x_t] + b_C) | tanh | content of the new memory proposal |
| o_t | σ(W_o · [h_{t-1}, x_t] + b_o) | sigmoid | expose fraction per channel to h_t |
The choice of activation is not arbitrary. Gates are switches, and sigmoid's range of is exactly the vocabulary of switches — “let 90% through” is a meaningful thing to say. The candidate is content, and content needs a sign — the negative half of tanh lets a memory slot move in the opposite direction (e.g. a word like not subtracting from an accumulated sentiment).
The Forget Gate
. For each dimension of the cell state, returns a number between 0 and 1 that will multiply :
- — channel of memory passes through essentially unchanged.
- — channel is erased.
- Intermediate values fade memory gradually.
A classic intuition: suppose one channel of encodes the subject's gender (for subject-verb agreement). When a new sentence begins, the forget gate on that channel should go to zero — “flush the gender, a new subject is coming.” The network learns to do this from data.
Input Gate and Candidate Memory
Writing to memory is split into how much and what — the input gate and the candidate.
- Input gate . A sigmoid-valued vector. Close to 1 means “yes, write heavily into this channel this step”; close to 0 means “skip this channel.”
- Candidate . A tanh-bounded vector of proposed new content. Unlike the gates, this one is signed.
The two combine as . Think of it as (how loud) × (what to say). Splitting these two concerns is crucial: a single tanh-of-linear cannot express “I have strong content but I don't want to write it hard right now.” With the gate, the network can.
Updating the Cell State
Everything so far has been preparation for one line — the line that defines LSTM:
That's it. Two element-wise products and an element-wise sum. The first term erases; the second term writes. There is no activation wrapped around the result. That is what makes the highway.
Take the derivative along the highway: . No tanh derivative (which can be as small as 0.0), no repeated weight matrix (which magnifies or shrinks across steps). Just a direct element-wise multiplier the network can learn to keep near 1 when it needs to remember — and the memory of the sequence survives.
Constant Error Carousel. Hochreiter and Schmidhuber's original name for this mechanism. When and , the cell state literally just circulates: , and gradients pass through with factor 1. Carousels go around forever; so do these gradients.
The Output Gate and Hidden State
The cell state can hold values much larger than 1 if the network wants (it is bounded only by what the gates write over time). But downstream layers expect a bounded, stable signal. So the LSTM exposes a second, tamer view of memory:
Two things happen here. First, squashes memory into — a “readable” form. Second, masks it: channels the network chooses not to reveal this step are zeroed out in . The cell state itself is untouched.
Because , not , is what the next timestep's gates see, the output gate is also the LSTM's way of saying “this part of memory is not relevant to the next prediction” while still keeping it on the belt for later.
Interactive: LSTM Architecture
Click any gate in the diagram below to see its equation, description, and the intuition behind it. Press Animate Flow to watch information move through the cell in the canonical order: forget → write → update → expose.
Interactive: The Gates at Work
Numbers bring intuition. Drag the sliders to set the previous hidden state , the current input , and the previous cell state . Watch how each gate responds and how the two final outputs — and — are assembled from them. Try the perfect remembering configuration in the hint below the visualization: push the forget gate toward 1 and the input gate toward 0. You will see essentially equal — memory preserved.
A Full Forward Pass with Real Numbers
Before we write code, let's run the cell by hand for one step. Take hidden size and input size . Our first input is the token “I” with embedding , and both states start at zero.
- Concatenate: .
- Forget gate: , so .
- Input gate: .
- Candidate: .
- Output gate: .
- Cell state: .
- Hidden state: .
These are not made-up numbers; they are what you will see printed when you run the NumPy code below. The point is that once the formulas are written out, running the cell is just algebra — five affine layers and a few element-wise products.
LSTM From Scratch in Python
Here is the complete cell, implemented with nothing but NumPy. Click any line to see the exact tensor values at that point — including the actual matrix entries of each gate's weight matrix, the pre-activations, and the gate outputs for all three timesteps of our toy sequence “I love it”.
Run the code. You should see:
| Step | Token | h_t | C_t |
|---|---|---|---|
| 1 | I | [ 0.0223, -0.0146] | [ 0.0485, -0.0276] |
| 2 | love | [ 0.0839, 0.0504] | [ 0.1749, 0.0919] |
| 3 | it | [ 0.1183, 0.1549] | [ 0.2092, 0.3480] |
Notice how the magnitudes of are accumulating across steps — memory building up — while the magnitudes of stay small, bounded by and masked by . That is the highway-vs-readout distinction playing out in actual numbers.
The Same LSTM in PyTorch
Now watch how the same mechanism becomes two lines in PyTorch. The NumPy version makes the math concrete; the PyTorch version is what you will actually use. We show both ways: nn.LSTMCell (one step at a time, mirroring our NumPy loop) and nn.LSTM (the whole sequence in a single call).
The output values differ from the NumPy version because PyTorch uses its own random initialization for the weight matrices. The mechanism is identical, line for line — the same four gates, the same additive cell update, the same gated hidden state.
| Aspect | NumPy from scratch | nn.LSTMCell | nn.LSTM |
|---|---|---|---|
| Weights | You provide | Random init | Random init |
| Gates | Four separate matmuls | One fused matmul | One fused matmul, batched over time |
| Loop over time | Python for-loop | Python for-loop | Optimized C++/CUDA kernel |
| Batching | One sequence | Batch-aware | Batch + sequence in one tensor |
| GPU | None | .to('cuda') | .to('cuda') |
| When to use | Learning / debugging | Custom step logic | Standard training |
nn.LSTMCell must equal the output of a single nn.LSTM call with the same weights — the second is just a faster implementation of the first. PyTorch's tests enforce this. Your own custom RNN variants should do the same.Why LSTMs Actually Learn Long Memory
We have talked about the gradient highway abstractly; the visualization below makes it quantitative. It compares the magnitude of the gradient (for a vanilla RNN) against (for an LSTM) as a function of sequence length. Drag the sliders and watch the RNN gradients collapse while the LSTM gradients survive.
Read off the math from the visualization. In a vanilla RNN the gradient at timestep 1 is — a product of many potentially-small terms. In the LSTM the dominant term is , and the network is free to learn whenever it needs memory to survive.
Interactive: BPTT Gradient Unroller
Unroll the cell in 3D and watch a gradient particle travel backward along the memory highway. Drag , flip between LSTM and vanilla RNN, and compare the final gradient magnitude displayed above the canvas. The LSTM's particle stays bright while the RNN's fades — that is the exponential shrink from playing out in pixels.
BPTT by Hand vs Autograd
The highway argument says — but that is only the dominant backward path. The full Jacobian also includes smaller contributions that flow through into the next step's gates. We check both against PyTorch's autograd.
The autograd result is a few percent larger than the highway approximation — those extra percent are the h-path contributions your hand-derivation explicitly ignored. The key point stands: the dominant term is the diagonal , and it is the one the network can keep near 1 by learning appropriate gate weights.
The Forget-Bias Trick, Ablated
The forget-bias-to-+1 heuristic (Jozefowicz, Zaremba & Sutskever, 2015) initializes so that at the start of training the forget gate is not halving memory every step. The following zero-input experiment isolates the effect: we seed one unit of memory and watch it decay over 20 steps with vs .
The Copy Task: An Experimental Proof
Hochreiter and Schmidhuber introduced the copy task in the original 1997 paper as a stress test for long-range memory. The network sees content tokens, then distractor steps, then a cue telling it to reproduce the original tokens. A vanilla RNN cannot solve this when exceeds ~10; an LSTM solves it at and beyond (Arjovsky, Shah & Bengio, 2016; Bai, Kolter & Koltun, 2018).
LSTM Variants — Peephole, CIFG, ConvLSTM
The LSTM you have just learned is the 1997 / 2000 canonical cell. Over the next two decades researchers proposed dozens of mutations. Greff et al. (2017) — “LSTM: A Search Space Odyssey” — ran 5200 training runs sweeping eight of the most plausible variants on three sequence-modelling tasks and reported a striking result: none of them consistently beat the vanilla cell by a meaningful margin. The ones worth knowing about are below, not because you should usually reach for them, but because you will see them in the wild and the names should mean something.
| Variant | What changes | When it is used |
|---|---|---|
| Peephole LSTM (Gers, Schraudolph & Schmidhuber, 2002) | Gates see the cell state directly: f_t, i_t, o_t also read C_{t-1} (or C_t for o_t), not just [h_{t-1}, x_t] | Timing-sensitive tasks (rhythm, precise interval detection) — lets the gate condition on the memory itself, not just the gated output. |
| CIFG / Coupled Input-Forget Gate (Greff et al., 2017) | i_t = 1 − f_t, forcing keep and write to sum to 1 (exactly the GRU's 1−z / z coupling, grafted onto an LSTM) | Reduces parameter count by ~25% with near-identical task performance — the explicit bridge between LSTM and GRU. |
| ConvLSTM (Shi et al., 2015) | Replace every matrix multiply inside the cell with a 2-D convolution; gates and states are feature maps, not vectors | Spatio-temporal problems — precipitation nowcasting (the original paper), video prediction, weather modelling. |
| Layer-Norm LSTM (Ba, Kiros & Hinton, 2016) | Apply LayerNorm to the pre-activations of each gate | Helps training stability on long sequences; standard in many speech models. |
| Bidirectional LSTM (Schuster & Paliwal, 1997; Graves & Schmidhuber, 2005) | Run one LSTM forward, one backward; concatenate hidden states per step | Tagging, NER, acoustic modelling — any task where both past and future context are available at inference. |
Check Your Understanding
Four short questions to make sure the pieces are connected. The answers lead with a one-sentence justification, not just “A”.
Summary
- Two states, not one. is long-term memory (no activation on its update), is the gated, bounded output.
- Four sub-networks, one input. All of are affine functions of the same vector .
- Gates use sigmoid; content uses tanh. Switches want ; signed content wants .
- The whole point is one equation: . Additive updates make the gradient highway possible.
- Hidden state reads memory, doesn't own it: .
With the LSTM cell in hand, the next section unpacks the GRU — a later simplification that merges the forget and input gates into a single update gate and drops the cell state entirely. You will see how many LSTM ideas survive in a smaller package, and where the two architectures diverge.
References
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780. [Introduces the LSTM cell and the Constant Error Carousel.]
- Gers, F. A., Schmidhuber, J. & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation 12(10), 2451–2471. [Adds the forget gate to the original 1997 architecture.]
- Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Shows empirically that initializing the forget-gate bias to +1 consistently improves training.]
- Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013 / arXiv:1211.5063. [The canonical analysis of the vanishing/exploding gradient problem in RNNs.]
- Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10), 2222–2232 / arXiv:1503.04069. [Large-scale ablation of LSTM variants — the vanilla cell holds up.]
- Gers, F. A., Schraudolph, N. N. & Schmidhuber, J. (2002). Learning Precise Timing with LSTM Recurrent Networks. Journal of Machine Learning Research 3, 115–143. [Introduces peephole connections.]
- Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K. & Woo, W. (2015). Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. NeurIPS 2015 / arXiv:1506.04214. [ConvLSTM.]
- Ba, J. L., Kiros, J. R. & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450. [Includes the Layer-Norm LSTM variant.]
- Schuster, M. & Paliwal, K. K. (1997). Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing 45(11), 2673–2681. [Introduces the bidirectional recurrent architecture.]
- Graves, A. & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6), 602–610. [Bidirectional LSTM applied to speech.]
- Arjovsky, M., Shah, A. & Bengio, Y. (2016). Unitary Evolution Recurrent Neural Networks. ICML 2016 / arXiv:1511.06464. [Benchmarks the copy task at long distractor lengths — LSTM solves L=100+, vanilla RNN fails at L>10.]
- Bai, S., Kolter, J. Z. & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271. [Additional copy-task benchmarks cited by the Copy-Task demo.]