Learning Objectives
By the end of this section, you will be able to:
- Understand the LSTM architecture including its cell state and gating mechanisms
- Explain the purpose of each gate (forget, input, output) and how they control information flow
- Write the complete LSTM equations and understand what each component computes
- Explain why LSTM solves the vanishing gradient problem through additive cell state updates
- Visualize gradient flow through the "constant error carousel"
- Develop intuition for when and how LSTM learns to remember or forget information
Why This Matters: LSTM was a breakthrough that made sequence modeling practical. Before LSTM, training RNNs on sequences longer than 10-20 steps was nearly impossible. After LSTM, neural networks could learn dependencies spanning hundreds of timesteps. This enabled transformative applications in machine translation, speech recognition, and language modeling. Understanding LSTM deeply is essential because: (1) it remains widely used in production systems, (2) its gating mechanisms inspired later architectures including Transformers, and (3) it beautifully illustrates how architectural choices can solve fundamental training problems.
The Story Behind LSTM
In 1997, Sepp Hochreiter and Jürgen Schmidhuber published a paper that would change the course of deep learning: "Long Short-Term Memory." The title itself captures the key insight: they wanted to enable neural networks to have both long-term memory (information that persists over many timesteps) and short-term flexibility (the ability to update and use that information appropriately).
The Problem They Were Solving
Recall from the previous section that vanilla RNNs suffer from the vanishing gradient problem. The gradient from time to time involves a product of Jacobians:
Each factor in this product is typically less than 1, so the gradient decays exponentially. Hochreiter and Schmidhuber's key insight was: what if we could create a pathway where the gradient multiplication factor is exactly 1?
The Breakthrough Idea
The solution came from a simple but profound observation: addition doesn't shrink gradients the way multiplication does. If we update a quantity by adding to it rather than multiplying:
Then the gradient . The gradient flows through unchanged! This is the foundation of the LSTM cell state.
The Constant Error Carousel
The Key Innovation: The Cell State
The LSTM introduces a new quantity called the cell state, denoted . This cell state runs through the entire sequence like a "memory highway," carrying information with minimal modification.
Two Parallel Paths
Unlike vanilla RNNs which have only the hidden state , LSTM maintains two parallel quantities:
| Quantity | Symbol | Role | Update Mechanism |
|---|---|---|---|
| Cell State | Cₜ | Long-term memory storage | Additive updates (preserves gradients) |
| Hidden State | hₜ | Short-term output/representation | Multiplicative gating of cell state |
The cell state is the key to LSTM's success. It can remain unchanged across many timesteps if needed, or it can be selectively updated. The hidden state is derived from the cell state but filtered through an "output gate"—so the LSTM can remember things internally that it doesn't expose in its outputs.
Quick Check
What is the key difference between the cell state update in LSTM compared to the hidden state update in vanilla RNNs?
Interactive: LSTM Architecture
Explore the LSTM cell architecture interactively. Click on each gate to learn about its function, or use the "Animate Flow" button to see how information moves through the cell.
LSTM Cell Architecture
LSTM Architecture
Click on any gate in the diagram to learn about its function, or use the "Animate Flow" button to see how information flows through the LSTM cell.
Information Flow Order
The LSTM Key Insight: The cell state (top horizontal line) acts as a "memory highway" that allows information to flow unchanged through time. The gates control what gets added to or removed from this highway. Because the cell state update uses addition rather than multiplication, gradients can flow backward through many timesteps without vanishing.
The Four Components of LSTM
An LSTM cell has four key components: three gates (forget, input, output) and a candidate cell state generator. Each plays a distinct role in controlling information flow.
1. The Forget Gate (ft)
The forget gate decides what information from the previous cell state should be discarded. It looks at the previous hidden state and current input and outputs values between 0 and 1 for each element of .
| Gate Value | Meaning | Effect on Cₜ₋₁ |
|---|---|---|
| fₜ ≈ 0 | "Forget this" | Information is discarded |
| fₜ ≈ 1 | "Remember this" | Information is preserved |
| fₜ = 0.5 | "Partially remember" | Information is attenuated by 50% |
Why 'Forget' Gate?
2. The Input Gate (it)
The input gate decides which new information to add to the cell state. Like the forget gate, it uses a sigmoid to output values between 0 and 1:
3. The Candidate Cell State (C̃t)
While the input gate decides how much new information to add, the candidate cell state determines what new information to potentially add. It uses tanh to produce values between -1 and 1:
Why tanh for the candidate?
4. The Cell State Update
Now we can update the cell state. This is where the magic happens:
This equation says: take the old cell state, forget some parts (multiply by ), and add new information (scaled by ). The symbol represents element-wise (Hadamard) multiplication.
Critical Observation
5. The Output Gate (ot)
Finally, the output gate controls what part of the cell state is exposed as the hidden state output:
The hidden state output is computed by applying tanh to the cell state (to push values to [-1, 1]) and then filtering through the output gate:
Why Filter the Output?
Interactive: Gate Explorer
Use the sliders below to adjust the inputs and observe how each gate responds. Watch how the gates work together to update the cell state and produce the hidden state output.
Interactive Gate Explorer
Adjust the inputs and observe how each gate responds. The gates work together to update the cell state and produce the hidden state output.
The output from the previous timestep
The input at the current timestep
The long-term memory from previous timestep
Forget Gate ft
Some of Ct-1 will be partially kept
Input Gate it
Some new info will be stored
Candidate &Ctilde;t
New candidate memory value (range: -1 to 1)
Output Gate ot
Some of cell state will be output
Cell State Update
New Cell State Ct
The updated long-term memory. Change: -0.190 from Ct-1
New Hidden State ht
The output (and input to next timestep)
Try this: Set the forget gate input to favor keeping (f > 0.9) and the input gate to favor ignoring new info (i < 0.1). Watch how the cell state Ct stays close to Ct-1. This is how LSTM "remembers" information across many timesteps!
Complete Mathematical Formulation
Let's bring all the equations together. Given input at time , previous hidden state , and previous cell state , the LSTM computes:
Gate Computations
State Updates
Notation Summary
| Symbol | Description | Dimensions |
|---|---|---|
| xₜ | Input at timestep t | d (input dimension) |
| hₜ | Hidden state (output) | n (hidden dimension) |
| Cₜ | Cell state (internal memory) | n (hidden dimension) |
| fₜ, iₜ, oₜ | Gate activations | n (hidden dimension) |
| C̃ₜ | Candidate cell state | n (hidden dimension) |
| Wf, Wᵢ, Wc, Wₒ | Weight matrices | (n) × (n + d) |
| bf, bᵢ, bc, bₒ | Bias vectors | n |
| σ | Sigmoid function (outputs 0-1) | — |
| tanh | Hyperbolic tangent (outputs -1 to 1) | — |
| ⊙ | Element-wise multiplication | — |
Why LSTM Solves the Vanishing Gradient Problem
Now let's analyze why LSTM's architecture solves the vanishing gradient problem. The key is understanding the gradient of the cell state.
Gradient Through the Cell State
Consider the cell state update equation:
Taking the partial derivative with respect to (treating the gates as constants for this analysis):
This is remarkable! The gradient from to is simply the forget gate value . If the network learns to set , the gradient passes through unchanged.
Gradient Over Many Timesteps
For a loss at time , the gradient with respect to an early cell state is:
Compare this to vanilla RNN where the product involves and \tanh' terms. In LSTM:
| Network | Gradient Product | Typical Value | After 50 Steps |
|---|---|---|---|
| Vanilla RNN | ∏(Wₕₕ · tanh') | ≈ 0.8 per step | ≈ 10⁻⁵ |
| LSTM | ∏(fₜ) | ≈ 0.95 per step | ≈ 0.08 |
The Power of Learning to Remember
Quick Check
If the forget gate outputs fₜ = 0.98 at every timestep, what fraction of the original gradient remains after 100 timesteps?
Interactive: The Constant Error Carousel
Compare gradient flow through vanilla RNN versus LSTM. Adjust the forget gate value and sequence length to see how LSTM preserves gradients for long-term dependencies.
The Constant Error Carousel: How LSTM Preserves Gradients
The cell state in LSTM acts as a "gradient highway" that preserves gradients across many timesteps. Compare how gradients decay in vanilla RNN vs. LSTM.
Higher f = better gradient preservation
Gradient Magnitude at Each Timestep (Flowing Backward from Loss)
Why LSTM Gradients Survive
Vanilla RNN Gradient:
Problem: If ||Whh|| < 1, gradients shrink exponentially
LSTM Cell State Gradient:
Solution: When f ≈ 1, gradients flow unchanged!
Vanilla RNN Gradient at t=1
Gradient is weak but may allow some learning.
LSTM Gradient at t=1
Strong gradient! Can learn long-term dependencies.
54x better than RNN
The Constant Error Carousel: Hochreiter and Schmidhuber (1997) called the cell state a "constant error carousel" because when the forget gate f ≈ 1 and input gate i ≈ 0, the cell state update becomes Ct ≈ Ct-1. This means the gradient ∂Ct/∂Ct-1 ≈ 1, allowing errors (gradients) to flow unchanged through many timesteps. The network learns when to preserve and when to update memory.
Building LSTM Intuition
Let's develop intuition for how LSTM processes information through several examples.
Example 1: Remembering a Subject for Verb Agreement
Consider the sentence: "The cat, which was chasing the mice in the garden, is hungry."
The LSTM needs to remember that the subject is "cat" (singular) to correctly predict "is" instead of "are":
- When processing "cat": The input gate opens () to store the singular subject information in the cell state.
- During "which was chasing the mice in the garden": The forget gate stays high () to preserve the subject information, while the input gate stays low () to avoid overwriting it.
- When predicting the verb: The output gate opens () to access the stored subject information.
Example 2: Language Modeling with Context
"I grew up in France. ... ... ... I speak fluent French."
Even with many sentences in between, the LSTM can remember "France" and use it to predict "French":
- Storing context: The word "France" triggers high input gate activation, storing location information.
- Preserving over time: High forget gate values maintain this information across the intervening text.
- Using context: When predicting the language, the network accesses the stored location to generate "French".
Example 3: Closing Brackets in Code
Matching opening and closing brackets requires counting: "((())" has 3 open brackets.
- Opening bracket "(": Increment the cell state ().
- Closing bracket ")": Decrement the cell state ().
- Predicting next token: If cell state is positive, more closing brackets are needed.
Why These Examples Work
Summary
LSTM is a carefully designed architecture that solves the vanishing gradient problem through several key innovations:
Key Concepts
| Concept | Innovation | Purpose |
|---|---|---|
| Cell State | Additive updates (Cₜ = f·C + i·C̃) | Enables gradient flow without decay |
| Forget Gate | Learned forgetting (fₜ = σ(...)) | Selective memory erasure |
| Input Gate | Learned writing (iₜ = σ(...)) | Selective memory writing |
| Output Gate | Learned reading (oₜ = σ(...)) | Selective memory exposure |
| Dual State | Cell state + Hidden state | Internal memory vs. external output |
Key Equations
- Gates:
- Candidate:
- Cell update:
- Output:
- Gradient: (the key insight!)
Looking Forward
In the next section, we'll implement a complete LSTM from scratch in PyTorch and train it on real sequence tasks. We'll see how the theory translates to practice and explore training techniques specific to LSTMs.
Knowledge Check
Test your understanding of LSTM architecture and mechanics:
LSTM Knowledge Check
What is the primary purpose of the forget gate in an LSTM?
Exercises
Conceptual Questions
- Explain why the forget gate is crucial for learning long-term dependencies. What would happen if we removed the forget gate and always had ?
- The output gate allows the LSTM to "hide" information in the cell state without exposing it in the hidden state. Give an example scenario where this capability would be useful.
- Compare the number of learnable parameters in a vanilla RNN cell vs. an LSTM cell with the same hidden size. Why is LSTM more parameter-efficient than simply stacking multiple RNN layers?
- Explain the role of tanh in the cell state update. Why is tanh applied before the output gate but not before the forget/input operations on the cell state?
Mathematical Exercises
- Gradient Computation: Derive the gradient for a sequence of length . Show that it equals .
- Full Gradient: The gradient through LSTM also flows through the hidden state path. Write out the complete gradient including both the cell state and hidden state paths.
- Parameter Count: For an LSTM with input dimension and hidden dimension , calculate the total number of learnable parameters.
Coding Exercises
- Gate Visualization: Implement a function that processes a sequence through an LSTM and records the forget gate values at each timestep. Visualize these as a heatmap for a sentence processing task.
- Memory Persistence Test: Create a synthetic task where the network must remember a binary value for timesteps before using it. Compare vanilla RNN and LSTM performance as increases.
- Gradient Analysis: Modify the LSTM implementation to track gradient magnitudes at each timestep during backpropagation. Compare to vanilla RNN and verify that LSTM maintains better gradient flow.
Solution Hints
- Exercise 1: Register forward hooks on the LSTM layer to capture intermediate activations.
- Exercise 2: The "copy memory" task: input a bit, then zeros, then a signal to output the original bit.
- Exercise 3: Use retain_graph=True in backward() and register backward hooks on the cell state tensor.
Challenge Project
Build an LSTM Debugger: Create an interactive tool that visualizes LSTM internals during sequence processing. Include:
- Real-time visualization of all gate activations
- Cell state evolution over time
- Attention-style visualization showing which inputs most strongly affected each output
- Gradient magnitude tracking through both cell state and hidden state paths
- Comparison mode to contrast vanilla RNN behavior
Now that you understand the LSTM architecture and why it works, you're ready to implement it from scratch. In the next section, we'll build a complete LSTM in PyTorch, train it on a real task, and explore practical considerations for getting LSTMs to work well in production.