Working Memory: a Running Summary
When you read this sentence, you maintain a running summary of the words that came before. By the time you reach the period you have compressed a couple of dozen tokens into a small mental state — enough to disambiguate this, they, that on the next line. Your hippocampus does it for navigation; your auditory cortex does it for music; your frontal cortex does it for plans that span minutes. Recurrent neural networks are the engineering caricature of that capacity.
For RUL prediction the same idea applies to a different kind of memory: as a turbofan's sensor stream rolls past, the model needs to integrate slow degradation patterns over many cycles. A single spike at cycle 23 means little; a slow upward drift over cycles 5-30 means the engine is dying. Convolutions (Section 3.2) only see cycles at a time. RNNs carry a hidden state across the entire window.
Vanilla RNN and Its Failure Mode
The simplest recurrent cell — the “vanilla RNN” — has one update equation:
The hidden state at step is a learned non-linear function of the current input and the previous hidden state. Repeat for steps and you have read the whole sequence. Beautiful, simple, and broken on long sequences.
The problem is the gradient. Backpropagating through a length-T RNN produces a product of T Jacobians, each containing a derivative bounded above by 1. With the gradient at the first cycle is attenuated by roughly in the worst case. The model effectively cannot learn dependencies more than ~10 cycles long. This is the famous vanishing gradient problem.
The LSTM Cell: Four Gates
An LSTM cell maintains two state vectors that propagate through time: — the cell state, a long-term memory; and — the hidden output, what downstream layers see. Three gates and one candidate regulate the update:
| Symbol | Equation | Meaning |
|---|---|---|
| Input gate — how much new info to admit | ||
| Forget gate — how much old c to keep | ||
| Candidate update — what new content | ||
| Output gate — how much c to expose |
Together they update the cell and hidden state via
The element-wise is critical: the cell state is updated by simple addition and pointwise scaling — no matrix multiplication in the recurrence loop. That is why gradients survive across long sequences: the partial derivative , which is element-wise close to 1 when the forget gate is open.
Interactive: Step Through 8 Timesteps
The trace below runs a scalar LSTM cell on the input pulse . Press play or step one timestep at a time. Watch the cell state ramp up when the pulse arrives, hold near 1 while the pulse is active, then decay slowly back toward zero after the pulse ends — the LSTM is remembering the recent input.
The forget gate stays at throughout because the input does not affect it (we hard-coded ). 73% of the cell state survives each step, which is what gives the post-pulse decay its characteristic half-life. In a real LSTM the forget gate is learned and can range from 0 (forget everything) to nearly 1 (remember forever).
Interactive: The Unrolled Diagram
Another way to see an RNN is to unroll it across time — draw one cell per timestep, with arrows showing how the hidden state flows from left to right. The animation below unrolls a generic recurrent cell over six timesteps; the same picture applies to any RNN variant including LSTMs.
RNN Unrolled Through Time
Each hidden state ht depends on the previous state ht-1
Python: An LSTM Cell From Scratch
Twenty-five lines of NumPy and the entire algorithm is exposed. We write a scalar LSTM cell (one input dim, one hidden dim) so the gates are visible as actual numbers. Generalising to vector cells is trivial — replace each scalar weight with a matrix, each scalar multiplication with a matmul, and the algebra of the four gates is unchanged.
The numbers tell the story
Read down the cell-state column: 0 → 0 → 0 → 0.413 → 0.716 → 0.936 → 0.685 → 0.500. The cell state builds up while the pulse is active and decays gracefully after it ends. That is what working memory looks like in arithmetic.
PyTorch: nn.LSTM in Six Lines
Production code never writes the cell from scratch. PyTorch's nn.LSTM wraps a CUDA-optimised batched implementation with multi-layer and bidirectional support built in.
nn.LSTM(input_size=17, hidden_size=256, num_layers=2, bidirectional=True, batch_first=True) is verbatim what the backbone in Chapter 9 uses. ~2.1M parameters — the largest single component of the network.Recurrent Networks Beyond RUL
| Domain | Sequence | Hidden state captures | Famous architecture |
|---|---|---|---|
| RUL (this book) | 30 cycles of 17 sensors | Cumulative degradation | CNN-BiLSTM-Attention |
| Language models | Subword tokens | Sentence-level meaning | GPT-1 / ELMo |
| Speech recognition | Audio frames | Phoneme context | DeepSpeech 2 |
| Machine translation | Source tokens | Sentence representation | seq2seq + attention |
| Music generation | Audio samples | Tonal / rhythmic motif | WaveNet, SampleRNN |
| Reinforcement learning | Game frames | Belief over hidden state | DRQN, R2D2 |
| Time-series forecasting | Hourly observations | Trend + seasonality | DeepAR, encoder-decoder LSTM |
| Medical event prediction | EHR codes / vitals | Patient trajectory | Doctor AI, RETAIN |
Every row shipped a state-of-the-art result at some point in the last decade. Transformers (Section 3.4) have since taken over large-scale text and vision, but LSTMs remain the right tool for small-batch, low-latency, low-data settings — including most prognostic problems with under a million sensor samples.
The Three Pitfalls
batch_first=False by default. The single most common LSTM bug. PyTorch's historical default expects , but every other layer in the book uses . Always pass batch_first=True.h_n from one batch into the next without .detach(), autograd will accumulate the entire computation graph across batches and either OOM or silently corrupt gradients (the “BPTT-through-batches” bug).bidirectional=True requires the full future to compute the backward direction. You cannot use a bidirectional LSTM in a streaming setting where new cycles arrive one at a time. For end-of-window RUL prediction (our setting) it is fine.The point. An LSTM is a learnable, gated working memory. The gates are the engineering trick that solves vanishing gradients; the cell state is the long-term memory; the hidden output is what the rest of the model sees.
Takeaway
- RNNs read sequences and update a hidden state. They add temporal modelling on top of whatever frontend you put in place (Conv1D in this book).
- Vanilla RNNs vanish. Length-30 sequences are too long for tanh-based recurrences; gradients shrink to nothing.
- LSTMs fix it with three gates and a candidate. and . Gradients flow through almost unattenuated.
- PyTorch's nn.LSTM is six arguments.
input_size, hidden_size, num_layers, bidirectional, batch_first, dropout— the entire backbone of Chapter 9 fits on one line. - Always set
batch_first=True. The default is and breaks every other layer's shape contract.