Live Closed Captions vs Subtitles
A live closed-captioner working off a microphone has no idea what word is coming next - she has to commit to her transcription word-by-word, with only past context. A film subtitler working from a finished movie can rewind, fast-forward, look at what comes both before AND after a confusing line, then write a subtitle that reads naturally in context. Same task; very different output quality, because one of them has access to the future.
That is exactly what a bidirectional LSTM gives a model. A unidirectional LSTM at cycle only sees . A bidirectional LSTM sees the entire window from BOTH directions and lets the model decide what each cycle means in light of everything that surrounds it.
From CNN Features to Temporal Modelling
The CNN frontend (Chapter 8) produced a (B, 30, 64) tensor - 30 timesteps, 64 learned local-pattern channels. The BiLSTM's job is to integrate those LOCAL features into a sequence-aware representation that carries TEMPORAL structure: are the local spikes accelerating? Are oscillations growing in amplitude? Is a slow upward drift consistent across the window?
| Limitation of CNN alone | Example | Why BiLSTM helps |
|---|---|---|
| No long-range dependencies | Pattern at cycle 5 relates to cycle 25 | Hidden state spans the entire window |
| No temporal ordering | Whether degradation is accelerating | Cell state tracks evolution |
| Fixed receptive field | Sudden change after long stability | LSTM adapts via gates |
What a Unidirectional LSTM Cannot See
A unidirectional LSTM at cycle has only seen . For a spike at cycle 23 the model cannot tell whether the spike is the START of a failure cascade or just a transient anomaly that resolves at cycle 25 - because cycle 24 and cycle 25 do not exist yet from the unidirectional point of view.
Bidirectional LSTM solves this by running a SECOND LSTM right- to-left over the same input, then concatenating the two hidden states. At cycle 23, the model now also knows what cycles 24-30 look like - decisive context for distinguishing transient from terminal events.
The BiLSTM Mechanism
Two independent LSTM passes on the same sequence, concatenated:
The output at each timestep is the concatenation:
Two separate parameter sets - one for each direction - learn different things. The forward LSTM tends to learn “what came before” features; the backward LSTM learns “what comes after”. The concat lets the downstream layer use both.
| Aspect | Unidirectional LSTM | Bidirectional LSTM |
|---|---|---|
| Context at cycle t | x_1..x_t (past only) | x_1..x_T (full window) |
| Output dimension | H = 256 | 2H = 512 (concatenated) |
| Parameter count | P | 2P (two LSTMs) |
| Computation | 1 pass | 2 parallel passes |
| Use case | Real-time / streaming | Offline / fixed window |
Interactive: Watch the Two Passes Combine
The animation below runs a 6-cycle BiLSTM step by step. Toggle between “forward only”, “backward only”, and “both” to see how the concatenated output is only ready when BOTH directions have processed the relevant timestep.
Python: A Bidirectional Pass From Scratch
Twenty lines that take the scalar LSTM cell from §3.3 and run it twice: forward, then backward, on the SAME input sequence. The concatenation glues the two hidden-state lists into one (T, 2H) output.
h, c before the backward loop is essential - sharing state would couple the two passes in ways that defeat the purpose.PyTorch: nn.LSTM(bidirectional=True)
Bidirectional vs Streaming in Other Domains
| Domain | Sequence | Bidirectional? | Reason |
|---|---|---|---|
| RUL on fixed window (this book) | 30 cycles | Yes | Window is complete at prediction time |
| Live speech recognition | Audio frames as captured | No - unidirectional / streaming | Latency budget; future not available |
| Offline ASR / dictation | Recorded audio | Yes | Full audio available |
| Machine translation | Source sentence | Yes (encoder) | Whole sentence is provided |
| Live language modelling | Token-by-token | No - causal mask | Generation requires causality |
| Medical EHR risk scoring | Patient history | Depends | Look-ahead leak in causal evaluation |
| Anomaly detection (online) | Telemetry stream | No | Streaming inference |
Three Bidirectional Pitfalls
h_n[-1] gives the LAST direction of the LAST layer, not a (B, 2H) full final state. To get a full final state, use out[:, -1] (the last timestep's concatenated output) instead.The point. Two LSTM passes on the same input, concatenated. Doubles the hidden dimension; doubles the parameters; gives the model both past and future context at every cycle. Required for windowed RUL; unsuitable for streaming.
Takeaway
- Bidirectional = forward LSTM + backward LSTM, concatenated.
- Output dim = 2 × hidden_size. 256 × 2 = 512 for our backbone.
- Two independent parameter sets. They learn different things during training.
- Cannot stream. Needs the full window.
- Used by every model in this book. AMNL / GABA / GRACE all use the same BiLSTM block; only the loss changes.