Chapter 9
13 min read
Section 35 of 121

Why Bidirectional Beats Unidirectional

Bidirectional LSTM Encoder

Live Closed Captions vs Subtitles

A live closed-captioner working off a microphone has no idea what word is coming next - she has to commit to her transcription word-by-word, with only past context. A film subtitler working from a finished movie can rewind, fast-forward, look at what comes both before AND after a confusing line, then write a subtitle that reads naturally in context. Same task; very different output quality, because one of them has access to the future.

That is exactly what a bidirectional LSTM gives a model. A unidirectional LSTM at cycle tt only sees x1,,xtx_1, \ldots, x_t. A bidirectional LSTM sees the entire window x1,,xTx_1, \ldots, x_T from BOTH directions and lets the model decide what each cycle means in light of everything that surrounds it.

The streaming caveat. A bidirectional LSTM cannot do streaming inference - it needs the full window upfront. For end-of-window RUL on a 30-cycle sliding window (our setting) this is fine. For one-cycle-at-a-time online prediction it is fatal.

From CNN Features to Temporal Modelling

The CNN frontend (Chapter 8) produced a (B, 30, 64) tensor - 30 timesteps, 64 learned local-pattern channels. The BiLSTM's job is to integrate those LOCAL features into a sequence-aware representation that carries TEMPORAL structure: are the local spikes accelerating? Are oscillations growing in amplitude? Is a slow upward drift consistent across the window?

Limitation of CNN aloneExampleWhy BiLSTM helps
No long-range dependenciesPattern at cycle 5 relates to cycle 25Hidden state spans the entire window
No temporal orderingWhether degradation is acceleratingCell state tracks evolution
Fixed receptive fieldSudden change after long stabilityLSTM adapts via gates

What a Unidirectional LSTM Cannot See

A unidirectional LSTM at cycle tt has only seen x1,,xtx_1, \ldots, x_t. For a spike at cycle 23 the model cannot tell whether the spike is the START of a failure cascade or just a transient anomaly that resolves at cycle 25 - because cycle 24 and cycle 25 do not exist yet from the unidirectional point of view.

Bidirectional LSTM solves this by running a SECOND LSTM right- to-left over the same input, then concatenating the two hidden states. At cycle 23, the model now also knows what cycles 24-30 look like - decisive context for distinguishing transient from terminal events.

The BiLSTM Mechanism

Two independent LSTM passes on the same sequence, concatenated:

ht=LSTM(xt,ht1),ht=LSTM(xt,ht+1).\overrightarrow{h}_t = \text{LSTM}_{\rightarrow}(x_t,\, \overrightarrow{h}_{t-1}), \quad \overleftarrow{h}_t = \text{LSTM}_{\leftarrow}(x_t,\, \overleftarrow{h}_{t+1}).

The output at each timestep is the concatenation:

ht=[ht;ht]R2H.h_t = [\overrightarrow{h}_t \,;\, \overleftarrow{h}_t] \in \mathbb{R}^{2H}.

Two separate parameter sets - one for each direction - learn different things. The forward LSTM tends to learn “what came before” features; the backward LSTM learns “what comes after”. The concat lets the downstream layer use both.

AspectUnidirectional LSTMBidirectional LSTM
Context at cycle tx_1..x_t (past only)x_1..x_T (full window)
Output dimensionH = 2562H = 512 (concatenated)
Parameter countP2P (two LSTMs)
Computation1 pass2 parallel passes
Use caseReal-time / streamingOffline / fixed window

Interactive: Watch the Two Passes Combine

The animation below runs a 6-cycle BiLSTM step by step. Toggle between “forward only”, “backward only”, and “both” to see how the concatenated output is only ready when BOTH directions have processed the relevant timestep.

Loading BiLSTM flow viz…

Python: A Bidirectional Pass From Scratch

Twenty lines that take the scalar LSTM cell from §3.3 and run it twice: forward, then backward, on the SAME input sequence. The concatenation glues the two hidden-state lists into one (T, 2H) output.

Two passes, concatenated - the entire bidirectional algorithm
🐍bidirectional_lstm_numpy.py
1import numpy as np

Standard alias.

6def lstm_step(x, h_prev, c_prev, params):

Same scalar LSTM step from §3.3. Reused here as the building block for both directions.

EXECUTION STATE
input: x = Current timestep input scalar
input: h_prev, c_prev = Previous hidden / cell state
input: params = Dict of (W_x, W_h, b) tuples for each gate
returns = (h_new, c_new) - new hidden and cell state
8sig = lambda z: 1.0 / (1.0 + np.exp(-z))

Sigmoid - same as §3.3.

9W_ix, W_ih, b_i = params['i']

Unpack the 4 gate parameters from the dict.

13i = sig(W_ix * x + W_ih * h_prev + b_i)

Input gate.

14f = sig(W_fx * x + W_fh * h_prev + b_f)

Forget gate.

15g = np.tanh(W_gx * x + W_gh * h_prev + b_g)

Candidate state.

16o = sig(W_ox * x + W_oh * h_prev + b_o)

Output gate.

17c = f * c_prev + i * g

Cell state update.

18h = o * np.tanh(c)

Hidden output.

22def bidirectional_lstm(seq, params_fwd, params_bwd):

The bidirectional wrapper. Runs the same lstm_step in BOTH directions on the SAME input sequence using SEPARATE parameter sets.

EXECUTION STATE
input: seq = Length-T list of scalar inputs
input: params_fwd = Forward LSTM's parameter dict
input: params_bwd = Backward LSTM's parameter dict (separate from forward)
returns = (T, 2) ndarray - column 0 forward, column 1 backward
25T = len(seq)

Sequence length.

26h, c = 0.0, 0.0

Zero-init for the FORWARD pass.

27h_fwd = []

Forward hidden states accumulator.

28for t in range(T):

Forward pass: left to right.

LOOP TRACE · 3 iterations
t = 0
h_fwd[0] = depends on seq[0] only
t = 1
h_fwd[1] = depends on seq[0], seq[1]
t = T-1
h_fwd[T-1] = depends on seq[0..T-1]
29h, c = lstm_step(seq[t], h, c, params_fwd)

Forward step using FORWARD parameters.

30h_fwd.append(h)

Stash forward hidden state.

32h, c = 0.0, 0.0

RESET hidden / cell state for the backward pass. The two passes have INDEPENDENT state - they do NOT share intermediate results.

33h_bwd = [None] * T

Pre-allocate backward output list. We will fill from the right.

34for t in range(T - 1, -1, -1):

Backward pass: right to left. Note the SAME input seq is used.

LOOP TRACE · 3 iterations
t = T-1
h_bwd[T-1] = depends on seq[T-1] only
t = T-2
h_bwd[T-2] = depends on seq[T-1], seq[T-2]
t = 0
h_bwd[0] = depends on seq[0..T-1]
35h, c = lstm_step(seq[t], h, c, params_bwd)

Backward step using BACKWARD parameters. params_fwd and params_bwd are independent; they will (in real training) learn different things.

36h_bwd[t] = h

Stash at index t (not appended) - this is why we pre-allocated.

38return np.stack([np.stack([f, b]) for f, b in zip(h_fwd, h_bwd)])

Concatenate forward and backward at every timestep into a (T, 2) tensor.

EXECUTION STATE
→ real LSTM concat = In nn.LSTM, h_fwd and h_bwd are each H-dimensional, so the concat is 2H. Here they are scalars so the concat is 2D.
41seq = [0.0, 0.0, 1.0, 1.0, 0.0, 0.0]

Square pulse - same shape as §3.3 but length 6 for visibility.

42shared = {...}

Same parameter set for both directions for this demo. Real BiLSTM has independent forward / backward weights.

48out = bidirectional_lstm(seq, shared, shared)

Run.

EXECUTION STATE
out.shape = (6, 2)
50print("seq :", seq)

Original input.

EXECUTION STATE
Output = seq : [0.0, 0.0, 1.0, 1.0, 0.0, 0.0]
51print("h_fwd :", out[:, 0].round(3).tolist())

Forward hidden states. h_fwd[2] sees only seq[0..2]; h_fwd[5] sees the whole pulse.

EXECUTION STATE
Output (representative) = h_fwd : [0.0, 0.0, 0.286, 0.449, 0.297, 0.231]
52print("h_bwd :", out[:, 1].round(3).tolist())

Backward hidden states. h_bwd[2] now sees seq[2..5]; h_bwd[0] sees the whole sequence in reverse. Symmetric to forward but starting from the OTHER end.

EXECUTION STATE
Output (representative) = h_bwd : [0.231, 0.297, 0.449, 0.286, 0.0, 0.0]
53print("concat shape :", out.shape)

(T, 2) - one (forward, backward) pair per timestep.

EXECUTION STATE
Output = concat shape : (6, 2)
23 lines without explanation
1import numpy as np
2
3# A scalar bidirectional pass — same machinery as nn.LSTM(bidirectional=True)
4# but written so every line is visible.
5
6def lstm_step(x, h_prev, c_prev, params):
7    """One scalar LSTM step. Returns (h_new, c_new)."""
8    sig = lambda z: 1.0 / (1.0 + np.exp(-z))
9    W_ix, W_ih, b_i = params["i"]
10    W_fx, W_fh, b_f = params["f"]
11    W_gx, W_gh, b_g = params["g"]
12    W_ox, W_oh, b_o = params["o"]
13    i = sig (W_ix * x + W_ih * h_prev + b_i)
14    f = sig (W_fx * x + W_fh * h_prev + b_f)
15    g = np.tanh(W_gx * x + W_gh * h_prev + b_g)
16    o = sig (W_ox * x + W_oh * h_prev + b_o)
17    c = f * c_prev + i * g
18    h = o * np.tanh(c)
19    return h, c
20
21
22def bidirectional_lstm(seq, params_fwd, params_bwd):
23    """seq: list of scalars (length T). Returns (T, 2) — concat[fwd, bwd]."""
24    T = len(seq)
25    h, c = 0.0, 0.0
26    h_fwd = []
27    for t in range(T):                              # left → right
28        h, c = lstm_step(seq[t], h, c, params_fwd)
29        h_fwd.append(h)
30
31    h, c = 0.0, 0.0
32    h_bwd = [None] * T
33    for t in range(T - 1, -1, -1):                  # right → left
34        h, c = lstm_step(seq[t], h, c, params_bwd)
35        h_bwd[t] = h
36
37    return np.stack([np.stack([f, b]) for f, b in zip(h_fwd, h_bwd)])
38
39
40# ----- Run on a 6-step input -----
41seq = [0.0, 0.0, 1.0, 1.0, 0.0, 0.0]
42shared = {
43    "i": (0.5, 0.0, 0.0),
44    "f": (0.0, 0.0, 1.0),       # forget bias = 1, like §3.3
45    "g": (0.8, 0.0, 0.0),
46    "o": (1.0, 0.0, 0.0),
47}
48out = bidirectional_lstm(seq, shared, shared)
49
50print("seq           :", seq)
51print("h_fwd         :", out[:, 0].round(3).tolist())
52print("h_bwd         :", out[:, 1].round(3).tolist())
53print("concat shape  :", out.shape)             # (6, 2)
The state RESET between passes. Forward and backward have INDEPENDENT (h, c). Resetting h, c before the backward loop is essential - sharing state would couple the two passes in ways that defeat the purpose.

PyTorch: nn.LSTM(bidirectional=True)

Production BiLSTM in 6 arguments
🐍bilstm_torch.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

nn.LSTM lives here.

5torch.manual_seed(0)

Determinism.

6B, T, F = 2, 30, 64

Input shape from §8.4's CNN frontend: 2 engines, 30 cycles, 64 channels. The BiLSTM consumes this directly.

8bilstm = nn.LSTM(input_size=64, hidden_size=256, num_layers=2, bidirectional=True, batch_first=True, dropout=0.3)

Six arguments fully specify the BiLSTM.

EXECUTION STATE
input_size=64 = Channels per cycle from CNN frontend
hidden_size=256 = Per-direction H. Output dim is 2*256 = 512
num_layers=2 = Two stacked BiLSTMs - output of layer 1 feeds layer 2
bidirectional=True = Doubles the hidden dim and parameter count
batch_first=True = (B, T, F) order, matching the rest of the book
dropout=0.3 = Only applied BETWEEN stacked layers - 0 if num_layers=1
17x = torch.randn(B, T, F)

Fake CNN-frontend output. Real data comes from CNNFrontend(...) (Chapter 8).

18out, (h_n, c_n) = bilstm(x)

BiLSTM forward pass. Returns both the per-timestep hidden output AND the final (h, c) tuple.

20print("input shape :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input shape : (2, 30, 64)
21print("output shape :", tuple(out.shape))

Time preserved at 30; channel dim doubles to 512 (= 2 * hidden_size from the bidirectional concat).

EXECUTION STATE
Output = output shape : (2, 30, 512)
22print("h_n shape :", tuple(h_n.shape))

Final hidden state - 4 entries because (2 layers × 2 directions).

EXECUTION STATE
Output = h_n shape : (4, 2, 256)
23print("# params :", sum(p.numel() for p in bilstm.parameters()))

About 2.14M parameters - the BiLSTM is the largest single component of the backbone.

EXECUTION STATE
Output = # params : 2,140,160
13 lines without explanation
1import torch
2import torch.nn as nn
3
4# ----- Production-grade bidirectional LSTM -----
5torch.manual_seed(0)
6B, T, F = 2, 30, 64                 # Comes from §8.4: CNN frontend output
7
8bilstm = nn.LSTM(
9    input_size=64,
10    hidden_size=256,
11    num_layers=2,
12    bidirectional=True,             # ← the magic word
13    batch_first=True,
14    dropout=0.3,                    # only between stacked layers
15)
16
17x = torch.randn(B, T, F)
18out, (h_n, c_n) = bilstm(x)
19
20print("input shape  :", tuple(x.shape))
21print("output shape :", tuple(out.shape))           # (2, 30, 512) — 2 * hidden_size
22print("h_n shape    :", tuple(h_n.shape))           # (4, 2, 256)  — 2 layers * 2 dirs
23print("# params     :", sum(p.numel() for p in bilstm.parameters()))
24# # params : 2,140,160
The cuDNN speed-up. When CUDA is available PyTorch dispatches nn.LSTM to NVIDIA's cuDNN kernel, which fuses both directions and all gates into one highly-optimised call. Hand-written for loops are 5-10× slower.

Bidirectional vs Streaming in Other Domains

DomainSequenceBidirectional?Reason
RUL on fixed window (this book)30 cyclesYesWindow is complete at prediction time
Live speech recognitionAudio frames as capturedNo - unidirectional / streamingLatency budget; future not available
Offline ASR / dictationRecorded audioYesFull audio available
Machine translationSource sentenceYes (encoder)Whole sentence is provided
Live language modellingToken-by-tokenNo - causal maskGeneration requires causality
Medical EHR risk scoringPatient historyDependsLook-ahead leak in causal evaluation
Anomaly detection (online)Telemetry streamNoStreaming inference

Three Bidirectional Pitfalls

Pitfall 1: Streaming inference. If you train bidirectional and want to deploy in a streaming setting, you cannot. Either accept latency equal to one full window, or retrain unidirectional.
Pitfall 2: Doubled parameter count. BiLSTM has 2× the parameters of a unidirectional LSTM. With H=256, L=2 layers and input_size=64, that is ~2.1M params - the biggest part of the backbone. Watch GPU memory.
Pitfall 3: Confusing h_n shape. The final hidden has shape (num_layers × num_directions, B, hidden) = (4, B, 256) for our config. Indexing h_n[-1] gives the LAST direction of the LAST layer, not a (B, 2H) full final state. To get a full final state, use out[:, -1] (the last timestep's concatenated output) instead.
The point. Two LSTM passes on the same input, concatenated. Doubles the hidden dimension; doubles the parameters; gives the model both past and future context at every cycle. Required for windowed RUL; unsuitable for streaming.

Takeaway

  • Bidirectional = forward LSTM + backward LSTM, concatenated.
  • Output dim = 2 × hidden_size. 256 × 2 = 512 for our backbone.
  • Two independent parameter sets. They learn different things during training.
  • Cannot stream. Needs the full window.
  • Used by every model in this book. AMNL / GABA / GRACE all use the same BiLSTM block; only the loss changes.
Loading comments...