AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the transition from CNN to LSTM in the AMNL architecture
Explain why bidirectional processing improves temporal understanding
Describe the BiLSTM mechanism with forward and backward passes
Justify bidirectionality for RUL prediction despite the causal nature of time
Appreciate the information flow through the BiLSTM encoder

Why This Matters: The LSTM processes the sequence of CNN features to capture long-range temporal dependencies. Using a bidirectional architecture doubles the context available at each timestep, enabling the model to understand how degradation patterns unfold in both directions—critical for accurate RUL estimation.

From CNN to LSTM

The CNN feature extractor produces a sequence of 64-dimensional feature vectors, one for each of the 30 timesteps. Now we need to model how these features evolve over time.

CNN Output Recap

📝text

1CNN Output: (batch, 30, 64)
2              ↓
3    Sequence of feature vectors:
4    [f₁, f₂, f₃, ..., f₃₀]
5
6    Each fₜ ∈ ℝ⁶⁴ captures local patterns at timestep t

What CNN Cannot Capture

While the CNN excels at local pattern detection (receptive field of 7 timesteps), it has limitations:

Limitation	Example	Why LSTM Helps
No long-range dependencies	Pattern at t=5 relates to t=25	LSTM memory spans entire sequence
No temporal ordering	Whether degradation is accelerating	LSTM tracks state evolution
Fixed receptive field	Sudden changes after long stability	LSTM adapts attention dynamically

Division of Labor

The CNN and LSTM have complementary roles:

CNN: "What local patterns exist?" — Detects spikes, trends, oscillations within 7-timestep windows
LSTM: "How do patterns evolve over the full sequence?" — Models the trajectory of degradation across all 30 timesteps

Why Bidirectional?

A unidirectional LSTM processes the sequence from left to right, accumulating information forward in time. A bidirectional LSTM adds a second pass from right to left.

Unidirectional Limitation

📝text

1Unidirectional (forward only):
2
3Input:     [x₁] → [x₂] → [x₃] → [x₄] → [x₅]
4                                         ↓
5Hidden:    [h₁] → [h₂] → [h₃] → [h₄] → [h₅]
6
7At position t, hidden state h_t only knows about x₁...x_t
8h₃ has NO information about x₄ or x₅!

At each timestep, the unidirectional LSTM only has access to past context. This is limiting because:

Understanding the current state often requires future context
Is this spike the beginning of failure or a transient anomaly?
The answer depends on what happens next

Bidirectional Solution

📝text

1Bidirectional (forward + backward):
2
3Forward:   [x₁] → [x₂] → [x₃] → [x₄] → [x₅]
4            ↓      ↓      ↓      ↓      ↓
5           [h→₁]  [h→₂]  [h→₃]  [h→₄]  [h→₅]
6
7Backward:  [x₁] ← [x₂] ← [x₃] ← [x₄] ← [x₅]
8            ↓      ↓      ↓      ↓      ↓
9           [h←₁]  [h←₂]  [h←₃]  [h←₄]  [h←₅]
10
11Combined:  [h→₁;h←₁] [h→₂;h←₂] [h→₃;h←₃] [h→₄;h←₄] [h→₅;h←₅]
12
13At position t, combined hidden state knows BOTH past and future!

Information Flow Comparison

Aspect	Unidirectional	Bidirectional
Context at timestep t	x₁...xₜ (past only)	x₁...xₜ...x_T (full)
Hidden dimension	H	2H (concatenated)
Parameters	P	2P (two LSTMs)
Computation	1 pass	2 parallel passes
Use case	Real-time streaming	Offline analysis

BiLSTM Mechanism

The BiLSTM runs two separate LSTM networks on the same sequence in opposite directions.

Forward LSTM

Processes the sequence from $t = 1$ to $t = T$ :

\overrightarrow{h}_t = \text{LSTM}_{\rightarrow}(x_t, \overrightarrow{h}_{t-1})

At each timestep, the forward hidden state $\overrightarrow{h}_t$ captures information from $x_1, x_2, ..., x_t$ .

Backward LSTM

Processes the sequence from $t = T$ to $t = 1$ :

\overleftarrow{h}_t = \text{LSTM}_{\leftarrow}(x_t, \overleftarrow{h}_{t+1})

At each timestep, the backward hidden state $\overleftarrow{h}_t$ captures information from $x_T, x_{T-1}, ..., x_t$ .

Concatenation

The final output at each timestep concatenates both directions:

h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t] \in \mathbb{R}^{2H}

Where:

$H$ : Hidden dimension of each LSTM (128 in our model)
$2H$ : Combined dimension (256)
$[; ]$ : Concatenation operator

Bidirectionality in RUL Prediction

A natural question arises: if time flows forward, why process the sequence backward?

The Key Insight

In RUL prediction, we are not predicting the future in real-time. We have access to a fixed window of observations (30 cycles) and must estimate the remaining life. Within this window, there is no causal constraint—we can look at all observations.

Analogy: A doctor examining a patient's week-long vital signs doesn't read them strictly left-to-right. They look at the full picture: "The spike on Day 3 is concerning because it wasn't followed by recovery on Days 4-5."

What Bidirectionality Captures

Pattern Type	Forward LSTM Sees	Backward LSTM Sees
Gradual degradation	Values increasing over time	How high values got
Sudden spike	Normal → spike transition	Recovery (or not) after spike
Oscillation	Increasing amplitude	Where oscillation ends up
Plateau then drop	Stability before drop	Drop is coming

Example: Spike Interpretation

📝text

1Sensor reading: [..., 50, 52, 95, 54, 51, ...]
2                              ↑ spike
3
4Forward LSTM at spike position:
5  "Values jumped from ~50 to 95"
6  Context: Only knows stable past
7
8Backward LSTM at spike position:
9  "Values returned to ~50 after reaching 95"
10  Context: Knows the spike resolved
11
12Combined interpretation:
13  "Transient anomaly, not sustained damage"
14
15Without backward:
16  Could mistake transient spike for onset of failure

Offline vs Online Processing

Training vs Deployment Context

Bidirectionality is valid because our model operates on fixed 30-timestep windows. During training and inference, we have the complete window. For truly real-time streaming applications where future observations don't exist yet, unidirectional would be necessary—but that's not our use case.

Summary

In this section, we motivated the use of bidirectional LSTMs:

CNN to LSTM transition: CNN captures local patterns; LSTM models temporal evolution
Bidirectional advantage: Each timestep has context from both past and future
BiLSTM mechanism: Forward + backward passes, concatenated outputs
Output dimension: 2H = 256 (128 from each direction)
RUL justification: Fixed windows allow looking both ways

Property	Value
Input from CNN	(B, 30, 64)
Hidden size (H)	128
BiLSTM output	(B, 30, 256)
Processing	Forward + Backward passes
Context per timestep	Entire 30-step window

Looking Ahead: Understanding why we use BiLSTM is the first step. The next section dives into the LSTM cell mathematics—the gates, cell state, and hidden state updates that give LSTMs their remarkable ability to model long-range dependencies.

With the motivation clear, we now examine the mathematical machinery inside the LSTM cell.