AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand why we stack two BiLSTM layers instead of one
Trace data flow through the two-layer architecture
Explain hierarchical temporal feature learning
Calculate the parameter count for the BiLSTM encoder
Apply dropout between layers for regularization

Why This Matters: Stacking LSTM layers creates a hierarchy of temporal abstractions—the first layer learns basic patterns, the second layer learns patterns of patterns. This mirrors the CNN hierarchy and provides richer representations for RUL prediction.

Why Two Layers?

A single BiLSTM layer already captures temporal dependencies. Why add a second?

Analogy: CNN Depth

Just as stacking CNN layers builds increasingly abstract spatial features, stacking LSTM layers builds increasingly abstract temporal features:

Layer	CNN (Spatial)	LSTM (Temporal)
Layer 1	Edges, gradients	Short-term patterns
Layer 2	Textures, shapes	Medium-term dynamics
Layer 3+	Objects, parts	Long-term trends

Expressiveness Gains

A second layer provides:

Compositional learning: Layer 2 can learn functions of Layer 1's hidden states
Non-linear transformations: More depth = more complex mappings
Hierarchical abstraction: Higher layers see "summaries" of lower layers

Diminishing Returns

Why not three or more layers?

Layers	Benefit	Cost
1	Baseline temporal modeling	~400K params
2	Hierarchical features, +5-10% accuracy	~1.2M params
3	Marginal improvement	+800K params, overfitting risk
4+	Often no improvement	High overfitting, slow training

For C-MAPSS with ~20K training windows, two layers provide the best capacity-data balance.

Interactive: LSTM Cell Explorer

Before understanding how we stack BiLSTM layers, let's visualize how a single LSTM cell processes data. The visualizer below uses actual CNN output (8 features × 6 timesteps) and shows the step-by-step gate computations.

LSTM Cell Visualizer

Explore how LSTM processes the CNN output (8 features × 6 timesteps) step by step.

Example Input Data: CNN Output

We're using the output from Conv2 layer (8 features × 6 timesteps) as input to the LSTM. Each column represents a timestep, and each row is a feature channel extracted by the CNN.

Conv2

After Conv2: 8 Features

	t₀	t₁	t₂	t₃	t₄	t₅
Out₀	0.58	0.42	0.52	0.54	0.21	-0.19
Out₁	-0.36	-0.83	-0.57	-0.67	-0.64	-0.41
Out₂	-1.01	-0.90	-0.87	-0.88	-0.91	-0.06
Out₃	-0.12	-0.15	0.04	0.01	-0.08	0.11
Out₄	0.15	-0.02	0.17	0.05	-0.06	-0.17
Out₅	-1.06	-0.94	-0.80	-0.67	-0.16	0.26
Out₆	-0.47	-0.24	-0.22	-0.27	-0.27	0.04
Out₇	0.50	0.58	0.40	0.42	0.36	-0.24

8 feature channels (rows)×6 timesteps (columns)

Timestep:

Click step buttons or hover over diagram elements to trace the data flow

Input x₀ (from CNN)

Out₀

0.58

Out₁

-0.36

Out₂

-1.01

Out₃

-0.12

Out₄

0.15

Out₅

-1.06

Out₆

-0.47

Out₇

0.50

Previous Hidden h₀

h₀

0.00

h₁

0.00

h₂

0.00

h₃

0.00

Previous Cell C₀

C₀

0.00

C₁

0.00

C₂

0.00

C₃

0.00

Gate Outputs at t₀

Forget Gate (f)

f₀

0.63

f₁

0.72

f₂

0.55

f₃

0.38

Input Gate (i)

i₀

0.33

i₁

0.61

i₂

0.51

i₃

0.64

Candidate (g)

g₀

-0.22

g₁

0.27

g₂

0.70

g₃

-0.33

Output Gate (o)

o₀

0.36

o₁

0.71

o₂

0.32

o₃

0.65

New Cell State C₀

C₀

-0.07

C₁

0.17

C₂

0.36

C₃

-0.21

New Hidden State h₀

h₀

-0.03

h₁

0.12

h₂

0.11

h₃

-0.13

Key Insight: Why LSTM Solves Vanishing Gradients

The cell state C acts as a "highway" for gradients. When f ≈ 1, gradients flow unchanged through time: $\frac{\partial C_t}{\partial C_{t-1}} = f_t \approx 1$ . This allows LSTMs to learn long-range dependencies that simple RNNs cannot capture.

Try This: Click through different gates (Forget, Input, Candidate, etc.) to see exactly how LSTM processes information. Notice how the forget gate decides what to keep from the previous cell state, while the input gate controls what new information to add.

Two-Layer Architecture

Our BiLSTM encoder stacks two bidirectional LSTM layers, each with hidden size 128.

Data Flow

📝text

1Input from CNN: (batch, 30, 64)
2         ↓
3┌────────────────────────────────────────────────────┐
4│                  BiLSTM Layer 1                    │
5│  Forward:  LSTM(64 → 128)  →  H→₁ ∈ (B, 30, 128)  │
6│  Backward: LSTM(64 → 128)  →  H←₁ ∈ (B, 30, 128)  │
7│  Concat:   H₁ = [H→₁; H←₁] ∈ (B, 30, 256)         │
8└────────────────────────────────────────────────────┘
9         ↓
10    Dropout(p=0.3)
11         ↓
12┌────────────────────────────────────────────────────┐
13│                  BiLSTM Layer 2                    │
14│  Forward:  LSTM(256 → 128) →  H→₂ ∈ (B, 30, 128)  │
15│  Backward: LSTM(256 → 128) →  H←₂ ∈ (B, 30, 128)  │
16│  Concat:   H₂ = [H→₂; H←₂] ∈ (B, 30, 256)         │
17└────────────────────────────────────────────────────┘
18         ↓
19Output: (batch, 30, 256)

Dimension Summary

Stage	Shape	Description
CNN output	(B, 30, 64)	64 local features per timestep
Layer 1 input	(B, 30, 64)	Same as CNN output
Layer 1 output	(B, 30, 256)	128×2 bidirectional
After dropout	(B, 30, 256)	Regularized
Layer 2 input	(B, 30, 256)	Previous layer output
Layer 2 output	(B, 30, 256)	128×2 bidirectional

Constant Output Dimension

Both layers output 256 dimensions (128 per direction). This is a design choice—the hidden size doesn't need to increase with depth. Keeping it constant simplifies the architecture and residual connections.

Interactive: BiLSTM Explorer

Now let's see how bidirectional processing works. The BiLSTM runs two separate LSTM networks—one processing left-to-right (forward), the other right-to-left (backward)—and concatenates their outputs.

BiLSTM Visualizer

See how BiLSTM processes the sequence in both directions and combines the results.

BiLSTM Architecture

Forward LSTM

Processes the sequence from left to right (t₀ → t₅). Captures past context - what happened before each position.

h^{\rightarrow}_t = \text{LSTM}^{\rightarrow}(x_t, h^{\rightarrow}_{t-1})

Backward LSTM

Processes the sequence from right to left (t₅ → t₀). Captures future context - what happens after each position.

h^{\leftarrow}_t = \text{LSTM}^{\leftarrow}(x_t, h^{\leftarrow}_{t+1})

Output Concatenation

At each timestep, the forward and backward hidden states are concatenated:

h_t = [h^{\rightarrow}_t ; h^{\leftarrow}_t]

Output dimension: 4 (forward) + 4 (backward) = 8

Why Use BiLSTM for Predictive Maintenance?

In predictive maintenance, sensor readings form a time series. A BiLSTM can detect patterns that depend on context from both directions:

• Forward context: What events led to this sensor reading?
• Backward context: What happened after this reading?
• Combined, this helps identify gradual degradation patterns for RUL prediction

Key Insight: At each timestep, the forward LSTM captures "what came before" while the backward LSTM captures "what comes after." The concatenated output gives each position full context from both directions—crucial for understanding degradation patterns in sensor data.

Hierarchical Temporal Features

Each layer learns different levels of temporal abstraction.

Layer 1: Low-Level Temporal Patterns

The first layer directly processes CNN features, learning:

Immediate transitions: How features change from one timestep to the next
Short-term memory: Recent history relevant to current state
Local temporal context: Patterns spanning a few timesteps

Layer 2: High-Level Temporal Patterns

The second layer processes Layer 1's hidden states, learning:

Patterns of patterns: How low-level dynamics combine
Long-term trends: Overall trajectory of degradation
Abstract dynamics: "Accelerating degradation" vs "steady decline"

Parameter Count Analysis

Let us calculate the number of parameters in each BiLSTM layer.

LSTM Parameter Formula

For an LSTM with input size $D$ and hidden size $H$ :

\text{Params} = 4 \times [(D + H) \times H + H]

The factor of 4 accounts for the four gates (forget, input, candidate, output). Each gate has:

Weight matrix: $(D + H) \times H$
Bias: $H$

BiLSTM Layer 1

BiLSTM Layer 2

Total BiLSTM Parameters

Component	Parameters
BiLSTM Layer 1	197,632
BiLSTM Layer 2	394,240
Total	591,872

The two-layer BiLSTM accounts for approximately 592K of the model's total 3.5M parameters.

Dropout Between Layers

We apply dropout between the two BiLSTM layers to prevent overfitting.

Placement

📝text

1Layer 1 output: (B, 30, 256)
2       ↓
3   Dropout(p=0.3)  ← Applied here
4       ↓
5Layer 2 input: (B, 30, 256)

Why Between Layers?

Prevents co-adaptation: Layer 2 cannot rely on specific Layer 1 neurons
Regularizes hierarchy: Each layer must be independently useful
Standard practice: PyTorch's nn.LSTM has a built-in dropout parameter for this

Rate Selection

Rate	Effect	Recommendation
p = 0.1	Light regularization	Risk of overfitting
p = 0.3	Moderate regularization	Our choice
p = 0.5	Strong regularization	May hurt performance

No Dropout on Last Layer Output

PyTorch's dropout parameter applies between layers, not after the last layer. If you need dropout on the final output, add it separately. This prevents accidentally regularizing the representation that feeds attention.

Summary

In this section, we designed the two-layer BiLSTM encoder:

Two layers: Hierarchical temporal abstraction without overfitting
Hidden size 128: 256-dim output per layer (bidirectional)
Hierarchical features: Layer 1 captures low-level dynamics, Layer 2 captures high-level trends
~592K parameters: 198K (Layer 1) + 394K (Layer 2)
Dropout 0.3: Between layers for regularization

Property	Layer 1	Layer 2
Input dimension	64	256
Hidden dimension	128	128
Output dimension	256	256
Parameters	~198K	~394K
Dropout after	0.3	None (end of encoder)

Looking Ahead: PyTorch's built-in LSTM uses batch normalization implicitly through careful initialization. However, adding explicit layer normalization can further stabilize training, especially for deep networks. The next section covers layer normalization integration.

With the two-layer design established, we now examine layer normalization for enhanced training stability.

Learning Objectives

Why Two Layers?

Analogy: CNN Depth

Expressiveness Gains

Diminishing Returns

Interactive: LSTM Cell Explorer

LSTM Cell Visualizer

Example Input Data: CNN Output

After Conv2: 8 Features

Input x0 (from CNN)

Previous Hidden h0

Previous Cell C0

Gate Outputs at t₀

New Cell State C0

New Hidden State h0

Key Insight: Why LSTM Solves Vanishing Gradients

Two-Layer Architecture

Data Flow

Dimension Summary

Constant Output Dimension

Interactive: BiLSTM Explorer

BiLSTM Visualizer

BiLSTM Architecture

Forward LSTM

Backward LSTM

Output Concatenation

Why Use BiLSTM for Predictive Maintenance?

Hierarchical Temporal Features

Layer 1: Low-Level Temporal Patterns

Layer 2: High-Level Temporal Patterns

Parameter Count Analysis

LSTM Parameter Formula

BiLSTM Layer 1

BiLSTM Layer 2

Total BiLSTM Parameters

Dropout Between Layers

Placement

Why Between Layers?

Rate Selection

No Dropout on Last Layer Output

Summary

Input x₀ (from CNN)

Previous Hidden h₀

Previous Cell C₀

New Cell State C₀

New Hidden State h₀