Chapter 6
15 min read
Section 29 of 104

Two-Layer BiLSTM Design

Bidirectional LSTM Encoder

Learning Objectives

By the end of this section, you will:

  1. Understand why we stack two BiLSTM layers instead of one
  2. Trace data flow through the two-layer architecture
  3. Explain hierarchical temporal feature learning
  4. Calculate the parameter count for the BiLSTM encoder
  5. Apply dropout between layers for regularization
Why This Matters: Stacking LSTM layers creates a hierarchy of temporal abstractions—the first layer learns basic patterns, the second layer learns patterns of patterns. This mirrors the CNN hierarchy and provides richer representations for RUL prediction.

Why Two Layers?

A single BiLSTM layer already captures temporal dependencies. Why add a second?

Analogy: CNN Depth

Just as stacking CNN layers builds increasingly abstract spatial features, stacking LSTM layers builds increasingly abstract temporal features:

LayerCNN (Spatial)LSTM (Temporal)
Layer 1Edges, gradientsShort-term patterns
Layer 2Textures, shapesMedium-term dynamics
Layer 3+Objects, partsLong-term trends

Expressiveness Gains

A second layer provides:

  • Compositional learning: Layer 2 can learn functions of Layer 1's hidden states
  • Non-linear transformations: More depth = more complex mappings
  • Hierarchical abstraction: Higher layers see "summaries" of lower layers

Diminishing Returns

Why not three or more layers?

LayersBenefitCost
1Baseline temporal modeling~400K params
2Hierarchical features, +5-10% accuracy~1.2M params
3Marginal improvement+800K params, overfitting risk
4+Often no improvementHigh overfitting, slow training

For C-MAPSS with ~20K training windows, two layers provide the best capacity-data balance.


Interactive: LSTM Cell Explorer

Before understanding how we stack BiLSTM layers, let's visualize how a single LSTM cell processes data. The visualizer below uses actual CNN output (8 features × 6 timesteps) and shows the step-by-step gate computations.

LSTM Cell Visualizer

Explore how LSTM processes the CNN output (8 features × 6 timesteps) step by step.

!

Example Input Data: CNN Output

We're using the output from Conv2 layer (8 features × 6 timesteps) as input to the LSTM. Each column represents a timestep, and each row is a feature channel extracted by the CNN.

Conv2

After Conv2: 8 Features

t₀t₁t₂t₃t₄t₅
Out₀
0.58
0.42
0.52
0.54
0.21
-0.19
Out₁
-0.36
-0.83
-0.57
-0.67
-0.64
-0.41
Out₂
-1.01
-0.90
-0.87
-0.88
-0.91
-0.06
Out₃
-0.12
-0.15
0.04
0.01
-0.08
0.11
Out₄
0.15
-0.02
0.17
0.05
-0.06
-0.17
Out₅
-1.06
-0.94
-0.80
-0.67
-0.16
0.26
Out₆
-0.47
-0.24
-0.22
-0.27
-0.27
0.04
Out₇
0.50
0.58
0.40
0.42
0.36
-0.24
8 feature channels (rows)×6 timesteps (columns)
Timestep:
LSTM Cell at t₀Cₜ₋₁0.00Cell State×+×Cₜ-0.07tanh×hₜ₋₁[0.00, 0.00, 0.00, 0.00]xₜ8 features++12Forgetσfₜ: [0.63, 0.72, 0.55, 0.38]Inputσiₜ: [0.33, 0.61, 0.51, 0.64]Candidatetanhg̃ₜOutputσoₜ: [0.36, 0.71, 0.32, 0.65]hₜ[-0.03, 0.12, 0.11, -0.13]→ hₜ to next timestepLegend:Forget GateInput GateCandidateOutput GateCell StateHidden State

Click step buttons or hover over diagram elements to trace the data flow

Input x0 (from CNN)
Out₀
0.58
Out₁
-0.36
Out₂
-1.01
Out₃
-0.12
Out₄
0.15
Out₅
-1.06
Out₆
-0.47
Out₇
0.50
Previous Hidden h0
h₀
0.00
h₁
0.00
h₂
0.00
h₃
0.00
Previous Cell C0
C₀
0.00
C₁
0.00
C₂
0.00
C₃
0.00
Gate Outputs at t₀
Forget Gate (f)
f₀
0.63
f₁
0.72
f₂
0.55
f₃
0.38
Input Gate (i)
i₀
0.33
i₁
0.61
i₂
0.51
i₃
0.64
Candidate (g)
g₀
-0.22
g₁
0.27
g₂
0.70
g₃
-0.33
Output Gate (o)
o₀
0.36
o₁
0.71
o₂
0.32
o₃
0.65
New Cell State C0
C₀
-0.07
C₁
0.17
C₂
0.36
C₃
-0.21
New Hidden State h0
h₀
-0.03
h₁
0.12
h₂
0.11
h₃
-0.13

Key Insight: Why LSTM Solves Vanishing Gradients

The cell state C acts as a "highway" for gradients. When f ≈ 1, gradients flow unchanged through time: CtCt1=ft1\frac{\partial C_t}{\partial C_{t-1}} = f_t \approx 1. This allows LSTMs to learn long-range dependencies that simple RNNs cannot capture.

Try This: Click through different gates (Forget, Input, Candidate, etc.) to see exactly how LSTM processes information. Notice how the forget gate decides what to keep from the previous cell state, while the input gate controls what new information to add.

Two-Layer Architecture

Our BiLSTM encoder stacks two bidirectional LSTM layers, each with hidden size 128.

Data Flow

📝text
1Input from CNN: (batch, 30, 64)
23┌────────────────────────────────────────────────────┐
4│                  BiLSTM Layer 1                    │
5│  Forward:  LSTM(64 → 128)  →  H→₁ ∈ (B, 30, 128)  │
6│  Backward: LSTM(64 → 128)  →  H←₁ ∈ (B, 30, 128)  │
7│  Concat:   H₁ = [H→₁; H←₁] ∈ (B, 30, 256)         │
8└────────────────────────────────────────────────────┘
910    Dropout(p=0.3)
1112┌────────────────────────────────────────────────────┐
13│                  BiLSTM Layer 2                    │
14│  Forward:  LSTM(256 → 128) →  H→₂ ∈ (B, 30, 128)  │
15│  Backward: LSTM(256 → 128) →  H←₂ ∈ (B, 30, 128)  │
16│  Concat:   H₂ = [H→₂; H←₂] ∈ (B, 30, 256)         │
17└────────────────────────────────────────────────────┘
1819Output: (batch, 30, 256)

Dimension Summary

StageShapeDescription
CNN output(B, 30, 64)64 local features per timestep
Layer 1 input(B, 30, 64)Same as CNN output
Layer 1 output(B, 30, 256)128×2 bidirectional
After dropout(B, 30, 256)Regularized
Layer 2 input(B, 30, 256)Previous layer output
Layer 2 output(B, 30, 256)128×2 bidirectional

Constant Output Dimension

Both layers output 256 dimensions (128 per direction). This is a design choice—the hidden size doesn't need to increase with depth. Keeping it constant simplifies the architecture and residual connections.


Interactive: BiLSTM Explorer

Now let's see how bidirectional processing works. The BiLSTM runs two separate LSTM networks—one processing left-to-right (forward), the other right-to-left (backward)—and concatenates their outputs.

BiLSTM Visualizer

See how BiLSTM processes the sequence in both directions and combines the results.

BiLSTM Architecture

Bidirectional LSTM (BiLSTM)Forward →LSTMLSTMLSTMLSTMLSTMLSTMInputx0x1x2x3x4x5← BackwardLSTMLSTMLSTMLSTMLSTMLSTMOutput[h→;h←][h→;h←][h→;h←][h→;h←][h→;h←][h→;h←]
Forward LSTM

Processes the sequence from left to right (t₀ → t₅). Captures past context - what happened before each position.

ht=LSTM(xt,ht1)h^{\rightarrow}_t = \text{LSTM}^{\rightarrow}(x_t, h^{\rightarrow}_{t-1})
Backward LSTM

Processes the sequence from right to left (t₅ → t₀). Captures future context - what happens after each position.

ht=LSTM(xt,ht+1)h^{\leftarrow}_t = \text{LSTM}^{\leftarrow}(x_t, h^{\leftarrow}_{t+1})
Output Concatenation

At each timestep, the forward and backward hidden states are concatenated:

ht=[ht;ht]h_t = [h^{\rightarrow}_t ; h^{\leftarrow}_t]

Output dimension: 4 (forward) + 4 (backward) = 8

Why Use BiLSTM for Predictive Maintenance?

In predictive maintenance, sensor readings form a time series. A BiLSTM can detect patterns that depend on context from both directions:

  • Forward context: What events led to this sensor reading?
  • Backward context: What happened after this reading?
  • • Combined, this helps identify gradual degradation patterns for RUL prediction
Key Insight: At each timestep, the forward LSTM captures "what came before" while the backward LSTM captures "what comes after." The concatenated output gives each position full context from both directions—crucial for understanding degradation patterns in sensor data.

Hierarchical Temporal Features

Each layer learns different levels of temporal abstraction.

Layer 1: Low-Level Temporal Patterns

The first layer directly processes CNN features, learning:

  • Immediate transitions: How features change from one timestep to the next
  • Short-term memory: Recent history relevant to current state
  • Local temporal context: Patterns spanning a few timesteps

Layer 2: High-Level Temporal Patterns

The second layer processes Layer 1's hidden states, learning:

  • Patterns of patterns: How low-level dynamics combine
  • Long-term trends: Overall trajectory of degradation
  • Abstract dynamics: "Accelerating degradation" vs "steady decline"

Parameter Count Analysis

Let us calculate the number of parameters in each BiLSTM layer.

LSTM Parameter Formula

For an LSTM with input size DD and hidden size HH:

Params=4×[(D+H)×H+H]\text{Params} = 4 \times [(D + H) \times H + H]

The factor of 4 accounts for the four gates (forget, input, candidate, output). Each gate has:

  • Weight matrix: (D+H)×H(D + H) \times H
  • Bias: HH

BiLSTM Layer 1

BiLSTM Layer 2

Total BiLSTM Parameters

ComponentParameters
BiLSTM Layer 1197,632
BiLSTM Layer 2394,240
Total591,872

The two-layer BiLSTM accounts for approximately 592K of the model's total 3.5M parameters.


Dropout Between Layers

We apply dropout between the two BiLSTM layers to prevent overfitting.

Placement

📝text
1Layer 1 output: (B, 30, 256)
23   Dropout(p=0.3)  ← Applied here
45Layer 2 input: (B, 30, 256)

Why Between Layers?

  • Prevents co-adaptation: Layer 2 cannot rely on specific Layer 1 neurons
  • Regularizes hierarchy: Each layer must be independently useful
  • Standard practice: PyTorch's nn.LSTM has a built-in dropout parameter for this

Rate Selection

RateEffectRecommendation
p = 0.1Light regularizationRisk of overfitting
p = 0.3Moderate regularizationOur choice
p = 0.5Strong regularizationMay hurt performance

No Dropout on Last Layer Output

PyTorch's dropout parameter applies between layers, not after the last layer. If you need dropout on the final output, add it separately. This prevents accidentally regularizing the representation that feeds attention.


Summary

In this section, we designed the two-layer BiLSTM encoder:

  1. Two layers: Hierarchical temporal abstraction without overfitting
  2. Hidden size 128: 256-dim output per layer (bidirectional)
  3. Hierarchical features: Layer 1 captures low-level dynamics, Layer 2 captures high-level trends
  4. ~592K parameters: 198K (Layer 1) + 394K (Layer 2)
  5. Dropout 0.3: Between layers for regularization
PropertyLayer 1Layer 2
Input dimension64256
Hidden dimension128128
Output dimension256256
Parameters~198K~394K
Dropout after0.3None (end of encoder)
Looking Ahead: PyTorch's built-in LSTM uses batch normalization implicitly through careful initialization. However, adding explicit layer normalization can further stabilize training, especially for deep networks. The next section covers layer normalization integration.

With the two-layer design established, we now examine layer normalization for enhanced training stability.