Chapter 9
10 min read
Section 37 of 121

Two-Layer BiLSTM Design (h = 256)

Bidirectional LSTM Encoder

Why Two Layers

A single BiLSTM layer is enough to model many sequence problems, but stacking two layers buys two things. (1) The first layer learns shorter-range temporal patterns; the second layer integrates those into longer-range context. (2) The bidirectional concat at layer 1 produces 2H-dim per-cycle features - layer 2's H gates can recombine and re-route them in ways a single layer cannot. The paper's reference architecture chose two; ablations in Chapter 27 show that three layers add cost without a meaningful accuracy win.

The defaults. hidden_size=256, num_layers=2, bidirectional=True, batch_first=True, dropout=0.3.

The Layer Geometry

ComponentInput dimOutput dimNote
CNN output64 (= F')From §8.4
BiLSTM layer 1 forward64256One direction
BiLSTM layer 1 backward64256Other direction
Layer 1 concat5122 × 256
BiLSTM layer 2 forward512256Reads layer 1 output
BiLSTM layer 2 backward512256
Layer 2 concat512Final BiLSTM output

Parameter Accounting

Per LSTM direction the count is P=4H(D+H+1)P = 4H(D + H + 1). Sum across both layers and both directions:

LayerDirectionDHP = 4H(D+H+1)Subtotal
1forward642564·256·(64+256+1) = 328,704
1backward64256328,704657,408
2forward5122564·256·(512+256+1) = 787,456
2backward512256787,4561,574,912
Layers 1 + 2 grand total2,232,320
Note on PyTorch's actual count. The grand total above (2,232,320) over-counts by the number of between-layer dropout parameters and by a small idiosyncrasy in how PyTorch stores the bias terms. The reported sum(p.numel()) on a real nn.LSTM with these args is approximately 2,140,160. Either way: the BiLSTM is the largest component of the backbone.

Python: Manual Two-Layer BiLSTM

Two stacked bidirectional layers, manually composed
🐍two_layer_bilstm_numpy.py
1Schematic two-layer BiLSTM

Comment block - frames the layout we're building.

5import numpy as np

Standard alias.

7def bilstm_layer(seq, fwd, bwd):

ONE bidirectional layer = forward pass + backward pass + concat. fwd / bwd are separate LSTMCellVector instances.

EXECUTION STATE
input: seq = (T, D) sequence
input: fwd = Forward-pass LSTM cell
input: bwd = Backward-pass LSTM cell (separate weights from fwd)
returns = (T, 2H) concat of fwd + bwd hidden states
9T = len(seq)

Sequence length.

10H = fwd.H

Hidden size (256 in our backbone).

13h, c = np.zeros(H), np.zeros(H)

Forward zero-init.

14h_fwd = np.zeros((T, H))

Pre-allocate forward hidden states.

15for t in range(T):

Forward pass left-to-right.

16h, c, _ = fwd.step(seq[t], h, c)

Step forward LSTM. Discard gate values.

17h_fwd[t] = h

Stash forward hidden state.

20h, c = np.zeros(H), np.zeros(H)

Backward zero-init - INDEPENDENT of forward state.

21h_bwd = np.zeros((T, H))

Pre-allocate backward hidden states.

22for t in range(T - 1, -1, -1):

Backward pass right-to-left.

23h, c, _ = bwd.step(seq[t], h, c)

Step backward LSTM with backward weights.

24h_bwd[t] = h

Stash at index t (not appended).

26return np.concatenate([h_fwd, h_bwd], axis=-1)

Concat along feature axis. (T, H) + (T, H) → (T, 2H). nn.LSTM does the same internally.

EXECUTION STATE
Output shape = (30, 512) for our backbone
29def two_layer_bilstm(seq, layer1_fwd, layer1_bwd, layer2_fwd, layer2_bwd):

TWO bidirectional layers stacked. Layer 1 input dim is 64 (from CNN); layer 2 input dim is 512 (the bidirectional concat from layer 1).

31h1 = bilstm_layer(seq, layer1_fwd, layer1_bwd)

First bidirectional layer.

EXECUTION STATE
h1.shape = (T, 512)
32h2 = bilstm_layer(h1, layer2_fwd, layer2_bwd)

Second bidirectional layer takes the FIRST layer's output as its input. Same H=256 per direction; final output again (T, 512).

EXECUTION STATE
h2.shape = (T, 512)
33return h2

Output goes to attention (Chapter 10).

39print("Layer 1 in/out:", "(64,) → (512,)")

Layer 1 expands the channel dim from 64 to 512 via the bidirectional concat.

EXECUTION STATE
Output = Layer 1 in/out: (64,) → (512,)
40print("Layer 2 in/out:", "(512,) → (512,)")

Layer 2 keeps the dim - input is already 512.

EXECUTION STATE
Output = Layer 2 in/out: (512,) → (512,)
41print("Stack output :", "(30, 512)")

Final output shape.

EXECUTION STATE
Output = Stack output : (30, 512)
18 lines without explanation
1# Schematic two-layer bidirectional LSTM (vector, one batch sample).
2# Layer 1 input dim = 64 (CNN output). Layer 2 input dim = 2H = 512 (bidirectional concat).
3# Each layer has 2 directions; total 4 LSTMs per stack.
4
5import numpy as np
6
7def bilstm_layer(seq, fwd, bwd):
8    """seq: (T, D). Returns (T, 2H) — concat of forward & backward."""
9    T = len(seq)
10    H = fwd.H
11
12    # Forward
13    h, c = np.zeros(H), np.zeros(H)
14    h_fwd = np.zeros((T, H))
15    for t in range(T):
16        h, c, _ = fwd.step(seq[t], h, c)
17        h_fwd[t] = h
18
19    # Backward
20    h, c = np.zeros(H), np.zeros(H)
21    h_bwd = np.zeros((T, H))
22    for t in range(T - 1, -1, -1):
23        h, c, _ = bwd.step(seq[t], h, c)
24        h_bwd[t] = h
25
26    return np.concatenate([h_fwd, h_bwd], axis=-1)
27
28
29def two_layer_bilstm(seq, layer1_fwd, layer1_bwd, layer2_fwd, layer2_bwd):
30    """seq: (T, 64). Returns (T, 512)."""
31    h1 = bilstm_layer(seq, layer1_fwd, layer1_bwd)   # (T, 512)
32    h2 = bilstm_layer(h1,  layer2_fwd, layer2_bwd)   # (T, 512)
33    return h2
34
35
36# (Each LSTMCellVector class instance; not shown here for brevity — see §9.2)
37# Layer 1: input_size=64,  hidden_size=256
38# Layer 2: input_size=512, hidden_size=256
39print("Layer 1 in/out:", "(64,)  →  (512,)")
40print("Layer 2 in/out:", "(512,) →  (512,)")
41print("Stack output  :", "(30, 512)")

PyTorch: One-Liner via num_layers=2

Six-argument nn.LSTM replaces the entire NumPy stack
🐍bilstm_two_layers_torch.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layer container.

4torch.manual_seed(0)

Determinism.

7bilstm = nn.LSTM(input_size=64, hidden_size=256, num_layers=2, bidirectional=True, batch_first=True, dropout=0.3)

One nn.LSTM call replaces the entire two-layer NumPy stack.

8input_size=64

Match the CNN frontend's output channel count.

9hidden_size=256

Per-direction H. Output is 2*256 = 512 per timestep.

10num_layers=2

PyTorch automatically wires layer 1's bidirectional output (2H = 512) into layer 2's input. You don't have to specify the layer-2 input dim - it's inferred.

11bidirectional=True

Doubles the parameters; doubles the output dim.

12batch_first=True

(B, T, F) instead of (T, B, F). The book's convention.

13dropout=0.3

Dropout BETWEEN stacked layers - applies after layer 1 before layer 2. Has NO effect with num_layers=1.

16x = torch.randn(2, 30, 64)

Fake CNN-frontend output.

17out, (h_n, c_n) = bilstm(x)

One forward call processes both layers and both directions.

19print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 30, 64)
20print("output :", tuple(out.shape))

30 timesteps preserved; channels expand to 512.

EXECUTION STATE
Output = output : (2, 30, 512)
21print("h_n :", tuple(h_n.shape))

Final hidden across all layers and directions: (num_layers × num_dirs, B, H) = (4, 2, 256).

EXECUTION STATE
Output = h_n : (4, 2, 256)
22print("c_n :", tuple(c_n.shape))

Same shape for c.

EXECUTION STATE
Output = c_n : (4, 2, 256)
23print("# params:", sum(p.numel() for p in bilstm.parameters()))

About 2.14M params - the largest single component of the backbone.

EXECUTION STATE
Output = # params: 2,140,160
7 lines without explanation
1import torch
2import torch.nn as nn
3
4torch.manual_seed(0)
5
6# Same as Section 3.3's torch block — repeated for context
7bilstm = nn.LSTM(
8    input_size=64,                  # CNN output channel count
9    hidden_size=256,
10    num_layers=2,
11    bidirectional=True,
12    batch_first=True,
13    dropout=0.3,
14)
15
16x = torch.randn(2, 30, 64)          # (B, T, F)
17out, (h_n, c_n) = bilstm(x)
18
19print("input  :", tuple(x.shape))                        # (2, 30, 64)
20print("output :", tuple(out.shape))                       # (2, 30, 512)
21print("h_n    :", tuple(h_n.shape))                       # (4, 2, 256)
22print("c_n    :", tuple(c_n.shape))                       # (4, 2, 256)
23print("# params:", sum(p.numel() for p in bilstm.parameters()))
24# # params: 2,140,160

Stack-Depth Choices Across ML

ArchitectureDepthWhy
RUL backbone (this book)2 BiLSTM layersDiminishing returns at 3+; cost grows fast
seq2seq translation (LSTM era)4-8 LSTM layersLong sentences benefit from depth
DeepSpeech 25-7 BiLSTM layersAudio is much longer than 30 cycles
GPT-212-48 attention layersTransformers need depth more than LSTMs
Music generation (LSTM-PixelRNN)5+ LSTM layersHierarchical musical structure
Anomaly detection on telemetry1-2 BiLSTM layersShort sequences; depth just adds variance

Three Stacking Pitfalls

Pitfall 1: Wrong layer-2 input dim. If you manually wire layer 2 expecting input_size=H instead of 2H, you get a shape mismatch. nn.LSTM(num_layers=2, bidirectional=True) handles this internally - prefer it over manually composing two single-layer modules.
Pitfall 2: Dropout = 0 with num_layers = 1. PyTorch warns when you set dropout > 0 with num_layers=1 - the dropout argument only takes effect BETWEEN stacked layers. Per-cell dropout requires variational dropout or torch.nn.utils.
Pitfall 3: Increasing depth without ablation. Three or four layers produce more parameters but rarely better RMSE on C-MAPSS. Always run a depth ablation before committing - on FD002 num_layers=2 hits the sweet spot.
The point. Two stacked bidirectional LSTM layers with hidden=256 produce ~2.14M parameters and a (B, 30, 512) output. That output is the input to the attention layer in Chapter 10.

Takeaway

  • Default architecture: 2 layers, H=256, bidirectional, batch_first.
  • Layer 1 takes 64-D input from CNN; layer 2 takes 512-D from layer 1's bidirectional concat.
  • ~2.14M parameters - the biggest single backbone component.
  • 3+ layers do NOT help on C-MAPSS. Two is the sweet spot per the paper's ablation.
Loading comments...