Chapter 9
11 min read
Section 38 of 121

PyTorch Implementation

Bidirectional LSTM Encoder

The Production BiLSTM Encoder

The contract:

SpecValue
Input shape(B, 30, 64)
Output shape(B, 30, 512)
Layers2
Hidden size per direction256
BidirectionalYes
batch_firstTrue
Dropout (between layers)0.3
Total parameters~2,140,160

The Full PyTorch Module

BiLSTMEncoder + smoke test + composition with CNNFrontend
🐍bilstm_encoder_full.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layer container.

5class BiLSTMEncoder(nn.Module):

Production wrapper around nn.LSTM. Sets the defaults the paper uses.

8def __init__(self, input_size=64, hidden_size=256, num_layers=2, dropout_p=0.3):

Four hyperparameters - all four match the paper's reference architecture.

13super().__init__()

Initialise nn.Module.

14self.lstm = nn.LSTM(...)

One nn.LSTM call.

15input_size=input_size

Channel count from CNN frontend (default 64).

16hidden_size=hidden_size

Per-direction hidden dim (default 256).

17num_layers=num_layers

Two stacked bidirectional layers.

18bidirectional=True

Doubles the parameters and output dim.

19batch_first=True

(B, T, F) order - book convention.

20dropout=dropout_p if num_layers > 1 else 0.0

Guard against PyTorch's warning when num_layers=1. Dropout is only meaningful between stacked layers.

23def forward(self, x):

Standard forward.

25out, _ = self.lstm(x)

We discard the (h_n, c_n) tuple - downstream uses out (the per-timestep sequence) only.

EXECUTION STATE
out.shape = (B, T, 512) for B=2, T=30, hidden=256, bidirectional
26return out

Hand off to attention (Chapter 10).

30torch.manual_seed(0)

Determinism.

31encoder = BiLSTMEncoder(input_size=64, hidden_size=256, num_layers=2, dropout_p=0.3)

Instantiate.

33x = torch.randn(2, 30, 64)

Fake CNN output. Real data: x = cnn(raw_input).

34y = encoder(x)

Forward.

36print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 30, 64)
37print("output :", tuple(y.shape))

Output channel dim 512 = 2 × 256.

EXECUTION STATE
Output = output : (2, 30, 512)
38print("# params:", sum(p.numel() for p in encoder.parameters()))

About 2.14M.

EXECUTION STATE
Output = # params: 2,140,160
41loss = y.sum()

Silly placeholder loss for the smoke test.

42loss.backward()

Autograd through both layers and both directions. cuDNN's backward is just as fused as its forward.

43opt = torch.optim.AdamW(encoder.parameters(), lr=1e-3)

Standard optimiser.

44opt.step()

One gradient update.

45print("loss :", float(loss))

Pre-step loss.

49class CNNBiLSTMStack(nn.Module):

End-to-end stack: CNN frontend feeds the BiLSTM encoder.

50def __init__(self):

No hyperparameters - uses the book defaults.

53from cnn_frontend_full import CNNFrontend

Imports the CNN module from §8.4.

54self.cnn = CNNFrontend(c_in=17)

17 → 64 channels via the three Conv blocks.

55self.lstm = BiLSTMEncoder(input_size=64, hidden_size=256, num_layers=2)

BiLSTM encoder consumes 64-D input.

57def forward(self, x):

Composed forward.

58return self.lstm(self.cnn(x))

Sequential composition. (B, T, 17) → CNN → (B, T, 64) → BiLSTM → (B, T, 512). Two lines of model code.

EXECUTION STATE
→ end-to-end shape = (B, 30, 17) → (B, 30, 64) → (B, 30, 512)
25 lines without explanation
1import torch
2import torch.nn as nn
3
4
5class BiLSTMEncoder(nn.Module):
6    """Two-layer bidirectional LSTM. Consumes (B, T, 64), emits (B, T, 512)."""
7
8    def __init__(self,
9                 input_size: int = 64,
10                 hidden_size: int = 256,
11                 num_layers: int = 2,
12                 dropout_p: float = 0.3):
13        super().__init__()
14        self.lstm = nn.LSTM(
15            input_size=input_size,
16            hidden_size=hidden_size,
17            num_layers=num_layers,
18            bidirectional=True,
19            batch_first=True,
20            dropout=dropout_p if num_layers > 1 else 0.0,
21        )
22
23    def forward(self, x: torch.Tensor) -> torch.Tensor:
24        # x: (B, T, F)  →  out: (B, T, 2H)
25        out, _ = self.lstm(x)
26        return out
27
28
29# ---------- Smoke test ----------
30torch.manual_seed(0)
31encoder = BiLSTMEncoder(input_size=64, hidden_size=256, num_layers=2, dropout_p=0.3)
32
33x = torch.randn(2, 30, 64)        # CNN frontend output (Chapter 8)
34y = encoder(x)
35
36print("input  :", tuple(x.shape))                   # (2, 30, 64)
37print("output :", tuple(y.shape))                    # (2, 30, 512)
38print("# params:", sum(p.numel() for p in encoder.parameters()))
39
40# Backward + optimiser
41loss = y.sum()
42loss.backward()
43opt = torch.optim.AdamW(encoder.parameters(), lr=1e-3)
44opt.step()
45print("loss   :", float(loss))
46
47
48# ---------- Composing with CNNFrontend ----------
49class CNNBiLSTMStack(nn.Module):
50    """CNN frontend + BiLSTM encoder, end to end."""
51    def __init__(self):
52        super().__init__()
53        # See §8.4 for CNNFrontend definition
54        from cnn_frontend_full import CNNFrontend
55        self.cnn  = CNNFrontend(c_in=17)
56        self.lstm = BiLSTMEncoder(input_size=64, hidden_size=256, num_layers=2)
57
58    def forward(self, x):
59        return self.lstm(self.cnn(x))     # (B, T, F=17) -> (B, T, 512)

Shape Trace

StepShapeWhat happened
Raw input from CMAPSSFullDataset(B, 30, 17)Output of §7.4
After CNNFrontend (§8.4)(B, 30, 64)Local features
After BiLSTMEncoder (this section)(B, 30, 512)Bidirectional context
Ready for attention (§10)(B, 30, 512)Same shape

Smoke Test: Backward + Optimiser

Same six-line smoke test pattern as §8.4: forward, sum, backward, optimiser step. If autograd cannot reach every parameter or BN/dropout misbehave, this test crashes immediately. Always run before integrating into a full training loop.

Composing With the CNN Frontend

The end of the code block defines CNNBiLSTMStack - the CNN frontend and the BiLSTM encoder composed into one Module. Two layers; one forward call. The output is (B,T,512)(B, T, 512), ready for the attention layer that comes next.

The backbone is half-built. CNNFrontend + BiLSTMEncoder = ~2.2M parameters. Add the attention block (Chapter 10, ~100k more) and the FC stack + dual-task heads (Chapter 11, ~1.2M) and you arrive at the ~3.5M-parameter DualTaskModel.

Three Implementation Pitfalls

Pitfall 1: Forgetting batch_first. Default is False; book convention is True. Forgetting silently flips the input axes - typed-shape checks would catch it; the code runs without errors otherwise.
Pitfall 2: Dropout warning at num_layers=1. PyTorch warns; the dropout has no effect. Use the conditional in __init__: dropout_p if num_layers > 1 else 0.0.
Pitfall 3: Discarding (h_n, c_n) when you need them. If you want a SINGLE per-engine context vector (instead of the per-timestep sequence), use h_n - reshape it from (num_layers × num_dirs, B, H) to (B, num_layers × num_dirs × H) and feed that into a head. We do NOT do this in the backbone - attention (Chapter 10) consumes the full per-timestep sequence.
The point. Two-layer bidirectional LSTM with H=256 wraps into a 25-line nn.Module. Composes with CNNFrontend in two lines. Total backbone-so-far: ~2.2M parameters; output ready for attention.

Takeaway

  • BiLSTMEncoder is a thin wrapper over nn.LSTM. Sets the paper's defaults and discards the unused (h_n, c_n).
  • (B, 30, 64) → (B, 30, 512). Time preserved; channels expand 8×.
  • ~2.14M parameters. Largest single backbone component.
  • CNNBiLSTMStack composes the two halves of the backbone. Two-line forward; ~2.2M params total.
Loading comments...