The Production BiLSTM Encoder
The contract:
| Spec | Value |
|---|---|
| Input shape | (B, 30, 64) |
| Output shape | (B, 30, 512) |
| Layers | 2 |
| Hidden size per direction | 256 |
| Bidirectional | Yes |
| batch_first | True |
| Dropout (between layers) | 0.3 |
| Total parameters | ~2,140,160 |
The Full PyTorch Module
BiLSTMEncoder + smoke test + composition with CNNFrontend
🐍bilstm_encoder_full.py
Shape Trace
| Step | Shape | What happened |
|---|---|---|
| Raw input from CMAPSSFullDataset | (B, 30, 17) | Output of §7.4 |
| After CNNFrontend (§8.4) | (B, 30, 64) | Local features |
| After BiLSTMEncoder (this section) | (B, 30, 512) | Bidirectional context |
| Ready for attention (§10) | (B, 30, 512) | Same shape |
Smoke Test: Backward + Optimiser
Same six-line smoke test pattern as §8.4: forward, sum, backward, optimiser step. If autograd cannot reach every parameter or BN/dropout misbehave, this test crashes immediately. Always run before integrating into a full training loop.
Composing With the CNN Frontend
The end of the code block defines CNNBiLSTMStack - the CNN frontend and the BiLSTM encoder composed into one Module. Two layers; one forward call. The output is , ready for the attention layer that comes next.
The backbone is half-built. CNNFrontend + BiLSTMEncoder = ~2.2M parameters. Add the attention block (Chapter 10, ~100k more) and the FC stack + dual-task heads (Chapter 11, ~1.2M) and you arrive at the ~3.5M-parameter DualTaskModel.
Three Implementation Pitfalls
Pitfall 1: Forgetting batch_first. Default is False; book convention is True. Forgetting silently flips the input axes - typed-shape checks would catch it; the code runs without errors otherwise.
Pitfall 2: Dropout warning at num_layers=1. PyTorch warns; the dropout has no effect. Use the conditional in __init__:
dropout_p if num_layers > 1 else 0.0.Pitfall 3: Discarding (h_n, c_n) when you need them. If you want a SINGLE per-engine context vector (instead of the per-timestep sequence), use h_n - reshape it from (num_layers × num_dirs, B, H) to (B, num_layers × num_dirs × H) and feed that into a head. We do NOT do this in the backbone - attention (Chapter 10) consumes the full per-timestep sequence.
The point. Two-layer bidirectional LSTM with H=256 wraps into a 25-line nn.Module. Composes with CNNFrontend in two lines. Total backbone-so-far: ~2.2M parameters; output ready for attention.
Takeaway
- BiLSTMEncoder is a thin wrapper over nn.LSTM. Sets the paper's defaults and discards the unused (h_n, c_n).
- (B, 30, 64) → (B, 30, 512). Time preserved; channels expand 8×.
- ~2.14M parameters. Largest single backbone component.
- CNNBiLSTMStack composes the two halves of the backbone. Two-line forward; ~2.2M params total.