The Production Attention Block
The contract:
| Spec | Value |
|---|---|
| Input shape | (B, 30, 512) - from BiLSTM |
| Output shape | (B, 30, 512) - same |
| Heads | 8 |
| Per-head dim (d_k) | 64 |
| Wrapping | Residual + LayerNorm + Dropout |
| Dropout p | 0.1 |
| Total parameters | ~1,051,648 |
The Full PyTorch Module
AttentionBlock + BackboneSoFar (CNN + BiLSTM + Attention) + smoke test
🐍attention_block_full.py
Composing the Backbone So Far
With CNNFrontend (Chapter 8), BiLSTMEncoder (Chapter 9), and AttentionBlock (this section), three Modules compose into a 3.2M-parameter feature extractor. The next chapter adds the FC stack and the two task-specific heads (RUL regression + health classification) to complete the DualTaskModel.
| Stage | Output shape | Params (approx) |
|---|---|---|
| Input | (B, 30, 17) | — |
| After CNN (§8.4) | (B, 30, 64) | 53k |
| After BiLSTM (§9.4) | (B, 30, 512) | 2.14M |
| After Attention (this §) | (B, 30, 512) | 1.05M |
| BackboneSoFar total | (B, 30, 512) | ~3.2M |
End-to-End Smoke Test
The bottom of the code block runs a six-line smoke test on the composed BackboneSoFar: instantiate, forward, scalar loss, backward. If the shape contract or autograd flow is broken in any of the three nested Modules, the test fails immediately. Always run before integrating into the full DualTaskModel of Chapter 11.
What the smoke test confirms. (1) Shapes flow through three nested Modules. (2) Gradients reach the deepest layer. (3) BatchNorm and LayerNorm both interoperate. (4) No hidden device transfers. (5) The parameter count matches the per-block accounting.
Three Implementation Pitfalls
Pitfall 1: Forgetting need_weights=False. Default of nn.MultiheadAttention is True, which materialises the (B, T, T) attention matrix. For pure forward inference you do not need it; setting False saves memory and a small amount of compute.
Pitfall 2: Mixing post-norm and pre-norm in the same stack. If you ever add a second attention block to the backbone, ensure both blocks use the same normalisation scheme. Mixing destabilises training.
Pitfall 3: Dropping the residual. Without
x + drop(a), the output is just the attention result. Gradients struggle to reach the BiLSTM through a full attention computation. Always residual.The point. The attention block is a 25-line nn.Module. Composes with CNN + BiLSTM in one expression. Total backbone-so-far: ~3.2M parameters; output (B, 30, 512) ready for the FC stack and dual-task heads in Chapter 11.
Takeaway
- AttentionBlock = MHA + Residual + LayerNorm + Dropout. Wrapped in one Module.
- (B, 30, 512) → (B, 30, 512). Shape preserved end-to-end.
- BackboneSoFar composes the three components. ~3.2M params; one forward call.
- Smoke test before integration. Six lines; catches shape and gradient bugs immediately.