Chapter 10
11 min read
Section 42 of 121

PyTorch Implementation

Multi-Head Self-Attention

The Production Attention Block

The contract:

SpecValue
Input shape(B, 30, 512) - from BiLSTM
Output shape(B, 30, 512) - same
Heads8
Per-head dim (d_k)64
WrappingResidual + LayerNorm + Dropout
Dropout p0.1
Total parameters~1,051,648

The Full PyTorch Module

AttentionBlock + BackboneSoFar (CNN + BiLSTM + Attention) + smoke test
🐍attention_block_full.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layers.

5class AttentionBlock(nn.Module):

Production wrapper for the attention sub-layer.

12def __init__(self, d_model=512, num_heads=8, dropout_p=0.1):

Three knobs - all default to the paper's reference values.

16super().__init__()

Initialise nn.Module.

17self.attn = nn.MultiheadAttention(...)

From §10.2.

23self.norm = nn.LayerNorm(d_model)

Per-feature normalisation - 1024 params (gamma + beta of length 512).

24self.drop = nn.Dropout(dropout_p)

Inverted dropout post-attention.

26def forward(self, x):

Standard forward.

27a, _ = self.attn(x, x, x, need_weights=False)

Self-attention - same tensor for Q, K, V.

28return self.norm(x + self.drop(a))

Residual + LayerNorm wrap. Three operations in one line.

33class BackboneSoFar(nn.Module):

Compose CNN + BiLSTM + Attention into one Module. Demonstrates how the building blocks plug together.

35def __init__(self):

No arguments - uses each block's default config.

38from cnn_frontend_full import CNNFrontend

Import from §8.4.

39from bilstm_encoder_full import BiLSTMEncoder

Import from §9.4.

41self.cnn = CNNFrontend(c_in=17, dropout_p=0.15)

17 → 64 channel encoder.

42self.lstm = BiLSTMEncoder(input_size=64, hidden_size=256, num_layers=2, dropout_p=0.3)

64 → 512 channel encoder.

44self.attn = AttentionBlock(d_model=512, num_heads=8, dropout_p=0.1)

512 → 512 self-attention block (this section).

46def forward(self, x):

Composed forward.

48return self.attn(self.lstm(self.cnn(x)))

Three function calls. (B, T, 17) → CNN → (B, T, 64) → BiLSTM → (B, T, 512) → Attention → (B, T, 512). The whole backbone-so-far in one expression.

53torch.manual_seed(0)

Determinism.

54backbone = BackboneSoFar()

Instantiate the whole stack.

56x = torch.randn(2, 30, 17)

Fake input from CMAPSSFullDataset (§7.4).

57y = backbone(x)

Full forward pass.

58loss = y.sum()

Placeholder loss for the smoke test.

59loss.backward()

Autograd through THREE nested Modules. If anywhere in the stack the gradient flow is broken, the test crashes.

61print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 30, 17)
62print("output :", tuple(y.shape))

Verify backbone-so-far output.

EXECUTION STATE
Output = output : (2, 30, 512)
63print("# params (CNN + BiLSTM + Attn):", sum(p.numel() for p in backbone.parameters()))

Total parameter count across the three components. ~53k (CNN) + ~2.14M (BiLSTM) + ~1.05M (Attention) ≈ 3.2M.

EXECUTION STATE
Output = # params (CNN + BiLSTM + Attn): ~3,200,000
34 lines without explanation
1import torch
2import torch.nn as nn
3
4
5class AttentionBlock(nn.Module):
6    """Multi-head self-attention + residual + LayerNorm + dropout.
7
8    Consumes (B, T, 512) from the BiLSTM encoder; emits (B, T, 512)
9    ready for the FC stack and dual-task heads.
10    """
11
12    def __init__(self,
13                 d_model: int = 512,
14                 num_heads: int = 8,
15                 dropout_p: float = 0.1):
16        super().__init__()
17        self.attn = nn.MultiheadAttention(
18            embed_dim=d_model,
19            num_heads=num_heads,
20            batch_first=True,
21            dropout=dropout_p,
22        )
23        self.norm = nn.LayerNorm(d_model)
24        self.drop = nn.Dropout(dropout_p)
25
26    def forward(self, x: torch.Tensor) -> torch.Tensor:
27        a, _ = self.attn(x, x, x, need_weights=False)
28        return self.norm(x + self.drop(a))
29
30
31# ---------- Compose: CNN → BiLSTM → Attention ----------
32class BackboneSoFar(nn.Module):
33    """The first three components of the backbone composed together."""
34    def __init__(self):
35        super().__init__()
36        # Imports from previous chapters' implementations
37        from cnn_frontend_full import CNNFrontend           # §8.4
38        from bilstm_encoder_full import BiLSTMEncoder       # §9.4
39
40        self.cnn  = CNNFrontend(c_in=17, dropout_p=0.15)
41        self.lstm = BiLSTMEncoder(input_size=64, hidden_size=256,
42                                   num_layers=2, dropout_p=0.3)
43        self.attn = AttentionBlock(d_model=512, num_heads=8, dropout_p=0.1)
44
45    def forward(self, x: torch.Tensor) -> torch.Tensor:
46        # x: (B, T=30, F=17)  →  (B, T=30, 512)
47        return self.attn(self.lstm(self.cnn(x)))
48
49
50# ---------- Smoke test ----------
51torch.manual_seed(0)
52backbone = BackboneSoFar()
53
54x = torch.randn(2, 30, 17)
55y = backbone(x)
56loss = y.sum()
57loss.backward()
58
59print("input  :", tuple(x.shape))                      # (2, 30, 17)
60print("output :", tuple(y.shape))                       # (2, 30, 512)
61print("# params (CNN + BiLSTM + Attn):",
62      sum(p.numel() for p in backbone.parameters()))
63# # params: ~3,200,000

Composing the Backbone So Far

With CNNFrontend (Chapter 8), BiLSTMEncoder (Chapter 9), and AttentionBlock (this section), three Modules compose into a 3.2M-parameter feature extractor. The next chapter adds the FC stack and the two task-specific heads (RUL regression + health classification) to complete the DualTaskModel.

StageOutput shapeParams (approx)
Input(B, 30, 17)
After CNN (§8.4)(B, 30, 64)53k
After BiLSTM (§9.4)(B, 30, 512)2.14M
After Attention (this §)(B, 30, 512)1.05M
BackboneSoFar total(B, 30, 512)~3.2M

End-to-End Smoke Test

The bottom of the code block runs a six-line smoke test on the composed BackboneSoFar: instantiate, forward, scalar loss, backward. If the shape contract or autograd flow is broken in any of the three nested Modules, the test fails immediately. Always run before integrating into the full DualTaskModel of Chapter 11.

What the smoke test confirms. (1) Shapes flow through three nested Modules. (2) Gradients reach the deepest layer. (3) BatchNorm and LayerNorm both interoperate. (4) No hidden device transfers. (5) The parameter count matches the per-block accounting.

Three Implementation Pitfalls

Pitfall 1: Forgetting need_weights=False. Default of nn.MultiheadAttention is True, which materialises the (B, T, T) attention matrix. For pure forward inference you do not need it; setting False saves memory and a small amount of compute.
Pitfall 2: Mixing post-norm and pre-norm in the same stack. If you ever add a second attention block to the backbone, ensure both blocks use the same normalisation scheme. Mixing destabilises training.
Pitfall 3: Dropping the residual. Withoutx + drop(a), the output is just the attention result. Gradients struggle to reach the BiLSTM through a full attention computation. Always residual.
The point. The attention block is a 25-line nn.Module. Composes with CNN + BiLSTM in one expression. Total backbone-so-far: ~3.2M parameters; output (B, 30, 512) ready for the FC stack and dual-task heads in Chapter 11.

Takeaway

  • AttentionBlock = MHA + Residual + LayerNorm + Dropout. Wrapped in one Module.
  • (B, 30, 512) → (B, 30, 512). Shape preserved end-to-end.
  • BackboneSoFar composes the three components. ~3.2M params; one forward call.
  • Smoke test before integration. Six lines; catches shape and gradient bugs immediately.
Loading comments...