Assembling the DualTaskModel
Three chapters of components, four sections of heads. This section glues them into one nn.Module that you can instantiate with one line and train end-to-end with one optimizer. Everything you have built fits together like this:
| Stage | Source | Shape |
|---|---|---|
| Window of normalised sensors | §7 | (B, 30, 14) |
| CNN frontend | §8 | (B, 30, 64) |
| BiLSTM encoder | §9 | (B, 30, 512) |
| Multi-head self-attention | §10 | (B, 30, 512) |
| FC funnel + last-cycle pool | §11.1 | (B, 32) ← shared |
| RUL regression head | §11.2 | (B,) |
| Health classification head | §11.3 | (B, 3) logits |
End-to-End Shape Trace
Walk a single batch (B=4) through every stage. The only non-trivial transitions are CNN's channel-axis bridge (transpose-in, transpose-out) and FCFunnel's last-cycle pool that collapses the time axis.
| Step | Operation | Shape |
|---|---|---|
| 0 | input batch | (4, 30, 14) |
| 1 | CNNFrontend - transpose to (B,F,T), conv stack, transpose | (4, 30, 64) |
| 2 | BiLSTMEncoder - 2-layer bidir LSTM | (4, 30, 512) |
| 3 | AttentionBlock - 8-head self-attn + residual + LN | (4, 30, 512) |
| 4 | FCFunnel.last-cycle pool | (4, 512) |
| 5 | FCFunnel - dense 512 → 256 → 64 → 32 | (4, 32) ← z |
| 6a | RULHead - 32 → 64 → 32 → 16 → 1, final ReLU | (4,) |
| 6b | HealthHead - 32 → 32 → 16 → 3 (logits) | (4, 3) |
Interactive: Click Through the Pipeline
Click each block to see its input/output shape, parameter count, and share of the total budget.
Read this before continuing. Notice that BiLSTM owns ~64% of the parameters but the two task heads own less than 0.2%. Yet Chapter 12 will show the classification head's gradient pulling on the shared backbone with up to 500x the magnitude of the regression head. Parameter share and gradient share are not the same thing - that is the central insight of GABA / GRACE.
The Parameter Budget
| Block | Params | Share | Notes |
|---|---|---|---|
| CNN frontend | 53,184 | 1.6% | Three Conv1d → BN → ReLU → Dropout |
| BiLSTM encoder | 2,140,160 | 63.0% | 2 layers, hidden=256, bidirectional |
| Multi-head attention | 1,052,672 | 31.0% | 8 heads, d_model=512 |
| FC funnel | 148,000 | 4.4% | 512 → 256 → 64 → 32 |
| RUL head | 4,737 | 0.14% | Regression |
| Health head | 1,635 | 0.05% | Classification (auxiliary) |
| Total | ~3.40M | 100% | Reproducible on a single 16 GB GPU |
Python: Pseudo-Assembly
This file is a NumPy-only pseudo-assembly of the dual-task model. Every block (CNN, BiLSTM, attention, funnel, two heads) was implemented in earlier sections; here we glue them together so you can SEE the end-to-end shape flow without any PyTorch magic. The comment tells the reader: do not implement these blocks again - import the per-section reference impls.
Lists the six imports the file is about to make. Reading this before the imports below tells you the architecture in one breath: cnn_frontend (Ch 8), bilstm_encoder (Ch 9), attention_block (Ch 10), then three Ch 11 modules - fc_funnel (the trunk-to-z bottleneck), rul_head (regression), health_head (classification).
Closes the chapter mapping. Together lines 1-3 are the contract: 'this file does NOT redefine the components, it only wires them in the order the architecture requires.'
NumPy is the numerical computing library for Python. It supplies ndarray (the fast N-D array used to hold the (B, 30, 14) sensor windows), the @ matrix-multiply operator that every block relies on under the hood, and broadcasting that makes per-row normalisation a one-liner. We alias it as np by convention so we can write np.random.randn(...) and np.float32 below.
Pulls in the function defined back in section 8.4. cnn_frontend is the first stage of the trunk: it takes the raw (B, T, 14) window, transposes to (B, 14, T) so Conv1d sees channels first, runs three Conv1d -> BatchNorm -> ReLU -> Dropout blocks, then transposes back. It mixes information across the 14 sensors at every cycle.
Pulls in the recurrent encoder from §9.4. Two stacked LSTMs run in BOTH directions over the 30-cycle window: one reads cycles 1->30 (causal), the other reads 30->1 (anti-causal). Their outputs concatenate along the channel axis, doubling 256 hidden units to 512.
Pulls in the multi-head self-attention block from §10.4. Lets every cycle attend to every other cycle in the window directly (not through the LSTM's sequential bottleneck). Wrapped in a residual connection + LayerNorm so it is shape-preserving.
Pulls in the bottleneck from §11.1. Two responsibilities: (1) collapse the time axis by taking only the LAST cycle's 512-D vector (the 'right-now' embedding), then (2) funnel 512 → 256 → 64 → 32 through dense layers. The output is the 32-D shared representation z that both task heads read.
Pulls in the regression head from §11.2. A 4-layer MLP that maps the 32-D shared vector z to a SINGLE non-negative scalar - the predicted remaining useful life in cycles. Final activation is ReLU so the prediction is always ≥ 0 (no engine has 'negative life left').
Pulls in the classification head from §11.3. A 3-layer MLP that maps z to RAW logits over {healthy, degrading, critical}. No softmax here - softmax is fused inside cross_entropy in PyTorch for numerical stability, and this NumPy version mirrors that contract.
The end-to-end forward pass. One input batch in, two outputs out - a regression scalar per engine AND a 3-vector of class logits per engine. Both come from the same shared trunk; that shared computation is what lets one task help the other.
Docstring start. Numpy-style convention: first line is a one-sentence summary (here: 'End-to-end forward pass.'), then a blank line, then the parameter/return descriptions. Docstrings are pulled by help(dual_task_model) and by Sphinx for the API docs.
Documents the input contract. (B, 30, 14) means batch first, then 30 cycles per window, then 14 sensors. 'Normalised' is critical - the window must already have z-score normalisation applied (mean ≈ 0, std ≈ 1 per sensor); the model itself does NOT include a normalisation layer.
Documents the output contract. rul shape (B,) is a 1-D vector of one scalar per engine. logits shape (B, 3) is a matrix of one row per engine, three raw scores per row (one per health class). This is the contract every downstream training loop relies on.
Closes the triple-quoted docstring block. Below this point is executable code.
Stage 1 of the trunk. Calls the CNN frontend with the raw input AND its parameters splatted in. h becomes the per-cycle feature tensor with 64 learned channels. Comment // (B, 30, 64) is the SHAPE of h after this line - useful when reading the code top-to-bottom.
Stage 2 of the trunk. Sweeps the 30-cycle sequence with TWO stacked LSTMs in both directions. The variable name 'h' is REASSIGNED - this is intentional; we never need the raw CNN output again, and reusing the name keeps memory usage low.
Stage 3 of the trunk. Adds direct cycle-to-cycle communication on top of the BiLSTM's sequential summary. The block is shape-preserving thanks to its residual connection: output = LayerNorm(h + MultiHead(h, h, h)).
Stage 4 - the bottleneck. The variable name CHANGES from h to z because the meaning changes: h was per-cycle, z is per-engine. The time axis collapses here (last-cycle pool keeps only the final cycle's 512-D vector), then dense layers funnel 512 → 256 → 64 → 32. The 32-D z is the SHARED representation both heads consume.
Task head 1 - regression. Maps the 32-D shared z into a single non-negative scalar per engine. The trailing ReLU on the rul_head's final layer guarantees rul ≥ 0 - we cannot predict 'negative life left'.
Task head 2 - classification. Reads THE SAME z that rul_head just consumed and emits raw logits over the 3 health classes. No softmax here - the classifier's loss (cross_entropy in the PyTorch version below) applies log-softmax internally for numerical stability.
Returns BOTH outputs as a 2-tuple. Python will pack them into a tuple even without explicit parentheses (it sees the comma). This is the canonical multi-task return signature; the training loop in Ch 15 will unpack it as 'rul_pred, hs_pred = model(x)' and feed each into its own loss.
Section divider. Everything below is a tiny dry-run that proves the wiring is correct - it does NOT train the model. Smoke tests are the cheapest way to catch a shape mismatch BEFORE you waste GPU time on a real training run.
Sets NumPy's global RNG seed to 0 so the random input on line 32 is REPRODUCIBLE - rerun the script and you get the exact same x. Critical for debugging: a shape error today will reproduce tomorrow.
Tuple unpack: B = 2 (batch size), T = 30 (cycles per window), F = 14 (sensors). Naming convention echoes the docstring shape (B, T, F). Tiny B keeps the smoke test fast - real training uses 64 or 128.
Builds a fake (2, 30, 14) input drawn from a standard normal distribution N(0, 1). Used purely to verify shapes flow correctly. Real data would be pre-normalised sensor readings; standard-normal noise mimics the post-normalisation statistics closely enough for a smoke test.
Stand-in helper - assumed to live in §8-11 supporting code - that returns a dict-of-dicts with random tensors of the right shape for every block. In a real training run, params would come from torch.load(...) or from the optimiser; here we just need SOMETHING the right shape so the forward pass executes.
The single forward call. Drives x through CNN → BiLSTM → Attention → FCFunnel → both heads in one shot. Tuple unpacking on the LHS splits the (rul, logits) return into two named arrays.
Print the input shape so the operator can eyeball that the smoke test received what it expected. .shape on an ndarray returns a tuple of dimension sizes.
Confirms the regression head returned ONE scalar per engine (B,) - not (B, 1). If you ever see (B, 1), squeeze the trailing dim before computing MSE; otherwise broadcasting will silently produce (B, B) per-element losses.
Confirms the classification head returned 3 raw scores per engine. (2, 3) means 2 engines × 3 classes; rows are engines, columns are classes.
Prints the actual RUL predictions, rounded to 2 decimals and converted to a Python list (so the print is human-readable, not 'array([...], dtype=float32)'). Every value should be ≥ 0 thanks to the final ReLU on rul_head.
Picks the predicted class for each engine: argmax along the LAST axis returns the index of the largest logit per row. axis=-1 on a (B, 3) array yields a (B,) vector of class indices ∈ {0, 1, 2}.
1# Pseudo-NumPy assembly. Each call wraps the per-component implementation
2# from Chapters 8-11 (cnn_frontend, bilstm_encoder, attention_block,
3# fc_funnel, rul_head, health_head).
4
5import numpy as np
6from cnn_frontend import cnn_frontend # §8.4
7from bilstm_encoder import bilstm_encoder # §9.4
8from attention_block import attention_block # §10.4
9from fc_funnel import fc_funnel # §11.1
10from rul_head import rul_head # §11.2
11from health_head import health_head # §11.3
12
13
14def dual_task_model(x: np.ndarray, params: dict) -> tuple:
15 """End-to-end forward pass.
16
17 x: (B, 30, 14) - normalised sliding window of 14 sensors
18 returns: (rul, logits) where rul: (B,), logits: (B, 3)
19 """
20 h = cnn_frontend(x, **params["cnn"]) # (B, 30, 64)
21 h = bilstm_encoder(h, **params["lstm"]) # (B, 30, 512)
22 h = attention_block(h, **params["attn"]) # (B, 30, 512)
23 z = fc_funnel(h, **params["funnel"]) # (B, 32) <-- shared
24 rul = rul_head(z, **params["rul"]) # (B,)
25 logits = health_head(z, **params["hs"]) # (B, 3)
26 return rul, logits
27
28
29# ---------- Smoke test ----------
30np.random.seed(0)
31B, T, F = 2, 30, 14
32x = np.random.randn(B, T, F).astype(np.float32)
33
34# params dict assembled elsewhere (one entry per block)
35params = build_random_params() # see §8-11
36
37rul, logits = dual_task_model(x, params)
38
39print("input shape :", x.shape) # (2, 30, 14)
40print("rul shape :", rul.shape) # (2,)
41print("logits shape :", logits.shape) # (2, 3)
42print("rul values :", rul.round(2).tolist()) # all >= 0
43print("argmax(hs) :", logits.argmax(-1).tolist()) # predicted classPyTorch: One nn.Module
PyTorch's core package. Provides torch.Tensor (the GPU-aware n-D array that replaces NumPy's ndarray for deep learning), the autograd engine that records operations and computes gradients on .backward(), torch.manual_seed for reproducibility, and torch.randn / torch.tensor / torch.randint constructors used below.
Module / layer namespace. nn.Module is the base class every model and sub-module inherits from; nn.Linear, nn.Conv1d, nn.LSTM, nn.MultiheadAttention, nn.LayerNorm all live here. Aliased as 'nn' by universal convention - 'nn.Linear' reads better than 'torch.nn.Linear'.
The FUNCTIONAL counterpart to nn. Same operations but as plain functions that take inputs (and weights) explicitly, with no learnable state of their own. We use F.mse_loss and F.cross_entropy below - losses are stateless, so the functional form is the standard.
Section divider comment. Tells the reader the next six lines pull in PyTorch CLASS implementations of the same blocks the NumPy version above used as functions. Naming convention: function_name -> CamelCaseClass (e.g. cnn_frontend -> CNNFrontend).
Pulls in the §8.4 CNN frontend as an nn.Module subclass. CNNFrontend wraps three Conv1d → BatchNorm1d → ReLU → Dropout blocks plus the (B,T,F) ↔ (B,F,T) transposes Conv1d needs. Calling CNNFrontend(...) returns an INSTANCE; calling that instance later (line 24's self.cnn(x)) runs the forward pass.
Pulls in the §9.4 two-layer bidirectional LSTM. Wraps PyTorch's nn.LSTM with bidirectional=True and num_layers=2. Doubles the channel axis: forward and backward hidden states are concatenated.
Pulls in the §10.4 multi-head self-attention block. Internally uses nn.MultiheadAttention(batch_first=True), then wraps it in residual + LayerNorm. Shape-preserving.
Pulls in the §11.1 bottleneck. Two phases: last-cycle pool (h[:, -1, :]) collapses time, then dense layers funnel down to shared_dim. Output is the SHARED vector both heads consume.
Pulls in the §11.2 regression head. 4-layer MLP 32→64→32→16→1 with a final ReLU so the output is always ≥ 0.
Pulls in the §11.3 classification head. 3-layer MLP 32→32→16→num_classes that emits raw logits (no softmax - cross_entropy applies log_softmax internally).
Defines the end-to-end model as a subclass of nn.Module. Inheriting from nn.Module gives us automatic parameter registration (every sub-module assigned to self.something is tracked), .parameters() iteration for the optimiser, .to(device) for GPU placement, and .state_dict() for checkpointing.
One-line docstring summarising the full architecture. Visible via help(DualTaskModel). The arrow chain is the trunk; the parenthesised tuple is the two task heads that branch off the shared 32-D vector.
Constructor with five hyperparameters and all-default values matching the paper. Defaults are SAFE - instantiating DualTaskModel() with no args reproduces the published model.
MUST be the FIRST line of any nn.Module subclass __init__. Initialises the parent class's bookkeeping dicts (_modules, _parameters, _buffers). Skipping this raises 'cannot assign module before Module.__init__() call' the first time you try self.cnn = ....
Stage 1 of the trunk. Constructs the CNN frontend and assigns it to self.cnn. Because CNNFrontend is an nn.Module, this assignment AUTO-REGISTERS it - its parameters now appear in model.parameters() and its buffers in model.state_dict().
Stage 2 - construction of the BiLSTM. The line continues onto line 26. input_size=64 must match what the CNN frontend OUTPUTS (its last conv has 64 channels). hidden_size is forwarded from the constructor (default 256 → 2*256 = 512 after bidir concat).
Continuation of line 25. num_layers=2 stacks two LSTM layers (output of layer 1 feeds layer 2 in both directions). dropout_p=0.3 applies dropout BETWEEN layers (PyTorch nn.LSTM only applies dropout between layers, never on the final output).
Stage 3 - constructs the multi-head attention. d_model=2*lstm_hidden = 512. The expression '2 * lstm_hidden' is critical: forgetting the 2 (Pitfall 1 below) causes silent shape mismatches.
Continuation of line 27. num_heads forwards the constructor arg (default 8). dropout_p=0.1 is the lightest dropout in the model - attention is already a strong regulariser via averaging across positions.
Stage 4 - constructs the bottleneck. d_model=512 again because the funnel reads the attention block's output. Continues onto lines 30-31.
Continuation. hidden=(256, 64) defines the two intermediate widths in the funnel: 512→256→64. out_dim=shared_dim=32 is the FINAL output width - the size of the shared z vector both heads will read.
Continuation. dropout_p=0.3 - same heavy rate as the LSTM. The funnel is right before the heads, so regularising hard here prevents the heads from memorising training-set quirks of z.
Task head 1 - regression. Reads the 32-D shared z and emits a single non-negative scalar per engine. in_dim must match the funnel's out_dim - hence both reference the same shared_dim constructor arg.
Task head 2 - classification. Reads THE SAME 32-D z and emits raw logits over num_classes. No softmax inside - cross_entropy loss applies log_softmax internally for numerical stability.
Defines what happens when you call model(x). PyTorch routes the call through nn.Module.__call__, which sets up hooks then calls forward. The type hint 'tuple[torch.Tensor, torch.Tensor]' tells static analysers that two tensors come back; it has no runtime effect.
Inline reminder of the input shape contract. (B, T, c_in) = (batch, time, input channels). Reading this BEFORE running the forward in your head saves a debugging session.
Stage 1 forward. self.cnn(x) invokes CNNFrontend.__call__(x) which wires forward + autograd hooks. The trailing comment '# (B, T, 64)' is the SHAPE of h after this call - kept inline so reading the forward top-to-bottom shows the shape evolution.
Stage 2. h is reassigned (memory of previous h is reclaimed once autograd is done with it). Channel axis goes 64 → 512 (= 2 * 256, bidirectional concat).
Stage 3. SHAPE-PRESERVING - input and output are both (B, 30, 512). The block adds direct cycle-to-cycle communication on top of the BiLSTM's sequential summary, but does not change the tensor shape (residual + LayerNorm guarantee that).
Stage 4. Variable name CHANGES from h to z because semantics change: h was per-cycle, z is per-engine. The funnel internally pools the time axis (last cycle only) then funnels 512 → 256 → 64 → 32. The arrow '<-- shared' in the comment flags the contract for both heads.
Task head 1 forward. (B, 32) z in, (B,) non-negative scalars out. The trailing ReLU on rul_head's last layer guarantees rul ≥ 0.
Task head 2 forward. Reads the EXACT SAME z that line 41 just consumed. Outputs raw logits - no softmax. The variable name 'logits' (not 'probs') is a convention reminder that softmax has not been applied.
Returns BOTH outputs as a tuple. The order is part of the API contract - downstream code unpacks 'rul, logits = model(x)' positionally. Pitfall 3 below explains why we return a tuple, not a dict.
Section divider. Below is a tiny end-to-end exercise: build the model, fake some data, run forward + backward, print shapes and parameter count. Catches wiring errors in milliseconds.
Sets PyTorch's CPU RNG seed so torch.randn / torch.randint / Linear weight init are reproducible. (For CUDA you would also need torch.cuda.manual_seed_all(0).)
Instantiates the model with the paper's reference hyperparameters. The line continues onto line 49. After this call, model.parameters() iterates over all ~3.4M learnable tensors.
Continuation of line 48. shared_dim=32 = the z width. num_classes=3 = healthy / degrading / critical. After this line model is fully constructed.
Builds a fake input batch of 4 engines × 30 cycles × 14 sensors, sampled from N(0, 1). requires_grad=True attaches it to the autograd graph - .backward() will populate x.grad. (Real training data does NOT need requires_grad - we only set it here so backward has a leaf to flow gradients into for inspection.)
Fake RUL targets. torch.randint(low, high, size) returns int64; the .float() cast makes them float32 because F.mse_loss expects matching dtypes for input and target. The 126 upper bound matches the paper's piecewise-linear RUL cap at 125.
Fake classification targets - one int per engine giving the true class. F.cross_entropy expects CLASS INDICES as int64, NOT one-hot vectors. Values must be in [0, num_classes).
ONE forward call returns BOTH outputs. model(x) is sugar for model.__call__(x) which calls model.forward(x) plus hook handling. Tuple unpacking on the LHS splits the (rul, logits) return into two named tensors.
Computes mean squared error between predicted and target RUL. F.mse_loss is the functional form - stateless, takes (input, target). Default reduction is 'mean' so the result is a single scalar.
Standard classification loss. Internally: log_softmax(logits, dim=-1) then NLL with y_hs as class indices. Fused for numerical stability - never compute softmax then log yourself, you'll lose precision.
Naive sum of the two task losses with a fixed weight 0.1 on the auxiliary classification loss. This is the BASELINE every multi-task paper improves on; Chapter 12 shows that 0.1 is a guess that varies by 100× across runs - the central problem AMNL/GABA/GRACE solve.
Triggers reverse-mode automatic differentiation. PyTorch traverses the autograd graph backwards from 'loss', accumulating gradients into every leaf tensor with requires_grad=True - that means EVERY model.parameters() tensor AND x. After this call you can read .grad on each one.
Prints the input shape. tuple(x.shape) converts torch.Size (which prints as 'torch.Size([4, 30, 14])') into a plain Python tuple ('(4, 30, 14)') for cleaner output.
Confirms regression head returned ONE scalar per engine. Crucial sanity check - if you ever see (4, 1), squeeze before MSE or broadcasting will silently produce per-element losses of shape (4, 4).
Confirms classification head returned 3 raw scores per engine. Rows are engines, columns are classes.
.item() extracts a 0-D tensor's value as a plain Python float (only works on 0-D / scalar tensors). round(x, 2) keeps two decimals for readable output.
Same pattern. Cross-entropy on 3 classes with random logits is ~ln(3) ≈ 1.10 - if your initial loss is far from that, your initialisation is broken.
Start of the parameter-count print (continues onto line 68). Splitting across two lines keeps the f-string under PEP-8's 79-char limit.
Counts total learnable parameters. model.parameters() yields every nn.Parameter tracked by the model; .numel() on each gives the total element count; sum() folds them. The ',' format spec inserts thousands separators.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Components from previous sections
6from cnn_frontend import CNNFrontend # 8.4
7from bilstm_encoder import BiLSTMEncoder # 9.4
8from attention_block import AttentionBlock # 10.4
9from fc_funnel import FCFunnel # 11.1
10from rul_head import RULHead # 11.2
11from health_head import HealthHead # 11.3
12
13
14class DualTaskModel(nn.Module):
15 """End-to-end model: CNN -> BiLSTM -> Attention -> FC funnel -> (RUL, Health)."""
16
17 def __init__(self,
18 c_in: int = 14,
19 lstm_hidden: int = 256,
20 num_heads: int = 8,
21 shared_dim: int = 32,
22 num_classes: int = 3):
23 super().__init__()
24 self.cnn = CNNFrontend(c_in=c_in, dropout_p=0.15)
25 self.lstm = BiLSTMEncoder(input_size=64, hidden_size=lstm_hidden,
26 num_layers=2, dropout_p=0.3)
27 self.attn = AttentionBlock(d_model=2 * lstm_hidden,
28 num_heads=num_heads, dropout_p=0.1)
29 self.funnel = FCFunnel(d_model=2 * lstm_hidden,
30 hidden=(256, 64), out_dim=shared_dim,
31 dropout_p=0.3)
32 self.rul_head = RULHead(in_dim=shared_dim)
33 self.health_head = HealthHead(in_dim=shared_dim, num_classes=num_classes)
34
35 def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
36 # x: (B, T, c_in)
37 h = self.cnn(x) # (B, T, 64)
38 h = self.lstm(h) # (B, T, 2*hidden)
39 h = self.attn(h) # (B, T, 2*hidden)
40 z = self.funnel(h) # (B, shared_dim) <-- shared
41 rul = self.rul_head(z) # (B,)
42 logits = self.health_head(z) # (B, num_classes)
43 return rul, logits
44
45
46# ---------- Smoke test ----------
47torch.manual_seed(0)
48model = DualTaskModel(c_in=14, lstm_hidden=256, num_heads=8,
49 shared_dim=32, num_classes=3)
50
51x = torch.randn(4, 30, 14, requires_grad=True)
52y_rul = torch.randint(0, 126, (4,)).float() # capped RUL
53y_hs = torch.tensor([0, 1, 2, 1]) # class indices
54
55rul, logits = model(x)
56
57loss_rul = F.mse_loss(rul, y_rul) # regression
58loss_hs = F.cross_entropy(logits, y_hs) # classification
59loss = loss_rul + 0.1 * loss_hs # placeholder weights
60loss.backward()
61
62print("input :", tuple(x.shape)) # (4, 30, 14)
63print("rul shape :", tuple(rul.shape)) # (4,)
64print("logits shape :", tuple(logits.shape)) # (4, 3)
65print("loss_rul :", round(loss_rul.item(), 2))
66print("loss_hs :", round(loss_hs.item(), 2))
67print("# params :",
68 f"{sum(p.numel() for p in model.parameters()):,}") # ~3.4MSame Backbone, Different Sensors
Only cin and the class boundaries change. The shared trunk and heads are domain-agnostic.
| Domain | c_in | Window T | Health classes |
|---|---|---|---|
| C-MAPSS turbofan | 14 | 30 | healthy / degrading / critical |
| N-CMAPSS DS02 | 20 | 30 | same |
| PRONOSTIA bearings | 2 | 60 | OK / inner-race / outer-race |
| Wind-turbine SCADA | 12 | 144 (24 h) | OK / wear / failure |
| Battery cycling (Severson) | 5 | 100 | fresh / mid-life / EOL |
| Disk SMART (Backblaze) | 16 | 30 days | OK / pre-fail / fail-soon |
Three Assembly Pitfalls
2 * lstm_hidden.z.detach() before the health head accidentally cuts the classification gradient out of the backbone. The auxiliary signal then does nothing - and you will spend a week wondering why classification accuracy does not help RUL like the paper claims. Never detach inside the model.return {"rul": rul, "hs": logits} looks cleaner but breaks every nn.DataParallel / torch.compile path. Stick with thereturn rul, logits tuple - it is what the rest of the book assumes.The point. 3.4M parameters, six sub-modules, one forward call, two outputs. The architecture is now complete. Parts IV-VI will leave the architecture frozen and focus entirely on what loss to put on top of it.
Takeaway
- One trunk, two heads. CNN → BiLSTM → Attention → FC funnel emit the 32-D shared z. RUL and health heads both read z.
- ~3.4M parameters total. BiLSTM 63%, Attention 31%, everything else 6%. Heads are 0.2%.
- Shape contract. Always (B, T, c_in) in, (B,) and (B, num_classes) out. Never break this.
- Domain-agnostic trunk. Only cin and class boundaries change between turbofans, bearings, batteries, disks, turbines.
- End of Part III. Architecture done. Chapters 12-13 will diagnose the gradient imbalance that motivates AMNL, GABA, and GRACE.