Chapter 11
12 min read
Section 46 of 121

Complete DualTaskModel and Parameter Count

Dual-Task Heads & Model Assembly

Assembling the DualTaskModel

Three chapters of components, four sections of heads. This section glues them into one nn.Module that you can instantiate with one line and train end-to-end with one optimizer. Everything you have built fits together like this:

StageSourceShape
Window of normalised sensors§7(B, 30, 14)
CNN frontend§8(B, 30, 64)
BiLSTM encoder§9(B, 30, 512)
Multi-head self-attention§10(B, 30, 512)
FC funnel + last-cycle pool§11.1(B, 32) ← shared
RUL regression head§11.2(B,)
Health classification head§11.3(B, 3) logits
Two outputs from one trunk. The 32-D shared feature vector zz is the contract between trunk and heads. Both heads read the SAME zz. That is what makes this multi-task and what makes the gradient imbalance of Chapter 12 unavoidable.

End-to-End Shape Trace

Walk a single batch (B=4) through every stage. The only non-trivial transitions are CNN's channel-axis bridge (transpose-in, transpose-out) and FCFunnel's last-cycle pool that collapses the time axis.

StepOperationShape
0input batch(4, 30, 14)
1CNNFrontend - transpose to (B,F,T), conv stack, transpose(4, 30, 64)
2BiLSTMEncoder - 2-layer bidir LSTM(4, 30, 512)
3AttentionBlock - 8-head self-attn + residual + LN(4, 30, 512)
4FCFunnel.last-cycle pool(4, 512)
5FCFunnel - dense 512 → 256 → 64 → 32(4, 32) ← z
6aRULHead - 32 → 64 → 32 → 16 → 1, final ReLU(4,)
6bHealthHead - 32 → 32 → 16 → 3 (logits)(4, 3)

Interactive: Click Through the Pipeline

Click each block to see its input/output shape, parameter count, and share of the total budget.

Loading shape tracer…
Read this before continuing. Notice that BiLSTM owns ~64% of the parameters but the two task heads own less than 0.2%. Yet Chapter 12 will show the classification head's gradient pulling on the shared backbone with up to 500x the magnitude of the regression head. Parameter share and gradient share are not the same thing - that is the central insight of GABA / GRACE.

The Parameter Budget

BlockParamsShareNotes
CNN frontend53,1841.6%Three Conv1d → BN → ReLU → Dropout
BiLSTM encoder2,140,16063.0%2 layers, hidden=256, bidirectional
Multi-head attention1,052,67231.0%8 heads, d_model=512
FC funnel148,0004.4%512 → 256 → 64 → 32
RUL head4,7370.14%Regression
Health head1,6350.05%Classification (auxiliary)
Total~3.40M100%Reproducible on a single 16 GB GPU
Why “~3.4M” and not exactly 3.40M. The exact figure depends on three knobs: (a) c_in (14 vs 17 sensors), (b) the FC funnel's exact hidden tuple ((256, 64) is the paper default), and (c) whether you use bias=True on the heads. The book's reference config prints 3,400,388 when c_in=14.

Python: Pseudo-Assembly

End-to-end forward pass, NumPy pseudo-assembly
🐍dual_task_model_numpy.py
1Header comment - what this file is

This file is a NumPy-only pseudo-assembly of the dual-task model. Every block (CNN, BiLSTM, attention, funnel, two heads) was implemented in earlier sections; here we glue them together so you can SEE the end-to-end shape flow without any PyTorch magic. The comment tells the reader: do not implement these blocks again - import the per-section reference impls.

2Comment - which chapters supply which blocks

Lists the six imports the file is about to make. Reading this before the imports below tells you the architecture in one breath: cnn_frontend (Ch 8), bilstm_encoder (Ch 9), attention_block (Ch 10), then three Ch 11 modules - fc_funnel (the trunk-to-z bottleneck), rul_head (regression), health_head (classification).

3Comment - continuation

Closes the chapter mapping. Together lines 1-3 are the contract: 'this file does NOT redefine the components, it only wires them in the order the architecture requires.'

5import numpy as np

NumPy is the numerical computing library for Python. It supplies ndarray (the fast N-D array used to hold the (B, 30, 14) sensor windows), the @ matrix-multiply operator that every block relies on under the hood, and broadcasting that makes per-row normalisation a one-liner. We alias it as np by convention so we can write np.random.randn(...) and np.float32 below.

EXECUTION STATE
numpy = Library for numerical computing. Provides ndarray (N-dimensional float arrays), random number generators, dtype control (float32 vs float64), and the @ operator.
as np = Alias - lets us write np.random.randn instead of numpy.random.randn. Universal Python convention; every NumPy tutorial uses 'np'.
6from cnn_frontend import cnn_frontend

Pulls in the function defined back in section 8.4. cnn_frontend is the first stage of the trunk: it takes the raw (B, T, 14) window, transposes to (B, 14, T) so Conv1d sees channels first, runs three Conv1d -> BatchNorm -> ReLU -> Dropout blocks, then transposes back. It mixes information across the 14 sensors at every cycle.

EXECUTION STATE
📚 cnn_frontend() = Function from §8.4. Signature: cnn_frontend(x, **params) -> ndarray. Input shape (B, T, F) -> output shape (B, T, 64). 'F' here is the input sensor count (14 for C-MAPSS).
→ why we need it = Sensors are not independent - oil temperature and exhaust temperature carry overlapping information. Conv1d learns 64 mixed channels per cycle so downstream layers see a richer per-cycle representation than the raw 14 numbers.
7from bilstm_encoder import bilstm_encoder

Pulls in the recurrent encoder from §9.4. Two stacked LSTMs run in BOTH directions over the 30-cycle window: one reads cycles 1->30 (causal), the other reads 30->1 (anti-causal). Their outputs concatenate along the channel axis, doubling 256 hidden units to 512.

EXECUTION STATE
📚 bilstm_encoder() = Function from §9.4. Signature: bilstm_encoder(x, **params) -> ndarray. Input (B, T, 64) -> output (B, T, 2*hidden) = (B, T, 512). The factor of 2 is the bidirectional concatenation.
→ why bidirectional = RUL at cycle t depends on the WHOLE window, not just the past. The forward LSTM gives 'how did we get here?', the backward LSTM gives 'how will the next cycles play out?'. Concat = both views.
8from attention_block import attention_block

Pulls in the multi-head self-attention block from §10.4. Lets every cycle attend to every other cycle in the window directly (not through the LSTM's sequential bottleneck). Wrapped in a residual connection + LayerNorm so it is shape-preserving.

EXECUTION STATE
📚 attention_block() = Function from §10.4. Signature: attention_block(x, **params) -> ndarray. Input (B, T, 512) -> output (B, T, 512). 8 heads × 64 d_k = 512.
→ why on top of BiLSTM = The BiLSTM compresses long-range dependencies through a chain of recurrences (cycle 1 reaches cycle 30 via 29 hops). Attention lets cycle 1 talk to cycle 30 in ONE hop - critical for catching abrupt sensor anomalies.
9from fc_funnel import fc_funnel

Pulls in the bottleneck from §11.1. Two responsibilities: (1) collapse the time axis by taking only the LAST cycle's 512-D vector (the 'right-now' embedding), then (2) funnel 512 → 256 → 64 → 32 through dense layers. The output is the 32-D shared representation z that both task heads read.

EXECUTION STATE
📚 fc_funnel() = Function from §11.1. Signature: fc_funnel(x, **params) -> ndarray. Input (B, T, 512) -> output (B, 32). The time axis is GONE after this call.
→ why 'last-cycle' pool = We are predicting RUL and health AT THE CURRENT cycle. The most recent cycle's representation is the most informative; mean-pool over T would dilute it. (Ch 12 will revisit attention pooling as an alternative.)
10from rul_head import rul_head

Pulls in the regression head from §11.2. A 4-layer MLP that maps the 32-D shared vector z to a SINGLE non-negative scalar - the predicted remaining useful life in cycles. Final activation is ReLU so the prediction is always ≥ 0 (no engine has 'negative life left').

EXECUTION STATE
📚 rul_head() = Function from §11.2. Signature: rul_head(z, **params) -> ndarray. Input (B, 32) -> output (B,). Architecture: 32 → 64 → 32 → 16 → 1, ReLU on output.
11from health_head import health_head

Pulls in the classification head from §11.3. A 3-layer MLP that maps z to RAW logits over {healthy, degrading, critical}. No softmax here - softmax is fused inside cross_entropy in PyTorch for numerical stability, and this NumPy version mirrors that contract.

EXECUTION STATE
📚 health_head() = Function from §11.3. Signature: health_head(z, **params) -> ndarray. Input (B, 32) -> output (B, 3). Architecture: 32 → 32 → 16 → 3, no final activation (raw logits).
→ why share z = Both heads read the SAME z vector. If we trained two separate models, the regression model would never benefit from the classification signal. Sharing z is the entire reason this is a multi-task problem.
14def dual_task_model(x, params) → (rul, logits)

The end-to-end forward pass. One input batch in, two outputs out - a regression scalar per engine AND a 3-vector of class logits per engine. Both come from the same shared trunk; that shared computation is what lets one task help the other.

EXECUTION STATE
⬇ input: x (B, 30, 14) = Normalised sliding window. B = batch size (often 64 in training, 2 in this smoke test). 30 = T (cycles per window, the standard C-MAPSS choice). 14 = sensors after the §6.4 pruning that dropped 7 constant-valued sensors.
→ x dtype = float32 (set on line 32 with .astype(np.float32)). Saves memory vs float64 with no measurable accuracy loss for this model.
⬇ input: params = Dict-of-dicts. params['cnn'] holds the CNN frontend's weights & biases, params['lstm'] holds the BiLSTM tensors, etc. Each sub-dict is splatted with **params[...] into the matching block below.
→ why dict-of-dicts = Each block has different parameter NAMES (Conv1d wants 'kernel_w', Linear wants 'W'). A flat dict would clash. Nested dict keeps each block's namespace clean.
→ tuple type hint = Python's built-in 'tuple' annotation. Tells readers (and type checkers) that this function returns a 2-element tuple, not a list or a dict. Critical for nn.DataParallel compatibility (see Pitfall 3 below).
⬆ returns = (rul, logits) where rul: (B,) non-negative scalars (cycles to failure), logits: (B, 3) raw classification scores.
15Docstring opens: """End-to-end forward pass.

Docstring start. Numpy-style convention: first line is a one-sentence summary (here: 'End-to-end forward pass.'), then a blank line, then the parameter/return descriptions. Docstrings are pulled by help(dual_task_model) and by Sphinx for the API docs.

17Docstring: x: (B, 30, 14) - normalised sliding window of 14 sensors

Documents the input contract. (B, 30, 14) means batch first, then 30 cycles per window, then 14 sensors. 'Normalised' is critical - the window must already have z-score normalisation applied (mean ≈ 0, std ≈ 1 per sensor); the model itself does NOT include a normalisation layer.

18Docstring: returns: (rul, logits) where rul: (B,), logits: (B, 3)

Documents the output contract. rul shape (B,) is a 1-D vector of one scalar per engine. logits shape (B, 3) is a matrix of one row per engine, three raw scores per row (one per health class). This is the contract every downstream training loop relies on.

19Docstring closes: """

Closes the triple-quoted docstring block. Below this point is executable code.

20h = cnn_frontend(x, **params["cnn"])

Stage 1 of the trunk. Calls the CNN frontend with the raw input AND its parameters splatted in. h becomes the per-cycle feature tensor with 64 learned channels. Comment // (B, 30, 64) is the SHAPE of h after this line - useful when reading the code top-to-bottom.

EXECUTION STATE
📚 cnn_frontend(x, **params) = Function from §8.4. Wraps three Conv1d → BatchNorm → ReLU → Dropout blocks plus the in-and-out transposes that bridge between (B,T,F) - the convention this file uses - and (B,F,T) which Conv1d wants internally.
⬇ arg 1: x (B, 30, 14) = The raw normalised window passed in by the caller.
⬇ arg 2: **params["cnn"] = Dict splat. params['cnn'] is something like {'kernel_w1': ..., 'bias_b1': ..., 'bn1_gamma': ..., ...}. The ** unpacks each key into a keyword argument: kernel_w1=..., bias_b1=..., etc.
→ ** splat example = p = {'a': 1, 'b': 2}; f(**p) is equivalent to f(a=1, b=2). Saves writing every keyword by hand AND keeps the call site short when a block has 20+ parameter tensors.
⬆ result: h (B, 30, 64) = Per-cycle 64-D mixed-sensor representation. Time axis (T=30) is preserved; the channel axis grew from 14 to 64.
21h = bilstm_encoder(h, **params["lstm"])

Stage 2 of the trunk. Sweeps the 30-cycle sequence with TWO stacked LSTMs in both directions. The variable name 'h' is REASSIGNED - this is intentional; we never need the raw CNN output again, and reusing the name keeps memory usage low.

EXECUTION STATE
📚 bilstm_encoder(h, **params) = Function from §9.4. Internally: layer-1 forward LSTM, layer-1 backward LSTM, concat → layer-2 forward, layer-2 backward, concat. Final concatenation gives 256 forward + 256 backward = 512 channels per cycle.
⬇ arg 1: h (B, 30, 64) = The CNN output from line 20. h is reassigned, not appended.
⬇ arg 2: **params["lstm"] = Dict of LSTM weights: input-to-hidden (W_ih), hidden-to-hidden (W_hh), and biases for forget/input/cell/output gates × 2 layers × 2 directions. ~2.14M parameters - the dominant cost in the whole model.
⬆ result: h (B, 30, 512) = Now each cycle's vector is 512-D and encodes both past AND future context within the 30-cycle window.
22h = attention_block(h, **params["attn"])

Stage 3 of the trunk. Adds direct cycle-to-cycle communication on top of the BiLSTM's sequential summary. The block is shape-preserving thanks to its residual connection: output = LayerNorm(h + MultiHead(h, h, h)).

EXECUTION STATE
📚 attention_block(h, **params) = Function from §10.4. Wraps multi-head self-attention with 8 heads, plus the residual + LayerNorm 'add & norm' wrapper that makes attention blocks composable.
⬇ arg 1: h (B, 30, 512) = BiLSTM output. Notice we PASS THE SAME h THREE TIMES inside attention_block (as Q, K, V) - that is what makes it 'self-attention' as opposed to encoder-decoder cross-attention.
⬇ arg 2: **params["attn"] = Q/K/V projection matrices (3 × 512 × 512), output projection (512 × 512), and the 4 LayerNorm parameters. ~1.05M parameters.
⬆ result: h (B, 30, 512) = Same shape, but every cycle now has a direct (attention-weighted) view of every other cycle in the window.
23z = fc_funnel(h, **params["funnel"])

Stage 4 - the bottleneck. The variable name CHANGES from h to z because the meaning changes: h was per-cycle, z is per-engine. The time axis collapses here (last-cycle pool keeps only the final cycle's 512-D vector), then dense layers funnel 512 → 256 → 64 → 32. The 32-D z is the SHARED representation both heads consume.

EXECUTION STATE
📚 fc_funnel(h, **params) = Function from §11.1. Two phases: (1) z = h[:, -1, :] - last-cycle pool, shape becomes (B, 512). (2) Three Dense → ReLU → Dropout layers that funnel 512 → 256 → 64, then a final Dense → 32 with no activation.
⬇ arg 1: h (B, 30, 512) = Attention output. After the last-cycle pool only h[:, -1, :] (shape (B, 512)) survives.
⬇ arg 2: **params["funnel"] = Three weight matrices (512×256, 256×64, 64×32), three bias vectors, ~148k params total.
⬆ result: z (B, 32) ← SHARED = 32 numbers per engine. This vector is the contract between trunk and heads. CHANGE its size and BOTH head dimensionalities must change too.
→ why 32? = Empirically tuned in §11.1. Larger z gives more head capacity but lets each head OVERFIT its own task and starve the other. 32 is the sweet spot the paper landed on.
24rul = rul_head(z, **params["rul"])

Task head 1 - regression. Maps the 32-D shared z into a single non-negative scalar per engine. The trailing ReLU on the rul_head's final layer guarantees rul ≥ 0 - we cannot predict 'negative life left'.

EXECUTION STATE
📚 rul_head(z, **params) = Function from §11.2. 4-layer MLP: 32 → 64 → 32 → 16 → 1 with ReLU between every pair of layers AND on the output. Total: 4,737 params - tiny next to the trunk.
⬇ arg 1: z (B, 32) = The shared representation. Identical to the z that health_head will see on the next line.
⬆ result: rul (B,) = 1-D array of B non-negative floats. The unit is 'cycles' (e.g. rul[0] = 87.3 means engine 0 has ~87 cycles left). NumPy collapses (B, 1) → (B,) automatically when the last dim is 1.
→ why ReLU on output = Without it, the head could output negative numbers and waste capacity learning to bound them away from zero. ReLU enforces the bound architecturally.
25logits = health_head(z, **params["hs"])

Task head 2 - classification. Reads THE SAME z that rul_head just consumed and emits raw logits over the 3 health classes. No softmax here - the classifier's loss (cross_entropy in the PyTorch version below) applies log-softmax internally for numerical stability.

EXECUTION STATE
📚 health_head(z, **params) = Function from §11.3. 3-layer MLP: 32 → 32 → 16 → 3 with ReLU in between, no activation on output. Total: 1,635 params - the smallest block in the model.
⬇ arg 1: z (B, 32) = EXACT SAME object reference as the z passed to rul_head on line 24. This is the multi-task 'shared trunk' in concrete terms.
⬆ result: logits (B, 3) = Raw scores - any real number. Apply np.argmax(logits, axis=-1) to get the predicted class index ∈ {0, 1, 2} = {healthy, degrading, critical}.
→ why raw logits, not softmax = Cross-entropy loss = log_softmax + NLL fused. Computing softmax here and then log inside the loss would cost numerical precision (log of small softmax values is unstable). PyTorch enforces this convention; we mirror it.
26return rul, logits

Returns BOTH outputs as a 2-tuple. Python will pack them into a tuple even without explicit parentheses (it sees the comma). This is the canonical multi-task return signature; the training loop in Ch 15 will unpack it as 'rul_pred, hs_pred = model(x)' and feed each into its own loss.

EXECUTION STATE
⬆ return: (rul, logits) = tuple( ndarray (B,), ndarray (B, 3) ). The order matters - downstream code unpacks positionally.
→ why tuple, not dict = See Pitfall 3 lower on this page. Tuples are stable across nn.DataParallel and torch.compile; dicts can break those.
29# ---------- Smoke test ----------

Section divider. Everything below is a tiny dry-run that proves the wiring is correct - it does NOT train the model. Smoke tests are the cheapest way to catch a shape mismatch BEFORE you waste GPU time on a real training run.

30np.random.seed(0)

Sets NumPy's global RNG seed to 0 so the random input on line 32 is REPRODUCIBLE - rerun the script and you get the exact same x. Critical for debugging: a shape error today will reproduce tomorrow.

EXECUTION STATE
📚 np.random.seed(s) = Initialises the Mersenne-Twister random state to a deterministic value. Affects every np.random.* call thereafter (randn, randint, choice, ...).
⬇ arg: 0 = Any non-negative integer works; '0' is convention for 'first run'. Different seeds give different x; same seed gives identical x.
31B, T, F = 2, 30, 14

Tuple unpack: B = 2 (batch size), T = 30 (cycles per window), F = 14 (sensors). Naming convention echoes the docstring shape (B, T, F). Tiny B keeps the smoke test fast - real training uses 64 or 128.

EXECUTION STATE
B = 2 - batch size (engines per forward pass)
T = 30 - cycles per sliding window (paper standard for C-MAPSS)
F = 14 - input sensor count after dropping 7 constants in §6.4
32x = np.random.randn(B, T, F).astype(np.float32)

Builds a fake (2, 30, 14) input drawn from a standard normal distribution N(0, 1). Used purely to verify shapes flow correctly. Real data would be pre-normalised sensor readings; standard-normal noise mimics the post-normalisation statistics closely enough for a smoke test.

EXECUTION STATE
📚 np.random.randn(*shape) = Returns an ndarray of the given shape filled with samples from N(0, 1) - mean 0, std 1. randn(2, 30, 14) → shape (2, 30, 14).
⬇ args: B, T, F = 2, 30, 14 = Each int becomes one dimension. randn takes positional args, NOT a tuple - randn((2,30,14)) would raise.
📚 .astype(np.float32) = ndarray method that returns a NEW ndarray with the requested dtype. Default randn returns float64; we cast to float32 to halve memory and match what GPUs prefer.
→ np.float32 = 32-bit IEEE-754 floating point. ~7 decimal digits of precision - more than enough for sensor data. float64 would double memory for no measurable benefit.
⬆ result: x = shape (2, 30, 14), dtype float32, values ~ N(0, 1)
35params = build_random_params()

Stand-in helper - assumed to live in §8-11 supporting code - that returns a dict-of-dicts with random tensors of the right shape for every block. In a real training run, params would come from torch.load(...) or from the optimiser; here we just need SOMETHING the right shape so the forward pass executes.

EXECUTION STATE
📚 build_random_params() = Helper function (not shown). Returns: {'cnn': {...}, 'lstm': {...}, 'attn': {...}, 'funnel': {...}, 'rul': {...}, 'hs': {...}}. Each sub-dict matches the kwargs of its corresponding block.
37rul, logits = dual_task_model(x, params)

The single forward call. Drives x through CNN → BiLSTM → Attention → FCFunnel → both heads in one shot. Tuple unpacking on the LHS splits the (rul, logits) return into two named arrays.

EXECUTION STATE
⬇ arg 1: x = (2, 30, 14) float32 - the smoke-test batch from line 32
⬇ arg 2: params = the dict-of-dicts from line 35
⬆ rul = shape (2,) - one non-negative scalar per engine
⬆ logits = shape (2, 3) - 3 raw class scores per engine
39print("input shape :", x.shape)

Print the input shape so the operator can eyeball that the smoke test received what it expected. .shape on an ndarray returns a tuple of dimension sizes.

EXECUTION STATE
📚 ndarray.shape = Attribute (NOT a method - no parens). Returns a tuple of ints, one per axis. For x: (2, 30, 14).
⬆ Output = input shape : (2, 30, 14)
40print("rul shape :", rul.shape)

Confirms the regression head returned ONE scalar per engine (B,) - not (B, 1). If you ever see (B, 1), squeeze the trailing dim before computing MSE; otherwise broadcasting will silently produce (B, B) per-element losses.

EXECUTION STATE
⬆ Output = rul shape : (2,)
41print("logits shape :", logits.shape)

Confirms the classification head returned 3 raw scores per engine. (2, 3) means 2 engines × 3 classes; rows are engines, columns are classes.

EXECUTION STATE
⬆ Output = logits shape : (2, 3)
42print("rul values :", rul.round(2).tolist())

Prints the actual RUL predictions, rounded to 2 decimals and converted to a Python list (so the print is human-readable, not 'array([...], dtype=float32)'). Every value should be ≥ 0 thanks to the final ReLU on rul_head.

EXECUTION STATE
📚 ndarray.round(decimals) = Returns a NEW ndarray rounded element-wise to the given number of decimal places. round(2) → 2 decimals.
📚 ndarray.tolist() = Returns a nested Python list. Useful for printing; standard list repr is friendlier than ndarray repr.
⬆ Output (example) = rul values : [12.34, 87.91]
43print("argmax(hs) :", logits.argmax(-1).tolist())

Picks the predicted class for each engine: argmax along the LAST axis returns the index of the largest logit per row. axis=-1 on a (B, 3) array yields a (B,) vector of class indices ∈ {0, 1, 2}.

EXECUTION STATE
📚 ndarray.argmax(axis) = Returns the INDEX of the maximum value along the given axis. Element-wise max would lose which class won; argmax keeps the index.
⬇ arg: -1 = The LAST axis. For shape (B, 3), axis=-1 is axis=1 (the class dimension). axis=0 would compare engines AGAINST each other within each class - wrong.
→ axis=-1 vs axis=0 example = logits = [[1, 5, 3], [8, 2, 7]] axis=-1 → [1, 0] (per-engine winner) axis=0 → [1, 0, 1] (per-class winner across engines)
⬆ Output (example) = argmax(hs) : [1, 0] # engine 0 → degrading, engine 1 → healthy
10 lines without explanation
1# Pseudo-NumPy assembly. Each call wraps the per-component implementation
2# from Chapters 8-11 (cnn_frontend, bilstm_encoder, attention_block,
3# fc_funnel, rul_head, health_head).
4
5import numpy as np
6from cnn_frontend  import cnn_frontend     # §8.4
7from bilstm_encoder import bilstm_encoder  # §9.4
8from attention_block import attention_block # §10.4
9from fc_funnel     import fc_funnel        # §11.1
10from rul_head      import rul_head         # §11.2
11from health_head   import health_head      # §11.3
12
13
14def dual_task_model(x: np.ndarray, params: dict) -> tuple:
15    """End-to-end forward pass.
16
17    x:      (B, 30, 14)  - normalised sliding window of 14 sensors
18    returns: (rul, logits) where rul: (B,), logits: (B, 3)
19    """
20    h = cnn_frontend(x,        **params["cnn"])      # (B, 30, 64)
21    h = bilstm_encoder(h,      **params["lstm"])     # (B, 30, 512)
22    h = attention_block(h,     **params["attn"])     # (B, 30, 512)
23    z = fc_funnel(h,           **params["funnel"])   # (B, 32)   <-- shared
24    rul    = rul_head(z,       **params["rul"])      # (B,)
25    logits = health_head(z,    **params["hs"])       # (B, 3)
26    return rul, logits
27
28
29# ---------- Smoke test ----------
30np.random.seed(0)
31B, T, F = 2, 30, 14
32x = np.random.randn(B, T, F).astype(np.float32)
33
34# params dict assembled elsewhere (one entry per block)
35params = build_random_params()                       # see §8-11
36
37rul, logits = dual_task_model(x, params)
38
39print("input shape  :", x.shape)                     # (2, 30, 14)
40print("rul shape    :", rul.shape)                   # (2,)
41print("logits shape :", logits.shape)                # (2, 3)
42print("rul values   :", rul.round(2).tolist())       # all >= 0
43print("argmax(hs)   :", logits.argmax(-1).tolist())  # predicted class

PyTorch: One nn.Module

DualTaskModel + smoke test (forward + backward)
🐍dual_task_model.py
1import torch

PyTorch's core package. Provides torch.Tensor (the GPU-aware n-D array that replaces NumPy's ndarray for deep learning), the autograd engine that records operations and computes gradients on .backward(), torch.manual_seed for reproducibility, and torch.randn / torch.tensor / torch.randint constructors used below.

EXECUTION STATE
torch = Top-level package. Holds Tensor, autograd, RNG, dtype constants (torch.float32), device handles (torch.device('cuda')), and serialisation (torch.load/save).
2import torch.nn as nn

Module / layer namespace. nn.Module is the base class every model and sub-module inherits from; nn.Linear, nn.Conv1d, nn.LSTM, nn.MultiheadAttention, nn.LayerNorm all live here. Aliased as 'nn' by universal convention - 'nn.Linear' reads better than 'torch.nn.Linear'.

EXECUTION STATE
nn.Module = Base class. Subclasses must implement __init__ (register sub-modules and parameters) and forward (define the computation). __call__ wires forward + hooks for you.
3import torch.nn.functional as F

The FUNCTIONAL counterpart to nn. Same operations but as plain functions that take inputs (and weights) explicitly, with no learnable state of their own. We use F.mse_loss and F.cross_entropy below - losses are stateless, so the functional form is the standard.

EXECUTION STATE
F.mse_loss(input, target) = Mean-squared-error loss. Equivalent to ((input - target) ** 2).mean(). Used for the RUL regression.
F.cross_entropy(logits, target) = Combines log_softmax + NLL loss in one numerically stable call. Used for the health classification.
5# Components from previous sections

Section divider comment. Tells the reader the next six lines pull in PyTorch CLASS implementations of the same blocks the NumPy version above used as functions. Naming convention: function_name -> CamelCaseClass (e.g. cnn_frontend -> CNNFrontend).

6from cnn_frontend import CNNFrontend

Pulls in the §8.4 CNN frontend as an nn.Module subclass. CNNFrontend wraps three Conv1d → BatchNorm1d → ReLU → Dropout blocks plus the (B,T,F) ↔ (B,F,T) transposes Conv1d needs. Calling CNNFrontend(...) returns an INSTANCE; calling that instance later (line 24's self.cnn(x)) runs the forward pass.

EXECUTION STATE
📚 CNNFrontend = Class from §8.4. Constructor: CNNFrontend(c_in: int, dropout_p: float). Forward: (B, T, c_in) → (B, T, 64). ~53k learnable params.
7from bilstm_encoder import BiLSTMEncoder

Pulls in the §9.4 two-layer bidirectional LSTM. Wraps PyTorch's nn.LSTM with bidirectional=True and num_layers=2. Doubles the channel axis: forward and backward hidden states are concatenated.

EXECUTION STATE
📚 BiLSTMEncoder = Class from §9.4. Constructor: BiLSTMEncoder(input_size, hidden_size, num_layers, dropout_p). Forward: (B, T, in) → (B, T, 2*hidden). ~2.14M params.
8from attention_block import AttentionBlock

Pulls in the §10.4 multi-head self-attention block. Internally uses nn.MultiheadAttention(batch_first=True), then wraps it in residual + LayerNorm. Shape-preserving.

EXECUTION STATE
📚 AttentionBlock = Class from §10.4. Constructor: AttentionBlock(d_model, num_heads, dropout_p). Forward: (B, T, d_model) → (B, T, d_model). ~1.05M params.
9from fc_funnel import FCFunnel

Pulls in the §11.1 bottleneck. Two phases: last-cycle pool (h[:, -1, :]) collapses time, then dense layers funnel down to shared_dim. Output is the SHARED vector both heads consume.

EXECUTION STATE
📚 FCFunnel = Class from §11.1. Constructor: FCFunnel(d_model, hidden: tuple[int, ...], out_dim, dropout_p). Forward: (B, T, d_model) → (B, out_dim). ~148k params.
10from rul_head import RULHead

Pulls in the §11.2 regression head. 4-layer MLP 32→64→32→16→1 with a final ReLU so the output is always ≥ 0.

EXECUTION STATE
📚 RULHead = Class from §11.2. Constructor: RULHead(in_dim). Forward: (B, in_dim) → (B,). 4,737 params.
11from health_head import HealthHead

Pulls in the §11.3 classification head. 3-layer MLP 32→32→16→num_classes that emits raw logits (no softmax - cross_entropy applies log_softmax internally).

EXECUTION STATE
📚 HealthHead = Class from §11.3. Constructor: HealthHead(in_dim, num_classes). Forward: (B, in_dim) → (B, num_classes). 1,635 params.
14class DualTaskModel(nn.Module):

Defines the end-to-end model as a subclass of nn.Module. Inheriting from nn.Module gives us automatic parameter registration (every sub-module assigned to self.something is tracked), .parameters() iteration for the optimiser, .to(device) for GPU placement, and .state_dict() for checkpointing.

EXECUTION STATE
📚 nn.Module = PyTorch base class. Tracks sub-modules registered via 'self.x = SomeModule(...)' and learnable parameters via 'self.x = nn.Parameter(...)'. Forward pass is invoked by calling the instance: model(x) calls model.forward(x).
⬇ input: x (B, 30, c_in) = Sliding window of normalised sensor readings. B = batch size, 30 = T (cycles per window), c_in = number of input sensors (default 14 for C-MAPSS).
⬆ returns = Tuple (rul, logits) where rul has shape (B,) and logits has shape (B, num_classes).
15Docstring: """End-to-end model: CNN -> BiLSTM -> Attention -> FC funnel -> (RUL, Health)."""

One-line docstring summarising the full architecture. Visible via help(DualTaskModel). The arrow chain is the trunk; the parenthesised tuple is the two task heads that branch off the shared 32-D vector.

17def __init__(self, c_in=14, lstm_hidden=256, num_heads=8, shared_dim=32, num_classes=3):

Constructor with five hyperparameters and all-default values matching the paper. Defaults are SAFE - instantiating DualTaskModel() with no args reproduces the published model.

EXECUTION STATE
⬇ self = The instance being constructed. PyTorch will populate self._modules and self._parameters as we assign submodules.
⬇ c_in: int = 14 = Number of input sensors. 14 = C-MAPSS after dropping 7 constants in §6.4. Other domains override: PRONOSTIA bearings = 2, wind turbine SCADA = 12.
⬇ lstm_hidden: int = 256 = Hidden size of EACH LSTM direction. After bidirectional concat the channel axis becomes 2*256 = 512. Driving down hidden = lighter model but worse temporal modelling.
⬇ num_heads: int = 8 = Multi-head attention heads. d_model (512) must be divisible by num_heads → d_k = 512/8 = 64 per head. Common alternatives: 4 (faster, less expressive) or 16 (richer but more compute).
⬇ shared_dim: int = 32 = Width of the shared z vector consumed by BOTH heads. Larger = more capacity per head but tasks compete more; 32 is the §11.1 sweet spot.
⬇ num_classes: int = 3 = Health classes: {healthy, degrading, critical}. Other domains override: bearings = 3 (OK / inner / outer race), disks = 3 (OK / pre-fail / fail-soon).
23super().__init__()

MUST be the FIRST line of any nn.Module subclass __init__. Initialises the parent class's bookkeeping dicts (_modules, _parameters, _buffers). Skipping this raises 'cannot assign module before Module.__init__() call' the first time you try self.cnn = ....

EXECUTION STATE
📚 super() = Returns a proxy to the parent class (nn.Module here). super().__init__() calls nn.Module.__init__(self) - the parent's constructor.
24self.cnn = CNNFrontend(c_in=c_in, dropout_p=0.15)

Stage 1 of the trunk. Constructs the CNN frontend and assigns it to self.cnn. Because CNNFrontend is an nn.Module, this assignment AUTO-REGISTERS it - its parameters now appear in model.parameters() and its buffers in model.state_dict().

EXECUTION STATE
📚 CNNFrontend(c_in, dropout_p) = Class constructor. Builds three Conv1d → BatchNorm1d → ReLU → Dropout blocks plus the in/out transposes between (B,T,F) and (B,F,T).
⬇ arg: c_in=c_in = Forwards the constructor arg (default 14) into the CNN. The Conv1d's in_channels comes from this.
⬇ arg: dropout_p=0.15 = Mild dropout. Sensors are correlated, so heavy dropout would zero useful signal. 0.15 is the §8.4 tuned value.
⬆ self.cnn = CNNFrontend instance. Forward shape: (B, T, c_in) → (B, T, 64). ~53k learnable params.
25self.lstm = BiLSTMEncoder(input_size=64, hidden_size=lstm_hidden,

Stage 2 - construction of the BiLSTM. The line continues onto line 26. input_size=64 must match what the CNN frontend OUTPUTS (its last conv has 64 channels). hidden_size is forwarded from the constructor (default 256 → 2*256 = 512 after bidir concat).

EXECUTION STATE
📚 BiLSTMEncoder(input_size, hidden_size, num_layers, dropout_p) = Wraps nn.LSTM(bidirectional=True, batch_first=True). The forward pass returns only the per-cycle output sequence (drops the (h_n, c_n) tuple).
⬇ arg: input_size=64 = Channel count of the LSTM's input. MUST equal the CNN frontend's output channels (64) - this is where shape mismatches between blocks usually break.
⬇ arg: hidden_size=lstm_hidden = Hidden state width PER DIRECTION. Default 256. After bidirectional concat the per-cycle vector becomes 2*256 = 512.
26 num_layers=2, dropout_p=0.3)

Continuation of line 25. num_layers=2 stacks two LSTM layers (output of layer 1 feeds layer 2 in both directions). dropout_p=0.3 applies dropout BETWEEN layers (PyTorch nn.LSTM only applies dropout between layers, never on the final output).

EXECUTION STATE
⬇ arg: num_layers=2 = Stacked LSTM depth. Layer 1's output is the input to layer 2. Going from 1 to 2 layers ~doubles the parameter count and gives strictly more expressive temporal modelling; 3+ layers gain little on C-MAPSS-sized data.
⬇ arg: dropout_p=0.3 = Heavier dropout than the CNN. LSTM hidden states leak information through the chain, so heavier regularisation helps generalisation. PyTorch warning: dropout is silently ignored when num_layers=1.
⬆ self.lstm = BiLSTMEncoder instance. Forward shape: (B, T, 64) → (B, T, 512). ~2.14M params - the dominant cost in the model.
27self.attn = AttentionBlock(d_model=2 * lstm_hidden,

Stage 3 - constructs the multi-head attention. d_model=2*lstm_hidden = 512. The expression '2 * lstm_hidden' is critical: forgetting the 2 (Pitfall 1 below) causes silent shape mismatches.

EXECUTION STATE
📚 AttentionBlock(d_model, num_heads, dropout_p) = Wraps nn.MultiheadAttention(batch_first=True) with a residual + LayerNorm 'add & norm' wrapper.
⬇ arg: d_model = 2 * lstm_hidden = 512 = Expected channel dimension of the input. MUST equal the BiLSTM's output (2 * 256 = 512). Half of d_model would silently take the WRONG slice of the tensor at runtime.
→ why 2 * lstm_hidden = Bidirectional LSTM concatenates forward and backward hidden states; the channel axis doubles. Hardcoding 256 here is the #1 model-assembly bug - see Pitfall 1 below.
28 num_heads=num_heads, dropout_p=0.1)

Continuation of line 27. num_heads forwards the constructor arg (default 8). dropout_p=0.1 is the lightest dropout in the model - attention is already a strong regulariser via averaging across positions.

EXECUTION STATE
⬇ arg: num_heads=num_heads = Default 8. Constraint: d_model % num_heads == 0. With d_model=512: 512/8 = 64 = d_k per head.
⬇ arg: dropout_p=0.1 = Applied INSIDE attention (on the post-softmax weights). Lighter than other blocks because attention's averaging already regularises.
⬆ self.attn = AttentionBlock instance. Forward shape: (B, T, 512) → (B, T, 512). ~1.05M params.
29self.funnel = FCFunnel(d_model=2 * lstm_hidden,

Stage 4 - constructs the bottleneck. d_model=512 again because the funnel reads the attention block's output. Continues onto lines 30-31.

EXECUTION STATE
📚 FCFunnel(d_model, hidden, out_dim, dropout_p) = Class constructor. Internal architecture: last-cycle pool → Linear(d_model, hidden[0]) → ReLU → Dropout → Linear(hidden[0], hidden[1]) → ReLU → Dropout → Linear(hidden[1], out_dim).
⬇ arg: d_model = 2 * lstm_hidden = 512 = Same constant as AttentionBlock. Both consume the BiLSTM's 512-D per-cycle output.
30 hidden=(256, 64), out_dim=shared_dim,

Continuation. hidden=(256, 64) defines the two intermediate widths in the funnel: 512→256→64. out_dim=shared_dim=32 is the FINAL output width - the size of the shared z vector both heads will read.

EXECUTION STATE
⬇ arg: hidden=(256, 64) = Tuple of intermediate Linear widths. Two ints → two intermediate layers. Each step halves-or-more the dimension; this 'funnel' shape compresses representation gradually.
→ why a tuple, not two args = Lets you change funnel depth without changing the constructor signature. hidden=(256, 64, 32) would add a third layer; hidden=(256,) would have only one.
⬇ arg: out_dim=shared_dim = Forwards the constructor arg (default 32). This is THE shared-representation width - changing it forces matching changes in both heads' in_dim.
31 dropout_p=0.3)

Continuation. dropout_p=0.3 - same heavy rate as the LSTM. The funnel is right before the heads, so regularising hard here prevents the heads from memorising training-set quirks of z.

EXECUTION STATE
⬇ arg: dropout_p=0.3 = Applied inside the funnel between Linear+ReLU pairs. Together with the LSTM's 0.3 these are the model's two main regularisation knobs.
⬆ self.funnel = FCFunnel instance. Forward shape: (B, T, 512) → (B, 32). The time axis is GONE after this call. ~148k params.
32self.rul_head = RULHead(in_dim=shared_dim)

Task head 1 - regression. Reads the 32-D shared z and emits a single non-negative scalar per engine. in_dim must match the funnel's out_dim - hence both reference the same shared_dim constructor arg.

EXECUTION STATE
📚 RULHead(in_dim) = 4-layer MLP 32→64→32→16→1 with ReLU on every hidden layer AND on the output (so rul ≥ 0).
⬇ arg: in_dim=shared_dim = Forwards 32. Couples the head's input to the funnel's output - changing one without the other raises a shape error in forward.
⬆ self.rul_head = RULHead instance. Forward shape: (B, 32) → (B,). 4,737 params.
33self.health_head = HealthHead(in_dim=shared_dim, num_classes=num_classes)

Task head 2 - classification. Reads THE SAME 32-D z and emits raw logits over num_classes. No softmax inside - cross_entropy loss applies log_softmax internally for numerical stability.

EXECUTION STATE
📚 HealthHead(in_dim, num_classes) = 3-layer MLP 32→32→16→num_classes. ReLU between hidden layers, NO activation on output.
⬇ arg: in_dim=shared_dim = 32 - couples to the funnel's out_dim, just like RULHead.
⬇ arg: num_classes=num_classes = Forwards 3. Becomes the output dimension of the final Linear layer.
⬆ self.health_head = HealthHead instance. Forward shape: (B, 32) → (B, 3). 1,635 params.
35def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:

Defines what happens when you call model(x). PyTorch routes the call through nn.Module.__call__, which sets up hooks then calls forward. The type hint 'tuple[torch.Tensor, torch.Tensor]' tells static analysers that two tensors come back; it has no runtime effect.

EXECUTION STATE
⬇ input: self = The instance. Lets us reach the sub-modules registered in __init__: self.cnn, self.lstm, etc.
⬇ input: x (B, 30, c_in) torch.Tensor = Mini-batch of normalised windows on whichever device the model lives on (CPU or CUDA). dtype usually float32; requires_grad set by the caller.
→ tuple[torch.Tensor, torch.Tensor] return type = Python 3.9+ generic tuple syntax. Annotates that the return is a 2-tuple of Tensors (not a list, not a dict). Pure documentation - doesn't enforce at runtime.
⬆ returns = (rul, logits) where rul: (B,) and logits: (B, num_classes). Order matches the NumPy version above.
36# x: (B, T, c_in)

Inline reminder of the input shape contract. (B, T, c_in) = (batch, time, input channels). Reading this BEFORE running the forward in your head saves a debugging session.

37h = self.cnn(x) # (B, T, 64)

Stage 1 forward. self.cnn(x) invokes CNNFrontend.__call__(x) which wires forward + autograd hooks. The trailing comment '# (B, T, 64)' is the SHAPE of h after this call - kept inline so reading the forward top-to-bottom shows the shape evolution.

EXECUTION STATE
📚 self.cnn(x) = Equivalent to self.cnn.forward(x), but routed via nn.Module.__call__ so PyTorch can register hooks (forward_pre, forward, backward).
⬇ x (B, 30, 14) = The forward arg. Tracked by autograd because the smoke test below sets requires_grad=True on x.
⬆ h (B, 30, 64) = Per-cycle 64-D mixed-sensor representation. Autograd graph now has CNN ops appended.
38h = self.lstm(h) # (B, T, 2*hidden)

Stage 2. h is reassigned (memory of previous h is reclaimed once autograd is done with it). Channel axis goes 64 → 512 (= 2 * 256, bidirectional concat).

EXECUTION STATE
⬇ h (B, 30, 64) = CNN output from line 37.
⬆ h (B, 30, 512) = BiLSTM output. 2*hidden = 2*256 = 512 channels per cycle. Each cycle now encodes both past and future context within the window.
39h = self.attn(h) # (B, T, 2*hidden)

Stage 3. SHAPE-PRESERVING - input and output are both (B, 30, 512). The block adds direct cycle-to-cycle communication on top of the BiLSTM's sequential summary, but does not change the tensor shape (residual + LayerNorm guarantee that).

EXECUTION STATE
⬇ h (B, 30, 512) = BiLSTM output from line 38. Used as Q, K, AND V inside the attention block (self-attention).
⬆ h (B, 30, 512) = Attention output. Same shape, but every cycle now has a direct view of every other cycle.
40z = self.funnel(h) # (B, shared_dim) <-- shared

Stage 4. Variable name CHANGES from h to z because semantics change: h was per-cycle, z is per-engine. The funnel internally pools the time axis (last cycle only) then funnels 512 → 256 → 64 → 32. The arrow '<-- shared' in the comment flags the contract for both heads.

EXECUTION STATE
⬇ h (B, 30, 512) = Attention output. The funnel's last-cycle pool keeps only h[:, -1, :].
⬆ z (B, 32) ← SHARED = 32-D representation per engine. Same Python tensor object will be passed to BOTH heads on the next two lines - that is the multi-task sharing in concrete terms.
→ autograd consequence = Because z is consumed by both heads, backward through (loss_rul + loss_hs) accumulates BOTH gradients into z.grad - which then propagates through the funnel/attn/lstm/cnn. This is why both losses 'pull' on the trunk.
41rul = self.rul_head(z) # (B,)

Task head 1 forward. (B, 32) z in, (B,) non-negative scalars out. The trailing ReLU on rul_head's last layer guarantees rul ≥ 0.

EXECUTION STATE
⬇ z (B, 32) = The shared representation.
⬆ rul (B,) = Per-engine RUL prediction in cycles. Always ≥ 0. PyTorch squeezes the trailing dim 1 to 0 when the head's final layer is Linear(.., 1) → flatten.
42logits = self.health_head(z) # (B, num_classes)

Task head 2 forward. Reads the EXACT SAME z that line 41 just consumed. Outputs raw logits - no softmax. The variable name 'logits' (not 'probs') is a convention reminder that softmax has not been applied.

EXECUTION STATE
⬇ z (B, 32) = Same Python object as the z passed to rul_head. PyTorch tracks this so backward sums gradients from both heads through z.
⬆ logits (B, 3) = Raw class scores ∈ ℝ. Apply F.softmax(logits, dim=-1) for probabilities or F.cross_entropy(logits, targets) for the loss directly.
43return rul, logits

Returns BOTH outputs as a tuple. The order is part of the API contract - downstream code unpacks 'rul, logits = model(x)' positionally. Pitfall 3 below explains why we return a tuple, not a dict.

EXECUTION STATE
⬆ return: (rul, logits) = tuple( Tensor (B,), Tensor (B, num_classes) ). Both are still attached to the autograd graph - calling .backward() on a loss derived from either will flow gradients all the way to x.
46# ---------- Smoke test ----------

Section divider. Below is a tiny end-to-end exercise: build the model, fake some data, run forward + backward, print shapes and parameter count. Catches wiring errors in milliseconds.

47torch.manual_seed(0)

Sets PyTorch's CPU RNG seed so torch.randn / torch.randint / Linear weight init are reproducible. (For CUDA you would also need torch.cuda.manual_seed_all(0).)

EXECUTION STATE
📚 torch.manual_seed(seed) = Initialises the global RNG state. Affects torch.randn, torch.rand, torch.randint AND every Module's parameter init below this line.
⬇ arg: 0 = Any int seed; '0' is convention. Same seed → same x → same model weights → same losses. Critical for debugging.
48model = DualTaskModel(c_in=14, lstm_hidden=256, num_heads=8,

Instantiates the model with the paper's reference hyperparameters. The line continues onto line 49. After this call, model.parameters() iterates over all ~3.4M learnable tensors.

EXECUTION STATE
⬇ arg: c_in=14 = C-MAPSS sensor count after pruning.
⬇ arg: lstm_hidden=256 = Per-direction LSTM hidden size (paper default).
⬇ arg: num_heads=8 = Multi-head attention heads. 512 / 8 = 64 d_k.
49 shared_dim=32, num_classes=3)

Continuation of line 48. shared_dim=32 = the z width. num_classes=3 = healthy / degrading / critical. After this line model is fully constructed.

EXECUTION STATE
⬆ model = DualTaskModel instance. ~3,400,388 learnable params total. .train() / .eval() / .to(device) / .state_dict() all available.
51x = torch.randn(4, 30, 14, requires_grad=True)

Builds a fake input batch of 4 engines × 30 cycles × 14 sensors, sampled from N(0, 1). requires_grad=True attaches it to the autograd graph - .backward() will populate x.grad. (Real training data does NOT need requires_grad - we only set it here so backward has a leaf to flow gradients into for inspection.)

EXECUTION STATE
📚 torch.randn(*shape, requires_grad=False) = Returns a Tensor of the given shape filled with N(0, 1) samples. By default requires_grad=False; we override to True for this smoke test.
⬇ args: 4, 30, 14 = Three positional dims → shape (4, 30, 14). 4 engines, 30 cycles, 14 sensors.
⬇ arg: requires_grad=True = Tells autograd to track operations on x and accumulate gradients into x.grad on backward(). Without this, x is a leaf with no gradient.
⬆ x = shape (4, 30, 14), dtype float32, device CPU, requires_grad=True
52y_rul = torch.randint(0, 126, (4,)).float() # capped RUL

Fake RUL targets. torch.randint(low, high, size) returns int64; the .float() cast makes them float32 because F.mse_loss expects matching dtypes for input and target. The 126 upper bound matches the paper's piecewise-linear RUL cap at 125.

EXECUTION STATE
📚 torch.randint(low, high, size) = Returns a Tensor of int64 sampled uniformly from [low, high). Note: high is EXCLUSIVE - randint(0, 126, ...) gives 0..125.
⬇ args: 0, 126, (4,) = low=0, high=126 (exclusive), shape=(4,). One scalar per engine.
📚 .float() = Cast to torch.float32. F.mse_loss requires input.dtype == target.dtype; rul_head outputs float32, so y_rul must too.
⬆ y_rul = shape (4,), dtype float32, values ∈ [0.0, 125.0]
53y_hs = torch.tensor([0, 1, 2, 1]) # class indices

Fake classification targets - one int per engine giving the true class. F.cross_entropy expects CLASS INDICES as int64, NOT one-hot vectors. Values must be in [0, num_classes).

EXECUTION STATE
📚 torch.tensor(data) = Constructs a Tensor from a Python list/tuple. Infers dtype from the data: ints → int64, floats → float32.
⬇ data: [0, 1, 2, 1] = Class indices for the 4 engines: healthy, degrading, critical, degrading.
⬆ y_hs = shape (4,), dtype int64, values ∈ {0, 1, 2}
→ why class indices, not one-hot = F.cross_entropy expects targets as indices (memory-efficient; no need to materialise an N×C matrix). Pass one-hot and you'll get a shape error.
55rul, logits = model(x)

ONE forward call returns BOTH outputs. model(x) is sugar for model.__call__(x) which calls model.forward(x) plus hook handling. Tuple unpacking on the LHS splits the (rul, logits) return into two named tensors.

EXECUTION STATE
⬇ x (4, 30, 14) = The smoke-test batch, requires_grad=True.
⬆ rul (4,) = Predicted RULs. Will be near 0 because weights are random and the head's ReLU clips negatives to 0.
⬆ logits (4, 3) = Random logits ~ N(0, small variance). argmax gives near-uniform class predictions.
57loss_rul = F.mse_loss(rul, y_rul) # regression

Computes mean squared error between predicted and target RUL. F.mse_loss is the functional form - stateless, takes (input, target). Default reduction is 'mean' so the result is a single scalar.

EXECUTION STATE
📚 F.mse_loss(input, target, reduction='mean') = Computes ((input - target) ** 2). Then averages (reduction='mean'), sums (reduction='sum'), or returns the per-element tensor (reduction='none').
⬇ arg 1: rul (4,) = Model predictions. Must have the same shape as target (broadcasting works but is rarely intended).
⬇ arg 2: y_rul (4,) = Ground truth from line 52. Same dtype (float32) as input - mismatched dtypes raise here.
⬆ loss_rul = 0-D tensor (scalar) holding the mean squared error. Attached to the autograd graph.
→ AMNL/GRACE preview = Plain MSE punishes early-life and end-of-life errors equally. Ch 14 swaps this for a weighted MSE that emphasises the dangerous near-failure cycles.
58loss_hs = F.cross_entropy(logits, y_hs) # classification

Standard classification loss. Internally: log_softmax(logits, dim=-1) then NLL with y_hs as class indices. Fused for numerical stability - never compute softmax then log yourself, you'll lose precision.

EXECUTION STATE
📚 F.cross_entropy(input, target, weight=None, reduction='mean') = Combines log_softmax + NLL loss. input shape (B, C) raw logits; target shape (B,) class indices ∈ [0, C).
⬇ arg 1: logits (4, 3) = Raw scores from the head - NOT softmax probabilities. Cross-entropy applies softmax internally.
⬇ arg 2: y_hs (4,) = Class indices ∈ {0, 1, 2}, dtype int64. NOT one-hot.
⬆ loss_hs = 0-D tensor. Attached to the autograd graph through logits → health_head → z → trunk.
59loss = loss_rul + 0.1 * loss_hs # placeholder weights

Naive sum of the two task losses with a fixed weight 0.1 on the auxiliary classification loss. This is the BASELINE every multi-task paper improves on; Chapter 12 shows that 0.1 is a guess that varies by 100× across runs - the central problem AMNL/GABA/GRACE solve.

EXECUTION STATE
⬇ loss_rul = Regression loss scalar. Typical magnitude on C-MAPSS: 1000s (squared cycle errors).
⬇ 0.1 * loss_hs = Auxiliary classification loss. Cross-entropy is typically O(1), so 0.1× scales it to ~0.1 - much smaller than loss_rul. This unbalanced scale is exactly the problem.
⬆ loss = 0-D tensor = loss_rul + 0.1 * loss_hs. Single scalar that .backward() can be called on.
→ why 0.1 is the wrong answer = Ch 12 demos: optimal aux weight depends on training step, gradient norms, and batch composition. A static 0.1 is almost always either too small (aux signal vanishes) or too large (aux dominates and degrades RUL).
60loss.backward()

Triggers reverse-mode automatic differentiation. PyTorch traverses the autograd graph backwards from 'loss', accumulating gradients into every leaf tensor with requires_grad=True - that means EVERY model.parameters() tensor AND x. After this call you can read .grad on each one.

EXECUTION STATE
📚 Tensor.backward(gradient=None) = Walks the autograd graph from the calling tensor back to its leaves, applying the chain rule. For 0-D scalars (like loss), gradient defaults to 1.0.
→ side effect on parameters = Every p in model.parameters() now has p.grad populated (or accumulated, if .grad already had a value). The optimiser would call optimizer.step() next to apply these gradients.
→ side effect on x = Because x.requires_grad=True, x.grad is now a (4, 30, 14) tensor of ∂loss/∂x. Useful for adversarial robustness analysis or input attribution.
62print("input :", tuple(x.shape)) # (4, 30, 14)

Prints the input shape. tuple(x.shape) converts torch.Size (which prints as 'torch.Size([4, 30, 14])') into a plain Python tuple ('(4, 30, 14)') for cleaner output.

EXECUTION STATE
📚 Tensor.shape = Attribute returning a torch.Size object (subclass of tuple). Repr is 'torch.Size([...])'; tuple(...) gives plain '(...)'.
⬆ Output = input : (4, 30, 14)
63print("rul shape :", tuple(rul.shape)) # (4,)

Confirms regression head returned ONE scalar per engine. Crucial sanity check - if you ever see (4, 1), squeeze before MSE or broadcasting will silently produce per-element losses of shape (4, 4).

EXECUTION STATE
⬆ Output = rul shape : (4,)
64print("logits shape :", tuple(logits.shape)) # (4, 3)

Confirms classification head returned 3 raw scores per engine. Rows are engines, columns are classes.

EXECUTION STATE
⬆ Output = logits shape : (4, 3)
65print("loss_rul :", round(loss_rul.item(), 2))

.item() extracts a 0-D tensor's value as a plain Python float (only works on 0-D / scalar tensors). round(x, 2) keeps two decimals for readable output.

EXECUTION STATE
📚 Tensor.item() = Returns the underlying Python scalar. ONLY valid for 0-D tensors. Raises 'only one element tensors can be converted to Python scalars' on multi-element tensors.
⬆ Output (example) = loss_rul : 4123.7
66print("loss_hs :", round(loss_hs.item(), 2))

Same pattern. Cross-entropy on 3 classes with random logits is ~ln(3) ≈ 1.10 - if your initial loss is far from that, your initialisation is broken.

EXECUTION STATE
→ sanity: ln(num_classes) = Random untrained classifier achieves loss ≈ ln(C). For C=3: ln(3) ≈ 1.0986. Initial loss far from this signals init / scaling bugs.
⬆ Output (example) = loss_hs : 1.13
67print("# params :",

Start of the parameter-count print (continues onto line 68). Splitting across two lines keeps the f-string under PEP-8's 79-char limit.

68 f"{sum(p.numel() for p in model.parameters()):,}") # ~3.4M

Counts total learnable parameters. model.parameters() yields every nn.Parameter tracked by the model; .numel() on each gives the total element count; sum() folds them. The ',' format spec inserts thousands separators.

EXECUTION STATE
📚 model.parameters() = Iterator over every nn.Parameter tracked by the model and its sub-modules. Used both for printing AND for passing into the optimiser: optim.Adam(model.parameters(), lr=1e-3).
📚 Tensor.numel() = 'Number of elements'. For a Tensor of shape (256, 64) → 16384. For shape (3,) → 3.
📚 sum(generator) = Python builtin. Iterates the generator (one .numel() per parameter) and adds them up.
→ f"{n:,}" format spec = Python format mini-language. ':,' inserts thousand separators: 3400388 → '3,400,388'. No need for locale or commas-by-hand.
⬆ Output = # params : 3,400,388
→ breakdown = CNN 53,184 (1.6%) + BiLSTM 2,140,160 (63%) + Attention 1,052,672 (31%) + Funnel 148,000 (4.4%) + RULHead 4,737 (0.14%) + HealthHead 1,635 (0.05%) ≈ 3.40M
16 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Components from previous sections
6from cnn_frontend     import CNNFrontend       # 8.4
7from bilstm_encoder   import BiLSTMEncoder     # 9.4
8from attention_block  import AttentionBlock    # 10.4
9from fc_funnel        import FCFunnel          # 11.1
10from rul_head         import RULHead           # 11.2
11from health_head      import HealthHead        # 11.3
12
13
14class DualTaskModel(nn.Module):
15    """End-to-end model: CNN -> BiLSTM -> Attention -> FC funnel -> (RUL, Health)."""
16
17    def __init__(self,
18                 c_in:        int = 14,
19                 lstm_hidden: int = 256,
20                 num_heads:   int = 8,
21                 shared_dim:  int = 32,
22                 num_classes: int = 3):
23        super().__init__()
24        self.cnn      = CNNFrontend(c_in=c_in, dropout_p=0.15)
25        self.lstm     = BiLSTMEncoder(input_size=64, hidden_size=lstm_hidden,
26                                       num_layers=2, dropout_p=0.3)
27        self.attn     = AttentionBlock(d_model=2 * lstm_hidden,
28                                        num_heads=num_heads, dropout_p=0.1)
29        self.funnel   = FCFunnel(d_model=2 * lstm_hidden,
30                                  hidden=(256, 64), out_dim=shared_dim,
31                                  dropout_p=0.3)
32        self.rul_head    = RULHead(in_dim=shared_dim)
33        self.health_head = HealthHead(in_dim=shared_dim, num_classes=num_classes)
34
35    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
36        # x: (B, T, c_in)
37        h = self.cnn(x)                         # (B, T, 64)
38        h = self.lstm(h)                        # (B, T, 2*hidden)
39        h = self.attn(h)                        # (B, T, 2*hidden)
40        z = self.funnel(h)                      # (B, shared_dim)   <-- shared
41        rul    = self.rul_head(z)               # (B,)
42        logits = self.health_head(z)            # (B, num_classes)
43        return rul, logits
44
45
46# ---------- Smoke test ----------
47torch.manual_seed(0)
48model = DualTaskModel(c_in=14, lstm_hidden=256, num_heads=8,
49                      shared_dim=32, num_classes=3)
50
51x      = torch.randn(4, 30, 14, requires_grad=True)
52y_rul  = torch.randint(0, 126, (4,)).float()           # capped RUL
53y_hs   = torch.tensor([0, 1, 2, 1])                    # class indices
54
55rul, logits = model(x)
56
57loss_rul = F.mse_loss(rul, y_rul)                      # regression
58loss_hs  = F.cross_entropy(logits, y_hs)               # classification
59loss     = loss_rul + 0.1 * loss_hs                    # placeholder weights
60loss.backward()
61
62print("input        :", tuple(x.shape))                # (4, 30, 14)
63print("rul shape    :", tuple(rul.shape))              # (4,)
64print("logits shape :", tuple(logits.shape))           # (4, 3)
65print("loss_rul     :", round(loss_rul.item(), 2))
66print("loss_hs      :", round(loss_hs.item(), 2))
67print("# params     :",
68      f"{sum(p.numel() for p in model.parameters()):,}")  # ~3.4M

Same Backbone, Different Sensors

Only cinc_{in} and the class boundaries change. The shared trunk and heads are domain-agnostic.

Domainc_inWindow THealth classes
C-MAPSS turbofan1430healthy / degrading / critical
N-CMAPSS DS022030same
PRONOSTIA bearings260OK / inner-race / outer-race
Wind-turbine SCADA12144 (24 h)OK / wear / failure
Battery cycling (Severson)5100fresh / mid-life / EOL
Disk SMART (Backblaze)1630 daysOK / pre-fail / fail-soon
Re-using checkpoints. The trunk weights transfer surprisingly well across domains. The book's Chapter 24 ablation shows ~30% faster convergence on N-CMAPSS DS02 when initialised from C-MAPSS pre-training - even though sensor counts differ.

Three Assembly Pitfalls

Pitfall 1: Mismatched d_model after BiLSTM.BiLSTM doubles the channel axis (forward + backward). If you wire dmodel=256d_{model} = 256 (the unidir size) into the attention block instead of 2×256=5122 \times 256 = 512, you will get a runtime shape error or, worse, a silent matmul against the wrong half of the tensor. Always use 2 * lstm_hidden.
Pitfall 2: Detaching the shared vector. Calling z.detach() before the health head accidentally cuts the classification gradient out of the backbone. The auxiliary signal then does nothing - and you will spend a week wondering why classification accuracy does not help RUL like the paper claims. Never detach inside the model.
Pitfall 3: Returning a dict instead of a tuple. return {"rul": rul, "hs": logits} looks cleaner but breaks every nn.DataParallel / torch.compile path. Stick with thereturn rul, logits tuple - it is what the rest of the book assumes.
The point. 3.4M parameters, six sub-modules, one forward call, two outputs. The architecture is now complete. Parts IV-VI will leave the architecture frozen and focus entirely on what loss to put on top of it.

Takeaway

  • One trunk, two heads. CNN → BiLSTM → Attention → FC funnel emit the 32-D shared zz. RUL and health heads both read zz.
  • ~3.4M parameters total. BiLSTM 63%, Attention 31%, everything else 6%. Heads are 0.2%.
  • Shape contract. Always (B, T, c_in) in, (B,) and (B, num_classes) out. Never break this.
  • Domain-agnostic trunk. Only cinc_{in} and class boundaries change between turbofans, bearings, batteries, disks, turbines.
  • End of Part III. Architecture done. Chapters 12-13 will diagnose the gradient imbalance that motivates AMNL, GABA, and GRACE.
Loading comments...