The Fuel Gauge Output
The RUL head is the model's “fuel gauge” - Section 1.2's framing made literal. It takes the 32-D shared feature vector and emits a single non-negative scalar: the predicted cycles-to-failure for the engine at the end of the current 30-cycle window.
32 → 64 → 32 → 16 → 1
| Layer | Shape in | Shape out | Params |
|---|---|---|---|
| Linear → ReLU | (B, 32) | (B, 64) | 2,112 |
| Linear → ReLU | (B, 64) | (B, 32) | 2,080 |
| Linear → ReLU | (B, 32) | (B, 16) | 528 |
| Linear → ReLU (non-neg) | (B, 16) | (B, 1) | 17 |
| Total RUL head | — | — | 4,737 |
Note the expand-then-contract shape (32 → 64 → 32 → 16 → 1). Going up to 64 first lets the head re-mix the shared features before squeezing them down to a single scalar. Pure contraction (32 → 16 → 8 → 1) trains slightly worse on C-MAPSS in the paper's ablation.
The Non-Negative Output
RUL is “cycles to failure”. Negative RUL is physically meaningless - the engine cannot fail in “negative six cycles”. We enforce this by applying ReLU to the FINAL layer's output:
y^rul=max(0,wout⊤h3+bout).
Two alternatives:
| Method | Trade-off |
|---|---|
| Final ReLU (this book) | Simple; clips to [0, ∞); cheap |
| Softplus | Smoother; never exactly 0; slightly costlier |
| No constraint + clip eval | Trains faster but may predict negative RUL |
Python: The Regression Head
NumPy is Python's numerical computing library. It provides ndarray (N-dimensional arrays) and a fast, vectorised math runtime in C. We need it here for the matrix multiply (@), element-wise ReLU (np.maximum), Kaiming init (np.random.randn, np.sqrt) and bias zeros (np.zeros). The 'np' alias is the universal convention.
Pure-NumPy reference implementation of the RUL regression head. Maps the 32-D shared feature vector through a four-layer MLP (32 → 64 → 32 → 16 → 1) with ReLU between every hidden layer and a final ReLU that clamps the output to [0, ∞). The non-negativity guarantee is the whole point of this function — RUL is 'cycles to failure' and cannot be negative.
States the input/output contract: takes a batch of 32-D shared features, returns one non-negative scalar per row.
Aliases the running activation tensor h to the input. This is the standard MLP idiom — h gets overwritten layer by layer in the loop below. No copy is made; h and shared point to the same ndarray until the first reassignment.
Iterates the four layers. zip(weights, biases) pairs each weight matrix with its bias vector; enumerate adds the index i so we can detect the last layer (i == 3 for a 4-layer net) and skip the ReLU there.
Affine layer: matrix-multiply the activations by the weight matrix, then add the bias. This is one fully-connected layer's forward pass. Broadcasting handles the bias automatically — b has shape (out,) and is added to every row of h @ W which has shape (B, out).
Skip the per-layer ReLU on the FINAL layer. For a 4-layer net (len(weights) = 4), this is True for i ∈ {0, 1, 2} and False for i = 3. The final ReLU is applied separately on line 11 — first to enforce non-negativity, second so we can squeeze the trailing size-1 dim cleanly.
Element-wise ReLU: each element becomes max(value, 0). This is THE non-linearity that makes the MLP an MLP — without it, four stacked Linears collapse mathematically into a single Linear.
After the loop, h has shape (B, 1) — the linear output of the final layer (i=3) WITHOUT a ReLU. We squeeze the trailing size-1 dim to get (B,), then apply the final ReLU. This is the line that guarantees the head's contract: one non-negative float per engine.
Seed NumPy's global random number generator for reproducibility. Every call to np.random.* (randn, choice, ...) below will produce the same values across runs. Critical for the smoke test to print stable numbers.
Four layer shapes for the 32 → 64 → 32 → 16 → 1 funnel. Each tuple is (in_features, out_features). Note the expand-then-contract pattern: bump up to 64 first, then squeeze down through 32 → 16 → 1.
List comprehension that builds 4 weight matrices using Kaiming (He) initialisation — the standard init for ReLU networks. Without proper init, deep ReLU nets either blow up (variance grows) or die (variance shrinks to 0).
Build 4 bias vectors, all initialised to zero. Standard practice for ReLU networks — non-zero biases would risk pushing activations into a saturated regime before training begins.
Build a fake batch of 2 shared feature vectors (each 32-D) for the smoke test. In production this comes from the shared backbone in §11.1; here we synthesise random N(0,1) values just to verify shapes and the non-negativity invariant.
Run the full forward pass: 32-D input → 64 → 32 → 16 → 1 → squeeze → final ReLU. End-to-end shape transformation: (2, 32) → (2,).
Confirms the input shape. Two engines, 32-D feature vectors each.
Confirms the output shape. Note (2,) NOT (2, 1) — the .squeeze(-1) on line 11 removed the trailing dim. This shape matches what MSE loss expects against a (B,) RUL target.
Print the actual scalar predictions. Both ≥ 0 by construction (final ReLU). For this seed both happen to be strictly positive, so the ReLU is not active — but the guarantee always holds.
Count total parameters in the head. Confirms the 4,737 figure from the table above.
1import numpy as np
2
3
4def rul_head(shared: np.ndarray, weights: list, biases: list) -> np.ndarray:
5 """shared: (B, 32). Returns (B,) - non-negative RUL prediction."""
6 h = shared
7 for i, (W, b) in enumerate(zip(weights, biases)):
8 h = h @ W + b
9 if i < len(weights) - 1:
10 h = np.maximum(h, 0) # ReLU on hidden layers only
11 return np.maximum(h.squeeze(-1), 0) # Final ReLU - non-negative RUL
12
13
14# 32 → 64 → 32 → 16 → 1
15np.random.seed(0)
16shapes = [(32, 64), (64, 32), (32, 16), (16, 1)]
17weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
18biases = [np.zeros(s[1], dtype=np.float32) for s in shapes]
19
20# Fake shared features
21shared = np.random.randn(2, 32).astype(np.float32)
22rul = rul_head(shared, weights, biases)
23
24print("shared.shape :", shared.shape) # (2, 32)
25print("rul.shape :", rul.shape) # (2,)
26print("rul values :", rul.tolist()) # all >= 0
27print("# params :",
28 sum(W.size + b.size for W, b in zip(weights, biases))) # 4,225PyTorch: nn.Sequential
PyTorch's top-level package. Provides torch.Tensor (the GPU-aware n-dim array), autograd, the random-number generator (torch.manual_seed, torch.randn), and the dtype/device system. Required even when the heavy lifting lives in torch.nn.
PyTorch's neural network sub-package. Provides nn.Module (the base class for all learnable models), nn.Linear (fully-connected layer), nn.ReLU (the non-linearity), and nn.Sequential (the container we use to wire layers in order). The 'nn' alias is convention.
Production wrapper for the regression head. Subclassing nn.Module makes RULHead recognisable to the optimizer, serialisable via state_dict, and movable to GPU with .to('cuda'). Same logic as the NumPy version above, but with proper parameter management.
States the input/output contract for callers and docs.
Constructor. Builds the layer stack and registers all parameters. Called once when you do head = RULHead(in_dim=32).
Call the nn.Module base-class constructor. This sets up internal bookkeeping dicts (_parameters, _modules, _buffers) that nn.Module needs before we can assign any layers to self. Forgetting this line raises 'cannot assign module before Module.__init__() call'.
Wrap the entire MLP in a single nn.Sequential container. Sequential stores child modules in order and runs them sequentially in its forward(): output of module k is input to module k+1. Equivalent to writing forward() manually with four lines, but cleaner and auto-registers all sub-modules' parameters.
Layer 1: project 32 → 64 with a fully-connected layer, then apply ReLU.
Layer 2: contract 64 → 32. The descent down the funnel begins. Same pattern as line 9 with new shapes.
Layer 3: 32 → 16. Halving again — funneling toward the scalar output.
Final layer: collapse 16 → 1, then ReLU to enforce non-negativity. THIS is the line that turns a generic MLP into a 'cycles to failure' regressor — the ReLU is a structural prior on the physics (RUL ≥ 0).
Close the nn.Sequential. self.net is now ready and all 4,737 parameters are registered.
The forward pass. nn.Module.__call__ dispatches to this method when you call head(shared). Never call .forward() directly — head(x) is preferred because it runs the registered hooks.
Run the input through the Sequential stack — this internally calls each child in order: Linear → ReLU → Linear → ReLU → Linear → ReLU → Linear → ReLU. The result has shape (B, 1); we squeeze the trailing size-1 dim to get (B,).
Seed PyTorch's CPU random number generator for reproducibility. Affects torch.randn, default-init weights inside nn.Linear, dropout masks, etc. The CUDA RNG is separate (torch.cuda.manual_seed_all) but not needed here since the demo runs on CPU.
Instantiate the head. Triggers __init__: builds nn.Sequential, allocates all 4 Linear layers' weight + bias tensors, registers them as parameters. After this line, head.parameters() yields 4,737 trainable scalars.
Build a fake batch of 2 shared feature vectors. Same role as the NumPy version above — verify shapes and the non-negativity invariant without needing the real backbone.
Forward pass. head(shared) is sugar for head.__call__(shared) which runs hooks then forward(). Returns a (2,) tensor of non-negative RUL predictions.
Print the input shape. tuple(...) converts torch.Size into a plain Python tuple for cleaner printing.
Confirms the output is (2,) — squeezed, ready for MSE against a (B,) target.
The crucial invariant check. Builds a boolean tensor with element-wise (rul >= 0), reduces with .all() to a single 0-D tensor, then .item() pulls the Python bool out.
Count every trainable scalar in the head. Confirms 4,737 parameters — tiny compared with the shared backbone (typically tens of thousands).
1import torch
2import torch.nn as nn
3
4class RULHead(nn.Module):
5 """32-D shared features → non-negative scalar RUL prediction."""
6 def __init__(self, in_dim: int = 32):
7 super().__init__()
8 self.net = nn.Sequential(
9 nn.Linear(in_dim, 64), nn.ReLU(inplace=True),
10 nn.Linear(64, 32), nn.ReLU(inplace=True),
11 nn.Linear(32, 16), nn.ReLU(inplace=True),
12 nn.Linear(16, 1), nn.ReLU(inplace=True), # final non-neg
13 )
14
15 def forward(self, shared: torch.Tensor) -> torch.Tensor:
16 return self.net(shared).squeeze(-1) # (B,)
17
18
19# Use it
20torch.manual_seed(0)
21head = RULHead(in_dim=32)
22shared = torch.randn(2, 32)
23rul = head(shared)
24
25print("shared.shape :", tuple(shared.shape)) # (2, 32)
26print("rul.shape :", tuple(rul.shape)) # (2,)
27print("non-neg? :", (rul >= 0).all().item()) # True
28print("# params :", sum(p.numel() for p in head.parameters()))
29# # params: 4,737Two Head Pitfalls
.squeeze(-1) at the head.dropout_p = 0 on the regression head once the model is past the warmup phase.The point. A four-layer MLP with non-negative output. Tiny - just 4.7k parameters. The full DualTaskModel gets a similar-size head for classification next.
Takeaway
- 32 → 64 → 32 → 16 → 1. Expand-then-contract.
- Final ReLU keeps RUL non-negative. Physically required.
- 4,737 parameters. Tiny relative to the backbone.
- Always squeeze. (B, 1) → (B,) before MSE.