Chapter 11
9 min read
Section 44 of 121

RUL Regression Head

Dual-Task Heads & Model Assembly

The Fuel Gauge Output

The RUL head is the model's “fuel gauge” - Section 1.2's framing made literal. It takes the 32-D shared feature vector and emits a single non-negative scalar: the predicted cycles-to-failure for the engine at the end of the current 30-cycle window.

The output contract. One non-negative float per engine. The training loop in Chapter 15 uses MSE (or, for AMNL / GRACE, weighted MSE) on this against the capped RUL target from §7.2.

32 → 64 → 32 → 16 → 1

LayerShape inShape outParams
Linear → ReLU(B, 32)(B, 64)2,112
Linear → ReLU(B, 64)(B, 32)2,080
Linear → ReLU(B, 32)(B, 16)528
Linear → ReLU (non-neg)(B, 16)(B, 1)17
Total RUL head4,737

Note the expand-then-contract shape (32 → 64 → 32 → 16 → 1). Going up to 64 first lets the head re-mix the shared features before squeezing them down to a single scalar. Pure contraction (32 → 16 → 8 → 1) trains slightly worse on C-MAPSS in the paper's ablation.

The Non-Negative Output

RUL is “cycles to failure”. Negative RUL is physically meaningless - the engine cannot fail in “negative six cycles”. We enforce this by applying ReLU to the FINAL layer's output:

y^rul=max(0,wouth3+bout).\hat{y}_{\text{rul}} = \max(0,\, w_{\text{out}}^{\top} h_3 + b_{\text{out}}).

Two alternatives:

MethodTrade-off
Final ReLU (this book)Simple; clips to [0, ∞); cheap
SoftplusSmoother; never exactly 0; slightly costlier
No constraint + clip evalTrains faster but may predict negative RUL
Final ReLU also helps the loss. If the model predicts -2 instead of clamped 0 when the truth is 5, the gradient signal is identical in magnitude but the prediction is physically nonsensical. ReLU at the head removes the non-physical regime entirely.

Python: The Regression Head

Four-layer MLP with non-negative output
🐍rul_head_numpy.py
1import numpy as np

NumPy is Python's numerical computing library. It provides ndarray (N-dimensional arrays) and a fast, vectorised math runtime in C. We need it here for the matrix multiply (@), element-wise ReLU (np.maximum), Kaiming init (np.random.randn, np.sqrt) and bias zeros (np.zeros). The 'np' alias is the universal convention.

EXECUTION STATE
numpy = Library for numerical computing — ndarray, broadcasting, linear algebra, RNG. Backs almost every Python ML library under the hood.
as np = Short alias so we write np.maximum() instead of numpy.maximum().
4def rul_head(shared, weights, biases) → np.ndarray

Pure-NumPy reference implementation of the RUL regression head. Maps the 32-D shared feature vector through a four-layer MLP (32 → 64 → 32 → 16 → 1) with ReLU between every hidden layer and a final ReLU that clamps the output to [0, ∞). The non-negativity guarantee is the whole point of this function — RUL is 'cycles to failure' and cannot be negative.

EXECUTION STATE
⬇ input: shared (B, 32) = The shared backbone features from §11.1. B is the batch size (here B=2 in the smoke test). Each row is one engine's 32-D feature vector built from a 30-cycle sensor window.
→ shared purpose = These are the shared representations both heads (RUL + failure-class) read from. The RUL head squeezes them down to a single scalar; the classifier head squeezes them down to class logits.
⬇ input: weights = Python list of 4 ndarray weight matrices with shapes (32,64), (64,32), (32,16), (16,1). Total params: 32·64 + 64·32 + 32·16 + 16·1 = 2048 + 2048 + 512 + 16 = 4,624 weight params (plus 113 bias params = 4,737 total).
⬇ input: biases = Python list of 4 bias vectors with shapes (64,), (32,), (16,), (1,). All initialised to zero (standard for ReLU networks).
→ np.ndarray type hint = Tells type checkers that both inputs and the return are NumPy arrays. Pure documentation — Python does not enforce it at runtime.
⬆ returns = np.ndarray of shape (B,) — one non-negative float per engine in the batch. For the smoke test below, returns [3.913, 1.900].
5Docstring: "shared: (B, 32). Returns (B,) - non-negative RUL prediction."

States the input/output contract: takes a batch of 32-D shared features, returns one non-negative scalar per row.

6h = shared

Aliases the running activation tensor h to the input. This is the standard MLP idiom — h gets overwritten layer by layer in the loop below. No copy is made; h and shared point to the same ndarray until the first reassignment.

EXECUTION STATE
h (initial) = (B, 32) — same memory as shared. Will be replaced in line 8.
7for i, (W, b) in enumerate(zip(weights, biases)):

Iterates the four layers. zip(weights, biases) pairs each weight matrix with its bias vector; enumerate adds the index i so we can detect the last layer (i == 3 for a 4-layer net) and skip the ReLU there.

LOOP TRACE · 4 iterations
i=0 (32 → 64 hidden)
W shape = (32, 64)
b shape = (64,)
h before = (2, 32)
h after Linear+ReLU = (2, 64)
i=1 (64 → 32 hidden)
W shape = (64, 32)
b shape = (32,)
h before = (2, 64)
h after Linear+ReLU = (2, 32)
i=2 (32 → 16 bottleneck)
W shape = (32, 16)
b shape = (16,)
h before = (2, 32)
h after Linear+ReLU = (2, 16)
i=3 (16 → 1 output, NO ReLU here)
W shape = (16, 1)
b shape = (1,)
h before = (2, 16)
h after Linear only = (2, 1) — pre-ReLU; final ReLU runs in line 11
8h = h @ W + b

Affine layer: matrix-multiply the activations by the weight matrix, then add the bias. This is one fully-connected layer's forward pass. Broadcasting handles the bias automatically — b has shape (out,) and is added to every row of h @ W which has shape (B, out).

EXECUTION STATE
📚 @ (matrix multiply) = Python's matmul operator (PEP 465). For 2-D arrays equivalent to np.matmul(h, W). h(B, in) @ W(in, out) → (B, out). Inner dims must match.
📚 + (broadcast add) = NumPy element-wise addition with broadcasting. Adding b(out,) to a (B, out) array stretches b across all B rows. Example: [[1,2],[3,4]] + [10,20] → [[11,22],[13,24]].
→ Layer 0 example = h(2, 32) @ W(32, 64) → (2, 64), then + b(64,) → (2, 64). 2 engines, 64 hidden units each.
⬆ result: h = (B, out) — pre-activation. Will be passed through ReLU on the next line for hidden layers.
9if i < len(weights) - 1:

Skip the per-layer ReLU on the FINAL layer. For a 4-layer net (len(weights) = 4), this is True for i ∈ {0, 1, 2} and False for i = 3. The final ReLU is applied separately on line 11 — first to enforce non-negativity, second so we can squeeze the trailing size-1 dim cleanly.

EXECUTION STATE
len(weights) = 4 (number of layers)
len(weights) - 1 = 3 (index of the LAST layer)
→ why this guard? = If we ReLU'd the last layer here, the squeeze + final ReLU on line 11 would be a no-op. Splitting them keeps the 'final non-negativity' step explicit and easy to swap (e.g. for softplus).
10h = np.maximum(h, 0) # ReLU on hidden layers

Element-wise ReLU: each element becomes max(value, 0). This is THE non-linearity that makes the MLP an MLP — without it, four stacked Linears collapse mathematically into a single Linear.

EXECUTION STATE
📚 np.maximum(x1, x2) = Element-wise maximum of two arrays (or array and scalar). Broadcasts. Different from np.max() which reduces along an axis. Example: np.maximum([-1, 2, -3], 0) → [0, 2, 0].
⬇ arg 1: h = (B, out) pre-activation from line 8 — can contain negative values.
⬇ arg 2: 0 = Python int — broadcast to the full shape of h. Every element of h is compared against 0 and the larger is kept.
⬆ result: h = Same shape as input, but all negative entries replaced by 0. Now non-negative.
11return np.maximum(h.squeeze(-1), 0)

After the loop, h has shape (B, 1) — the linear output of the final layer (i=3) WITHOUT a ReLU. We squeeze the trailing size-1 dim to get (B,), then apply the final ReLU. This is the line that guarantees the head's contract: one non-negative float per engine.

EXECUTION STATE
📚 .squeeze(-1) = ndarray method: removes a size-1 dimension. The argument is the axis to squeeze. -1 means the LAST axis. Example: arr of shape (2,1).squeeze(-1) → shape (2,). If the axis is not size-1, it raises ValueError.
⬇ arg: -1 = Squeeze the last axis (the size-1 trailing dim from the (B, 1) → (B,) collapse). Using -1 instead of 1 makes the code robust to extra leading batch dims.
📚 np.maximum(., 0) = Element-wise ReLU again — the FINAL non-negativity clamp.
→ why the final ReLU? = Without it the model could predict RUL = -3.5 cycles, which is physically impossible (cycles-to-failure cannot be negative). The ReLU clamps the output to [0, ∞) and removes the non-physical regime entirely. Cheaper than softplus, equivalent in expressivity for this task.
⬆ return: np.ndarray (B,) = Smoke-test value: [3.913, 1.900] — both engines have positive predicted RUL, so the final ReLU is a no-op here. But the guarantee holds for any input.
15np.random.seed(0)

Seed NumPy's global random number generator for reproducibility. Every call to np.random.* (randn, choice, ...) below will produce the same values across runs. Critical for the smoke test to print stable numbers.

EXECUTION STATE
📚 np.random.seed(seed) = Sets the seed of the legacy global RNG. Modern code prefers np.random.default_rng(seed) for an explicit Generator, but seed(0) is fine for a one-shot demo.
⬇ arg: 0 = The seed value — any integer works. 0 is the conventional 'demo seed'.
16shapes = [(32, 64), (64, 32), (32, 16), (16, 1)]

Four layer shapes for the 32 → 64 → 32 → 16 → 1 funnel. Each tuple is (in_features, out_features). Note the expand-then-contract pattern: bump up to 64 first, then squeeze down through 32 → 16 → 1.

EXECUTION STATE
shapes[0] = (32, 64) — Layer 1: 32-D shared input → 64-D hidden
shapes[1] = (64, 32) — Layer 2: 64 → 32
shapes[2] = (32, 16) — Layer 3: 32 → 16 (bottleneck begins)
shapes[3] = (16, 1) — Layer 4: 16 → 1 (final scalar)
→ why expand-then-contract? = Going up to 64 first lets the head re-mix the shared features before squeezing them to a single scalar. Pure contraction (32→16→8→1) trains slightly worse on C-MAPSS in the paper's ablation.
17weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]

List comprehension that builds 4 weight matrices using Kaiming (He) initialisation — the standard init for ReLU networks. Without proper init, deep ReLU nets either blow up (variance grows) or die (variance shrinks to 0).

EXECUTION STATE
📚 np.random.randn(*s) = Draws samples from the standard normal distribution N(0, 1). The *s syntax unpacks the shape tuple as positional args. Example: np.random.randn(*((32, 64))) → np.random.randn(32, 64) → array of shape (32, 64) filled with N(0,1) samples.
📚 .astype(np.float32) = Cast the array to 32-bit floats. Default randn returns float64 (8 bytes/element). float32 (4 bytes/element) is the standard for deep learning — half the memory, twice the throughput on GPU, no measurable accuracy loss.
📚 np.sqrt(2 / s[0]) = The Kaiming/He scale factor. s[0] is the fan-in (number of input features). Multiplying N(0,1) samples by √(2/fan_in) gives weights with variance 2/fan_in — the value derived in He et al. 2015 to keep activation variance ≈ 1 through ReLU layers.
→ per-layer scale = Layer 0 (fan_in=32): √(2/32) = 0.250 Layer 1 (fan_in=64): √(2/64) = 0.177 Layer 2 (fan_in=32): √(2/32) = 0.250 Layer 3 (fan_in=16): √(2/16) = 0.354
⬆ result: weights = List of 4 float32 ndarrays with shapes (32,64), (64,32), (32,16), (16,1). 4,624 weight params total.
18biases = [np.zeros(s[1], dtype=np.float32) for s in shapes]

Build 4 bias vectors, all initialised to zero. Standard practice for ReLU networks — non-zero biases would risk pushing activations into a saturated regime before training begins.

EXECUTION STATE
📚 np.zeros(shape, dtype) = Creates an array of the given shape filled with 0.0. Example: np.zeros(64, dtype=np.float32) → array of 64 zeros, dtype float32.
⬇ arg: s[1] = The OUTPUT dim of each layer — the bias vector matches the layer's output, not the input.
⬇ arg: dtype=np.float32 = Match the float32 dtype of the weights. Mixing float32 and float64 in matmul triggers a slow upcast.
⬆ result: biases = List of 4 float32 ndarrays with shapes (64,), (32,), (16,), (1,). 113 bias params total.
21shared = np.random.randn(2, 32).astype(np.float32)

Build a fake batch of 2 shared feature vectors (each 32-D) for the smoke test. In production this comes from the shared backbone in §11.1; here we synthesise random N(0,1) values just to verify shapes and the non-negativity invariant.

EXECUTION STATE
⬇ arg 1, 2: 2, 32 = Shape: 2 rows (batch), 32 cols (features per engine).
⬆ result: shared = (2, 32) float32 array — 64 random N(0,1) values. Distribution is centred at 0 with unit variance.
22rul = rul_head(shared, weights, biases)

Run the full forward pass: 32-D input → 64 → 32 → 16 → 1 → squeeze → final ReLU. End-to-end shape transformation: (2, 32) → (2,).

EXECUTION STATE
⬇ shared = (2, 32) float32
⬇ weights = List of 4 weight matrices
⬇ biases = List of 4 zero bias vectors
⬆ rul = (2,) float32 — [3.913, 1.900]. Both positive, so the final ReLU is a no-op here.
24print("shared.shape :", shared.shape)

Confirms the input shape. Two engines, 32-D feature vectors each.

EXECUTION STATE
Output = shared.shape : (2, 32)
25print("rul.shape :", rul.shape)

Confirms the output shape. Note (2,) NOT (2, 1) — the .squeeze(-1) on line 11 removed the trailing dim. This shape matches what MSE loss expects against a (B,) RUL target.

EXECUTION STATE
Output = rul.shape : (2,)
26print("rul values :", rul.tolist())

Print the actual scalar predictions. Both ≥ 0 by construction (final ReLU). For this seed both happen to be strictly positive, so the ReLU is not active — but the guarantee always holds.

EXECUTION STATE
📚 .tolist() = Convert ndarray to a plain Python list. Useful for printing — avoids NumPy's array() wrapper in the output. Recursive: a (2, 3) array becomes a nested list [[...], [...]].
Output = rul values : [3.9130710412522607, 1.8996311091779403]
27print("# params :", sum(W.size + b.size for W, b in zip(weights, biases)))

Count total parameters in the head. Confirms the 4,737 figure from the table above.

EXECUTION STATE
📚 W.size = ndarray attribute: total number of elements = product of shape. For W(32, 64): size = 32×64 = 2048.
→ per-layer count = Layer 0: 32·64 + 64 = 2,112 Layer 1: 64·32 + 32 = 2,080 Layer 2: 32·16 + 16 = 528 Layer 3: 16·1 + 1 = 17 Total : 4,737
Output = # params : 4737
9 lines without explanation
1import numpy as np
2
3
4def rul_head(shared: np.ndarray, weights: list, biases: list) -> np.ndarray:
5    """shared: (B, 32). Returns (B,) - non-negative RUL prediction."""
6    h = shared
7    for i, (W, b) in enumerate(zip(weights, biases)):
8        h = h @ W + b
9        if i < len(weights) - 1:
10            h = np.maximum(h, 0)                     # ReLU on hidden layers only
11    return np.maximum(h.squeeze(-1), 0)              # Final ReLU - non-negative RUL
12
13
14# 32 → 64 → 32 → 16 → 1
15np.random.seed(0)
16shapes  = [(32, 64), (64, 32), (32, 16), (16, 1)]
17weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
18biases  = [np.zeros(s[1], dtype=np.float32) for s in shapes]
19
20# Fake shared features
21shared = np.random.randn(2, 32).astype(np.float32)
22rul    = rul_head(shared, weights, biases)
23
24print("shared.shape :", shared.shape)               # (2, 32)
25print("rul.shape    :", rul.shape)                  # (2,)
26print("rul values   :", rul.tolist())               # all >= 0
27print("# params     :",
28      sum(W.size + b.size for W, b in zip(weights, biases)))    # 4,225

PyTorch: nn.Sequential

Production RULHead - tiny nn.Module
🐍rul_head_torch.py
1import torch

PyTorch's top-level package. Provides torch.Tensor (the GPU-aware n-dim array), autograd, the random-number generator (torch.manual_seed, torch.randn), and the dtype/device system. Required even when the heavy lifting lives in torch.nn.

2import torch.nn as nn

PyTorch's neural network sub-package. Provides nn.Module (the base class for all learnable models), nn.Linear (fully-connected layer), nn.ReLU (the non-linearity), and nn.Sequential (the container we use to wire layers in order). The 'nn' alias is convention.

EXECUTION STATE
→ why nn.Module? = Subclassing nn.Module gives us automatic parameter registration (.parameters()), device moves (.to('cuda')), train/eval mode toggling, and serialisation (state_dict). Wrapping the head in a class instead of a bare function unlocks all of that.
4class RULHead(nn.Module):

Production wrapper for the regression head. Subclassing nn.Module makes RULHead recognisable to the optimizer, serialisable via state_dict, and movable to GPU with .to('cuda'). Same logic as the NumPy version above, but with proper parameter management.

5Docstring: "32-D shared features → non-negative scalar RUL prediction."

States the input/output contract for callers and docs.

6def __init__(self, in_dim: int = 32):

Constructor. Builds the layer stack and registers all parameters. Called once when you do head = RULHead(in_dim=32).

EXECUTION STATE
⬇ input: self = The instance being constructed. Attribute writes (self.net = ...) trigger nn.Module's parameter-registration hook.
⬇ input: in_dim = 32 = Dimensionality of the shared backbone features. Defaults to 32 to match §11.1's design. Pass a different value if the backbone is reconfigured.
→ int type hint, default = 32 = in_dim must be an integer; if omitted it defaults to 32. Type hints are documentation — Python doesn't enforce them at runtime.
7super().__init__()

Call the nn.Module base-class constructor. This sets up internal bookkeeping dicts (_parameters, _modules, _buffers) that nn.Module needs before we can assign any layers to self. Forgetting this line raises 'cannot assign module before Module.__init__() call'.

EXECUTION STATE
📚 super().__init__() = Python's mechanism for calling a parent class's method. Without args this means 'call the next __init__ in the MRO' — here that is nn.Module's constructor.
8self.net = nn.Sequential(

Wrap the entire MLP in a single nn.Sequential container. Sequential stores child modules in order and runs them sequentially in its forward(): output of module k is input to module k+1. Equivalent to writing forward() manually with four lines, but cleaner and auto-registers all sub-modules' parameters.

EXECUTION STATE
📚 nn.Sequential(*modules) = Container that holds an ordered list of modules. forward(x) = m_n( ... m_2( m_1( x ) ) ... ). Each child module's parameters are auto-registered for optimization.
→ why self.net? = Assigning to self.net triggers nn.Module.__setattr__, which registers Sequential (and recursively all its children) so head.parameters() finds them, and head.to('cuda') moves them.
9nn.Linear(in_dim, 64), nn.ReLU(inplace=True),

Layer 1: project 32 → 64 with a fully-connected layer, then apply ReLU.

EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias=True) = Fully-connected layer: y = x @ W.T + b. Stores W of shape (out, in) and b of shape (out,). Weights are Kaiming-uniform-init by default; biases are uniform-init.
⬇ arg 1: in_dim = 32 = Input feature count. Sets the number of columns of W (32) — must match the input tensor's last dim.
⬇ arg 2: 64 = Output feature count. Bumps the dim up from 32 to 64 — the 'expand' phase of the expand-then-contract funnel.
📚 nn.ReLU(inplace=True) = Element-wise ReLU activation: max(x, 0). inplace=True overwrites the input tensor instead of allocating a new one — saves memory, and is safe inside Sequential because the pre-activation tensor is not needed downstream.
→ params added = Linear: 32×64 weights + 64 biases = 2,112
10nn.Linear(64, 32), nn.ReLU(inplace=True),

Layer 2: contract 64 → 32. The descent down the funnel begins. Same pattern as line 9 with new shapes.

EXECUTION STATE
⬇ in_features = 64 = Matches the previous layer's output (64).
⬇ out_features = 32 = Halve the width — the bottleneck pattern.
→ params added = 64×32 + 32 = 2,080
11nn.Linear(32, 16), nn.ReLU(inplace=True),

Layer 3: 32 → 16. Halving again — funneling toward the scalar output.

EXECUTION STATE
⬇ in_features = 32 = From layer 2's output.
⬇ out_features = 16 = Halve once more.
→ params added = 32×16 + 16 = 528
12nn.Linear(16, 1), nn.ReLU(inplace=True), # final non-neg

Final layer: collapse 16 → 1, then ReLU to enforce non-negativity. THIS is the line that turns a generic MLP into a 'cycles to failure' regressor — the ReLU is a structural prior on the physics (RUL ≥ 0).

EXECUTION STATE
⬇ in_features = 16 = From layer 3's output.
⬇ out_features = 1 = Single scalar — the predicted RUL (still has a trailing size-1 dim; squeezed in forward()).
→ params added = 16×1 + 1 = 17
→ alternative = Some papers omit the final ReLU and use softplus (smoother, never exactly 0) or just clip at eval time. ReLU is cheaper, gives a hard guarantee, and works fine on C-MAPSS.
13)

Close the nn.Sequential. self.net is now ready and all 4,737 parameters are registered.

15def forward(self, shared: torch.Tensor) -> torch.Tensor:

The forward pass. nn.Module.__call__ dispatches to this method when you call head(shared). Never call .forward() directly — head(x) is preferred because it runs the registered hooks.

EXECUTION STATE
⬇ input: self = The RULHead instance.
⬇ input: shared (B, 32) torch.Tensor = Shared backbone features. B is batch size. dtype is float32, device matches the head's device.
⬆ returns = torch.Tensor of shape (B,) — non-negative RUL predictions.
16return self.net(shared).squeeze(-1)

Run the input through the Sequential stack — this internally calls each child in order: Linear → ReLU → Linear → ReLU → Linear → ReLU → Linear → ReLU. The result has shape (B, 1); we squeeze the trailing size-1 dim to get (B,).

EXECUTION STATE
📚 self.net(shared) = Calls Sequential.forward(shared). Output shape: (B, 1). All four layers and the final non-neg ReLU run before we return.
📚 .squeeze(-1) = Tensor method: removes a size-1 dim along the given axis. -1 means the LAST axis. (B, 1) → (B,). If the axis is not size-1, .squeeze(axis) returns the tensor unchanged (unlike NumPy's strict squeeze).
⬇ arg: -1 = Squeeze the last dim. Using -1 instead of 1 makes this robust to extra leading batch dims (e.g. (B, T, 1) → (B, T)).
⬆ return: torch.Tensor (B,) = Smoke-test value: tensor([0.0, 0.0]) — both happen to clamp to 0 with this seed under PyTorch's default Linear init.
20torch.manual_seed(0)

Seed PyTorch's CPU random number generator for reproducibility. Affects torch.randn, default-init weights inside nn.Linear, dropout masks, etc. The CUDA RNG is separate (torch.cuda.manual_seed_all) but not needed here since the demo runs on CPU.

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets the seed of torch's CPU RNG and returns the Generator. Idempotent — calling it again resets the stream.
⬇ arg: 0 = Seed value. Same convention as np.random.seed(0).
21head = RULHead(in_dim=32)

Instantiate the head. Triggers __init__: builds nn.Sequential, allocates all 4 Linear layers' weight + bias tensors, registers them as parameters. After this line, head.parameters() yields 4,737 trainable scalars.

EXECUTION STATE
⬇ in_dim=32 = Match the shared backbone's output dim.
⬆ head = RULHead instance with .net (Sequential), .training=True (default mode), and 4,737 parameters.
22shared = torch.randn(2, 32)

Build a fake batch of 2 shared feature vectors. Same role as the NumPy version above — verify shapes and the non-negativity invariant without needing the real backbone.

EXECUTION STATE
📚 torch.randn(*size) = Returns a tensor filled with samples from N(0, 1). Default dtype is float32, default device is CPU. Example: torch.randn(2, 32) → tensor of shape (2, 32).
⬇ args: 2, 32 = Shape: 2 rows (batch), 32 cols (features).
⬆ shared = torch.Tensor (2, 32) float32 on CPU.
23rul = head(shared)

Forward pass. head(shared) is sugar for head.__call__(shared) which runs hooks then forward(). Returns a (2,) tensor of non-negative RUL predictions.

EXECUTION STATE
⬇ shared = (2, 32) float32
⬆ rul = torch.Tensor([0.0, 0.0]) — both clamped to 0 by the final ReLU under PyTorch's default Linear init with this seed.
→ why both 0? = PyTorch's default nn.Linear init is Kaiming-uniform with sqrt(5) — slightly different from our NumPy Kaiming-normal. With this particular seed, the pre-final-ReLU output happens to land negative for both rows, and the ReLU clamps them to 0. The non-negativity guarantee still holds.
25print("shared.shape :", tuple(shared.shape))

Print the input shape. tuple(...) converts torch.Size into a plain Python tuple for cleaner printing.

EXECUTION STATE
📚 tuple(shared.shape) = shared.shape is a torch.Size (subclass of tuple). Wrapping in tuple() prints (2, 32) instead of torch.Size([2, 32]).
Output = shared.shape : (2, 32)
26print("rul.shape :", tuple(rul.shape))

Confirms the output is (2,) — squeezed, ready for MSE against a (B,) target.

EXECUTION STATE
Output = rul.shape : (2,)
27print("non-neg? :", (rul >= 0).all().item())

The crucial invariant check. Builds a boolean tensor with element-wise (rul >= 0), reduces with .all() to a single 0-D tensor, then .item() pulls the Python bool out.

EXECUTION STATE
📚 rul >= 0 = Element-wise comparison. Returns a bool tensor of the same shape as rul: tensor([True, True]).
📚 .all() = Reduces a bool tensor with logical AND. Returns a scalar tensor: True iff every element is True. Without args, reduces over ALL dims.
📚 .item() = Extract the single scalar value from a 0-D (or 1-element) tensor as a native Python int / float / bool. Errors if the tensor has more than one element.
Output = non-neg? : True
28print("# params :", sum(p.numel() for p in head.parameters()))

Count every trainable scalar in the head. Confirms 4,737 parameters — tiny compared with the shared backbone (typically tens of thousands).

EXECUTION STATE
📚 head.parameters() = nn.Module method: yields every registered Parameter recursively. Used by optimizers (optim.Adam(head.parameters(), ...)).
📚 p.numel() = Tensor method: number of elements = product of shape. For W(64, 32): numel = 2048.
→ per-layer count = Layer 0 (Linear 32→64): 2,112 Layer 1 (Linear 64→32): 2,080 Layer 2 (Linear 32→16): 528 Layer 3 (Linear 16→1) : 17 Total : 4,737
Output = # params : 4737
7 lines without explanation
1import torch
2import torch.nn as nn
3
4class RULHead(nn.Module):
5    """32-D shared features → non-negative scalar RUL prediction."""
6    def __init__(self, in_dim: int = 32):
7        super().__init__()
8        self.net = nn.Sequential(
9            nn.Linear(in_dim, 64), nn.ReLU(inplace=True),
10            nn.Linear(64,    32), nn.ReLU(inplace=True),
11            nn.Linear(32,    16), nn.ReLU(inplace=True),
12            nn.Linear(16,    1),  nn.ReLU(inplace=True),    # final non-neg
13        )
14
15    def forward(self, shared: torch.Tensor) -> torch.Tensor:
16        return self.net(shared).squeeze(-1)              # (B,)
17
18
19# Use it
20torch.manual_seed(0)
21head = RULHead(in_dim=32)
22shared = torch.randn(2, 32)
23rul    = head(shared)
24
25print("shared.shape :", tuple(shared.shape))     # (2, 32)
26print("rul.shape    :", tuple(rul.shape))         # (2,)
27print("non-neg?     :", (rul >= 0).all().item())  # True
28print("# params     :", sum(p.numel() for p in head.parameters()))
29# # params: 4,737

Two Head Pitfalls

Pitfall 1: Forgetting to squeeze. The head outputs (B, 1) by default. MSE loss against a (B,) target either crashes or broadcasts incorrectly. Always.squeeze(-1) at the head.
Pitfall 2: Aggressive dropout. Dropping 30% of activations on a 32-D shared feature OR a 16-D head bottleneck is too much. Section 12's default isdropout_p = 0 on the regression head once the model is past the warmup phase.
The point. A four-layer MLP with non-negative output. Tiny - just 4.7k parameters. The full DualTaskModel gets a similar-size head for classification next.

Takeaway

  • 32 → 64 → 32 → 16 → 1. Expand-then-contract.
  • Final ReLU keeps RUL non-negative. Physically required.
  • 4,737 parameters. Tiny relative to the backbone.
  • Always squeeze. (B, 1) → (B,) before MSE.
Loading comments...