The Traffic-Light Output
If the RUL head is the “fuel gauge”, the health head is the dashboard's traffic light: green (healthy), amber (degrading), red (critical). Same shared 32-D feature vector goes in; three numbers come out - one raw score per class.
Class boundaries match the legacy AMNL paper:
| Class | Name | RUL range (cycles) | Operator action |
|---|---|---|---|
| 0 | healthy | RUL > 125 | Continue normal operation |
| 1 | degrading | 50 < RUL ≤ 125 | Schedule maintenance window |
| 2 | critical | RUL ≤ 50 | Pull for inspection now |
These thresholds are derived directly from the capped RUL target (Rmax=125) and the empirical danger zone (~50 cycles ≈ one prediction-error standard deviation).
32 → 32 → 16 → 3
| Layer | Shape in | Shape out | Params |
|---|---|---|---|
| Linear → ReLU | (B, 32) | (B, 32) | 1,056 |
| Linear → ReLU | (B, 32) | (B, 16) | 528 |
| Linear (no activation) | (B, 16) | (B, 3) | 51 |
| Total health head | — | — | 1,635 |
Three layers, ReLU between, no activation on the output. The head is roughly a third the size of the RUL head (1,635 vs 4,737) - it is the auxiliary task and we keep it lean.
Logits, Not Probabilities
The head emits raw logits z=(z0,z1,z2)∈R3, not probabilities. The probability vector is recovered by softmax,
pk=∑j=02exp(zj)exp(zk),
and the predicted class is y^hs=argmaxkzk=argmaxkpk. The argmax of logits and the argmax of probabilities are the same - softmax is monotonic - so for inference you can skip softmax entirely.
Why no softmax inside the head?
- Numerical stability. F.cross_entropy uses log∑jexp(zj) with a max-subtraction trick. Applying softmax then taking log re-introduces the very overflow we're avoiding.
- Efficiency. One log-softmax instead of softmax-then-log.
- Convention. Every PyTorch classification recipe assumes the head returns logits.
Interactive: Three Logits → Softmax
Drag the three logit sliders and watch the predicted class flip when the gap between the top two scores closes. The temperature slider controls how sharply softmax separates them - the same trick we use in distillation.
Try this. Set all three logits equal - notice the probabilities become 33.3% each and the entropy peaks at ln3≈1.099 nats. That is the maximum-uncertainty state. Push z2up to +3 and watch the “critical” class dominate - this is what an engine three cycles from failure looks like through the head.
Python: The Classification Head
NumPy is the foundational numerical library for Python. It provides ndarray (N-dimensional array) — a fast, memory-efficient array type backed by C — plus matrix multiplication (@), broadcasting, and a huge library of math functions (np.exp, np.maximum, np.random, etc.). We alias it as 'np' by convention so we can write np.exp() instead of numpy.exp().
The classification head: a tiny three-layer MLP that turns the shared 32-D feature vector into three raw class scores ('logits') for {healthy, degrading, critical}. Crucially, this returns logits — NOT probabilities. Softmax is applied later, only when computing loss or formatting output for a dashboard.
Triple-quoted string on the line after `def` becomes accessible as `health_head.__doc__`. This one tells the caller exactly what shape goes in, what shape comes out, and that the output is logits — not probabilities.
Initialize the running activation `h` with the input. We'll overwrite `h` inside the loop, layer by layer. This is a reference assignment, not a copy — but since each layer rebinds h to a NEW array (h = h @ W + b creates a new array), `shared` is never mutated.
Iterate the three layers in order. `zip(weights, biases)` pairs each weight matrix with its bias vector. `enumerate(...)` adds the index `i` so we know whether we're on a hidden layer (apply ReLU) or the output layer (skip ReLU).
The Linear layer: matrix-multiply h by W, then add bias. NumPy's broadcasting handles the bias automatically — `b` is shape (out,) and gets added to every row of the (B, out) result.
Branch on the layer index. `len(weights)` is 3, so `len(weights) - 1` is 2 — the index of the OUTPUT layer. The body runs for i=0 and i=1 (hidden layers), and is skipped for i=2 (output layer). This is how we apply ReLU to hidden layers but leave the output as raw logits.
ReLU activation: element-wise max with 0. Negative entries become 0; non-negative entries pass through unchanged. Standard non-linearity for hidden MLP layers — cheap, gradient-friendly, and pairs well with He initialisation (line 24).
Return the (B, 3) logits exactly as they came out of the final Linear layer. Applying softmax here would be a BUG — F.cross_entropy in PyTorch applies log-softmax internally with a numerically stable trick. Softmax + log_softmax = collapsed gradients.
Standalone softmax for INFERENCE / dashboard display only. We don't use this for training (that goes through cross-entropy directly on logits). Uses the log-sum-exp trick: subtract the row max before exp() to prevent overflow.
z = [[-0.47, 1.12, -0.66],
[ 0.85, 0.10, 1.40]]Reminds the reader this softmax subtracts the row max first. Mathematically softmax(z) = softmax(z - c) for any constant c, so choosing c = max(z) keeps every exp() argument ≤ 0 — exp() ≤ 1 — and never overflows.
Subtract the per-row max from each row. After this, the largest entry in every row is 0; all other entries are ≤ 0. Mathematically softmax is unchanged; numerically it's safe.
z = [[-0.47, 1.12, -0.66],
[ 0.85, 0.10, 1.40]][[1.12], [1.40]]
[[-1.59, 0.00, -1.78], [-0.55, -1.30, 0.00]]
Element-wise exponential — turns every entry into e^entry. Because we already subtracted the row max, every entry is ≤ 0, so every exponential is in (0, 1]. No overflow possible.
[[0.204, 1.000, 0.169], [0.577, 0.273, 1.000]]
Normalize each row to sum to 1 — the softmax formula. e / sum(e) along the class axis turns the unnormalized exponentials into a proper probability distribution.
[[1.373], [1.850]]
[[0.142, 0.703, 0.155], ← row sums to 1.0 ✓ [0.312, 0.148, 0.540]] ← row sums to 1.0 ✓
Comment marking the start of the smoke test. Funnel shape: 32 input features → 32 hidden → 16 hidden → 3 output classes. Notice we do NOT expand first like the RUL head (which went 32 → 64 → 32). Classification is the auxiliary task — keep it lean.
Seed NumPy's global random number generator with 0. Every np.random.* call after this is fully deterministic — reproducible across runs, machines, and Python versions. Essential for reproducible smoke tests.
List of (in_dim, out_dim) tuples for the three Linear layers. Each tuple defines the shape of a weight matrix W: rows = in_dim (fan-in), cols = out_dim (fan-out).
He initialisation (Kaiming, 2015): for each layer, sample weights from N(0, 2/fan_in). This keeps the variance of activations roughly constant across layers when paired with ReLU — without it, activations either explode or vanish through the depth.
Initialize all biases to zero. With He init on weights, zero biases let the layer start neutral — no preference for any output unit. Adam will quickly move them away from zero during training.
Fake shared features for two engines (B=2). In production this comes out of the shared backbone (TCN/LSTM, §11.1); here we just sample from N(0, 1) to exercise the head.
Forward pass through the head. Internally: h = shared → @W0+b0 → ReLU → @W1+b1 → ReLU → @W2+b2 → return. Result is (2, 3) raw logits.
Example one realisation: [[-0.47, 1.12, -0.66], [ 0.85, 0.10, 1.40]]
Convert logits → probabilities for inspection. Each row is a probability distribution over {healthy, degrading, critical}, summing to exactly 1.0. We DON'T use these for training — that goes through cross-entropy directly on logits.
[[0.142, 0.703, 0.155], ← engine 0: most likely degrading [0.312, 0.148, 0.540]] ← engine 1: most likely critical
Confirm shape is (B, K) = (2, 3) — three logits per engine.
Print engine 0's three raw logits, rounded for readability. Logits live in (-∞, +∞) — they're unnormalized log-odds, not probabilities.
Print engine 0's class probabilities. After softmax these are non-negative and sum to 1 — the natural format for a dashboard.
Reduce probabilities to the single predicted class per engine. argmax along axis=-1 returns the INDEX of the largest entry in each row.
Continuation of a multi-line print. The closing args come on line 36.
Generator-expression sum over all weight + bias elements. Tells you exactly how big the head is.
1import numpy as np
2
3
4def health_head(shared: np.ndarray, weights: list, biases: list) -> np.ndarray:
5 """shared: (B, 32). Returns (B, 3) - raw logits for {healthy, degrading, critical}."""
6 h = shared
7 for i, (W, b) in enumerate(zip(weights, biases)):
8 h = h @ W + b
9 if i < len(weights) - 1:
10 h = np.maximum(h, 0) # ReLU on hidden layers only
11 return h # NO softmax - return raw logits
12
13
14def softmax(z: np.ndarray) -> np.ndarray:
15 """Stable softmax (subtract row-wise max first)."""
16 z = z - z.max(axis=-1, keepdims=True)
17 e = np.exp(z)
18 return e / e.sum(axis=-1, keepdims=True)
19
20
21# 32 → 32 → 16 → 3
22np.random.seed(0)
23shapes = [(32, 32), (32, 16), (16, 3)]
24weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
25biases = [np.zeros(s[1], dtype=np.float32) for s in shapes]
26
27shared = np.random.randn(2, 32).astype(np.float32)
28logits = health_head(shared, weights, biases)
29probs = softmax(logits)
30
31print("logits.shape :", logits.shape) # (2, 3)
32print("logits[0] :", logits[0].round(2).tolist())
33print("probs[0] :", probs[0].round(3).tolist())
34print("argmax :", probs.argmax(-1).tolist()) # predicted class per engine
35print("# params :",
36 sum(W.size + b.size for W, b in zip(weights, biases))) # 1,635PyTorch: nn.Sequential
Imports PyTorch's core tensor library. Provides torch.Tensor (the GPU/CPU autograd-tracked array type), torch.randn/zeros/tensor (constructors), torch.optim (optimizers), and torch.manual_seed (RNG control). Everything else (nn, F) is a sub-module.
Imports the neural-network sub-package and aliases it as `nn`. Provides nn.Module (base class for all networks), nn.Linear (fully-connected layer), nn.Sequential (ordered container), nn.ReLU, nn.LayerNorm, nn.Dropout, etc.
STATELESS functional ops — no learnable parameters, no module registration. We use F.cross_entropy (training loss) and F.softmax (inference / dashboard view). Convention: alias as `F` so it reads like F.softmax(...).
Define the classification head as a subclass of nn.Module. Subclassing gives us .parameters() (for the optimizer), .to(device), .train()/.eval() mode toggles, and automatic parameter registration when we assign sub-modules to self.
Opens the class-level docstring. Triple-quoted string on the line after `class` becomes HealthHead.__doc__.
Docstring line — short description of the class.
Documents the input shape contract: batch of 32-D shared features.
Documents the output shape and class ordering. Class 0 = healthy, class 1 = degrading, class 2 = critical.
Reminds the reader why the output layer has no activation: the loss function takes care of softmax internally for numerical stability.
Continuation of the previous docstring line.
Closes the class-level docstring.
Constructor. Defaults match the book-wide settings (32-D shared vector, 3 classes), so most callers can write `HealthHead()` and get the right thing. Both args are typed `int` for IDE hints.
MANDATORY first line of any nn.Module subclass __init__. Calls nn.Module's constructor, which sets up the internal dicts that track parameters, buffers, and child modules. Forget this and PyTorch raises a confusing AttributeError later.
Compose the three Linear layers and two ReLUs as an ordered pipeline. nn.Sequential stores its arguments in order and calls them one after the other in forward() — perfect for a straight-line MLP.
# Layer-by-layer flow in net: # shared (B, 32) # → Linear(32, 32) + ReLU → (B, 32) # → Linear(32, 16) + ReLU → (B, 16) # → Linear(16, 3) → (B, 3) ← logits, no activation # # Param count: # 32×32 + 32 = 1,056 # 32×16 + 16 = 528 # 16× 3 + 3 = 51 # total = 1,635
First hidden layer (32 → 32) followed by a ReLU. The Linear holds W(32, 32) and b(32) as learnable parameters; the ReLU is parameter-free.
Second hidden layer (32 → 16) — the bottleneck. Forces the head to compress the 32-D representation into 16-D before deciding the class.
Output layer. NO activation follows. The output is RAW LOGITS in (-∞, +∞), exactly what F.cross_entropy expects. Adding a Softmax or LogSoftmax here is a classic bug — it will silently double-apply log-softmax inside the loss.
Closing paren of nn.Sequential.
The forward pass. PyTorch calls this when you do `head(shared)` — the nn.Module __call__ implementation invokes forward() and also runs hooks (gradient hooks, profiling, etc.). NEVER call forward() directly; always use the module instance.
Single Sequential pass. Chains Linear → ReLU → Linear → ReLU → Linear and returns the (B, 3) logits. No squeeze, no reshape — the head's output is already the desired shape.
Comment marking the start of the runnable smoke test.
Seed PyTorch's CPU RNG with 0. Makes torch.randn, weight init inside nn.Linear, dropout, etc. fully deterministic for this run.
Instantiate with default args: in_dim=32, num_classes=3. Constructor runs __init__, which builds nn.Sequential, which constructs three nn.Linear modules — each initializes its W with kaiming_uniform_ and b with zeros (PyTorch's default for Linear).
Fake shared features for a batch of 4 engines. requires_grad=True so we can call .backward() and inspect the gradient flowing back into the shared backbone — useful for the GABA chapter.
Forward pass. PyTorch's nn.Module overrides __call__ to invoke forward() plus run hooks. NEVER call head.forward(shared) directly — you'd skip the hooks.
Ground-truth class INDICES — not one-hot vectors. F.cross_entropy expects integer class IDs of shape (B,) where each value is in [0, num_classes). One-hot would error or, worse, silently broadcast.
Compute the multi-class cross-entropy loss in ONE numerically-stable call. Internally: log_softmax over the class axis, then negative-log-likelihood against the integer labels, then mean over the batch.
# What F.cross_entropy actually computes: # logp = log_softmax(logits, dim=-1) # (B, 3) # loss = -logp[range(B), labels].mean() # scalar
Walk the autograd graph backward from the scalar loss, accumulating ∂loss/∂param into every Parameter's .grad attribute (and into shared.grad because we set requires_grad=True there). One full reverse-mode autodiff pass.
Construct an Adam optimizer holding references to all of the head's learnable parameters. Adam tracks per-parameter first and second moment estimates of the gradient — adaptive step sizes.
Apply one Adam update, then clear the .grad attributes for the next iteration. Two statements on one line for compactness — semicolon = statement separator in Python.
Convert logits → probabilities for inspection only. .detach() strips autograd tracking (we don't need gradients through the dashboard view). dim=-1 normalises along the class axis so each engine's row sums to 1.
Reduce probabilities to a single predicted class per engine. argmax over the class axis returns the integer index of the largest probability.
Confirm the input shape. tuple(...) prints (4, 32) instead of torch.Size([4, 32]).
Confirm the output shape — (B, num_classes).
Print the scalar loss as a Python float, rounded for readability. .item() unwraps the 0-D tensor into a plain float (otherwise print would show 'tensor(1.1732)').
Print predicted class indices for all 4 engines. .tolist() converts the tensor to a plain Python list for clean printing.
Count every learnable scalar in the head. Generator expression sums .numel() across all Parameters returned by head.parameters().
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6class HealthHead(nn.Module):
7 """
8 Health classification head.
9 Input : shared features (B, 32)
10 Output: raw logits (B, 3) ← {healthy, degrading, critical}
11
12 No softmax inside the head - F.cross_entropy applies log-softmax
13 internally for numerical stability.
14 """
15
16 def __init__(self, in_dim: int = 32, num_classes: int = 3):
17 super().__init__()
18 self.net = nn.Sequential(
19 nn.Linear(in_dim, 32), nn.ReLU(inplace=True),
20 nn.Linear(32, 16), nn.ReLU(inplace=True),
21 nn.Linear(16, num_classes), # ← no activation
22 )
23
24 def forward(self, shared: torch.Tensor) -> torch.Tensor:
25 return self.net(shared) # (B, 3)
26
27
28# ---------- smoke test ----------
29torch.manual_seed(0)
30head = HealthHead()
31shared = torch.randn(4, 32, requires_grad=True) # fake shared features
32
33logits = head(shared) # (4, 3)
34labels = torch.tensor([0, 1, 2, 1]) # ground truth class
35loss = F.cross_entropy(logits, labels)
36
37loss.backward()
38optim = torch.optim.Adam(head.parameters(), lr=1e-3)
39optim.step(); optim.zero_grad()
40
41probs = F.softmax(logits.detach(), dim=-1)
42preds = probs.argmax(dim=-1)
43
44print("shared.shape :", tuple(shared.shape)) # (4, 32)
45print("logits.shape :", tuple(logits.shape)) # (4, 3)
46print("loss :", round(loss.item(), 4))
47print("preds :", preds.tolist()) # predicted class per engine
48print("# params :", sum(p.numel() for p in head.parameters())) # 1,635Same Head, Different Domains
The 32-D shared features are domain-agnostic; only the class boundaries change. Drop in a different label rule and the same head trains:
| Domain | Three classes | Boundary rule |
|---|---|---|
| Bearings (PRONOSTIA) | healthy / inner-race fault / outer-race fault | Vibration spectrum bands |
| Lithium-ion cells | healthy / mid-life / end-of-life | SoH > 90% / 80–90% / ≤ 80% |
| Wind-turbine gearbox | OK / wear / failure imminent | SCADA temp residual + age |
| Hard disk SMART | OK / pre-fail / fail-soon | SMART 5/197/198 thresholds |
| HVAC chillers | nominal / fouled / leaking | ΔT and refrigerant pressure |
Three Classification Pitfalls
The point. A 1,635-parameter MLP that turns the shared 32-D vector into three raw logits. No softmax inside. Same parameter budget across every domain in the diverse table - only the labels change.
Takeaway
- 32 → 32 → 16 → 3. Three layers, lean by design - the auxiliary task should not crowd the backbone.
- Raw logits out, no softmax inside. F.cross_entropy handles the numerically-stable log-softmax for you.
- Integer labels, not one-hot. Class indices in {0,1,2}.
- 1,635 parameters. About a third of the RUL head, < 0.05% of the full DualTaskModel.