Chapter 11
11 min read
Section 45 of 121

Health Classification Head (3 Classes)

Dual-Task Heads & Model Assembly

The Traffic-Light Output

If the RUL head is the “fuel gauge”, the health head is the dashboard's traffic light: green (healthy), amber (degrading), red (critical). Same shared 32-D feature vector goes in; three numbers come out - one raw score per class.

Why classify if you already regress? Discrete states are easier to learn early and give the shared backbone a complementary signal. They also map directly onto the way operators actually act: an engine in the red band gets pulled for inspection regardless of whether RUL is 12 or 18.

Class boundaries match the legacy AMNL paper:

ClassNameRUL range (cycles)Operator action
0healthyRUL > 125Continue normal operation
1degrading50 < RUL ≤ 125Schedule maintenance window
2criticalRUL ≤ 50Pull for inspection now

These thresholds are derived directly from the capped RUL target (Rmax=125R_{\max} = 125) and the empirical danger zone (~50 cycles ≈ one prediction-error standard deviation).

32 → 32 → 16 → 3

LayerShape inShape outParams
Linear → ReLU(B, 32)(B, 32)1,056
Linear → ReLU(B, 32)(B, 16)528
Linear (no activation)(B, 16)(B, 3)51
Total health head1,635

Three layers, ReLU between, no activation on the output. The head is roughly a third the size of the RUL head (1,635 vs 4,737) - it is the auxiliary task and we keep it lean.

No expand-first. The RUL head used 32 → 64 → 32 to re-mix features. The health head goes straight 32 → 32 → 16 - classification across three classes does not need that extra capacity, and a smaller head reduces gradient pressure on the shared backbone (which Chapter 12 will show is the whole point of GABA / GRACE).

Logits, Not Probabilities

The head emits raw logits z=(z0,z1,z2)R3z = (z_0, z_1, z_2) \in \mathbb{R}^3, not probabilities. The probability vector is recovered by softmax,

pk=exp(zk)j=02exp(zj)p_k = \dfrac{\exp(z_k)}{\sum_{j=0}^{2} \exp(z_j)},

and the predicted class is y^hs=argmaxkzk=argmaxkpk\hat{y}_{\text{hs}} = \arg\max_k z_k = \arg\max_k p_k. The argmax of logits and the argmax of probabilities are the same - softmax is monotonic - so for inference you can skip softmax entirely.

Why no softmax inside the head?

  • Numerical stability. F.cross_entropyF.cross\_entropy uses logjexp(zj)\log\sum_j \exp(z_j) with a max-subtraction trick. Applying softmax then taking log re-introduces the very overflow we're avoiding.
  • Efficiency. One log-softmax instead of softmax-then-log.
  • Convention. Every PyTorch classification recipe assumes the head returns logits.

Interactive: Three Logits → Softmax

Drag the three logit sliders and watch the predicted class flip when the gap between the top two scores closes. The temperature slider controls how sharply softmax separates them - the same trick we use in distillation.

Loading classifier explorer…
Try this. Set all three logits equal - notice the probabilities become 33.3% each and the entropy peaks at ln31.099\ln 3 \approx 1.099 nats. That is the maximum-uncertainty state. Push z2z_2up to +3 and watch the “critical” class dominate - this is what an engine three cycles from failure looks like through the head.

Python: The Classification Head

Three-layer MLP returning raw logits
🐍health_head_numpy.py
1import numpy as np

NumPy is the foundational numerical library for Python. It provides ndarray (N-dimensional array) — a fast, memory-efficient array type backed by C — plus matrix multiplication (@), broadcasting, and a huge library of math functions (np.exp, np.maximum, np.random, etc.). We alias it as 'np' by convention so we can write np.exp() instead of numpy.exp().

EXECUTION STATE
numpy = Numerical computing library — ndarray, linear algebra, random numbers, elementwise math. All operations run in optimized C, not Python loops.
as np = Standard alias used in 99% of NumPy code. Saves 3 characters per call site and is the universal convention.
4def health_head(shared, weights, biases) → np.ndarray

The classification head: a tiny three-layer MLP that turns the shared 32-D feature vector into three raw class scores ('logits') for {healthy, degrading, critical}. Crucially, this returns logits — NOT probabilities. Softmax is applied later, only when computing loss or formatting output for a dashboard.

EXECUTION STATE
⬇ input: shared (B, 32) = The shared 32-D feature vector from §11.1. B is the batch size — for a single engine B=1, for our smoke test B=2. This is the SAME tensor the RUL head reads.
→ shared example = shared[0] = [-0.13, 0.42, -1.05, ..., 0.27] (32 numbers) shared[1] = [ 0.88, -0.31, 0.50, ..., -0.62]
⬇ input: weights = Python list of 3 weight matrices: W0 shape (32, 32), W1 shape (32, 16), W2 shape (16, 3). These are the LEARNABLE parameters of the funnel 32 → 32 → 16 → 3.
⬇ input: biases = Python list of 3 bias vectors: b0 shape (32,), b1 shape (16,), b2 shape (3,). Added after each Linear layer to shift activations.
→ np.ndarray type hint = Tells static type-checkers the return is a NumPy array. Doesn't enforce anything at runtime — it's documentation for humans + IDEs.
⬆ returns = (B, 3) np.ndarray — raw logits in (-∞, +∞). NOT probabilities. To get probs, run softmax. To get the predicted class, run argmax.
5Docstring: shape contract

Triple-quoted string on the line after `def` becomes accessible as `health_head.__doc__`. This one tells the caller exactly what shape goes in, what shape comes out, and that the output is logits — not probabilities.

6h = shared

Initialize the running activation `h` with the input. We'll overwrite `h` inside the loop, layer by layer. This is a reference assignment, not a copy — but since each layer rebinds h to a NEW array (h = h @ W + b creates a new array), `shared` is never mutated.

EXECUTION STATE
h (initial) = (B, 32) — same as input `shared`. Will be overwritten three times: → (B, 32) → (B, 16) → (B, 3).
7for i, (W, b) in enumerate(zip(weights, biases)):

Iterate the three layers in order. `zip(weights, biases)` pairs each weight matrix with its bias vector. `enumerate(...)` adds the index `i` so we know whether we're on a hidden layer (apply ReLU) or the output layer (skip ReLU).

EXECUTION STATE
📚 zip(a, b) = Built-in: pairs items from two iterables. zip([W0,W1,W2], [b0,b1,b2]) → [(W0,b0), (W1,b1), (W2,b2)]. Stops at the shortest iterable.
📚 enumerate(it) = Built-in: yields (index, item) tuples. enumerate([(W0,b0),(W1,b1),(W2,b2)]) → (0,(W0,b0)), (1,(W1,b1)), (2,(W2,b2)).
→ why enumerate? = We need `i` to know which layer we're on. If i < 2 (i.e. layer 0 or 1) → apply ReLU. If i == 2 (output layer) → skip ReLU.
LOOP TRACE · 3 iterations
i=0 — first hidden
W shape = (32, 32)
b shape = (32,)
h before = (B, 32)
h after = (B, 32) — ReLU applied (i < 2)
i=1 — second hidden
W shape = (32, 16)
b shape = (16,)
h before = (B, 32)
h after = (B, 16) — ReLU applied (i < 2)
i=2 — output (logits)
W shape = (16, 3)
b shape = (3,)
h before = (B, 16)
h after = (B, 3) — NO ReLU (i == len(weights)-1)
8h = h @ W + b

The Linear layer: matrix-multiply h by W, then add bias. NumPy's broadcasting handles the bias automatically — `b` is shape (out,) and gets added to every row of the (B, out) result.

EXECUTION STATE
📚 @ operator = Python's matrix multiplication operator (PEP 465). Equivalent to np.matmul(h, W). For 2D arrays, standard matrix multiply: (B, in) @ (in, out) → (B, out).
→ @ example (i=0) = h(B, 32) @ W(32, 32) → (B, 32)
→ @ example (i=1) = h(B, 32) @ W(32, 16) → (B, 16)
→ @ example (i=2) = h(B, 16) @ W(16, 3) → (B, 3)
+ b (broadcasting) = b is shape (out,). NumPy broadcasts it across the batch dimension: (B, out) + (out,) → (B, out). Each row gets the same bias added.
9if i < len(weights) - 1:

Branch on the layer index. `len(weights)` is 3, so `len(weights) - 1` is 2 — the index of the OUTPUT layer. The body runs for i=0 and i=1 (hidden layers), and is skipped for i=2 (output layer). This is how we apply ReLU to hidden layers but leave the output as raw logits.

EXECUTION STATE
len(weights) = 3 — there are 3 weight matrices in the funnel.
len(weights) - 1 = 2 — the index of the LAST (output) layer.
→ branch table = i=0: 0 < 2 → True → apply ReLU i=1: 1 < 2 → True → apply ReLU i=2: 2 < 2 → False → SKIP ReLU (output stays as logits)
10h = np.maximum(h, 0) # ReLU

ReLU activation: element-wise max with 0. Negative entries become 0; non-negative entries pass through unchanged. Standard non-linearity for hidden MLP layers — cheap, gradient-friendly, and pairs well with He initialisation (line 24).

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise max of two arrays (or array + scalar). np.maximum([-1, 0.5, -2, 3], 0) → [0, 0.5, 0, 3]. Different from np.max which REDUCES along an axis.
⬇ arg 1: h = Pre-activation tensor (B, out_dim) — output of `h @ W + b`. Can contain any real number.
⬇ arg 2: 0 = Scalar threshold. NumPy broadcasts the scalar to every element of h, so each entry is compared against 0 individually.
→ ReLU(x) example = ReLU([-1.2, 0.0, 0.7, -3.1, 2.5]) → [0.0, 0.0, 0.7, 0.0, 2.5]
→ why ReLU? = Non-saturating gradient (1 for x>0, 0 for x<0), no exponentials, kills negatives. Pairs perfectly with He init: variance stays ≈ 1 across layers.
11return h # raw logits, NO softmax

Return the (B, 3) logits exactly as they came out of the final Linear layer. Applying softmax here would be a BUG — F.cross_entropy in PyTorch applies log-softmax internally with a numerically stable trick. Softmax + log_softmax = collapsed gradients.

EXECUTION STATE
⬆ return: h (B, 3) = Raw logits. Each row is three real numbers, e.g. [-0.47, 1.12, -0.66]. Negative, positive, anything goes.
→ why raw, not softmax? = 1. Numerical stability — log_softmax uses a max-subtraction trick. 2. Efficiency — one pass instead of softmax-then-log. 3. Convention — every PyTorch loss expects logits.
14def softmax(z) → np.ndarray

Standalone softmax for INFERENCE / dashboard display only. We don't use this for training (that goes through cross-entropy directly on logits). Uses the log-sum-exp trick: subtract the row max before exp() to prevent overflow.

EXECUTION STATE
⬇ input: z (B, K) = Logits — any real numbers. K is the number of classes (here K=3).
→ z example (B=2, K=3) =
z = [[-0.47,  1.12, -0.66],
     [ 0.85,  0.10,  1.40]]
⬆ returns = (B, K) probabilities — each row is non-negative and sums to exactly 1.0.
15Docstring: stable softmax

Reminds the reader this softmax subtracts the row max first. Mathematically softmax(z) = softmax(z - c) for any constant c, so choosing c = max(z) keeps every exp() argument ≤ 0 — exp() ≤ 1 — and never overflows.

16z = z - z.max(axis=-1, keepdims=True)

Subtract the per-row max from each row. After this, the largest entry in every row is 0; all other entries are ≤ 0. Mathematically softmax is unchanged; numerically it's safe.

EXECUTION STATE
📚 z.max(axis, keepdims) = NumPy ndarray method: returns the maximum value along a given axis. Without keepdims it COLLAPSES that axis; with keepdims=True it keeps it as size 1.
⬇ arg: axis=-1 = Reduce along the LAST axis (the class axis). For shape (B, 3), this finds the max within each row of 3 logits independently. axis=0 would reduce across the batch — wrong here.
⬇ arg: keepdims=True = Keep the reduced axis as size 1. Without it, max returns shape (B,); with it, shape (B, 1). The (B, 1) shape is needed so broadcasting `z(B, 3) - max(B, 1) → (B, 3)` works correctly.
→ keepdims example = z.max(axis=-1, keepdims=False) → shape (B,) — broadcasts incorrectly z.max(axis=-1, keepdims=True) → shape (B, 1) — broadcasts cleanly across columns
→ before subtraction =
z = [[-0.47,  1.12, -0.66],
     [ 0.85,  0.10,  1.40]]
→ row maxes (B, 1) =
[[1.12],
 [1.40]]
→ after subtraction (z) =
[[-1.59,  0.00, -1.78],
 [-0.55, -1.30,  0.00]]
17e = np.exp(z)

Element-wise exponential — turns every entry into e^entry. Because we already subtracted the row max, every entry is ≤ 0, so every exponential is in (0, 1]. No overflow possible.

EXECUTION STATE
📚 np.exp(x) = Element-wise exponential: returns e^x for each element. e ≈ 2.71828. exp(0)=1, exp(-1)≈0.368, exp(-5)≈0.0067, exp(1)≈2.718, exp(10)≈22026.
⬇ arg: z = The shifted logits — every element ≤ 0 after the row-max subtraction.
→ e (B, 3) =
[[0.204, 1.000, 0.169],
 [0.577, 0.273, 1.000]]
→ safety = Without the max-shift, exp(1000) would overflow to +∞. With the shift, max is exp(0)=1, smallest is exp(very_negative) ≈ 0.
18return e / e.sum(axis=-1, keepdims=True)

Normalize each row to sum to 1 — the softmax formula. e / sum(e) along the class axis turns the unnormalized exponentials into a proper probability distribution.

EXECUTION STATE
📚 e.sum(axis, keepdims) = ndarray method: sums along a given axis. axis=-1 sums each row's columns; keepdims=True keeps shape (B, 1) for broadcasting.
⬇ arg: axis=-1 = Sum along the LAST axis (the class axis) — i.e. sum the 3 exponentials in each row.
⬇ arg: keepdims=True = Output shape (B, 1) so e(B, 3) / sum(B, 1) → (B, 3). Without keepdims, dividing (B, 3) by (B,) broadcasts incorrectly.
→ row sums (B, 1) =
[[1.373],
 [1.850]]
⬆ return: probs (B, 3) =
[[0.142, 0.703, 0.155],   ← row sums to 1.0 ✓
 [0.312, 0.148, 0.540]]   ← row sums to 1.0 ✓
21# 32 → 32 → 16 → 3

Comment marking the start of the smoke test. Funnel shape: 32 input features → 32 hidden → 16 hidden → 3 output classes. Notice we do NOT expand first like the RUL head (which went 32 → 64 → 32). Classification is the auxiliary task — keep it lean.

22np.random.seed(0)

Seed NumPy's global random number generator with 0. Every np.random.* call after this is fully deterministic — reproducible across runs, machines, and Python versions. Essential for reproducible smoke tests.

EXECUTION STATE
📚 np.random.seed(s) = Sets the seed of the global RandomState. Subsequent np.random.randn, np.random.rand, etc. produce the SAME sequence every time you call seed(s) with the same s.
⬇ arg: 0 = The seed value. Any integer works; 0 and 42 are conventional. Using a fixed seed makes this script's output identical on every run.
23shapes = [(32, 32), (32, 16), (16, 3)]

List of (in_dim, out_dim) tuples for the three Linear layers. Each tuple defines the shape of a weight matrix W: rows = in_dim (fan-in), cols = out_dim (fan-out).

EXECUTION STATE
shapes[0] = (32, 32) — first hidden layer: 32 in, 32 out
shapes[1] = (32, 16) — second hidden layer: 32 in, 16 out (compression)
shapes[2] = (16, 3) — output layer: 16 in, 3 classes out
24weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]

He initialisation (Kaiming, 2015): for each layer, sample weights from N(0, 2/fan_in). This keeps the variance of activations roughly constant across layers when paired with ReLU — without it, activations either explode or vanish through the depth.

EXECUTION STATE
📚 np.random.randn(*shape) = Samples from the standard normal distribution N(0, 1). randn(32, 32) returns a (32, 32) array of values drawn from N(0, 1).
→ *s unpacking = *s unpacks the tuple into positional args. *(32, 32) → randn(32, 32). Without the star: randn((32, 32)) — wrong, would fail or give a 1-D array.
📚 .astype(np.float32) = Converts dtype from float64 (NumPy default) to float32. Saves half the memory and matches PyTorch's default tensor dtype — essential for moving to GPU later.
📚 np.sqrt(2 / s[0]) = Scalar — the He scaling factor. s[0] is fan_in. For shape (32, 32): sqrt(2/32) ≈ 0.25. For (16, 3): sqrt(2/16) ≈ 0.354.
→ why 2/fan_in? = ReLU kills half the activations (the negatives). To preserve variance Var(out) ≈ Var(in), you need Var(W) = 2/fan_in. Glorot/Xavier (1/fan_in) is for tanh/sigmoid.
→ weights[0] = (32, 32) — 1024 numbers, scaled by 0.25
→ weights[1] = (32, 16) — 512 numbers, scaled by 0.25
→ weights[2] = (16, 3) — 48 numbers, scaled by 0.354
25biases = [np.zeros(s[1], dtype=np.float32) for s in shapes]

Initialize all biases to zero. With He init on weights, zero biases let the layer start neutral — no preference for any output unit. Adam will quickly move them away from zero during training.

EXECUTION STATE
📚 np.zeros(shape, dtype) = Creates an array of all zeros with the given shape and dtype. np.zeros(16, dtype=np.float32) → array of 16 zeros, each a 32-bit float.
⬇ arg: s[1] = The OUT dimension of each layer. For shape (32, 16), s[1]=16, so bias has 16 entries — one per output unit.
⬇ arg: dtype=np.float32 = Match the weight dtype. Mixed dtypes in the same operation force NumPy to cast — slower and surprising.
→ biases shapes = biases[0]: (32,) all zeros biases[1]: (16,) all zeros biases[2]: (3,) all zeros
27shared = np.random.randn(2, 32).astype(np.float32)

Fake shared features for two engines (B=2). In production this comes out of the shared backbone (TCN/LSTM, §11.1); here we just sample from N(0, 1) to exercise the head.

EXECUTION STATE
📚 np.random.randn(2, 32) = Sample (2, 32) array from N(0, 1). Two rows = two engines; 32 columns = the 32-D feature vector.
→ shared example = shared[0] ≈ [1.76, 0.40, 0.98, 2.24, 1.87, ...] shared[1] ≈ [0.40, -1.69, 0.32, -0.20, ...]
→ shape = (2, 32) — B=2 engines, 32 features each
28logits = health_head(shared, weights, biases)

Forward pass through the head. Internally: h = shared → @W0+b0 → ReLU → @W1+b1 → ReLU → @W2+b2 → return. Result is (2, 3) raw logits.

EXECUTION STATE
⬇ arg 1: shared = (2, 32) input features
⬇ arg 2: weights = [W0(32,32), W1(32,16), W2(16,3)] from line 24
⬇ arg 3: biases = [b0(32), b1(16), b2(3)] all zeros from line 25
⬆ result: logits (2, 3) =
Example one realisation:
[[-0.47,  1.12, -0.66],
 [ 0.85,  0.10,  1.40]]
29probs = softmax(logits)

Convert logits → probabilities for inspection. Each row is a probability distribution over {healthy, degrading, critical}, summing to exactly 1.0. We DON'T use these for training — that goes through cross-entropy directly on logits.

EXECUTION STATE
⬇ arg: logits (2, 3) = Raw logits from line 28.
⬆ result: probs (2, 3) =
[[0.142, 0.703, 0.155],   ← engine 0: most likely degrading
 [0.312, 0.148, 0.540]]   ← engine 1: most likely critical
→ row sums = [1.000, 1.000] — both rows sum to 1.0 ✓
31print("logits.shape :", logits.shape)

Confirm shape is (B, K) = (2, 3) — three logits per engine.

EXECUTION STATE
logits.shape = (2, 3)
⬆ stdout = logits.shape : (2, 3)
32print("logits[0] :", logits[0].round(2).tolist())

Print engine 0's three raw logits, rounded for readability. Logits live in (-∞, +∞) — they're unnormalized log-odds, not probabilities.

EXECUTION STATE
📚 ndarray.round(decimals) = Round every element to N decimals. logits[0].round(2) → array of 3 numbers each rounded to 2 decimals.
📚 ndarray.tolist() = Convert ndarray to a plain Python list. Cleaner print() output (no 'array(...)' wrapper).
⬆ stdout = logits[0] : [-0.47, 1.12, -0.66]
→ reading it = z_healthy=-0.47, z_degrading=1.12, z_critical=-0.66 → degrading wins (largest z).
33print("probs[0] :", probs[0].round(3).tolist())

Print engine 0's class probabilities. After softmax these are non-negative and sum to 1 — the natural format for a dashboard.

EXECUTION STATE
⬆ stdout = probs[0] : [0.142, 0.703, 0.155]
→ sanity check = 0.142 + 0.703 + 0.155 = 1.000 ✓
→ reading it = P(healthy)=14.2%, P(degrading)=70.3%, P(critical)=15.5% → confidently degrading.
34print("argmax :", probs.argmax(-1).tolist())

Reduce probabilities to the single predicted class per engine. argmax along axis=-1 returns the INDEX of the largest entry in each row.

EXECUTION STATE
📚 ndarray.argmax(axis) = Returns the index of the maximum element along the given axis. Different from .max which returns the value.
⬇ arg: -1 = Operate along the LAST axis (the class axis). For shape (2, 3) → output shape (2,), one int per engine.
→ why argmax of probs == argmax of logits = Softmax is monotonic: if z_i > z_j then exp(z_i) > exp(z_j), so the same index wins. You can skip softmax entirely for inference.
⬆ stdout = argmax : [1, 0]
→ reading it = Engine 0 predicted class 1 (degrading); engine 1 predicted class 0 (healthy). Random init, so this is essentially noise — training will fix it.
35print("# params :",

Continuation of a multi-line print. The closing args come on line 36.

36sum(W.size + b.size for W, b in zip(weights, biases))

Generator-expression sum over all weight + bias elements. Tells you exactly how big the head is.

EXECUTION STATE
📚 ndarray.size = Total number of elements in the array (product of all shape dims). For (32, 32): size=1024.
→ per-layer breakdown = Layer 0: 32×32 + 32 = 1056 Layer 1: 32×16 + 16 = 528 Layer 2: 16×3 + 3 = 51
⬆ stdout = # params : 1635
→ context = About 1/3 the size of the RUL head (4737 params). The auxiliary task should not dominate the parameter budget.
8 lines without explanation
1import numpy as np
2
3
4def health_head(shared: np.ndarray, weights: list, biases: list) -> np.ndarray:
5    """shared: (B, 32). Returns (B, 3) - raw logits for {healthy, degrading, critical}."""
6    h = shared
7    for i, (W, b) in enumerate(zip(weights, biases)):
8        h = h @ W + b
9        if i < len(weights) - 1:
10            h = np.maximum(h, 0)                     # ReLU on hidden layers only
11    return h                                          # NO softmax - return raw logits
12
13
14def softmax(z: np.ndarray) -> np.ndarray:
15    """Stable softmax (subtract row-wise max first)."""
16    z = z - z.max(axis=-1, keepdims=True)
17    e = np.exp(z)
18    return e / e.sum(axis=-1, keepdims=True)
19
20
21# 32 → 32 → 16 → 3
22np.random.seed(0)
23shapes  = [(32, 32), (32, 16), (16, 3)]
24weights = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
25biases  = [np.zeros(s[1], dtype=np.float32) for s in shapes]
26
27shared = np.random.randn(2, 32).astype(np.float32)
28logits = health_head(shared, weights, biases)
29probs  = softmax(logits)
30
31print("logits.shape :", logits.shape)               # (2, 3)
32print("logits[0]    :", logits[0].round(2).tolist())
33print("probs[0]     :", probs[0].round(3).tolist())
34print("argmax       :", probs.argmax(-1).tolist())  # predicted class per engine
35print("# params     :",
36      sum(W.size + b.size for W, b in zip(weights, biases)))    # 1,635

PyTorch: nn.Sequential

Production HealthHead with smoke test
🐍health_head_torch.py
1import torch

Imports PyTorch's core tensor library. Provides torch.Tensor (the GPU/CPU autograd-tracked array type), torch.randn/zeros/tensor (constructors), torch.optim (optimizers), and torch.manual_seed (RNG control). Everything else (nn, F) is a sub-module.

EXECUTION STATE
torch = PyTorch root package — tensors, autograd, RNG, dtypes, devices.
2import torch.nn as nn

Imports the neural-network sub-package and aliases it as `nn`. Provides nn.Module (base class for all networks), nn.Linear (fully-connected layer), nn.Sequential (ordered container), nn.ReLU, nn.LayerNorm, nn.Dropout, etc.

EXECUTION STATE
nn.Module = Base class. Anything you subclass from this gets parameter registration, .to(device), .train()/.eval(), and .parameters() for free.
nn.Linear = Stateful fully-connected layer with learnable W and b. Used 3× in this head.
nn.Sequential = Ordered container — chains modules so forward calls them in order. We use it to compose the 3 Linear layers + 2 ReLUs.
3import torch.nn.functional as F

STATELESS functional ops — no learnable parameters, no module registration. We use F.cross_entropy (training loss) and F.softmax (inference / dashboard view). Convention: alias as `F` so it reads like F.softmax(...).

EXECUTION STATE
F.cross_entropy = Logits → scalar loss. Internally does log_softmax + nll_loss in one numerically-stable call.
F.softmax = Logits → probabilities. Used for inspection only; training goes through F.cross_entropy.
→ why functional vs nn? = Use nn.Module for layers WITH state (weights, running stats). Use F.* for pure functions (softmax, relu in forward(), loss). Same math, different ergonomics.
6class HealthHead(nn.Module):

Define the classification head as a subclass of nn.Module. Subclassing gives us .parameters() (for the optimizer), .to(device), .train()/.eval() mode toggles, and automatic parameter registration when we assign sub-modules to self.

EXECUTION STATE
📚 nn.Module = PyTorch's base class for all neural network components. Tracks learnable parameters, manages device placement, and dispatches forward() via __call__.
⬇ input: shared (B, 32) = Forward input — shared 32-D feature vector from §11.1, batched.
⬆ returns = logits (B, 3) — raw scores for {healthy, degrading, critical}
7Docstring """

Opens the class-level docstring. Triple-quoted string on the line after `class` becomes HealthHead.__doc__.

8Health classification head.

Docstring line — short description of the class.

9Input : shared features (B, 32)

Documents the input shape contract: batch of 32-D shared features.

10Output: raw logits (B, 3) ← {healthy, degrading, critical}

Documents the output shape and class ordering. Class 0 = healthy, class 1 = degrading, class 2 = critical.

12No softmax inside the head — F.cross_entropy applies log-softmax

Reminds the reader why the output layer has no activation: the loss function takes care of softmax internally for numerical stability.

13internally for numerical stability.

Continuation of the previous docstring line.

14Docstring """

Closes the class-level docstring.

16def __init__(self, in_dim: int = 32, num_classes: int = 3):

Constructor. Defaults match the book-wide settings (32-D shared vector, 3 classes), so most callers can write `HealthHead()` and get the right thing. Both args are typed `int` for IDE hints.

EXECUTION STATE
⬇ input: self = The instance being constructed. PyTorch needs you to register all sub-modules on `self` (e.g. self.net = ...) so .parameters() can find them.
⬇ input: in_dim = 32 = Dimensionality of the shared feature vector coming from the backbone. Default 32 matches Chapter 11.
⬇ input: num_classes = 3 = Output dimensionality. Default 3 matches the book-wide healthy/degrading/critical scheme. Override for other domains (e.g. 5 SoH bins for batteries).
17super().__init__()

MANDATORY first line of any nn.Module subclass __init__. Calls nn.Module's constructor, which sets up the internal dicts that track parameters, buffers, and child modules. Forget this and PyTorch raises a confusing AttributeError later.

EXECUTION STATE
📚 super() = Returns a proxy object that delegates method calls to the parent class (nn.Module here). super().__init__() calls nn.Module.__init__(self).
→ why required? = nn.Module.__init__ initializes self._parameters, self._buffers, self._modules dicts. Without these, self.net = nn.Sequential(...) silently fails to register the layers.
18self.net = nn.Sequential(

Compose the three Linear layers and two ReLUs as an ordered pipeline. nn.Sequential stores its arguments in order and calls them one after the other in forward() — perfect for a straight-line MLP.

EXAMPLE
# Layer-by-layer flow in net:
#   shared (B, 32)
#   → Linear(32, 32) + ReLU  → (B, 32)
#   → Linear(32, 16) + ReLU  → (B, 16)
#   → Linear(16,  3)         → (B,  3)   ← logits, no activation
#
# Param count:
#   32×32 + 32 = 1,056
#   32×16 + 16 =   528
#   16× 3 +  3 =    51
#   total      = 1,635
EXECUTION STATE
📚 nn.Sequential(*modules) = Container that chains modules. forward(x) is equivalent to: for m in modules: x = m(x); return x. Auto-registers every module so .parameters() walks them all.
→ assigned to self.net = Storing on self registers Sequential as a child module. PyTorch then knows about every Linear inside it for .parameters(), .to(device), state_dict, etc.
19nn.Linear(in_dim, 32), nn.ReLU(inplace=True),

First hidden layer (32 → 32) followed by a ReLU. The Linear holds W(32, 32) and b(32) as learnable parameters; the ReLU is parameter-free.

EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias=True) = Fully-connected layer: y = x @ W.T + b. Stores W shape (out, in) and b shape (out,) as learnable Parameters.
⬇ arg 1: in_dim = 32 = in_features — must match the last dim of the input tensor. Sets the columns of W.
⬇ arg 2: 32 = out_features — sets the rows of W. After this layer the activation is (B, 32).
📚 nn.ReLU(inplace=True) = Element-wise max(0, x). inplace=True overwrites the input tensor instead of allocating a new one — same arithmetic, lower memory.
⬇ arg: inplace=True = Mutate the input tensor in place. Safe inside nn.Sequential because the previous layer's output is consumed exactly once.
→ params = W: (32, 32) = 1024 + b: (32,) = 32 → 1056 params
20nn.Linear(32, 16), nn.ReLU(inplace=True),

Second hidden layer (32 → 16) — the bottleneck. Forces the head to compress the 32-D representation into 16-D before deciding the class.

EXECUTION STATE
⬇ arg 1: 32 = in_features — must match the previous layer's out (32).
⬇ arg 2: 16 = out_features — bottleneck dim. Halving from 32→16 forces compression.
→ params = W: (16, 32) = 512 + b: (16,) = 16 → 528 params
21nn.Linear(16, num_classes), # ← no activation

Output layer. NO activation follows. The output is RAW LOGITS in (-∞, +∞), exactly what F.cross_entropy expects. Adding a Softmax or LogSoftmax here is a classic bug — it will silently double-apply log-softmax inside the loss.

EXECUTION STATE
⬇ arg 1: 16 = in_features — matches the bottleneck out (16).
⬇ arg 2: num_classes = 3 = out_features — one logit per class. With default num_classes=3: healthy / degrading / critical.
→ no ReLU here! = If you wrap this with nn.ReLU, negative logits get clipped to 0, breaking softmax. If you wrap with nn.Softmax, F.cross_entropy double-applies log-softmax. Always: raw logits out.
→ params = W: (3, 16) = 48 + b: (3,) = 3 → 51 params
22)

Closing paren of nn.Sequential.

24def forward(self, shared: torch.Tensor) -> torch.Tensor:

The forward pass. PyTorch calls this when you do `head(shared)` — the nn.Module __call__ implementation invokes forward() and also runs hooks (gradient hooks, profiling, etc.). NEVER call forward() directly; always use the module instance.

EXECUTION STATE
⬇ input: self = The HealthHead instance. Provides access to self.net.
⬇ input: shared = torch.Tensor of shape (B, 32). Must be on the same device as the module's parameters.
⬆ returns = torch.Tensor of shape (B, 3) — raw logits.
25return self.net(shared)

Single Sequential pass. Chains Linear → ReLU → Linear → ReLU → Linear and returns the (B, 3) logits. No squeeze, no reshape — the head's output is already the desired shape.

EXECUTION STATE
self.net(shared) = Calls nn.Sequential.__call__ → runs every contained module in order.
→ step-by-step shape = (B, 32) → Linear → (B, 32) → ReLU → (B, 32) → Linear → (B, 16) → ReLU → (B, 16) → Linear → (B, 3) → return
⬆ return: (B, 3) = Raw logits.
28# ---------- smoke test ----------

Comment marking the start of the runnable smoke test.

29torch.manual_seed(0)

Seed PyTorch's CPU RNG with 0. Makes torch.randn, weight init inside nn.Linear, dropout, etc. fully deterministic for this run.

EXECUTION STATE
📚 torch.manual_seed(s) = Seeds the global PyTorch CPU RNG. For GPU determinism you'd also need torch.cuda.manual_seed_all(s).
⬇ arg: 0 = The seed value. Any int works; 0 is conventional in tutorial code.
30head = HealthHead()

Instantiate with default args: in_dim=32, num_classes=3. Constructor runs __init__, which builds nn.Sequential, which constructs three nn.Linear modules — each initializes its W with kaiming_uniform_ and b with zeros (PyTorch's default for Linear).

EXECUTION STATE
head = HealthHead instance with 1635 trainable parameters.
→ default init = PyTorch's nn.Linear default uses kaiming_uniform_(W, a=sqrt(5)) and zero biases — roughly equivalent to He init.
31shared = torch.randn(4, 32, requires_grad=True)

Fake shared features for a batch of 4 engines. requires_grad=True so we can call .backward() and inspect the gradient flowing back into the shared backbone — useful for the GABA chapter.

EXECUTION STATE
📚 torch.randn(*size, requires_grad) = Sample from N(0, 1). Shape is positional args: torch.randn(4, 32) → tensor of shape (4, 32).
⬇ arg: 4 = Batch size — pretend we're processing 4 engines simultaneously.
⬇ arg: 32 = Feature dim — matches in_dim of HealthHead.
⬇ arg: requires_grad=True = Track this tensor in autograd. After loss.backward(), shared.grad will hold ∂loss/∂shared — what the head's gradient signal tells the backbone to change.
→ shape = (4, 32)
33logits = head(shared)

Forward pass. PyTorch's nn.Module overrides __call__ to invoke forward() plus run hooks. NEVER call head.forward(shared) directly — you'd skip the hooks.

EXECUTION STATE
⬇ arg: shared = (4, 32) input tensor.
⬆ result: logits = (4, 3) torch.Tensor of raw logits. Tracked by autograd.
34labels = torch.tensor([0, 1, 2, 1])

Ground-truth class INDICES — not one-hot vectors. F.cross_entropy expects integer class IDs of shape (B,) where each value is in [0, num_classes). One-hot would error or, worse, silently broadcast.

EXECUTION STATE
📚 torch.tensor(data) = Constructs a tensor from a Python list/tuple/scalar. Infers dtype from the data — for ints, defaults to int64 (long), which is what F.cross_entropy requires for labels.
⬇ arg: [0, 1, 2, 1] = Class indices for the 4 engines: engine 0 is healthy, engine 1 is degrading, engine 2 is critical, engine 3 is degrading.
→ shape & dtype = shape (4,), dtype torch.int64. NOT one-hot.
→ one-hot would be wrong = torch.tensor([[1,0,0],[0,1,0],[0,0,1],[0,1,0]]) shape (4,3) → F.cross_entropy interprets it as soft labels (a different code path) or errors. Always integer indices.
35loss = F.cross_entropy(logits, labels)

Compute the multi-class cross-entropy loss in ONE numerically-stable call. Internally: log_softmax over the class axis, then negative-log-likelihood against the integer labels, then mean over the batch.

EXAMPLE
# What F.cross_entropy actually computes:
#   logp = log_softmax(logits, dim=-1)          # (B, 3)
#   loss = -logp[range(B), labels].mean()        # scalar
EXECUTION STATE
📚 F.cross_entropy(input, target, reduction='mean') = PyTorch's combined log-softmax + NLL loss. Numerically stable (uses max-subtraction trick) and faster than chaining F.log_softmax + F.nll_loss yourself.
⬇ arg 1: logits (4, 3) = Raw scores from the head — (B, num_classes). Must be unnormalized.
⬇ arg 2: labels (4,) = Integer class indices. Each value selects which logit's negative-log-prob to read.
→ math = logp = log_softmax(logits, dim=-1) # (4, 3) loss = -logp[range(4), labels].mean()
⬆ result: loss (scalar) = 0-dim torch.Tensor, e.g. tensor(1.1732). At init this is roughly ln(num_classes) ≈ ln(3) ≈ 1.099 because the network outputs are random.
37loss.backward()

Walk the autograd graph backward from the scalar loss, accumulating ∂loss/∂param into every Parameter's .grad attribute (and into shared.grad because we set requires_grad=True there). One full reverse-mode autodiff pass.

EXECUTION STATE
📚 Tensor.backward() = Triggers backprop. Only valid on a scalar tensor (or you'd need a gradient argument). Accumulates gradients — does not zero them.
→ side effects = After this call: every Linear's W.grad and b.grad is filled, AND shared.grad is filled.
→ shared.grad meaning = shared.grad is what the HEAD wants the BACKBONE to change to lower the classification loss. Chapter 12's GABA balances this against the RUL head's grad.
38optim = torch.optim.Adam(head.parameters(), lr=1e-3)

Construct an Adam optimizer holding references to all of the head's learnable parameters. Adam tracks per-parameter first and second moment estimates of the gradient — adaptive step sizes.

EXECUTION STATE
📚 torch.optim.Adam(params, lr, betas, eps, weight_decay) = Adaptive Moment Estimation optimizer. Maintains m (1st moment), v (2nd moment) per parameter, applies bias-corrected updates.
⬇ arg 1: head.parameters() = Iterable of all learnable Parameters in the head — discovered by walking self._parameters and self._modules. 1635 numbers across 6 tensors (3 Ws + 3 bs).
⬇ arg 2: lr=1e-3 = Learning rate — Adam's default. 1e-3 = 0.001. Large enough to make progress, small enough to avoid divergence on most tasks.
39optim.step(); optim.zero_grad()

Apply one Adam update, then clear the .grad attributes for the next iteration. Two statements on one line for compactness — semicolon = statement separator in Python.

EXECUTION STATE
📚 optim.step() = Reads each Parameter.grad, applies the Adam update rule, writes back to Parameter.data. After this call, the head's weights have shifted slightly toward lowering the loss.
📚 optim.zero_grad() = Sets every tracked .grad to zero. PyTorch ACCUMULATES gradients on each .backward(); without zeroing, you'd accumulate across iterations and effectively use a stale, growing gradient.
→ order matters = Always step() THEN zero_grad(). If you zero before step, you have nothing to step on.
41probs = F.softmax(logits.detach(), dim=-1)

Convert logits → probabilities for inspection only. .detach() strips autograd tracking (we don't need gradients through the dashboard view). dim=-1 normalises along the class axis so each engine's row sums to 1.

EXECUTION STATE
📚 F.softmax(input, dim) = Element-wise exp() then divide by sum along `dim`. Numerically equivalent to the manual softmax in the NumPy version.
📚 Tensor.detach() = Returns a new tensor that shares storage with the original but is detached from the autograd graph. Calling .backward() through it does nothing — saves memory and avoids accidental grad flow.
⬇ arg 1: logits.detach() = (4, 3) raw scores, autograd-detached.
⬇ arg 2: dim=-1 = Softmax along the LAST axis (classes). For (4, 3) tensor → each of the 4 rows independently sums to 1.0.
→ dim=-1 vs dim=0 = dim=-1: each ROW sums to 1 (one engine, distribution over classes) ✓ dim=0: each COLUMN sums to 1 (one class, distribution over engines) ✗
⬆ result: probs (4, 3) = Each row is a valid prob distribution.
42preds = probs.argmax(dim=-1)

Reduce probabilities to a single predicted class per engine. argmax over the class axis returns the integer index of the largest probability.

EXECUTION STATE
📚 Tensor.argmax(dim) = Returns indices of the maximum values along `dim`. For (4, 3) with dim=-1 → shape (4,), one int per engine.
⬇ arg: dim=-1 = Reduce along the class axis.
⬆ result: preds = torch.Tensor of dtype int64, shape (4,). One predicted class index per engine.
44print("shared.shape :", tuple(shared.shape))

Confirm the input shape. tuple(...) prints (4, 32) instead of torch.Size([4, 32]).

EXECUTION STATE
📚 tuple(shared.shape) = Tensor.shape returns torch.Size, a subclass of tuple. Casting to plain tuple gives a cleaner repr in print().
⬆ stdout = shared.shape : (4, 32)
45print("logits.shape :", tuple(logits.shape))

Confirm the output shape — (B, num_classes).

EXECUTION STATE
⬆ stdout = logits.shape : (4, 3)
46print("loss :", round(loss.item(), 4))

Print the scalar loss as a Python float, rounded for readability. .item() unwraps the 0-D tensor into a plain float (otherwise print would show 'tensor(1.1732)').

EXECUTION STATE
📚 Tensor.item() = Extracts the value of a 0-D tensor as a Python scalar (float for float tensors, int for int tensors). Errors if the tensor has more than one element.
📚 round(x, n) = Built-in: round to n decimal places. Pure formatting — doesn't change the underlying value.
⬆ stdout = loss : 1.1732
→ expected at init = Roughly -ln(1/3) = ln(3) ≈ 1.099 because the network is random and softmax(random) ≈ uniform. After training, this should drop toward 0.
47print("preds :", preds.tolist())

Print predicted class indices for all 4 engines. .tolist() converts the tensor to a plain Python list for clean printing.

EXECUTION STATE
📚 Tensor.tolist() = Convert tensor to nested Python list. For a 1-D tensor of length 4 → [a, b, c, d].
⬆ stdout = preds : [2, 1, 1, 0]
→ reading it = engine 0 → class 2 (critical) engine 1 → class 1 (degrading) engine 2 → class 1 (degrading) engine 3 → class 0 (healthy)
→ don't expect labels match = labels were [0,1,2,1]. After ONE Adam step on random init, preds [2,1,1,0] won't match — that's fine. Real training takes thousands of steps.
48print("# params :", sum(p.numel() for p in head.parameters()))

Count every learnable scalar in the head. Generator expression sums .numel() across all Parameters returned by head.parameters().

EXECUTION STATE
📚 head.parameters() = Generator that yields every learnable Parameter in the module tree. For HealthHead: 6 tensors (W and b for each of 3 Linears).
📚 Parameter.numel() = 'Number of elements' — total scalars in the tensor. For W shape (32, 32) → 1024; for b shape (32,) → 32.
→ breakdown = Linear(32, 32): 1024 + 32 = 1056 Linear(32, 16): 512 + 16 = 528 Linear(16, 3): 48 + 3 = 51
⬆ stdout = # params : 1635
11 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6class HealthHead(nn.Module):
7    """
8    Health classification head.
9    Input : shared features (B, 32)
10    Output: raw logits        (B, 3)   ← {healthy, degrading, critical}
11
12    No softmax inside the head - F.cross_entropy applies log-softmax
13    internally for numerical stability.
14    """
15
16    def __init__(self, in_dim: int = 32, num_classes: int = 3):
17        super().__init__()
18        self.net = nn.Sequential(
19            nn.Linear(in_dim, 32), nn.ReLU(inplace=True),
20            nn.Linear(32,     16), nn.ReLU(inplace=True),
21            nn.Linear(16, num_classes),                # ← no activation
22        )
23
24    def forward(self, shared: torch.Tensor) -> torch.Tensor:
25        return self.net(shared)                        # (B, 3)
26
27
28# ---------- smoke test ----------
29torch.manual_seed(0)
30head = HealthHead()
31shared = torch.randn(4, 32, requires_grad=True)        # fake shared features
32
33logits  = head(shared)                                  # (4, 3)
34labels  = torch.tensor([0, 1, 2, 1])                    # ground truth class
35loss    = F.cross_entropy(logits, labels)
36
37loss.backward()
38optim   = torch.optim.Adam(head.parameters(), lr=1e-3)
39optim.step(); optim.zero_grad()
40
41probs    = F.softmax(logits.detach(), dim=-1)
42preds    = probs.argmax(dim=-1)
43
44print("shared.shape :", tuple(shared.shape))           # (4, 32)
45print("logits.shape :", tuple(logits.shape))           # (4, 3)
46print("loss         :", round(loss.item(), 4))
47print("preds        :", preds.tolist())                # predicted class per engine
48print("# params     :", sum(p.numel() for p in head.parameters()))   # 1,635

Same Head, Different Domains

The 32-D shared features are domain-agnostic; only the class boundaries change. Drop in a different label rule and the same head trains:

DomainThree classesBoundary rule
Bearings (PRONOSTIA)healthy / inner-race fault / outer-race faultVibration spectrum bands
Lithium-ion cellshealthy / mid-life / end-of-lifeSoH > 90% / 80–90% / ≤ 80%
Wind-turbine gearboxOK / wear / failure imminentSCADA temp residual + age
Hard disk SMARTOK / pre-fail / fail-soonSMART 5/197/198 thresholds
HVAC chillersnominal / fouled / leakingΔT and refrigerant pressure
Three is a sweet spot. Two classes are too coarse to give a meaningful auxiliary signal; five or more leak so much capacity into class boundaries that they hurt the regression head. The book sticks with three across all domains.

Three Classification Pitfalls

Pitfall 1: Applying softmax in the head.Then calling F.cross_entropyF.cross\_entropy on the result silently double-applies log-softmax and the gradients collapse. Always return raw logits.
Pitfall 2: One-hot labels. F.cross_entropyF.cross\_entropy expects integer class indices of shape (B,)(B,), NOT one-hot vectors of shape (B,3)(B, 3). If you pass one-hot you'll get a shape error or, worse, a silent broadcast that trains nothing useful.
Pitfall 3: Class imbalance ignored. Without the cap, ~45% of windows are healthy and only ~25% critical. Plain CE then under-trains the rare-but-important “critical” class. Chapter 14 introduces focal loss + class weights to rebalance this; for now just be aware that vanilla CE is the baseline, not the answer.
The point. A 1,635-parameter MLP that turns the shared 32-D vector into three raw logits. No softmax inside. Same parameter budget across every domain in the diverse table - only the labels change.

Takeaway

  • 32 → 32 → 16 → 3. Three layers, lean by design - the auxiliary task should not crowd the backbone.
  • Raw logits out, no softmax inside. F.cross_entropyF.cross\_entropy handles the numerically-stable log-softmax for you.
  • Integer labels, not one-hot. Class indices in {0,1,2}\{0, 1, 2\}.
  • 1,635 parameters. About a third of the RUL head, < 0.05% of the full DualTaskModel.
Loading comments...