Chapter 11
11 min read
Section 43 of 121

Shared 32-Dimensional Feature Vector

Dual-Task Heads & Model Assembly

From a Wide River to a Single Tap

The backbone has, up to this point, increased the feature dim: 17 → 64 (CNN) → 512 (BiLSTM) → 512 (Attention). All that capacity has been used to compress the 30-cycle window into a rich representation. Now we go the other way - funnel down to a 32-dimensional summary that BOTH task heads will read from.

The FC funnel 5122566432512 \to 256 \to 64 \to 32 is the LAST shared layer in the backbone. Everything after it is task-specific: the RUL regression head reads from these 32 dims, and the health classification head reads from the SAME 32 dims. That shared point is what makes the model a multi-task network in any meaningful sense.

The architectural meaning of 32. Small enough that gradients from both tasks have to compete for capacity - which is the whole point of MTL. Big enough to encode the information both heads need.

The 512 → 256 → 64 → 32 Funnel

LayerShapeParams (W + b)
Input (last cycle)(B, 512)
Linear → ReLU → Drop(B, 256)131,328
Linear → ReLU → Drop(B, 64)16,448
Linear → ReLU → Drop(B, 32)2,080
Total funnel~149,856

Time Pooling: Last Cycle vs Mean

We have a (B, T=30, 512) tensor and need a (B, 512) summary per engine. Three options:

PoolingFormulaNotes
Last cycleh[:, -1, :]Default. Bidirectional + attention already integrate the window
Mean over timeh.mean(dim=1)Common alternative; works well on FD001
Attention poolingsoftmax(W·h) · hAdds parameters; sometimes helps multi-condition data
Last + mean concatconcat(h[:,-1], h.mean(dim=1))Doubles input dim of next layer

We use last-cycle pooling because the BiLSTM's last timestep contains forward context from the entire window, and the attention layer further mixes information across all cycles. Mean-pooling is a reasonable alternative that matters more on shorter sequences.

Python: The FC Funnel From Scratch

Three FC blocks; last-cycle pooling; 512 → 256 → 64 → 32
🐍fc_funnel_numpy.py
1import numpy as np

NumPy is the numerical-computing backbone of Python. It provides ndarray (N-dimensional array) — a contiguous, typed buffer with vectorised math implemented in C/BLAS. Every operation in this file (matrix multiply via @, element-wise ReLU via np.maximum, broadcasting in dropout) runs as compiled C, not Python loops. We alias it 'np' by universal convention.

EXECUTION STATE
numpy = Library for fast tensor math: ndarray, linear algebra (BLAS), random numbers, reductions, broadcasting
as np = Alias so we write np.maximum(...), np.random.randn(...), np.sqrt(...) instead of numpy.<...> — saves typing and is the canonical name
4def fc_block(x, W, b, dropout_p=0.3, training=True) → np.ndarray

The atomic fully-connected block: Linear → ReLU → (inverted) Dropout. Called three times by shared_features() to build the 512 → 256 → 64 → 32 funnel.

EXECUTION STATE
⬇ input: x = Activations from the previous layer. Shape (..., in_dim) — leading axes are passed through unchanged so this works for (B, in) or (B, T, in).
→ x example = Layer 1 receives x with shape (2, 512) — the last-cycle vector for each of the 2 engines in the batch.
⬇ input: W = Weight matrix of shape (in_dim, out_dim). Multiplying x @ W projects each row of x into out_dim space.
→ W example (Layer 1) = shape (512, 256) — 131,072 learnable weights. Initialised with Kaiming: std = √(2/512) ≈ 0.0625.
⬇ input: b = Bias vector of shape (out_dim,). Added to every row of x @ W via broadcasting. Initialised to zeros.
⬇ input: dropout_p = 0.3 = Probability of zeroing each activation. 0.3 is heavier than the 0.1–0.2 used in the conv blocks because we are deep in the net where overfitting risk is highest. Default value bound at function definition.
⬇ input: training = True = Toggle. True during fit() — dropout is applied. False during eval()/inference — dropout is skipped (the layer becomes identity).
⬆ returns = np.ndarray with shape (..., out_dim). Same leading axes as x, last dim swapped from in_dim to out_dim. Values are non-negative (ReLU output) and possibly zeroed/scaled (dropout).
5Docstring: """Fully-connected block: Linear → ReLU → Dropout."""

One-line summary of the three operations the block performs, in order. This pattern (affine → activation → regularisation) is the canonical FC unit in modern deep nets.

6z = x @ W + b

The Linear (affine) step. Compute z = xW + b, the foundation of every neural network layer. Each output neuron is a weighted sum of all input features plus a bias.

EXECUTION STATE
📚 @ (matmul operator) = Python's PEP-465 matrix-multiplication operator. For 2-D arrays x(M,K) @ W(K,N) → z(M,N). For higher-rank x, NumPy treats all but the last two axes as batch dimensions, so x(B,T,K) @ W(K,N) → z(B,T,N) with no reshape needed.
+ b broadcasting = b has shape (N,). NumPy broadcasts it across every row of x@W: each of the M output rows gets the same bias added element-wise.
⬇ x (Layer 1) = shape (2, 512) — last-cycle activations from the attention block, 2 engines × 512 features
⬇ W (Layer 1) = shape (512, 256) — projects 512-dim input into 256-dim hidden space
⬇ b (Layer 1) = shape (256,) — all zeros at init
⬆ z (Layer 1) = shape (2, 256) — pre-activation. Values can be positive or negative.
7z = np.maximum(z, 0)

ReLU activation: replace every negative element with 0, leave positives untouched. The element-wise non-linearity is what lets stacked Linear layers learn non-linear functions.

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of two inputs. When b is a scalar (here 0), each element of a is compared to that scalar individually. NOT the same as np.max() — np.max reduces along an axis; np.maximum takes pairwise max with no reduction.
⬇ arg 1: z = Pre-activation, shape (2, 256). Roughly half its values are negative on average for symmetric random weights.
⬇ arg 2: 0 = Scalar threshold. Broadcasts to every element of z. After this line, all negatives become 0.
→ why ReLU? = Cheap (one comparison), no vanishing gradient for positive values, induces sparsity (dead units = 0). Default activation in modern feed-forward nets.
⬆ z after = shape (2, 256), all entries ≥ 0. Example row: [0.0, 1.04, 0.0, 2.81, 0.0, ...]
8if training and dropout_p > 0:

Guard so dropout only runs during training. At inference (training=False) we want the deterministic activations the network produced, with no random masking.

EXECUTION STATE
training = True here — fc_block is called from shared_features which is called from the top-level run, no flag override
dropout_p > 0 = 0.3 > 0 → True. Skipping when p == 0 avoids the cost of generating a useless all-ones mask.
→ why guard? = Inverted dropout (next line) divides by (1 - p). If p == 0 that's a no-op; if training is False we want the network to behave deterministically for evaluation.
9mask = (np.random.random_sample(z.shape) > dropout_p).astype(np.float32)

Generate a Bernoulli(1 − p) mask: each entry is 1 with probability 0.7 and 0 with probability 0.3. Multiplying z by this mask drops 30% of activations to zero.

EXECUTION STATE
📚 np.random.random_sample(size) = Returns float64 samples uniformly drawn from [0, 1). Shape is given by `size`. Equivalent to np.random.rand but takes a tuple instead of *args.
⬇ arg: z.shape = (2, 256) for Layer 1 — produces a uniform-random matrix of the same shape as z.
→ > dropout_p = Element-wise comparison vs 0.3. Returns a bool ndarray: True where the uniform sample exceeds 0.3 (≈70% of cells), False otherwise.
→ .astype(np.float32) = Cast bool → float32: True becomes 1.0, False becomes 0.0. Needed so we can multiply z (numeric) by the mask later.
⬆ mask = shape (2, 256), dtype float32, ~70% ones / ~30% zeros, drawn fresh on every call
10z = z * mask / (1 - dropout_p)

Inverted dropout: zero the dropped units AND divide the survivors by (1 − p) so that the expected value of z is unchanged. The win is that at inference we just skip dropout entirely — no rescaling needed.

EXECUTION STATE
z * mask = Element-wise multiply. About 30% of entries become 0; the rest are unchanged.
/ (1 - dropout_p) = Divide by 0.7. Surviving activations are amplified by ~1.43× so E[z] = z (the dropped 30% of mass is redistributed to the surviving 70%).
→ why invert? = Standard ('non-inverted') dropout zeros at train time and rescales at eval. Inverted dropout pushes the rescale into training so eval-mode forward is just `return ReLU(xW + b)` — simpler and faster.
⬆ z (Layer 1) = shape (2, 256), ≥0, ~30% zeros from dropout + zeros from ReLU
11return z

Hand the regularised activation back to the caller (shared_features), which feeds it into the next FC block.

EXECUTION STATE
⬆ return: z = shape (..., out_dim). For Layer 1 of the funnel: (2, 256).
16def shared_features(h_attn, Wfc, bfc, dropout_p=0.3) → np.ndarray

The funnel itself: pool over time, then run three FC blocks that compress 512 → 256 → 64 → 32. The output is the 32-D shared representation that BOTH task heads (RUL regression + health classification) consume.

EXECUTION STATE
⬇ input: h_attn = Output of the BiLSTM + Attention backbone. Shape (B, T, 512): batch of B engines × T=30 cycles × 512 features per cycle.
→ h_attn example = shape (2, 30, 512), float32 — 2 engines, 30 historical cycles each, 512-D contextualised features per cycle
⬇ input: Wfc = Python list of 3 weight matrices: shapes (512,256), (256,64), (64,32). One per layer in the funnel.
⬇ input: bfc = Python list of 3 bias vectors: shapes (256,), (64,), (32,). Zip-paired with Wfc inside the loop.
⬇ input: dropout_p = 0.3 = Same dropout probability for every funnel layer. Forwarded to fc_block() unchanged.
⬆ returns = np.ndarray with shape (B, 32) — one 32-D shared feature vector per engine in the batch.
17 Wfc: list, bfc: list,

Continuation of the def signature. Type hints document that these two arguments are lists (of weights and biases) — not NumPy arrays. Hints are advisory; Python doesn't enforce them at runtime.

18 dropout_p: float = 0.3) -> np.ndarray:

Continuation of the signature. Default dropout_p=0.3, return type annotated as np.ndarray. The trailing colon ends the def header — body starts on the next line.

19Docstring: """h_attn: (B, T, 512). Returns (B, 32)."""

Records the input shape contract and output shape contract — the two facts you need to plug this function into the upstream and downstream pipeline.

20h = h_attn[:, -1, :] # last-cycle pooling

Time pooling: keep only the LAST cycle from every engine. The 3-axis input collapses to 2 axes — one summary vector per engine.

EXECUTION STATE
📚 NumPy basic slicing = x[a, b, c] indexes axes 0/1/2 simultaneously. ':' means 'all of this axis'; an int picks one position. Negative indices count from the end (-1 = last).
→ axis 0 = ':' = Keep ALL engines in the batch (size B = 2)
→ axis 1 = -1 = Pick ONLY the last cycle (T=30, so index 29). This axis is reduced away — the result is 2-D.
→ axis 2 = ':' = Keep ALL 512 feature dimensions
h.shape = (2, 512) — last-cycle features per engine
h[0, :5] = [-0.438, 1.586, -0.090, 0.245, 1.667] ← first 5 features of engine 0
→ why last cycle? = Bidirectional LSTM + self-attention have already mixed information across all 30 cycles into every position. The last position therefore contains a full-window summary — no information is lost vs h_attn.mean(axis=1) and we keep one extra signal: the 'most recent' cycle's importance.
21for W, b in zip(Wfc, bfc):

Iterate the three (W, b) pairs in lock-step. zip stops when the shorter iterable is exhausted — both lists have length 3, so we get exactly three loop iterations: Layer 1, Layer 2, Layer 3.

LOOP TRACE · 3 iterations
Iter 1 — Layer 1: 512 → 256
W.shape = (512, 256)
b.shape = (256,)
h.shape before = (2, 512)
h.shape after fc_block = (2, 256)
h[0, :5] after (eval-mode) = [0.000, 1.036, 0.000, 2.805, 0.000]
Iter 2 — Layer 2: 256 → 64
W.shape = (256, 64)
b.shape = (64,)
h.shape before = (2, 256)
h.shape after fc_block = (2, 64)
h[0, :5] after (eval-mode) = [0.837, 0.000, 0.000, 1.840, 0.000]
Iter 3 — Layer 3: 64 → 32
W.shape = (64, 32)
b.shape = (32,)
h.shape before = (2, 64)
h.shape after fc_block = (2, 32)
h[0, :5] after (eval-mode) = [2.846, 0.446, 0.000, 0.148, 0.123]
22h = fc_block(h, W, b, dropout_p=dropout_p)

Apply one Linear → ReLU → Dropout block. Reassigns h so the next iteration sees the lower-dimensional output of this iteration — the funnel narrows step by step.

EXECUTION STATE
⬇ arg 1: h = Current activation. Starts (2, 512) on iter 1, becomes (2, 256) on iter 2, (2, 64) on iter 3.
⬇ arg 2: W = Current weight matrix from the zip loop
⬇ arg 3: b = Current bias vector from the zip loop
⬇ kwarg: dropout_p=dropout_p = Forward the outer dropout_p (0.3) into fc_block. Using a keyword arg avoids positional ambiguity if fc_block's signature ever changes.
⬆ result: h (next iter) = Linear → ReLU → Dropout applied. Last-axis size drops to W.shape[1].
23return h

After three iterations, h has shape (B, 32) — the 32-D shared feature vector per engine.

EXECUTION STATE
⬆ return: h = shape (2, 32) — the per-engine summary read by both downstream task heads
27np.random.seed(0)

Seed NumPy's global PRNG so every run produces identical pseudo-random numbers. Without this, np.random.randn (line 30) and np.random.random_sample inside dropout would yield different values every run.

EXECUTION STATE
📚 np.random.seed(seed) = Initialises the legacy global Mersenne-Twister state. Affects ALL np.random.* calls until reseeded. Modern code prefers np.random.default_rng(seed) for an isolated Generator object, but seed() is fine for tutorials.
⬇ arg: 0 = Any non-negative integer works. 0 is conventional for examples.
28B, T, d_model = 2, 30, 512

Tuple unpacking — assign three names in one line. Matches the upstream backbone's output dims.

EXECUTION STATE
B = 2 = Batch size — number of engines processed together. Tiny here for didactic clarity; production might use 32–256.
T = 30 = Window length — 30 historical cycles per engine. The architecture-wide constant chosen in §3.
d_model = 512 = Feature dimension produced by the BiLSTM (§9) and preserved by the attention block (§10). The funnel's input width.
30h_attn = np.random.randn(B, T, d_model).astype(np.float32)

Synthesise a fake attention output so this snippet can run standalone. In the real pipeline, h_attn is produced by self-attention.forward(bilstm_output) — not random.

EXECUTION STATE
📚 np.random.randn(*shape) = Standard-normal samples (mean 0, std 1). Variadic shape: pass dimensions as separate ints, NOT a tuple. randn(2,30,512) ≠ randn((2,30,512)) — the second form raises.
⬇ args: B, T, d_model = 2, 30, 512 = Output shape (2, 30, 512) — one Gaussian sample per cell, 30,720 cells total
📚 .astype(np.float32) = Cast from default float64 to float32. Halves memory, matches GPU's preferred dtype, and aligns with the rest of the project's tensors. Returns a NEW array (does not mutate in place).
⬆ h_attn = shape (2, 30, 512), dtype float32. Stand-in for backbone_so_far(input).
33shapes = [(512, 256), (256, 64), (64, 32)]

List of (in_dim, out_dim) tuples — one per funnel layer. Encodes the full architecture in one readable line: 512 → 256 → 64 → 32.

EXECUTION STATE
(512, 256) = Layer 1: 131,072 weights + 256 biases = 131,328 params
(256, 64) = Layer 2: 16,384 weights + 64 biases = 16,448 params
(64, 32) = Layer 3: 2,048 weights + 32 biases = 2,080 params
→ total funnel params = 131,328 + 16,448 + 2,080 = 149,856
34W_list = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]

List-comprehension that builds three weight matrices with Kaiming (He) initialisation — the correct scheme for ReLU networks.

EXECUTION STATE
📚 list comprehension = [expr for item in iterable] — equivalent to a for-loop that appends to a list, but vectorised at the bytecode level and idiomatic in NumPy code.
📚 np.random.randn(*s) = Standard-normal samples. The * unpacks the tuple s into positional args: randn(*(512,256)) → randn(512, 256).
📚 np.sqrt(x) = Element-wise square root. Here applied to a scalar (2 / s[0]).
→ Kaiming formula = std = √(2 / fan_in). For Layer 1: √(2/512) ≈ 0.0625. Designed so Var(z) ≈ Var(x) under ReLU; without it activations vanish or explode through depth.
→ why ÷ s[0] and not s[1]? = fan_in = number of inputs each output neuron sees = in_dim = s[0]. Using fan_out (s[1]) gives a different valid scheme (Glorot/Xavier) tuned for tanh, not ReLU.
⬆ W_list = Python list of 3 float32 ndarrays with shapes (512,256), (256,64), (64,32).
35b_list = [np.zeros(s[1], dtype=np.float32) for s in shapes]

Build the matching list of bias vectors, all zero-initialised. Standard for FC layers — non-zero bias init rarely helps and can slow convergence.

EXECUTION STATE
📚 np.zeros(shape, dtype) = Allocate a new array filled with 0.0. shape can be int (1-D) or tuple (n-D); dtype defaults to float64 — we override to float32 to match the weights.
⬇ arg 1: s[1] = out_dim of the layer — 256, then 64, then 32. Bias has one entry per output neuron.
⬇ arg 2: dtype=np.float32 = Match the dtype of W and h_attn. Mixed-precision (float32 + float64) would silently up-cast every op to float64.
⬆ b_list = Python list of 3 float32 zero-vectors, shapes (256,), (64,), (32,).
37shared = shared_features(h_attn, W_list, b_list, dropout_p=0.3)

Run the funnel end-to-end on h_attn with the three weight/bias pairs. dropout_p is forwarded through to every fc_block call inside the loop.

EXECUTION STATE
⬇ h_attn = shape (2, 30, 512)
⬇ W_list, b_list = lists of length 3
⬇ dropout_p=0.3 = 30% drop probability per layer
⬆ shared = shape (2, 32) — final per-engine 32-D summary read by both task heads
38print("h_attn.shape :", h_attn.shape)

Sanity-check the input shape. Catches off-by-one in B/T/d_model or accidental transposes early.

EXECUTION STATE
Output = h_attn.shape : (2, 30, 512)
39print("shared.shape :", shared.shape)

Confirm the funnel collapses time and projects to 32 dims. (B, T, 512) → (B, 32).

EXECUTION STATE
Output = shared.shape : (2, 32)
40print("shared dtype :", shared.dtype)

Inspect the dtype. Note: although h_attn, W and b are all float32, dropout's mask is created from np.random.random_sample (which returns float64). The division by (1 - dropout_p) — a Python float — promotes the result to float64. To keep the entire pipeline float32, cast the mask explicitly inside fc_block.

EXECUTION STATE
Output = shared dtype : float64
→ fix to keep float32 = Inside fc_block: mask = (np.random.random_sample(z.shape).astype(np.float32) > dropout_p).astype(np.float32) — and cast the divisor: np.float32(1 - dropout_p).
41print("shared[0,:5] :", shared[0, :5].round(3).tolist())

Print the first 5 features of engine 0. With ReLU + dropout, several entries are exactly 0; the non-zero values are random because dropout's mask is freshly sampled each run (we only seed before randn, not before each random_sample call).

EXECUTION STATE
📚 .round(3) = Round each element to 3 decimal places. Returns a new ndarray; doesn't mutate the original.
📚 .tolist() = Convert ndarray → nested Python list. Useful for printing because it shows full precision without NumPy's truncating formatter.
Output (this run, dropout active) = shared[0,:5] : [6.518, 0.881, 0.0, 0.0, 2.837]
Eval-mode (no dropout) reference = shared[0,:5] : [2.846, 0.446, 0.0, 0.148, 0.123]
13 lines without explanation
1import numpy as np
2
3
4def fc_block(x, W, b, dropout_p=0.3, training=True):
5    """Fully-connected block: Linear → ReLU → Dropout."""
6    z = x @ W + b                                  # (..., out)
7    z = np.maximum(z, 0)                            # ReLU
8    if training and dropout_p > 0:
9        mask = (np.random.random_sample(z.shape) > dropout_p).astype(np.float32)
10        z = z * mask / (1 - dropout_p)
11    return z
12
13
14# Backbone-so-far output is (B, T, 512). Take the LAST cycle's features
15# as the shared per-engine summary, then funnel down to 32 dims.
16def shared_features(h_attn: np.ndarray,
17                    Wfc: list, bfc: list,
18                    dropout_p: float = 0.3) -> np.ndarray:
19    """h_attn: (B, T, 512). Returns (B, 32)."""
20    h = h_attn[:, -1, :]                           # (B, 512) - last cycle pooling
21    for W, b in zip(Wfc, bfc):
22        h = fc_block(h, W, b, dropout_p=dropout_p)
23    return h
24
25
26# ----- Run on (B, T, 512) input -----
27np.random.seed(0)
28B, T, d_model = 2, 30, 512
29
30h_attn = np.random.randn(B, T, d_model).astype(np.float32)
31
32# 512 → 256 → 64 → 32
33shapes  = [(512, 256), (256, 64), (64, 32)]
34W_list  = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
35b_list  = [np.zeros(s[1], dtype=np.float32) for s in shapes]
36
37shared = shared_features(h_attn, W_list, b_list, dropout_p=0.3)
38print("h_attn.shape :", h_attn.shape)               # (2, 30, 512)
39print("shared.shape :", shared.shape)               # (2, 32)
40print("shared dtype :", shared.dtype)
41print("shared[0,:5] :", shared[0, :5].round(3).tolist())

PyTorch: nn.Sequential FC Stack

A few-line nn.Module wrapping the funnel
🐍fc_funnel_torch.py
1import torch

PyTorch's top-level module. Provides torch.Tensor (the GPU-accelerated ndarray), automatic differentiation (autograd), and the random-number generator we seed below. Everything else (nn, optim, utils) is namespaced under it.

EXECUTION STATE
torch = Library: tensors with autograd, CUDA/MPS device support, optimisers, distributed training, JIT compiler
2import torch.nn as nn

torch.nn is the layer/module library — Linear, ReLU, Dropout, Sequential, Conv2d, LSTM, etc. All inherit from nn.Module which provides parameter tracking, state_dict, .to(device), .train()/.eval() mode flags. Aliased as 'nn' by convention.

EXECUTION STATE
nn.Module = Base class — every layer or model subclasses this. Auto-registers nn.Parameter and nn.Module attributes for gradient tracking.
nn.Linear / nn.ReLU / nn.Dropout = The three building blocks used below
nn.Sequential = Container that runs a list of sub-modules in order — used on line 20 to chain the funnel layers
5class FCFunnel(nn.Module):

Define the funnel as a subclass of nn.Module. This makes it a first-class PyTorch model — its parameters are auto-discovered by .parameters(), it can be moved to GPU with .to('cuda'), saved with state_dict(), and toggled between train/eval mode.

EXECUTION STATE
→ why subclass nn.Module? = Autograd needs to know which tensors are learnable. Storing nn.Linear/ReLU/Dropout as attributes of an nn.Module subclass auto-registers them so .parameters(), .train(), .eval(), .to(device) all 'just work'.
11def __init__(self, d_model=512, hidden=(256, 64), out_dim=32, dropout_p=0.3):

Constructor. Four knobs let you reuse this class for any funnel shape — only defaults are wired for the paper's 512 → 256 → 64 → 32 design.

EXECUTION STATE
⬇ input: self = The instance being constructed. PyTorch hooks (parameter registration, hook lists, _modules dict) live on self.
⬇ input: d_model = 512 = Width of the input — must equal the output width of the upstream attention block. Sets fan_in of the first Linear.
⬇ input: hidden = (256, 64) = Tuple of intermediate widths. Loop on line 16 turns each one into a Linear→ReLU→Dropout block. Tuple (not list) signals 'fixed architecture, don't mutate'.
⬇ input: out_dim = 32 = Width of the final shared feature vector. Both task heads consume this dimensionality.
⬇ input: dropout_p = 0.3 = Drop probability for every Dropout layer in the stack. Same value as the NumPy version above.
13super().__init__()

Call nn.Module.__init__() FIRST. This sets up the internal _parameters, _modules and _buffers dicts. Skipping it (or putting it after attribute assignment) is the #1 cause of 'no parameters found' bugs because attribute setattr is intercepted by nn.Module to register parameters — but only if those dicts already exist.

EXECUTION STATE
📚 super() = Returns a proxy for the parent class — here nn.Module. super().__init__() == nn.Module.__init__(self).
14layers = []

Plain Python list to accumulate sub-modules. We'll wrap it in nn.Sequential on line 20. Building a list first (instead of appending directly to nn.Sequential) is the idiomatic, mistake-free pattern.

EXECUTION STATE
layers = [] — empty Python list, NOT a PyTorch module yet. Storing modules in a plain list does NOT register them with autograd; that only happens when wrapped in nn.Sequential / nn.ModuleList / nn.ModuleDict.
15prev = d_model

Track the output width of the most-recently-added Linear so the NEXT Linear knows its in_features. Starts at d_model (=512) because the first Linear receives the raw input.

EXECUTION STATE
prev = 512 initially — gets overwritten to 256, then 64 inside the loop
16for h in hidden:

Iterate the (256, 64) tuple — two iterations build the first two FC blocks. The third (64 → 32) is added separately on line 19 because its out_dim is the user-specified out_dim, not a hidden size.

LOOP TRACE · 2 iterations
Iter 1 — h = 256
prev (before) = 512
block added = [Linear(512,256), ReLU(inplace=True), Dropout(0.3)]
len(layers) after = 3
Iter 2 — h = 64
prev (before) = 256
block added = [Linear(256,64), ReLU(inplace=True), Dropout(0.3)]
len(layers) after = 6
17layers += [nn.Linear(prev, h), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]

Append three sub-modules in one go. += on a list is in-place extend, so this avoids creating a new list every iteration.

EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias=True) = Affine layer: y = x @ W.T + b. Stores W of shape (out, in) and b of shape (out,) as nn.Parameters. Default bias=True (different from the NumPy snippet above).
→ arg 1: prev = in_features — width of the previous layer. 512 on iter 1, 256 on iter 2.
→ arg 2: h = out_features — current hidden width. 256 on iter 1, 64 on iter 2.
📚 nn.ReLU(inplace=True) = ReLU activation. inplace=True overwrites the input tensor instead of allocating a new one — saves memory but is illegal if any earlier op needs the pre-ReLU values for its backward pass. Linear's backward uses W and the post-activation, NOT the pre-activation, so inplace is safe here.
📚 nn.Dropout(p) = Inverted dropout. During .train(): mask drops fraction p, survivors scaled by 1/(1-p). During .eval(): identity. Toggle with model.train() / model.eval().
→ arg: p = dropout_p = 0.3 — same probability for every block in the funnel
18prev = h

Advance prev so the NEXT iteration's nn.Linear(prev, …) gets the correct in_features. Without this update, every Linear would have in_features=d_model and the shapes would mismatch.

19layers += [nn.Linear(prev, out_dim), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]

Add the final block separately so its out_dim is the user-controlled output width (32), not a hidden size. After this line, layers contains 9 sub-modules total: 3 blocks × (Linear + ReLU + Dropout).

EXECUTION STATE
→ nn.Linear(prev, out_dim) = prev = 64 (last hidden size), out_dim = 32. Shape (32, 64) for the weight matrix.
→ why ReLU+Dropout on the FINAL layer too? = The 32-D output is consumed by downstream task heads (RUL regression, health classification). Keeping ReLU + Dropout here regularises the SHARED representation. Some designs drop the last activation to allow negative shared features — paper ablation chose to keep it.
len(layers) after = 9 — three Linear, three ReLU, three Dropout
20self.fc = nn.Sequential(*layers)

Wrap the 9 sub-modules in an nn.Sequential. Assigning to self.fc registers all of them as children of FCFunnel — they appear in .parameters(), .state_dict(), .to(device), and .train()/.eval() now propagates to them.

EXECUTION STATE
📚 nn.Sequential(*modules) = Container that calls each child in order: forward(x) = layer_n(...(layer_1(layer_0(x)))). Children are accessible by integer index (self.fc[0]) or via .named_children().
📚 *layers (unpack) = Python's argument-unpacking. nn.Sequential expects positional module arguments — the * spreads the 9-element list into 9 separate args. Sequential([list]) (no star) raises a TypeError.
self.fc structure =
fc.0  Linear(512→256)
fc.1  ReLU
fc.2  Dropout(0.3)
fc.3  Linear(256→64)
fc.4  ReLU
fc.5  Dropout(0.3)
fc.6  Linear(64→32)
fc.7  ReLU
fc.8  Dropout(0.3)
22def forward(self, x: torch.Tensor) -> torch.Tensor:

The forward pass. nn.Module.__call__ invokes this AND runs the registered hooks (pre-forward, forward, etc.). Always call the model as model(x), NEVER model.forward(x) directly — direct calls bypass hooks.

EXECUTION STATE
⬇ input: self = The FCFunnel instance — gives access to self.fc
⬇ input: x = torch.Tensor of shape (B, T, 512). Output of the upstream BiLSTM + Attention backbone.
→ x example = shape torch.Size([2, 30, 512]) — batch of 2 engines × 30 cycles × 512 features
⬆ returns = torch.Tensor of shape (B, 32) — the 32-D shared feature vector per engine
23# x: (B, T, 512)

Shape comment so a future reader doesn't have to trace back to the docstring on line 6 to confirm the input contract.

24last = x[:, -1, :]

Last-cycle pooling. Identical semantics to the NumPy version: keep all engines, pick the LAST of the 30 time-steps, keep all 512 features.

EXECUTION STATE
📚 PyTorch tensor slicing = Same syntax as NumPy. x[a, b, c] indexes axes 0/1/2; ':' = whole axis; int = pick one index (axis is reduced).
→ axis 0 = ':' = Keep batch dim B (=2)
→ axis 1 = -1 = Pick T=29 (last cycle). This axis is reduced away.
→ axis 2 = ':' = Keep all 512 features
last.shape = torch.Size([2, 512])
→ why last cycle? = Same as NumPy: BiLSTM + self-attention have already mixed information across all 30 cycles. The last position contains a full-window summary; alternatives are x.mean(dim=1), attention pooling, or torch.cat([last, mean], dim=-1).
25return self.fc(last)

Push the (B, 512) last-cycle tensor through the 9-layer Sequential. nn.Sequential's forward iterates its children: Linear → ReLU → Dropout → Linear → ReLU → Dropout → Linear → ReLU → Dropout, returning the final (B, 32) tensor.

EXECUTION STATE
intermediate shapes (eval mode) =
fc[0] Linear  → (2, 256)
fc[1] ReLU    → (2, 256)
fc[2] Dropout → (2, 256)
fc[3] Linear  → (2,  64)
fc[4] ReLU    → (2,  64)
fc[5] Dropout → (2,  64)
fc[6] Linear  → (2,  32)
fc[7] ReLU    → (2,  32)
fc[8] Dropout → (2,  32)
⬆ return: shape = torch.Size([2, 32])
29torch.manual_seed(0)

Seed PyTorch's CPU RNG so weight init (inside nn.Linear) and dropout masks are reproducible. For deterministic CUDA you'd also need torch.cuda.manual_seed_all(0) and torch.use_deterministic_algorithms(True).

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets the seed of the default PyTorch CPU generator (and CUDA generators if available). Returns the generator object so you can chain ops if needed.
⬇ arg: 0 = Any non-negative integer; 0 is conventional for examples
30funnel = FCFunnel(d_model=512, hidden=(256, 64), out_dim=32, dropout_p=0.3)

Instantiate the model with the paper's defaults. All nn.Linear layers initialise their weights with Kaiming-uniform (default in PyTorch ≥1.0) and zero biases — matches the NumPy Kaiming init above in spirit.

EXECUTION STATE
⬇ d_model=512 = Input width — matches upstream attention output
⬇ hidden=(256, 64) = Two intermediate widths — produces the 512→256→64 funnel
⬇ out_dim=32 = Final shared dim
⬇ dropout_p=0.3 = Drop probability per layer
→ uses keyword args = Always pass these by name, not by position. Future-proof against signature reordering and self-documenting.
31x = torch.randn(2, 30, 512)

Synthesise a fake backbone output. In production this is the tensor returned by the BiLSTM + Attention modules.

EXECUTION STATE
📚 torch.randn(*size) = Standard-normal samples (mean 0, std 1). Variadic size like NumPy's randn. Returns a torch.float32 tensor on CPU by default; pass device='cuda' / dtype=torch.float16 to override.
⬇ args: 2, 30, 512 = Output shape torch.Size([2, 30, 512])
32y = funnel(x)

Run the model. funnel(x) is sugar for funnel.__call__(x), which invokes pre-forward hooks → forward(x) → forward hooks. Autograd wires up the computation graph because x and the model's parameters are tensors.

EXECUTION STATE
→ why funnel(x), not funnel.forward(x)? = Hooks. Calling .forward directly skips PyTorch's hook system. Always use the call syntax.
⬆ y = torch.Tensor of shape torch.Size([2, 32]); requires_grad=True (because funnel's parameters do).
34print("input :", tuple(x.shape))

Sanity-print the input shape. Wrapping in tuple(...) prints (2, 30, 512) instead of the more verbose torch.Size([2, 30, 512]).

EXECUTION STATE
📚 tuple(tensor.shape) = torch.Size is a subclass of tuple; explicit tuple() conversion strips the 'torch.Size' label for cleaner logs.
Output = input : (2, 30, 512)
35print("output :", tuple(y.shape))

Confirm the funnel reduces (2, 30, 512) to (2, 32). The two-axis collapse: time pooling kills axis 1, the FC stack projects axis 2 from 512 → 32.

EXECUTION STATE
Output = output : (2, 32)
36print("# params:", sum(p.numel() for p in funnel.parameters()))

Count learnable parameters. nn.Linear has bias=True by default, so each layer contributes (in×out) weights + (out) biases. PyTorch tracks them in W shape (out, in) (note: transposed vs the NumPy version) but the param count is identical.

EXECUTION STATE
📚 model.parameters() = Generator yielding every nn.Parameter registered on the module (and recursively on its children). Used by optimisers: optim.Adam(model.parameters(), lr=1e-3).
📚 p.numel() = Tensor method: total number of elements = product of shape. For a (256, 512) weight matrix → 131,072.
→ generator-expression sum = sum(... for ... in ...) avoids materialising an intermediate list — efficient even for huge models.
Per-layer breakdown =
fc.0 Linear(512→256): 256×512 + 256 = 131,328
fc.3 Linear(256→64):  64×256 + 64  =  16,448
fc.6 Linear(64→32):   32×64 + 32   =   2,080
ReLU + Dropout layers: 0 params each
Output = # params: 149856
→ vs NumPy = Identical count (149,856). PyTorch's W is (out, in); NumPy uses (in, out). Same parameter, transposed convention.
14 lines without explanation
1import torch
2import torch.nn as nn
3
4
5class FCFunnel(nn.Module):
6    """Funnel from (B, T, 512) attention output to (B, 32) shared features.
7
8    Uses LAST-cycle pooling (the bidirectional + attention output at the
9    last cycle already integrates the whole window).
10    """
11    def __init__(self, d_model: int = 512, hidden: tuple = (256, 64),
12                 out_dim: int = 32, dropout_p: float = 0.3):
13        super().__init__()
14        layers = []
15        prev = d_model
16        for h in hidden:
17            layers += [nn.Linear(prev, h), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]
18            prev = h
19        layers += [nn.Linear(prev, out_dim), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]
20        self.fc = nn.Sequential(*layers)
21
22    def forward(self, x: torch.Tensor) -> torch.Tensor:
23        # x: (B, T, 512)
24        last = x[:, -1, :]                          # (B, 512)
25        return self.fc(last)                         # (B, 32)
26
27
28# Use it
29torch.manual_seed(0)
30funnel = FCFunnel(d_model=512, hidden=(256, 64), out_dim=32, dropout_p=0.3)
31x = torch.randn(2, 30, 512)
32y = funnel(x)
33
34print("input  :", tuple(x.shape))             # (2, 30, 512)
35print("output :", tuple(y.shape))              # (2, 32)
36print("# params:", sum(p.numel() for p in funnel.parameters()))
37# # params: 148,000
The 32-D shared feature is the LAST point both tasks see together. Everything in §11.2 (RUL head) and §11.3 (health head) starts from this 32-vector.

Two Funnel Pitfalls

Pitfall 1: Skipping last-cycle pooling. Feeding (B, T, 512) directly to the first FC layer would require Flattening to (B, T*512) - dramatically increasing parameter count. Always pool first.
Pitfall 2: Too aggressive bottleneck. A 32-D shared vector is small. Going to 16 or 8 hurts both tasks. The paper's ablation found 32 to be the sweet spot.
The point. Three FC blocks compress the (B, 30, 512) backbone output into a 32-D shared feature vector that both task heads consume. ~150k params.

Takeaway

  • FC funnel: 512 → 256 → 64 → 32. Three Linear + ReLU + Dropout blocks.
  • Last-cycle pooling. Bidirectional + attention mean the last cycle has full-window context.
  • 32-D shared vector. The architectural point where multi-task learning happens.
  • ~150k params. Small relative to the BiLSTM / attention upstream.
Loading comments...