From a Wide River to a Single Tap
The backbone has, up to this point, increased the feature dim: 17 → 64 (CNN) → 512 (BiLSTM) → 512 (Attention). All that capacity has been used to compress the 30-cycle window into a rich representation. Now we go the other way - funnel down to a 32-dimensional summary that BOTH task heads will read from.
The FC funnel 512→256→64→32 is the LAST shared layer in the backbone. Everything after it is task-specific: the RUL regression head reads from these 32 dims, and the health classification head reads from the SAME 32 dims. That shared point is what makes the model a multi-task network in any meaningful sense.
The 512 → 256 → 64 → 32 Funnel
| Layer | Shape | Params (W + b) |
|---|---|---|
| Input (last cycle) | (B, 512) | — |
| Linear → ReLU → Drop | (B, 256) | 131,328 |
| Linear → ReLU → Drop | (B, 64) | 16,448 |
| Linear → ReLU → Drop | (B, 32) | 2,080 |
| Total funnel | — | ~149,856 |
Time Pooling: Last Cycle vs Mean
We have a (B, T=30, 512) tensor and need a (B, 512) summary per engine. Three options:
| Pooling | Formula | Notes |
|---|---|---|
| Last cycle | h[:, -1, :] | Default. Bidirectional + attention already integrate the window |
| Mean over time | h.mean(dim=1) | Common alternative; works well on FD001 |
| Attention pooling | softmax(W·h) · h | Adds parameters; sometimes helps multi-condition data |
| Last + mean concat | concat(h[:,-1], h.mean(dim=1)) | Doubles input dim of next layer |
We use last-cycle pooling because the BiLSTM's last timestep contains forward context from the entire window, and the attention layer further mixes information across all cycles. Mean-pooling is a reasonable alternative that matters more on shorter sequences.
Python: The FC Funnel From Scratch
NumPy is the numerical-computing backbone of Python. It provides ndarray (N-dimensional array) — a contiguous, typed buffer with vectorised math implemented in C/BLAS. Every operation in this file (matrix multiply via @, element-wise ReLU via np.maximum, broadcasting in dropout) runs as compiled C, not Python loops. We alias it 'np' by universal convention.
The atomic fully-connected block: Linear → ReLU → (inverted) Dropout. Called three times by shared_features() to build the 512 → 256 → 64 → 32 funnel.
One-line summary of the three operations the block performs, in order. This pattern (affine → activation → regularisation) is the canonical FC unit in modern deep nets.
The Linear (affine) step. Compute z = xW + b, the foundation of every neural network layer. Each output neuron is a weighted sum of all input features plus a bias.
ReLU activation: replace every negative element with 0, leave positives untouched. The element-wise non-linearity is what lets stacked Linear layers learn non-linear functions.
Guard so dropout only runs during training. At inference (training=False) we want the deterministic activations the network produced, with no random masking.
Generate a Bernoulli(1 − p) mask: each entry is 1 with probability 0.7 and 0 with probability 0.3. Multiplying z by this mask drops 30% of activations to zero.
Inverted dropout: zero the dropped units AND divide the survivors by (1 − p) so that the expected value of z is unchanged. The win is that at inference we just skip dropout entirely — no rescaling needed.
Hand the regularised activation back to the caller (shared_features), which feeds it into the next FC block.
The funnel itself: pool over time, then run three FC blocks that compress 512 → 256 → 64 → 32. The output is the 32-D shared representation that BOTH task heads (RUL regression + health classification) consume.
Continuation of the def signature. Type hints document that these two arguments are lists (of weights and biases) — not NumPy arrays. Hints are advisory; Python doesn't enforce them at runtime.
Continuation of the signature. Default dropout_p=0.3, return type annotated as np.ndarray. The trailing colon ends the def header — body starts on the next line.
Records the input shape contract and output shape contract — the two facts you need to plug this function into the upstream and downstream pipeline.
Time pooling: keep only the LAST cycle from every engine. The 3-axis input collapses to 2 axes — one summary vector per engine.
Iterate the three (W, b) pairs in lock-step. zip stops when the shorter iterable is exhausted — both lists have length 3, so we get exactly three loop iterations: Layer 1, Layer 2, Layer 3.
Apply one Linear → ReLU → Dropout block. Reassigns h so the next iteration sees the lower-dimensional output of this iteration — the funnel narrows step by step.
After three iterations, h has shape (B, 32) — the 32-D shared feature vector per engine.
Seed NumPy's global PRNG so every run produces identical pseudo-random numbers. Without this, np.random.randn (line 30) and np.random.random_sample inside dropout would yield different values every run.
Tuple unpacking — assign three names in one line. Matches the upstream backbone's output dims.
Synthesise a fake attention output so this snippet can run standalone. In the real pipeline, h_attn is produced by self-attention.forward(bilstm_output) — not random.
List of (in_dim, out_dim) tuples — one per funnel layer. Encodes the full architecture in one readable line: 512 → 256 → 64 → 32.
List-comprehension that builds three weight matrices with Kaiming (He) initialisation — the correct scheme for ReLU networks.
Build the matching list of bias vectors, all zero-initialised. Standard for FC layers — non-zero bias init rarely helps and can slow convergence.
Run the funnel end-to-end on h_attn with the three weight/bias pairs. dropout_p is forwarded through to every fc_block call inside the loop.
Sanity-check the input shape. Catches off-by-one in B/T/d_model or accidental transposes early.
Confirm the funnel collapses time and projects to 32 dims. (B, T, 512) → (B, 32).
Inspect the dtype. Note: although h_attn, W and b are all float32, dropout's mask is created from np.random.random_sample (which returns float64). The division by (1 - dropout_p) — a Python float — promotes the result to float64. To keep the entire pipeline float32, cast the mask explicitly inside fc_block.
Print the first 5 features of engine 0. With ReLU + dropout, several entries are exactly 0; the non-zero values are random because dropout's mask is freshly sampled each run (we only seed before randn, not before each random_sample call).
1import numpy as np
2
3
4def fc_block(x, W, b, dropout_p=0.3, training=True):
5 """Fully-connected block: Linear → ReLU → Dropout."""
6 z = x @ W + b # (..., out)
7 z = np.maximum(z, 0) # ReLU
8 if training and dropout_p > 0:
9 mask = (np.random.random_sample(z.shape) > dropout_p).astype(np.float32)
10 z = z * mask / (1 - dropout_p)
11 return z
12
13
14# Backbone-so-far output is (B, T, 512). Take the LAST cycle's features
15# as the shared per-engine summary, then funnel down to 32 dims.
16def shared_features(h_attn: np.ndarray,
17 Wfc: list, bfc: list,
18 dropout_p: float = 0.3) -> np.ndarray:
19 """h_attn: (B, T, 512). Returns (B, 32)."""
20 h = h_attn[:, -1, :] # (B, 512) - last cycle pooling
21 for W, b in zip(Wfc, bfc):
22 h = fc_block(h, W, b, dropout_p=dropout_p)
23 return h
24
25
26# ----- Run on (B, T, 512) input -----
27np.random.seed(0)
28B, T, d_model = 2, 30, 512
29
30h_attn = np.random.randn(B, T, d_model).astype(np.float32)
31
32# 512 → 256 → 64 → 32
33shapes = [(512, 256), (256, 64), (64, 32)]
34W_list = [np.random.randn(*s).astype(np.float32) * np.sqrt(2 / s[0]) for s in shapes]
35b_list = [np.zeros(s[1], dtype=np.float32) for s in shapes]
36
37shared = shared_features(h_attn, W_list, b_list, dropout_p=0.3)
38print("h_attn.shape :", h_attn.shape) # (2, 30, 512)
39print("shared.shape :", shared.shape) # (2, 32)
40print("shared dtype :", shared.dtype)
41print("shared[0,:5] :", shared[0, :5].round(3).tolist())PyTorch: nn.Sequential FC Stack
PyTorch's top-level module. Provides torch.Tensor (the GPU-accelerated ndarray), automatic differentiation (autograd), and the random-number generator we seed below. Everything else (nn, optim, utils) is namespaced under it.
torch.nn is the layer/module library — Linear, ReLU, Dropout, Sequential, Conv2d, LSTM, etc. All inherit from nn.Module which provides parameter tracking, state_dict, .to(device), .train()/.eval() mode flags. Aliased as 'nn' by convention.
Define the funnel as a subclass of nn.Module. This makes it a first-class PyTorch model — its parameters are auto-discovered by .parameters(), it can be moved to GPU with .to('cuda'), saved with state_dict(), and toggled between train/eval mode.
Constructor. Four knobs let you reuse this class for any funnel shape — only defaults are wired for the paper's 512 → 256 → 64 → 32 design.
Call nn.Module.__init__() FIRST. This sets up the internal _parameters, _modules and _buffers dicts. Skipping it (or putting it after attribute assignment) is the #1 cause of 'no parameters found' bugs because attribute setattr is intercepted by nn.Module to register parameters — but only if those dicts already exist.
Plain Python list to accumulate sub-modules. We'll wrap it in nn.Sequential on line 20. Building a list first (instead of appending directly to nn.Sequential) is the idiomatic, mistake-free pattern.
Track the output width of the most-recently-added Linear so the NEXT Linear knows its in_features. Starts at d_model (=512) because the first Linear receives the raw input.
Iterate the (256, 64) tuple — two iterations build the first two FC blocks. The third (64 → 32) is added separately on line 19 because its out_dim is the user-specified out_dim, not a hidden size.
Append three sub-modules in one go. += on a list is in-place extend, so this avoids creating a new list every iteration.
Advance prev so the NEXT iteration's nn.Linear(prev, …) gets the correct in_features. Without this update, every Linear would have in_features=d_model and the shapes would mismatch.
Add the final block separately so its out_dim is the user-controlled output width (32), not a hidden size. After this line, layers contains 9 sub-modules total: 3 blocks × (Linear + ReLU + Dropout).
Wrap the 9 sub-modules in an nn.Sequential. Assigning to self.fc registers all of them as children of FCFunnel — they appear in .parameters(), .state_dict(), .to(device), and .train()/.eval() now propagates to them.
fc.0 Linear(512→256) fc.1 ReLU fc.2 Dropout(0.3) fc.3 Linear(256→64) fc.4 ReLU fc.5 Dropout(0.3) fc.6 Linear(64→32) fc.7 ReLU fc.8 Dropout(0.3)
The forward pass. nn.Module.__call__ invokes this AND runs the registered hooks (pre-forward, forward, etc.). Always call the model as model(x), NEVER model.forward(x) directly — direct calls bypass hooks.
Shape comment so a future reader doesn't have to trace back to the docstring on line 6 to confirm the input contract.
Last-cycle pooling. Identical semantics to the NumPy version: keep all engines, pick the LAST of the 30 time-steps, keep all 512 features.
Push the (B, 512) last-cycle tensor through the 9-layer Sequential. nn.Sequential's forward iterates its children: Linear → ReLU → Dropout → Linear → ReLU → Dropout → Linear → ReLU → Dropout, returning the final (B, 32) tensor.
fc[0] Linear → (2, 256) fc[1] ReLU → (2, 256) fc[2] Dropout → (2, 256) fc[3] Linear → (2, 64) fc[4] ReLU → (2, 64) fc[5] Dropout → (2, 64) fc[6] Linear → (2, 32) fc[7] ReLU → (2, 32) fc[8] Dropout → (2, 32)
Seed PyTorch's CPU RNG so weight init (inside nn.Linear) and dropout masks are reproducible. For deterministic CUDA you'd also need torch.cuda.manual_seed_all(0) and torch.use_deterministic_algorithms(True).
Instantiate the model with the paper's defaults. All nn.Linear layers initialise their weights with Kaiming-uniform (default in PyTorch ≥1.0) and zero biases — matches the NumPy Kaiming init above in spirit.
Synthesise a fake backbone output. In production this is the tensor returned by the BiLSTM + Attention modules.
Run the model. funnel(x) is sugar for funnel.__call__(x), which invokes pre-forward hooks → forward(x) → forward hooks. Autograd wires up the computation graph because x and the model's parameters are tensors.
Sanity-print the input shape. Wrapping in tuple(...) prints (2, 30, 512) instead of the more verbose torch.Size([2, 30, 512]).
Confirm the funnel reduces (2, 30, 512) to (2, 32). The two-axis collapse: time pooling kills axis 1, the FC stack projects axis 2 from 512 → 32.
Count learnable parameters. nn.Linear has bias=True by default, so each layer contributes (in×out) weights + (out) biases. PyTorch tracks them in W shape (out, in) (note: transposed vs the NumPy version) but the param count is identical.
fc.0 Linear(512→256): 256×512 + 256 = 131,328 fc.3 Linear(256→64): 64×256 + 64 = 16,448 fc.6 Linear(64→32): 32×64 + 32 = 2,080 ReLU + Dropout layers: 0 params each
1import torch
2import torch.nn as nn
3
4
5class FCFunnel(nn.Module):
6 """Funnel from (B, T, 512) attention output to (B, 32) shared features.
7
8 Uses LAST-cycle pooling (the bidirectional + attention output at the
9 last cycle already integrates the whole window).
10 """
11 def __init__(self, d_model: int = 512, hidden: tuple = (256, 64),
12 out_dim: int = 32, dropout_p: float = 0.3):
13 super().__init__()
14 layers = []
15 prev = d_model
16 for h in hidden:
17 layers += [nn.Linear(prev, h), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]
18 prev = h
19 layers += [nn.Linear(prev, out_dim), nn.ReLU(inplace=True), nn.Dropout(dropout_p)]
20 self.fc = nn.Sequential(*layers)
21
22 def forward(self, x: torch.Tensor) -> torch.Tensor:
23 # x: (B, T, 512)
24 last = x[:, -1, :] # (B, 512)
25 return self.fc(last) # (B, 32)
26
27
28# Use it
29torch.manual_seed(0)
30funnel = FCFunnel(d_model=512, hidden=(256, 64), out_dim=32, dropout_p=0.3)
31x = torch.randn(2, 30, 512)
32y = funnel(x)
33
34print("input :", tuple(x.shape)) # (2, 30, 512)
35print("output :", tuple(y.shape)) # (2, 32)
36print("# params:", sum(p.numel() for p in funnel.parameters()))
37# # params: 148,000Two Funnel Pitfalls
The point. Three FC blocks compress the (B, 30, 512) backbone output into a 32-D shared feature vector that both task heads consume. ~150k params.
Takeaway
- FC funnel: 512 → 256 → 64 → 32. Three Linear + ReLU + Dropout blocks.
- Last-cycle pooling. Bidirectional + attention mean the last cycle has full-window context.
- 32-D shared vector. The architectural point where multi-task learning happens.
- ~150k params. Small relative to the BiLSTM / attention upstream.