Section 3.2 walked through the math of 1-D convolution. This chapter applies it. The first thing the backbone does to a (B, T, 17) sensor batch is run it through THREE stacked Conv1D blocks that extract local degradation patterns BEFORE handing off to the BiLSTM (Chapter 9) and attention (Chapter 10).
Three reasons the convolutional frontend pays off:
Property
What it gives the model
Translation invariance
A spike at cycle 5 and a spike at cycle 25 produce the same kernel response - the model doesn't have to learn pattern detection at every position.
Local feature extraction
Conv1D weights specialise to 3-cycle motifs (edges, gradients, oscillations). Stacked layers compose them into longer-range patterns.
Channel mixing
Each output channel takes a weighted sum across ALL 17 input sensors. The conv decides which sensor combinations matter at each position.
What to remember. Conv1D = local-pattern detector + channel mixer. Stacking three layers builds a hierarchy: simple patterns → compound patterns → degradation signatures.
Conv1D vs Direct-to-LSTM
A natural question: why not feed raw sensors straight into the BiLSTM? Two reasons. First, the BiLSTM is sequential - it cannot parallelise across timesteps. The conv frontend processes all timesteps simultaneously, so the model spends its expensive recurrent compute on a much smaller (B, T, 64) tensor instead of (B, T, 17) directly. Second, the conv frontend gives the BiLSTM already-clean features - removing high-frequency noise and emphasising local structure that the LSTM would otherwise have to relearn.
The division of labour. CNN handles “what local patterns exist?” - parallel, fast, translation-invariant. BiLSTM handles “how do those patterns evolve over the window?” - sequential, position-aware. Attention handles “which timesteps matter most?” - global, content-addressed.
Interactive: The Sliding Kernel (Recap)
The same animation from §3.2 - reproduced here so the math has somewhere to land in this chapter's context.
kernel_size=3 = Window looks at 3 consecutive timesteps
padding=1 = Add zeros at boundaries to preserve length
Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)
Speed:
Progress1 / 8 positions
1D Convolution Equation
yt = Σk=0K-1 wk · xt+k + b
• K = kernel size (3 in our case)
• w = learned weights
• b = bias term
• t = output position
Output Dimension Formula
Tout = ⌊(Tin + 2P - K) / S⌋ + 1
With Tin=8, P=1, K=3, S=1:
Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8
Padding preserves sequence length!
Parameter Count
For Conv1d(17, 64, kernel_size=3):
Weights = 64 × 17 × 3 = 3,264
Biases = 64
Total = 3,328 parameters
What the Kernel Learns
The kernel weights are learned during training. Different patterns emerge:
[1, 0, -1] → Detects rising/falling edges
[0.33, 0.33, 0.33] → Smoothing/averaging
[−1, 2, −1] → Detects spikes
64 different kernels learn 64 different patterns!
Python: A Single Conv Block
In production we don't just apply Conv1D - we wrap it with BatchNorm, ReLU, and Dropout to form a stable, fast-training block. Below is the entire block in NumPy with explicit normalisation and dropout maths.
input: w (C_out, C_in, K) = Conv kernel - all (C_out * C_in * K) weights
input: b (C_out,) = Conv bias - one per output channel
input: gamma (C_out,) = BN scale - learnable; one per output channel
input: beta (C_out,) = BN shift - learnable
input: dropout_p = 0.15 - paper default (small dropout in shallow layers)
input: training = Toggles dropout (and BN training-mode statistics)
14B, C_in, T = x.shape
Unpack input shape.
EXECUTION STATE
Example = (B=2, C_in=17, T=30)
15C_out, _, K = w.shape
Unpack kernel shape.
EXECUTION STATE
Example = C_out=64, K=3
18pad = K // 2
'Same' padding for odd K. K=3 → pad=1 → output length = input length.
19x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad)))
Pad only the time axis. Tuple syntax: ((before, after) for each dimension). Default mode='constant' with value 0.
EXECUTION STATE
x_pad.shape = (2, 17, 32) - pad 1 on each side of T
20out = np.zeros((B, C_out, T), dtype=np.float32)
Pre-allocate output. Output length T = input length T (because of same padding).
21for t in range(T): for j in range(C_out): for c in range(C_in): for k in range(K):
Four nested loops - the explicit definition of multi-channel conv. Slow in Python; PyTorch fuses all four with a single CUDA kernel.
24out[:, j, t] += w[j, c, k] * x_pad[:, c, t + k]
Each output cell is the SUM over (c, k) of (kernel weight × input value). Identical to §3.2's math, broadcast across the batch.
25out[:, j, t] += b[j]
Add the per-output-channel bias once per (b, t) cell.
28mu = out.mean(axis=(0, 2), keepdims=True)
BatchNorm: per-channel mean over the BATCH axis AND the TIME axis. Shape (1, C_out, 1) thanks to keepdims so broadcasting works.
EXECUTION STATE
→ why these axes? = Each output channel has its own learnable scale and shift. Normalising over (B, T) gives one statistic per channel - that's what BN1d does.
Re-scale and re-shift. gamma/beta are LEARNABLE per-channel parameters - lets the network undo BN if it needs to. reshape (1, -1, 1) makes them broadcast-compatible.
EXECUTION STATE
→ why learn gamma / beta? = Strict whitening can erase useful signal. The network may want to keep some channels biased; gamma=1, beta=0 means no rescale. The network learns these from data.
34out = np.maximum(out, 0)
ReLU. Negatives clipped to 0; positives pass through.
Bernoulli mask: 1 with probability (1 - p), 0 with probability p. Each cell independently.
39out = out * mask / (1 - dropout_p)
Apply mask AND divide by (1 - p). The division is the 'inverted dropout' trick - keeps expected output magnitude unchanged so we don't need a different scale at eval time.
EXECUTION STATE
inverted dropout = PyTorch / TF / Keras default. Alternative: scale at EVAL time instead. Inverted is now standard.
padding = K // 2 gives 'same' padding for odd K. bias=False because BatchNorm has its own bias (beta) - keeping the conv bias would be redundant.
EXECUTION STATE
→ bias=False with BN = BatchNorm subtracts the per-channel mean. Any conv bias gets absorbed into that subtraction - so bias=False saves c_out parameters and a tiny bit of compute.
10self.bn = nn.BatchNorm1d(c_out)
BatchNorm1d normalises per channel. Expects input shape (B, C, T). Produces (B, C, T) with mean ≈ 0 and std ≈ 1 per channel during training.
11self.relu = nn.ReLU(inplace=True)
inplace=True writes the result back into the input tensor in memory - saves a small amount of GPU memory. Safe here because the conv output is no longer needed after ReLU.
12self.drop = nn.Dropout(dropout_p)
PyTorch's standard inverted dropout. Inactive at .eval() time.
bias=False because BN. BatchNorm subtracts the per-channel mean, which absorbs any conv bias. Setting bias=False on the conv saves c_out parameters and a tiny bit of compute - standard practice when conv is followed directly by BN.
Conv-Frontend Architectures Elsewhere
Architecture
Conv frontend
Purpose
RUL CNN-BiLSTM-Attention (this book)
3 stacked Conv1D blocks
Local pattern extraction
WaveNet (audio)
Dilated Conv1D
Long-range without recurrence
DeepSpeech 2
Conv2D over (time, freq)
Speech feature extraction
Vision Transformer (ViT)
Single 16×16 Conv2D ('patch embed')
Tokenise the image
AlphaFold 2
EvoFormer with attention + conv mixers
Coevolution feature extraction
Conformer (speech)
ConvModule between attention layers
Local pattern boost on top of attention
Two Frontend Pitfalls
Pitfall 1: Forgetting padding. Without same padding, every conv layer shrinks the time axis by K-1. Three K=3 layers without padding lose 6 cycles - a fifth of a 30-cycle window. Always set padding=K//2.
Pitfall 2: Wrong axis order. Conv1d expects (B, C, T). Our data flows as (B, T, F). Bridge with x.transpose(1, 2) - same trap as §3.2.
The point. Conv1D + BN + ReLU + Dropout is the building block. Stacked three times with growing then shrinking channels (17 → 64 → 128 → 64), it forms the entire CNN frontend. Section 8.2 builds that stack.
Takeaway
Conv1D is the model's frontend. Local pattern extraction; channel mixing; translation invariance.
The block is conv + BN + ReLU + dropout. Four ops in a fixed order. Same pattern across most CNN architectures.
bias=False on the conv. BN's beta absorbs it. Standard idiom.
3,392 parameters per block. Tiny. The full three-layer stack adds up to ~52k parameters - small relative to the BiLSTM's 2.1M.