Chapter 8
11 min read
Section 31 of 121

1D Convolution for Sensor Series

CNN Feature Extractor

Why Conv1D Is the Right Frontend

Section 3.2 walked through the math of 1-D convolution. This chapter applies it. The first thing the backbone does to a (B, T, 17) sensor batch is run it through THREE stacked Conv1D blocks that extract local degradation patterns BEFORE handing off to the BiLSTM (Chapter 9) and attention (Chapter 10).

Three reasons the convolutional frontend pays off:

PropertyWhat it gives the model
Translation invarianceA spike at cycle 5 and a spike at cycle 25 produce the same kernel response - the model doesn't have to learn pattern detection at every position.
Local feature extractionConv1D weights specialise to 3-cycle motifs (edges, gradients, oscillations). Stacked layers compose them into longer-range patterns.
Channel mixingEach output channel takes a weighted sum across ALL 17 input sensors. The conv decides which sensor combinations matter at each position.
What to remember. Conv1D = local-pattern detector + channel mixer. Stacking three layers builds a hierarchy: simple patterns → compound patterns → degradation signatures.

Conv1D vs Direct-to-LSTM

A natural question: why not feed raw sensors straight into the BiLSTM? Two reasons. First, the BiLSTM is sequential - it cannot parallelise across timesteps. The conv frontend processes all timesteps simultaneously, so the model spends its expensive recurrent compute on a much smaller (B, T, 64) tensor instead of (B, T, 17) directly. Second, the conv frontend gives the BiLSTM already-clean features - removing high-frequency noise and emphasising local structure that the LSTM would otherwise have to relearn.

The division of labour. CNN handles “what local patterns exist?” - parallel, fast, translation-invariant. BiLSTM handles “how do those patterns evolve over the window?” - sequential, position-aware. Attention handles “which timesteps matter most?” - global, content-addressed.

Interactive: The Sliding Kernel (Recap)

The same animation from §3.2 - reproduced here so the math has somewhere to land in this chapter's context.

Interactive 1D Convolution Visualizer

Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)

What happens when we declare this line?

input_size = Number of input channels (17 sensors in C-MAPSS)
64 = Output channels (64 learned feature detectors)
kernel_size=3 = Window looks at 3 consecutive timesteps
padding=1 = Add zeros at boundaries to preserve length
Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)
Input(padded)pad0.00t00.82t10.91t20.76t30.88t40.95t50.71t60.84t70.93pad0.00Kernel(size=3)w00.33w10.34w20.33Step 1/8: Calculation at position 0y0 = 0.33 × 0.00 + 0.34 × 0.82 + 0.33 × 0.91 = 0.000 + 0.279 + 0.300 = 0.579Outputy00.58y1y2y3y4y5y6y7
Speed:
Progress1 / 8 positions

1D Convolution Equation

yt = Σk=0K-1 wk · xt+k + b

  • K = kernel size (3 in our case)
  • w = learned weights
  • b = bias term
  • t = output position

Output Dimension Formula

Tout = ⌊(Tin + 2P - K) / S⌋ + 1

With Tin=8, P=1, K=3, S=1:

Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8

Padding preserves sequence length!

Parameter Count

For Conv1d(17, 64, kernel_size=3):

Weights = 64 × 17 × 3 = 3,264

Biases = 64

Total = 3,328 parameters

What the Kernel Learns

The kernel weights are learned during training. Different patterns emerge:

  • [1, 0, -1] → Detects rising/falling edges
  • [0.33, 0.33, 0.33] → Smoothing/averaging
  • [−1, 2, −1] → Detects spikes

64 different kernels learn 64 different patterns!

Python: A Single Conv Block

In production we don't just apply Conv1D - we wrap it with BatchNorm, ReLU, and Dropout to form a stable, fast-training block. Below is the entire block in NumPy with explicit normalisation and dropout maths.

Conv1D + BatchNorm + ReLU + Dropout in pure NumPy
🐍conv_block_numpy.py
1import numpy as np

Standard alias.

6def conv_block_numpy(x, w, b, gamma, beta, dropout_p=0.15, training=True):

One CNN block - the unit of repetition. Stack three of these (with growing channel counts) to get the full CNN frontend in §8.2.

EXECUTION STATE
input: x (B, C_in, T) = Channel-first tensor - PyTorch nn.Conv1d convention
input: w (C_out, C_in, K) = Conv kernel - all (C_out * C_in * K) weights
input: b (C_out,) = Conv bias - one per output channel
input: gamma (C_out,) = BN scale - learnable; one per output channel
input: beta (C_out,) = BN shift - learnable
input: dropout_p = 0.15 - paper default (small dropout in shallow layers)
input: training = Toggles dropout (and BN training-mode statistics)
14B, C_in, T = x.shape

Unpack input shape.

EXECUTION STATE
Example = (B=2, C_in=17, T=30)
15C_out, _, K = w.shape

Unpack kernel shape.

EXECUTION STATE
Example = C_out=64, K=3
18pad = K // 2

'Same' padding for odd K. K=3 → pad=1 → output length = input length.

19x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad)))

Pad only the time axis. Tuple syntax: ((before, after) for each dimension). Default mode='constant' with value 0.

EXECUTION STATE
x_pad.shape = (2, 17, 32) - pad 1 on each side of T
20out = np.zeros((B, C_out, T), dtype=np.float32)

Pre-allocate output. Output length T = input length T (because of same padding).

21for t in range(T): for j in range(C_out): for c in range(C_in): for k in range(K):

Four nested loops - the explicit definition of multi-channel conv. Slow in Python; PyTorch fuses all four with a single CUDA kernel.

24out[:, j, t] += w[j, c, k] * x_pad[:, c, t + k]

Each output cell is the SUM over (c, k) of (kernel weight × input value). Identical to §3.2's math, broadcast across the batch.

25out[:, j, t] += b[j]

Add the per-output-channel bias once per (b, t) cell.

28mu = out.mean(axis=(0, 2), keepdims=True)

BatchNorm: per-channel mean over the BATCH axis AND the TIME axis. Shape (1, C_out, 1) thanks to keepdims so broadcasting works.

EXECUTION STATE
→ why these axes? = Each output channel has its own learnable scale and shift. Normalising over (B, T) gives one statistic per channel - that's what BN1d does.
29var = out.var(axis=(0, 2), keepdims=True)

Per-channel variance.

30out = (out - mu) / np.sqrt(var + 1e-5)

Whiten: zero mean, unit variance per channel.

31out = out * gamma.reshape(1, -1, 1) + beta.reshape(1, -1, 1)

Re-scale and re-shift. gamma/beta are LEARNABLE per-channel parameters - lets the network undo BN if it needs to. reshape (1, -1, 1) makes them broadcast-compatible.

EXECUTION STATE
→ why learn gamma / beta? = Strict whitening can erase useful signal. The network may want to keep some channels biased; gamma=1, beta=0 means no rescale. The network learns these from data.
34out = np.maximum(out, 0)

ReLU. Negatives clipped to 0; positives pass through.

37if training and dropout_p > 0:

Dropout active only at training time.

38mask = (np.random.random_sample(out.shape) > dropout_p).astype(np.float32)

Bernoulli mask: 1 with probability (1 - p), 0 with probability p. Each cell independently.

39out = out * mask / (1 - dropout_p)

Apply mask AND divide by (1 - p). The division is the 'inverted dropout' trick - keeps expected output magnitude unchanged so we don't need a different scale at eval time.

EXECUTION STATE
inverted dropout = PyTorch / TF / Keras default. Alternative: scale at EVAL time instead. Inverted is now standard.
41return out

Block output.

45B, C_in, T, C_out, K = 2, 17, 30, 64, 3

First-block dimensions: 17 input channels (sensors), 64 output channels.

46x = np.random.randn(B, C_in, T).astype(np.float32) * 0.1

Fake input - small std to keep numbers manageable.

47w = np.random.randn(C_out, C_in, K).astype(np.float32) * 0.1

Fake weights - small std (poor man's Xavier). Real PyTorch uses Kaiming init by default.

EXECUTION STATE
w.size = 64 * 17 * 3 = 3,264 weights
48b = np.zeros(C_out, dtype=np.float32)

Bias init to zero.

49gamma = np.ones(C_out, dtype=np.float32)

BN gamma init to 1 (PyTorch default).

50beta = np.zeros(C_out, dtype=np.float32)

BN beta init to 0.

52out = conv_block_numpy(x, w, b, gamma, beta, dropout_p=0.15, training=True)

Run the block.

53print("input :", x.shape)

Verify input.

EXECUTION STATE
Output = input : (2, 17, 30)
54print("output :", out.shape)

Same time length, channels grew 17 → 64.

EXECUTION STATE
Output = output : (2, 64, 30)
55print("non-neg?", (out >= 0).all())

ReLU guarantees non-negative output.

EXECUTION STATE
Output = non-neg? True
28 lines without explanation
1import numpy as np
2
3# A Conv1D BLOCK = (conv -> BN -> ReLU -> dropout) — the building unit
4# of the backbone we will stack three times in Section 8.2.
5
6def conv_block_numpy(x: np.ndarray,         # (B, C_in, T)
7                     w: np.ndarray,         # (C_out, C_in, K)
8                     b: np.ndarray,         # (C_out,)
9                     gamma: np.ndarray,     # (C_out,)  — BatchNorm scale
10                     beta:  np.ndarray,     # (C_out,)  — BatchNorm shift
11                     dropout_p: float = 0.15,
12                     training: bool = True) -> np.ndarray:
13    """One CNN block: Conv1D + BatchNorm + ReLU + Dropout."""
14    B, C_in, T = x.shape
15    C_out, _, K = w.shape
16
17    # ----- Conv1D (same padding) -----
18    pad = K // 2
19    x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad)))             # (B, C_in, T+2P)
20    out = np.zeros((B, C_out, T), dtype=np.float32)
21    for t in range(T):
22        for j in range(C_out):
23            for c in range(C_in):
24                for k in range(K):
25                    out[:, j, t] += w[j, c, k] * x_pad[:, c, t + k]
26            out[:, j, t] += b[j]
27
28    # ----- BatchNorm (per channel) -----
29    mu  = out.mean(axis=(0, 2), keepdims=True)
30    var = out.var (axis=(0, 2), keepdims=True)
31    out = (out - mu) / np.sqrt(var + 1e-5)
32    out = out * gamma.reshape(1, -1, 1) + beta.reshape(1, -1, 1)
33
34    # ----- ReLU -----
35    out = np.maximum(out, 0)
36
37    # ----- Dropout (training only) -----
38    if training and dropout_p > 0:
39        mask = (np.random.random_sample(out.shape) > dropout_p).astype(np.float32)
40        out = out * mask / (1 - dropout_p)
41
42    return out
43
44
45# ----- Verify shape -----
46np.random.seed(0)
47B, C_in, T, C_out, K = 2, 17, 30, 64, 3
48x     = np.random.randn(B, C_in, T).astype(np.float32) * 0.1
49w     = np.random.randn(C_out, C_in, K).astype(np.float32) * 0.1
50b     = np.zeros(C_out, dtype=np.float32)
51gamma = np.ones(C_out,  dtype=np.float32)
52beta  = np.zeros(C_out, dtype=np.float32)
53
54out = conv_block_numpy(x, w, b, gamma, beta, dropout_p=0.15, training=True)
55print("input  :", x.shape)        # (2, 17, 30)
56print("output :", out.shape)      # (2, 64, 30)
57print("non-neg?", (out >= 0).all())  # True (after ReLU)

PyTorch: Conv → BatchNorm → ReLU → Dropout

Same block as a 4-line nn.Module
🐍conv_block_torch.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layers.

4class ConvBlock(nn.Module):

The composed conv block. Used three times in Section 8.2 to build the full backbone frontend.

6def __init__(self, c_in, c_out, kernel_size=3, dropout_p=0.15):

Four hyperparameters - everything the block needs.

EXECUTION STATE
input: c_in = Input channel count - 17 for first block, 64/128 for stacked
input: c_out = Output channel count - 64/128/64 for the three-layer stack
input: kernel_size = 3 - paper default
input: dropout_p = 0.15 - paper default for shallow layers
7super().__init__()

Initialise nn.Module.

8self.conv = nn.Conv1d(c_in, c_out, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)

padding = K // 2 gives 'same' padding for odd K. bias=False because BatchNorm has its own bias (beta) - keeping the conv bias would be redundant.

EXECUTION STATE
→ bias=False with BN = BatchNorm subtracts the per-channel mean. Any conv bias gets absorbed into that subtraction - so bias=False saves c_out parameters and a tiny bit of compute.
10self.bn = nn.BatchNorm1d(c_out)

BatchNorm1d normalises per channel. Expects input shape (B, C, T). Produces (B, C, T) with mean ≈ 0 and std ≈ 1 per channel during training.

11self.relu = nn.ReLU(inplace=True)

inplace=True writes the result back into the input tensor in memory - saves a small amount of GPU memory. Safe here because the conv output is no longer needed after ReLU.

12self.drop = nn.Dropout(dropout_p)

PyTorch's standard inverted dropout. Inactive at .eval() time.

14def forward(self, x):

Standard PyTorch forward.

16return self.drop(self.relu(self.bn(self.conv(x))))

Function composition: conv → BN → ReLU → drop. Reads inside-out. PyTorch fuses these calls into ~3 GPU kernels.

EXECUTION STATE
→ why this order? = Standard 'CNN block' from the BatchNorm paper (Ioffe & Szegedy 2015). BN before ReLU is the original recommendation.
20torch.manual_seed(0)

Determinism.

21block = ConvBlock(c_in=17, c_out=64, kernel_size=3, dropout_p=0.15)

First block of the stack.

22x = torch.randn(2, 17, 30)

Channel-first input (B=2, C=17, T=30). Don't forget the .transpose(1, 2) at the boundary if your data flows as (B, T, F) - see §3.2's PyTorch block.

23y = block(x)

Forward pass.

25print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 17, 30)
26print("output :", tuple(y.shape))

Channels grew 17 → 64; time preserved.

EXECUTION STATE
Output = output : (2, 64, 30)
27print("# params:", sum(p.numel() for p in block.parameters()))

3264 conv weights + 64 BN gamma + 64 BN beta = 3392. The BN's running mean / running var are buffers (not parameters) so don't appear in this count.

EXECUTION STATE
Output = # params: 3,392
10 lines without explanation
1import torch
2import torch.nn as nn
3
4class ConvBlock(nn.Module):
5    """Conv1D → BatchNorm1d → ReLU → Dropout."""
6    def __init__(self, c_in: int, c_out: int, kernel_size: int = 3, dropout_p: float = 0.15):
7        super().__init__()
8        self.conv = nn.Conv1d(c_in, c_out, kernel_size=kernel_size,
9                              padding=kernel_size // 2, bias=False)
10        self.bn   = nn.BatchNorm1d(c_out)
11        self.relu = nn.ReLU(inplace=True)
12        self.drop = nn.Dropout(dropout_p)
13
14    def forward(self, x: torch.Tensor) -> torch.Tensor:
15        # x: (B, C_in, T)
16        return self.drop(self.relu(self.bn(self.conv(x))))
17
18
19# Use it
20torch.manual_seed(0)
21block = ConvBlock(c_in=17, c_out=64, kernel_size=3, dropout_p=0.15)
22x = torch.randn(2, 17, 30)
23y = block(x)
24
25print("input  :", tuple(x.shape))                  # (2, 17, 30)
26print("output :", tuple(y.shape))                  # (2, 64, 30)
27print("# params:", sum(p.numel() for p in block.parameters()))
28# # params: 3,392  (= 64*17*3 weights + 64 BN gamma + 64 BN beta)
bias=False because BN. BatchNorm subtracts the per-channel mean, which absorbs any conv bias. Setting bias=False on the conv saves c_out parameters and a tiny bit of compute - standard practice when conv is followed directly by BN.

Conv-Frontend Architectures Elsewhere

ArchitectureConv frontendPurpose
RUL CNN-BiLSTM-Attention (this book)3 stacked Conv1D blocksLocal pattern extraction
WaveNet (audio)Dilated Conv1DLong-range without recurrence
DeepSpeech 2Conv2D over (time, freq)Speech feature extraction
Vision Transformer (ViT)Single 16×16 Conv2D ('patch embed')Tokenise the image
AlphaFold 2EvoFormer with attention + conv mixersCoevolution feature extraction
Conformer (speech)ConvModule between attention layersLocal pattern boost on top of attention

Two Frontend Pitfalls

Pitfall 1: Forgetting padding. Without same padding, every conv layer shrinks the time axis by K-1. Three K=3 layers without padding lose 6 cycles - a fifth of a 30-cycle window. Always set padding=K//2.
Pitfall 2: Wrong axis order. Conv1d expects (B, C, T). Our data flows as (B, T, F). Bridge with x.transpose(1, 2) - same trap as §3.2.
The point. Conv1D + BN + ReLU + Dropout is the building block. Stacked three times with growing then shrinking channels (17 → 64 → 128 → 64), it forms the entire CNN frontend. Section 8.2 builds that stack.

Takeaway

  • Conv1D is the model's frontend. Local pattern extraction; channel mixing; translation invariance.
  • The block is conv + BN + ReLU + dropout. Four ops in a fixed order. Same pattern across most CNN architectures.
  • bias=False on the conv. BN's beta absorbs it. Standard idiom.
  • 3,392 parameters per block. Tiny. The full three-layer stack adds up to ~52k parameters - small relative to the BiLSTM's 2.1M.
Loading comments...