Chapter 3
16 min read
Section 10 of 121

1D Convolution for Sensor Streams

Mathematical Preliminaries

From Edge Detectors to Sensor Streams

Open any introductory image-processing book and the first algorithm you meet is the Sobel edge detector: a tiny 3 × 3 weight matrix that, when slid across an image, lights up wherever brightness changes sharply. Gravitational-wave astronomers do the same with matched filters: a known waveform template is correlated against a noisy receiver stream, with peaks marking detections. Speech-recognition front-ends use 1-D filters to find vowel onsets in a microphone signal. These are the same operation: take a small set of weights, slide it along a signal, sum the products at every position.

Applied to a turbofan engine's sensor stream, the operation is called a 1D convolution — the workhorse of the first layer of every CNN-based prognostic model in this book. The kernel learns to detect local degradation patterns: sudden spikes, gradual trends, oscillations. The same architectural primitive that detects edges in photographs detects bearing wear in vibration signals.

The mental model. A 1D conv kernel is a learned question, repeated at every cycle. “Is there a sharp upward deflection in the pressure trace within these three cycles?” The answer becomes one number per output position; the kernel weights encode what question the model has learned to ask.

The 1D Convolution Operation

Given an input signal xRT\mathbf{x} \in \mathbb{R}^{T} and a kernel wRK\mathbf{w} \in \mathbb{R}^{K}, the (cross-) correlation that PyTorch and most deep-learning libraries call “convolution” is

yt  =  k=0K1wkxt+k  +  b.y_t \;=\; \sum_{k=0}^{K-1} w_k \, x_{t+k} \;+\; b.

Three knobs control the geometry of the operation: kernel size KK, stride SS, and padding PP. The output length is

Tout  =  Tin+2PKS+1.T_{\text{out}} \;=\; \left\lfloor \frac{T_{\text{in}} + 2P - K}{S} \right\rfloor + 1.

SymbolMeaningCommon choice
TinT_{\text{in}}Input sequence length30 (C-MAPSS window)
KKKernel size (receptive field)3 or 5
PPZero padding on each side1 (for K = 3, 'same')
SSStride1 (no downsampling)
ToutT_{\text{out}}Output length30 with same padding

With our default K=3K = 3, P=1P = 1, S=1S = 1: Tout=(30+23)/1+1=30T_{\text{out}} = \lfloor (30 + 2 - 3)/1 \rfloor + 1 = 30 — the time axis is preserved layer to layer, which is what we want when we stack three conv layers and then hand off to the BiLSTM.

A worked example. Let x=[1,3,2,4,1]\mathbf{x} = [1, 3, 2, 4, 1] and w=[0.5,1.0,0.5]\mathbf{w} = [0.5, 1.0, 0.5]. With no padding the kernel sits at three positions:
y0=0.51+1.03+0.52=4.5y_0 = 0.5 \cdot 1 + 1.0 \cdot 3 + 0.5 \cdot 2 = 4.5
y1=0.53+1.02+0.54=5.5y_1 = 0.5 \cdot 3 + 1.0 \cdot 2 + 0.5 \cdot 4 = 5.5
y2=0.52+1.04+0.51=5.5y_2 = 0.5 \cdot 2 + 1.0 \cdot 4 + 0.5 \cdot 1 = 5.5
Output length = 53+1=35 - 3 + 1 = 3. The kernel is a weighted local average; the output emphasises the centre cycle of each window.

Interactive: Watch the Kernel Slide

The visualization below uses a real C-MAPSS sensor (T30, total temperature at HPC outlet) and a 3-tap kernel. Press play and watch the kernel walk across the input; pause to inspect the weighted-sum at any position. The padding toggle shows you exactly what the zero-pad cycles look like at the edges.

Interactive 1D Convolution Visualizer

Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)

What happens when we declare this line?

input_size = Number of input channels (17 sensors in C-MAPSS)
64 = Output channels (64 learned feature detectors)
kernel_size=3 = Window looks at 3 consecutive timesteps
padding=1 = Add zeros at boundaries to preserve length
Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)
Input(padded)pad0.00t00.82t10.91t20.76t30.88t40.95t50.71t60.84t70.93pad0.00Kernel(size=3)w00.33w10.34w20.33Step 1/8: Calculation at position 0y0 = 0.33 × 0.00 + 0.34 × 0.82 + 0.33 × 0.91 = 0.000 + 0.279 + 0.300 = 0.579Outputy00.58y1y2y3y4y5y6y7
Speed:
Progress1 / 8 positions

1D Convolution Equation

yt = Σk=0K-1 wk · xt+k + b

  • K = kernel size (3 in our case)
  • w = learned weights
  • b = bias term
  • t = output position

Output Dimension Formula

Tout = ⌊(Tin + 2P - K) / S⌋ + 1

With Tin=8, P=1, K=3, S=1:

Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8

Padding preserves sequence length!

Parameter Count

For Conv1d(17, 64, kernel_size=3):

Weights = 64 × 17 × 3 = 3,264

Biases = 64

Total = 3,328 parameters

What the Kernel Learns

The kernel weights are learned during training. Different patterns emerge:

  • [1, 0, -1] → Detects rising/falling edges
  • [0.33, 0.33, 0.33] → Smoothing/averaging
  • [−1, 2, −1] → Detects spikes

64 different kernels learn 64 different patterns!

Two things to take away. First, the output value at any position tt only depends on cycles tt, t+1t+1, t+2t+2 — the receptive field. Cycles outside that window cannot affect yty_t; this is why Section 9 will follow the conv with a BiLSTM that integrates over the whole window. Second, the same kernel applies at every position — the conv layer is translation-invariant. A spike at cycle 5 and a spike at cycle 25 both produce the same output magnitude; the model doesn't need to learn to detect spikes twice.

Multi-Channel: 17 Sensors at Once

Real sensor data has F=17F = 17 channels (informative C-MAPSS sensors). The 1D convolution generalises trivially: each output channel jj uses a separate kernel W(j)RCin×K\mathbf{W}^{(j)} \in \mathbb{R}^{C_{\text{in}} \times K} that spans all input channels, then sums the contributions:

yt(j)  =  c=1Cink=0K1Wc,k(j)xt+k(c)  +  b(j).y_t^{(j)} \;=\; \sum_{c=1}^{C_{\text{in}}} \sum_{k=0}^{K-1} W^{(j)}_{c,k} \, x_{t+k}^{(c)} \;+\; b^{(j)}.

Each output channel is the weighted sum of all input channels across the kernel's temporal window. With Cin=17C_{\text{in}} = 17, Cout=64C_{\text{out}} = 64, K=3K = 3 the layer holds 64×17×3=3,26464 \times 17 \times 3 = 3{,}264 weights plus 64 biases — 3,328 learnable parameters in one layer.

LayerInput shape (B, T, C)Output shapeParams
Conv1D #1(B, 30, 17)(B, 30, 64)64*17*3 + 64 = 3,328
Conv1D #2(B, 30, 64)(B, 30, 128)128*64*3 + 128 = 24,704
Conv1D #3(B, 30, 128)(B, 30, 64)64*128*3 + 64 = 24,640
Total~ 52,672

Interactive: Multi-Channel in Detail

The next visualization steps through a stacked 8 → 16 → 8 architecture. Click any output cell at any layer and the diagram shows you which input cells contributed to it — you can literally see the receptive field grow as you move up the stack.

Multi-Channel 1D Convolution + ReLU

Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation

Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)

Input
(8, 6)
Hidden
(16, 6)
ReLU
max(0,x)
Output
(8, 6)

Click on Conv1, Conv2, or ReLU to see detailed computation

Input: 8 Sensors × 6 Timesteps
t0
t1
t2
t3
t4
t5
T₃₀
0.82
0.91
0.76
0.88
0.95
0.71
P₃₀
0.45
0.52
0.48
0.55
0.61
0.58
Vib
0.33
0.29
0.35
0.31
0.28
0.32
RPM
0.67
0.72
0.69
0.74
0.78
0.75
Flow
0.21
0.25
0.23
0.27
0.24
0.22
Fuel
0.89
0.85
0.92
0.88
0.84
0.90
Exh
0.56
0.59
0.54
0.62
0.58
0.55
Oil
0.41
0.38
0.44
0.40
0.43
0.39
Conv1 + ReLU
After Conv1+ReLU: 16 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
F0
0.42
0.33
0.38
0.41
0.41
0.00
F1
0.62
0.74
0.78
0.74
0.75
0.28
F2
0.00
0.00
0.00
0.00
0.00
0.16
F3
0.00
0.00
0.00
0.00
0.00
0.00
F4
0.00
0.06
0.10
0.02
0.11
0.20
F5
0.00
0.00
0.00
0.00
0.00
0.02
F6
0.00
0.00
0.00
0.00
0.00
0.00
F7
0.34
0.45
0.54
0.49
0.43
0.26
F8
0.00
0.00
0.00
0.00
0.00
0.00
F9
0.00
0.00
0.00
0.00
0.00
0.00
F10
0.00
0.00
0.00
0.00
0.00
0.00
F11
0.30
0.09
0.10
0.08
0.11
0.00
F12
0.14
0.13
0.20
0.16
0.10
0.22
F13
0.00
0.00
0.00
0.00
0.00
0.00
F14
0.00
0.21
0.18
0.15
0.20
0.48
F15
0.25
0.00
0.00
0.00
0.00
0.00
Conv2 + ReLU
After Conv2+ReLU: 8 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
Out0
0.47
0.50
0.61
0.63
0.48
0.20
Out1
0.00
0.00
0.00
0.00
0.00
0.00
Out2
0.00
0.00
0.00
0.00
0.00
0.00
Out3
0.31
0.13
0.16
0.17
0.10
0.00
Out4
0.08
0.14
0.10
0.04
0.04
0.00
Out5
0.00
0.00
0.00
0.00
0.00
0.00
Out6
0.00
0.09
0.04
0.00
0.00
0.00
Out7
0.00
0.00
0.00
0.00
0.00
0.00

Key Insight: Multi-Channel Convolution

Each output channel is computed by summing contributions from ALL input channels:

y[out_ch, t] = Σin_ch Σk W[out_ch, in_ch, k] × x[in_ch, t+k] + bias
Why the channel count grows then shrinks. Early layers expand 17 raw sensors into a richer 64-channel local feature space. The middle layer pushes it further to 128 to allow combinations of local patterns. The final conv layer compresses back to 64 to keep the BiLSTM input compact. The shape 17641286417 \to 64 \to 128 \to 64 is the canonical “encoder-decoder waist” pattern adapted to time series.

Python: 1D Convolution From Scratch

Twenty-five lines of NumPy and the operation is fully transparent. We define conv1d_naive, run it on the 5-sample toy from above to verify the hand-computed numbers, then apply two well-known hand-crafted kernels — an edge detector and a three-tap smoother — to the same signal.

A from-scratch 1-D cross-correlation
🐍conv1d_naive.py
1import numpy as np

Same NumPy alias used throughout the book.

7def conv1d_naive(x, w, stride=1, padding=0, bias=0.0) -> np.ndarray:

A scalar-channel, scalar-batch implementation of the 1-D cross-correlation. PyTorch's nn.Conv1d does the same operation under the hood, just batched and multi-channel.

EXECUTION STATE
input: x (np.ndarray) = 1-D input signal of length L_in
input: w (np.ndarray) = 1-D kernel of length K (the learnable weights in a real network)
input: stride = 1 by default - move the kernel one position at a time
input: padding = 0 by default - 'valid' convolution shrinks the output. padding=1 with K=3 keeps L_out = L_in.
input: bias = Scalar added to every output position (one bias per output channel in a real layer)
returns np.ndarray = 1-D output of length L_out = (L_in + 2P - K) // stride + 1
12if padding > 0: x = np.pad(x, padding)

Pads the input with zeros on both sides. After this, the kernel can sit centred on the original first / last cycle without falling off the edge. Without padding, every conv layer shrinks the time axis by K - 1 cycles - stack three layers and you have lost 6 cycles for free.

EXECUTION STATE
np.pad(arr, n) = Default mode is 'constant' with constant_values=0. Pads BOTH sides by n elements. So np.pad([1,3,2,4,1], 1) = [0, 1, 3, 2, 4, 1, 0].
Example = padding=1 turns [1, 3, 2, 4, 1] into [0, 1, 3, 2, 4, 1, 0]
13K = len(w)

Kernel length. For our toy demo K = 3.

EXECUTION STATE
K = 3
14L_in = len(x)

Length of the input AFTER padding (5 + 0 = 5 valid; 5 + 2 = 7 with padding=1).

EXECUTION STATE
L_in (no padding) = 5
L_in (padding=1) = 7
15L_out = (L_in - K) // stride + 1

Standard output-length formula. Floor division ensures we drop fractional positions where the kernel would not fit. For our toy run with no padding: (5 - 3) // 1 + 1 = 3.

EXECUTION STATE
Example: no padding, K=3 = (5 - 3) // 1 + 1 = 3
Example: padding=1, K=3 = (7 - 3) // 1 + 1 = 5 - same as input length
Example: stride=2, K=3 = (5 - 3) // 2 + 1 = 2 - downsampling
16y = np.zeros(L_out, dtype=np.float32)

Pre-allocate the output array. float32 to match what PyTorch will use later.

EXECUTION STATE
y.shape = (3,) for the valid case; (5,) when padding=1
17for t in range(L_out):

Iterate over every valid output position.

LOOP TRACE · 3 iterations
t = 0
kernel position = covers x[0:3] = [1, 3, 2]
t = 1
kernel position = covers x[1:4] = [3, 2, 4]
t = 2
kernel position = covers x[2:5] = [2, 4, 1] - last valid window
18s = t * stride

Start index inside x. With stride 1, s = t. With stride 2, s = 0, 2, 4, ... - the kernel skips every other input position, halving the output length.

EXECUTION STATE
Example: t=0, stride=1 = s = 0
Example: t=1, stride=2 = s = 2
19y[t] = float(np.dot(x[s:s + K], w)) + bias

The whole convolution operation in one line: dot product between the K-length window and the kernel, plus bias. np.dot of two 1-D arrays of equal length returns a scalar. The float(...) cast unwraps NumPy's 0-D ndarray.

EXECUTION STATE
np.dot(a, b) = For 1-D arrays: sum over element-wise products. For 2-D arrays: matrix multiply. We use the 1-D form here.
Example: t=0 = np.dot([1, 3, 2], [0.5, 1.0, 0.5]) = 0.5 + 3 + 1 = 4.5
Example: t=1 = np.dot([3, 2, 4], [0.5, 1.0, 0.5]) = 1.5 + 2 + 2 = 5.5
Example: t=2 = np.dot([2, 4, 1], [0.5, 1.0, 0.5]) = 1 + 4 + 0.5 = 5.5
20return y

Hand the output back. Same dtype as we allocated (float32).

24x = np.array([1.0, 3.0, 2.0, 4.0, 1.0])

A toy input - five samples from some imagined sensor. The middle three are roughly increasing then a drop at the end - mimicking a brief excursion.

EXECUTION STATE
x.shape = (5,)
x.dtype = float64 by default
25w = np.array([0.5, 1.0, 0.5])

A 'weighted average' kernel - centre weight 1.0, neighbours 0.5. Sums to 2 (NOT a unit-sum smoother), so the output is roughly 2x the local mean. We'll see it act like a smoother that emphasises the centre cycle.

EXECUTION STATE
w = [0.5, 1.0, 0.5]
centre weight = 1.0 (the 'now')
neighbour weights = 0.5 each (past + future)
27y_valid = conv1d_naive(x, w)

'Valid' convolution: no padding. The kernel only sits at positions where it fully fits. Output length 3.

EXECUTION STATE
y_valid = [4.5, 5.5, 5.5]
len(y_valid) = 3 - lost 2 cycles to the edge effect
28y_same = conv1d_naive(x, w, padding=1)

'Same' convolution: pad by floor(K/2) on each side so the output length equals the input length. Standard choice when stacking many conv layers.

EXECUTION STATE
y_same = [2.5, 4.5, 5.5, 5.5, 3.0]
len(y_same) = 5 - matches the input length
edge effect = y_same[0] = 0.5*0 + 1.0*1 + 0.5*3 = 2.5 - the '0' is the left zero-pad. The first output is biased toward the centre value.
30print("input :", x.tolist())

Print the original signal.

EXECUTION STATE
Output = input : [1.0, 3.0, 2.0, 4.0, 1.0]
31print("kernel :", w.tolist())

Print the kernel.

EXECUTION STATE
Output = kernel : [0.5, 1.0, 0.5]
32print("y (no padding) :", y_valid.tolist())

Print the valid-conv output.

EXECUTION STATE
Output = y (no padding) : [4.5, 5.5, 5.5]
33print("y (padding = 1) :", y_same.tolist())

Print the same-conv output.

EXECUTION STATE
Output = y (padding = 1) : [2.5, 4.5, 5.5, 5.5, 3.0]
37edge_kernel = np.array([-1.0, 2.0, -1.0])

Discrete approximation to the second derivative. High response when the centre cycle is much higher than its neighbours - i.e., a SPIKE.

EXECUTION STATE
edge_kernel = [-1, +2, -1]
→ when does it fire? = If x = [a, b, a] (constant): output = -a + 2b - a = 2(b - a). Big when b ≠ a. If x = [a, a, a] flat: output = 0.
38smooth_kernel = np.array([1/3, 1/3, 1/3])

Equal-weight three-tap moving average. Removes high-frequency noise; output equals the local mean.

EXECUTION STATE
smooth_kernel = [0.333, 0.333, 0.333]
sums to = 1.0 - preserves the signal scale
40print("edge detector out :", conv1d_naive(x, edge_kernel).tolist())

Apply the edge detector. The middle output is -3.0 because x[2] = 2 is LOWER than its neighbours (3 and 4) - so the centre is a 'dip', not a 'spike', and the second-derivative kernel returns negative.

EXECUTION STATE
Output = edge detector out : [3.0, -3.0, 5.0]
Position 0 = -1*1 + 2*3 + -1*2 = 3.0 (3 is a peak)
Position 1 = -1*3 + 2*2 + -1*4 = -3.0 (2 is a valley)
Position 2 = -1*2 + 2*4 + -1*1 = 5.0 (4 is a peak)
42print("smoother out :", conv1d_naive(x, smooth_kernel).tolist())

Apply the moving-average smoother. Output = local mean. (1+3+2)/3 = 2.0; (3+2+4)/3 = 3.0; (2+4+1)/3 ≈ 2.33.

EXECUTION STATE
Output = smoother out : [2.0, 3.0, 2.333]
22 lines without explanation
1import numpy as np
2
3# ----- A from-scratch 1D convolution -----
4# y[t] = sum_{k=0}^{K-1} w[k] * x[t + k] + b
5# Inputs may be padded with zeros to control output length.
6
7def conv1d_naive(x: np.ndarray,
8                 w: np.ndarray,
9                 stride: int = 1,
10                 padding: int = 0,
11                 bias: float = 0.0) -> np.ndarray:
12    """1-D cross-correlation. No batch / no channels yet."""
13    if padding > 0:
14        x = np.pad(x, padding)
15    K     = len(w)
16    L_in  = len(x)
17    L_out = (L_in - K) // stride + 1
18    y     = np.zeros(L_out, dtype=np.float32)
19    for t in range(L_out):
20        s = t * stride
21        y[t] = float(np.dot(x[s:s + K], w)) + bias
22    return y
23
24
25# ----- Run on a tiny signal with a 3-tap "smoother" kernel -----
26x = np.array([1.0, 3.0, 2.0, 4.0, 1.0])     # length 5
27w = np.array([0.5, 1.0, 0.5])               # length 3 - weighted average
28
29y_valid = conv1d_naive(x, w)
30y_same  = conv1d_naive(x, w, padding=1)
31
32print("input             :", x.tolist())
33print("kernel            :", w.tolist())
34print("y (no padding)    :", y_valid.tolist())     # [4.5, 5.5, 5.5]
35print("y (padding = 1)   :", y_same.tolist())      # [2.5, 4.5, 5.5, 5.5, 3.0]
36
37
38# ----- Edge detector and smoother on the same signal -----
39edge_kernel = np.array([-1.0,  2.0, -1.0])
40smooth_kernel = np.array([1/3, 1/3, 1/3])
41
42print("edge detector out :", conv1d_naive(x, edge_kernel).tolist())
43# edge detector out : [3.0, -3.0, 5.0]
44print("smoother out      :", conv1d_naive(x, smooth_kernel).tolist())
45# smoother out      : [2.0, 3.0, 2.333]

Verifying the hand-computation

The valid output [4.5,5.5,5.5][4.5, 5.5, 5.5] matches the three-line worked example earlier in the section to the digit. The same kernel with padding=1= 1 emits five values instead of three — the input-length-preserving choice that lets us stack many conv layers without losing time-axis cycles.

PyTorch: nn.Conv1d (and the Axis Trap)

The single most common Conv1d bug. PyTorch's nn.Conv1d expects input shape (B,Cin,T)(B,\, C_{\text{in}},\, T) — channels SECOND, time LAST. Our CMAPSSDataset emits (B,T,F)(B,\, T,\, F). If you forget the.transpose(1, 2) bridge, the layer will silently treat your time axis as channels and your sensors as a tiny temporal window — the loss will go down, the accuracy will not, and you will spend a week debugging. Always transpose.

With that out of the way, the entire idiomatic PyTorch implementation is one nn.Conv1d instantiation plus the two transposes:

nn.Conv1d on a (B, T, F) batch — bridge with transpose
🐍conv1d_torch.py
1import numpy as np

We do not strictly need NumPy here, but it stays imported for downstream interop.

2import torch

Top-level PyTorch.

3import torch.nn as nn

Container of every learnable layer. nn.Conv1d lives here.

9torch.manual_seed(0)

Lock PyTorch's RNG so the random weights and inputs are deterministic across runs.

10B, T, F = 2, 30, 17

Same B, T, F triple from Section 3.1. F = 17 because that is the count of informative C-MAPSS sensors after dropping constant ones (Section 5.3 will explain).

EXECUTION STATE
B = 2 - tiny batch for the demo
T = 30 - the standard sliding-window length
F = 17 - sensor count
13x_btf = torch.randn(B, T, F) * 5 + 100

Build a fake batch in (B, T, F) order - the layout our CMAPSSDataset emits in Chapter 7. randn fills with standard Gaussian; we scale to std 5 and shift to mean 100 to look like normalised sensor values.

EXECUTION STATE
x_btf.shape = torch.Size([2, 30, 17])
x_btf.dtype = torch.float32
16x_bft = x_btf.transpose(1, 2)

THE CRITICAL LINE. PyTorch's nn.Conv1d expects (B, C_in, T) - channels SECOND, time LAST. Our data flows as (B, T, F). transpose(1, 2) swaps axes 1 and 2 to convert. Forgetting this is the most common Conv1D bug; the layer will run silently on garbage if you skip it.

EXECUTION STATE
.transpose(dim0, dim1) = Swaps two axes. Returns a VIEW (no copy). Use .contiguous() if a downstream op insists on contiguous memory.
Before = (B=2, T=30, F=17)
After = (B=2, F=17, T=30)
→ why does Conv1d expect this? = Conv1d slides along the LAST axis. By convention in PyTorch (and Keras / TF channel-first mode) the time axis is last and the channel axis is second.
19conv = nn.Conv1d(in_channels=17, out_channels=64, kernel_size=3, padding=1)

Construct the layer. Four arguments fully specify it. in_channels MUST match the channel dim of the input - 17 sensors here. out_channels = 64 means the layer will learn 64 different 3-tap kernels (each spanning all 17 input channels). padding=1 is 'same' padding for K=3 - output time-axis matches input.

EXECUTION STATE
nn.Conv1d(in, out, kernel_size, ...) = Wraps a learnable weight tensor of shape (out, in, K) and bias of shape (out,). On forward, slides the kernel along the LAST axis.
arg: in_channels=17 = Number of input channels per timestep. Must equal F.
arg: out_channels=64 = Number of distinct learned kernels. Each produces its own output channel.
arg: kernel_size=3 = How many time steps the kernel spans. 3 is the standard small choice; larger K = wider receptive field but more parameters.
arg: padding=1 = Zero-pads each side by 1. With K=3 this preserves time-axis length: T_out = T_in.
23print("conv.weight.shape :", tuple(conv.weight.shape))

PyTorch lays out Conv1d weights as (out_channels, in_channels, kernel_size) - 64 distinct (17, 3) kernels stacked.

EXECUTION STATE
Output = conv.weight.shape : (64, 17, 3)
→ mental picture = Each of the 64 output channels has a (17, 3) kernel: 17 channel-wise weights × 3 time-step weights. All 51 numbers are summed to produce one output value.
24print("conv.bias.shape :", tuple(conv.bias.shape))

One bias scalar per output channel.

EXECUTION STATE
Output = conv.bias.shape : (64,)
25print("# params :", sum(p.numel() for p in conv.parameters()))

Parameter accounting. 64 × 17 × 3 weights + 64 biases = 3,328 trainable parameters - tiny by deep-learning standards. The full three-layer CNN in Chapter 8 has roughly 100k parameters.

EXECUTION STATE
.parameters() = Iterator over every learnable Tensor in the module. .numel() returns the total element count of a tensor.
Output = # params : 3328
Breakdown = weights: 64 * 17 * 3 = 3,264, plus biases: 64 = 3,328
29y_bft = conv(x_bft)

Forward pass. PyTorch's __call__ routes to the layer's forward(); under the hood it calls torch.conv1d with the layer's weight and bias.

EXECUTION STATE
y_bft.shape = torch.Size([2, 64, 30])
y_bft.dtype = torch.float32
y_bft.requires_grad = True - autograd is tracking gradients to conv.weight / conv.bias
32y_btf = y_bft.transpose(1, 2)

Permute back to (B, T, F') so downstream layers in the book - which all assume (B, T, F) order - work without further surgery. F' = 64 (the new feature dimension after the conv).

EXECUTION STATE
Before = (2, 64, 30)
After = (2, 30, 64)
→ invariant = Time axis stays length 30 thanks to padding=1. Channels grew from 17 to 64.
34print("x_btf.shape :", tuple(x_btf.shape))

Confirm the input shape.

EXECUTION STATE
Output = x_btf.shape : (2, 30, 17)
35print("x_bft.shape :", tuple(x_bft.shape))

Post-transpose - what nn.Conv1d actually saw.

EXECUTION STATE
Output = x_bft.shape : (2, 17, 30)
36print("y_bft.shape :", tuple(y_bft.shape))

Conv output - 64 feature maps, time axis unchanged.

EXECUTION STATE
Output = y_bft.shape : (2, 64, 30)
37print("y_btf.shape :", tuple(y_btf.shape))

Back to the book's (B, T, F') convention. From here, BiLSTM / attention can consume it directly.

EXECUTION STATE
Output = y_btf.shape : (2, 30, 64)
38print("y_btf[0, 0, :5] :", y_btf[0, 0, :5].tolist())

Peek at the first five output channels for engine 0, cycle 0. With random weights the values are arbitrary - what matters is the SHAPE (B, T, F') and the fact that gradients can flow through to update conv.weight.

EXECUTION STATE
Output (representative) = y_btf[0, 0, :5] : [-2.18, 0.97, -3.31, 4.62, 1.40]
→ after training = These same numbers become meaningful pattern detectors - some channels respond to upward trends, some to spikes, some to oscillations. Section 8 visualises this.
21 lines without explanation
1import numpy as np
2import torch
3import torch.nn as nn
4
5# ----- nn.Conv1d on a multi-sensor batch -----
6# IMPORTANT: nn.Conv1d expects input shape (B, C_in, T) - NOT (B, T, C_in)!
7# Our CMAPSSDataset emits (B, T, F). We MUST .transpose(1, 2) before the conv.
8
9torch.manual_seed(0)
10B, T, F = 2, 30, 17                         # batch=2 engines, 30 cycles, 17 sensors
11
12# Step 1 - the dataset gives us (B, T, F)
13x_btf = torch.randn(B, T, F) * 5 + 100
14
15# Step 2 - permute to (B, F, T) for nn.Conv1d
16x_bft = x_btf.transpose(1, 2)               # (B, T, F) -> (B, F, T)
17
18# Step 3 - build the conv layer
19conv = nn.Conv1d(in_channels=17,
20                 out_channels=64,
21                 kernel_size=3,
22                 padding=1)                 # 'same' padding for K=3
23
24print("conv.weight.shape :", tuple(conv.weight.shape))   # (64, 17, 3)
25print("conv.bias.shape   :", tuple(conv.bias.shape))     # (64,)
26print("# params         :", sum(p.numel() for p in conv.parameters()))
27# # params         : 3328  (= 64*17*3 + 64)
28
29# Step 4 - forward pass
30y_bft = conv(x_bft)                         # (B, F=17, T) -> (B, 64, T)
31
32# Step 5 - permute BACK to (B, T, F') for the next layer / model section
33y_btf = y_bft.transpose(1, 2)               # (B, 64, T) -> (B, T, 64)
34
35print("x_btf.shape       :", tuple(x_btf.shape))    # (2, 30, 17)
36print("x_bft.shape       :", tuple(x_bft.shape))    # (2, 17, 30)
37print("y_bft.shape       :", tuple(y_bft.shape))    # (2, 64, 30)
38print("y_btf.shape       :", tuple(y_btf.shape))    # (2, 30, 64)
39print("y_btf[0, 0, :5]   :", y_btf[0, 0, :5].tolist())
An alternative that avoids the transpose. Some teams build their entire pipeline in (B,F,T)(B, F, T) order and only transpose at the very end (before the regression head). That is also valid — pick a convention and stick to it. This book chooses (B,T,F)(B, T, F) for the pipeline because it matches NLP transformers, makes the time axis prominent in tensor printouts, and aligns with what most papers report.

What Kernels Actually Learn

Hand-coding kernels (Sobel, smoothing, Gaussian) is the classical approach. Modern deep learning learns kernels from data through back-propagation. Once trained, the learned weights tend to look like recognisable pattern detectors:

PatternKernel approximationWhat it fires on
Edge / spike[-1, +2, -1]Centre cycle higher than its neighbours
Gradient / trend[-1, 0, +1]Increasing values from left to right
Smoothing[1/3, 1/3, 1/3]Local mean - removes high-frequency noise
Difference[ 0, +1, -1]Cycle-to-cycle delta
Wide trend[-1, -1, 0, +1, +1]Slow upward drift over 5 cycles

Three layers of these stacked become a hierarchy — layer 1 detects edges and gradients, layer 2 combines them into compound features (“spike followed by decay”), layer 3 produces high-level degradation signatures. Section 8 visualises exactly this hierarchy on a trained C-MAPSS model.

1D Convolution Beyond RUL

The same nine lines of code show up everywhere a model needs to detect local patterns in a 1-D signal. Each row in the table below is solved with an architecture that is, modulo the loader, identical to ours.

DomainSignalWhat conv1d detectsFamous architecture
RUL (this book)17 engine sensorsLocal degradation patternsCNN-BiLSTM-Attention
Audio recognitionMel-spectrogramPhoneme onsets, formantsWaveNet / DeepSpeech
Music generationRaw audio waveformPitched events, percussionWaveNet / SampleRNN
GenomicsDNA bases (one-hot)Motifs, regulatory elementsDeepBind / DeepSEA
ECG analysis12-lead voltage traceQRS complexes, arrhythmiaResNet-1D for cardiology
Gravitational wavesLIGO strain dataCompact-binary inspiral chirpsMatched filter / 1D ResNet
Network trafficBytes-per-secondAnomalous spikes, DDoS onsets1D CNN intrusion detection
Industrial vibrationAccelerometer traceBearing fault frequencies1D CNN + envelope spectrum

The mathematical machinery in this book transfers to every row of that table by changing only the loader and the input dimensions. The attention mechanism in Section 3.4, the loss function in Section 14, the gradient balancer in Section 18 — all applicable wherever a 1D conv frontend is the right entry point.

The Three Pitfalls

Pitfall 1: The (B, T, F) vs (B, C, T) trap. Already flagged above. The PyTorch error message you get if your shapes are wrong is opaque (“Given groups=1, weight of size [64, 17, 3], expected input [2, 30, 17] to have 17 channels...”). Always verify the input shape with print(x.shape) immediately before any nn.Conv1d.
Pitfall 2: Forgetting padding shrinks T. A conv layer with K=3, no padding, removes 2 cycles. Stack three such layers and you have lost 6 cycles — a fifth of a 30-cycle window. Use P=K/2P = \lfloor K/2 \rfloor for “same” padding when stacking.
Pitfall 3: Treating in_channels as 1. Newcomers often confuse in_channels with batch size or kernel size. It is the feature dimension — the number of sensors per timestep. For multi-sensor data it is never 1.
The pattern. A 1D convolution slides a learned kernel along the time axis and asks the same local question at every cycle. Stacking layers builds a hierarchy of questions. Coupling it with the BiLSTM in Section 3.3 gives the model both local pattern detection and long-range temporal dynamics — the architecture every model in this book uses.

Takeaway

  • 1D convolution is one equation. yt=kwkxt+k+by_t = \sum_k w_k \, x_{t+k} + b — a weighted sum over a sliding window.
  • Output length is mechanical. Tout=(Tin+2PK)/S+1T_{\text{out}} = \lfloor (T_{\text{in}} + 2P - K)/S \rfloor + 1. Use P=K/2P = \lfloor K/2 \rfloor for same padding.
  • Multi-channel is the same with one more sum. Each output channel sums contributions from all input channels and all kernel positions: 64×17×364 \times 17 \times 3 weights per layer.
  • PyTorch wants (B, C, T). Bridge from (B,T,F)(B, T, F) with x.transpose(1, 2) before nn.Conv1d; transpose back after.
  • Translation invariance is free. The same kernel applies at every cycle — no need to teach the model about position. The BiLSTM in Section 3.3 will add position-aware temporal modelling on top.
Loading comments...