Chapter 5
15 min read
Section 22 of 104

1D Convolutions for Time Series

CNN Feature Extractor

Learning Objectives

By the end of this section, you will:

  1. Understand the CNN-first architecture and why we extract local features before temporal modeling
  2. Master 1D convolution mathematics for time series data
  3. Apply convolution to multivariate sequences with multiple input channels
  4. Identify local patterns that CNNs learn to detect in sensor data
  5. Appreciate the CNN-LSTM synergy that makes our architecture powerful
Why This Matters: Our AMNL model uses a CNN as its first processing stage, before the BiLSTM. The CNN extracts local patterns from sensor readings—sudden spikes, gradual trends, noise artifacts—that the LSTM then integrates over time. Understanding this division of labor is key to understanding the architecture.

Why CNN Before LSTM?

A natural question: why not feed raw sensor data directly to the LSTM? The CNN-first approach offers several advantages.

Advantage 1: Local Feature Extraction

Sensor readings contain local patterns spanning a few timesteps:

  • Spikes: Sudden temperature increases (2-5 cycles)
  • Gradients: Rate of pressure change (3-7 cycles)
  • Oscillations: Cyclic patterns in vibration (5-10 cycles)

CNNs excel at detecting these patterns through learned filters.

Advantage 2: Dimensionality Reduction

The CNN compresses 17 raw features into richer representations:

(T,17)CNN(T,64)(T, 17) \xrightarrow{\text{CNN}} (T', 64)

Where TTT' \leq T (sequence may shorten with strided convolutions). The LSTM then processes fewer, more informative features.

Advantage 3: Translation Invariance

A spike at cycle 5 and a spike at cycle 25 should both be recognized as "spike patterns." Convolution provides this naturally—the same filter slides across all positions.

Architecture Overview

📝text
1Input: (batch, 30, 17)  ← Window of 30 timesteps, 17 sensors
23       CNN Block 1
45       CNN Block 2
67       CNN Block 3
89Output: (batch, 30, 64)  ← Same timesteps, 64 learned features

1D Convolution Operation

1D convolution slides a kernel across the temporal dimension, computing weighted sums at each position.

Mathematical Definition

For input xRT\mathbf{x} \in \mathbb{R}^T and kernel wRK\mathbf{w} \in \mathbb{R}^K:

yt=k=0K1wkxt+k+by_t = \sum_{k=0}^{K-1} w_k \cdot x_{t+k} + b

Where:

  • KK: Kernel size (receptive field)
  • wkw_k: Learned kernel weights
  • bb: Bias term
  • tt: Output position

Interactive Visualization

Use the interactive visualization below to see how 1D convolution works step by step with real NASA C-MAPSS sensor data. Watch the kernel slide across the input, observe the calculations at each position, and see how the output is built up.

Interactive 1D Convolution Visualizer

Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)

What happens when we declare this line?

input_size = Number of input channels (17 sensors in C-MAPSS)
64 = Output channels (64 learned feature detectors)
kernel_size=3 = Window looks at 3 consecutive timesteps
padding=1 = Add zeros at boundaries to preserve length
Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)
Input(padded)pad0.00t00.82t10.91t20.76t30.88t40.95t50.71t60.84t70.93pad0.00Kernel(size=3)w00.33w10.34w20.33Step 1/8: Calculation at position 0y0 = 0.33 × 0.00 + 0.34 × 0.82 + 0.33 × 0.91 = 0.000 + 0.279 + 0.300 = 0.579Outputy00.58y1y2y3y4y5y6y7
Speed:
Progress1 / 8 positions

1D Convolution Equation

yt = Σk=0K-1 wk · xt+k + b

  • K = kernel size (3 in our case)
  • w = learned weights
  • b = bias term
  • t = output position

Output Dimension Formula

Tout = ⌊(Tin + 2P - K) / S⌋ + 1

With Tin=8, P=1, K=3, S=1:

Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8

Padding preserves sequence length!

Parameter Count

For Conv1d(17, 64, kernel_size=3):

Weights = 64 × 17 × 3 = 3,264

Biases = 64

Total = 3,328 parameters

What the Kernel Learns

The kernel weights are learned during training. Different patterns emerge:

  • [1, 0, -1] → Detects rising/falling edges
  • [0.33, 0.33, 0.33] → Smoothing/averaging
  • [−1, 2, −1] → Detects spikes

64 different kernels learn 64 different patterns!

Output Dimension Formula

Tout=Tin+2PKS+1T_{\text{out}} = \left\lfloor \frac{T_{\text{in}} + 2P - K}{S} \right\rfloor + 1

Where:

SymbolMeaningOur Setting
T_inInput sequence length30
KKernel size3
PPadding1 (same padding)
SStride1
T_outOutput sequence length30

With padding P = 1 and kernel K = 3, we get Tout=(30+23)/1+1=30T_{\text{out}} = (30 + 2 - 3)/1 + 1 = 30—the sequence length is preserved.


Convolution for Multivariate Series

Our sensor data has 17 channels. The convolution extends naturally to multiple input channels.

Multi-Channel Input

For input XRT×Cin\mathbf{X} \in \mathbb{R}^{T \times C_{\text{in}}}, each output channel uses a separate 2D kernel:

yt(j)=c=1Cink=0K1wc,k(j)xt+k(c)+b(j)y_t^{(j)} = \sum_{c=1}^{C_{\text{in}}} \sum_{k=0}^{K-1} w_{c,k}^{(j)} \cdot x_{t+k}^{(c)} + b^{(j)}

Where:

  • jj: Output channel index (1 to CoutC_{\text{out}})
  • cc: Input channel index (1 to 17)
  • wc,k(j)w_{c,k}^{(j)}: Weight for input channel c, position k, output channel j

Dimension Transformation

For our first CNN layer:

(30,17)Conv1D(30,64)(30, 17) \xrightarrow{\text{Conv1D}} (30, 64)

Each of the 64 output channels is computed by a kernel of shape (17,3)(17, 3) that spans all 17 input channels and 3 timesteps.

Interactive Multi-Channel Visualization

The visualization below demonstrates how multi-channel convolution works with a stacked architecture (8 → 16 → 8 channels). Click on Conv1 or Conv2 to see the detailed step-by-step computation showing how each output channel sums contributions from all input channels.

Multi-Channel 1D Convolution + ReLU

Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation

Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)

Input
(8, 6)
Hidden
(16, 6)
ReLU
max(0,x)
Output
(8, 6)

Click on Conv1, Conv2, or ReLU to see detailed computation

Input: 8 Sensors × 6 Timesteps
t0
t1
t2
t3
t4
t5
T₃₀
0.82
0.91
0.76
0.88
0.95
0.71
P₃₀
0.45
0.52
0.48
0.55
0.61
0.58
Vib
0.33
0.29
0.35
0.31
0.28
0.32
RPM
0.67
0.72
0.69
0.74
0.78
0.75
Flow
0.21
0.25
0.23
0.27
0.24
0.22
Fuel
0.89
0.85
0.92
0.88
0.84
0.90
Exh
0.56
0.59
0.54
0.62
0.58
0.55
Oil
0.41
0.38
0.44
0.40
0.43
0.39
Conv1 + ReLU
After Conv1+ReLU: 16 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
F0
0.42
0.33
0.38
0.41
0.41
0.00
F1
0.62
0.74
0.78
0.74
0.75
0.28
F2
0.00
0.00
0.00
0.00
0.00
0.16
F3
0.00
0.00
0.00
0.00
0.00
0.00
F4
0.00
0.06
0.10
0.02
0.11
0.20
F5
0.00
0.00
0.00
0.00
0.00
0.02
F6
0.00
0.00
0.00
0.00
0.00
0.00
F7
0.34
0.45
0.54
0.49
0.43
0.26
F8
0.00
0.00
0.00
0.00
0.00
0.00
F9
0.00
0.00
0.00
0.00
0.00
0.00
F10
0.00
0.00
0.00
0.00
0.00
0.00
F11
0.30
0.09
0.10
0.08
0.11
0.00
F12
0.14
0.13
0.20
0.16
0.10
0.22
F13
0.00
0.00
0.00
0.00
0.00
0.00
F14
0.00
0.21
0.18
0.15
0.20
0.48
F15
0.25
0.00
0.00
0.00
0.00
0.00
Conv2 + ReLU
After Conv2+ReLU: 8 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
Out0
0.47
0.50
0.61
0.63
0.48
0.20
Out1
0.00
0.00
0.00
0.00
0.00
0.00
Out2
0.00
0.00
0.00
0.00
0.00
0.00
Out3
0.31
0.13
0.16
0.17
0.10
0.00
Out4
0.08
0.14
0.10
0.04
0.04
0.00
Out5
0.00
0.00
0.00
0.00
0.00
0.00
Out6
0.00
0.09
0.04
0.00
0.00
0.00
Out7
0.00
0.00
0.00
0.00
0.00
0.00

Key Insight: Multi-Channel Convolution

Each output channel is computed by summing contributions from ALL input channels:

y[out_ch, t] = Σin_ch Σk W[out_ch, in_ch, k] × x[in_ch, t+k] + bias
📝 Important Clarifications About This Visualization

1. Kernel Weights Are Randomly Initialized

The kernel weights shown (e.g., 0.15, -0.23, 0.08) are randomly generated for demonstration. In practice:

# PyTorch initializes randomly:
self.conv1 = nn.Conv1d(8, 16, kernel_size=3)
# Weights shape: (16, 8, 3) = 384 parameters
  • Before training: Random values (Kaiming or Xavier initialization)
  • During training: Backpropagation adjusts weights to minimize loss
  • After training: Weights become meaningful pattern detectors (edges, trends, spikes)

2. Sequence Length is Simplified for Visualization

We show only 6 timesteps (t₀ to t₅) for visual clarity. The actual C-MAPSS model uses different dimensions:

AspectThis DemoReal C-MAPSS
Sequence length630
Input channels817
Architecture8→16→817→64→128→64

The convolution operation works identically regardless of size—we just use smaller numbers so you can see every calculation clearly.


What CNNs Learn

Through training, CNN kernels learn to detect patterns relevant for RUL prediction.

Examples of Learned Patterns

Pattern TypeKernel ShapeDetection Mechanism
Edge (spike)[−1, 2, −1]High response at sudden changes
Gradient (trend)[−1, 0, 1]High response for increasing values
Smoothing[1/3, 1/3, 1/3]Reduces noise, highlights trends
Difference[1, −1, 0]Detects changes from previous step

Hierarchical Feature Extraction

Multiple CNN layers build increasingly abstract features:

  1. Layer 1: Detects simple patterns (edges, gradients) in raw sensor data
  2. Layer 2: Combines simple patterns into compound features (e.g., "spike followed by decay")
  3. Layer 3: Creates high-level representations (degradation signatures)

This hierarchical structure mirrors how degradation manifests: low-level sensor anomalies combine into recognizable degradation patterns.


CNN-LSTM Synergy

The CNN and LSTM have complementary strengths:

AspectCNNLSTM
Temporal scopeLocal (kernel size)Global (entire sequence)
ComputationParallel (all positions)Sequential (step by step)
What it learnsPattern detectionTemporal dynamics
InvarianceTranslation invariantPosition aware

Division of Labor

In our architecture:

  • CNN: "What patterns exist?" — Detects local features independent of when they occur
  • LSTM: "How do patterns evolve?" — Models the temporal sequence of detected patterns
  • Attention: "Which timesteps matter?" — Focuses on relevant moments for prediction
Design Principle: Let each component do what it does best. CNNs excel at pattern detection; LSTMs excel at sequence modeling. Combining them gives the best of both worlds.

Summary

In this section, we introduced 1D convolutions for time series:

  1. CNN-first architecture: Extract local features before temporal modeling
  2. 1D convolution: yt=kwkxt+k+by_t = \sum_k w_k \cdot x_{t+k} + b
  3. Multi-channel: Each output channel combines all input channels
  4. Learned patterns: Edges, gradients, complex degradation signatures
  5. CNN-LSTM synergy: Pattern detection + temporal dynamics
PropertyValue
Input shape(30, 17)
Output shape(30, 64)
Kernel size3
Padding1 (same)
Stride1
Looking Ahead: A single convolution layer has limited capacity. Our model uses three stacked CNN layers with increasing channel counts (17 → 64 → 128 → 64). The next section details this architecture and explains the channel progression.

With 1D convolution understood, we are ready to design the complete three-layer CNN architecture.