Learning Objectives
By the end of this section, you will:
- Understand the CNN-first architecture and why we extract local features before temporal modeling
- Master 1D convolution mathematics for time series data
- Apply convolution to multivariate sequences with multiple input channels
- Identify local patterns that CNNs learn to detect in sensor data
- Appreciate the CNN-LSTM synergy that makes our architecture powerful
Why This Matters: Our AMNL model uses a CNN as its first processing stage, before the BiLSTM. The CNN extracts local patterns from sensor readings—sudden spikes, gradual trends, noise artifacts—that the LSTM then integrates over time. Understanding this division of labor is key to understanding the architecture.
Why CNN Before LSTM?
A natural question: why not feed raw sensor data directly to the LSTM? The CNN-first approach offers several advantages.
Advantage 1: Local Feature Extraction
Sensor readings contain local patterns spanning a few timesteps:
- Spikes: Sudden temperature increases (2-5 cycles)
- Gradients: Rate of pressure change (3-7 cycles)
- Oscillations: Cyclic patterns in vibration (5-10 cycles)
CNNs excel at detecting these patterns through learned filters.
Advantage 2: Dimensionality Reduction
The CNN compresses 17 raw features into richer representations:
Where (sequence may shorten with strided convolutions). The LSTM then processes fewer, more informative features.
Advantage 3: Translation Invariance
A spike at cycle 5 and a spike at cycle 25 should both be recognized as "spike patterns." Convolution provides this naturally—the same filter slides across all positions.
Architecture Overview
1Input: (batch, 30, 17) ← Window of 30 timesteps, 17 sensors
2 ↓
3 CNN Block 1
4 ↓
5 CNN Block 2
6 ↓
7 CNN Block 3
8 ↓
9Output: (batch, 30, 64) ← Same timesteps, 64 learned features1D Convolution Operation
1D convolution slides a kernel across the temporal dimension, computing weighted sums at each position.
Mathematical Definition
For input and kernel :
Where:
- : Kernel size (receptive field)
- : Learned kernel weights
- : Bias term
- : Output position
Interactive Visualization
Use the interactive visualization below to see how 1D convolution works step by step with real NASA C-MAPSS sensor data. Watch the kernel slide across the input, observe the calculations at each position, and see how the output is built up.
Interactive 1D Convolution Visualizer
Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)
What happens when we declare this line?
input_size = Number of input channels (17 sensors in C-MAPSS)64 = Output channels (64 learned feature detectors)kernel_size=3 = Window looks at 3 consecutive timestepspadding=1 = Add zeros at boundaries to preserve length1D Convolution Equation
yt = Σk=0K-1 wk · xt+k + b
- • K = kernel size (3 in our case)
- • w = learned weights
- • b = bias term
- • t = output position
Output Dimension Formula
Tout = ⌊(Tin + 2P - K) / S⌋ + 1
With Tin=8, P=1, K=3, S=1:
Tout = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8
Padding preserves sequence length!
Parameter Count
For Conv1d(17, 64, kernel_size=3):
Weights = 64 × 17 × 3 = 3,264
Biases = 64
Total = 3,328 parameters
What the Kernel Learns
The kernel weights are learned during training. Different patterns emerge:
- [1, 0, -1] → Detects rising/falling edges
- [0.33, 0.33, 0.33] → Smoothing/averaging
- [−1, 2, −1] → Detects spikes
64 different kernels learn 64 different patterns!
Output Dimension Formula
Where:
| Symbol | Meaning | Our Setting |
|---|---|---|
| T_in | Input sequence length | 30 |
| K | Kernel size | 3 |
| P | Padding | 1 (same padding) |
| S | Stride | 1 |
| T_out | Output sequence length | 30 |
With padding P = 1 and kernel K = 3, we get —the sequence length is preserved.
Convolution for Multivariate Series
Our sensor data has 17 channels. The convolution extends naturally to multiple input channels.
Multi-Channel Input
For input , each output channel uses a separate 2D kernel:
Where:
- : Output channel index (1 to )
- : Input channel index (1 to 17)
- : Weight for input channel c, position k, output channel j
Dimension Transformation
For our first CNN layer:
Each of the 64 output channels is computed by a kernel of shape that spans all 17 input channels and 3 timesteps.
Interactive Multi-Channel Visualization
The visualization below demonstrates how multi-channel convolution works with a stacked architecture (8 → 16 → 8 channels). Click on Conv1 or Conv2 to see the detailed step-by-step computation showing how each output channel sums contributions from all input channels.
Multi-Channel 1D Convolution + ReLU
Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation
Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)
Click on Conv1, Conv2, or ReLU to see detailed computation
Input: 8 Sensors × 6 Timesteps
After Conv1+ReLU: 16 Features × 6 Timesteps
After Conv2+ReLU: 8 Features × 6 Timesteps
Input: 8 Sensors × 6 Timesteps
After Conv1+ReLU: 16 Features
After Conv2+ReLU: 8 Features
Key Insight: Multi-Channel Convolution
Each output channel is computed by summing contributions from ALL input channels:
📝 Important Clarifications About This Visualization▼
1. Kernel Weights Are Randomly Initialized
The kernel weights shown (e.g., 0.15, -0.23, 0.08) are randomly generated for demonstration. In practice:
self.conv1 = nn.Conv1d(8, 16, kernel_size=3)
# Weights shape: (16, 8, 3) = 384 parameters
- Before training: Random values (Kaiming or Xavier initialization)
- During training: Backpropagation adjusts weights to minimize loss
- After training: Weights become meaningful pattern detectors (edges, trends, spikes)
2. Sequence Length is Simplified for Visualization
We show only 6 timesteps (t₀ to t₅) for visual clarity. The actual C-MAPSS model uses different dimensions:
| Aspect | This Demo | Real C-MAPSS |
|---|---|---|
| Sequence length | 6 | 30 |
| Input channels | 8 | 17 |
| Architecture | 8→16→8 | 17→64→128→64 |
The convolution operation works identically regardless of size—we just use smaller numbers so you can see every calculation clearly.
What CNNs Learn
Through training, CNN kernels learn to detect patterns relevant for RUL prediction.
Examples of Learned Patterns
| Pattern Type | Kernel Shape | Detection Mechanism |
|---|---|---|
| Edge (spike) | [−1, 2, −1] | High response at sudden changes |
| Gradient (trend) | [−1, 0, 1] | High response for increasing values |
| Smoothing | [1/3, 1/3, 1/3] | Reduces noise, highlights trends |
| Difference | [1, −1, 0] | Detects changes from previous step |
Hierarchical Feature Extraction
Multiple CNN layers build increasingly abstract features:
- Layer 1: Detects simple patterns (edges, gradients) in raw sensor data
- Layer 2: Combines simple patterns into compound features (e.g., "spike followed by decay")
- Layer 3: Creates high-level representations (degradation signatures)
This hierarchical structure mirrors how degradation manifests: low-level sensor anomalies combine into recognizable degradation patterns.
CNN-LSTM Synergy
The CNN and LSTM have complementary strengths:
| Aspect | CNN | LSTM |
|---|---|---|
| Temporal scope | Local (kernel size) | Global (entire sequence) |
| Computation | Parallel (all positions) | Sequential (step by step) |
| What it learns | Pattern detection | Temporal dynamics |
| Invariance | Translation invariant | Position aware |
Division of Labor
In our architecture:
- CNN: "What patterns exist?" — Detects local features independent of when they occur
- LSTM: "How do patterns evolve?" — Models the temporal sequence of detected patterns
- Attention: "Which timesteps matter?" — Focuses on relevant moments for prediction
Design Principle: Let each component do what it does best. CNNs excel at pattern detection; LSTMs excel at sequence modeling. Combining them gives the best of both worlds.
Summary
In this section, we introduced 1D convolutions for time series:
- CNN-first architecture: Extract local features before temporal modeling
- 1D convolution:
- Multi-channel: Each output channel combines all input channels
- Learned patterns: Edges, gradients, complex degradation signatures
- CNN-LSTM synergy: Pattern detection + temporal dynamics
| Property | Value |
|---|---|
| Input shape | (30, 17) |
| Output shape | (30, 64) |
| Kernel size | 3 |
| Padding | 1 (same) |
| Stride | 1 |
Looking Ahead: A single convolution layer has limited capacity. Our model uses three stacked CNN layers with increasing channel counts (17 → 64 → 128 → 64). The next section details this architecture and explains the channel progression.
With 1D convolution understood, we are ready to design the complete three-layer CNN architecture.