AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the CNN-first architecture and why we extract local features before temporal modeling
Master 1D convolution mathematics for time series data
Apply convolution to multivariate sequences with multiple input channels
Identify local patterns that CNNs learn to detect in sensor data
Appreciate the CNN-LSTM synergy that makes our architecture powerful

Why This Matters: Our AMNL model uses a CNN as its first processing stage, before the BiLSTM. The CNN extracts local patterns from sensor readings—sudden spikes, gradual trends, noise artifacts—that the LSTM then integrates over time. Understanding this division of labor is key to understanding the architecture.

Why CNN Before LSTM?

A natural question: why not feed raw sensor data directly to the LSTM? The CNN-first approach offers several advantages.

Advantage 1: Local Feature Extraction

Sensor readings contain local patterns spanning a few timesteps:

Spikes: Sudden temperature increases (2-5 cycles)
Gradients: Rate of pressure change (3-7 cycles)
Oscillations: Cyclic patterns in vibration (5-10 cycles)

CNNs excel at detecting these patterns through learned filters.

Advantage 2: Dimensionality Reduction

The CNN compresses 17 raw features into richer representations:

(T, 17) \xrightarrow{\text{CNN}} (T', 64)

Where $T' \leq T$ (sequence may shorten with strided convolutions). The LSTM then processes fewer, more informative features.

Advantage 3: Translation Invariance

A spike at cycle 5 and a spike at cycle 25 should both be recognized as "spike patterns." Convolution provides this naturally—the same filter slides across all positions.

Architecture Overview

📝text

1Input: (batch, 30, 17)  ← Window of 30 timesteps, 17 sensors
2          ↓
3       CNN Block 1
4          ↓
5       CNN Block 2
6          ↓
7       CNN Block 3
8          ↓
9Output: (batch, 30, 64)  ← Same timesteps, 64 learned features

1D Convolution Operation

1D convolution slides a kernel across the temporal dimension, computing weighted sums at each position.

Mathematical Definition

For input $\mathbf{x} \in \mathbb{R}^T$ and kernel $\mathbf{w} \in \mathbb{R}^K$ :

y_t = \sum_{k=0}^{K-1} w_k \cdot x_{t+k} + b

Where:

$K$ : Kernel size (receptive field)
$w_k$ : Learned kernel weights
$b$ : Bias term
$t$ : Output position

Interactive Visualization

Use the interactive visualization below to see how 1D convolution works step by step with real NASA C-MAPSS sensor data. Watch the kernel slide across the input, observe the calculations at each position, and see how the output is built up.

Interactive 1D Convolution Visualizer

Understanding nn.Conv1d(input_size, 64, kernel_size=3, padding=1)

What happens when we declare this line?

input_size = Number of input channels (17 sensors in C-MAPSS)

64 = Output channels (64 learned feature detectors)

kernel_size=3 = Window looks at 3 consecutive timesteps

padding=1 = Add zeros at boundaries to preserve length

Data:NASA C-MAPSS FD001 - T30 (Total temperature at HPC outlet)

Speed:

Show padding

Progress1 / 8 positions

1D Convolution Equation

y_t = Σ_k=0^K-1 w_k · x_t+k + b

• K = kernel size (3 in our case)
• w = learned weights
• b = bias term
• t = output position

Output Dimension Formula

T_out = ⌊(T_in + 2P - K) / S⌋ + 1

With T_in=8, P=1, K=3, S=1:

T_out = ⌊(8 + 2 - 3) / 1⌋ + 1 = 8

Padding preserves sequence length!

Parameter Count

For Conv1d(17, 64, kernel_size=3):

Weights = 64 × 17 × 3 = 3,264

Biases = 64

Total = 3,328 parameters

What the Kernel Learns

The kernel weights are learned during training. Different patterns emerge:

[1, 0, -1] → Detects rising/falling edges
[0.33, 0.33, 0.33] → Smoothing/averaging
[−1, 2, −1] → Detects spikes

64 different kernels learn 64 different patterns!

Output Dimension Formula

T_{\text{out}} = \left\lfloor \frac{T_{\text{in}} + 2P - K}{S} \right\rfloor + 1

Where:

Symbol	Meaning	Our Setting
T_in	Input sequence length	30
K	Kernel size	3
P	Padding	1 (same padding)
S	Stride	1
T_out	Output sequence length	30

With padding P = 1 and kernel K = 3, we get $T_{\text{out}} = (30 + 2 - 3)/1 + 1 = 30$ —the sequence length is preserved.

Convolution for Multivariate Series

Our sensor data has 17 channels. The convolution extends naturally to multiple input channels.

Multi-Channel Input

For input $\mathbf{X} \in \mathbb{R}^{T \times C_{\text{in}}}$ , each output channel uses a separate 2D kernel:

y_t^{(j)} = \sum_{c=1}^{C_{\text{in}}} \sum_{k=0}^{K-1} w_{c,k}^{(j)} \cdot x_{t+k}^{(c)} + b^{(j)}

Where:

$j$ : Output channel index (1 to $C_{\text{out}}$ )
$c$ : Input channel index (1 to 17)
$w_{c,k}^{(j)}$ : Weight for input channel c, position k, output channel j

Dimension Transformation

For our first CNN layer:

(30, 17) \xrightarrow{\text{Conv1D}} (30, 64)

Each of the 64 output channels is computed by a kernel of shape $(17, 3)$ that spans all 17 input channels and 3 timesteps.

Interactive Multi-Channel Visualization

The visualization below demonstrates how multi-channel convolution works with a stacked architecture (8 → 16 → 8 channels). Click on Conv1 or Conv2 to see the detailed step-by-step computation showing how each output channel sums contributions from all input channels.

Multi-Channel 1D Convolution + ReLU

Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation

Show BatchNormShow ReLUShow Dropout

Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)

Input

(8, 6)

Hidden

(16, 6)

ReLU

max(0,x)

Output

(8, 6)

Click on Conv1, Conv2, or ReLU to see detailed computation

Input: 8 Sensors × 6 Timesteps

T₃₀

0.82

0.91

0.76

0.88

0.95

0.71

P₃₀

0.45

0.52

0.48

0.55

0.61

0.58

Vib

0.33

0.29

0.35

0.31

0.28

0.32

RPM

0.67

0.72

0.69

0.74

0.78

0.75

Flow

0.21

0.25

0.23

0.27

0.24

0.22

Fuel

0.89

0.85

0.92

0.88

0.84

0.90

Exh

0.56

0.59

0.54

0.62

0.58

0.55

Oil

0.41

0.38

0.44

0.40

0.43

0.39

Conv1 + ReLU

After Conv1+ReLU: 16 Features × 6 Timesteps

0.42

0.33

0.38

0.41

0.00

0.62

0.74

0.78

0.74

0.75

0.28

0.00

0.16

0.00

0.06

0.10

0.02

0.11

0.20

0.00

0.02

0.00

0.34

0.45

0.54

0.49

0.43

0.26

0.00

F10

0.00

F11

0.30

0.09

0.10

0.08

0.11

0.00

F12

0.14

0.13

0.20

0.16

0.10

0.22

F13

0.00

F14

0.00

0.21

0.18

0.15

0.20

0.48

F15

0.25

0.00

Conv2 + ReLU

After Conv2+ReLU: 8 Features × 6 Timesteps

Out0

0.47

0.50

0.61

0.63

0.48

0.20

Out1

0.00

Out2

0.00

Out3

0.31

0.13

0.16

0.17

0.10

0.00

Out4

0.08

0.14

0.10

0.04

0.00

Out5

0.00

Out6

0.00

0.09

0.04

0.00

Out7

0.00

Input: 8 Sensors × 6 Timesteps

T₃₀

0.82

0.91

0.76

0.88

0.95

0.71

P₃₀

0.45

0.52

0.48

0.55

0.61

0.58

Vib

0.33

0.29

0.35

0.31

0.28

0.32

RPM

0.67

0.72

0.69

0.74

0.78

0.75

Flow

0.21

0.25

0.23

0.27

0.24

0.22

Fuel

0.89

0.85

0.92

0.88

0.84

0.90

Exh

0.56

0.59

0.54

0.62

0.58

0.55

Oil

0.41

0.38

0.44

0.40

0.43

0.39

Conv1 + ReLU

After Conv1+ReLU: 16 Features

0.42

0.33

0.38

0.41

0.00

0.62

0.74

0.78

0.74

0.75

0.28

0.00

0.16

0.00

0.06

0.10

0.02

0.11

0.20

0.00

0.02

0.00

0.34

0.45

0.54

0.49

0.43

0.26

0.00

F10

0.00

F11

0.30

0.09

0.10

0.08

0.11

0.00

F12

0.14

0.13

0.20

0.16

0.10

0.22

F13

0.00

F14

0.00

0.21

0.18

0.15

0.20

0.48

F15

0.25

0.00

Conv2 + ReLU

After Conv2+ReLU: 8 Features

Out0

0.47

0.50

0.61

0.63

0.48

0.20

Out1

0.00

Out2

0.00

Out3

0.31

0.13

0.16

0.17

0.10

0.00

Out4

0.08

0.14

0.10

0.04

0.00

Out5

0.00

Out6

0.00

0.09

0.04

0.00

Out7

0.00

Key Insight: Multi-Channel Convolution

Each output channel is computed by summing contributions from ALL input channels:

y[out_ch, t] = Σ_{in_ch} Σ_k W[out_ch, in_ch, k] × x[in_ch, t+k] + bias

📝 Important Clarifications About This Visualization▼

1. Kernel Weights Are Randomly Initialized

The kernel weights shown (e.g., 0.15, -0.23, 0.08) are randomly generated for demonstration. In practice:

# PyTorch initializes randomly:
self.conv1 = nn.Conv1d(8, 16, kernel_size=3)
# Weights shape: (16, 8, 3) = 384 parameters

Before training: Random values (Kaiming or Xavier initialization)
During training: Backpropagation adjusts weights to minimize loss
After training: Weights become meaningful pattern detectors (edges, trends, spikes)

2. Sequence Length is Simplified for Visualization

We show only 6 timesteps (t₀ to t₅) for visual clarity. The actual C-MAPSS model uses different dimensions:

Aspect	This Demo	Real C-MAPSS
Sequence length	6	30
Input channels	8	17
Architecture	8→16→8	17→64→128→64

The convolution operation works identically regardless of size—we just use smaller numbers so you can see every calculation clearly.

What CNNs Learn

Through training, CNN kernels learn to detect patterns relevant for RUL prediction.

Examples of Learned Patterns

Pattern Type	Kernel Shape	Detection Mechanism
Edge (spike)	[−1, 2, −1]	High response at sudden changes
Gradient (trend)	[−1, 0, 1]	High response for increasing values
Smoothing	[1/3, 1/3, 1/3]	Reduces noise, highlights trends
Difference	[1, −1, 0]	Detects changes from previous step

Hierarchical Feature Extraction

Multiple CNN layers build increasingly abstract features:

Layer 1: Detects simple patterns (edges, gradients) in raw sensor data
Layer 2: Combines simple patterns into compound features (e.g., "spike followed by decay")
Layer 3: Creates high-level representations (degradation signatures)

This hierarchical structure mirrors how degradation manifests: low-level sensor anomalies combine into recognizable degradation patterns.

CNN-LSTM Synergy

The CNN and LSTM have complementary strengths:

Aspect	CNN	LSTM
Temporal scope	Local (kernel size)	Global (entire sequence)
Computation	Parallel (all positions)	Sequential (step by step)
What it learns	Pattern detection	Temporal dynamics
Invariance	Translation invariant	Position aware

Division of Labor

In our architecture:

CNN: "What patterns exist?" — Detects local features independent of when they occur
LSTM: "How do patterns evolve?" — Models the temporal sequence of detected patterns
Attention: "Which timesteps matter?" — Focuses on relevant moments for prediction

Design Principle: Let each component do what it does best. CNNs excel at pattern detection; LSTMs excel at sequence modeling. Combining them gives the best of both worlds.

Summary

In this section, we introduced 1D convolutions for time series:

CNN-first architecture: Extract local features before temporal modeling
1D convolution: $y_t = \sum_k w_k \cdot x_{t+k} + b$
Multi-channel: Each output channel combines all input channels
Learned patterns: Edges, gradients, complex degradation signatures
CNN-LSTM synergy: Pattern detection + temporal dynamics

Property	Value
Input shape	(30, 17)
Output shape	(30, 64)
Kernel size	3
Padding	1 (same)
Stride	1

Looking Ahead: A single convolution layer has limited capacity. Our model uses three stacked CNN layers with increasing channel counts (17 → 64 → 128 → 64). The next section details this architecture and explains the channel progression.

With 1D convolution understood, we are ready to design the complete three-layer CNN architecture.