Learning Objectives
By the end of this section, you will:
- Understand convolution as a mathematical operation that extracts local patterns from sequences
- Derive the 1D convolution formula and know how it applies to time series data
- Grasp the role of kernels as learned pattern detectors that slide across the input
- Calculate output dimensions given input size, kernel size, padding, and stride
- Extend to multi-channel convolution for processing multivariate time series
- Connect to our architecture: the three-layer CNN that transforms 17 features to 64 learned representations
Why This Matters: Convolution is the workhorse of modern deep learning. In our architecture, CNN layers are the first processing stage—they transform raw sensor readings into learned features that capture local temporal patterns. Understanding convolution mathematically is essential for designing and debugging these networks.
What is Convolution?
Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other.
Historical Origins
Convolution originated in 18th century mathematics and became central to signal processing in the 20th century. The key insight is that many physical systems can be characterized by their impulse response—convolving this response with an input signal predicts the output.
Continuous Convolution
For continuous functions and , their convolution is:
This integral "slides" function across , multiplying and summing at each position.
Discrete Convolution
For digital signals and neural networks, we use the discrete version:
In practice, our sequences and kernels are finite, so the sum has finite bounds.
1D Convolution for Time Series
For time series data, we use 1D convolution where a kernel slides along the temporal dimension.
The 1D Convolution Formula
Given an input sequence and a kernel of size , the convolution output at position is:
Where:
- is the output at position
- are the kernel weights (learned parameters)
- is the input at the corresponding position
- is the bias term
- is the kernel size
Kernels and Learned Filters
In traditional signal processing, kernels are hand-designed for specific purposes. In deep learning, kernels are learned from data to extract task-relevant features.
Common Hand-Designed Kernels
| Kernel | Values | Purpose |
|---|---|---|
| Identity | [0, 1, 0] | Pass-through (no effect) |
| Blur/Average | [1/3, 1/3, 1/3] | Smooth signal, reduce noise |
| Derivative | [1, 0, -1] | Detect rate of change |
| Laplacian | [1, -2, 1] | Detect acceleration/curvature |
| Edge detection | [-1, 2, -1] | Highlight sharp transitions |
Learned Kernels in Neural Networks
In our CNN, each kernel starts with random values and is optimized via backpropagation to minimize the loss function. After training, kernels evolve into task-specific feature detectors.
For RUL prediction, learned kernels might detect:
- Temperature spikes indicating overheating
- Pressure drops suggesting seal wear
- Vibration patterns indicating mechanical loosening
- Multi-sensor correlation changes as degradation progresses
The Power of Learning
Padding, Stride, and Output Dimensions
Three hyperparameters control how convolution is applied:
Padding
Padding adds values (typically zeros) to the boundaries of the input sequence. With padding on each side:
"Same" padding preserves input length:
For , —add one zero on each end.
Stride
Stride is how many positions the kernel moves between each computation. Stride 1 means slide by one position; stride 2 means skip every other position.
Output Dimension Formula
Given input length , kernel size , padding , and stride :
Multi-Channel Convolution
Our input has 17 channels (features). Multi-channel convolution processes all channels simultaneously.
Multi-Input, Single-Output
For an input with channels, each output position aggregates across all input channels:
The kernel now has shape —it has weights for each input channel.
Multi-Input, Multi-Output
To produce output channels, we use separate kernels:
The full kernel tensor has shape .
Our CNN Configuration
The CNN in our AMNL model consists of three convolutional layers with batch normalization and ReLU activation.
Layer-by-Layer Specification
| Layer | C_in | C_out | Kernel | Padding | Parameters |
|---|---|---|---|---|---|
| Conv1 | 17 | 64 | 3 | 1 | 3,328 |
| BatchNorm1 | - | 64 | - | - | 128 |
| Conv2 | 64 | 128 | 3 | 1 | 24,704 |
| BatchNorm2 | - | 128 | - | - | 256 |
| Conv3 | 128 | 64 | 3 | 1 | 24,640 |
| BatchNorm3 | - | 64 | - | - | 128 |
| Total | - | - | - | - | ~53K |
Tensor Flow Through CNN
1Input: X ∈ ℝ^(30 × 17) # (time, features)
2 ↓ transpose
3 X' ∈ ℝ^(17 × 30) # (channels, time) for Conv1d
4
5Conv1: H₁ ∈ ℝ^(64 × 30) # 64 feature maps, same length
6 + BatchNorm + ReLU + Dropout(0.15)
7
8Conv2: H₂ ∈ ℝ^(128 × 30) # expand to 128 channels
9 + BatchNorm + ReLU + Dropout(0.15)
10
11Conv3: H₃ ∈ ℝ^(64 × 30) # compress back to 64 channels
12 + BatchNorm + ReLU
13 ↓ transpose
14Output: H ∈ ℝ^(30 × 64) # ready for BiLSTMWhy This Channel Progression?
- 17 → 64: Expand from raw sensors to richer learned representation
- 64 → 128: Further expand to capture complex combinations
- 128 → 64: Compress to reduce parameters for BiLSTM while retaining key features
The Bottleneck Pattern
The 64 → 128 → 64 pattern is a bottleneck design. The middle layer has high capacity to learn complex patterns, while the output layer compresses to a more efficient representation. This is analogous to autoencoder bottlenecks.
Why Convolution Works for Local Patterns
Convolution has several properties that make it ideal for extracting local patterns from time series:
1. Local Connectivity
Each output depends only on a local window of the input (kernel size means 3 consecutive timesteps). This matches the physical reality that short-term sensor dynamics carry information.
2. Parameter Sharing
The same kernel is applied at every position. A pattern learned at the beginning of the sequence can be detected at the end. This drastically reduces parameters compared to fully connected layers.
Convolution uses 10× fewer parameters while capturing the same local relationships.
3. Translation Equivariance
If a pattern shifts in time, its detection shifts correspondingly:
A temperature spike at cycle 50 produces the same activation pattern as one at cycle 150, just shifted. This is crucial for generalization.
4. Hierarchical Feature Learning
Stacked convolutional layers create a hierarchy:
- Layer 1: Detects simple patterns (edges, trends) across 3 timesteps
- Layer 2: Combines simple patterns into complex ones (effective receptive field: 5 timesteps)
- Layer 3: Detects high-level features (effective receptive field: 7 timesteps)
Summary
In this section, we explored the mathematics of convolution for sequence processing:
- Convolution slides a kernel across the input, computing weighted sums:
- 1D convolution for time series extracts local temporal patterns with learned kernels
- Padding and stride control output dimensions:
- Multi-channel convolution processes all 17 sensor channels simultaneously
- Our CNN uses three layers: 17 → 64 → 128 → 64 channels with ~53K parameters
- Key properties: local connectivity, parameter sharing, translation equivariance, hierarchical learning
| Property | Value | Significance |
|---|---|---|
| Kernel size | 3 | Captures 3-timestep local patterns |
| Padding | 1 (same) | Preserves sequence length at 30 |
| Receptive field | 7 | After 3 layers, sees 7 timesteps |
| Parameter reduction | 10× | vs fully connected layers |
| Output shape | (30, 64) | Ready for BiLSTM input |
Looking Ahead: Convolution captures local patterns, but cannot model dependencies across the full 30-timestep window. In the next section, we introduce Recurrent Neural Networks—the mathematical framework for processing sequences where each output depends on the entire preceding history.
With convolution understood, we can now tackle the more complex mathematics of recurrent processing.