AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand convolution as a mathematical operation that extracts local patterns from sequences
Derive the 1D convolution formula and know how it applies to time series data
Grasp the role of kernels as learned pattern detectors that slide across the input
Calculate output dimensions given input size, kernel size, padding, and stride
Extend to multi-channel convolution for processing multivariate time series
Connect to our architecture: the three-layer CNN that transforms 17 features to 64 learned representations

Why This Matters: Convolution is the workhorse of modern deep learning. In our architecture, CNN layers are the first processing stage—they transform raw sensor readings into learned features that capture local temporal patterns. Understanding convolution mathematically is essential for designing and debugging these networks.

What is Convolution?

Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other.

Historical Origins

Convolution originated in 18th century mathematics and became central to signal processing in the 20th century. The key insight is that many physical systems can be characterized by their impulse response—convolving this response with an input signal predicts the output.

Continuous Convolution

For continuous functions $f$ and $g$ , their convolution is:

(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \cdot g(t - \tau) \, d\tau

This integral "slides" function $g$ across $f$ , multiplying and summing at each position.

Discrete Convolution

For digital signals and neural networks, we use the discrete version:

(f * g)[n] = \sum_{m=-\infty}^{\infty} f[m] \cdot g[n - m]

In practice, our sequences and kernels are finite, so the sum has finite bounds.

1D Convolution for Time Series

For time series data, we use 1D convolution where a kernel slides along the temporal dimension.

The 1D Convolution Formula

Given an input sequence $\mathbf{x} = [x_1, x_2, \ldots, x_T]$ and a kernel $\mathbf{w} = [w_1, w_2, \ldots, w_K]$ of size $K$ , the convolution output at position $t$ is:

y_t = \sum_{k=1}^{K} w_k \cdot x_{t+k-1} + b

Where:

$y_t$ is the output at position $t$
$w_k$ are the kernel weights (learned parameters)
$x_{t+k-1}$ is the input at the corresponding position
$b$ is the bias term
$K$ is the kernel size

Kernels and Learned Filters

In traditional signal processing, kernels are hand-designed for specific purposes. In deep learning, kernels are learned from data to extract task-relevant features.

Common Hand-Designed Kernels

Kernel	Values	Purpose
Identity	[0, 1, 0]	Pass-through (no effect)
Blur/Average	[1/3, 1/3, 1/3]	Smooth signal, reduce noise
Derivative	[1, 0, -1]	Detect rate of change
Laplacian	[1, -2, 1]	Detect acceleration/curvature
Edge detection	[-1, 2, -1]	Highlight sharp transitions

Learned Kernels in Neural Networks

In our CNN, each kernel starts with random values and is optimized via backpropagation to minimize the loss function. After training, kernels evolve into task-specific feature detectors.

For RUL prediction, learned kernels might detect:

Temperature spikes indicating overheating
Pressure drops suggesting seal wear
Vibration patterns indicating mechanical loosening
Multi-sensor correlation changes as degradation progresses

The Power of Learning

We do not need to know what patterns indicate degradation—the network discovers this automatically by learning kernels that minimize prediction error on training data.

Padding, Stride, and Output Dimensions

Three hyperparameters control how convolution is applied:

Padding

Padding adds values (typically zeros) to the boundaries of the input sequence. With padding $P$ on each side:

\text{Padded input length} = T + 2P

"Same" padding preserves input length:

P_{\text{same}} = \left\lfloor \frac{K - 1}{2} \right\rfloor

For $K = 3$ , $P_{\text{same}} = 1$ —add one zero on each end.

Stride

Stride $S$ is how many positions the kernel moves between each computation. Stride 1 means slide by one position; stride 2 means skip every other position.

Output Dimension Formula

Given input length $T$ , kernel size $K$ , padding $P$ , and stride $S$ :

T_{\text{out}} = \left\lfloor \frac{T + 2P - K}{S} \right\rfloor + 1

Multi-Channel Convolution

Our input has 17 channels (features). Multi-channel convolution processes all channels simultaneously.

Multi-Input, Single-Output

For an input with $C_{\text{in}}$ channels, each output position aggregates across all input channels:

y_t = \sum_{c=1}^{C_{\text{in}}} \sum_{k=1}^{K} w_{c,k} \cdot x_{c, t+k-1} + b

The kernel now has shape $(C_{\text{in}}, K)$ —it has weights for each input channel.

Multi-Input, Multi-Output

To produce $C_{\text{out}}$ output channels, we use $C_{\text{out}}$ separate kernels:

y_{c', t} = \sum_{c=1}^{C_{\text{in}}} \sum_{k=1}^{K} w_{c', c, k} \cdot x_{c, t+k-1} + b_{c'}

The full kernel tensor has shape $(C_{\text{out}}, C_{\text{in}}, K)$ .

Our CNN Configuration

The CNN in our AMNL model consists of three convolutional layers with batch normalization and ReLU activation.

Layer-by-Layer Specification

Layer	C_in	C_out	Kernel	Padding	Parameters
Conv1	17	64	3	1	3,328
BatchNorm1	-	64	-	-	128
Conv2	64	128	3	1	24,704
BatchNorm2	-	128	-	-	256
Conv3	128	64	3	1	24,640
BatchNorm3	-	64	-	-	128
Total	-	-	-	-	~53K

Tensor Flow Through CNN

📝text

1Input:    X ∈ ℝ^(30 × 17)     # (time, features)
2              ↓ transpose
3          X' ∈ ℝ^(17 × 30)     # (channels, time) for Conv1d
4
5Conv1:    H₁ ∈ ℝ^(64 × 30)    # 64 feature maps, same length
6          + BatchNorm + ReLU + Dropout(0.15)
7
8Conv2:    H₂ ∈ ℝ^(128 × 30)   # expand to 128 channels
9          + BatchNorm + ReLU + Dropout(0.15)
10
11Conv3:    H₃ ∈ ℝ^(64 × 30)    # compress back to 64 channels
12          + BatchNorm + ReLU
13              ↓ transpose
14Output:   H ∈ ℝ^(30 × 64)     # ready for BiLSTM

Why This Channel Progression?

17 → 64: Expand from raw sensors to richer learned representation
64 → 128: Further expand to capture complex combinations
128 → 64: Compress to reduce parameters for BiLSTM while retaining key features

The Bottleneck Pattern

The 64 → 128 → 64 pattern is a bottleneck design. The middle layer has high capacity to learn complex patterns, while the output layer compresses to a more efficient representation. This is analogous to autoencoder bottlenecks.

Why Convolution Works for Local Patterns

Convolution has several properties that make it ideal for extracting local patterns from time series:

1. Local Connectivity

Each output depends only on a local window of the input (kernel size $K = 3$ means 3 consecutive timesteps). This matches the physical reality that short-term sensor dynamics carry information.

The same kernel is applied at every position. A pattern learned at the beginning of the sequence can be detected at the end. This drastically reduces parameters compared to fully connected layers.

\text{FC parameters} = T \times D \times H \approx 30 \times 17 \times 64 = 32{,}640

\text{Conv parameters} = C_{\text{in}} \times C_{\text{out}} \times K = 17 \times 64 \times 3 = 3{,}264

Convolution uses 10× fewer parameters while capturing the same local relationships.

3. Translation Equivariance

If a pattern shifts in time, its detection shifts correspondingly:

f(\text{shift}(\mathbf{x})) = \text{shift}(f(\mathbf{x}))

A temperature spike at cycle 50 produces the same activation pattern as one at cycle 150, just shifted. This is crucial for generalization.

4. Hierarchical Feature Learning

Stacked convolutional layers create a hierarchy:

Layer 1: Detects simple patterns (edges, trends) across 3 timesteps
Layer 2: Combines simple patterns into complex ones (effective receptive field: 5 timesteps)
Layer 3: Detects high-level features (effective receptive field: 7 timesteps)

Summary

In this section, we explored the mathematics of convolution for sequence processing:

Convolution slides a kernel across the input, computing weighted sums: $y_t = \sum_k w_k \cdot x_{t+k-1} + b$
1D convolution for time series extracts local temporal patterns with learned kernels
Padding and stride control output dimensions: $T_{\text{out}} = \lfloor (T + 2P - K)/S \rfloor + 1$
Multi-channel convolution processes all 17 sensor channels simultaneously
Our CNN uses three layers: 17 → 64 → 128 → 64 channels with ~53K parameters
Key properties: local connectivity, parameter sharing, translation equivariance, hierarchical learning

Property	Value	Significance
Kernel size	3	Captures 3-timestep local patterns
Padding	1 (same)	Preserves sequence length at 30
Receptive field	7	After 3 layers, sees 7 timesteps
Parameter reduction	10×	vs fully connected layers
Output shape	(30, 64)	Ready for BiLSTM input

Looking Ahead: Convolution captures local patterns, but cannot model dependencies across the full 30-timestep window. In the next section, we introduce Recurrent Neural Networks—the mathematical framework for processing sequences where each output depends on the entire preceding history.

With convolution understood, we can now tackle the more complex mathematics of recurrent processing.