Chapter 2
18 min read
Section 7 of 104

Convolution Operations for Sequences

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Understand convolution as a mathematical operation that extracts local patterns from sequences
  2. Derive the 1D convolution formula and know how it applies to time series data
  3. Grasp the role of kernels as learned pattern detectors that slide across the input
  4. Calculate output dimensions given input size, kernel size, padding, and stride
  5. Extend to multi-channel convolution for processing multivariate time series
  6. Connect to our architecture: the three-layer CNN that transforms 17 features to 64 learned representations
Why This Matters: Convolution is the workhorse of modern deep learning. In our architecture, CNN layers are the first processing stage—they transform raw sensor readings into learned features that capture local temporal patterns. Understanding convolution mathematically is essential for designing and debugging these networks.

What is Convolution?

Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other.

Historical Origins

Convolution originated in 18th century mathematics and became central to signal processing in the 20th century. The key insight is that many physical systems can be characterized by their impulse response—convolving this response with an input signal predicts the output.

Continuous Convolution

For continuous functions ff and gg, their convolution is:

(fg)(t)=f(τ)g(tτ)dτ(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \cdot g(t - \tau) \, d\tau

This integral "slides" function gg across ff, multiplying and summing at each position.

Discrete Convolution

For digital signals and neural networks, we use the discrete version:

(fg)[n]=m=f[m]g[nm](f * g)[n] = \sum_{m=-\infty}^{\infty} f[m] \cdot g[n - m]

In practice, our sequences and kernels are finite, so the sum has finite bounds.


1D Convolution for Time Series

For time series data, we use 1D convolution where a kernel slides along the temporal dimension.

The 1D Convolution Formula

Given an input sequence x=[x1,x2,,xT]\mathbf{x} = [x_1, x_2, \ldots, x_T] and a kernel w=[w1,w2,,wK]\mathbf{w} = [w_1, w_2, \ldots, w_K] of size KK, the convolution output at position tt is:

yt=k=1Kwkxt+k1+by_t = \sum_{k=1}^{K} w_k \cdot x_{t+k-1} + b

Where:

  • yty_t is the output at position tt
  • wkw_k are the kernel weights (learned parameters)
  • xt+k1x_{t+k-1} is the input at the corresponding position
  • bb is the bias term
  • KK is the kernel size

Kernels and Learned Filters

In traditional signal processing, kernels are hand-designed for specific purposes. In deep learning, kernels are learned from data to extract task-relevant features.

Common Hand-Designed Kernels

KernelValuesPurpose
Identity[0, 1, 0]Pass-through (no effect)
Blur/Average[1/3, 1/3, 1/3]Smooth signal, reduce noise
Derivative[1, 0, -1]Detect rate of change
Laplacian[1, -2, 1]Detect acceleration/curvature
Edge detection[-1, 2, -1]Highlight sharp transitions

Learned Kernels in Neural Networks

In our CNN, each kernel starts with random values and is optimized via backpropagation to minimize the loss function. After training, kernels evolve into task-specific feature detectors.

For RUL prediction, learned kernels might detect:

  • Temperature spikes indicating overheating
  • Pressure drops suggesting seal wear
  • Vibration patterns indicating mechanical loosening
  • Multi-sensor correlation changes as degradation progresses

The Power of Learning

We do not need to know what patterns indicate degradation—the network discovers this automatically by learning kernels that minimize prediction error on training data.

Padding, Stride, and Output Dimensions

Three hyperparameters control how convolution is applied:

Padding

Padding adds values (typically zeros) to the boundaries of the input sequence. With padding PP on each side:

Padded input length=T+2P\text{Padded input length} = T + 2P

"Same" padding preserves input length:

Psame=K12P_{\text{same}} = \left\lfloor \frac{K - 1}{2} \right\rfloor

For K=3K = 3, Psame=1P_{\text{same}} = 1—add one zero on each end.

Stride

Stride SS is how many positions the kernel moves between each computation. Stride 1 means slide by one position; stride 2 means skip every other position.

Output Dimension Formula

Given input length TT, kernel size KK, padding PP, and stride SS:

Tout=T+2PKS+1T_{\text{out}} = \left\lfloor \frac{T + 2P - K}{S} \right\rfloor + 1

Multi-Channel Convolution

Our input has 17 channels (features). Multi-channel convolution processes all channels simultaneously.

Multi-Input, Single-Output

For an input with CinC_{\text{in}} channels, each output position aggregates across all input channels:

yt=c=1Cink=1Kwc,kxc,t+k1+by_t = \sum_{c=1}^{C_{\text{in}}} \sum_{k=1}^{K} w_{c,k} \cdot x_{c, t+k-1} + b

The kernel now has shape (Cin,K)(C_{\text{in}}, K)—it has weights for each input channel.

Multi-Input, Multi-Output

To produce CoutC_{\text{out}} output channels, we use CoutC_{\text{out}} separate kernels:

yc,t=c=1Cink=1Kwc,c,kxc,t+k1+bcy_{c', t} = \sum_{c=1}^{C_{\text{in}}} \sum_{k=1}^{K} w_{c', c, k} \cdot x_{c, t+k-1} + b_{c'}

The full kernel tensor has shape (Cout,Cin,K)(C_{\text{out}}, C_{\text{in}}, K).


Our CNN Configuration

The CNN in our AMNL model consists of three convolutional layers with batch normalization and ReLU activation.

Layer-by-Layer Specification

LayerC_inC_outKernelPaddingParameters
Conv11764313,328
BatchNorm1-64--128
Conv2641283124,704
BatchNorm2-128--256
Conv3128643124,640
BatchNorm3-64--128
Total----~53K

Tensor Flow Through CNN

📝text
1Input:    X ∈ ℝ^(30 × 17)     # (time, features)
2              ↓ transpose
3          X' ∈ ℝ^(17 × 30)     # (channels, time) for Conv1d
4
5Conv1:    H₁ ∈ ℝ^(64 × 30)    # 64 feature maps, same length
6          + BatchNorm + ReLU + Dropout(0.15)
7
8Conv2:    H₂ ∈ ℝ^(128 × 30)   # expand to 128 channels
9          + BatchNorm + ReLU + Dropout(0.15)
10
11Conv3:    H₃ ∈ ℝ^(64 × 30)    # compress back to 64 channels
12          + BatchNorm + ReLU
13              ↓ transpose
14Output:   H ∈ ℝ^(30 × 64)     # ready for BiLSTM

Why This Channel Progression?

  • 17 → 64: Expand from raw sensors to richer learned representation
  • 64 → 128: Further expand to capture complex combinations
  • 128 → 64: Compress to reduce parameters for BiLSTM while retaining key features

The Bottleneck Pattern

The 64 → 128 → 64 pattern is a bottleneck design. The middle layer has high capacity to learn complex patterns, while the output layer compresses to a more efficient representation. This is analogous to autoencoder bottlenecks.


Why Convolution Works for Local Patterns

Convolution has several properties that make it ideal for extracting local patterns from time series:

1. Local Connectivity

Each output depends only on a local window of the input (kernel size K=3K = 3 means 3 consecutive timesteps). This matches the physical reality that short-term sensor dynamics carry information.

2. Parameter Sharing

The same kernel is applied at every position. A pattern learned at the beginning of the sequence can be detected at the end. This drastically reduces parameters compared to fully connected layers.

FC parameters=T×D×H30×17×64=32,640\text{FC parameters} = T \times D \times H \approx 30 \times 17 \times 64 = 32{,}640
Conv parameters=Cin×Cout×K=17×64×3=3,264\text{Conv parameters} = C_{\text{in}} \times C_{\text{out}} \times K = 17 \times 64 \times 3 = 3{,}264

Convolution uses 10× fewer parameters while capturing the same local relationships.

3. Translation Equivariance

If a pattern shifts in time, its detection shifts correspondingly:

f(shift(x))=shift(f(x))f(\text{shift}(\mathbf{x})) = \text{shift}(f(\mathbf{x}))

A temperature spike at cycle 50 produces the same activation pattern as one at cycle 150, just shifted. This is crucial for generalization.

4. Hierarchical Feature Learning

Stacked convolutional layers create a hierarchy:

  • Layer 1: Detects simple patterns (edges, trends) across 3 timesteps
  • Layer 2: Combines simple patterns into complex ones (effective receptive field: 5 timesteps)
  • Layer 3: Detects high-level features (effective receptive field: 7 timesteps)

Summary

In this section, we explored the mathematics of convolution for sequence processing:

  1. Convolution slides a kernel across the input, computing weighted sums: yt=kwkxt+k1+by_t = \sum_k w_k \cdot x_{t+k-1} + b
  2. 1D convolution for time series extracts local temporal patterns with learned kernels
  3. Padding and stride control output dimensions:Tout=(T+2PK)/S+1T_{\text{out}} = \lfloor (T + 2P - K)/S \rfloor + 1
  4. Multi-channel convolution processes all 17 sensor channels simultaneously
  5. Our CNN uses three layers: 17 → 64 → 128 → 64 channels with ~53K parameters
  6. Key properties: local connectivity, parameter sharing, translation equivariance, hierarchical learning
PropertyValueSignificance
Kernel size3Captures 3-timestep local patterns
Padding1 (same)Preserves sequence length at 30
Receptive field7After 3 layers, sees 7 timesteps
Parameter reduction10×vs fully connected layers
Output shape(30, 64)Ready for BiLSTM input
Looking Ahead: Convolution captures local patterns, but cannot model dependencies across the full 30-timestep window. In the next section, we introduce Recurrent Neural Networks—the mathematical framework for processing sequences where each output depends on the entire preceding history.

With convolution understood, we can now tackle the more complex mathematics of recurrent processing.