AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Define time series mathematically and understand their role in sequential data analysis
Master the notation for univariate and multivariate time series used throughout this book
Understand stationarity and why sensor degradation data is inherently non-stationary
Know autocorrelation and why it creates temporal dependencies that deep learning can exploit
Formalize the sliding window approach for converting variable-length sequences to fixed-length inputs
Connect these concepts to real sensor data from turbofan engines

Why This Matters: Every deep learning architecture for time series—from LSTMs to Transformers—is designed to capture temporal structure. Understanding what time series are, why they exhibit autocorrelation, and how we represent them mathematically provides the foundation for understanding why certain architectures work better than others.

What is a Time Series?

A time series is a sequence of observations recorded at successive points in time. Unlike independent data points, time series exhibit temporal dependencies—what happens at time $t$ depends on what happened at times $t-1, t-2, \ldots$

Historical Context

The mathematical study of time series began in earnest in the 1920s with Yule's work on sunspot cycles and was formalized by Norbert Wiener and Andrey Kolmogorov in the 1940s. The key insight was that random processes could have predictable structure when viewed sequentially.

Time Series in RUL Prediction

In predictive maintenance, our time series are sensor measurements recorded at each operational cycle of equipment. As the equipment degrades, these measurements change in systematic ways that can be learned by neural networks.

Time Series Type	Example	Characteristic
Stock prices	Daily closing prices	Volatile, trends, cycles
Weather data	Hourly temperature	Seasonal patterns
ECG signals	Heart electrical activity	Periodic with anomalies
Sensor data (ours)	Turbofan engine readings	Degradation trends, multi-variate

Mathematical Notation

We establish precise notation that will be used consistently throughout this book.

Univariate Time Series

A univariate time series is a sequence of scalar observations:

\{x_t\}_{t=1}^{T} = \{x_1, x_2, x_3, \ldots, x_T\}

Where:

$x_t \in \mathbb{R}$ is the observation at time $t$
$T$ is the total length of the series
$t \in \{1, 2, \ldots, T\}$ is the time index (discrete time)

Stochastic Process View

From a probabilistic perspective, each observation $x_t$ is a realization of a random variable $X_t$ :

\{X_t\}_{t=1}^{T} \text{ is a stochastic process}

The observed sequence $\{x_t\}$ is one realization of this process. In our context, each engine provides one realization—different engines under identical conditions would produce different (but statistically similar) sequences.

Why This Matters for Deep Learning

Neural networks learn to predict

X_{t+1}

given

X_1, X_2, \ldots, X_t

. They are essentially learning the conditional distribution

P(X_{t+1} | X_{1:t})

from many training realizations.

Multivariate Time Series

In predictive maintenance, we observe multiple sensors simultaneously. This gives us a multivariate time series.

Definition

A multivariate time series of dimension $D$ is:

\{\mathbf{x}_t\}_{t=1}^{T} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T\}, \quad \mathbf{x}_t \in \mathbb{R}^D

Where:

$\mathbf{x}_t = [x_t^{(1)}, x_t^{(2)}, \ldots, x_t^{(D)}]^T$ is the feature vector at time $t$
$D$ is the number of features (sensors + settings)
$x_t^{(d)}$ is the value of feature $d$ at time $t$

Matrix Representation

The entire multivariate sequence can be represented as a matrix:

\mathbf{X} = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(D)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(D)} \\ \vdots & \vdots & \ddots & \vdots \\ x_T^{(1)} & x_T^{(2)} & \cdots & x_T^{(D)} \end{bmatrix} \in \mathbb{R}^{T \times D}

Each row is a timestep; each column is a feature (sensor).

Stationarity and Non-Stationarity

Stationarity is a fundamental concept that describes whether a time series' statistical properties change over time.

Strict Stationarity

A process is strictly stationary if its joint distribution is invariant to time shifts:

P(X_{t_1}, X_{t_2}, \ldots, X_{t_k}) = P(X_{t_1+\tau}, X_{t_2+\tau}, \ldots, X_{t_k+\tau})

for all time indices and all shifts $\tau$ .

Weak (Second-Order) Stationarity

More practically, a process is weakly stationary if:

Constant mean: $\mathbb{E}[X_t] = \mu$ for all $t$
Constant variance: $\text{Var}(X_t) = \sigma^2$ for all $t$
Covariance depends only on lag: $\text{Cov}(X_t, X_{t+h}) = \gamma(h)$ (depends only on $h$ , not $t$ )

Degradation Data is Non-Stationary

Equipment degradation data violates stationarity by design. As the equipment wears:

Mean changes: Sensor readings drift systematically (e.g., temperature increases)
Variance may change: Readings become more erratic as failure approaches
Distribution shifts: The generating process changes fundamentally over time

\mathbb{E}[X_t | \text{RUL}=100] \neq \mathbb{E}[X_t | \text{RUL}=20]

Implications for Modeling

Non-stationarity means we cannot use simple stationary models (like AR, MA, ARIMA) directly. Instead, we need models that can track changing dynamics—which is exactly what LSTMs and attention mechanisms do.

Property	Stationary Series	Degradation Data
Mean	Constant over time	Drifts with degradation
Variance	Constant over time	May increase near failure
Distribution	Same at all times	Changes from healthy to critical
Traditional models	AR, MA, ARIMA work well	Require non-linear, adaptive models

Autocorrelation and Temporal Structure

Autocorrelation measures how correlated a time series is with itself at different time lags. This is the key property that makes sequential modeling necessary.

Autocovariance Function

For a stationary process, the autocovariance at lag $h$ is:

\gamma(h) = \text{Cov}(X_t, X_{t+h}) = \mathbb{E}[(X_t - \mu)(X_{t+h} - \mu)]

Autocorrelation Function (ACF)

The normalized version is the autocorrelation function:

\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(X_t, X_{t+h})}{\text{Var}(X_t)}

Note that $\rho(0) = 1$ and $-1 \leq \rho(h) \leq 1$ .

Why Autocorrelation Matters

High autocorrelation means nearby values are similar → sequence models can exploit this
Slow decay means long-range dependencies → LSTMs needed for long memory
Periodic peaks indicate seasonal patterns → attention can learn to focus on relevant periods

Sliding Window Representation

Real equipment trajectories have variable lengths—some engines run for 128 cycles, others for 362. Neural networks need fixed-size inputs. The sliding window approach bridges this gap.

Formal Definition

Given a multivariate time series $\mathbf{X} \in \mathbb{R}^{T \times D}$ and window size $W$ , we construct windowed samples:

\mathbf{X}^{(i)} = [\mathbf{x}_{i}, \mathbf{x}_{i+1}, \ldots, \mathbf{x}_{i+W-1}] \in \mathbb{R}^{W \times D}

for $i = 1, 2, \ldots, T - W + 1$ .

Labels for Each Window

Each window gets the label corresponding to its final timestep:

y^{(i)} = \text{RUL}_{i+W-1}

This reflects the prediction task: given the last $W$ cycles, predict the RUL now.

Number of Windows

From a single trajectory of length $T$ :

N_{\text{windows}} = T - W + 1

Why Window Size 30?

The choice of $W = 30$ balances several factors:

Sufficient context: 30 cycles capture meaningful degradation trends
Not too long: Avoids including irrelevant ancient history
Computational efficiency: Keeps sequence length manageable for LSTMs
Empirical performance: Validated through ablation studies

Sensor Data as Time Series

Let's connect these abstract concepts to the concrete sensor data from NASA C-MAPSS.

The 17-Dimensional Feature Vector

At each cycle $t$ , we observe:

\mathbf{x}_t = \begin{bmatrix} s_t^{(1)} & s_t^{(2)} & s_t^{(3)} & s_t^{(4)} & \cdots & s_t^{(17)} \end{bmatrix}^T

Where the 17 components are:

Index	Feature	Physical Meaning
1-3	setting₁, setting₂, setting₃	Altitude, Mach, Throttle (operating condition)
4-7	sensor₂, sensor₃, sensor₄, sensor₆	Temperatures at various engine stages
8-10	sensor₇, sensor₈, sensor₉	Speeds and pressure ratios
11-14	sensor₁₁, sensor₁₂, sensor₁₃, sensor₁₄	Corrected speeds, bleed measurements
15-17	sensor₁₇, sensor₂₀, (reserved)	Additional pressure and temperature

Cross-Sensor Correlations

Beyond temporal autocorrelation, sensors exhibit cross-correlations:

\rho_{d_1, d_2}(h) = \text{Corr}(X_t^{(d_1)}, X_{t+h}^{(d_2)})

For example, HPC outlet temperature correlates with HPT coolant bleed because they share physical dependencies. CNNs in our architecture learn to exploit these cross-correlations.

Degradation Signatures

Different sensors show degradation in different ways:

Sensor Type	Healthy Behavior	Degradation Signal
Temperature	Stable around setpoint	Gradual increase (less efficient cooling)
Speed	Stable at rated value	Small oscillations, slight drift
Pressure	Consistent ratio	Decreased efficiency → changed ratios
Vibration	Low amplitude	Increasing amplitude (mechanical wear)

The Deep Learning Opportunity: Each sensor tells part of the story. By processing all sensors together through CNN, BiLSTM, and Attention layers, our model learns to fuse these partial signals into a holistic degradation assessment that no single sensor could provide.

Summary

In this section, we established the mathematical foundations for time series analysis:

Time series are sequences $\{x_t\}_{t=1}^T$ where observations depend on their temporal position
Multivariate time series $\mathbf{X} \in \mathbb{R}^{T \times D}$ record $D$ features at each timestep
Stationarity means stable statistical properties—but degradation data is inherently non-stationary
Autocorrelation $\rho(h)$ measures temporal dependencies that sequence models exploit
Sliding windows convert variable-length trajectories to fixed-size inputs $\mathbf{X}^{(i)} \in \mathbb{R}^{W \times D}$
Sensor data exhibits both temporal autocorrelation and cross-sensor correlations

Concept	Notation	In C-MAPSS
Feature dimension	D	17 (settings + sensors)
Sequence length	T	128-362 cycles per engine
Window size	W	30 cycles
Window shape	ℝ^{W×D}	ℝ^{30×17}
Windows per engine	T - W + 1	~100-330

Looking Ahead: In the next section, we will explore convolution operations for sequences—the mathematical foundation for the CNN feature extractor in our architecture. You will learn how sliding kernels extract local patterns from time series.

With time series fundamentals established, we can now build up the mathematical machinery for each component of our deep learning model.