Chapter 2
15 min read
Section 6 of 104

Time Series Fundamentals

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Define time series mathematically and understand their role in sequential data analysis
  2. Master the notation for univariate and multivariate time series used throughout this book
  3. Understand stationarity and why sensor degradation data is inherently non-stationary
  4. Know autocorrelation and why it creates temporal dependencies that deep learning can exploit
  5. Formalize the sliding window approach for converting variable-length sequences to fixed-length inputs
  6. Connect these concepts to real sensor data from turbofan engines
Why This Matters: Every deep learning architecture for time series—from LSTMs to Transformers—is designed to capture temporal structure. Understanding what time series are, why they exhibit autocorrelation, and how we represent them mathematically provides the foundation for understanding why certain architectures work better than others.

What is a Time Series?

A time series is a sequence of observations recorded at successive points in time. Unlike independent data points, time series exhibit temporal dependencies—what happens at time tt depends on what happened at times t1,t2,t-1, t-2, \ldots

Historical Context

The mathematical study of time series began in earnest in the 1920s with Yule's work on sunspot cycles and was formalized by Norbert Wiener and Andrey Kolmogorov in the 1940s. The key insight was that random processes could have predictable structure when viewed sequentially.

Time Series in RUL Prediction

In predictive maintenance, our time series are sensor measurements recorded at each operational cycle of equipment. As the equipment degrades, these measurements change in systematic ways that can be learned by neural networks.

Time Series TypeExampleCharacteristic
Stock pricesDaily closing pricesVolatile, trends, cycles
Weather dataHourly temperatureSeasonal patterns
ECG signalsHeart electrical activityPeriodic with anomalies
Sensor data (ours)Turbofan engine readingsDegradation trends, multi-variate

Mathematical Notation

We establish precise notation that will be used consistently throughout this book.

Univariate Time Series

A univariate time series is a sequence of scalar observations:

{xt}t=1T={x1,x2,x3,,xT}\{x_t\}_{t=1}^{T} = \{x_1, x_2, x_3, \ldots, x_T\}

Where:

  • xtRx_t \in \mathbb{R} is the observation at time tt
  • TT is the total length of the series
  • t{1,2,,T}t \in \{1, 2, \ldots, T\} is the time index (discrete time)

Stochastic Process View

From a probabilistic perspective, each observation xtx_t is a realization of a random variable XtX_t:

{Xt}t=1T is a stochastic process\{X_t\}_{t=1}^{T} \text{ is a stochastic process}

The observed sequence {xt}\{x_t\} is one realization of this process. In our context, each engine provides one realization—different engines under identical conditions would produce different (but statistically similar) sequences.

Why This Matters for Deep Learning

Neural networks learn to predict Xt+1X_{t+1} given X1,X2,,XtX_1, X_2, \ldots, X_t. They are essentially learning the conditional distribution P(Xt+1X1:t)P(X_{t+1} | X_{1:t}) from many training realizations.

Multivariate Time Series

In predictive maintenance, we observe multiple sensors simultaneously. This gives us a multivariate time series.

Definition

A multivariate time series of dimension DD is:

{xt}t=1T={x1,x2,,xT},xtRD\{\mathbf{x}_t\}_{t=1}^{T} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T\}, \quad \mathbf{x}_t \in \mathbb{R}^D

Where:

  • xt=[xt(1),xt(2),,xt(D)]T\mathbf{x}_t = [x_t^{(1)}, x_t^{(2)}, \ldots, x_t^{(D)}]^T is the feature vector at time tt
  • DD is the number of features (sensors + settings)
  • xt(d)x_t^{(d)} is the value of feature dd at time tt

Matrix Representation

The entire multivariate sequence can be represented as a matrix:

X=[x1(1)x1(2)x1(D)x2(1)x2(2)x2(D)xT(1)xT(2)xT(D)]RT×D\mathbf{X} = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(D)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(D)} \\ \vdots & \vdots & \ddots & \vdots \\ x_T^{(1)} & x_T^{(2)} & \cdots & x_T^{(D)} \end{bmatrix} \in \mathbb{R}^{T \times D}

Each row is a timestep; each column is a feature (sensor).


Stationarity and Non-Stationarity

Stationarity is a fundamental concept that describes whether a time series' statistical properties change over time.

Strict Stationarity

A process is strictly stationary if its joint distribution is invariant to time shifts:

P(Xt1,Xt2,,Xtk)=P(Xt1+τ,Xt2+τ,,Xtk+τ)P(X_{t_1}, X_{t_2}, \ldots, X_{t_k}) = P(X_{t_1+\tau}, X_{t_2+\tau}, \ldots, X_{t_k+\tau})

for all time indices and all shifts τ\tau.

Weak (Second-Order) Stationarity

More practically, a process is weakly stationary if:

  1. Constant mean: E[Xt]=μ\mathbb{E}[X_t] = \mu for all tt
  2. Constant variance: Var(Xt)=σ2\text{Var}(X_t) = \sigma^2 for all tt
  3. Covariance depends only on lag: Cov(Xt,Xt+h)=γ(h)\text{Cov}(X_t, X_{t+h}) = \gamma(h) (depends only on hh, not tt)

Degradation Data is Non-Stationary

Equipment degradation data violates stationarity by design. As the equipment wears:

  • Mean changes: Sensor readings drift systematically (e.g., temperature increases)
  • Variance may change: Readings become more erratic as failure approaches
  • Distribution shifts: The generating process changes fundamentally over time
E[XtRUL=100]E[XtRUL=20]\mathbb{E}[X_t | \text{RUL}=100] \neq \mathbb{E}[X_t | \text{RUL}=20]

Implications for Modeling

Non-stationarity means we cannot use simple stationary models (like AR, MA, ARIMA) directly. Instead, we need models that can track changing dynamics—which is exactly what LSTMs and attention mechanisms do.

PropertyStationary SeriesDegradation Data
MeanConstant over timeDrifts with degradation
VarianceConstant over timeMay increase near failure
DistributionSame at all timesChanges from healthy to critical
Traditional modelsAR, MA, ARIMA work wellRequire non-linear, adaptive models

Autocorrelation and Temporal Structure

Autocorrelation measures how correlated a time series is with itself at different time lags. This is the key property that makes sequential modeling necessary.

Autocovariance Function

For a stationary process, the autocovariance at lag hh is:

γ(h)=Cov(Xt,Xt+h)=E[(Xtμ)(Xt+hμ)]\gamma(h) = \text{Cov}(X_t, X_{t+h}) = \mathbb{E}[(X_t - \mu)(X_{t+h} - \mu)]

Autocorrelation Function (ACF)

The normalized version is the autocorrelation function:

ρ(h)=γ(h)γ(0)=Cov(Xt,Xt+h)Var(Xt)\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(X_t, X_{t+h})}{\text{Var}(X_t)}

Note that ρ(0)=1\rho(0) = 1 and 1ρ(h)1-1 \leq \rho(h) \leq 1.

Why Autocorrelation Matters

  • High autocorrelation means nearby values are similar → sequence models can exploit this
  • Slow decay means long-range dependencies → LSTMs needed for long memory
  • Periodic peaks indicate seasonal patterns → attention can learn to focus on relevant periods

Sliding Window Representation

Real equipment trajectories have variable lengths—some engines run for 128 cycles, others for 362. Neural networks need fixed-size inputs. The sliding window approach bridges this gap.

Formal Definition

Given a multivariate time series XRT×D\mathbf{X} \in \mathbb{R}^{T \times D} and window size WW, we construct windowed samples:

X(i)=[xi,xi+1,,xi+W1]RW×D\mathbf{X}^{(i)} = [\mathbf{x}_{i}, \mathbf{x}_{i+1}, \ldots, \mathbf{x}_{i+W-1}] \in \mathbb{R}^{W \times D}

for i=1,2,,TW+1i = 1, 2, \ldots, T - W + 1.

Labels for Each Window

Each window gets the label corresponding to its final timestep:

y(i)=RULi+W1y^{(i)} = \text{RUL}_{i+W-1}

This reflects the prediction task: given the last WW cycles, predict the RUL now.

Number of Windows

From a single trajectory of length TT:

Nwindows=TW+1N_{\text{windows}} = T - W + 1

Why Window Size 30?

The choice of W=30W = 30 balances several factors:

  • Sufficient context: 30 cycles capture meaningful degradation trends
  • Not too long: Avoids including irrelevant ancient history
  • Computational efficiency: Keeps sequence length manageable for LSTMs
  • Empirical performance: Validated through ablation studies

Sensor Data as Time Series

Let's connect these abstract concepts to the concrete sensor data from NASA C-MAPSS.

The 17-Dimensional Feature Vector

At each cycle tt, we observe:

xt=[st(1)st(2)st(3)st(4)st(17)]T\mathbf{x}_t = \begin{bmatrix} s_t^{(1)} & s_t^{(2)} & s_t^{(3)} & s_t^{(4)} & \cdots & s_t^{(17)} \end{bmatrix}^T

Where the 17 components are:

IndexFeaturePhysical Meaning
1-3setting₁, setting₂, setting₃Altitude, Mach, Throttle (operating condition)
4-7sensor₂, sensor₃, sensor₄, sensor₆Temperatures at various engine stages
8-10sensor₇, sensor₈, sensor₉Speeds and pressure ratios
11-14sensor₁₁, sensor₁₂, sensor₁₃, sensor₁₄Corrected speeds, bleed measurements
15-17sensor₁₇, sensor₂₀, (reserved)Additional pressure and temperature

Cross-Sensor Correlations

Beyond temporal autocorrelation, sensors exhibit cross-correlations:

ρd1,d2(h)=Corr(Xt(d1),Xt+h(d2))\rho_{d_1, d_2}(h) = \text{Corr}(X_t^{(d_1)}, X_{t+h}^{(d_2)})

For example, HPC outlet temperature correlates with HPT coolant bleed because they share physical dependencies. CNNs in our architecture learn to exploit these cross-correlations.

Degradation Signatures

Different sensors show degradation in different ways:

Sensor TypeHealthy BehaviorDegradation Signal
TemperatureStable around setpointGradual increase (less efficient cooling)
SpeedStable at rated valueSmall oscillations, slight drift
PressureConsistent ratioDecreased efficiency → changed ratios
VibrationLow amplitudeIncreasing amplitude (mechanical wear)
The Deep Learning Opportunity: Each sensor tells part of the story. By processing all sensors together through CNN, BiLSTM, and Attention layers, our model learns to fuse these partial signals into a holistic degradation assessment that no single sensor could provide.

Summary

In this section, we established the mathematical foundations for time series analysis:

  1. Time series are sequences {xt}t=1T\{x_t\}_{t=1}^T where observations depend on their temporal position
  2. Multivariate time series XRT×D\mathbf{X} \in \mathbb{R}^{T \times D} record DD features at each timestep
  3. Stationarity means stable statistical properties—but degradation data is inherently non-stationary
  4. Autocorrelation ρ(h)\rho(h) measures temporal dependencies that sequence models exploit
  5. Sliding windows convert variable-length trajectories to fixed-size inputs X(i)RW×D\mathbf{X}^{(i)} \in \mathbb{R}^{W \times D}
  6. Sensor data exhibits both temporal autocorrelation and cross-sensor correlations
ConceptNotationIn C-MAPSS
Feature dimensionD17 (settings + sensors)
Sequence lengthT128-362 cycles per engine
Window sizeW30 cycles
Window shapeℝ^{W×D}ℝ^{30×17}
Windows per engineT - W + 1~100-330
Looking Ahead: In the next section, we will explore convolution operations for sequences—the mathematical foundation for the CNN feature extractor in our architecture. You will learn how sliding kernels extract local patterns from time series.

With time series fundamentals established, we can now build up the mathematical machinery for each component of our deep learning model.