AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Formalize RUL prediction as a supervised regression problem with precise mathematical notation
Understand input representation: multivariate time series from sensors and operational settings
Master the piecewise linear degradation model and why it reflects engineering reality
Identify the fundamental challenges that make RUL prediction difficult for machine learning
Frame RUL as a multi-task problem with both regression and classification objectives
Know the evaluation metrics: RMSE, MAE, and the asymmetric NASA scoring function

Why This Matters: Before building any machine learning model, we must precisely define what we are trying to predict, what data we have, and how we measure success. This section establishes the formal framework that all subsequent chapters build upon.

Formal Problem Formulation

We formulate Remaining Useful Life (RUL) prediction as a supervised regression problem with an auxiliary classification task. Given a multivariate time series of sensor measurements from operating equipment, the goal is to predict how many operational cycles remain before failure.

The Core Prediction Task

Let us define the problem mathematically. Consider a piece of equipment (e.g., a turbofan engine) that operates in discrete cycles. At each cycle $t$ , we observe:

Sensor measurements: temperature, pressure, vibration, speed, etc.
Operational settings: altitude, throttle position, Mach number, etc.

Our task is to use the history of these observations to predict how many cycles remain until the equipment fails.

f: \mathbf{X}_{1:T} \rightarrow \hat{y}_{\text{RUL}}

Where:

$f$ is the prediction function (neural network) we want to learn
$\mathbf{X}_{1:T}$ is the sequence of observations from cycle 1 to current cycle $T$
$\hat{y}_{\text{RUL}}$ is the predicted remaining useful life (in cycles)

Input Representation

The input to our model is a multivariate time series—a sequence of feature vectors recorded at each operational cycle.

Mathematical Definition

\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T] \in \mathbb{R}^{T \times D}

Where:

$T$ is the sequence length (number of timesteps/cycles)
$D$ is the feature dimension (number of sensors + operational settings)
$\mathbf{x}_t \in \mathbb{R}^D$ is the feature vector at timestep $t$

Feature Composition

In the NASA C-MAPSS benchmark we use throughout this book, each feature vector contains:

Category	Count	Examples
Operational Settings	3	Altitude, Mach number, Throttle resolver angle
Sensor Measurements	14	Temperature, Pressure, Speed, Vibration
Total Features	17	D = 17 dimensional feature vector

Feature Selection

The original C-MAPSS dataset contains 21 sensor measurements, but 7 are constant or near-constant and provide no information about degradation. Following established practice, we use 14 informative sensors plus 3 operational settings, yielding

D = 17

features.

Sliding Window Approach

Rather than processing entire engine trajectories (which vary in length), we use a sliding window to create fixed-length sequences:

\mathbf{X}_{\text{window}} = [\mathbf{x}_{t-W+1}, \mathbf{x}_{t-W+2}, \ldots, \mathbf{x}_t] \in \mathbb{R}^{W \times D}

Where:

$W$ is the window size (we use $W = 30$ cycles)
$t$ is the current timestep
The label for each window is the RUL at the final timestep $t$

Output Targets

Our model produces two outputs for each input sequence:

Primary Output: RUL Prediction

\hat{y}_{\text{RUL}} \in \mathbb{R}^+

A non-negative real number representing the predicted remaining cycles until failure. This is a regression target.

Auxiliary Output: Health State Classification

\hat{y}_{\text{health}} \in \{0, 1, 2\}

A discrete category representing the equipment's degradation stage. This is a classification target with three classes:

Class	Health State	RUL Range	Interpretation
0	Normal	RUL > 80	Equipment operating normally, no action needed
1	Early Degradation	30 < RUL ≤ 80	Degradation detected, schedule maintenance
2	Critical	RUL ≤ 30	Failure imminent, immediate action required

Why Two Outputs?

The auxiliary health classification task is not just for interpretability—it is central to our AMNL innovation. As we will show in Chapter 10, treating this auxiliary task as equally important as RUL prediction provides crucial regularization that enables state-of-the-art performance.

The Piecewise Linear Degradation Model

A critical preprocessing step is how we define the ground-truth RUL labels. Real equipment does not degrade immediately from the start—there is typically a healthy period where wear is negligible.

The Problem with Linear RUL

Naively, we might define RUL as a simple countdown:

\text{RUL}_{\text{naive}}(t) = T_{\text{failure}} - t

But this creates a problem: early in the equipment's life, sensor readings show no degradation signature. Asking a model to predict RUL=250 vs RUL=300 when both correspond to healthy equipment is impossible—there is no signal in the data to distinguish them.

Piecewise Linear Solution

The standard solution is to cap the RUL at a maximum value $R_{\max}$ :

\text{RUL}(t) = \min(R_{\max}, T_{\text{failure}} - t)

Or equivalently:

\text{RUL}(t) = \begin{cases} R_{\max} & \text{if } T_{\text{failure}} - t > R_{\max} \\ T_{\text{failure}} - t & \text{otherwise} \end{cases}

In the NASA C-MAPSS benchmark, $R_{\max} = 125$ cycles is the standard choice.

Physical Interpretation

The capping threshold

R_{\max} = 125

corresponds roughly to the point where degradation becomes detectable in sensor readings. Before this point, the equipment is in its "infant mortality" or "useful life" phase where failures are random, not wear-related.

Why RUL Prediction is Hard

RUL prediction is not a simple regression problem. Several fundamental challenges make it difficult for machine learning:

1. Non-Stationarity

The statistical properties of sensor data change over time as equipment degrades. A model trained on healthy data may fail on degraded data, and vice versa.

P(\mathbf{x}_t | \text{RUL} = 100) \neq P(\mathbf{x}_t | \text{RUL} = 20)

Equipment can fail in different ways. A turbofan engine might experience:

High-pressure compressor (HPC) degradation
Fan degradation
Combustor issues
Turbine blade erosion

Each failure mode produces different sensor signatures. A model must learn to recognize all failure modes, not just one.

3. Operating Condition Variability

Sensor readings depend heavily on operating conditions, not just degradation state:

Temperature readings at sea level ≠ temperature readings at 35,000 ft
Vibration at full throttle ≠ vibration at idle
Pressure ratios depend on ambient conditions

The model must learn to disentangle condition effects from degradation effects—a challenging feature engineering problem that deep learning can potentially solve.

4. Label Noise

The ground-truth failure time $T_{\text{failure}}$ is determined by a threshold crossing in simulation, or by physical inspection in real data. This introduces label uncertainty:

When exactly did the degradation start?
Is the failure point precisely defined?
Could the equipment have operated longer?

5. Imbalanced Data

Most of an engine's operational life is spent in the healthy phase. The critical RUL range (0-30 cycles) represents only a small fraction of training data:

RUL Range	Approximate % of Data	Importance
RUL > 80 (Normal)	~60%	Low (easy to predict)
30 < RUL ≤ 80 (Degradation)	~25%	Medium
RUL ≤ 30 (Critical)	~15%	High (crucial for maintenance)

The Accuracy Paradox

A naive model that always predicts RUL = 125 would achieve reasonable RMSE on average, but would be completely useless for the critical predictions that matter most. This is why we need specialized loss functions that emphasize the critical phase.

Multi-Task Learning Formulation

To address these challenges, we formulate RUL prediction as a multi-task learning problem with two objectives:

Task 1: RUL Regression (Primary)

\mathcal{L}_{\text{RUL}} = \frac{1}{N} \sum_{i=1}^{N} w_i \cdot (\hat{y}_i - y_i)^2

Where $w_i$ is a sample weight that emphasizes critical-phase predictions (more on this in Chapter 11).

Task 2: Health Classification (Auxiliary)

\mathcal{L}_{\text{health}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=0}^{2} y_{i,c} \log(\hat{p}_{i,c})

Standard cross-entropy loss for 3-class classification.

Combined AMNL Loss

Our key innovation is combining these with equal weights:

\mathcal{L}_{\text{AMNL}} = 0.5 \times \mathcal{L}_{\text{RUL}} + 0.5 \times \mathcal{L}_{\text{health}}

The Counterintuitive Discovery: Conventional wisdom says to weight the primary task (RUL) higher than the auxiliary task (health classification). But our experiments show that equal weighting provides superior regularization, especially for complex multi-condition scenarios.

Evaluation Metrics

We evaluate RUL prediction using several complementary metrics:

Root Mean Square Error (RMSE)

The primary metric for comparing methods:

\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2}

Lower is better. RMSE penalizes large errors more heavily than small errors due to the squaring operation.

Mean Absolute Error (MAE)

A more robust metric less sensitive to outliers:

\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |\hat{y}_i - y_i|

NASA Asymmetric Scoring Function

The NASA score reflects the real-world cost asymmetry: predicting failure too late (overestimating RUL) is more dangerous than predicting too early (underestimating RUL).

S = \frac{1}{N} \sum_{i=1}^{N} s_i, \quad \text{where } s_i = \begin{cases} e^{-d_i/13} - 1 & \text{if } d_i < 0 \text{ (early)} \\ e^{d_i/10} - 1 & \text{if } d_i \geq 0 \text{ (late)} \end{cases}

Where $d_i = \hat{y}_i - y_i$ is the prediction error.

Coefficient of Determination (R²)

Measures how well predictions explain variance in true RUL:

R^2 = 1 - \frac{\sum_{i=1}^{N} (\hat{y}_i - y_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}

$R^2 = 1.0$ means perfect prediction; $R^2 = 0$ means no better than predicting the mean.

Summary

In this section, we have formally defined the RUL prediction problem:

Input: Multivariate time series $\mathbf{X} \in \mathbb{R}^{T \times D}$ with $D = 17$ features over $T = 30$ timesteps
Primary output: Continuous RUL prediction $\hat{y}_{\text{RUL}} \in \mathbb{R}^+$
Auxiliary output: Discrete health state $\hat{y}_{\text{health}} \in \{0, 1, 2\}$
Degradation model: Piecewise linear with $R_{\max} = 125$
Key challenges: Non-stationarity, multi-modal degradation, operating condition variability, label noise, data imbalance
Evaluation: RMSE (primary), MAE, NASA Score (asymmetric), R²

Looking Ahead: In the next section, we will explore why deep learning is particularly well-suited for RUL prediction, and trace the evolution of neural network approaches for time series analysis.

With the problem formally defined, we are ready to understand the solution approach.

Learning Objectives

Formal Problem Formulation

The Core Prediction Task

Input Representation

Mathematical Definition

Feature Composition

Feature Selection

Sliding Window Approach

Output Targets

Primary Output: RUL Prediction

Auxiliary Output: Health State Classification

Why Two Outputs?

The Piecewise Linear Degradation Model

The Problem with Linear RUL

Piecewise Linear Solution

Physical Interpretation

Why RUL Prediction is Hard

1. Non-Stationarity

2. Multi-Modal Degradation

3. Operating Condition Variability

4. Label Noise

5. Imbalanced Data

The Accuracy Paradox

Multi-Task Learning Formulation

Task 1: RUL Regression (Primary)

Task 2: Health Classification (Auxiliary)

Combined AMNL Loss

Evaluation Metrics

Root Mean Square Error (RMSE)

Mean Absolute Error (MAE)

NASA Asymmetric Scoring Function

Coefficient of Determination (R²)

Summary