Chapter 3
15 min read
Section 12 of 104

Dataset Structure and Sensor Readings

NASA C-MAPSS Dataset Deep Dive

Learning Objectives

By the end of this section, you will:

  1. Understand the C-MAPSS simulation and why it serves as the gold standard benchmark for RUL prediction
  2. Know the physical meaning of each sensor and operating condition in the dataset
  3. Navigate the dataset files and understand the train/test split structure
  4. Interpret the 26-column data format and identify which columns contain predictive information
  5. Appreciate the run-to-failure paradigm that makes this dataset unique for predictive maintenance research
Why This Matters: The NASA C-MAPSS dataset is the most widely used benchmark for RUL prediction research. Understanding its structure, physics, and quirks is essential for reproducing results, comparing methods fairly, and knowing what our model is actually learning. This chapter turns raw data files into deep understanding.

C-MAPSS Overview

C-MAPSS stands for Commercial Modular Aero-Propulsion System Simulation. It was developed by NASA's Glenn Research Center to simulate realistic turbofan engine degradation.

Why Simulated Data?

Real engine failure data is extremely rare and expensive to obtain:

  • Safety: We cannot run engines to failure in commercial aircraft
  • Cost: Each engine costs millions of dollars
  • Time: Engine lifetimes span thousands of flight hours
  • Variability: Real failures have complex, uncontrolled conditions

C-MAPSS provides controlled degradation simulations where we know the exact failure point, enabling supervised learning with ground-truth RUL labels.

The Turbofan Engine Model

C-MAPSS simulates a large commercial turbofan engine with the following components:

ComponentAbbreviationFunction
FanFanDraws air into engine, provides thrust
Low Pressure CompressorLPCFirst stage compression
High Pressure CompressorHPCSecond stage compression
CombustorCombustorFuel combustion, energy release
High Pressure TurbineHPTDrives HPC via shaft
Low Pressure TurbineLPTDrives Fan and LPC via shaft
NozzleNozzleExhaust, thrust generation

The simulation models thermodynamic cycles, mechanical wear, and performance degradation as the engine accumulates operational cycles.

Dataset History

The C-MAPSS dataset was released by Saxena et al. in 2008 at the International Conference on Prognostics and Health Management. It has since become the de facto standard for benchmarking RUL prediction algorithms, with hundreds of papers using it for evaluation.


The Simulation Physics

Understanding the physics helps interpret what our model learns. The simulation introduces degradation through two mechanisms:

1. Component Efficiency Degradation

As the engine operates, component efficiencies decrease:

ηcomponent(t)=η0Δη(t)\eta_{\text{component}}(t) = \eta_0 - \Delta\eta(t)

Where η0\eta_0 is the initial efficiency and Δη(t)\Delta\eta(t) is the degradation accumulated over time.

2. Flow Capacity Degradation

The ability to pass air through components decreases:

m˙component(t)=m˙0Δm˙(t)\dot{m}_{\text{component}}(t) = \dot{m}_0 - \Delta\dot{m}(t)

Degradation Modes

C-MAPSS simulates degradation in specific components. The fault modes affect:

Fault ModeAffected ComponentsPhysical Effect
HPC DegradationHigh Pressure CompressorBlade erosion, tip clearance
Fan DegradationFanForeign object damage, wear

These degradation modes manifest as measurable changes in sensor readings—exactly what our model learns to detect.

Run-to-Failure Paradigm

Each simulated engine starts healthy and runs until failure:

Cycle 1degradationCycle 2degradationfailureCycle T\text{Cycle 1} \xrightarrow{\text{degradation}} \text{Cycle 2} \xrightarrow{\text{degradation}} \cdots \xrightarrow{\text{failure}} \text{Cycle } T

At the final cycle TT, the engine has failed. The RUL at any cycle tt is simply:

RUL(t)=Tt\text{RUL}(t) = T - t

Ground Truth Labels

This is the key advantage of simulated data: we know TT exactly, giving us perfect RUL labels for supervised learning. In real-world applications, we never know when failure will occur—that's the prediction problem!


Dataset File Structure

The C-MAPSS dataset consists of four sub-datasets (FD001-FD004), each with training and test files.

File Types

FileDescriptionUse
train_FD00X.txtComplete run-to-failure trajectoriesTraining data
test_FD00X.txtTruncated trajectories (unknown failure point)Test inputs
RUL_FD00X.txtGround truth RUL for test setEvaluation only

Training Data Structure

Training files contain complete trajectories. Each engine runs from cycle 1 until failure:

📝text
1Engine 1: Cycle 1, 2, 3, ..., 192 (failed at 192)
2Engine 2: Cycle 1, 2, 3, ..., 287 (failed at 287)
3Engine 3: Cycle 1, 2, 3, ..., 145 (failed at 145)
4...

The last cycle of each engine is the failure point. RUL labels are computed by counting backwards from failure.

Test Data Structure

Test files contain truncated trajectories. We observe some cycles but not the failure:

📝text
1Engine 1: Cycle 1, 2, 3, ..., 31 (truncated, failure unknown)
2Engine 2: Cycle 1, 2, 3, ..., 49 (truncated, failure unknown)
3Engine 3: Cycle 1, 2, 3, ..., 126 (truncated, failure unknown)
4...

The task is to predict the RUL at the final observed cycle. Ground truth is provided separately in RUL_FD00X.txt.

Dataset Statistics

DatasetTrain EnginesTest EnginesOperating ConditionsFault Modes
FD0011001001 (Sea Level)1 (HPC)
FD0022602596 (Various)1 (HPC)
FD0031001001 (Sea Level)2 (HPC + Fan)
FD0042492486 (Various)2 (HPC + Fan)

Increasing Complexity

FD001 is the simplest (single condition, single fault). FD004 is the most challenging (multiple conditions, multiple faults). Methods that work on FD001 may fail on FD004—our AMNL model achieves state-of-the-art on ALL four datasets.


Sensor Readings Explained

Each cycle produces readings from 21 sensors placed throughout the engine. Understanding these sensors helps interpret model behavior.

Sensor Categories

Sensors measure different physical quantities:

CategorySensorsPhysical Meaning
TemperatureT2, T24, T30, T50Gas temperatures at various stages
PressureP2, P15, P30, Ps30Static and total pressures
SpeedNf, Nc, NRf, NRcFan and core shaft speeds
Flowphi, BPR, htBleed, W31, W32Air flow rates and ratios
OtherfarB, Nf_dmd, PCNfR_dmd, eprFuel ratio, demanded values

Complete Sensor List

IndexSymbolDescriptionUnit
1T2Total temperature at fan inlet°R
2T24Total temperature at LPC outlet°R
3T30Total temperature at HPC outlet°R
4T50Total temperature at LPT outlet°R
5P2Pressure at fan inletpsia
6P15Total pressure in bypass ductpsia
7P30Total pressure at HPC outletpsia
8NfPhysical fan speedrpm
9NcPhysical core speedrpm
10eprEngine pressure ratio (P50/P2)
11Ps30Static pressure at HPC outletpsia
12phiRatio of fuel flow to Ps30pps/psi
13NRfCorrected fan speedrpm
14NRcCorrected core speedrpm
15BPRBypass ratio
16farBBurner fuel-air ratio
17htBleedBleed enthalpy
18Nf_dmdDemanded fan speedrpm
19PCNfR_dmdDemanded corrected fan speed%
20W31HPT coolant bleedlbm/s
21W32LPT coolant bleedlbm/s

Informative vs Non-Informative Sensors

Not all sensors are equally useful for RUL prediction:

  • Constant sensors: Some sensors remain nearly constant throughout the engine life (e.g., T2, P2 at sea level)
  • Noisy sensors: Some have high noise with little degradation signal
  • Informative sensors: Show clear degradation trends (e.g., T24, T30, T50, P30)

We will analyze which sensors to keep in Section 3 (Feature Selection).


Data Format and Dimensions

Each data file is a space-delimited text file with 26 columns and no header row.

Column Layout

ColumnsContentCount
1Engine unit number1
2Time (in cycles)1
3-5Operating conditions (altitude, Mach, TRA)3
6-26Sensor measurements21

Sample Data Row

📝text
11 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.02 9046.19 1.30 47.47 521.66 2388.01 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190

Parsing this row:

  • Engine 1, Cycle 1: First two values
  • Operating conditions: altitude ≈ 0, Mach ≈ 0, TRA = 100 (sea level, full throttle)
  • 21 sensor values: Remaining columns

Tensor Representation

For a single engine with TT cycles, the data forms a matrix:

XengineRT×26\mathbf{X}_{\text{engine}} \in \mathbb{R}^{T \times 26}

For the entire training set with multiple engines:

RUL Label Computation

For training data, RUL at each cycle is computed as:

RUL(t)=Tmaxt\text{RUL}(t) = T_{\text{max}} - t

Where TmaxT_{\text{max}} is the final cycle (failure point) for that engine.

🐍python
1# Example: Engine with 192 cycles
2# Cycle 1: RUL = 192 - 1 = 191
3# Cycle 2: RUL = 192 - 2 = 190
4# ...
5# Cycle 191: RUL = 192 - 191 = 1
6# Cycle 192: RUL = 192 - 192 = 0 (failure)

Piecewise Linear Clipping

In practice, we clip RUL values at a maximum (e.g., 125 cycles). Early-life degradation is imperceptible, so predicting RUL = 191 vs RUL = 200 is meaningless. We will formalize this in Section 4.


Summary

In this section, we introduced the NASA C-MAPSS dataset—the benchmark for RUL prediction research:

  1. C-MAPSS is a turbofan engine simulation developed by NASA Glenn Research Center
  2. Run-to-failure: Each engine starts healthy and runs until failure, providing ground-truth RUL labels
  3. Four sub-datasets: FD001-FD004 with varying operating conditions (1 or 6) and fault modes (1 or 2)
  4. 26 columns: Engine ID, cycle, 3 operating conditions, 21 sensor readings
  5. 21 sensors: Measure temperatures, pressures, speeds, and flow rates throughout the engine
  6. Train/Test split: Training has complete trajectories; test has truncated trajectories with separate ground truth
AspectValueNote
Total sensors21Not all informative
Operating conditions3Altitude, Mach, TRA
FD001 engines100 train / 100 testSimplest case
FD004 engines249 train / 248 testMost challenging
Data formatSpace-delimited textNo headers
Looking Ahead: Now that we understand the dataset structure, we need to understand the differences between FD001-FD004. Why is FD004 harder than FD001? What are the 6 operating conditions? What are the 2 fault modes? The next section answers these questions, revealing why dataset-specific strategies are essential for state-of-the-art performance.

With the dataset structure understood, we are ready to explore the specific challenges of each sub-dataset.