Chapter 1
14 min read
Section 4 of 104

NASA C-MAPSS Benchmark Overview

Introduction to Predictive Maintenance

Learning Objectives

By the end of this section, you will:

  1. Understand the C-MAPSS simulation and why it became the standard benchmark for RUL prediction
  2. Know the four sub-datasets (FD001-FD004) and their complexity characteristics
  3. Learn the data format: operational settings, sensor measurements, and their physical meanings
  4. Understand feature selection: which of the 21 sensors provide useful information
  5. Master the evaluation protocol: train/test splits and ground-truth RUL files
  6. See the performance benchmark: where AMNL stands relative to state-of-the-art methods
Why This Matters: The NASA C-MAPSS dataset is the de facto benchmark for evaluating predictive maintenance algorithms. Understanding its structure, challenges, and evaluation protocol is essential for interpreting research results and designing your own experiments.

What is C-MAPSS?

C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) is a sophisticated simulation tool developed by NASA for modeling turbofan engine degradation. The dataset generated from this simulator has become the gold standard for benchmarking RUL prediction algorithms.

The Turbofan Engine

A turbofan engine is the most common type of jet engine used in commercial aircraft. It consists of several key components:

  • Fan: Large front fan that draws in air and provides most of the thrust
  • Low-Pressure Compressor (LPC): Compresses air before the core
  • High-Pressure Compressor (HPC): Further compresses air to high pressure
  • Combustor: Burns fuel with compressed air
  • High-Pressure Turbine (HPT): Extracts energy to drive the HPC
  • Low-Pressure Turbine (LPT): Extracts energy to drive the fan and LPC

Simulated Degradation

The C-MAPSS simulation injects fault modes that cause progressive degradation over time:

Fault ModeAffected ComponentsEffect
HPC DegradationHigh-Pressure CompressorReduced efficiency, increased temperature
Fan DegradationFan assemblyReduced thrust, vibration increase

These faults develop gradually over hundreds of operational cycles, mimicking real-world wear patterns. The simulator records sensor measurements throughout the degradation process until a failure threshold is reached.

Why Simulation Data?

Real run-to-failure data is extremely rare and expensive to obtain—no airline intentionally runs engines until they fail! Simulation provides controlled experiments with known failure times, enabling proper evaluation of prediction algorithms.

The Four Sub-Datasets

C-MAPSS provides four sub-datasets with varying levels of complexity, enabling systematic evaluation of algorithm robustness:

DatasetOperating ConditionsFault ModesTrain UnitsTest UnitsDifficulty
FD00111 (HPC)100100Easy
FD00261 (HPC)260259Hard
FD00312 (HPC + Fan)100100Medium
FD00462 (HPC + Fan)249248Very Hard

Understanding Dataset Complexity

Operating Conditions

Operating conditions represent different flight regimes—combinations of altitude, Mach number, and throttle settings. Each condition produces different baseline sensor readings:

  • FD001/FD003: Single operating condition—all engines operate under identical flight regime
  • FD002/FD004: Six operating conditions—engines experience varying altitudes, speeds, and power settings

The Multi-Condition Challenge

With 6 operating conditions, sensor readings vary significantly based on flight regime, not just degradation. A model must learn to disentangle condition effects from degradation signatures—this is why FD002 and FD004 are much harder than FD001 and FD003.

Fault Modes

  • FD001/FD002: Single fault mode (HPC degradation)—all engines fail the same way
  • FD003/FD004: Two fault modes (HPC and Fan degradation)—engines can fail in different ways with different signatures

Data Format and Features

Each data file is a space-separated text file with one row per engine-cycle observation:

Column Structure

ColumnNameDescription
1unitEngine unit ID (integer)
2cycleOperational cycle number (integer)
3setting_1Altitude (operational setting)
4setting_2Mach number (operational setting)
5setting_3Throttle resolver angle (operational setting)
6-26sensor_1 to sensor_2121 sensor measurements

Sample Data Row

📝text
1unit  cycle  set1      set2      set3    s1      s2      s3      ...  s21
21     1     -0.0007  -0.0004   100.0   518.67  641.82  1589.70  ...  8138.62
31     2      0.0019  -0.0003   100.0   518.67  642.15  1591.82  ...  8131.49
41     3     -0.0043   0.0003   100.0   518.67  642.35  1587.99  ...  8133.23
5...
61     192   -0.0019  -0.0002   100.0   518.67  641.71  1588.45  ...  8129.23
72     1      0.0007   0.0000   100.0   518.67  642.42  1592.14  ...  8132.45

Sensor Descriptions

The 21 sensors measure various physical quantities throughout the engine:

SensorDescriptionUnitTypical Range
sensor_1Total temperature at fan inlet°R~518
sensor_2Total temperature at LPC outlet°R~642
sensor_3Total temperature at HPC outlet°R~1580-1592
sensor_4Total temperature at LPT outlet°R~1400-1406
sensor_5Pressure at fan inletpsia~14.6
sensor_6Physical fan speedrpm~2388
sensor_7Physical core speedrpm~9046-9054
sensor_8Engine pressure ratio-~1.30
sensor_9Static pressure at HPC outletpsia~554-558
sensor_10Fuel flow/Ps30 ratio-~394-402
sensor_11Corrected fan speedrpm~2388
sensor_12Corrected core speedrpm~9046-9057
sensor_13HPT coolant bleed-~47.3-47.5
sensor_14LPT coolant bleed-~522-523
sensor_15Fuel flow ratepph~2388
sensor_16Burner fuel-air ratio-~0.0211
sensor_17Total pressure in bypass-ductpsia~8.40-8.46
sensor_18Demanded fan speedrpm~2388
sensor_19Demanded corrected fan speedrpm~100
sensor_20HPC outlet temperature°R~8128-8146
sensor_21Fan tip clearancein~14.95

Not All Sensors Are Informative

Several sensors (1, 5, 10, 16, 18, 19) have nearly constant values throughout engine operation and provide no information about degradation. Using all 21 sensors without selection can actually hurt model performance.

Feature Selection

Following established practice in the literature, we select 14 informative sensors plus the 3 operational settings, yielding 17 features total:

Selected Features

CategoryFeaturesRationale
Operational Settingssetting_1, setting_2, setting_3Define operating condition for normalization
Temperature Sensorssensor_2, sensor_3, sensor_4, sensor_20Temperature changes indicate degradation
Speed Sensorssensor_6, sensor_7, sensor_11, sensor_12Speed variations indicate efficiency loss
Pressure Sensorssensor_8, sensor_9, sensor_17Pressure ratios reflect compressor health
Othersensor_13, sensor_14Bleed air indicates thermal stress

Excluded Sensors

Seven sensors are excluded due to constant or near-constant values:

  • sensor_1: Total temperature at fan inlet—constant at 518.67°R
  • sensor_5: Pressure at fan inlet—constant at ~14.6 psia
  • sensor_10: Fuel flow ratio—constant
  • sensor_15: Fuel flow rate—redundant with other sensors
  • sensor_16: Burner fuel-air ratio—constant
  • sensor_18, sensor_19: Demanded fan speeds—constant

Implementation Detail

In our code, feature selection is implemented in the EnhancedNASACMAPSSDataset class with the enforce_feature_set=True parameter, which automatically selects the 17 informative features.

Train-Test Split Protocol

The C-MAPSS benchmark uses a specific evaluation protocol that differs from typical machine learning train-test splits:

Training Data

  • Engines run from initial operation until failure
  • Complete degradation trajectories available
  • RUL can be calculated as: RUL(t)=min(125,Tmaxt)\text{RUL}(t) = \min(125, T_{\text{max}} - t)
  • Files: train_FD001.txt, train_FD002.txt, etc.

Test Data

  • Engines run for some time but do NOT reach failure
  • Trajectories are truncated at random points
  • Ground-truth RUL provided in separate files
  • Files: test_FD001.txt + RUL_FD001.txt

File Structure

📝text
1data/raw/
2├── train_FD001.txt     # Training trajectories (100 engines)
3├── test_FD001.txt      # Test trajectories (100 engines, truncated)
4├── RUL_FD001.txt       # Ground truth RUL for test set (100 values)
5├── train_FD002.txt     # Training (260 engines)
6├── test_FD002.txt      # Test (259 engines, truncated)
7├── RUL_FD002.txt       # Ground truth (259 values)
8├── train_FD003.txt     # Training (100 engines)
9├── test_FD003.txt      # Test (100 engines, truncated)
10├── RUL_FD003.txt       # Ground truth (100 values)
11├── train_FD004.txt     # Training (249 engines)
12├── test_FD004.txt      # Test (248 engines, truncated)
13└── RUL_FD004.txt       # Ground truth (248 values)

Evaluation Protocol

The standard evaluation protocol computes metrics on the last cycle of each test engine:

  1. For each test engine, extract all cycles
  2. Run the model on each cycle (or sliding windows)
  3. Compare the prediction at the last cycle to the ground-truth RUL
  4. Compute RMSE and NASA Score across all test engines

Why Last-Cycle Evaluation?

The ground-truth RUL_FDxxx.txt files provide the RUL at the final cycle of each test trajectory. This simulates the real-world scenario: "Given sensor data up to now, how much life remains?" The model must predict accurately at the truncation point.

Why C-MAPSS Matters for Research

C-MAPSS has become the standard benchmark for several important reasons:

1. Controlled Complexity Progression

The four sub-datasets provide a natural difficulty progression:

  • Test on FD001 first to verify basic algorithm correctness
  • Move to FD003 to test multi-modal learning
  • Challenge with FD002 for multi-condition robustness
  • Ultimate test with FD004 for full complexity

2. Known Ground Truth

Unlike real-world data where failure time is often uncertain, C-MAPSS provides exact failure times, enabling precise evaluation of prediction accuracy.

3. Extensive Benchmarking

Hundreds of papers have reported results on C-MAPSS, enabling direct comparison of methods. Published RMSE values range from ~11 (SOTA transformer methods) to ~20+ (classical approaches).

4. Realistic Challenges

C-MAPSS captures key real-world difficulties:

  • Multiple operating conditions (flight regimes)
  • Multiple failure modes
  • Noisy sensor measurements
  • Varying trajectory lengths
  • Class imbalance (most data is healthy operation)

AMNL Performance on C-MAPSS

Our AMNL model achieves state-of-the-art performance on 3 of 4 datasets with statistical significance, including dramatic improvements on the challenging multi-condition datasets:

Main Results (RMSE, 5-seed average)

DatasetAMNL (Ours)DKAMFormerPrevious SOTAImprovement
FD00110.43 ± 1.9410.6811.49+9.2%
FD0026.74 ± 0.9110.7019.77+65.9%
FD0039.51 ± 1.7410.5211.71+18.8%
FD0048.16 ± 2.1712.8920.67+60.5%

Best Individual Results

DatasetBest RMSESeedvs SOTA
FD0018.69123+24.4%
FD0026.19123+68.7%
FD0038.0542+31.2%
FD0046.17123+70.2%
Key Finding: AMNL achieves 65.9% improvement on FD002 and 60.5% improvement on FD004—the complex multi-condition datasets where previous methods struggled. This demonstrates exceptional generalization across diverse operating conditions.

Statistical Significance

Datasetp-valueSignificance
FD0010.1439Not significant
FD002<0.0001Highly significant (p<0.001)
FD0030.0234Significant (p<0.05)
FD0040.0001Highly significant (p<0.001)

The improvements on FD002, FD003, and FD004 are statistically significant, meaning they are unlikely to be due to random chance.

Why AMNL Excels on Multi-Condition Data

The dramatic improvements on FD002 and FD004 are directly attributable to the equal weighting in AMNL:

  1. The health classification task forces the model to learn condition-invariant representations
  2. Health states (Normal, Early Degradation, Critical) are defined by RUL thresholds, not operating conditions
  3. Equal task weighting (0.5/0.5) ensures the model cannot "cheat" by ignoring the classification objective
  4. This regularization effect is strongest when conditions are heterogeneous—exactly the FD002/FD004 setting

Summary

In this section, we have covered the NASA C-MAPSS benchmark dataset:

  1. C-MAPSS simulates turbofan engine degradation with realistic sensor noise and operating condition variability
  2. Four sub-datasets provide controlled complexity: FD001 (easy), FD002 (hard), FD003 (medium), FD004 (very hard)
  3. Multi-condition datasets (FD002, FD004) are particularly challenging due to 6 different operating regimes
  4. 17 features are selected from the original 24 (3 settings + 14 informative sensors)
  5. Evaluation protocol uses last-cycle predictions compared against ground-truth RUL files
  6. AMNL achieves SOTA with 65.9% improvement on FD002 and 60.5% on FD004
Dataset StatisticsFD001FD002FD003FD004
Operating Conditions1616
Fault Modes1122
Train Engines100260100249
Test Engines100259100248
AMNL RMSE10.436.749.518.16
Improvement vs SOTA+9.2%+65.9%+18.8%+60.5%
Looking Ahead: In the next section, we will survey existing state-of-the-art methods for RUL prediction and understand their limitations—setting the stage for why AMNL's novel approach was necessary.

With a solid understanding of the benchmark dataset, we are ready to explore what existing methods have achieved and where they fall short.