AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the C-MAPSS simulation and why it became the standard benchmark for RUL prediction
Know the four sub-datasets (FD001-FD004) and their complexity characteristics
Learn the data format: operational settings, sensor measurements, and their physical meanings
Understand feature selection: which of the 21 sensors provide useful information
Master the evaluation protocol: train/test splits and ground-truth RUL files
See the performance benchmark: where AMNL stands relative to state-of-the-art methods

Why This Matters: The NASA C-MAPSS dataset is the de facto benchmark for evaluating predictive maintenance algorithms. Understanding its structure, challenges, and evaluation protocol is essential for interpreting research results and designing your own experiments.

What is C-MAPSS?

C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) is a sophisticated simulation tool developed by NASA for modeling turbofan engine degradation. The dataset generated from this simulator has become the gold standard for benchmarking RUL prediction algorithms.

The Turbofan Engine

A turbofan engine is the most common type of jet engine used in commercial aircraft. It consists of several key components:

Fan: Large front fan that draws in air and provides most of the thrust
Low-Pressure Compressor (LPC): Compresses air before the core
High-Pressure Compressor (HPC): Further compresses air to high pressure
Combustor: Burns fuel with compressed air
High-Pressure Turbine (HPT): Extracts energy to drive the HPC
Low-Pressure Turbine (LPT): Extracts energy to drive the fan and LPC

Simulated Degradation

The C-MAPSS simulation injects fault modes that cause progressive degradation over time:

Fault Mode	Affected Components	Effect
HPC Degradation	High-Pressure Compressor	Reduced efficiency, increased temperature
Fan Degradation	Fan assembly	Reduced thrust, vibration increase

These faults develop gradually over hundreds of operational cycles, mimicking real-world wear patterns. The simulator records sensor measurements throughout the degradation process until a failure threshold is reached.

Why Simulation Data?

Real run-to-failure data is extremely rare and expensive to obtain—no airline intentionally runs engines until they fail! Simulation provides controlled experiments with known failure times, enabling proper evaluation of prediction algorithms.

The Four Sub-Datasets

C-MAPSS provides four sub-datasets with varying levels of complexity, enabling systematic evaluation of algorithm robustness:

Dataset	Operating Conditions	Fault Modes	Train Units	Test Units	Difficulty
FD001	1	1 (HPC)	100	100	Easy
FD002	6	1 (HPC)	260	259	Hard
FD003	1	2 (HPC + Fan)	100	100	Medium
FD004	6	2 (HPC + Fan)	249	248	Very Hard

Understanding Dataset Complexity

Operating Conditions

Operating conditions represent different flight regimes—combinations of altitude, Mach number, and throttle settings. Each condition produces different baseline sensor readings:

FD001/FD003: Single operating condition—all engines operate under identical flight regime
FD002/FD004: Six operating conditions—engines experience varying altitudes, speeds, and power settings

The Multi-Condition Challenge

With 6 operating conditions, sensor readings vary significantly based on flight regime, not just degradation. A model must learn to disentangle condition effects from degradation signatures—this is why FD002 and FD004 are much harder than FD001 and FD003.

Fault Modes

FD001/FD002: Single fault mode (HPC degradation)—all engines fail the same way
FD003/FD004: Two fault modes (HPC and Fan degradation)—engines can fail in different ways with different signatures

Data Format and Features

Each data file is a space-separated text file with one row per engine-cycle observation:

Column Structure

Column	Name	Description
1	unit	Engine unit ID (integer)
2	cycle	Operational cycle number (integer)
3	setting_1	Altitude (operational setting)
4	setting_2	Mach number (operational setting)
5	setting_3	Throttle resolver angle (operational setting)
6-26	sensor_1 to sensor_21	21 sensor measurements

Sample Data Row

📝text

1unit  cycle  set1      set2      set3    s1      s2      s3      ...  s21
21     1     -0.0007  -0.0004   100.0   518.67  641.82  1589.70  ...  8138.62
31     2      0.0019  -0.0003   100.0   518.67  642.15  1591.82  ...  8131.49
41     3     -0.0043   0.0003   100.0   518.67  642.35  1587.99  ...  8133.23
5...
61     192   -0.0019  -0.0002   100.0   518.67  641.71  1588.45  ...  8129.23
72     1      0.0007   0.0000   100.0   518.67  642.42  1592.14  ...  8132.45

Sensor Descriptions

The 21 sensors measure various physical quantities throughout the engine:

Sensor	Description	Unit	Typical Range
sensor_1	Total temperature at fan inlet	°R	~518
sensor_2	Total temperature at LPC outlet	°R	~642
sensor_3	Total temperature at HPC outlet	°R	~1580-1592
sensor_4	Total temperature at LPT outlet	°R	~1400-1406
sensor_5	Pressure at fan inlet	psia	~14.6
sensor_6	Physical fan speed	rpm	~2388
sensor_7	Physical core speed	rpm	~9046-9054
sensor_8	Engine pressure ratio	-	~1.30
sensor_9	Static pressure at HPC outlet	psia	~554-558
sensor_10	Fuel flow/Ps30 ratio	-	~394-402
sensor_11	Corrected fan speed	rpm	~2388
sensor_12	Corrected core speed	rpm	~9046-9057
sensor_13	HPT coolant bleed	-	~47.3-47.5
sensor_14	LPT coolant bleed	-	~522-523
sensor_15	Fuel flow rate	pph	~2388
sensor_16	Burner fuel-air ratio	-	~0.0211
sensor_17	Total pressure in bypass-duct	psia	~8.40-8.46
sensor_18	Demanded fan speed	rpm	~2388
sensor_19	Demanded corrected fan speed	rpm	~100
sensor_20	HPC outlet temperature	°R	~8128-8146
sensor_21	Fan tip clearance	in	~14.95

Not All Sensors Are Informative

Several sensors (1, 5, 10, 16, 18, 19) have nearly constant values throughout engine operation and provide no information about degradation. Using all 21 sensors without selection can actually hurt model performance.

Feature Selection

Following established practice in the literature, we select 14 informative sensors plus the 3 operational settings, yielding 17 features total:

Selected Features

Category	Features	Rationale
Operational Settings	setting_1, setting_2, setting_3	Define operating condition for normalization
Temperature Sensors	sensor_2, sensor_3, sensor_4, sensor_20	Temperature changes indicate degradation
Speed Sensors	sensor_6, sensor_7, sensor_11, sensor_12	Speed variations indicate efficiency loss
Pressure Sensors	sensor_8, sensor_9, sensor_17	Pressure ratios reflect compressor health
Other	sensor_13, sensor_14	Bleed air indicates thermal stress

Excluded Sensors

Seven sensors are excluded due to constant or near-constant values:

sensor_1: Total temperature at fan inlet—constant at 518.67°R
sensor_5: Pressure at fan inlet—constant at ~14.6 psia
sensor_10: Fuel flow ratio—constant
sensor_15: Fuel flow rate—redundant with other sensors
sensor_16: Burner fuel-air ratio—constant
sensor_18, sensor_19: Demanded fan speeds—constant

Implementation Detail

In our code, feature selection is implemented in the EnhancedNASACMAPSSDataset class with the enforce_feature_set=True parameter, which automatically selects the 17 informative features.

Train-Test Split Protocol

The C-MAPSS benchmark uses a specific evaluation protocol that differs from typical machine learning train-test splits:

Training Data

Engines run from initial operation until failure
Complete degradation trajectories available
RUL can be calculated as: $\text{RUL}(t) = \min(125, T_{\text{max}} - t)$
Files: train_FD001.txt, train_FD002.txt, etc.

Test Data

Engines run for some time but do NOT reach failure
Trajectories are truncated at random points
Ground-truth RUL provided in separate files
Files: test_FD001.txt + RUL_FD001.txt

File Structure

📝text

1data/raw/
2├── train_FD001.txt     # Training trajectories (100 engines)
3├── test_FD001.txt      # Test trajectories (100 engines, truncated)
4├── RUL_FD001.txt       # Ground truth RUL for test set (100 values)
5├── train_FD002.txt     # Training (260 engines)
6├── test_FD002.txt      # Test (259 engines, truncated)
7├── RUL_FD002.txt       # Ground truth (259 values)
8├── train_FD003.txt     # Training (100 engines)
9├── test_FD003.txt      # Test (100 engines, truncated)
10├── RUL_FD003.txt       # Ground truth (100 values)
11├── train_FD004.txt     # Training (249 engines)
12├── test_FD004.txt      # Test (248 engines, truncated)
13└── RUL_FD004.txt       # Ground truth (248 values)

Evaluation Protocol

The standard evaluation protocol computes metrics on the last cycle of each test engine:

For each test engine, extract all cycles
Run the model on each cycle (or sliding windows)
Compare the prediction at the last cycle to the ground-truth RUL
Compute RMSE and NASA Score across all test engines

Why Last-Cycle Evaluation?

The ground-truth RUL_FDxxx.txt files provide the RUL at the final cycle of each test trajectory. This simulates the real-world scenario: "Given sensor data up to now, how much life remains?" The model must predict accurately at the truncation point.

Why C-MAPSS Matters for Research

C-MAPSS has become the standard benchmark for several important reasons:

1. Controlled Complexity Progression

The four sub-datasets provide a natural difficulty progression:

Test on FD001 first to verify basic algorithm correctness
Move to FD003 to test multi-modal learning
Challenge with FD002 for multi-condition robustness
Ultimate test with FD004 for full complexity

2. Known Ground Truth

Unlike real-world data where failure time is often uncertain, C-MAPSS provides exact failure times, enabling precise evaluation of prediction accuracy.

3. Extensive Benchmarking

Hundreds of papers have reported results on C-MAPSS, enabling direct comparison of methods. Published RMSE values range from ~11 (SOTA transformer methods) to ~20+ (classical approaches).

4. Realistic Challenges

C-MAPSS captures key real-world difficulties:

Multiple operating conditions (flight regimes)
Multiple failure modes
Noisy sensor measurements
Varying trajectory lengths
Class imbalance (most data is healthy operation)

AMNL Performance on C-MAPSS

Our AMNL model achieves state-of-the-art performance on 3 of 4 datasets with statistical significance, including dramatic improvements on the challenging multi-condition datasets:

Main Results (RMSE, 5-seed average)

Dataset	AMNL (Ours)	DKAMFormer	Previous SOTA	Improvement
FD001	10.43 ± 1.94	10.68	11.49	+9.2%
FD002	6.74 ± 0.91	10.70	19.77	+65.9%
FD003	9.51 ± 1.74	10.52	11.71	+18.8%
FD004	8.16 ± 2.17	12.89	20.67	+60.5%

Best Individual Results

Dataset	Best RMSE	Seed	vs SOTA
FD001	8.69	123	+24.4%
FD002	6.19	123	+68.7%
FD003	8.05	42	+31.2%
FD004	6.17	123	+70.2%

Key Finding: AMNL achieves 65.9% improvement on FD002 and 60.5% improvement on FD004—the complex multi-condition datasets where previous methods struggled. This demonstrates exceptional generalization across diverse operating conditions.

Statistical Significance

Dataset	p-value	Significance
FD001	0.1439	Not significant
FD002	<0.0001	Highly significant (p<0.001)
FD003	0.0234	Significant (p<0.05)
FD004	0.0001	Highly significant (p<0.001)

The improvements on FD002, FD003, and FD004 are statistically significant, meaning they are unlikely to be due to random chance.

Why AMNL Excels on Multi-Condition Data

The dramatic improvements on FD002 and FD004 are directly attributable to the equal weighting in AMNL:

The health classification task forces the model to learn condition-invariant representations
Health states (Normal, Early Degradation, Critical) are defined by RUL thresholds, not operating conditions
Equal task weighting (0.5/0.5) ensures the model cannot "cheat" by ignoring the classification objective
This regularization effect is strongest when conditions are heterogeneous—exactly the FD002/FD004 setting

Summary

In this section, we have covered the NASA C-MAPSS benchmark dataset:

C-MAPSS simulates turbofan engine degradation with realistic sensor noise and operating condition variability
Four sub-datasets provide controlled complexity: FD001 (easy), FD002 (hard), FD003 (medium), FD004 (very hard)
Multi-condition datasets (FD002, FD004) are particularly challenging due to 6 different operating regimes
17 features are selected from the original 24 (3 settings + 14 informative sensors)
Evaluation protocol uses last-cycle predictions compared against ground-truth RUL files
AMNL achieves SOTA with 65.9% improvement on FD002 and 60.5% on FD004

Dataset Statistics	FD001	FD002	FD003	FD004
Operating Conditions	1	6	1	6
Fault Modes	1	1	2	2
Train Engines	100	260	100	249
Test Engines	100	259	100	248
AMNL RMSE	10.43	6.74	9.51	8.16
Improvement vs SOTA	+9.2%	+65.9%	+18.8%	+60.5%

Looking Ahead: In the next section, we will survey existing state-of-the-art methods for RUL prediction and understand their limitations—setting the stage for why AMNL's novel approach was necessary.

With a solid understanding of the benchmark dataset, we are ready to explore what existing methods have achieved and where they fall short.