Learning Objectives
By the end of this section, you will:
- Understand the C-MAPSS simulation and why it serves as the gold standard benchmark for RUL prediction
- Know the physical meaning of each sensor and operating condition in the dataset
- Navigate the dataset files and understand the train/test split structure
- Interpret the 26-column data format and identify which columns contain predictive information
- Appreciate the run-to-failure paradigm that makes this dataset unique for predictive maintenance research
Why This Matters: The NASA C-MAPSS dataset is the most widely used benchmark for RUL prediction research. Understanding its structure, physics, and quirks is essential for reproducing results, comparing methods fairly, and knowing what our model is actually learning. This chapter turns raw data files into deep understanding.
C-MAPSS Overview
C-MAPSS stands for Commercial Modular Aero-Propulsion System Simulation. It was developed by NASA's Glenn Research Center to simulate realistic turbofan engine degradation.
Why Simulated Data?
Real engine failure data is extremely rare and expensive to obtain:
- Safety: We cannot run engines to failure in commercial aircraft
- Cost: Each engine costs millions of dollars
- Time: Engine lifetimes span thousands of flight hours
- Variability: Real failures have complex, uncontrolled conditions
C-MAPSS provides controlled degradation simulations where we know the exact failure point, enabling supervised learning with ground-truth RUL labels.
The Turbofan Engine Model
C-MAPSS simulates a large commercial turbofan engine with the following components:
| Component | Abbreviation | Function |
|---|---|---|
| Fan | Fan | Draws air into engine, provides thrust |
| Low Pressure Compressor | LPC | First stage compression |
| High Pressure Compressor | HPC | Second stage compression |
| Combustor | Combustor | Fuel combustion, energy release |
| High Pressure Turbine | HPT | Drives HPC via shaft |
| Low Pressure Turbine | LPT | Drives Fan and LPC via shaft |
| Nozzle | Nozzle | Exhaust, thrust generation |
The simulation models thermodynamic cycles, mechanical wear, and performance degradation as the engine accumulates operational cycles.
Dataset History
The C-MAPSS dataset was released by Saxena et al. in 2008 at the International Conference on Prognostics and Health Management. It has since become the de facto standard for benchmarking RUL prediction algorithms, with hundreds of papers using it for evaluation.
The Simulation Physics
Understanding the physics helps interpret what our model learns. The simulation introduces degradation through two mechanisms:
1. Component Efficiency Degradation
As the engine operates, component efficiencies decrease:
Where is the initial efficiency and is the degradation accumulated over time.
2. Flow Capacity Degradation
The ability to pass air through components decreases:
Degradation Modes
C-MAPSS simulates degradation in specific components. The fault modes affect:
| Fault Mode | Affected Components | Physical Effect |
|---|---|---|
| HPC Degradation | High Pressure Compressor | Blade erosion, tip clearance |
| Fan Degradation | Fan | Foreign object damage, wear |
These degradation modes manifest as measurable changes in sensor readings—exactly what our model learns to detect.
Run-to-Failure Paradigm
Each simulated engine starts healthy and runs until failure:
At the final cycle , the engine has failed. The RUL at any cycle is simply:
Ground Truth Labels
This is the key advantage of simulated data: we know exactly, giving us perfect RUL labels for supervised learning. In real-world applications, we never know when failure will occur—that's the prediction problem!
Dataset File Structure
The C-MAPSS dataset consists of four sub-datasets (FD001-FD004), each with training and test files.
File Types
| File | Description | Use |
|---|---|---|
| train_FD00X.txt | Complete run-to-failure trajectories | Training data |
| test_FD00X.txt | Truncated trajectories (unknown failure point) | Test inputs |
| RUL_FD00X.txt | Ground truth RUL for test set | Evaluation only |
Training Data Structure
Training files contain complete trajectories. Each engine runs from cycle 1 until failure:
1Engine 1: Cycle 1, 2, 3, ..., 192 (failed at 192)
2Engine 2: Cycle 1, 2, 3, ..., 287 (failed at 287)
3Engine 3: Cycle 1, 2, 3, ..., 145 (failed at 145)
4...The last cycle of each engine is the failure point. RUL labels are computed by counting backwards from failure.
Test Data Structure
Test files contain truncated trajectories. We observe some cycles but not the failure:
1Engine 1: Cycle 1, 2, 3, ..., 31 (truncated, failure unknown)
2Engine 2: Cycle 1, 2, 3, ..., 49 (truncated, failure unknown)
3Engine 3: Cycle 1, 2, 3, ..., 126 (truncated, failure unknown)
4...The task is to predict the RUL at the final observed cycle. Ground truth is provided separately in RUL_FD00X.txt.
Dataset Statistics
| Dataset | Train Engines | Test Engines | Operating Conditions | Fault Modes |
|---|---|---|---|---|
| FD001 | 100 | 100 | 1 (Sea Level) | 1 (HPC) |
| FD002 | 260 | 259 | 6 (Various) | 1 (HPC) |
| FD003 | 100 | 100 | 1 (Sea Level) | 2 (HPC + Fan) |
| FD004 | 249 | 248 | 6 (Various) | 2 (HPC + Fan) |
Increasing Complexity
FD001 is the simplest (single condition, single fault). FD004 is the most challenging (multiple conditions, multiple faults). Methods that work on FD001 may fail on FD004—our AMNL model achieves state-of-the-art on ALL four datasets.
Sensor Readings Explained
Each cycle produces readings from 21 sensors placed throughout the engine. Understanding these sensors helps interpret model behavior.
Sensor Categories
Sensors measure different physical quantities:
| Category | Sensors | Physical Meaning |
|---|---|---|
| Temperature | T2, T24, T30, T50 | Gas temperatures at various stages |
| Pressure | P2, P15, P30, Ps30 | Static and total pressures |
| Speed | Nf, Nc, NRf, NRc | Fan and core shaft speeds |
| Flow | phi, BPR, htBleed, W31, W32 | Air flow rates and ratios |
| Other | farB, Nf_dmd, PCNfR_dmd, epr | Fuel ratio, demanded values |
Complete Sensor List
| Index | Symbol | Description | Unit |
|---|---|---|---|
| 1 | T2 | Total temperature at fan inlet | °R |
| 2 | T24 | Total temperature at LPC outlet | °R |
| 3 | T30 | Total temperature at HPC outlet | °R |
| 4 | T50 | Total temperature at LPT outlet | °R |
| 5 | P2 | Pressure at fan inlet | psia |
| 6 | P15 | Total pressure in bypass duct | psia |
| 7 | P30 | Total pressure at HPC outlet | psia |
| 8 | Nf | Physical fan speed | rpm |
| 9 | Nc | Physical core speed | rpm |
| 10 | epr | Engine pressure ratio (P50/P2) | — |
| 11 | Ps30 | Static pressure at HPC outlet | psia |
| 12 | phi | Ratio of fuel flow to Ps30 | pps/psi |
| 13 | NRf | Corrected fan speed | rpm |
| 14 | NRc | Corrected core speed | rpm |
| 15 | BPR | Bypass ratio | — |
| 16 | farB | Burner fuel-air ratio | — |
| 17 | htBleed | Bleed enthalpy | — |
| 18 | Nf_dmd | Demanded fan speed | rpm |
| 19 | PCNfR_dmd | Demanded corrected fan speed | % |
| 20 | W31 | HPT coolant bleed | lbm/s |
| 21 | W32 | LPT coolant bleed | lbm/s |
Informative vs Non-Informative Sensors
Not all sensors are equally useful for RUL prediction:
- Constant sensors: Some sensors remain nearly constant throughout the engine life (e.g., T2, P2 at sea level)
- Noisy sensors: Some have high noise with little degradation signal
- Informative sensors: Show clear degradation trends (e.g., T24, T30, T50, P30)
We will analyze which sensors to keep in Section 3 (Feature Selection).
Data Format and Dimensions
Each data file is a space-delimited text file with 26 columns and no header row.
Column Layout
| Columns | Content | Count |
|---|---|---|
| 1 | Engine unit number | 1 |
| 2 | Time (in cycles) | 1 |
| 3-5 | Operating conditions (altitude, Mach, TRA) | 3 |
| 6-26 | Sensor measurements | 21 |
Sample Data Row
11 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.02 9046.19 1.30 47.47 521.66 2388.01 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190Parsing this row:
- Engine 1, Cycle 1: First two values
- Operating conditions: altitude ≈ 0, Mach ≈ 0, TRA = 100 (sea level, full throttle)
- 21 sensor values: Remaining columns
Tensor Representation
For a single engine with cycles, the data forms a matrix:
For the entire training set with multiple engines:
RUL Label Computation
For training data, RUL at each cycle is computed as:
Where is the final cycle (failure point) for that engine.
1# Example: Engine with 192 cycles
2# Cycle 1: RUL = 192 - 1 = 191
3# Cycle 2: RUL = 192 - 2 = 190
4# ...
5# Cycle 191: RUL = 192 - 191 = 1
6# Cycle 192: RUL = 192 - 192 = 0 (failure)Piecewise Linear Clipping
In practice, we clip RUL values at a maximum (e.g., 125 cycles). Early-life degradation is imperceptible, so predicting RUL = 191 vs RUL = 200 is meaningless. We will formalize this in Section 4.
Summary
In this section, we introduced the NASA C-MAPSS dataset—the benchmark for RUL prediction research:
- C-MAPSS is a turbofan engine simulation developed by NASA Glenn Research Center
- Run-to-failure: Each engine starts healthy and runs until failure, providing ground-truth RUL labels
- Four sub-datasets: FD001-FD004 with varying operating conditions (1 or 6) and fault modes (1 or 2)
- 26 columns: Engine ID, cycle, 3 operating conditions, 21 sensor readings
- 21 sensors: Measure temperatures, pressures, speeds, and flow rates throughout the engine
- Train/Test split: Training has complete trajectories; test has truncated trajectories with separate ground truth
| Aspect | Value | Note |
|---|---|---|
| Total sensors | 21 | Not all informative |
| Operating conditions | 3 | Altitude, Mach, TRA |
| FD001 engines | 100 train / 100 test | Simplest case |
| FD004 engines | 249 train / 248 test | Most challenging |
| Data format | Space-delimited text | No headers |
Looking Ahead: Now that we understand the dataset structure, we need to understand the differences between FD001-FD004. Why is FD004 harder than FD001? What are the 6 operating conditions? What are the 2 fault modes? The next section answers these questions, revealing why dataset-specific strategies are essential for state-of-the-art performance.
With the dataset structure understood, we are ready to explore the specific challenges of each sub-dataset.