Chapter 5
12 min read
Section 18 of 121

C-MAPSS File Structure

NASA Datasets Deep Dive

What's in the Shipping Crate

Open the C-MAPSS download from NASA's Prognostics Center of Excellence. Inside: a 50 MB zip with a README, a Matlab script nobody runs anymore, and twelve plain-text files. That is the entire benchmark that has anchored prognostic research for fifteen years — smaller than a single high-res photograph, simpler than a CSV, and yet rich enough that papers still squeeze new percentage points out of it.

Section 2.1 gave the macro view: four sub-datasets (FD001-4), 26 columns per row, 100-260 engines per subset. This section is the micro view — how to actually load the files, what one engine looks like up close, and the train/test split asymmetry that catches every newcomer.

Section 2.1 vs Section 5.1. §2.1 was the what-it-is overview. §5.1 is the practical “here is the file, here is the loader, here is one engine's full life” deep-dive. Both refer to the same data; the angle differs.

Three Files Per Sub-Dataset

FileContentsFormatUsed by
train_FD00x.txtRun-to-failure trajectories for all training engines26 cols, no header, space-separatedTraining set
test_FD00x.txtTruncated trajectories - engine pulled at random cycle26 cols, no header, space-separatedTest inputs
RUL_FD00x.txtGround-truth RUL at the last test cycle of each test engine1 col, integersTest labels

For each FD subset you get three files. Train is straightforward. Test is the interesting one — the engine's sensor stream is cut off at a random cycle before failure, and the model has to extrapolate. RUL_FD00x.txt tells you how many cycles the engine actually had left at that cut-off; that is the held-out label. The shape mismatch (test has truncated trajectories; train does not) is what makes evaluation non-trivial.

Why Train and Test Look Different

Run-to-failure data is the cleanest possible setup for learning — the model sees the entire trajectory. For evaluation it would be useless: you cannot judge a prognostic model by checking if it can label a fully-labelled engine. So the test split simulates the real-world situation: you get an engine that has been running for some unknown number of cycles, with no future visible. Predict its remaining useful life.

Reporting convention. Almost every C-MAPSS paper reports two numbers per FD subset — RMSE and NASA score — computed only at the LAST cycle of each test engine. The model emits one prediction per test engine; the score is averaged over ~100 engines. Do not confuse this with per-cycle evaluation, which is sometimes used in ablations.

Interactive: Browse One Engine's Life

The viewer below walks one engine's full sensor trajectory cycle by cycle. Switch between “all 21”, “14 informative”, and “single sensor” modes. Notice in all 21 mode that several sensors are flat lines — constants like T2=518.67T_2 = 518.67 that carry zero predictive signal. Section 5.3 drops them.

Loading engine trajectory browser…

Python: Inspect One Engine End-to-End

Twenty lines. Load FD001 train, slice to engine 1, look at its first/last cycle, then identify which sensors are constant. The same diagnostic you would run before any modelling work.

Load FD001 and inspect engine 1
🐍engine_inspection.py
1import numpy as np

Standard alias.

2import pandas as pd

We will use DataFrame.groupby for per-engine slicing.

4COLUMNS = ...

Same 26-column layout from §2.1 - 2 IDs + 3 op settings + 21 sensors. No header row in the .txt files.

11def load_train(path):

Loader from §2.1, repeated here for self-containment.

EXECUTION STATE
input: path = Filesystem path to a train_FD00x.txt
returns DataFrame = 26 raw columns + 1 computed RUL column
12df = pd.read_csv(path, sep=r'\s+', header=None, names=COLUMNS)

Whitespace-separated; no header. names assigns our column labels.

13df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]

Per-engine RUL: max(cycle) is the failure cycle; subtract current cycle for RUL.

18df = load_train("data/raw/train_FD001.txt")

FD001 train file.

20eng1 = df[df["engine_id"] == 1]

Boolean mask: keep only rows where engine_id is 1.

EXECUTION STATE
len(eng1) = 192 cycles - engine 1's full life
21print("engine 1 ran for :", len(eng1), "cycles")

Total run length for engine 1 on FD001.

EXECUTION STATE
Output = engine 1 ran for : 192 cycles
22print("engine 1 first cycle :", ...)

Selected sensor values at cycle 1.

EXECUTION STATE
Output = engine 1 first cycle : [641.82, 1400.6, 47.47]
23print("engine 1 last cycle :", ...)

Same sensors at the very last cycle (RUL = 0).

EXECUTION STATE
Output = engine 1 last cycle : [654.32, 1437.2, 46.05]
→ drift visible = sensor_2 drifts UP by ~12.5 cycles (degradation signal); sensor_4 also up; sensor_11 down
24print("engine 1 sensor_2 drift :", ...)

Quantify the drift on the sensor that grows.

EXECUTION STATE
Output = engine 1 sensor_2 drift : +12.50
28sensor_cols = [f"sensor_{i}" for i in range(1, 22)]

List comprehension producing the 21 sensor column names.

29ranges = (df[sensor_cols].max() - df[sensor_cols].min()).round(3)

Per-sensor range across the entire dataset. Sensors with range ~ 0 are constant.

30print("zero-range sensors (drop these):")

Diagnostic header.

31print(ranges[ranges < 1e-6].to_dict())

Show the ones that didn't change at all over the entire training file. These contribute zero predictive signal and are dropped in feature selection (Section 5.3).

EXECUTION STATE
Output = {'sensor_1': 0.0, 'sensor_5': 0.0, 'sensor_6': 0.0, 'sensor_10': 0.0, 'sensor_16': 0.0, 'sensor_18': 0.0, 'sensor_19': 0.0}
→ 7 sensors are constant on FD001 = Drop them - 21 raw - 7 constant = 14 useful sensors. Section 5.3 formalises this.
21 lines without explanation
1import numpy as np
2import pandas as pd
3
4COLUMNS = (
5    ["engine_id", "cycle"]
6    + [f"op_set_{i}" for i in range(1, 4)]
7    + [f"sensor_{i}" for i in range(1, 22)]
8)
9
10
11def load_train(path: str) -> pd.DataFrame:
12    df = pd.read_csv(path, sep=r"\s+", header=None, names=COLUMNS)
13    df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]
14    return df
15
16
17# ----- Inspect engine 1 of FD001 -----
18df = load_train("data/raw/train_FD001.txt")
19
20eng1 = df[df["engine_id"] == 1]
21print("engine 1 ran for       :", len(eng1), "cycles")
22print("engine 1 first cycle    :", eng1.iloc[0][["sensor_2", "sensor_4", "sensor_11"]].tolist())
23print("engine 1 last cycle     :", eng1.iloc[-1][["sensor_2", "sensor_4", "sensor_11"]].tolist())
24print("engine 1 sensor_2 drift :",
25      f"{eng1.iloc[-1]['sensor_2'] - eng1.iloc[0]['sensor_2']:+.2f}")
26
27# Constant-sensor check: which sensors barely move on FD001?
28sensor_cols = [f"sensor_{i}" for i in range(1, 22)]
29ranges = (df[sensor_cols].max() - df[sensor_cols].min()).round(3)
30print("\nzero-range sensors (drop these):")
31print(ranges[ranges < 1e-6].to_dict())
32
33# engine 1 ran for       : 192 cycles
34# engine 1 sensor_2 drift : +12.50    (real C-MAPSS values vary by engine)
35# zero-range sensors (drop these): {'sensor_1': 0.0, 'sensor_5': 0.0,
36#   'sensor_6': 0.0, 'sensor_10': 0.0, 'sensor_16': 0.0, 'sensor_18': 0.0,
37#   'sensor_19': 0.0}

PyTorch: From Files to Tensors in 10 Lines

With the CMAPSSDataset class from Section 2.1 already available, the path from disk to GPU-ready tensor batches is two more lines.

DataLoader on top of CMAPSSDataset (FD001)
🐍cmapss_loader_runtime.py
1from torch.utils.data import DataLoader

Just DataLoader - the Dataset class is imported from a separate file we built in Section 2.1.

2from cmapss_dataset import CMAPSSDataset

Re-import the reusable Dataset class from Section 2.1's PyTorch block.

5ds = CMAPSSDataset("data/raw/train_FD001.txt", window=30)

Construct on FD001 train. The Dataset reads the file once and pre-computes 17,731 valid sliding-window starts.

EXECUTION STATE
len(ds) = 17,731 windows on FD001 train (100 engines × ~177 windows each)
6loader = DataLoader(ds, batch_size=256, shuffle=True, num_workers=4, pin_memory=True)

Production-ready DataLoader. shuffle=True randomises order each epoch; num_workers=4 spawns 4 background processes for parallel I/O; pin_memory=True speeds GPU transfer.

EXECUTION STATE
batch_size=256 = Default training batch in this book
shuffle=True = Critical for SGD - without it the model sees all of engine 1 first, then engine 2, etc.
num_workers=4 = Parallel data loading. CPU-bound preprocessing happens off the GPU thread.
pin_memory=True = Allocates input tensors in pinned memory for faster .to('cuda') transfer
8print("dataset size:", len(ds))

Sanity check.

11for X, y in loader:

Iterate one batch.

12print("X.shape:", tuple(X.shape))

Verify (B, T, F) = (256, 30, 21).

EXECUTION STATE
Output = X.shape: (256, 30, 21)
13print("y.shape:", tuple(y.shape))

Verify scalar RUL targets per sample.

EXECUTION STATE
Output = y.shape: (256,)
14print("y[:5] :", y[:5].tolist())

First 5 RUL values - mix of newly-shuffled engines at varying lifetime stages.

EXECUTION STATE
Output = y[:5] : [42.0, 119.0, 0.0, 84.0, 7.0]
→ 0.0 in there = RUL = 0 means the engine is at the LAST cycle before failure. Several windows per engine end at RUL = 0.
15break

We just want one batch for inspection.

9 lines without explanation
1from torch.utils.data import DataLoader
2from cmapss_dataset import CMAPSSDataset    # the class from Section 2.1
3
4# ----- Build the train loader for FD001 -----
5ds = CMAPSSDataset("data/raw/train_FD001.txt", window=30)
6loader = DataLoader(ds, batch_size=256, shuffle=True, num_workers=4, pin_memory=True)
7
8print("dataset size:", len(ds))                # 17,731 sliding windows on FD001 train
9
10# ----- Inspect one batch -----
11for X, y in loader:
12    print("X.shape:", tuple(X.shape))           # (256, 30, 21)
13    print("y.shape:", tuple(y.shape))           # (256,)
14    print("y[:5] :", y[:5].tolist())
15    break
16
17# X.shape: (256, 30, 21)
18# y.shape: (256,)
19# y[:5] : [42.0, 119.0, 0.0, 84.0, 7.0]

A Note on the Simulation Physics

C-MAPSS is short for Commercial Modular Aero-Propulsion System Simulation and was developed at NASA Glenn. The simulator is a thermo-mechanical model of a high-bypass turbofan with five rotating components (fan, LPC, HPC, HPT, LPT) and a combustor. Each engine has health-margin parameters that drift over time according to a stochastic schedule; when any margin drops below a threshold the engine is marked failed. The 21 sensors are projections of the underlying state vector through the simulator's output equations.

Two consequences. First, the data is physically consistent — correlations between sensors reflect real engine thermodynamics, not noise. Second, the data is not as varied as real fleet data; the simulator has a finite parameter space and the failures fall into a small set of canonical modes (HPC degradation, fan degradation, both combined). The N-CMAPSS DS02 dataset (Section 2.3) closes much of this gap.

Two File-Format Pitfalls

Pitfall 1: Treating space-separated as comma-separated. Default pd.read_csv uses comma. C-MAPSS uses whitespace with multiple spaces. Use sep=r"\s+" to collapse them. Forgetting this gives you one giant column with all 26 numbers concatenated.
Pitfall 2: Computing RUL on the test file. max(cycle) - cycle works ONLY for run-to-failure trajectories - that is, the train file. The test file is truncated; the max cycle there is NOT the failure cycle. For test-time RUL you have to consult RUL_FD00x.txt and only at the LAST cycle of each test engine.
The bottom line. C-MAPSS is small, simple, and beautifully reproducible. The simulation physics is enough to produce realistic degradation shapes; the file format is plain enough that any reasonable scientific stack handles it in five lines.

Takeaway

  • Three files per FD subset. Train (full life), test (truncated), RUL (test labels at the cutoff cycle).
  • 26 columns per row, no header. 2 IDs + 3 op settings + 21 sensors. sep=r"\s+", header=None or you get one column.
  • RUL on train is one line of pandas. groupby("engine_id")["cycle"].transform("max") - df["cycle"]. RUL on test needs the separate label file.
  • Seven sensors are constant on FD001. 21 - 7 = 14 informative sensors. Section 5.3 formalises this.
  • The CMAPSSDataset class plus a DataLoader is the loader. Two lines beyond Section 2.1 to get batched, shuffled (B, T, F) tensors on the GPU.
Loading comments...