What's in the Shipping Crate
Open the C-MAPSS download from NASA's Prognostics Center of Excellence. Inside: a 50 MB zip with a README, a Matlab script nobody runs anymore, and twelve plain-text files. That is the entire benchmark that has anchored prognostic research for fifteen years — smaller than a single high-res photograph, simpler than a CSV, and yet rich enough that papers still squeeze new percentage points out of it.
Section 2.1 gave the macro view: four sub-datasets (FD001-4), 26 columns per row, 100-260 engines per subset. This section is the micro view — how to actually load the files, what one engine looks like up close, and the train/test split asymmetry that catches every newcomer.
Three Files Per Sub-Dataset
| File | Contents | Format | Used by |
|---|---|---|---|
| train_FD00x.txt | Run-to-failure trajectories for all training engines | 26 cols, no header, space-separated | Training set |
| test_FD00x.txt | Truncated trajectories - engine pulled at random cycle | 26 cols, no header, space-separated | Test inputs |
| RUL_FD00x.txt | Ground-truth RUL at the last test cycle of each test engine | 1 col, integers | Test labels |
For each FD subset you get three files. Train is straightforward. Test is the interesting one — the engine's sensor stream is cut off at a random cycle before failure, and the model has to extrapolate. RUL_FD00x.txt tells you how many cycles the engine actually had left at that cut-off; that is the held-out label. The shape mismatch (test has truncated trajectories; train does not) is what makes evaluation non-trivial.
Why Train and Test Look Different
Run-to-failure data is the cleanest possible setup for learning — the model sees the entire trajectory. For evaluation it would be useless: you cannot judge a prognostic model by checking if it can label a fully-labelled engine. So the test split simulates the real-world situation: you get an engine that has been running for some unknown number of cycles, with no future visible. Predict its remaining useful life.
Interactive: Browse One Engine's Life
The viewer below walks one engine's full sensor trajectory cycle by cycle. Switch between “all 21”, “14 informative”, and “single sensor” modes. Notice in all 21 mode that several sensors are flat lines — constants like that carry zero predictive signal. Section 5.3 drops them.
Python: Inspect One Engine End-to-End
Twenty lines. Load FD001 train, slice to engine 1, look at its first/last cycle, then identify which sensors are constant. The same diagnostic you would run before any modelling work.
PyTorch: From Files to Tensors in 10 Lines
With the CMAPSSDataset class from Section 2.1 already available, the path from disk to GPU-ready tensor batches is two more lines.
A Note on the Simulation Physics
C-MAPSS is short for Commercial Modular Aero-Propulsion System Simulation and was developed at NASA Glenn. The simulator is a thermo-mechanical model of a high-bypass turbofan with five rotating components (fan, LPC, HPC, HPT, LPT) and a combustor. Each engine has health-margin parameters that drift over time according to a stochastic schedule; when any margin drops below a threshold the engine is marked failed. The 21 sensors are projections of the underlying state vector through the simulator's output equations.
Two consequences. First, the data is physically consistent — correlations between sensors reflect real engine thermodynamics, not noise. Second, the data is not as varied as real fleet data; the simulator has a finite parameter space and the failures fall into a small set of canonical modes (HPC degradation, fan degradation, both combined). The N-CMAPSS DS02 dataset (Section 2.3) closes much of this gap.
Two File-Format Pitfalls
pd.read_csv uses comma. C-MAPSS uses whitespace with multiple spaces. Use sep=r"\s+" to collapse them. Forgetting this gives you one giant column with all 26 numbers concatenated.max(cycle) - cycle works ONLY for run-to-failure trajectories - that is, the train file. The test file is truncated; the max cycle there is NOT the failure cycle. For test-time RUL you have to consult RUL_FD00x.txt and only at the LAST cycle of each test engine.The bottom line. C-MAPSS is small, simple, and beautifully reproducible. The simulation physics is enough to produce realistic degradation shapes; the file format is plain enough that any reasonable scientific stack handles it in five lines.
Takeaway
- Three files per FD subset. Train (full life), test (truncated), RUL (test labels at the cutoff cycle).
- 26 columns per row, no header. 2 IDs + 3 op settings + 21 sensors.
sep=r"\s+", header=Noneor you get one column. - RUL on train is one line of pandas.
groupby("engine_id")["cycle"].transform("max") - df["cycle"]. RUL on test needs the separate label file. - Seven sensors are constant on FD001. 21 - 7 = 14 informative sensors. Section 5.3 formalises this.
- The CMAPSSDataset class plus a DataLoader is the loader. Two lines beyond Section 2.1 to get batched, shuffled (B, T, F) tensors on the GPU.