Learning Objectives
By the end of this section, you will:
- Understand the EnhancedNASACMAPSSDataset architecture and how it implements research-grade preprocessing
- Configure feature selection with the 17 informative features identified in literature
- Load and parse C-MAPSS data files with proper column naming and type handling
- Implement operating condition clustering for per-condition normalization on FD002/FD004
- Prevent data leakage through proper scaler parameter management
Why This Matters: The Dataset class is the foundation of the entire training pipeline. Our EnhancedNASACMAPSSDataset implements critical preprocessing steps that many published methods overlookâincluding per-condition normalization, proper train/test separation, and reproducible splits. These details significantly impact model performance and evaluation validity.
PyTorch Dataset Basics
PyTorch's Dataset class provides a standard interface for data loading. Custom datasets must implement __len__ (total samples) and __getitem__ (sample access). Our enhanced dataset goes far beyond this minimum, implementing comprehensive preprocessing within the class itself.
Dataset Responsibilities
| Responsibility | Implementation | Importance |
|---|---|---|
| Data loading | _load_split() method | Parse raw C-MAPSS files |
| Feature selection | enforce_feature_set parameter | Use 17 informative features |
| Condition clustering | _compute_condition_id() method | Enable per-condition norm |
| Normalization | _compute_scalers(), _apply_scalers() | Prevent data leakage |
| Sequence building | _build_sequences_and_labels() | Create sliding windows |
EnhancedNASACMAPSSDataset Class
This is the actual implementation from our research code. The class handles all preprocessing steps in a single, cohesive unit, ensuring consistency between training and inference.
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Feature set | 17 features (enforced) | Literature consensus on informative sensors |
| Normalization | Per-condition for FD002/FD004 | Preserves degradation signals across regimes |
| Scaler handling | Pass from training to test | Prevents data leakage completely |
| Random seed | Fixed at 42 | Reproducible experiments |
| Data types | float32 for sequences | Memory efficient, GPU compatible |
Feature Configuration
The dataset uses exactly 17 features: 3 operating settings + 14 selected sensors. This selection is based on extensive literature analysis showing which sensors contain meaningful degradation information.
Selected Sensors
| Category | Features | Count |
|---|---|---|
| Operating settings | setting_1, setting_2, setting_3 | 3 |
| Temperature sensors | sensor_2, sensor_3, sensor_4, sensor_7, sensor_8, sensor_9 | 6 |
| Pressure sensors | sensor_11, sensor_12, sensor_13, sensor_14 | 4 |
| Speed/efficiency | sensor_6, sensor_16, sensor_17, sensor_20 | 4 |
Why 17 Features?
The original C-MAPSS data has 24 features (3 settings + 21 sensors). However, sensors 1, 5, 10, 15, 18, 19, and 21 show near-constant values across all engines and provide no degradation information. Including them adds noise without signal.
Data Loading Implementation
The _load_split method handles parsing the original NASA C-MAPSS text files, which use space-separated values without headers.
C-MAPSS File Format
| Column Index | Name | Description |
|---|---|---|
| 0 | unit | Engine unit number (1-100 for FD001) |
| 1 | cycle | Operating cycle number within unit |
| 2-4 | setting_1/2/3 | Altitude, Mach, Throttle |
| 5-25 | sensor_1...21 | 21 sensor measurements |
Operating Condition Clustering
For FD002 and FD004, engines operate across 6 different regimes. The _compute_condition_id method assigns each data row to its operating condition, enabling per-condition normalization.
The Six Operating Conditions
| Condition ID | Altitude (ft) | Mach | TRA | Flight Phase |
|---|---|---|---|---|
| 0 | 0 | 0 | 100 | Ground idle / takeoff |
| 1 | 10,000 | 0.25 | 100 | Low altitude climb |
| 2 | 20,000 | 0.70 | 100 | Mid altitude cruise |
| 3 | 25,000 | 0.62 | 60 | Reduced thrust cruise |
| 4 | 35,000 | 0.84 | 100 | High altitude cruise |
| 5 | 42,000 | 0.84 | 100 | Max altitude cruise |
Why Per-Condition Normalization?
Without per-condition normalization, a sensor reading of 600 R might be "normal" at sea level but indicate severe degradation at high altitude. Per-condition normalization removes regime effects, allowing the model to focus on degradation signals.
Summary
In this section, we explored the EnhancedNASACMAPSSDataset class from our research implementation:
- Comprehensive initialization: 10+ parameters for full control over preprocessing
- 17-feature selection: Literature-validated sensor subset for optimal signal
- Proper data loading: Parses original NASA text files with correct column naming
- Condition clustering: Maps operating settings to integer IDs for per-condition normalization
- Leakage prevention: Scaler parameters passed from training to test data
| Component | Purpose | Impact |
|---|---|---|
| enforce_feature_set | Use 17 informative features | Reduces noise, improves signal |
| per_condition_norm | Normalize within each regime | Critical for FD002/FD004 performance |
| scaler_params | Pass training statistics | Prevents data leakage |
| random_seed | Reproducible splits | Fair experimental comparison |
Looking Ahead: The Dataset class loads and normalizes data. The next section covers the DataLoader configurationâbatch size, shuffling, and parallel loadingâthat determines training efficiency.
With the enhanced dataset implementation understood, we are ready to optimize the DataLoader for efficient training.