Chapter 4
20 min read
Section 20 of 104

PyTorch Dataset Implementation

Data Preprocessing Pipeline

Learning Objectives

By the end of this section, you will:

  1. Understand the EnhancedNASACMAPSSDataset architecture and how it implements research-grade preprocessing
  2. Configure feature selection with the 17 informative features identified in literature
  3. Load and parse C-MAPSS data files with proper column naming and type handling
  4. Implement operating condition clustering for per-condition normalization on FD002/FD004
  5. Prevent data leakage through proper scaler parameter management
Why This Matters: The Dataset class is the foundation of the entire training pipeline. Our EnhancedNASACMAPSSDataset implements critical preprocessing steps that many published methods overlook—including per-condition normalization, proper train/test separation, and reproducible splits. These details significantly impact model performance and evaluation validity.

PyTorch Dataset Basics

PyTorch's Dataset class provides a standard interface for data loading. Custom datasets must implement __len__ (total samples) and __getitem__ (sample access). Our enhanced dataset goes far beyond this minimum, implementing comprehensive preprocessing within the class itself.

Dataset Responsibilities

ResponsibilityImplementationImportance
Data loading_load_split() methodParse raw C-MAPSS files
Feature selectionenforce_feature_set parameterUse 17 informative features
Condition clustering_compute_condition_id() methodEnable per-condition norm
Normalization_compute_scalers(), _apply_scalers()Prevent data leakage
Sequence building_build_sequences_and_labels()Create sliding windows

EnhancedNASACMAPSSDataset Class

This is the actual implementation from our research code. The class handles all preprocessing steps in a single, cohesive unit, ensuring consistency between training and inference.

EnhancedNASACMAPSSDataset Initialization
🐍models/enhanced_sota_rul_predictor.py
1Class Definition

EnhancedNASACMAPSSDataset inherits from PyTorch's Dataset class, enabling seamless integration with DataLoader for batching, shuffling, and parallel loading.

10Constructor Parameters

The __init__ method accepts comprehensive configuration parameters that control dataset loading, preprocessing, and sequence construction. These parameters are carefully designed to prevent data leakage.

12Dataset Selection

The dataset_name parameter selects which C-MAPSS sub-dataset to load. FD001 is simplest (1 condition, 1 fault), while FD004 is most complex (6 conditions, 2 faults).

EXAMPLE
dataset_name='FD001' for training on the baseline dataset
16Scaler Parameters

The scaler_params parameter allows passing pre-computed normalization statistics from training data to test data. This is CRITICAL for preventing data leakage - test data must be normalized using training statistics only.

EXAMPLE
train_dataset.scaler_params passed to test dataset
20Per-Condition Normalization Flag

When True, normalization is computed separately for each operating condition (6 regimes in FD002/FD004). This preserves degradation signals that would otherwise be masked by regime changes.

35Random Seed

Setting the random seed ensures reproducible train/validation splits across experiments. This is essential for fair comparison between model variants.

38Operating Settings Columns

The three operating settings (altitude, Mach number, throttle resolver angle) define the engine's operating regime and are used for per-condition normalization.

39Selected Sensor Features

These 14 sensors were selected based on literature analysis - they show meaningful variance and degradation correlation. Constant sensors (like sensor_1, sensor_5) are excluded.

EXAMPLE
sensor_7 = Total temperature at HPC outlet
4817-Feature Enforcement

By default, we use exactly 17 features (3 settings + 14 sensors) for consistency with prior work. Setting enforce_feature_set=False uses all 24 original features.

54Data Loading

The _load_split method reads the raw C-MAPSS text files and returns a pandas DataFrame with proper column names.

57Condition ID Computation

Each data row is assigned a condition ID based on its operating settings. This enables per-condition normalization for FD002/FD004.

60Scaler Computation (Training Only)

For training data, we compute normalization statistics (mean, std). For test data, we MUST use the pre-computed training statistics to prevent leakage.

67Normalization Application

Apply z-score normalization using the computed (or passed) scaler parameters. Per-condition normalization uses separate statistics for each operating regime.

71Sequence Building

The _build_sequences_and_labels method converts the normalized DataFrame into sliding window sequences with corresponding RUL labels.

76Data Type Conversion

Convert to float32 for memory efficiency. PyTorch expects float32 tensors for efficient GPU computation.

70 lines without explanation
1class EnhancedNASACMAPSSDataset(Dataset):
2    """
3    Enhanced NASA C-MAPSS Dataset with comprehensive validation protocol
4    Implements all checklist improvements for research-grade benchmarking
5    """
6
7    def __init__(self,
8                 dataset_name: str = 'FD001',
9                 data_dir: str = 'data/raw',
10                 sequence_length: int = 30,
11                 max_rul: int = 125,
12                 train: bool = True,
13                 scaler_params: Optional[Dict] = None,
14                 validation_split: float = 0.0,
15                 random_seed: int = 42,
16                 enforce_feature_set: bool = True,
17                 per_condition_norm: bool = True):
18        """
19        Initialize enhanced NASA C-MAPSS dataset
20
21        Args:
22            dataset_name: One of FD001, FD002, FD003, FD004
23            data_dir: Directory containing the data files
24            sequence_length: Length of input sequences
25            max_rul: Maximum RUL value (piecewise linear RUL)
26            train: Whether to load training or test data
27            scaler_params: Pre-computed scaler parameters (prevents data leakage)
28            validation_split: Fraction of training data for validation (0.0-0.3)
29            random_seed: Random seed for reproducible splits
30            enforce_feature_set: Whether to enforce 17 features (3 settings + 14 sensors)
31            per_condition_norm: Whether to use per-condition normalization for FD002/FD004
32        """
33        assert dataset_name in {'FD001','FD002','FD003','FD004'}
34        self.dataset_name = dataset_name
35        self.data_dir = Path(data_dir)
36        self.sequence_length = sequence_length
37        self.max_rul = max_rul
38        self.train = train
39        self.validation_split = validation_split
40        self.random_seed = random_seed
41        self.enforce_feature_set = enforce_feature_set
42        self.per_condition_norm = per_condition_norm
43
44        # Set random seed for reproducible splits
45        np.random.seed(random_seed)
46
47        # Define feature columns based on literature-recommended sensors
48        self.setting_cols = ['setting_1', 'setting_2', 'setting_3']
49        sensor_keep = [
50            'sensor_7', 'sensor_8', 'sensor_9', 'sensor_12', 'sensor_16',
51            'sensor_17', 'sensor_20', 'sensor_2', 'sensor_3', 'sensor_4',
52            'sensor_14', 'sensor_11', 'sensor_13', 'sensor_6'
53        ]
54
55        if self.enforce_feature_set:
56            self.feature_columns = self.setting_cols + sensor_keep  # 17 features total
57        else:
58            # Use all 24 features (3 settings + 21 sensors)
59            self.feature_columns = self.setting_cols + [f'sensor_{i}' for i in range(1, 22)]
60
61        # Load data
62        df = self._load_split(dataset_name, train)
63
64        # Compute condition IDs for per-condition normalization
65        df['_cond_id'] = self._compute_condition_id(df, self.setting_cols)
66
67        # Store scaler parameters for consistent preprocessing (CRITICAL for preventing leakage)
68        if train or scaler_params is None:
69            self.scaler_params = self._compute_scalers(df, self.feature_columns,
70                                                        per_condition=self.per_condition_norm)
71        else:
72            self.scaler_params = scaler_params
73
74        # Apply normalization
75        df_norm = self._apply_scalers(df, self.feature_columns, self.scaler_params,
76                                       per_condition=self.per_condition_norm)
77
78        # Build sequences and labels with unit IDs
79        sequences, labels, unit_ids = self._build_sequences_and_labels(
80            df_norm, self.feature_columns, self.sequence_length
81        )
82
83        self.sequences = sequences.astype(np.float32)
84        self.targets = labels.astype(np.float32)
85        self.sequence_unit_ids = unit_ids

Key Design Decisions

DecisionChoiceRationale
Feature set17 features (enforced)Literature consensus on informative sensors
NormalizationPer-condition for FD002/FD004Preserves degradation signals across regimes
Scaler handlingPass from training to testPrevents data leakage completely
Random seedFixed at 42Reproducible experiments
Data typesfloat32 for sequencesMemory efficient, GPU compatible

Feature Configuration

The dataset uses exactly 17 features: 3 operating settings + 14 selected sensors. This selection is based on extensive literature analysis showing which sensors contain meaningful degradation information.

Selected Sensors

CategoryFeaturesCount
Operating settingssetting_1, setting_2, setting_33
Temperature sensorssensor_2, sensor_3, sensor_4, sensor_7, sensor_8, sensor_96
Pressure sensorssensor_11, sensor_12, sensor_13, sensor_144
Speed/efficiencysensor_6, sensor_16, sensor_17, sensor_204

Why 17 Features?

The original C-MAPSS data has 24 features (3 settings + 21 sensors). However, sensors 1, 5, 10, 15, 18, 19, and 21 show near-constant values across all engines and provide no degradation information. Including them adds noise without signal.


Data Loading Implementation

The _load_split method handles parsing the original NASA C-MAPSS text files, which use space-separated values without headers.

Loading C-MAPSS Data Files
🐍models/enhanced_sota_rul_predictor.py
1Load Split Method

This method handles loading raw C-MAPSS data from the original NASA text files. It properly parses the space-separated format and assigns meaningful column names.

3Train/Test Selection

The split variable determines which file to load: 'train_FD001.txt' or 'test_FD001.txt'. The training and test sets are pre-defined by NASA.

4Path Construction

Constructs the full path to the data file using pathlib for cross-platform compatibility.

EXAMPLE
data/raw/train_FD001.txt
6File Existence Check

Raises a clear error if the expected data file is missing. This helps debugging when data hasn't been downloaded or is in the wrong directory.

10Column Name Definition

C-MAPSS files have 26 columns: unit ID, cycle number, 3 operating settings, and 21 sensor readings. The original files have no headers.

15CSV Parsing

Uses pandas read_csv with whitespace separator (\s+) since C-MAPSS files are space-delimited. The header=None indicates no header row exists.

18Type Enforcement

Ensures unit and cycle columns are integers for proper grouping operations later. This prevents subtle bugs from float-based unit IDs.

14 lines without explanation
1def _load_split(self, dataset_name, train):
2    """Load train or test split from txt files"""
3    split = 'train' if train else 'test'
4    path = self.data_dir / f'{split}_{dataset_name}.txt'
5
6    if not path.exists():
7        raise FileNotFoundError(f"Expected C-MAPSS file not found: {path}")
8
9    # Column names for NASA C-MAPSS data
10    columns = ['unit', 'cycle'] + \
11              [f'setting_{i}' for i in range(1, 4)] + \
12              [f'sensor_{i}' for i in range(1, 22)]
13
14    # Read the data with proper column names
15    df = pd.read_csv(path, sep='\\s+', header=None, names=columns)
16
17    # Ensure integer types for unit and cycle columns
18    df['unit'] = df['unit'].astype(int)
19    df['cycle'] = df['cycle'].astype(int)
20
21    return df

C-MAPSS File Format

Column IndexNameDescription
0unitEngine unit number (1-100 for FD001)
1cycleOperating cycle number within unit
2-4setting_1/2/3Altitude, Mach, Throttle
5-25sensor_1...2121 sensor measurements

Operating Condition Clustering

For FD002 and FD004, engines operate across 6 different regimes. The _compute_condition_id method assigns each data row to its operating condition, enabling per-condition normalization.

Computing Operating Condition IDs
🐍models/enhanced_sota_rul_predictor.py
1Condition ID Method

This method clusters data rows into operating conditions based on the three operating settings. Each unique combination of (altitude, Mach, throttle) gets a unique condition ID.

3Copy Settings Columns

Creates a copy of the setting columns to avoid modifying the original DataFrame during processing.

4Round Settings Values

Rounds settings to 3 decimal places to handle floating-point precision issues. Without rounding, identical conditions might be treated as different due to tiny numerical differences.

EXAMPLE
0.84 vs 0.8400001 would be different without rounding
6Create Condition Keys

Concatenates the three settings into a single string key using '|' as separator. This creates a unique identifier for each operating condition combination.

EXAMPLE
'0.0|0.0|100.0' for sea-level takeoff
7Map to Integer IDs

Creates a dictionary mapping unique condition strings to integer IDs (0, 1, 2, ...). Sorting ensures consistent mapping across runs.

EXAMPLE
FD002 has 6 unique conditions, so IDs are 0-5
8Return Integer Array

Maps each row's condition key to its integer ID and returns as a numpy array. This array is used for per-condition normalization.

2 lines without explanation
1def _compute_condition_id(self, df, setting_cols):
2    """Compute condition ID for each row based on operating settings"""
3    s = df[setting_cols].copy()
4    for c in setting_cols:
5        s[c] = s[c].astype(float).round(3)
6    keys = s.astype(str).agg('|'.join, axis=1)
7    uniques = {k:i for i,k in enumerate(sorted(keys.unique()))}
8    return keys.map(uniques).astype(int).values

The Six Operating Conditions

Condition IDAltitude (ft)MachTRAFlight Phase
000100Ground idle / takeoff
110,0000.25100Low altitude climb
220,0000.70100Mid altitude cruise
325,0000.6260Reduced thrust cruise
435,0000.84100High altitude cruise
542,0000.84100Max altitude cruise

Why Per-Condition Normalization?

Without per-condition normalization, a sensor reading of 600 R might be "normal" at sea level but indicate severe degradation at high altitude. Per-condition normalization removes regime effects, allowing the model to focus on degradation signals.


Summary

In this section, we explored the EnhancedNASACMAPSSDataset class from our research implementation:

  1. Comprehensive initialization: 10+ parameters for full control over preprocessing
  2. 17-feature selection: Literature-validated sensor subset for optimal signal
  3. Proper data loading: Parses original NASA text files with correct column naming
  4. Condition clustering: Maps operating settings to integer IDs for per-condition normalization
  5. Leakage prevention: Scaler parameters passed from training to test data
ComponentPurposeImpact
enforce_feature_setUse 17 informative featuresReduces noise, improves signal
per_condition_normNormalize within each regimeCritical for FD002/FD004 performance
scaler_paramsPass training statisticsPrevents data leakage
random_seedReproducible splitsFair experimental comparison
Looking Ahead: The Dataset class loads and normalizes data. The next section covers the DataLoader configuration—batch size, shuffling, and parallel loading—that determines training efficiency.

With the enhanced dataset implementation understood, we are ready to optimize the DataLoader for efficient training.