AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the EnhancedNASACMAPSSDataset architecture and how it implements research-grade preprocessing
Configure feature selection with the 17 informative features identified in literature
Load and parse C-MAPSS data files with proper column naming and type handling
Implement operating condition clustering for per-condition normalization on FD002/FD004
Prevent data leakage through proper scaler parameter management

Why This Matters: The Dataset class is the foundation of the entire training pipeline. Our EnhancedNASACMAPSSDataset implements critical preprocessing steps that many published methods overlook—including per-condition normalization, proper train/test separation, and reproducible splits. These details significantly impact model performance and evaluation validity.

PyTorch Dataset Basics

PyTorch's Dataset class provides a standard interface for data loading. Custom datasets must implement __len__ (total samples) and __getitem__ (sample access). Our enhanced dataset goes far beyond this minimum, implementing comprehensive preprocessing within the class itself.

Dataset Responsibilities

Responsibility	Implementation	Importance
Data loading	_load_split() method	Parse raw C-MAPSS files
Feature selection	enforce_feature_set parameter	Use 17 informative features
Condition clustering	_compute_condition_id() method	Enable per-condition norm
Normalization	_compute_scalers(), _apply_scalers()	Prevent data leakage
Sequence building	_build_sequences_and_labels()	Create sliding windows

EnhancedNASACMAPSSDataset Class

This is the actual implementation from our research code. The class handles all preprocessing steps in a single, cohesive unit, ensuring consistency between training and inference.

EnhancedNASACMAPSSDataset Initialization

🐍models/enhanced_sota_rul_predictor.py

Explanation(15)

Code(85)

1Class Definition

EnhancedNASACMAPSSDataset inherits from PyTorch's Dataset class, enabling seamless integration with DataLoader for batching, shuffling, and parallel loading.

10Constructor Parameters

The __init__ method accepts comprehensive configuration parameters that control dataset loading, preprocessing, and sequence construction. These parameters are carefully designed to prevent data leakage.

12Dataset Selection

The dataset_name parameter selects which C-MAPSS sub-dataset to load. FD001 is simplest (1 condition, 1 fault), while FD004 is most complex (6 conditions, 2 faults).

EXAMPLE

dataset_name='FD001' for training on the baseline dataset

16Scaler Parameters

The scaler_params parameter allows passing pre-computed normalization statistics from training data to test data. This is CRITICAL for preventing data leakage - test data must be normalized using training statistics only.

EXAMPLE

train_dataset.scaler_params passed to test dataset

20Per-Condition Normalization Flag

When True, normalization is computed separately for each operating condition (6 regimes in FD002/FD004). This preserves degradation signals that would otherwise be masked by regime changes.

35Random Seed

Setting the random seed ensures reproducible train/validation splits across experiments. This is essential for fair comparison between model variants.

38Operating Settings Columns

The three operating settings (altitude, Mach number, throttle resolver angle) define the engine's operating regime and are used for per-condition normalization.

39Selected Sensor Features

These 14 sensors were selected based on literature analysis - they show meaningful variance and degradation correlation. Constant sensors (like sensor_1, sensor_5) are excluded.

EXAMPLE

sensor_7 = Total temperature at HPC outlet

4817-Feature Enforcement

By default, we use exactly 17 features (3 settings + 14 sensors) for consistency with prior work. Setting enforce_feature_set=False uses all 24 original features.

54Data Loading

The _load_split method reads the raw C-MAPSS text files and returns a pandas DataFrame with proper column names.

57Condition ID Computation

Each data row is assigned a condition ID based on its operating settings. This enables per-condition normalization for FD002/FD004.

60Scaler Computation (Training Only)

For training data, we compute normalization statistics (mean, std). For test data, we MUST use the pre-computed training statistics to prevent leakage.

67Normalization Application

Apply z-score normalization using the computed (or passed) scaler parameters. Per-condition normalization uses separate statistics for each operating regime.

71Sequence Building

The _build_sequences_and_labels method converts the normalized DataFrame into sliding window sequences with corresponding RUL labels.

76Data Type Conversion

Convert to float32 for memory efficiency. PyTorch expects float32 tensors for efficient GPU computation.

70 lines without explanation

1class EnhancedNASACMAPSSDataset(Dataset):
2    """
3    Enhanced NASA C-MAPSS Dataset with comprehensive validation protocol
4    Implements all checklist improvements for research-grade benchmarking
5    """
6
7    def __init__(self,
8                 dataset_name: str = 'FD001',
9                 data_dir: str = 'data/raw',
10                 sequence_length: int = 30,
11                 max_rul: int = 125,
12                 train: bool = True,
13                 scaler_params: Optional[Dict] = None,
14                 validation_split: float = 0.0,
15                 random_seed: int = 42,
16                 enforce_feature_set: bool = True,
17                 per_condition_norm: bool = True):
18        """
19        Initialize enhanced NASA C-MAPSS dataset
20
21        Args:
22            dataset_name: One of FD001, FD002, FD003, FD004
23            data_dir: Directory containing the data files
24            sequence_length: Length of input sequences
25            max_rul: Maximum RUL value (piecewise linear RUL)
26            train: Whether to load training or test data
27            scaler_params: Pre-computed scaler parameters (prevents data leakage)
28            validation_split: Fraction of training data for validation (0.0-0.3)
29            random_seed: Random seed for reproducible splits
30            enforce_feature_set: Whether to enforce 17 features (3 settings + 14 sensors)
31            per_condition_norm: Whether to use per-condition normalization for FD002/FD004
32        """
33        assert dataset_name in {'FD001','FD002','FD003','FD004'}
34        self.dataset_name = dataset_name
35        self.data_dir = Path(data_dir)
36        self.sequence_length = sequence_length
37        self.max_rul = max_rul
38        self.train = train
39        self.validation_split = validation_split
40        self.random_seed = random_seed
41        self.enforce_feature_set = enforce_feature_set
42        self.per_condition_norm = per_condition_norm
43
44        # Set random seed for reproducible splits
45        np.random.seed(random_seed)
46
47        # Define feature columns based on literature-recommended sensors
48        self.setting_cols = ['setting_1', 'setting_2', 'setting_3']
49        sensor_keep = [
50            'sensor_7', 'sensor_8', 'sensor_9', 'sensor_12', 'sensor_16',
51            'sensor_17', 'sensor_20', 'sensor_2', 'sensor_3', 'sensor_4',
52            'sensor_14', 'sensor_11', 'sensor_13', 'sensor_6'
53        ]
54
55        if self.enforce_feature_set:
56            self.feature_columns = self.setting_cols + sensor_keep  # 17 features total
57        else:
58            # Use all 24 features (3 settings + 21 sensors)
59            self.feature_columns = self.setting_cols + [f'sensor_{i}' for i in range(1, 22)]
60
61        # Load data
62        df = self._load_split(dataset_name, train)
63
64        # Compute condition IDs for per-condition normalization
65        df['_cond_id'] = self._compute_condition_id(df, self.setting_cols)
66
67        # Store scaler parameters for consistent preprocessing (CRITICAL for preventing leakage)
68        if train or scaler_params is None:
69            self.scaler_params = self._compute_scalers(df, self.feature_columns,
70                                                        per_condition=self.per_condition_norm)
71        else:
72            self.scaler_params = scaler_params
73
74        # Apply normalization
75        df_norm = self._apply_scalers(df, self.feature_columns, self.scaler_params,
76                                       per_condition=self.per_condition_norm)
77
78        # Build sequences and labels with unit IDs
79        sequences, labels, unit_ids = self._build_sequences_and_labels(
80            df_norm, self.feature_columns, self.sequence_length
81        )
82
83        self.sequences = sequences.astype(np.float32)
84        self.targets = labels.astype(np.float32)
85        self.sequence_unit_ids = unit_ids

Key Design Decisions

Decision	Choice	Rationale
Feature set	17 features (enforced)	Literature consensus on informative sensors
Normalization	Per-condition for FD002/FD004	Preserves degradation signals across regimes
Scaler handling	Pass from training to test	Prevents data leakage completely
Random seed	Fixed at 42	Reproducible experiments
Data types	float32 for sequences	Memory efficient, GPU compatible

Feature Configuration

The dataset uses exactly 17 features: 3 operating settings + 14 selected sensors. This selection is based on extensive literature analysis showing which sensors contain meaningful degradation information.

Selected Sensors

Category	Features	Count
Operating settings	setting_1, setting_2, setting_3	3
Temperature sensors	sensor_2, sensor_3, sensor_4, sensor_7, sensor_8, sensor_9	6
Pressure sensors	sensor_11, sensor_12, sensor_13, sensor_14	4
Speed/efficiency	sensor_6, sensor_16, sensor_17, sensor_20	4

Why 17 Features?

The original C-MAPSS data has 24 features (3 settings + 21 sensors). However, sensors 1, 5, 10, 15, 18, 19, and 21 show near-constant values across all engines and provide no degradation information. Including them adds noise without signal.

Data Loading Implementation

The _load_split method handles parsing the original NASA C-MAPSS text files, which use space-separated values without headers.

Loading C-MAPSS Data Files

🐍models/enhanced_sota_rul_predictor.py

Explanation(7)

Code(21)

1Load Split Method

This method handles loading raw C-MAPSS data from the original NASA text files. It properly parses the space-separated format and assigns meaningful column names.

3Train/Test Selection

The split variable determines which file to load: 'train_FD001.txt' or 'test_FD001.txt'. The training and test sets are pre-defined by NASA.

4Path Construction

Constructs the full path to the data file using pathlib for cross-platform compatibility.

EXAMPLE

data/raw/train_FD001.txt

6File Existence Check

Raises a clear error if the expected data file is missing. This helps debugging when data hasn't been downloaded or is in the wrong directory.

10Column Name Definition

C-MAPSS files have 26 columns: unit ID, cycle number, 3 operating settings, and 21 sensor readings. The original files have no headers.

15CSV Parsing

Uses pandas read_csv with whitespace separator (\s+) since C-MAPSS files are space-delimited. The header=None indicates no header row exists.

18Type Enforcement

Ensures unit and cycle columns are integers for proper grouping operations later. This prevents subtle bugs from float-based unit IDs.

14 lines without explanation

1def _load_split(self, dataset_name, train):
2    """Load train or test split from txt files"""
3    split = 'train' if train else 'test'
4    path = self.data_dir / f'{split}_{dataset_name}.txt'
5
6    if not path.exists():
7        raise FileNotFoundError(f"Expected C-MAPSS file not found: {path}")
8
9    # Column names for NASA C-MAPSS data
10    columns = ['unit', 'cycle'] + \
11              [f'setting_{i}' for i in range(1, 4)] + \
12              [f'sensor_{i}' for i in range(1, 22)]
13
14    # Read the data with proper column names
15    df = pd.read_csv(path, sep='\\s+', header=None, names=columns)
16
17    # Ensure integer types for unit and cycle columns
18    df['unit'] = df['unit'].astype(int)
19    df['cycle'] = df['cycle'].astype(int)
20
21    return df

C-MAPSS File Format

Column Index	Name	Description
0	unit	Engine unit number (1-100 for FD001)
1	cycle	Operating cycle number within unit
2-4	setting_1/2/3	Altitude, Mach, Throttle
5-25	sensor_1...21	21 sensor measurements

Operating Condition Clustering

For FD002 and FD004, engines operate across 6 different regimes. The _compute_condition_id method assigns each data row to its operating condition, enabling per-condition normalization.

Computing Operating Condition IDs

🐍models/enhanced_sota_rul_predictor.py

Explanation(6)

Code(8)

1Condition ID Method

This method clusters data rows into operating conditions based on the three operating settings. Each unique combination of (altitude, Mach, throttle) gets a unique condition ID.

3Copy Settings Columns

Creates a copy of the setting columns to avoid modifying the original DataFrame during processing.

4Round Settings Values

Rounds settings to 3 decimal places to handle floating-point precision issues. Without rounding, identical conditions might be treated as different due to tiny numerical differences.

EXAMPLE

0.84 vs 0.8400001 would be different without rounding

6Create Condition Keys

Concatenates the three settings into a single string key using '|' as separator. This creates a unique identifier for each operating condition combination.

EXAMPLE

'0.0|0.0|100.0' for sea-level takeoff

7Map to Integer IDs

Creates a dictionary mapping unique condition strings to integer IDs (0, 1, 2, ...). Sorting ensures consistent mapping across runs.

EXAMPLE

FD002 has 6 unique conditions, so IDs are 0-5

8Return Integer Array

Maps each row's condition key to its integer ID and returns as a numpy array. This array is used for per-condition normalization.

2 lines without explanation

1def _compute_condition_id(self, df, setting_cols):
2    """Compute condition ID for each row based on operating settings"""
3    s = df[setting_cols].copy()
4    for c in setting_cols:
5        s[c] = s[c].astype(float).round(3)
6    keys = s.astype(str).agg('|'.join, axis=1)
7    uniques = {k:i for i,k in enumerate(sorted(keys.unique()))}
8    return keys.map(uniques).astype(int).values

The Six Operating Conditions

Condition ID	Altitude (ft)	Mach	TRA	Flight Phase
0	0	0	100	Ground idle / takeoff
1	10,000	0.25	100	Low altitude climb
2	20,000	0.70	100	Mid altitude cruise
3	25,000	0.62	60	Reduced thrust cruise
4	35,000	0.84	100	High altitude cruise
5	42,000	0.84	100	Max altitude cruise

Why Per-Condition Normalization?

Without per-condition normalization, a sensor reading of 600 R might be "normal" at sea level but indicate severe degradation at high altitude. Per-condition normalization removes regime effects, allowing the model to focus on degradation signals.

Summary

In this section, we explored the EnhancedNASACMAPSSDataset class from our research implementation:

Comprehensive initialization: 10+ parameters for full control over preprocessing
17-feature selection: Literature-validated sensor subset for optimal signal
Proper data loading: Parses original NASA text files with correct column naming
Condition clustering: Maps operating settings to integer IDs for per-condition normalization
Leakage prevention: Scaler parameters passed from training to test data

Component	Purpose	Impact
enforce_feature_set	Use 17 informative features	Reduces noise, improves signal
per_condition_norm	Normalize within each regime	Critical for FD002/FD004 performance
scaler_params	Pass training statistics	Prevents data leakage
random_seed	Reproducible splits	Fair experimental comparison

Looking Ahead: The Dataset class loads and normalizes data. The next section covers the DataLoader configuration—batch size, shuffling, and parallel loading—that determines training efficiency.

With the enhanced dataset implementation understood, we are ready to optimize the DataLoader for efficient training.