AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Define data leakage and understand why it invalidates model evaluation
Identify leakage sources specific to time series and RUL prediction
Prevent normalization leakage by fitting statistics on training data only
Avoid temporal leakage through proper train/test separation
Implement a leakage-free pipeline that ensures valid evaluation

Why This Matters: Data leakage leads to overly optimistic results that don't generalize to deployment. A model that "cheats" by seeing test information during training will fail on new data. Rigorous leakage prevention is essential for trustworthy RUL predictions.

What is Data Leakage?

Data leakage occurs when information from the test set (or future data) influences the training process, leading to artificially good performance metrics.

The Core Problem

In a proper machine learning workflow:

\text{Training} \perp \text{Test} \quad \text{(independence)}

With leakage, this independence is violated:

\text{Training} \not\perp \text{Test} \quad \text{(information flow)}

Why Leakage is Dangerous

Aspect	With Leakage	Without Leakage
Reported RMSE	12 cycles	18 cycles
Real-world performance	18+ cycles	~18 cycles
Model trustworthiness	Misleading	Accurate
Deployment risk	High	Low

Leakage makes your model look better than it is. When deployed, it performs worse than expected—potentially with serious consequences in safety-critical applications like predictive maintenance.

Sources of Leakage in RUL Prediction

Several leakage sources are common in RUL prediction pipelines:

1. Normalization Leakage

Computing normalization statistics using all data (including test set).

2. Temporal Leakage

Using future information within a time series to predict present state.

3. Cross-Engine Leakage

Mixing samples from the same engine between train and validation sets.

4. Feature Engineering Leakage

Creating features (e.g., rolling statistics) using test data.

Leakage Type	How It Happens	Prevention
Normalization	Fit on all data	Fit on train only
Temporal	Future data in windows	Causal windowing
Cross-engine	Same engine in train/val	Engine-level splits
Feature engineering	Rolling stats include test	Compute on train only

Normalization Leakage

This is the most common leakage in preprocessing pipelines.

The Leak

The Fix

Fit normalization statistics on training data only:

\mu_{\text{train}} = \frac{1}{N_{\text{train}}} \sum_{i \in \text{train}} x_i

Apply these statistics to both training and test data:

x_{\text{norm}} = \frac{x - \mu_{\text{train}}}{\sigma_{\text{train}}} \quad \forall x \in \{\text{train}, \text{test}\}

🐍python

1# WRONG: Fit on all data
2scaler = StandardScaler()
3scaler.fit(np.vstack([train_data, test_data]))  # LEAKAGE!
4
5# CORRECT: Fit on training only
6scaler = StandardScaler()
7scaler.fit(train_data)  # Training only
8train_normalized = scaler.transform(train_data)
9test_normalized = scaler.transform(test_data)  # Use train statistics

Temporal Leakage

Time series data introduces unique leakage opportunities through temporal dependencies.

Leakage in Sliding Windows

Consider creating overlapping windows from a time series:

📝text

1Engine trajectory: [x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x₉, x₁₀]
2
3Window 1: [x₁, x₂, x₃]  → RUL = 7
4Window 2: [x₂, x₃, x₄]  → RUL = 6
5Window 3: [x₃, x₄, x₅]  → RUL = 5
6...

If windows are shuffled and split randomly:

Temporal Leakage Scenario

Window 2 goes to training, Window 3 goes to validation. But Window 3 contains $x_3, x_4$ which appeared in Window 2! The model has effectively seen part of the validation data.

C-MAPSS Temporal Structure

For C-MAPSS, the dataset already handles this correctly:

Training set: Complete trajectories of engines 1-100 (or 1-260, etc.)
Test set: Different engines with truncated trajectories

There is no overlap between train and test engines—temporal leakage at the dataset level is prevented by design.

Validation Split Considerations

When creating a validation set from training data, we must maintain engine-level separation:

🐍python

1# WRONG: Random sample split (leakage risk)
2train_windows, val_windows = train_test_split(all_windows, test_size=0.2)
3
4# CORRECT: Engine-level split
5train_engines = [1, 2, 3, ..., 80]  # 80 engines for training
6val_engines = [81, 82, ..., 100]    # 20 engines for validation
7
8train_windows = windows_from_engines(train_engines)
9val_windows = windows_from_engines(val_engines)

Engine-Level Splitting

All windows from the same engine must go to the same split (train or validation). This prevents any temporal leakage through overlapping windows from the same trajectory.

Leakage Prevention Protocol

We formalize a leakage-free preprocessing protocol:

Protocol Steps

Split first: Separate training and test data before any preprocessing
Fit on train: Compute all statistics (normalization, etc.) on training only
Transform both: Apply fitted transforms to train and test
Engine-level validation: Split by engine, not by window
Verify independence: Check that no information flows from test to train

Implementation Pattern

🐍python

1class LeakageFreePipeline:
2    def __init__(self):
3        self.fitted = False
4        self.condition_stats = None
5
6    def fit(self, train_data, train_conditions):
7        """Fit normalization on training data only."""
8        self.condition_stats = {}
9        for cond in np.unique(train_conditions):
10            mask = train_conditions == cond
11            self.condition_stats[cond] = {
12                'mean': np.mean(train_data[mask], axis=0),
13                'std': np.std(train_data[mask], axis=0)
14            }
15        self.fitted = True
16        return self
17
18    def transform(self, data, conditions):
19        """Transform data using pre-fitted statistics."""
20        if not self.fitted:
21            raise ValueError("Pipeline must be fitted before transform!")
22
23        normalized = np.zeros_like(data)
24        for cond in np.unique(conditions):
25            if cond not in self.condition_stats:
26                # Use closest condition for unseen conditions
27                cond = self._find_closest(cond)
28
29            mask = conditions == cond
30            stats = self.condition_stats[cond]
31            normalized[mask] = (data[mask] - stats['mean']) / stats['std']
32
33        return normalized
34
35    def fit_transform(self, train_data, train_conditions):
36        """Fit and transform training data."""
37        self.fit(train_data, train_conditions)
38        return self.transform(train_data, train_conditions)

Verification Checklist

Check	How to Verify	Expected Result
No test data in fitting	Inspect fit() input	Only train_data
Statistics before split	Check code order	Split happens first
Engine-level splits	Check split function	Groups by engine_id
Transform uses fitted	No fit() in transform	Reuses self.stats

Defensive Coding

Add assertions to catch leakage attempts:

🐍python

1assert len(set(train_engines) & set(test_engines)) == 0, "Engine overlap!"

Summary

In this section, we addressed the critical issue of data leakage:

Data leakage: Test information influencing training, leading to invalid evaluation
Normalization leakage: Fitting statistics on all data; fix by fitting on train only
Temporal leakage: Overlapping windows across splits; fix with engine-level separation
Prevention protocol: Split first, fit on train, transform both
Verification: Check independence, add assertions, audit code order

Leakage Type	Risk	Prevention
Normalization	High (very common)	fit() on train only
Temporal	Medium (with overlapping windows)	Engine-level splits
Feature engineering	Medium	Compute features before split
Label leakage	Low (for C-MAPSS)	RUL computed from cycle count

Looking Ahead: With leakage-free normalization in place, we need to create the input sequences for our model. The sliding window approach extracts fixed-length windows from variable-length trajectories—the topic of our next section.

With data leakage prevented, we are ready to construct the sliding window sequences that feed our model.