Learning Objectives
By the end of this section, you will:
- Define data leakage and understand why it invalidates model evaluation
- Identify leakage sources specific to time series and RUL prediction
- Prevent normalization leakage by fitting statistics on training data only
- Avoid temporal leakage through proper train/test separation
- Implement a leakage-free pipeline that ensures valid evaluation
Why This Matters: Data leakage leads to overly optimistic results that don't generalize to deployment. A model that "cheats" by seeing test information during training will fail on new data. Rigorous leakage prevention is essential for trustworthy RUL predictions.
What is Data Leakage?
Data leakage occurs when information from the test set (or future data) influences the training process, leading to artificially good performance metrics.
The Core Problem
In a proper machine learning workflow:
With leakage, this independence is violated:
Why Leakage is Dangerous
| Aspect | With Leakage | Without Leakage |
|---|---|---|
| Reported RMSE | 12 cycles | 18 cycles |
| Real-world performance | 18+ cycles | ~18 cycles |
| Model trustworthiness | Misleading | Accurate |
| Deployment risk | High | Low |
Leakage makes your model look better than it is. When deployed, it performs worse than expected—potentially with serious consequences in safety-critical applications like predictive maintenance.
Sources of Leakage in RUL Prediction
Several leakage sources are common in RUL prediction pipelines:
1. Normalization Leakage
Computing normalization statistics using all data (including test set).
2. Temporal Leakage
Using future information within a time series to predict present state.
3. Cross-Engine Leakage
Mixing samples from the same engine between train and validation sets.
4. Feature Engineering Leakage
Creating features (e.g., rolling statistics) using test data.
| Leakage Type | How It Happens | Prevention |
|---|---|---|
| Normalization | Fit on all data | Fit on train only |
| Temporal | Future data in windows | Causal windowing |
| Cross-engine | Same engine in train/val | Engine-level splits |
| Feature engineering | Rolling stats include test | Compute on train only |
Normalization Leakage
This is the most common leakage in preprocessing pipelines.
The Leak
The Fix
Fit normalization statistics on training data only:
Apply these statistics to both training and test data:
1# WRONG: Fit on all data
2scaler = StandardScaler()
3scaler.fit(np.vstack([train_data, test_data])) # LEAKAGE!
4
5# CORRECT: Fit on training only
6scaler = StandardScaler()
7scaler.fit(train_data) # Training only
8train_normalized = scaler.transform(train_data)
9test_normalized = scaler.transform(test_data) # Use train statisticsTemporal Leakage
Time series data introduces unique leakage opportunities through temporal dependencies.
Leakage in Sliding Windows
Consider creating overlapping windows from a time series:
1Engine trajectory: [x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x₉, x₁₀]
2
3Window 1: [x₁, x₂, x₃] → RUL = 7
4Window 2: [x₂, x₃, x₄] → RUL = 6
5Window 3: [x₃, x₄, x₅] → RUL = 5
6...If windows are shuffled and split randomly:
Temporal Leakage Scenario
Window 2 goes to training, Window 3 goes to validation. But Window 3 contains which appeared in Window 2! The model has effectively seen part of the validation data.
C-MAPSS Temporal Structure
For C-MAPSS, the dataset already handles this correctly:
- Training set: Complete trajectories of engines 1-100 (or 1-260, etc.)
- Test set: Different engines with truncated trajectories
There is no overlap between train and test engines—temporal leakage at the dataset level is prevented by design.
Validation Split Considerations
When creating a validation set from training data, we must maintain engine-level separation:
1# WRONG: Random sample split (leakage risk)
2train_windows, val_windows = train_test_split(all_windows, test_size=0.2)
3
4# CORRECT: Engine-level split
5train_engines = [1, 2, 3, ..., 80] # 80 engines for training
6val_engines = [81, 82, ..., 100] # 20 engines for validation
7
8train_windows = windows_from_engines(train_engines)
9val_windows = windows_from_engines(val_engines)Engine-Level Splitting
All windows from the same engine must go to the same split (train or validation). This prevents any temporal leakage through overlapping windows from the same trajectory.
Leakage Prevention Protocol
We formalize a leakage-free preprocessing protocol:
Protocol Steps
- Split first: Separate training and test data before any preprocessing
- Fit on train: Compute all statistics (normalization, etc.) on training only
- Transform both: Apply fitted transforms to train and test
- Engine-level validation: Split by engine, not by window
- Verify independence: Check that no information flows from test to train
Implementation Pattern
1class LeakageFreePipeline:
2 def __init__(self):
3 self.fitted = False
4 self.condition_stats = None
5
6 def fit(self, train_data, train_conditions):
7 """Fit normalization on training data only."""
8 self.condition_stats = {}
9 for cond in np.unique(train_conditions):
10 mask = train_conditions == cond
11 self.condition_stats[cond] = {
12 'mean': np.mean(train_data[mask], axis=0),
13 'std': np.std(train_data[mask], axis=0)
14 }
15 self.fitted = True
16 return self
17
18 def transform(self, data, conditions):
19 """Transform data using pre-fitted statistics."""
20 if not self.fitted:
21 raise ValueError("Pipeline must be fitted before transform!")
22
23 normalized = np.zeros_like(data)
24 for cond in np.unique(conditions):
25 if cond not in self.condition_stats:
26 # Use closest condition for unseen conditions
27 cond = self._find_closest(cond)
28
29 mask = conditions == cond
30 stats = self.condition_stats[cond]
31 normalized[mask] = (data[mask] - stats['mean']) / stats['std']
32
33 return normalized
34
35 def fit_transform(self, train_data, train_conditions):
36 """Fit and transform training data."""
37 self.fit(train_data, train_conditions)
38 return self.transform(train_data, train_conditions)Verification Checklist
| Check | How to Verify | Expected Result |
|---|---|---|
| No test data in fitting | Inspect fit() input | Only train_data |
| Statistics before split | Check code order | Split happens first |
| Engine-level splits | Check split function | Groups by engine_id |
| Transform uses fitted | No fit() in transform | Reuses self.stats |
Defensive Coding
Add assertions to catch leakage attempts:
1assert len(set(train_engines) & set(test_engines)) == 0, "Engine overlap!"Summary
In this section, we addressed the critical issue of data leakage:
- Data leakage: Test information influencing training, leading to invalid evaluation
- Normalization leakage: Fitting statistics on all data; fix by fitting on train only
- Temporal leakage: Overlapping windows across splits; fix with engine-level separation
- Prevention protocol: Split first, fit on train, transform both
- Verification: Check independence, add assertions, audit code order
| Leakage Type | Risk | Prevention |
|---|---|---|
| Normalization | High (very common) | fit() on train only |
| Temporal | Medium (with overlapping windows) | Engine-level splits |
| Feature engineering | Medium | Compute features before split |
| Label leakage | Low (for C-MAPSS) | RUL computed from cycle count |
Looking Ahead: With leakage-free normalization in place, we need to create the input sequences for our model. The sliding window approach extracts fixed-length windows from variable-length trajectories—the topic of our next section.
With data leakage prevented, we are ready to construct the sliding window sequences that feed our model.