The Doping Stadium
An athlete who reads the test list before competing has an unfair advantage. Even if his actual race was honest, the knowledge that a particular drug is being screened changes how he prepares. That is leakage in its purest form: information from the evaluation set polluting the training process.
In ML pipelines the equivalent is computing statistics — means, stds, scaling factors, k-means centroids — from data that will be evaluated on. The model trains; it tests well. Then you deploy and watch the metrics collapse, because reality does not ship pre-fitted training statistics.
What Counts as Test-Set Leakage Here
| Pipeline step | Leakage version | Correct version |
|---|---|---|
| k-means cluster discovery | fit_predict on train+test combined | fit on train, predict on test |
| Per-condition mean/std | Compute on full dataset | Compute on train slice only |
| Feature selection | Select using test correlations with RUL | Select using train data; apply to test |
| Hyperparameter tuning | Pick lambda from test loss | Pick from a held-out validation slice |
| RUL cap | Set R_max from test population | Set R_max from train (R_max=125 is conventional) |
The Train-Then-Apply Discipline
A clean pipeline has two distinct phases. Phase 1 reads only train data; computes all statistics; persists them. Phase 2 reads test data and the persisted statistics; never goes back to train. Encoding this as code-level separation (different functions, different files, different processes) is the only way to make it survive code review and team rotation.
Python: One Bundle, Two Apply Calls
PyTorch: state_dict Carries the Statistics
For PyTorch deployment, a single Module wraps the entire pipeline - including the per-condition normaliser's buffers and the downstream model's parameters. state_dict captures BOTH; load_state_dict restores everything.
Common Leakage Patterns Across ML
| Mistake | Why it leaks | Fix |
|---|---|---|
| Tokeniser fit on full corpus | Vocabulary contains test-only tokens | Fit on train only |
| Image normalisation on full dataset | Mean/std encode test colour distribution | Fit on train only |
| StandardScaler() in a sklearn Pipeline ALSO seeing test | Same as above | Use Pipeline + cross_val_score |
| Stratified split on label distribution | Test labels affect train sampling | Split BEFORE looking at labels |
| Time-series random split | Future leaks into past | Time-aware split |
Three Subtle Leakage Failures
Pipeline([('scaler', StandardScaler()), ('km', KMeans()), ('lr', LR())]) followed by .fit_transform(X_full) leaks because all three steps see the entire data. Use .fit(X_train) then .transform(X_test).The point. Per-condition normalisation can either be the cleanest preprocessing step in your pipeline or the sneakiest source of false confidence. The difference is one sentence: fit on train, persist, never re-fit on test.
Takeaway
- Statistics computed on test data leak. Even something as innocent as the global mean.
- Persist the bundle. joblib for sklearn artefacts; torch.save for PyTorch state_dict. Test loads, never fits.
- Diagnose with post-normalisation test stats. Mean ~ 0, std ~ 1 means the pipeline is clean. Big deviations mean either bundle mismatch or distribution shift.
- Two artefacts in production. One torch.save for tensors, one joblib for the sklearn estimator. Load both.