What We Need From the Clusterer
Before we can normalise per condition, we need a way to assign a condition label to every cycle. Section 5.4 walked through the discovery process; this section is a short, practical recap focused on the production-pipeline aspects: persistence, train/test reuse, and integration with the PyTorch model.
k-Means in Three Lines
The recipe from §5.4:
with the matrix of operational settings. Output: a (6, 3) centroid matrix and an integer label per row. On C-MAPSS this routinely recovers the canonical centroids with silhouette > 0.85 - any reasonable hyper-parameters work.
Interactive: The Same Six Centroids
Python: Fit, Predict, Persist
The production-grade variant ALSO computes the per-condition mean and std over each sensor at fit time, so the normaliser downstream has everything it needs in one bundle.
PyTorch: Embedding the Clusterer in the Pipeline
Bridge sklearn (k-means) and PyTorch (means / stds buffers, device-aware). The class below holds both and exposes a single .assign() method that takes raw op-settings and returns condition labels on whatever device the model lives on.
When You Need Something Other Than k-Means
| Situation | Use | Why |
|---|---|---|
| Unknown number of conditions | GMM with BIC selection | Likelihood-based K selection |
| Conditions with non-spherical shape | DBSCAN / HDBSCAN | Density-based, no K to set |
| Streaming / online setting | MiniBatchKMeans | Constant memory; updates per batch |
| High-dimensional ops (e.g., 100D) | PCA + k-means | Curse of dimensionality otherwise |
| Hard, non-overlapping conditions | Decision tree on op-settings | Interpretable, deterministic |
Two Discovery Pitfalls
.predict, never re-fit.fit_condition_clustereron the same data produce different cluster IDs. The means / stds bundle is then mismatched with later .predict outputs. Lock the seed.The point. Cluster discovery is a one-time setup cost. The discovery happens once on training data, gets serialised, and is reused at every training step + every inference call.
Takeaway
- Fit once on train, persist with joblib, predict on test. Never re-fit.
- Save means and stds alongside the clusterer. They are paired statistics; loading one without the other is a bug.
- The hybrid sklearn + PyTorch pattern works. k-means stays on CPU; means/stds are GPU-aware buffers.
- Always seed the random_state. Different seeds give different cluster IDs; mismatch with the saved means/stds breaks the pipeline silently.