Chapter 6
9 min read
Section 24 of 121

Discovering Operating Conditions (k-Means)

Per-Condition Normalization

What We Need From the Clusterer

Before we can normalise per condition, we need a way to assign a condition label to every cycle. Section 5.4 walked through the discovery process; this section is a short, practical recap focused on the production-pipeline aspects: persistence, train/test reuse, and integration with the PyTorch model.

The contract. Fit on TRAIN, persist, .predict on TEST. Never re-fit on test data.

k-Means in Three Lines

The recipe from §5.4:

KMeans(n=6,ninit=10,seed=0).fit(O)\text{KMeans}(n=6,\, n_{init}=10,\, \text{seed}=0)\,.\text{fit}(\mathbf{O})

with ORN×3\mathbf{O} \in \mathbb{R}^{N \times 3} the matrix of operational settings. Output: a (6, 3) centroid matrix and an integer label per row. On C-MAPSS this routinely recovers the canonical centroids with silhouette > 0.85 - any reasonable hyper-parameters work.

Interactive: The Same Six Centroids

Loading viewer…

Python: Fit, Predict, Persist

The production-grade variant ALSO computes the per-condition mean and std over each sensor at fit time, so the normaliser downstream has everything it needs in one bundle.

Fit k-means + per-condition stats; serialise with joblib
🐍fit_condition_clusterer.py
1import joblib

Persistence library. Saves sklearn estimators and ndarrays to disk in a binary format.

2import numpy as np

For ndarray ops.

3import pandas as pd

Loader.

4from sklearn.cluster import KMeans

Same clusterer from §2.2 / §5.4.

6OP_COLS = [...]

Three op-setting columns.

9def fit_condition_clusterer(df_train, n_conditions=6, seed=0):

One function that does all four things: fit k-means, assign labels, compute per-condition mean and std for the sensors. Returns everything you need for inference.

EXECUTION STATE
input: df_train = TRAIN DataFrame only - never test data
input: n_conditions = 6 for FD002/FD004; 1 for FD001/FD003
input: seed = Reproducibility
returns = (km, means, stds) - fitted estimator + (n_conds, 21) means + stds
12km = KMeans(n_clusters=n_conditions, n_init=10, random_state=seed).fit(df_train[OP_COLS])

Fit on op-settings only. .fit returns self - we save the fitted km for later .predict() calls on test data.

13df_train = df_train.copy()

Copy so we don't mutate the caller's DataFrame when adding the 'cond' column.

14df_train["cond"] = km.labels_

Attach per-row condition labels.

16sensor_cols = [c for c in df_train.columns if c.startswith("sensor_")]

Pull the 21 sensor column names from the DataFrame's columns - works regardless of how many sensors are in the data.

18means = np.stack([df_train.loc[df_train['cond'] == c, sensor_cols].mean().to_numpy() for c in range(n_conditions)])

For each of the 6 conditions, compute the per-sensor mean over the train rows in that cluster. np.stack glues 6 vectors of shape (21,) into one matrix of shape (6, 21).

EXECUTION STATE
means.shape = (6, 21) - one vector of means per condition
20stds = np.stack([df_train.loc[df_train['cond'] == c, sensor_cols].std().to_numpy() for c in range(n_conditions)])

Same trick for standard deviations.

EXECUTION STATE
stds.shape = (6, 21)
23return km, means, stds

Three-tuple. The km is needed at test time to assign new rows to a cluster; means/stds are what the per-condition normaliser uses.

27df_train = pd.read_csv(...)

Load FD002 train.

29km, means, stds = fit_condition_clusterer(df_train)

One call.

31joblib.dump({"km": km, "means": means, "stds": stds}, "fd002_clusterer.joblib")

Persist all three to disk. At inference time we will load this file and apply the same clusterer + same per-condition stats to new data.

EXECUTION STATE
joblib.dump = Pickle-based serialisation that handles large ndarrays well. .joblib is the conventional extension.
33print("means.shape :", means.shape)

Verify shape.

EXECUTION STATE
Output = means.shape : (6, 21)
34print("stds.shape :", stds.shape)

Verify shape.

EXECUTION STATE
Output = stds.shape : (6, 21)
35print("centroids :")

Header.

36for i, c in enumerate(km.cluster_centers_):

Iterate centroids.

LOOP TRACE · 2 iterations
i = 0
centroid = ( 0.0, 0.000, 100.0) - sea-level idle
i = 5
centroid = ( 42.0, 0.840, 100.0) - top of climb
37print(f" cond {i}: {c.round(2).tolist()}")

Pretty-print each centroid.

EXECUTION STATE
Output = cond 0: [0.0, 0.0, 100.0] ... cond 5: [42.0, 0.84, 100.0]
16 lines without explanation
1import joblib
2import numpy as np
3import pandas as pd
4from sklearn.cluster import KMeans
5
6OP_COLS = ["op_set_1", "op_set_2", "op_set_3"]
7
8
9def fit_condition_clusterer(df_train: pd.DataFrame, n_conditions: int = 6, seed: int = 0):
10    """Fit k-means on the train op-settings; return the fitted estimator
11    AND the per-condition mean / std arrays for each sensor."""
12    km = KMeans(n_clusters=n_conditions, n_init=10, random_state=seed).fit(df_train[OP_COLS])
13    df_train = df_train.copy()
14    df_train["cond"] = km.labels_
15
16    sensor_cols = [c for c in df_train.columns if c.startswith("sensor_")]
17
18    means = np.stack([df_train.loc[df_train["cond"] == c, sensor_cols].mean().to_numpy()
19                      for c in range(n_conditions)])
20    stds  = np.stack([df_train.loc[df_train["cond"] == c, sensor_cols].std().to_numpy()
21                      for c in range(n_conditions)])
22
23    return km, means, stds
24
25
26# ----- Persist for later use -----
27df_train = pd.read_csv("data/raw/train_FD002.txt", sep=r"\s+", header=None)
28df_train.columns = (["engine_id", "cycle"] + OP_COLS + [f"sensor_{i}" for i in range(1, 22)])
29km, means, stds = fit_condition_clusterer(df_train)
30
31joblib.dump({"km": km, "means": means, "stds": stds}, "fd002_clusterer.joblib")
32
33print("means.shape :", means.shape)        # (6, 21)
34print("stds.shape  :", stds.shape)         # (6, 21)
35print("centroids   :")
36for i, c in enumerate(km.cluster_centers_):
37    print(f"  cond {i}: {c.round(2).tolist()}")

PyTorch: Embedding the Clusterer in the Pipeline

Bridge sklearn (k-means) and PyTorch (means / stds buffers, device-aware). The class below holds both and exposes a single .assign() method that takes raw op-settings and returns condition labels on whatever device the model lives on.

Hybrid sklearn + PyTorch wrapper
🐍condition_gate.py
1import torch, torch.nn as nn

Tensors and Modules.

2import numpy as np, joblib

ndarray + persistence.

4class ConditionGate(nn.Module):

Wraps the saved k-means + means/stds in a PyTorch Module. The k-means runs on CPU (numpy); the means / stds become buffers so they live on whatever device the rest of the model uses.

EXECUTION STATE
design = Hybrid - sklearn for clustering, PyTorch for the means/stds tensors
7def __init__(self, joblib_path):

Constructor takes the path to the saved bundle.

8super().__init__()

Initialise nn.Module.

9bundle = joblib.load(joblib_path)

Load the dict we saved in the NumPy block.

10self.km = bundle["km"]

Stash the sklearn estimator as a regular attribute. NOT a parameter or buffer - sklearn objects don't fit those abstractions. PyTorch ignores it.

EXECUTION STATE
→ caveat = PyTorch's .to(device) will NOT move self.km. The clusterer always runs on CPU; that is fine because op-settings are tiny (B, 3).
11self.register_buffer("means", torch.from_numpy(bundle["means"]).float())

Means as a (6, 21) float32 buffer. Will be moved to GPU by .to(device) and saved in state_dict.

12self.register_buffer("stds", torch.from_numpy(bundle["stds"]).float())

Same for stds.

14def assign(self, op_settings: np.ndarray) -> torch.Tensor:

Assign a batch of op-settings to condition labels.

EXECUTION STATE
input: op_settings (np.ndarray) = (B, 3) - one row per sample
returns = (B,) int64 cluster IDs on the means buffer's device
16labels = self.km.predict(op_settings)

sklearn estimator's predict method. Returns a NumPy ndarray of cluster IDs.

17return torch.from_numpy(labels).to(self.means.device).long()

Cast to torch.Tensor, move to the same device as the means buffer (GPU if model is on GPU), cast to long (int64) for boolean / advanced indexing.

21gate = ConditionGate("fd002_clusterer.joblib")

Instantiate. Loads the saved bundle.

22print("means buffer shape:", tuple(gate.means.shape))

Verify shape.

EXECUTION STATE
Output = means buffer shape: (6, 21)
23print("stds buffer shape:", tuple(gate.stds.shape))

Verify shape.

EXECUTION STATE
Output = stds buffer shape: (6, 21)
7 lines without explanation
1import torch, torch.nn as nn
2import numpy as np, joblib
3
4class ConditionGate(nn.Module):
5    """Holds a fitted k-means and per-condition mean / std as buffers."""
6
7    def __init__(self, joblib_path: str):
8        super().__init__()
9        bundle = joblib.load(joblib_path)
10        self.km = bundle["km"]                # sklearn estimator stored on self - NOT a buffer
11        self.register_buffer("means", torch.from_numpy(bundle["means"]).float())
12        self.register_buffer("stds",  torch.from_numpy(bundle["stds"]).float())
13
14    def assign(self, op_settings: np.ndarray) -> torch.Tensor:
15        """op_settings: (B, 3) numpy. Returns (B,) int64 condition labels on whatever device the buffers live on."""
16        labels = self.km.predict(op_settings)
17        return torch.from_numpy(labels).to(self.means.device).long()
18
19
20gate = ConditionGate("fd002_clusterer.joblib")
21print("means buffer shape:", tuple(gate.means.shape))    # (6, 21)
22print("stds  buffer shape:", tuple(gate.stds.shape))     # (6, 21)
The hybrid pattern. sklearn for the clustering (CPU, deterministic, mature). PyTorch for the buffers (GPU-aware, differentiable surface). joblib bridges the persistence boundary. Same pattern used by every production ML pipeline in this book.

When You Need Something Other Than k-Means

SituationUseWhy
Unknown number of conditionsGMM with BIC selectionLikelihood-based K selection
Conditions with non-spherical shapeDBSCAN / HDBSCANDensity-based, no K to set
Streaming / online settingMiniBatchKMeansConstant memory; updates per batch
High-dimensional ops (e.g., 100D)PCA + k-meansCurse of dimensionality otherwise
Hard, non-overlapping conditionsDecision tree on op-settingsInterpretable, deterministic

Two Discovery Pitfalls

Pitfall 1: Re-fitting at test time. Every fit renumbers clusters: condition 3 today might be condition 0 tomorrow. The trained model expects condition IDs in the SAME order as the bundle - re-fitting breaks this contract silently. Always .predict, never re-fit.
Pitfall 2: Forgetting to save the random_state. Without a fixed seed, two runs of fit_condition_clustereron the same data produce different cluster IDs. The means / stds bundle is then mismatched with later .predict outputs. Lock the seed.
The point. Cluster discovery is a one-time setup cost. The discovery happens once on training data, gets serialised, and is reused at every training step + every inference call.

Takeaway

  • Fit once on train, persist with joblib, predict on test. Never re-fit.
  • Save means and stds alongside the clusterer. They are paired statistics; loading one without the other is a bug.
  • The hybrid sklearn + PyTorch pattern works. k-means stays on CPU; means/stds are GPU-aware buffers.
  • Always seed the random_state. Different seeds give different cluster IDs; mismatch with the saved means/stds breaks the pipeline silently.
Loading comments...