Section 2.2 introduced the operating-conditions space. This section formalises the recovery: given the three op-setting columns we want a single integer label per cycle that names which of the 6 canonical regimes the engine is in at that moment. That label becomes the input to the per-condition normaliser (Chapter 6).
Why k-Means Works So Well Here
On C-MAPSS the 6 canonical operating points are:
Cond.
Altitude (k ft)
Mach
TRA (%)
0
0
0.00
100
1
10
0.25
100
2
20
0.70
100
3
25
0.62
60
4
35
0.84
100
5
42
0.84
100
Three observations make k-means the right tool. (1) The centroids are well-separated: altitude alone spans 0 to 42 kft. (2) The clusters are spherical (small jitter around each centroid; the simulator does not introduce per-condition variance scaling). (3) We know k=6 upfront from the dataset documentation.
Interactive: The Conditions Space (Reprise)
Same viewer from Section 2.2 - reproduced here because the physics matter for both selection (Section 5.4) and normalisation (Chapter 6).
Loading viewer…
Validating the Clustering
Three diagnostics confirm the recovery worked.
Diagnostic
Computation
C-MAPSS FD002 value
Silhouette score
Average (b - a) / max(a, b)
0.879 (>0.7 = excellent)
Cluster size balance
max(size_k) / min(size_k)
1.03x (essentially uniform)
Centroid match
Distance to canonical NASA centroids
<1% on each axis
If your silhouette score is below ~0.6, k-means probably picked a suboptimal local minimum. Re-run with a different random_state or increase n_init. On C-MAPSS we have never seen a failure with n_init=10.
Python: Discover, Validate, Persist
k-means on op-settings + silhouette + size-balance check
n_init=10 = Run k-means 10 times with different random initial centroids; keep the lowest-inertia result. Defends against bad inits.
18labels = km.labels_
Cluster assignment per row, shape (N,).
21sil = silhouette_score(ops, labels)
Silhouette score in [-1, 1]: how tightly each point fits its cluster vs neighbouring clusters. > 0.7 = excellent. C-MAPSS gets 0.88 because the 6 centroids are well-separated.
EXECUTION STATE
silhouette_score = Average over all points of (b - a) / max(a, b), where a is mean intra-cluster distance and b is mean nearest-cluster distance.
FD002 result = 0.879 - clusters are tight and well-separated
24sizes = np.bincount(labels)
Count assignments per cluster. np.bincount(integers) returns an array of counts indexed by the integer.
EXECUTION STATE
FD002 sizes = [8896, 8869, 8989, 8900, 8990, 9115] - within 3% of perfectly balanced
1import numpy as np, pandas as pd, torch
2from torch.utils.data import Dataset
3from sklearn.cluster import KMeans
45INFORMATIVE_IDX =[1,2,3,6,7,8,10,11,12,13,14,16,19,20]67classCMAPSSConditionAware(Dataset):8def__init__(self, csv_path, n_conditions=6, window=30):9 cols =(["engine_id","cycle"]10+[f"op_set_{i}"for i inrange(1,4)]11+[f"sensor_{i}"for i inrange(1,22)])12 df = pd.read_csv(csv_path, sep=r"\s+", header=None, names=cols)13 df["RUL"]= df.groupby("engine_id")["cycle"].transform("max")- df["cycle"]1415# Discover conditions ONCE at construction16if n_conditions >1:17 km = KMeans(n_clusters=n_conditions, n_init=10, random_state=0)18 df["cond"]= km.fit_predict(df[["op_set_1","op_set_2","op_set_3"]]).astype(np.int64)19else:20 df["cond"]=02122 sensor_cols =[f"sensor_{i+1}"for i in INFORMATIVE_IDX]23 self.window = window
24 self.samples, self.engines =[],{}25for eid, sub in df.groupby("engine_id"):26 arr = sub[sensor_cols].to_numpy(dtype=np.float32)27 ruls = sub["RUL"].to_numpy(dtype=np.float32)28 cond = sub["cond"].to_numpy(dtype=np.int64)29 self.engines[eid]=(arr, ruls, cond)30for end inrange(window,len(sub)+1):31 self.samples.append((eid, end))3233def__len__(self):returnlen(self.samples)3435def__getitem__(self, idx):36 eid, end = self.samples[idx]37 arr, ruls, cond = self.engines[eid]38return(39 torch.from_numpy(arr [end - self.window:end]),# X (W, 14)40 torch.from_numpy(cond[end - self.window:end]),# cond (W,)41 torch.tensor(ruls[end -1]),# y scalar42)434445ds = CMAPSSConditionAware("data/raw/train_FD002.txt", n_conditions=6)46X, c, y = ds[0]47print("shapes:",tuple(X.shape),tuple(c.shape),tuple(y.shape))# (30, 14) (30,) ()
When k-Means Is the Wrong Tool
Scenario
Better choice
Why
Unknown number of conditions
Gaussian Mixture + BIC
Selects K automatically
Non-spherical clusters
DBSCAN, HDBSCAN
Handles arbitrary shapes
Highly imbalanced cluster sizes
Spectral clustering
Less sensitive to size
Streaming / online setting
Minibatch k-means
Constant memory; updates per batch
High-dimensional ops (e.g., 100D)
k-means on PCA-reduced ops
Curse of dimensionality
Three Discovery Pitfalls
Pitfall 1: Wrong K. If you accidentally pass n_clusters=4 instead of 6 on FD002, k-means happily produces 4 clusters and silhouette drops to ~0.6. Always validate cluster count against the dataset documentation.
Pitfall 2: Not setting random_state. Without a seed, k-means assignments are non-deterministic across runs - your condition ID 3 today might be condition ID 5 tomorrow. Lock the seed in production.
Pitfall 3: Re-fitting on test data. Fit k-means on TRAIN only; apply with km.predict(test_ops). Re- fitting on test data leaks information.
The whole pattern. Three op-settings cluster cleanly into 6 regimes. k-means with n_init=10 and a fixed seed recovers them with silhouette ~ 0.88 and 1.03x size balance. Plumb the labels into the Dataset and the per-condition normaliser (Chapter 6) does the rest.
Takeaway
k-means with k=6 nails C-MAPSS conditions. Silhouette 0.88, near-balanced clusters.
Validation matters. Silhouette score and cluster balance catch bad fits in seconds.
Discovery happens once, in __init__. No per-batch k-means at training time.
The condition label is now part of every sample. Every (X, cond, y) tuple lets downstream layers normalise per regime.
k-means is not always the answer. Unknown K, non-spherical clusters, or streaming data require alternatives.