Chapter 5
11 min read
Section 21 of 121

Operating-Condition Discovery

NASA Datasets Deep Dive

Stamping Each Cycle With a Regime Tag

Section 2.2 introduced the operating-conditions space. This section formalises the recovery: given the three op-setting columns we want a single integer label per cycle that names which of the 6 canonical regimes the engine is in at that moment. That label becomes the input to the per-condition normaliser (Chapter 6).

Why k-Means Works So Well Here

On C-MAPSS the 6 canonical operating points are:

Cond.Altitude (k ft)MachTRA (%)
000.00100
1100.25100
2200.70100
3250.6260
4350.84100
5420.84100

Three observations make k-means the right tool. (1) The centroids are well-separated: altitude alone spans 0 to 42 kft. (2) The clusters are spherical (small jitter around each centroid; the simulator does not introduce per-condition variance scaling). (3) We know k=6k = 6 upfront from the dataset documentation.

Interactive: The Conditions Space (Reprise)

Same viewer from Section 2.2 - reproduced here because the physics matter for both selection (Section 5.4) and normalisation (Chapter 6).

Loading viewer…

Validating the Clustering

Three diagnostics confirm the recovery worked.

DiagnosticComputationC-MAPSS FD002 value
Silhouette scoreAverage (b - a) / max(a, b)0.879 (>0.7 = excellent)
Cluster size balancemax(size_k) / min(size_k)1.03x (essentially uniform)
Centroid matchDistance to canonical NASA centroids<1% on each axis
If your silhouette score is below ~0.6, k-means probably picked a suboptimal local minimum. Re-run with a different random_state or increase n_init. On C-MAPSS we have never seen a failure with n_init=10.

Python: Discover, Validate, Persist

k-means on op-settings + silhouette + size-balance check
🐍discover_validate.py
1import numpy as np

Standard alias.

2import pandas as pd

DataFrame loader.

3from sklearn.cluster import KMeans

Clustering.

4from sklearn.metrics import silhouette_score

Cluster validity metric.

6COLUMNS = ...

26-column layout.

11OP_COLS = [...]

Just the three op-setting columns.

14def discover_and_validate(df, n_conditions, seed=0):

Wrap discovery + validation in one helper that returns everything you need.

EXECUTION STATE
input: df = DataFrame with op-setting columns
input: n_conditions = 1 (FD001/FD003) or 6 (FD002/FD004)
input: seed = Reproducibility - 0 by convention
returns = (labels, centroids, silhouette, cluster sizes)
15ops = df[OP_COLS].to_numpy()

Materialise op-settings as a (N, 3) ndarray.

17km = KMeans(n_clusters=n_conditions, n_init=10, random_state=seed).fit(ops)

Fit k-means. Same args as §2.2's example.

EXECUTION STATE
n_init=10 = Run k-means 10 times with different random initial centroids; keep the lowest-inertia result. Defends against bad inits.
18labels = km.labels_

Cluster assignment per row, shape (N,).

21sil = silhouette_score(ops, labels)

Silhouette score in [-1, 1]: how tightly each point fits its cluster vs neighbouring clusters. > 0.7 = excellent. C-MAPSS gets 0.88 because the 6 centroids are well-separated.

EXECUTION STATE
silhouette_score = Average over all points of (b - a) / max(a, b), where a is mean intra-cluster distance and b is mean nearest-cluster distance.
FD002 result = 0.879 - clusters are tight and well-separated
24sizes = np.bincount(labels)

Count assignments per cluster. np.bincount(integers) returns an array of counts indexed by the integer.

EXECUTION STATE
FD002 sizes = [8896, 8869, 8989, 8900, 8990, 9115] - within 3% of perfectly balanced
26return labels, km.cluster_centers_, sil, sizes

All four diagnostics in one tuple.

30df = pd.read_csv("data/raw/train_FD002.txt", sep=r"\s+", header=None, names=COLUMNS)

FD002 train.

32labels, centroids, sil, sizes = discover_and_validate(df, n_conditions=6)

Run discovery.

EXECUTION STATE
labels.shape = (53759,) - one cluster ID per training cycle
34print(f"silhouette score : {sil:.3f} # > 0.7 = excellent separation")

Validate quality.

EXECUTION STATE
Output = silhouette score : 0.879
35print(f"cluster sizes : {sizes.tolist()}")

Sanity-check balance.

EXECUTION STATE
Output = cluster sizes : [8896, 8869, 8989, 8900, 8990, 9115]
36print(f"size imbalance : {sizes.max() / sizes.min():.2f}x")

How lopsided? Real C-MAPSS gives 1.03x because the simulator samples conditions uniformly.

EXECUTION STATE
Output = size imbalance : 1.03x
37print(f"centroids:")

Header for the centroid printout.

38for i, c in enumerate(centroids):

Iterate the 6 centroids.

39print(f" cond {i}: alt={c[0]:5.1f} Mach={c[1]:.3f} TRA={c[2]:5.1f}")

Pretty-print each centroid - matches the canonical NASA flight-envelope corners.

EXECUTION STATE
Output cond 0 = alt= 0.0 Mach=0.000 TRA=100.0 (sea-level idle)
Output cond 5 = alt= 42.0 Mach=0.840 TRA=100.0 (top-of-climb cruise)
25 lines without explanation
1import numpy as np
2import pandas as pd
3from sklearn.cluster import KMeans
4from sklearn.metrics import silhouette_score
5
6COLUMNS = (
7    ["engine_id", "cycle"]
8    + [f"op_set_{i}" for i in range(1, 4)]
9    + [f"sensor_{i}" for i in range(1, 22)]
10)
11OP_COLS = [f"op_set_{i}" for i in range(1, 4)]
12
13
14def discover_and_validate(df, n_conditions, seed=0):
15    ops = df[OP_COLS].to_numpy()
16
17    km = KMeans(n_clusters=n_conditions, n_init=10, random_state=seed).fit(ops)
18    labels = km.labels_
19
20    # Silhouette score: how well-separated are the clusters?
21    sil = silhouette_score(ops, labels)
22
23    # Cluster sizes: ideally roughly balanced on C-MAPSS
24    sizes = np.bincount(labels)
25
26    return labels, km.cluster_centers_, sil, sizes
27
28
29# ----- FD002: 6 conditions -----
30df = pd.read_csv("data/raw/train_FD002.txt", sep=r"\s+", header=None, names=COLUMNS)
31
32labels, centroids, sil, sizes = discover_and_validate(df, n_conditions=6)
33
34print(f"silhouette score   : {sil:.3f}    # > 0.7 = excellent separation")
35print(f"cluster sizes      : {sizes.tolist()}")
36print(f"size imbalance     : {sizes.max() / sizes.min():.2f}x")
37print(f"centroids:")
38for i, c in enumerate(centroids):
39    print(f"  cond {i}: alt={c[0]:5.1f}  Mach={c[1]:.3f}  TRA={c[2]:5.1f}")
40
41# silhouette score   : 0.879    # > 0.7 = excellent separation
42# cluster sizes      : [8896, 8869, 8989, 8900, 8990, 9115]
43# size imbalance     : 1.03x
44# centroids:
45#   cond 0: alt=  0.0   Mach=0.000   TRA=100.0
46#   cond 5: alt= 42.0   Mach=0.840   TRA=100.0

PyTorch: Plumbing the Labels into the Dataset

The Dataset class below combines feature selection (Section 5.3) with condition discovery (Section 5.4). __getitem__ returns a 3-tuple: (X, condition_seq, y).

Condition-aware Dataset with k-means built in
🐍cmapss_condition_aware.py
1import numpy as np, pandas as pd, torch

Compact triple-import.

2from torch.utils.data import Dataset

Base class.

3from sklearn.cluster import KMeans

Clustering.

5INFORMATIVE_IDX = [...]

14-sensor index list from §5.3.

7class CMAPSSConditionAware(Dataset):

Combines §5.3's FilteredCMAPSSDataset with §2.2's per-cycle condition tagging. __getitem__ returns a 3-tuple now: (X, cond_seq, y).

8def __init__(self, csv_path, n_conditions=6, window=30):

Same constructor signature you have seen before.

12df = pd.read_csv(csv_path, sep=r'\s+', header=None, names=cols)

Same loader.

13df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]

RUL labels.

16if n_conditions > 1:

Branch on whether to run k-means.

17km = KMeans(n_clusters=n_conditions, n_init=10, random_state=0)

Same KMeans config from §2.2.

18df["cond"] = km.fit_predict(df[["op_set_1", "op_set_2", "op_set_3"]]).astype(np.int64)

fit_predict combines fit and predict in one call. Cast to int64 for PyTorch index tensors.

19else:

Single-condition case.

20df["cond"] = 0

All-zero column.

22sensor_cols = [f"sensor_{i+1}" for i in INFORMATIVE_IDX]

14 informative sensor names.

23self.window = window

Stash window length.

24self.samples, self.engines = [], {}

Indexing storage.

25for eid, sub in df.groupby('engine_id'):

Iterate engines.

26arr = sub[sensor_cols].to_numpy(dtype=np.float32)

(N_e, 14) sensor matrix.

27ruls = sub["RUL"].to_numpy(dtype=np.float32)

RUL per cycle.

28cond = sub["cond"].to_numpy(dtype=np.int64)

Condition labels per cycle. int64 for PyTorch index compatibility.

29self.engines[eid] = (arr, ruls, cond)

Three-tuple per engine.

30for end in range(window, len(sub) + 1):

Sliding-window starts.

31self.samples.append((eid, end))

Index storage.

33def __len__(self): return len(self.samples)

Total samples.

35def __getitem__(self, idx):

DataLoader contract.

36eid, end = self.samples[idx]

Lookup.

37arr, ruls, cond = self.engines[eid]

Pull all three pre-converted arrays.

38return (torch.from_numpy(arr[end - self.window:end]), ...)

Three-tuple returned: sensor window, condition sequence, scalar RUL.

EXECUTION STATE
X.shape = torch.Size([30, 14])
c.shape = torch.Size([30])
y.shape = torch.Size([])
45ds = CMAPSSConditionAware("data/raw/train_FD002.txt", n_conditions=6)

Construct on FD002. k-means runs once during __init__.

46X, c, y = ds[0]

Pull the first sample.

47print("shapes:", tuple(X.shape), tuple(c.shape), tuple(y.shape))

Sanity check.

EXECUTION STATE
Output = shapes: (30, 14) (30,) ()
16 lines without explanation
1import numpy as np, pandas as pd, torch
2from torch.utils.data import Dataset
3from sklearn.cluster import KMeans
4
5INFORMATIVE_IDX = [1, 2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 16, 19, 20]
6
7class CMAPSSConditionAware(Dataset):
8    def __init__(self, csv_path, n_conditions=6, window=30):
9        cols = (["engine_id", "cycle"]
10                + [f"op_set_{i}" for i in range(1, 4)]
11                + [f"sensor_{i}" for i in range(1, 22)])
12        df = pd.read_csv(csv_path, sep=r"\s+", header=None, names=cols)
13        df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]
14
15        # Discover conditions ONCE at construction
16        if n_conditions > 1:
17            km = KMeans(n_clusters=n_conditions, n_init=10, random_state=0)
18            df["cond"] = km.fit_predict(df[["op_set_1", "op_set_2", "op_set_3"]]).astype(np.int64)
19        else:
20            df["cond"] = 0
21
22        sensor_cols = [f"sensor_{i+1}" for i in INFORMATIVE_IDX]
23        self.window = window
24        self.samples, self.engines = [], {}
25        for eid, sub in df.groupby("engine_id"):
26            arr  = sub[sensor_cols].to_numpy(dtype=np.float32)
27            ruls = sub["RUL"].to_numpy(dtype=np.float32)
28            cond = sub["cond"].to_numpy(dtype=np.int64)
29            self.engines[eid] = (arr, ruls, cond)
30            for end in range(window, len(sub) + 1):
31                self.samples.append((eid, end))
32
33    def __len__(self): return len(self.samples)
34
35    def __getitem__(self, idx):
36        eid, end = self.samples[idx]
37        arr, ruls, cond = self.engines[eid]
38        return (
39            torch.from_numpy(arr [end - self.window:end]),    # X (W, 14)
40            torch.from_numpy(cond[end - self.window:end]),    # cond (W,)
41            torch.tensor(ruls[end - 1]),                       # y scalar
42        )
43
44
45ds = CMAPSSConditionAware("data/raw/train_FD002.txt", n_conditions=6)
46X, c, y = ds[0]
47print("shapes:", tuple(X.shape), tuple(c.shape), tuple(y.shape))   # (30, 14) (30,) ()

When k-Means Is the Wrong Tool

ScenarioBetter choiceWhy
Unknown number of conditionsGaussian Mixture + BICSelects K automatically
Non-spherical clustersDBSCAN, HDBSCANHandles arbitrary shapes
Highly imbalanced cluster sizesSpectral clusteringLess sensitive to size
Streaming / online settingMinibatch k-meansConstant memory; updates per batch
High-dimensional ops (e.g., 100D)k-means on PCA-reduced opsCurse of dimensionality

Three Discovery Pitfalls

Pitfall 1: Wrong K. If you accidentally pass n_clusters=4 instead of 6 on FD002, k-means happily produces 4 clusters and silhouette drops to ~0.6. Always validate cluster count against the dataset documentation.
Pitfall 2: Not setting random_state. Without a seed, k-means assignments are non-deterministic across runs - your condition ID 3 today might be condition ID 5 tomorrow. Lock the seed in production.
Pitfall 3: Re-fitting on test data. Fit k-means on TRAIN only; apply with km.predict(test_ops). Re- fitting on test data leaks information.
The whole pattern. Three op-settings cluster cleanly into 6 regimes. k-means with n_init=10 and a fixed seed recovers them with silhouette ~ 0.88 and 1.03x size balance. Plumb the labels into the Dataset and the per-condition normaliser (Chapter 6) does the rest.

Takeaway

  • k-means with k=6 nails C-MAPSS conditions. Silhouette 0.88, near-balanced clusters.
  • Validation matters. Silhouette score and cluster balance catch bad fits in seconds.
  • Discovery happens once, in __init__. No per-batch k-means at training time.
  • The condition label is now part of every sample. Every (X, cond, y) tuple lets downstream layers normalise per regime.
  • k-means is not always the answer. Unknown K, non-spherical clusters, or streaming data require alternatives.
Loading comments...