Chapter 5
10 min read
Section 20 of 121

Selecting 14 Informative Sensors

NASA Datasets Deep Dive

Wheat From Chaff

Of the 21 raw sensors, 7 are constants on FD001 (control commands like “demanded fan speed”) and contribute zero signal. Section 5.2 listed them; this section formalises the selection criteria and writes the code to drop them automatically. The result is the canonical F=14F = 14 input dimension that nearly every published C-MAPSS paper uses.

The principle. A feature is informative for RUL if (a) it varies, AND (b) its variation correlates with degradation. Either filter alone misses cases the other catches.

Three Selection Criteria

CriterionTestDrops
Varianceσsensor>106\sigma_{\text{sensor}} > 10^{-6}Constants (sensors 1, 5, 6, 10, 16, 18, 19 on FD001)
RUL correlationρ(sensor,RUL)>0.05|\rho(\text{sensor},\, \text{RUL})| > 0.05Sensors that vary but are uncorrelated with RUL
Cross-FD consistencySurvives in all four FD subsetsFD-specific noise variables

On FD001 the variance filter alone recovers the canonical 14 sensors; the correlation filter is a defensive safety net that catches cases like FD002 where some op-setting-coupled sensors accidentally vary across regimes without being meaningful for RUL.

Python: A Reusable Selector

Two-stage variance + correlation selector
🐍select_features.py
1import numpy as np

Standard alias.

2import pandas as pd

DataFrame statistics.

4COLUMNS = ...

26-column layout.

9SENSOR_COLS = [...]

21 sensor names.

12def select_informative(df, var_threshold=1e-6, corr_threshold=0.05):

Two-stage feature selector. Drops constants (zero variance) AND sensors that barely correlate with RUL.

EXECUTION STATE
input: df = DataFrame with the 21 sensor columns + a 'RUL' column
input: var_threshold = 1e-6 - any std below this is treated as constant
input: corr_threshold = 0.05 - drop sensors whose |corr(sensor, RUL)| is smaller
returns = List of surviving column names
18keep = df[SENSOR_COLS].std() > var_threshold

Boolean Series: True where std is above threshold. Sensor that didn't move is False.

EXECUTION STATE
Why threshold > 0? = Floating-point noise can make a 'constant' sensor's std be 1e-10 instead of exactly 0. Threshold of 1e-6 is robust.
19survivors = keep[keep].index.tolist()

Boolean indexing gymnastic: keep[keep] selects only the True entries; .index.tolist() yields the column names.

EXECUTION STATE
survivors after filter 1 = 14 sensors on FD001 (after dropping the 7 constants)
22if "RUL" in df.columns and survivors:

Filter 2 only runs if a RUL column exists (i.e., we are on the train set, not test).

23corrs = df[survivors].corrwith(df["RUL"]).abs()

.corrwith computes the Pearson correlation coefficient between each column of df[survivors] and the RUL column. We take absolute value because either positive or negative correlation is informative.

EXECUTION STATE
.corrwith(other) = Pandas method: pairwise correlation with another Series. Returns a Series indexed by column name.
Example: corrs['sensor_2'] = 0.69 (strong positive correlation - sensor_2 grows as RUL shrinks)
24survivors = corrs[corrs > corr_threshold].index.tolist()

Keep only sensors whose absolute correlation with RUL exceeds the threshold. On FD001 this filter removes 0 sensors - all 14 variance-survivors are also correlated.

EXECUTION STATE
survivors after filter 2 = 14 sensors (on FD001 - the variance filter already did the heavy lifting)
26return survivors

Hand back the list of surviving column names.

30df = pd.read_csv("data/raw/train_FD001.txt", sep=r"\s+", header=None, names=COLUMNS)

Load FD001 train.

31df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]

Compute RUL labels for the correlation filter.

33selected = select_informative(df)

Run the selector.

EXECUTION STATE
len(selected) = 14
34print(f"selected ({len(selected)}):", selected)

Print the surviving names. Match the table from §5.2.

EXECUTION STATE
Output = ['sensor_2', 'sensor_3', 'sensor_4', 'sensor_7', 'sensor_8', 'sensor_9', 'sensor_11', 'sensor_12', 'sensor_13', 'sensor_14', 'sensor_15', 'sensor_17', 'sensor_20', 'sensor_21']
37expected = [...]

The same 14 sensors hand-labelled by NASA / community convention.

41print("matches manual?", set(selected) == set(expected))

Sanity check: the data-driven selector recovers the published informative set.

EXECUTION STATE
Output = matches manual? True
28 lines without explanation
1import numpy as np
2import pandas as pd
3
4COLUMNS = (
5    ["engine_id", "cycle"]
6    + [f"op_set_{i}" for i in range(1, 4)]
7    + [f"sensor_{i}" for i in range(1, 22)]
8)
9SENSOR_COLS = [f"sensor_{i}" for i in range(1, 22)]
10
11
12def select_informative(df: pd.DataFrame,
13                        var_threshold: float = 1e-6,
14                        corr_threshold: float = 0.05) -> list[str]:
15    """Drop constant sensors AND sensors weakly correlated with RUL.
16
17    Returns the list of column names that survive both filters.
18    """
19    # Filter 1: variance
20    keep = df[SENSOR_COLS].std() > var_threshold
21    survivors = keep[keep].index.tolist()
22
23    # Filter 2: |Pearson correlation with RUL| above threshold
24    if "RUL" in df.columns and survivors:
25        corrs = df[survivors].corrwith(df["RUL"]).abs()
26        survivors = corrs[corrs > corr_threshold].index.tolist()
27
28    return survivors
29
30
31# ----- Run on FD001 train -----
32df = pd.read_csv("data/raw/train_FD001.txt", sep=r"\s+", header=None, names=COLUMNS)
33df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]
34
35selected = select_informative(df)
36print(f"selected ({len(selected)}):", selected)
37
38# Compare to manual labelling
39expected = ["sensor_2", "sensor_3", "sensor_4", "sensor_7", "sensor_8",
40            "sensor_9", "sensor_11", "sensor_12", "sensor_13", "sensor_14",
41            "sensor_15", "sensor_17", "sensor_20", "sensor_21"]
42print("matches manual?", set(selected) == set(expected))
43
44# selected (14): ['sensor_2', 'sensor_3', 'sensor_4', 'sensor_7', ...]
45# matches manual? True
The data-driven and the manual answer agree. The 14 sensors NASA hand-labelled as informative match what an automatic variance + correlation filter recovers. That is rare and valuable - it means we can ship the filter to a new dataset (say, N-CMAPSS) and trust it.

PyTorch: Applied at the Dataset Boundary

The cleanest place to apply feature selection is in the Dataset.__init__ so the model never sees the dropped columns. The class below is identical to Section 2.1's CMAPSSDataset with a 14-sensor subset.

Drop constants at Dataset construction
🐍filtered_cmapss_dataset.py
1from torch.utils.data import Dataset

Base class.

2import numpy as np, pandas as pd, torch

Compact one-line import.

4INFORMATIVE_IDX = [...]

Same 14-sensor index list from §5.2's PyTorch block. 0-based indices into the 21-sensor catalog.

7class FilteredCMAPSSDataset(Dataset):

Identical to Section 2.1's CMAPSSDataset except it consumes only the 14 informative sensor columns. Output X has 14 channels instead of 21.

10def __init__(self, csv_path, window=30):

Same constructor signature as §2.1; the change is a different sensor subset internally.

14df = pd.read_csv(...)

Same loader.

15df["RUL"] = ...

Per-engine RUL.

17sensor_cols = [f"sensor_{i+1}" for i in INFORMATIVE_IDX]

Translate the 0-based INFORMATIVE_IDX into the 1-based sensor_<n> column names. The +1 bridges the two conventions.

EXECUTION STATE
sensor_cols = ['sensor_2', 'sensor_3', ..., 'sensor_21'] - 14 strings
18self.window = window

Stash window length.

19self.samples, self.engines = [], {}

Per-engine state.

20for eid, sub in df.groupby('engine_id'):

Iterate engines.

21arr = sub[sensor_cols].to_numpy(dtype=np.float32)

Materialise the 14-channel array as float32.

EXECUTION STATE
arr.shape (per engine) = (N_e, 14) - 14 channels instead of 21
22ruls = sub['RUL'].to_numpy(dtype=np.float32)

Per-engine RUL.

23self.engines[eid] = (arr, ruls)

Store both arrays.

24for end in range(window, len(sub) + 1):

Sliding-window starts.

25self.samples.append((eid, end))

Index storage.

27def __len__(self): return len(self.samples)

Sample count.

29def __getitem__(self, idx):

DataLoader contract.

32X = torch.from_numpy(arr[end - self.window:end])

30-cycle slice; 14 channels.

EXECUTION STATE
X.shape = torch.Size([30, 14]) - reduced from (30, 21)
33y = torch.tensor(ruls[end - 1])

RUL at end of window.

34return X, y

(X, y) tuple.

37ds = FilteredCMAPSSDataset(...)

Construct on FD001.

38X, y = ds[0]

Pull sample 0.

39print("X.shape:", tuple(X.shape))

Verify reduced shape.

EXECUTION STATE
Output = X.shape: (30, 14)
15 lines without explanation
1from torch.utils.data import Dataset
2import numpy as np, pandas as pd, torch
3
4INFORMATIVE_IDX = [1, 2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 16, 19, 20]   # 0-based; 14 sensors
5
6
7class FilteredCMAPSSDataset(Dataset):
8    """Same windowing as Section 2.1's CMAPSSDataset, but selects 14 sensors."""
9
10    def __init__(self, csv_path: str, window: int = 30):
11        cols = (["engine_id", "cycle"]
12                + [f"op_set_{i}" for i in range(1, 4)]
13                + [f"sensor_{i}" for i in range(1, 22)])
14        df = pd.read_csv(csv_path, sep=r"\s+", header=None, names=cols)
15        df["RUL"] = df.groupby("engine_id")["cycle"].transform("max") - df["cycle"]
16
17        sensor_cols = [f"sensor_{i+1}" for i in INFORMATIVE_IDX]
18        self.window  = window
19        self.samples, self.engines = [], {}
20        for eid, sub in df.groupby("engine_id"):
21            arr  = sub[sensor_cols].to_numpy(dtype=np.float32)   # (N_e, 14)
22            ruls = sub["RUL"].to_numpy(dtype=np.float32)
23            self.engines[eid] = (arr, ruls)
24            for end in range(window, len(sub) + 1):
25                self.samples.append((eid, end))
26
27    def __len__(self): return len(self.samples)
28
29    def __getitem__(self, idx):
30        eid, end = self.samples[idx]
31        arr, ruls = self.engines[eid]
32        X = torch.from_numpy(arr[end - self.window:end])    # (W, 14)
33        y = torch.tensor(ruls[end - 1])
34        return X, y
35
36
37ds = FilteredCMAPSSDataset("data/raw/train_FD001.txt", window=30)
38X, y = ds[0]
39print("X.shape:", tuple(X.shape))    # (30, 14)
Why this matters for the backbone. CNN (Chapter 8) declares in_channels=14 instead of 17 once you adopt the filtered dataset. Saves a tiny number of parameters and a small amount of compute - but more importantly, every sensor the model sees is one that actually moves.

Feature Selection in Other Pipelines

DomainSelection criterionTypical method
RUL (this book)Variance + RUL correlationTwo-stage filter (this section)
Tabular classificationMutual information w/ targetsklearn SelectKBest
GenomicsDifferential expressionDESeq2, edgeR
NLPTF-IDF thresholdscikit-learn TfidfVectorizer
VisionLayer-wise principal componentsPCA, ICA
Time-series anomalyVariance + autocorrelationSTL decomposition

Three Selection Pitfalls

Pitfall 1: Selecting on test data. Compute std/correlation on the TRAIN file only, then apply the resulting column list to test. Leaking test statistics into selection biases evaluation.
Pitfall 2: Per-FD selection mismatch. Different FD subsets have different constant sets. Pick the UNION (drop only sensors that are constant on EVERY FD subset) if you want one model that runs on all four.
Pitfall 3: Aggressive correlation thresholds. A threshold of 0.5 might drop sensors that contribute non-linearly (the model can still learn from a sensor that has 0.1 linear correlation). 0.05 is conservative and matches the published 14-sensor convention.
The takeaway in one sentence. Drop the seven constants, keep the fourteen movers, ship.

Takeaway

  • 14 of 21 sensors carry signal on FD001. Variance + RUL-correlation filters recover the canonical set automatically.
  • Selection happens once, at the Dataset boundary. The model just declares in_channels=14 and never knows the dropped sensors existed.
  • Always select on train data only. Test statistics must not influence which features the model sees.
  • Correlation threshold is conservative. 0.05 is the convention; tighter thresholds risk dropping non-linear contributors the network would have used.
Loading comments...