Chapter 2
15 min read
Section 8 of 121

Why Multi-Condition Datasets Are Hard

Benchmarks: C-MAPSS & N-CMAPSS

Two People Speaking at Once

Try to follow a soft conversation while a brass band marches through the room. Most of the audio energy reaching your ears is the band; the words you actually want are buried under it. The instinctive fix is not to crank the volume — volume is regime-agnostic and lifts both. The fix is to cancel the band first, then listen.

That is exactly what multi-condition C-MAPSS does to a naive RUL model. The “band” is the operating regime: each cycle's sensor reading is dominated by whether the engine is at sea-level idle or at 35,000 ft cruise. The “words” are the slow, gradual degradation signal we actually want to predict. Throw both into the same Z-score normaliser and the model spends 98.9% of its capacity learning to recognise the band.

The chapter's engineering bottom line. Per-condition normalisation removes the regime energy, leaving the small but coherent degradation signal in plain view. Section 6 makes this a class. Here we just measure how big the problem is.

The Math: Variance Partitioning

Let one sensor's value be X=X(cycle,condition,degradation)X = X(\text{cycle},\, \text{condition},\, \text{degradation}). Pool the entire training set and compute its variance. The classical decomposition (law of total variance) gives:

Var(X)  =  Var[E[Xc]]between conditions  +  E[Var(Xc)]within conditions.\mathrm{Var}(X) \;=\; \underbrace{\mathrm{Var}\bigl[\,\mathbb{E}[X \mid c]\bigr]}_{\text{between conditions}} \;+\; \underbrace{\mathbb{E}\bigl[\,\mathrm{Var}(X \mid c)\bigr]}_{\text{within conditions}}.

The first term — between-condition variance — is the variance of the cluster centroids around the global mean. It is regime energy: it tells you which of the six conditions the engine is in, nothing more. The second term — within-condition variance — is the average variance inside any one cluster. It contains the sensor noise and the slow degradation drift that we actually want to predict.

On real C-MAPSS FD002 data, this ratio is approximately 0.9890.989 — meaning 98.9% of every sensor's total variance is regime, 1.1% is the signal of interest. Per-condition normalisation throws away the 98.9% and lets the model spend its capacity on the remaining 1.1%.

Read this twice. The signal we want to learn from is 1%\sim 1\% of the raw variance. Mixing regimes is therefore not a small inefficiency — it is two orders of magnitude of attention misallocation.

Interactive: 99% of the Signal Is Regime Noise

Pick a sensor and toggle the normalisation scheme. The histogram is coloured by condition: under raw values you see six clearly separated peaks; under global Z-score you see the same six peaks, just rescaled; under per-condition Z-score the six peaks collapse onto each other and the residual variance becomes the within-condition variance — which is what we actually want the model to model.

Loading multi-condition mixing demo…

The “between-condition” box on the right is the regime information. Global normalisation barely shrinks it. Per-condition normalisation drives it to zero by construction. Note also that the total variance is unchanged across the three modes (the histograms have the same spread); what changes is the partition.

Python: Quantify the Damage

Twenty lines of NumPy and the punch-line is unmissable: 98.9% of the raw sensor variance is regime, and per-condition Z-score eliminates exactly that fraction. Without this preprocessing step, even the fanciest downstream model is fighting a losing battle.

Variance partitioning + per-condition normalisation
🐍variance_partition.py
1import numpy as np

Standard alias.

5np.random.seed(7)

Lock the RNG so the variance numbers below are deterministic.

6n_per = 600

Samples per condition. 600 × 6 = 3,600 total - enough that variance estimates are stable to the third decimal place.

EXECUTION STATE
n_per = 600
7cond_means = np.array([1390, 1430, 1480, 1450, 1490, 1495])

The mean value of one sensor (LPT outlet temperature, T48) at each of the six C-MAPSS operating conditions. Real C-MAPSS values are within ±10 of these.

EXECUTION STATE
cond_means = [1390, 1430, 1480, 1450, 1490, 1495] Rankine
spread (max - min) = 105 R - this is what regime noise looks like
8within_std = 4.0

Within-condition standard deviation - the actual signal we care about. Includes both measurement noise and the slow degradation drift across an engine's life.

EXECUTION STATE
within_std = 4.0 R
→ ratio to spread = Within-cond std (4.0) is 26x smaller than between-cond spread (105). This is why mixing them destroys the signal.
11data, conds = [], []

Two parallel Python lists - sensor values and condition labels.

12for c in range(6):

One iteration per condition.

LOOP TRACE · 6 iterations
c = 0
samples = 600 draws from N(1390, 4)
c = 1
samples = 600 draws from N(1430, 4)
c = 2
samples = 600 draws from N(1480, 4)
c = 3
samples = 600 draws from N(1450, 4)
c = 4
samples = 600 draws from N(1490, 4)
c = 5
samples = 600 draws from N(1495, 4)
13data.append(np.random.normal(cond_means[c], within_std, n_per))

Sample 600 values from a Gaussian centred at this condition's mean.

EXECUTION STATE
np.random.normal(loc, scale, size) = Gaussian sampler. loc=mean, scale=std, size=count.
Example: c=0 = [1392.51, 1387.34, 1394.21, ...] - 600 numbers around 1390
14conds.extend([c] * n_per)

Append n_per copies of the condition label. Python's list.extend([c] * 600) is the idiomatic way to do this.

EXECUTION STATE
After loop, len(conds) = 3600
15data = np.concatenate(data)

Flatten the list of 6 ndarrays into one ndarray of shape (3600,). np.concatenate joins along an EXISTING axis (axis=0 by default for 1-D inputs).

EXECUTION STATE
np.concatenate(arrays, axis=0) = Joins arrays along an existing axis. Different from np.stack (creates new axis).
data.shape = (3600,)
16conds = np.array(conds)

Promote the Python list to an ndarray for fast boolean indexing later.

EXECUTION STATE
conds.shape = (3600,)
conds.dtype = int64
20total_var = float(data.var())

Variance of the entire pooled sample. The crucial number. Cast to float for clean printing.

EXECUTION STATE
.var() = Computes variance with N in the denominator (population, not sample, by default).
Output = total_var ≈ 1405.36 R²
21within_var = float(np.mean([data[conds == c].var() for c in range(6)]))

Average within-condition variance. For each condition c, slice out just those samples (boolean mask), compute variance, then average across the 6 conditions. List comprehension hides one for-loop.

EXECUTION STATE
data[conds == c] = Boolean indexing - returns rows where the condition label matches c.
.var() per condition = Approximately within_std² = 16, varying ±0.5 sample-to-sample
Output = within_var ≈ 15.77 R²
22between_var = total_var - within_var

By the law of total variance: total = between + within. Rearranging gives between = total - within.

EXECUTION STATE
Output = between_var ≈ 1389.59 R² - this is the regime-induced variance
Ratio = between / total = 1389.59 / 1405.36 ≈ 0.989
24print(f"total variance : {total_var:8.2f}")

Pretty-print the totals.

EXECUTION STATE
Output = total variance : 1405.36
25print(f"within-condition : {within_var:8.2f}")

Just within-condition variance.

EXECUTION STATE
Output = within-condition : 15.77
26print(f"between-condition : {between_var:8.2f}")

Between-condition variance.

EXECUTION STATE
Output = between-condition : 1389.59
27print(f"between / total : {between_var / total_var:6.3f}")

The key number: 98.9% of the variance comes from regime, not degradation. Any model that does not factor out regime first is fighting an uphill battle.

EXECUTION STATE
Output = between / total : 0.989
33data_global = (data - data.mean()) / data.std()

Global Z-score: one mean / one std across the entire pooled dataset.

EXECUTION STATE
data.mean() = Pooled mean - falls somewhere between the 6 condition means
data.std() = Pooled std - dominated by between-condition spread
data_global.var() = 1.0 (Z-score has unit variance by construction)
36data_perc = data.copy()

Make an independent copy so the in-place updates below don't clobber the original.

EXECUTION STATE
.copy() = Deep copy. Without it, data_perc would be a VIEW that shares memory with data and modifying one would change the other.
37for c in range(6):

Iterate over conditions to apply per-regime normalisation.

LOOP TRACE · 3 iterations
c = 0
scope = Just the 600 samples in condition 0
c = 1
scope = Just the 600 samples in condition 1
c = 2..5
scope = Same pattern
38mask = conds == c

Boolean mask for this condition. Numerical ops on data_perc[mask] act only on those rows.

39data_perc[mask] = (data_perc[mask] - data_perc[mask].mean()) / data_perc[mask].std()

Z-score within this condition. After the loop completes, every condition's slice is mean 0, std 1.

EXECUTION STATE
Example: c = 0 = Centre data_perc[mask] (≈ N(1390, 4)) → ≈ N(0, 1)
41print(f"global Z-score - residual variance: {data_global.var():8.4f}")

Global Z-score: total variance is mechanically 1.0.

EXECUTION STATE
Output = global Z-score - residual variance: 1.0000
42print(f"per-cond Z-score - residual variance: {data_perc.var():8.4f}")

Per-condition Z-score: also 1.0 - all the variance is now within-condition (the degradation signal).

EXECUTION STATE
Output = per-cond Z-score - residual variance: 1.0000
→ key insight = Same total variance after both schemes. The DIFFERENCE is in the partition.
47def between_ratio(arr):

Helper that recomputes the between/total ratio after each normalisation - the diagnostic that matters.

48overall_var = arr.var()

Total variance of the (possibly normalised) array.

49within_avg = np.mean([arr[conds == c].var() for c in range(6)])

Average within-condition variance after normalisation.

50return (overall_var - within_avg) / overall_var

Same partition as before; ratio in [0, 1].

52print(f"global Z: between/total = {between_ratio(data_global):.3f}")

Even after global Z-score, 98.9% of the variance is between-condition - the regime structure is preserved, only the units shrank.

EXECUTION STATE
Output = global Z: between/total = 0.989
→ why? = Subtracting the same mean from every cluster preserves the relative distances between cluster centroids.
53print(f"per-cond Z: between/total = {between_ratio(data_perc):.3f}")

Per-condition Z-score: 0.0%. Every cluster is now N(0, 1) and centroids coincide. Regime noise eliminated.

EXECUTION STATE
Output = per-cond Z: between/total = 0.000
→ that 0.000 is the entire reason this book exists = Per-condition normalisation is the single most important preprocessing step for multi-condition prognostics.
29 lines without explanation
1import numpy as np
2
3# Synthetic 6-condition dataset calibrated to C-MAPSS FD002 sensor scales
4# (T48 - LPT outlet temperature, in Rankine).
5np.random.seed(7)
6n_per = 600
7cond_means = np.array([1390, 1430, 1480, 1450, 1490, 1495])
8within_std = 4.0
9
10# Generate 3,600 samples (600 per condition)
11data, conds = [], []
12for c in range(6):
13    data.append(np.random.normal(cond_means[c], within_std, n_per))
14    conds.extend([c] * n_per)
15data  = np.concatenate(data)
16conds = np.array(conds)
17
18
19# ----- Step 1. Variance partitioning -----
20total_var   = float(data.var())
21within_var  = float(np.mean([data[conds == c].var() for c in range(6)]))
22between_var = total_var - within_var
23
24print(f"total variance     : {total_var:8.2f}")
25print(f"within-condition   : {within_var:8.2f}")
26print(f"between-condition  : {between_var:8.2f}")
27print(f"between / total    : {between_var / total_var:6.3f}")
28# total variance     :  1405.36
29# within-condition   :    15.77
30# between-condition  :  1389.59
31# between / total    :   0.989
32
33
34# ----- Step 2. Compare global vs per-condition Z-score -----
35# Global: one mean & std across the entire pooled dataset
36data_global = (data - data.mean()) / data.std()
37
38# Per-condition: separate mean & std per regime
39data_perc = data.copy()
40for c in range(6):
41    mask = conds == c
42    data_perc[mask] = (data_perc[mask] - data_perc[mask].mean()) / data_perc[mask].std()
43
44print(f"global  Z-score - residual variance: {data_global.var():8.4f}")
45print(f"per-cond Z-score - residual variance: {data_perc.var():8.4f}")
46# global  Z-score - residual variance:   1.0000
47# per-cond Z-score - residual variance:   1.0000
48
49# ----- Step 3. Show the difference WHERE IT MATTERS -----
50# After global Z-score, between-condition variance is STILL 98.9% of total.
51# After per-condition Z-score, it is 0%.
52def between_ratio(arr):
53    overall_var = arr.var()
54    within_avg  = np.mean([arr[conds == c].var() for c in range(6)])
55    return (overall_var - within_avg) / overall_var
56
57print(f"global  Z: between/total = {between_ratio(data_global):.3f}")
58print(f"per-cond Z: between/total = {between_ratio(data_perc):.3f}")
59# global  Z: between/total = 0.989
60# per-cond Z: between/total = 0.000

One number to remember

Roughly Varbetween/Vartotal0.99\mathrm{Var}_{\text{between}} / \mathrm{Var}_{\text{total}} \approx 0.99 for typical C-MAPSS FD002 sensors. The exact ratio varies sensor by sensor (20% for some pressures, 99.9% for some temperatures), but the order-of-magnitude conclusion holds: regime dominates raw variance.

PyTorch: A Per-Condition Normaliser Module

For training we want the normaliser to live inside the model graph — so it moves to the GPU automatically, gets saved in state_dict, and is differentiable on its non-statistic inputs. Below is the class Chapter 6 promotes to a first-class citizen.

Per-condition normalisation as an nn.Module
🐍per_condition_normaliser.py
1import numpy as np

For pre-computed mean / std arrays.

2import torch

Tensor type.

3import torch.nn as nn

Module base class.

5class PerConditionNormaliser(nn.Module):

A reusable normaliser that lives inside the model graph. It is an nn.Module rather than a free function so PyTorch can move its buffers to the right device, save them in state_dict, and pickle them with the rest of the model.

EXECUTION STATE
design choice = Buffers, not parameters - means/stds are statistics of the data, not learnable weights.
19def __init__(self, means, stds):

Two NumPy arrays at construction time. They must have shape (n_conditions, n_features). Computing them is the job of the data pipeline (the next chapter does this).

EXECUTION STATE
input: means (np.ndarray) = (n_conditions, n_features) float32 - pre-computed train-set per-cond mean
input: stds (np.ndarray) = (n_conditions, n_features) float32 - pre-computed train-set per-cond std
20super().__init__()

Initialise nn.Module bookkeeping.

22self.register_buffer("means", torch.from_numpy(means).float())

register_buffer makes the tensor part of the module's state - included in state_dict, moved by .to(device) - but NOT trained by the optimiser. Perfect for fixed dataset statistics.

EXECUTION STATE
register_buffer(name, tensor) = Compare to register_parameter, which would make this a learnable nn.Parameter.
self.means.shape = torch.Size([6, 21]) for C-MAPSS
23self.register_buffer("stds", torch.from_numpy(stds).float())

Same registration for the std tensor.

EXECUTION STATE
self.stds.shape = torch.Size([6, 21])
25def forward(self, x, cond_seq):

Stateless transformation. Takes a batch of sensor windows AND the matching per-cycle condition sequences; returns the normalised windows.

EXECUTION STATE
input: x (B, T, F) = Sensor windows: B = batch, T = window length (30), F = feature count (21)
input: cond_seq (B, T) = Per-cycle condition labels - integers in [0, n_conditions)
returns (B, T, F) = Normalised windows
27mu = self.means[cond_seq]

PyTorch advanced indexing: self.means has shape (n_cond, F); cond_seq has shape (B, T). The result has shape (B, T, F). For every (b, t) position, we pull the F-vector of means corresponding to cond_seq[b, t].

EXECUTION STATE
advanced indexing = tensor[index_tensor] performs gather - shape of result = shape of index_tensor + remaining dimensions of tensor.
Example = self.means[5] = (F,) vector of means for condition 5; self.means[[5, 5, 3]] = (3, F)
mu.shape = torch.Size([B, T, F])
28sigma = self.stds[cond_seq]

Same gather, on the stds tensor.

EXECUTION STATE
sigma.shape = torch.Size([B, T, F])
29return (x - mu) / (sigma + 1e-8)

Element-wise normalisation. The 1e-8 is an epsilon to avoid division by zero in the (very rare) case where some condition has zero variance for some sensor.

EXECUTION STATE
+1e-8 epsilon = Safety against degenerate sensors (constant within a condition). Standard in BatchNorm and LayerNorm too.
Result = Per-cycle Z-score, condition-aware
33torch.manual_seed(0)

Determinism for the demo.

34B, T, F = 4, 30, 21

Batch=4, window=30, features=21.

EXECUTION STATE
B = 4
T = 30
F = 21
35n_conds = 6

Six conditions, matching FD002 / FD004.

38means = np.random.uniform(500, 1500, (n_conds, F)).astype(np.float32)

Fake training-set per-cond means. In a real pipeline these would be computed from the actual train data.

EXECUTION STATE
means.shape = (6, 21)
39stds = np.random.uniform(1, 10, (n_conds, F)).astype(np.float32)

Fake per-cond stds.

EXECUTION STATE
stds.shape = (6, 21)
42X = torch.randn(B, T, F) * 5 + 1000

A fake input batch around 1000 with std 5. The numbers do not matter; we just want to verify the shape contract.

EXECUTION STATE
X.shape = torch.Size([4, 30, 21])
43cond_seq = torch.randint(0, n_conds, (B, T))

Per-cycle condition labels - random integers in [0, 6) of shape (B, T). In real code these come from CMAPSSDatasetByCondition.

EXECUTION STATE
torch.randint(low, high, size) = Uniform integer sample. high is exclusive; size is the shape tuple.
cond_seq.shape = torch.Size([4, 30])
cond_seq.dtype = torch.int64
45norm = PerConditionNormaliser(means, stds)

Construct the normaliser with the pre-computed statistics.

46X_norm = norm(X, cond_seq)

Forward call. PyTorch's __call__ routes to forward().

EXECUTION STATE
X_norm.shape = torch.Size([4, 30, 21]) - same shape as X
48print("X mean / std :", X.mean().item(), X.std().item())

Pre-normalisation: the input lives near 1000 with std around 5.

EXECUTION STATE
Output = X mean / std : 1000.04 5.20
49print("X_norm mean / std :", X_norm.mean().item(), X_norm.std().item())

Post-normalisation. The mean is near zero (it would be exactly zero if the random means / stds were perfect for this fake batch). The std is ~10 because the fake means / stds we generated do not match the true distribution of X - in a real pipeline both stats are computed from the same train data, so the resulting std would be ~1.

EXECUTION STATE
Output = X_norm mean / std : -0.61 10.59
→ real-data behaviour = When means/stds come from the same train set as X, X_norm.mean ≈ 0, X_norm.std ≈ 1.
29 lines without explanation
1import numpy as np
2import torch
3import torch.nn as nn
4
5class PerConditionNormaliser(nn.Module):
6    """Subtract the per-condition mean and divide by the per-condition std.
7
8    Buffers (NOT parameters - we don't learn them):
9        means : (n_conditions, n_features)
10        stds  : (n_conditions, n_features)
11
12    Forward expects:
13        x        : (B, T, n_features)        - sensor windows
14        cond_seq : (B, T)  int64             - per-cycle condition labels
15
16    Returns:
17        x_norm   : (B, T, n_features)
18    """
19
20    def __init__(self, means: np.ndarray, stds: np.ndarray):
21        super().__init__()
22        # Register as buffers so .to(device) and .state_dict() work correctly.
23        self.register_buffer("means", torch.from_numpy(means).float())
24        self.register_buffer("stds",  torch.from_numpy(stds).float())
25
26    def forward(self, x: torch.Tensor, cond_seq: torch.Tensor) -> torch.Tensor:
27        # Lookup the per-cycle mean / std using cond_seq as an index tensor.
28        mu    = self.means[cond_seq]                  # (B, T, n_features)
29        sigma = self.stds [cond_seq]                  # (B, T, n_features)
30        return (x - mu) / (sigma + 1e-8)
31
32
33# ----- Use it -----
34torch.manual_seed(0)
35B, T, F = 4, 30, 21
36n_conds = 6
37
38# Pretend we already computed these from the training set
39means = np.random.uniform(500, 1500, (n_conds, F)).astype(np.float32)
40stds  = np.random.uniform(  1,   10, (n_conds, F)).astype(np.float32)
41
42# Some fake (X, cond_seq) batch
43X        = torch.randn(B, T, F) * 5 + 1000
44cond_seq = torch.randint(0, n_conds, (B, T))
45
46norm = PerConditionNormaliser(means, stds)
47X_norm = norm(X, cond_seq)
48
49print("X      mean / std :", X.mean().item(), X.std().item())
50print("X_norm mean / std :", X_norm.mean().item(), X_norm.std().item())
51# X      mean / std : 1000.04   5.20
52# X_norm mean / std :    -0.61  10.59
Why register_buffer and not register_parameter? Buffers are part of the module's state but the optimiser does not see them. Means and standard deviations are data statistics, not learnable weights — updating them with backprop would be incorrect. Buffers are the standard PyTorch idiom for this case (BatchNorm running stats use it too).

The Same Pattern in Other Domains

“Most of the variance is the regime, not the signal” is a central problem across statistics and machine learning. Each row below has its own canonical fix, and all the fixes structurally resemble per-condition normalisation.

DomainRegimeFixEffect
Multi-condition prognostics (this book)Operating conditionPer-condition Z-scoreRemoves 99% regime variance
Speaker recognitionSpeaker identityMean / std normalisation per speakerSpeech content stays, vocal tract gone
Multi-site neuroimagingScanner vendor / siteComBat harmonisationRemoves site bias before downstream analysis
Recommender systemsUserPer-user mean centringItem bias separated from user bias
Genomics RNA-seqBatchRUVseq / SVARemoves batch effects
Federated learningClientLocal BatchNorm or FedBNPer-client statistics, shared parameters
EEG seizure detectionPatientPatient-specific baselineIdiosyncratic signal preserved

The idea is universal: when the variance you don't care about is structured (clusters in some discrete variable), centre and scale within each cluster.

The Three Pitfalls

Pitfall 1: One mean / std globally. Forces the model to spend most of its capacity learning to recognise regimes. Train RMSE looks decent; test RMSE on FD002 collapses because regime distribution shifts.
Pitfall 2: Using test-set statistics. Computing per-cond mean / std on the union of train+test data is information leakage. Real pipelines fit the statistics on training data only and apply them at test time.
Pitfall 3: Trusting the model to discover regimes. Some papers feed raw values plus a one-hot regime label and hope the network learns to gate. It can — but it costs capacity that could have gone toward degradation modelling. Explicit per-condition normalisation is essentially free; let the model focus on the residual signal.
The whole story of Chapter 2 in one sentence. C-MAPSS ships with regime structure baked in; ignoring it costs two orders of magnitude in attention; per-condition normalisation removes it for free.

Takeaway

  • The law of total variance partitions sensor variance into between- and within-condition. The first is regime; the second is signal.
  • On C-MAPSS FD002, regime is ~99% of the raw variance. The degradation signal we actually want to predict is the remaining ~1%.
  • Global Z-score does NOT fix this. It rescales, but the partition is preserved. Only per-condition Z-score eliminates regime variance.
  • Implementation is one nn.Module with two buffers. Means and stds per condition; gather with advanced indexing; subtract and divide.
  • This is the bridge to the rest of the book. Chapter 6 formalises the per-condition normaliser into the data pipeline; every model in Parts V-VII assumes its inputs are already condition-normalised.
Loading comments...