Chapter 6
11 min read
Section 23 of 121

Why Global Z-Score Fails

Per-Condition Normalization

The Broadcast-Microphone Analogy

A radio engineer normalises the audio level so loud and quiet speakers sound “equally loud” to the listener. If two speakers occupy the same broadcast - one whispering close to the mic, one shouting from across the room - applying a single gain knob to the whole show does not equalise them; it just scales the weird mix. To actually equalise, you need to apply a different gain per speaker. That is the heart of per-condition normalisation: one gain knob per regime, not one for the whole signal.

Section 2.4 demonstrated this empirically with a multi-condition sensor histogram. This section formalises the math, walks the Python and PyTorch implementations, and explains why the bug survives code review on so many real production pipelines.

The bug. Compute one mean and one std across the entire training set; subtract; divide; ship. The model trains on a signal that is 99% regime noise.

Why Global Z-Score Cannot Erase Regime Variance

Recall the law of total variance from §2.4: the total variance of a pooled sample decomposes into between- and within-cluster components,

Var(X)  =  Var[E[Xc]]between conditions  +  E[Var(Xc)]within conditions.\mathrm{Var}(X) \;=\; \underbrace{\mathrm{Var}\bigl[\mathbb{E}[X \mid c]\bigr]}_{\text{between conditions}} \;+\; \underbrace{\mathbb{E}\bigl[\mathrm{Var}(X \mid c)\bigr]}_{\text{within conditions}}.

Global Z-score is a linear transformation: Z=(Xμ)/σZ = (X - \mu)/\sigma. Linear transformations rescale variance, but they do NOT change the partition between-cluster vs within-cluster. After Z-score the total variance is 1.0 and the between/within ratio is unchanged from the raw data. If the raw data was 99% regime, the Z-scored data is also 99% regime - just on a different scale.

The mathematical fact. Linear (mean / std) scaling cannot erase a between-cluster structure. The only way to actually remove regime variance is to apply a different mean / std per regime - which is precisely per-condition normalisation (Sections 6.2 and 6.3).

Interactive: Mixing Hides the Signal

The visualization from §2.4, reproduced here so the math has somewhere to land. Toggle between “raw”, “global Z”, and “per-condition Z” and watch the variance partition update.

Loading mixing visualization…

Python: Quantify the Damage on Real Data

Twenty lines of NumPy. Compute the variance partition before and after global Z-score. The between/total ratio stays at 98.9% no matter what scale you apply.

Z-score does not remove regime variance
🐍global_zscore_fails.py
1import numpy as np

Standard alias.

4np.random.seed(0)

Reproducibility.

5N_PER_COND = 600

Samples per condition. 600 x 6 = 3,600 total - the calibration matches what we used in §2.4.

6COND_MEANS = [1390, 1430, 1480, 1450, 1490, 1495]

Real C-MAPSS T48 (LPT outlet temp) means at each of the 6 operating conditions. ~100 R spread between extremes.

7WITHIN_STD = 4.0

Within-condition standard deviation - the degradation signal we actually want.

9data, conds = [], []

Two parallel lists.

10for c in range(6):

Generate 600 samples per condition.

11data.append(np.random.normal(COND_MEANS[c], WITHIN_STD, N_PER_COND))

Gaussian sample around this condition's mean.

12conds.extend([c] * N_PER_COND)

Tag every sample with its condition.

13data, conds = np.concatenate(data), np.array(conds)

Flatten into ndarrays.

EXECUTION STATE
data.shape = (3600,)
17total_var = float(data.var())

Variance of the entire pooled sample.

EXECUTION STATE
Output = total_var ~ 1405
18within_var = float(np.mean([data[conds == c].var() for c in range(6)]))

Average within-condition variance. Boolean masking pulls per-cluster slices.

EXECUTION STATE
Output = within_var ~ 15.8 (= WITHIN_STD^2 = 16, with sample-noise bias)
19between_var = total_var - within_var

Law of total variance: total = between + within.

EXECUTION STATE
Output = between_var ~ 1389.6 - the regime energy
21print(f"total var : {total_var:8.2f}")

Total.

EXECUTION STATE
Output = total var : 1405.36
22print(f"within-condition : {within_var:8.2f} (...)",

Within. 1.1% of total - the degradation signal we want to model.

EXECUTION STATE
Output = within-condition : 15.77 (1.1% — degradation signal)
23print(f"between-condition : {between_var:8.2f} (...)")

Between. 98.9% - the regime noise that hides the signal.

EXECUTION STATE
Output = between-condition : 1389.59 (98.9% — regime noise)
27data_z = (data - data.mean()) / data.std()

Global Z-score: subtract the global mean, divide by the global std. This is what every default scikit-learn pipeline does.

29new_total = float(data_z.var())

After Z-score, total variance is 1.0 by construction.

EXECUTION STATE
Output = new_total ~ 1.0
30new_within = float(np.mean([data_z[conds == c].var() for c in range(6)]))

Within-condition variance after Z-score - small relative to between.

31new_between = new_total - new_within

Between-condition variance after Z-score.

33print(f"\nafter global Z-score:")

Header.

34print(f" total var : {new_total:.4f}")

Total = 1.0 by construction.

EXECUTION STATE
Output = total var : 1.0000
35print(f" between/total : {new_between/new_total:.3f}")

98.9% - UNCHANGED from before normalisation. Z-score rescaled the units; it did not erase the regime structure.

EXECUTION STATE
Output = between/total : 0.989
→ this is the bug = Global Z-score makes the histogram fit on the screen but does NOT remove the 99% regime variance.
19 lines without explanation
1import numpy as np
2
3# Synthetic FD002-style sensor that lives at 6 different baselines per regime
4np.random.seed(0)
5N_PER_COND = 600
6COND_MEANS = np.array([1390, 1430, 1480, 1450, 1490, 1495])
7WITHIN_STD = 4.0
8
9data, conds = [], []
10for c in range(6):
11    data.append(np.random.normal(COND_MEANS[c], WITHIN_STD, N_PER_COND))
12    conds.extend([c] * N_PER_COND)
13data, conds = np.concatenate(data), np.array(conds)
14
15
16# ----- Variance partition (law of total variance) -----
17total_var   = float(data.var())
18within_var  = float(np.mean([data[conds == c].var() for c in range(6)]))
19between_var = total_var - within_var
20
21print(f"total var          : {total_var:8.2f}")
22print(f"within-condition   : {within_var:8.2f}  ({100*within_var/total_var:.1f}% — degradation signal)")
23print(f"between-condition  : {between_var:8.2f}  ({100*between_var/total_var:.1f}% — regime noise)")
24
25
26# ----- Apply global Z-score -----
27data_z = (data - data.mean()) / data.std()
28
29new_total   = float(data_z.var())
30new_within  = float(np.mean([data_z[conds == c].var() for c in range(6)]))
31new_between = new_total - new_within
32
33print(f"\nafter global Z-score:")
34print(f"  total var        : {new_total:.4f}")
35print(f"  between/total    : {new_between/new_total:.3f}")
36# Scale changed; partition did NOT.
37# total var          :  1405.36
38# within-condition   :    15.77  (1.1% — degradation signal)
39# between-condition  :  1389.59  (98.9% — regime noise)
40# after global Z-score:
41#   total var        : 1.0000
42#   between/total    : 0.989  ← still 98.9%

PyTorch: A Global vs Per-Condition Comparison

Tensor implementation of both schemes; same data, very different partition
🐍zscore_compare_torch.py
1import numpy as np

For data generation.

2import torch

Tensor type.

5np.random.seed(0); torch.manual_seed(0)

Determinism on both stacks.

8N_PER_COND = 600

Same 600-per-condition setup.

9COND_MEANS = ...

Same regime means.

10data = np.concatenate([np.random.normal(...) for c in range(6)]).astype(np.float32)

Build the synthetic data. Cast to float32 for tensor consumption.

13conds = np.repeat(np.arange(6), N_PER_COND).astype(np.int64)

Per-sample condition labels via np.repeat. int64 for PyTorch boolean masking.

15data_t = torch.from_numpy(data).unsqueeze(-1)

Bridge to torch. unsqueeze(-1) adds a feature axis: (3600,) -> (3600, 1). Real data would be (B, T, F); here F=1 for the clean demo.

EXECUTION STATE
data_t.shape = torch.Size([3600, 1])
16cond_t = torch.from_numpy(conds)

Condition labels as a tensor.

20mean_g = data_t.mean()

Single scalar - the global mean.

21std_g = data_t.std()

Single scalar - the global std. Both ignore the regime structure.

22norm_global = (data_t - mean_g) / std_g

Apply globally. PyTorch broadcasts the scalars across all 3600 elements.

EXECUTION STATE
norm_global.shape = torch.Size([3600, 1])
25norm_perc = torch.empty_like(data_t)

Pre-allocate the per-condition output. empty_like matches dtype and shape.

26for c in range(6):

Iterate conditions.

27mask = (cond_t == c)

Boolean tensor selecting just this condition's rows.

28chunk = data_t[mask]

PyTorch boolean indexing - returns a copy of selected rows.

EXECUTION STATE
chunk.shape = torch.Size([600, 1])
29norm_perc[mask] = (chunk - chunk.mean()) / chunk.std()

Z-score within this condition only. Write back into the pre-allocated output via the same boolean mask.

33def between_ratio(x: torch.Tensor) -> float:

Helper that recomputes the between/total partition for any tensor.

34total = x.var().item()

Total variance, scalar.

35within = float(np.mean([x[cond_t == c].var().item() for c in range(6)]))

Average within-condition variance.

36return (total - within) / total

Ratio in [0, 1].

38print(f"raw : {between_ratio(data_t):.3f}")

Pre-normalisation: 98.9% regime variance.

EXECUTION STATE
Output = raw : 0.989
39print(f"after global Z-score : {between_ratio(norm_global):.3f}")

Global Z-score: still 98.9% regime variance. Z-score did not remove the structure.

EXECUTION STATE
Output = after global Z-score : 0.989
40print(f"after per-cond Z-score: {between_ratio(norm_perc):.3f}")

Per-condition Z-score: 0.000. Regime variance eliminated. ALL remaining variance is degradation signal.

EXECUTION STATE
Output = after per-cond Z-score: 0.000
→ punchline = Same data, same scaling. Global Z preserves regime; per-cond Z erases it.
19 lines without explanation
1import numpy as np
2import torch
3
4# ----- Compare global vs per-condition normalisation in PyTorch -----
5np.random.seed(0); torch.manual_seed(0)
6
7# Pretend (B, T, F) batch. F=1 for clarity.
8N_PER_COND = 600
9COND_MEANS = np.array([1390, 1430, 1480, 1450, 1490, 1495])
10data = np.concatenate([
11    np.random.normal(COND_MEANS[c], 4.0, N_PER_COND) for c in range(6)
12]).astype(np.float32)
13conds = np.repeat(np.arange(6), N_PER_COND).astype(np.int64)
14
15data_t = torch.from_numpy(data).unsqueeze(-1)        # (3600, 1) → fake F=1
16cond_t = torch.from_numpy(conds)
17
18
19# ----- Strategy 1 — Global -----
20mean_g = data_t.mean()
21std_g  = data_t.std()
22norm_global = (data_t - mean_g) / std_g
23
24# ----- Strategy 2 — Per-condition -----
25norm_perc = torch.empty_like(data_t)
26for c in range(6):
27    mask = (cond_t == c)
28    chunk = data_t[mask]
29    norm_perc[mask] = (chunk - chunk.mean()) / chunk.std()
30
31
32# ----- Diagnostic: between/total ratio after each scheme -----
33def between_ratio(x: torch.Tensor) -> float:
34    total  = x.var().item()
35    within = float(np.mean([x[cond_t == c].var().item() for c in range(6)]))
36    return (total - within) / total
37
38print(f"raw                   : {between_ratio(data_t):.3f}")
39print(f"after global Z-score  : {between_ratio(norm_global):.3f}")
40print(f"after per-cond Z-score: {between_ratio(norm_perc):.3f}")
41# raw                   : 0.989
42# after global Z-score  : 0.989
43# after per-cond Z-score: 0.000
The diagnostic. Compute between/total on the normalised data. If it is >0.5, you have a regime-leakage bug. If it is <0.05, your normaliser is doing its job.

The Same Failure Mode Elsewhere

DomainRegimeWhat global Z-score loses
RUL (this book)Operating conditionDegradation signal under regime variance
SpeechSpeakerPhonemes under speaker pitch/timbre
Multi-site MRIScanner vendorPathology under site bias
A/B testingUser segmentTreatment effect under cohort means
Federated learningClientModel updates under client drift
GenomicsSequencing batchDifferential expression under batch effect

Three Reasons This Bug Survives Code Review

Pitfall 1: The histogram looks fine. Global Z-score makes the data fit on screen with mean 0, std 1. The regime structure is still there but flattened to within ~3 standard units. Only the variance-partition diagnostic catches it.
Pitfall 2: The model trains. Loss decreases. RMSE on TRAIN looks fine because the model has learned the regime mapping plus a noise term. Test RMSE on FD002 collapses; only then does anyone investigate.
Pitfall 3: Default sklearn is wrong here. StandardScaler().fit_transform(X) is the wrong tool. You need either a custom transform or sklearn's ColumnTransformer applied per-condition slice.
The lesson. Global mean / std is mathematically unable to remove between-cluster variance. The fix is one nn.Module away (Section 6.4) and accounts for ~1 percentage point of NASA score on FD002 - small per-section but compounded across all of Parts V-VII.

Takeaway

  • Linear scaling cannot erase between-cluster variance. Global Z-score is linear; it preserves the between/total ratio.
  • The diagnostic is between/total after normalisation. Should be near 0 if the normaliser is correct; near 1 means the regime is still present.
  • The bug is silent. Train loss looks fine; only test loss reveals it. Diagnose with the partition ratio, not the training loss curve.
  • The fix is per-condition Z-score. Sections 6.2-6.4 implement it.
Loading comments...