Chapter 1
18 min read
Section 3 of 121

Why Safety Matters: NASA Score vs. RMSE

Predictive Maintenance & RUL

Two Doctors, Same RMSE, Different Outcomes

Imagine two diagnosticians estimating how many days a patient has left before a cardiac event. Both are wrong by the same average margin — say, six days. Doctor A's mistakes lean early: she usually says “next week” when the real event is two weeks out. Doctor B's mistakes lean late: he typically says “two weeks” when the patient codes in seven days. On any symmetric error metric the two doctors look identical. In the ICU they are not remotely the same person.

That story is the entire reason this section exists. The dominant accuracy metric in regression is RMSE, and RMSE cannot tell A from B. The maintenance community noticed that almost twenty years ago — the result is the NASA scoring function, an exponential, asymmetric cost that bakes into the metric the asymmetry that exists in reality.

One sentence to remember. RMSE measures how wrong you are. NASA score measures how badly that wrongness hurts.

RMSE: The Comfortable but Symmetric Metric

RMSE is what every regression paper reports because it has every property a metric should have: same units as the target, differentiable everywhere, cheap to compute, and well-understood. Given predictions y^i\hat{y}_i and ground-truth labels yiy_i over a test set of size NN,

RMSE  =  1Ni=1N(y^iyi)2.\text{RMSE} \;=\; \sqrt{\,\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2\,}.

The single property that ruins RMSE for safety-critical RUL is the square. Squaring (y^iyi)(\hat{y}_i - y_i) erases the sign of the error: an error of +10+10 cycles and an error of 10-10 cycles contribute identically.

RMSE is not wrong. It is just answering a different question than “will this engine kill someone if my model is late?”

The NASA Score: Asymmetry Made Quantitative

Saxena, Goebel and colleagues introduced the asymmetric scoring function alongside the original C-MAPSS dataset in 2008. For a single test sample, with di=y^iyid_i = \hat{y}_i - y_i,

si  =  {edi/131di<0(early prediction)edi/101di0(late prediction)s_i \;=\; \begin{cases} e^{-d_i / 13} - 1 & d_i < 0 \quad\text{(early prediction)}\\[4pt] e^{d_i / 10} - 1 & d_i \ge 0 \quad\text{(late prediction)} \end{cases}

The total score over the test set is the sum S=i=1NsiS = \sum_{i=1}^{N} s_i. Lower is safer, zero is perfect, and there is no upper bound — one really late prediction can dominate the entire sum.

Two design decisions encode the safety priorities. First, the denominators differ: 13 on the early side, 10 on the late side. Every 13 cycles of earliness multiplies the cost by e2.72e \approx 2.72; every 10 cycles of lateness does the same. Lateness compounds 30% faster than earliness. Second, the function is exponential, not linear or quadratic — doubling the lateness more than doubles the penalty. A 30-cycle late prediction is worth nineteen 1-cycle late predictions.

error dNASA score s(d)Interpretation
-309.05Very early - 30 cycles of wasted life, modest penalty
-101.16Early - small penalty
-50.47Slightly early - almost free
00.00Perfect
+50.65Slightly late - already comparable to 5 cycles early
+101.72Late - 50% worse than the same-magnitude early error
+3019.09Very late - over 2x the cost of the same-magnitude early

The math, decoded

The piecewise formula in the original paper is given in shorthand — no sign condition is written next to it. Make it explicit with d=RUL^RULd = \hat{\text{RUL}} - \text{RUL}:

  • d<0d < 0early prediction: the model says fewer cycles remain than the truth. Conservative. Maintenance is scheduled too soon and life is wasted.
  • d>0d > 0late prediction: the model says more cycles remain than the truth. Dangerous. The engine may be near failure while the model claims plenty of life left.

Worked example: same |d|, two penalties

Take a single engine with true RUL=20\text{RUL} = 20 cycles and consider two predictions that are wrong by the same 20 cycles in opposite directions.

Predictiond = predicted − actualBranchs(d) =Value
predicted = 00 − 20 = −20earlye20/131e^{20/13} - 1≈ 3.66
predicted = 4040 − 20 = +20latee20/101e^{20/10} - 1≈ 6.39

Same absolute error of 20 cycles, but the late penalty is ~75% larger than the early one. RMSE would call these equally bad — both contribute 202=40020^2 = 400 to the squared error. The NASA score refuses to.

Engineering picture. A model saying “this engine still has 40 cycles left” when it actually has only 20 is dangerous — maintenance arrives too late. A model saying “0 cycles left” when the truth is 20 is conservative and costly, but safer. The asymmetric score encodes that asymmetry of consequences.

Why 10 and 13 — the role of the denominators

Rewrite both branches in terms of d|d|:

searly(d)  =  ed/131,slate(d)  =  ed/101.s_{\text{early}}(d) \;=\; e^{\,|d|/13} - 1, \qquad s_{\text{late}}(d) \;=\; e^{\,|d|/10} - 1.

For any non-zero error, the late branch divides by the smaller number, which produces a larger exponent — and exponentials magnify any advantage in the exponent very quickly:

d10  >  d13        ed/10  >  ed/13.\frac{|d|}{10} \;>\; \frac{|d|}{13} \;\;\Longrightarrow\;\; e^{|d|/10} \;>\; e^{|d|/13}.

With d=20|d|=20: 20/10=220/10 = 2 vs 20/131.5420/13 \approx 1.54, so e216.39e^{2}-1 \approx 6.39 vs e1.5413.66e^{1.54}-1 \approx 3.66. The denominators are tuning knobs: 13 = softer growth on the safe side, 10 = sharper growth on the dangerous side. The paper does not derive these constants from first principles — they are the convention the C-MAPSS leaderboard has used since 2008 — but their role is precise: make every cycle of lateness 30% more painful than the equivalent cycle of earliness.

RMSE-blindness in two engines. Suppose model A errs by 20-20 on two engines, and model B errs by +20+20 on two engines. Both have RMSE=20\text{RMSE} = 20. But SA=23.66=7.32S_A = 2 \cdot 3.66 = 7.32 while SB=26.39=12.78S_B = 2 \cdot 6.39 = 12.78 — B is 75% worse on NASA score. Your reported metric decided whether you saw the danger.

Interactive: Watch RMSE Lie

Below is the asymmetric scoring curve, plus a tiny synthetic model controlled by two sliders — bias and noise std. Slide the bias to +10 and the bottom-right scatter goes red. Slide it to -10 and the same magnitude of error suddenly looks safe. Watch RMSE move slightly while NASA score moves a lot.

Loading interactive NASA-score explorer…

The asymmetry is not subtle. With σ=8\sigma = 8 noise, a +6 mean bias roughly doubles the NASA score relative to a 0 mean bias, while RMSE only worsens by 28%. Build a loss function out of RMSE and your gradient does not know that doubling.

Python: RMSE and NASA, Side by Side

Two implementations of NASA score and RMSE; three synthetic models with the same precision but different bias. The NumPy run below produces the exact values quoted in tables throughout this book.

RMSE and NASA score on three synthetic models
🐍rmse_vs_nasa.py
1import numpy as np

Same NumPy alias. We need np.exp for the asymmetric exponentials, np.random for synthetic predictions, and np.sqrt / np.mean for RMSE.

4def nasa_score(predicted, actual) -> float:

Computes the per-sample NASA score. Lower is safer. Exactly the formula from Saxena & Goebel's 2008 paper that has been the standard prognostic safety metric ever since.

EXECUTION STATE
input: predicted (float) = Model's RUL estimate, in cycles
input: actual (float) = Ground-truth RUL, in cycles
returns float = Non-negative penalty. Zero only at d = 0.
5d = predicted - actual

Signed prediction error. d < 0 means we predicted less RUL than the truth - pessimistic, safe. d > 0 means we predicted more - optimistic, dangerous.

EXECUTION STATE
Example: predicted=80, actual=90 = d = 80 - 90 = -10 -> early prediction
Example: predicted=100, actual=90 = d = 100 - 90 = +10 -> late prediction
6if d < 0:

Branch on the sign of d. Two completely different exponentials apply on each side.

7return np.exp(-d / 13.0) - 1.0

Early-prediction branch. Note the -d in the numerator: since d < 0, -d is positive. The denominator 13 controls the half-life: every 13 cycles of earliness multiplies the cost by e ~ 2.72.

EXECUTION STATE
np.exp(x) = Element-wise e^x. exp(0) = 1, exp(1) ~ 2.72, exp(2) ~ 7.39.
Example: d = -5 = exp(5/13) - 1 = exp(0.385) - 1 ~ 0.469
Example: d = -10 = exp(10/13) - 1 ~ 1.158
Example: d = -20 = exp(20/13) - 1 ~ 3.657
Example: d = -30 = exp(30/13) - 1 ~ 9.051
8return np.exp(d / 10.0) - 1.0

Late-prediction branch. The denominator is 10 - strictly less than the 13 on the early branch, which is what makes the curve asymmetric. Every 10 cycles of lateness multiplies the cost by e.

EXECUTION STATE
Example: d = +5 = exp(5/10) - 1 ~ 0.649
Example: d = +10 = exp(10/10) - 1 ~ 1.718
Example: d = +20 = exp(20/10) - 1 ~ 6.389
Example: d = +30 = exp(30/10) - 1 ~ 19.086 - over twice the cost of being 30 cycles early
11def rmse(pred, actual) -> float:

Standard root-mean-square error. Symmetric in d - punishes early and late equally. This is what every regression paper reports without thinking.

EXECUTION STATE
input: pred = (N,)-shaped ndarray of predicted RULs
input: actual = (N,)-shaped ndarray of true RULs
returns float = sqrt of mean squared error - same units as RUL (cycles)
12return float(np.sqrt(np.mean((pred - actual) ** 2)))

Vectorised computation. Subtract, square, average, square-root. The float(...) cast unwraps NumPy's 0-D ndarray into a plain Python float.

EXECUTION STATE
np.mean(arr) = Returns the arithmetic mean. Without `axis`, averages over the entire flat array.
np.sqrt(x) = Element-wise square root. Returns NaN for negative inputs - squaring inside means we never see one here.
Example = If errors are [-5, 0, +3]: mean of squares = (25+0+9)/3 = 11.33, sqrt -> 3.37
15def total_nasa(pred, actual) -> float:

Sum of per-sample NASA scores across the test set. This is a TOTAL, not a mean - the 2008 leaderboard uses sums, not averages.

EXECUTION STATE
input: pred = (N,)-shaped ndarray
input: actual = (N,)-shaped ndarray
returns float = Sum of all per-sample scores
16return float(sum(nasa_score(p, a) for p, a in zip(pred, actual)))

Generator expression streamed into Python's sum. Could be vectorised but staying scalar-by-scalar makes the asymmetric branch readable.

EXECUTION STATE
zip(a, b) = Pairs corresponding elements: zip([1,2], [10,20]) -> [(1,10), (2,20)]. Stops at the shorter iterable.
sum(iterable) = Adds up an iterable of numbers. Built-in Python - no import needed.
20np.random.seed(0)

Lock the random state so the same actual RULs come out every run.

EXECUTION STATE
seed = 0 - different from the engine simulation in section 1.2
21n = 100

Test-set size. Hundred samples is enough that the law-of-large-numbers behaviour of NASA score is visible.

EXECUTION STATE
n = 100
22actual = np.random.uniform(20, 120, n)

Ground-truth RULs uniform between 20 and 120 cycles. The lower bound 20 keeps us out of the very-near-failure regime where the asymmetry is even more dramatic.

EXECUTION STATE
np.random.uniform(low, high, size) = Sample n values from Uniform(low, high). Returns an ndarray.
actual[:5] = [ 74.88, 91.55, 80.27, 74.43, 62.39]
actual.mean() = ~ 70.4 cycles
24np.random.seed(0)

Re-seed before each model so all three models receive the SAME 100 noise draws - the only thing that differs between them is the bias term we add.

25pred_unbiased = actual + np.random.normal(0, 8, n)

Model A - symmetric Gaussian noise around the true RUL. Mean error 0, std 8 cycles. Realistic for a well-trained model that does not care which side of zero it sits on.

EXECUTION STATE
arg: loc=0 = Zero mean - the model is unbiased
arg: scale=8 = 8-cycle std-dev - moderately accurate
Example: pred_unbiased[0] = actual[0] (74.88) + noise (~ +14.12) = 89.00 - late by 14 cycles
28pred_late = actual + np.random.normal(+6, 8, n)

Model B - same noise std, but mean shifted by +6. Every prediction is on average 6 cycles late. The signature of a model trained without any safety penalty.

EXECUTION STATE
arg: loc=+6 = Late bias - predictions average 6 cycles too high
arg: scale=8 = Same noise as model A
31pred_early = actual + np.random.normal(-6, 8, n)

Model C - same noise, mean shifted to -6. Every prediction averages 6 cycles early. Paradoxically less accurate by RMSE, but much safer by NASA.

EXECUTION STATE
arg: loc=-6 = Early bias - predictions average 6 cycles too low
34for name, p in [...]:

Iterate over the three models and print both metrics.

LOOP TRACE · 3 iterations
model A (unbiased)
RMSE = 8.08
NASA = 99.3
model B (late-biased)
RMSE = 10.34
NASA = 180.1 - 1.8x worse than A
model C (early-biased)
RMSE = 9.77
NASA = 111.4 - only 12% worse than A
37print(...)

Output the three rows. The key observation: B and C have similar RMSE, but B's NASA is 60% higher than C's.

EXECUTION STATE
Output line 1 = unbiased N(0,8) RMSE = 8.08 NASA = 99.3
Output line 2 = late-biased N(+6,8) RMSE = 10.34 NASA = 180.1
Output line 3 = early-biased N(-6,8) RMSE = 9.77 NASA = 111.4
24 lines without explanation
1import numpy as np
2
3# ----- The NASA C-MAPSS scoring function -----
4# Lower is safer. Late predictions (d > 0) cost much more than early ones.
5def nasa_score(predicted: float, actual: float) -> float:
6    d = predicted - actual
7    if d < 0:
8        return np.exp(-d / 13.0) - 1.0   # early - gentle penalty
9    return np.exp( d / 10.0) - 1.0       # late  - exponential penalty
10
11
12def rmse(pred: np.ndarray, actual: np.ndarray) -> float:
13    return float(np.sqrt(np.mean((pred - actual) ** 2)))
14
15
16def total_nasa(pred: np.ndarray, actual: np.ndarray) -> float:
17    return float(sum(nasa_score(p, a) for p, a in zip(pred, actual)))
18
19
20# ----- Three "models" with the same precision but different bias -----
21np.random.seed(0)
22n = 100
23actual = np.random.uniform(20, 120, n)
24
25np.random.seed(0)
26pred_unbiased = actual + np.random.normal(0,  8, n)   # symmetric noise
27
28np.random.seed(0)
29pred_late     = actual + np.random.normal(+6, 8, n)   # biased late
30
31np.random.seed(0)
32pred_early    = actual + np.random.normal(-6, 8, n)   # biased early
33
34
35for name, p in [("unbiased N(0,8) ", pred_unbiased),
36                ("late-biased N(+6,8)", pred_late),
37                ("early-biased N(-6,8)", pred_early)]:
38    print(f"{name:22s}  RMSE = {rmse(p, actual):6.2f}"
39          f"   NASA = {total_nasa(p, actual):8.1f}")
40
41# unbiased N(0,8)         RMSE =   8.08   NASA =     99.3
42# late-biased N(+6,8)     RMSE =  10.34   NASA =    180.1
43# early-biased N(-6,8)    RMSE =   9.77   NASA =    111.4

What the numbers say

The unbiased model A has the lowest RMSE (8.08) and the lowest NASA (99.3). Add a +6-cycle bias and you get model B, with worse RMSE (10.34) and much-worse NASA (180.1). Now flip the sign: the early-biased model C ends up with similar RMSE to B (9.77) but its NASA score is 38% lower (111.4) — because all of C's errors fell on the “safe” side of the curve.

That gap between B and C is the entire research thesis of this book in one number. Two models with comparable RMSE; one is dangerous, the other is safe. The difference is invisible to the metric most papers report.

PyTorch: A Differentiable NASA-Style Loss

For training we need the asymmetric cost as a differentiable L\mathcal{L}. The shape stays the same; the implementation must use torch.exp and torch.where so autograd can backpropagate through both branches.

A differentiable, GPU-ready NASA-style loss
🐍asymmetric_rul_loss.py
1import torch

Top-level PyTorch - provides Tensor, autograd, exp, where, etc.

2import torch.nn as nn

torch.nn contains every learnable building block in PyTorch: layers, loss functions, normalisations. We subclass nn.Module so this loss can be put on a GPU, saved with state_dict, and used like any other module.

EXECUTION STATE
torch.nn = Submodule of PyTorch with all neural-network-related classes. By convention aliased as nn.
5class AsymmetricRULLoss(nn.Module):

We make the loss a Module rather than a free function so it can hold hyper-parameters as attributes, be moved to GPU with .to('cuda'), and compose with nn.Sequential and the rest of the ecosystem.

EXECUTION STATE
Parent: nn.Module = Base class for everything in torch.nn - provides parameters(), state_dict(), .to(device), forward hooks, etc.
14def __init__(self, early_scale=13.0, late_scale=10.0):

Constructor - sets the two scale parameters. Defaults are the NASA-paper values; you can pass different ones to soften or sharpen the asymmetry.

EXECUTION STATE
input: early_scale (float) = 13.0 - denominator on the d < 0 branch
input: late_scale (float) = 10.0 - denominator on the d > 0 branch
15super().__init__()

Calls nn.Module's constructor, which initialises internal bookkeeping (parameters dict, buffers dict, etc.). Always call super().__init__() before doing anything else.

16self.early_scale = early_scale

Store as a plain attribute. We do NOT register it as a buffer or parameter because it is a fixed constant.

EXECUTION STATE
self.early_scale = 13.0 - Python float, lives on CPU
17self.late_scale = late_scale

Same idea. 10.0.

EXECUTION STATE
self.late_scale = 10.0
19def forward(self, pred, actual) -> torch.Tensor:

Module.forward is invoked when you call the instance like a function: loss_fn(pred, actual). PyTorch's __call__ does some bookkeeping (forward hooks) and routes here.

EXECUTION STATE
input: pred (B,) = Batch of B predicted RULs from the network
input: actual (B,) = Matching ground-truth RULs
returns Tensor = Scalar tensor - the mean loss over the batch
20d = pred - actual

Element-wise subtraction. PyTorch broadcasting is identical to NumPy. d carries grad_fn so autograd can backpropagate through it.

EXECUTION STATE
d.shape = (B,) - same shape as pred / actual
Example: pred = [85., 110.], actual = [80., 100.] = d = [+5., +10.] -> both samples are late
21early = torch.exp(-d / self.early_scale) - 1.0

Computes the early-branch value for every sample, even the ones that are actually late. We will mask later with where().

EXECUTION STATE
torch.exp(x) = Element-wise e^x for tensors. Differentiable. Same semantics as np.exp.
Example = If d = [+5, -10]: early = [exp(-5/13)-1, exp(10/13)-1] = [-0.32, +1.16]
why a negative value? = When d > 0 the early branch produces a NEGATIVE quantity. That's fine - torch.where will discard it.
22late = torch.exp(d / self.late_scale) - 1.0

Same trick on the other branch. Computes the late-branch value for every sample.

EXECUTION STATE
Example = If d = [+5, -10]: late = [exp(5/10)-1, exp(-10/10)-1] = [+0.65, -0.63]
24per_sample = torch.where(d < 0, early, late)

Sample-wise mux: pick the early value where d < 0, the late value where d >= 0. Crucially, autograd sees BOTH branches and picks the right one - there is no Python conditional to break the gradient graph.

EXECUTION STATE
torch.where(condition, x, y) = Element-wise select. Differentiable in x and y. The condition is treated as a non-differentiable mask.
Example = d = [+5, -10] -> per_sample = [+0.65, +1.16] - both positive, both correctly the late/early branch
25return per_sample.mean()

Mean over the batch. We use mean (not sum) so the optimisation does not change scale when batch size changes. The returned tensor is 0-D.

EXECUTION STATE
.mean() = Reduces all dimensions by default. .mean(dim=0) would reduce only the batch axis.
returns shape = torch.Size([]) - a true scalar
29torch.manual_seed(0)

Lock PyTorch's RNG so the next torch.rand and torch.randn calls are reproducible.

EXECUTION STATE
torch.manual_seed(seed) = Sets the seed for CPU + all CUDA devices in one call. Equivalent to numpy's seed.
30B = 8

Tiny batch for the demo.

EXECUTION STATE
B = 8
31actual = torch.rand(B) * 100 + 20

Uniform sample in [20, 120). torch.rand draws from [0, 1); we scale and shift.

EXECUTION STATE
torch.rand(*sizes) = Uniform [0, 1). Use torch.rand for U(0,1); torch.randn for N(0,1).
actual = torch.Size([8]), values around 70 cycles
32pred = actual + torch.randn(B) * 8

Predictions = truth + 8-cycle Gaussian noise. Identical to the unbiased NumPy model above.

EXECUTION STATE
torch.randn(*sizes) = Standard Gaussian. To shift mean: + bias; to change std: * sigma.
34loss_fn = AsymmetricRULLoss()

Instantiate with default NASA scales (13, 10).

EXECUTION STATE
loss_fn.early_scale = 13.0
loss_fn.late_scale = 10.0
35loss = loss_fn(pred, actual)

Forward call. PyTorch's nn.Module.__call__ routes to forward(); returns a 0-D tensor with grad_fn so we can call loss.backward() in a real training loop.

EXECUTION STATE
loss.shape = torch.Size([])
loss.requires_grad = True (because pred is a function of model params, conceptually)
37print('d =', (pred - actual).round(decimals=2).tolist())

Look at the per-sample errors. Some early, some late.

EXECUTION STATE
Output = d = [12.40, -2.05, 9.77, -1.24, 6.16, -10.43, -2.04, 0.55]
38print('loss =', round(loss.item(), 4))

.item() pulls a 0-D tensor to a Python float. The mean of the per-sample asymmetric scores.

EXECUTION STATE
.item() = Tensor -> Python scalar. Only works on 0-D tensors. Synchronises with the GPU if the tensor lives on cuda.
Output = loss = 0.7913
39print('RMSE =', ...)

Compute RMSE on the same predictions for comparison. .sqrt() and .mean() are tensor methods - same semantics as the np.* free functions.

EXECUTION STATE
Output = RMSE = 7.0561
16 lines without explanation
1import torch
2import torch.nn as nn
3
4# ----- A differentiable NASA-style loss for training -----
5class AsymmetricRULLoss(nn.Module):
6    """Smoothed NASA-style loss. Lower is better.
7
8    Args:
9        early_scale: divisor in the d < 0 branch (NASA default 13).
10        late_scale:  divisor in the d > 0 branch (NASA default 10).
11    """
12
13    def __init__(self, early_scale: float = 13.0, late_scale: float = 10.0):
14        super().__init__()
15        self.early_scale = early_scale
16        self.late_scale  = late_scale
17
18    def forward(self, pred: torch.Tensor, actual: torch.Tensor) -> torch.Tensor:
19        d = pred - actual                                   # (B,)
20        early = torch.exp(-d / self.early_scale) - 1.0
21        late  = torch.exp( d / self.late_scale ) - 1.0
22        # torch.where keeps autograd flowing through both branches.
23        per_sample = torch.where(d < 0, early, late)
24        return per_sample.mean()
25
26
27# ----- Use it -----
28torch.manual_seed(0)
29B = 8
30actual = torch.rand(B) * 100 + 20      # uniform RUL in [20, 120)
31pred   = actual + torch.randn(B) * 8   # noisy predictions
32
33loss_fn = AsymmetricRULLoss()
34loss    = loss_fn(pred, actual)
35
36print("d        =", (pred - actual).round(decimals=2).tolist())
37print("loss     =", round(loss.item(), 4))
38print("RMSE     =", round(((pred - actual) ** 2).mean().sqrt().item(), 4))
Why torch.where and not a Python if? An if d < 0 in forward() would short-circuit one branch on a per-sample basis, breaking vectorisation and (more importantly) breaking the backward pass. torch.where evaluates both branches and selects element-wise — autograd sees a smooth, batched, GPU-parallel computation.

Asymmetric Cost in Other Domains

The pattern “errors in one direction are categorically worse than the other” is everywhere once you start looking.

DomainCheap directionExpensive directionClosest analogue to NASA score
Aviation RUL (this book)Predict too early - wasted lifePredict too late - engine failsNASA exp scoring (Saxena et al. 2008)
Insurance reservesOver-reserve - opportunity costUnder-reserve - solvency eventAsymmetric quantile / pinball loss
Hospital triageAdmit a stable patient - billingDischarge a deteriorating one - deathCost-sensitive learning matrices
Autonomous-vehicle brakingBrake too early - passenger discomfortBrake too late - collisionTime-to-collision threshold
Cancer screeningFalse positive - biopsyFalse negative - undetected diseaseWeighted F-beta with beta > 1
Battery state-of-chargeRange-anxiety - under-reportStranded vehicle - over-reportQuantile + range guard-band
Server-capacity predictionOver-provision - cloud billUnder-provision - outageAsymmetric MSE in SRE forecasting

The mathematical machinery in this book — an asymmetric loss, a gradient-balancing controller, a Pareto-frontier picture — transfers line-for-line to any of these settings.

The Trap: Optimising the Wrong Metric

The trap. Train with mean-squared-error, report RMSE, and celebrate. Your loss is symmetric. Your gradient is symmetric. There is no signal anywhere in the optimisation pipeline that says “late predictions are worse than early ones”. You will get a model that is dangerously unbiased.

It is exactly this trap that motivates the failure-biased weighted MSE in Chapter 14. By up-weighting near-failure samples we change the gradient landscape so the model pays disproportionate attention to the regime where late predictions are most expensive — without abandoning the smoothness of MSE that makes deep nets train.

The deep observation. Once you accept that the cost is asymmetric, all three of the proposed objectives in this book (AMNL, GABA, GRACE) follow as different ways to embed that asymmetry into a multi-task gradient.

Takeaway

  • RMSE is symmetric; reality is not. A late RUL prediction costs categorically more than an equally-wrong early one. RMSE cannot tell you which side of zero your errors live on.
  • The NASA score makes that asymmetry quantitative. s(d)=ed/131s(d) = e^{-d/13}-1 for early errors and s(d)=ed/101s(d) = e^{d/10}-1 for late ones. Small denominator, exponential growth: lateness compounds fast.
  • Two models with the same RMSE can have very different safety profiles. Our late-biased model B beats the early-biased model C on RMSE but loses badly on NASA score (180 vs 111).
  • The PyTorch loss is a four-line module. torch.where turns the piecewise definition into a fully differentiable, batched, GPU-friendly forward pass.
  • This is what the rest of the book is about. AMNL, GABA, and GRACE are three different engineering answers to the same question: how do we get a model whose gradient knows that late predictions are dangerous?
Loading comments...