Why Late Predictions Cost More
Sit in on a quarterly review at any commercial airline's engineering department and the chief engineer asks one question about every prognostics tool: ‘does it tell us we have more time than we actually do?’. An RUL model that predicts a CFM-56 engine has 30 cycles left when it really has 20 is the worst possible failure mode. The plane flies for ten more cycles past the safe envelope; if anything goes wrong, the cost is the airframe, possibly the crew, certainly the airline's certification. The same model predicting 10 cycles left when reality is 20 is wasteful but operationally safe — the engine is pulled early, no harm done.
RMSE cannot tell those two errors apart. Both have an absolute residual of 10 cycles; squared, they look identical. The C-MAPSS community recognised this in 2008 when Saxena et al. introduced the dataset, and they shipped an asymmetric scoring function alongside it: late predictions are penalised harder than early ones. This metric is the NASA score, and it is the metric GRACE wins on multi-condition C-MAPSS.
The NASA Score: An Asymmetric Cost
The C-MAPSS scoring function is defined per engine unit, summed across the test set. Let be the prediction residual on engine . The per-unit contribution is
and the total score is . Three properties to internalise:
- Lower is better. A perfect model predicts for every engine, giving per unit and a total of zero.
- Asymmetric. The late branch's time constant is 10; the early branch's is 13. A +10 cycle late prediction costs the same as a -13 cycle early one. Per cycle of error, late is 30% more expensive than early.
- Exponential. A +20 cycle late prediction contributes ; a +30 cycle late prediction contributes . Big mistakes dominate the score — one badly-late engine swamps fifty slightly-early ones.
What ‘Multi-Condition’ Means On C-MAPSS
The four C-MAPSS subsets are not equally hard:
| Subset | Operating regimes | Fault modes | Train units | Test units | Difficulty |
|---|---|---|---|---|---|
| FD001 | 1 (cruise) | 1 (HPC) | 100 | 100 | Easiest — single-condition single-fault |
| FD002 | 6 (full envelope) | 1 (HPC) | 260 | 259 | Multi-condition, single fault |
| FD003 | 1 (cruise) | 2 (HPC + Fan) | 100 | 100 | Single-condition, multi-fault |
| FD004 | 6 (full envelope) | 2 (HPC + Fan) | 249 | 248 | Hardest — multi-condition multi-fault |
Multi-condition (FD002 + FD004) means the model has to learn condition-invariant features: the same engine produces wildly different sensor signatures at idle, climb, cruise, and descent. The 6 operating regimes overlap sensor ranges; without per-condition normalisation a generic model spends most of its capacity learning ‘what regime are we in’ rather than ‘how degraded is this engine’. FD002 alone has 17,631 training windows across 260 engines; FD004 has 19,520 across 249.
This is the regime where MTL helps the most — the auxiliary health-classification task pulls the shared backbone toward features that are robust to the operating-regime variation. Section 21·3 showed empirically that this is also the regime where GABA + GRACE's OUTER axis is most beneficial.
Interactive: 9-Method Multi-Condition Ranking
Click any column header to re-sort by RMSE, NASA, or health accuracy. Click a method row to expand into a per-dataset breakdown showing FD002 and FD004 separately. Numbers come from the paper's 5-seed-per-method runs in cmapss_h256_complete_140.csv + pcgrad_results/all_results.json + cagrad_results/all_results.json.
What the ranking shows. Sort by NASA: GRACE, Uncertainty, GABA cluster at the top (NASA 232.7–235.7). Sort by RMSE: AMNL wins (7.45) but its NASA score is 446.7 — nearly 2× the next-worst. The two metrics rank the methods very differently, and choosing between them is a deployment decision (chapter 23·4).
Headline Result: GRACE Wins NASA, At ~0.5 RMSE Cost
| Method | Multi-cond RMSE | Multi-cond NASA | Multi-cond HA % | Trade-off vs Baseline |
|---|---|---|---|---|
| Single-task | — | — | — | Catastrophic on FD002 (RMSE 26.92) |
| Baseline | 8.06 | 252.5 | 96.37 | Reference |
| AMNL | 7.45 | 446.7 | 96.64 | RMSE -0.61, NASA +194 (worse safety) |
| GABA | 7.89 | 235.7 | 96.67 | RMSE -0.17, NASA -16.8 |
| GRACE | 7.92 | 232.7 | 96.80 | RMSE -0.14, NASA -19.8 (best NASA) |
| Uncertainty | 7.98 | 233.8 | 96.89 | RMSE -0.08, NASA -18.7 |
| GradNorm | 7.96 | 241.9 | 96.33 | RMSE -0.10, NASA -10.6 |
| DWA | 8.13 | 251.0 | 96.31 | RMSE +0.07, NASA -1.5 |
| PCGrad | 8.70 | 280.6 | 96.87 | RMSE +0.64, NASA +28.1 |
| CAGrad | 8.78 | 282.1 | 96.20 | RMSE +0.72, NASA +29.6 |
Three observations:
- GRACE is the safety-best. NASA = 232.7 is the lowest of all 9 MTL methods. The 1.1-point gap to Uncertainty (233.8) is within the seed-noise band (each method has SEM ≈ 1–2 on the multi-condition mean), so the paper does not claim a statistically-significant edge over Uncertainty here — but GRACE is provably never worse.
- Health-accuracy bonus. GRACE's 96.80% is third-best in the table. The OUTER axis pulls the backbone toward features that help BOTH tasks; the inner axis adds the failure-region weighting without hurting the health head.
- The RMSE cost is small. 7.92 vs the accuracy-best AMNL's 7.45 = 0.47 cycles. For an engine with mean RUL ~80 cycles, that is a 0.6% relative error difference — well below operational noise. AMNL pays for that 0.47 cycles with a +194 NASA score (almost double).
FD002 In Detail
FD002 is the cleaner of the two multi-condition subsets: 6 operating regimes, single fault mode, ~260 test engines. GRACE achieves the best NASA on this subset alone, and ties for best health accuracy:
| Method | RMSE | NASA | HA % |
|---|---|---|---|
| Baseline | 7.37 | 224.5 | 95.87 |
| AMNL | 6.74 | 356.0 | 97.01 |
| DWA | 7.75 | 234.4 | 96.24 |
| GABA | 7.53 | 224.2 | 97.04 |
| GRACE | 7.72 | 223.4 | 97.22 (best) |
| GradNorm | 8.19 | 260.9 | 95.99 |
| Uncertainty | 7.77 | 224.4 | 97.14 |
| PCGrad | 8.75 | 295.6 | 96.87 |
| CAGrad | 8.59 | 269.4 | 96.20 |
Three methods cluster at NASA ≈ 223–225 (Baseline, GABA, GRACE, Uncertainty). The cluster reflects the noise floor of 5-seed estimation: the difference between any two of them is within seed variation. GRACE's 223.4 is the point estimate winner; the published claim is ‘tied for best NASA on FD002 with a slightly improved health accuracy’.
FD004 In Detail
FD004 is the hardest C-MAPSS subset: 6 operating regimes plus 2 fault modes overlapping in the same units. ~249 test engines, higher per-method variance.
| Method | RMSE | NASA | HA % |
|---|---|---|---|
| Baseline | 8.76 | 280.5 | 96.87 |
| AMNL | 8.16 | 537.5 | 96.27 |
| DWA | 8.51 | 267.7 | 96.38 |
| GABA | 8.25 | 247.2 | 96.30 |
| GRACE | 8.12 (2nd) | 242.0 | 96.39 |
| GradNorm | 7.74 (best) | 222.9 (best) | 96.67 |
| Uncertainty | 8.19 | 243.2 | 96.64 |
| PCGrad | 8.66 | 265.6 | 96.20 |
| CAGrad | 8.97 | 294.7 | 96.20 |
On FD004 specifically, GradNorm wins both RMSE (7.74) and NASA (222.9). GRACE comes in second on both: RMSE 8.12 (a small loss) and NASA 242.0 (a 19-point gap). The reason the multi-condition average still favours GRACE is its FD002 dominance. Section 23·4 walks through when to choose GRACE vs GradNorm based on your dataset profile.
Python: NASA Scoring From Scratch
Two synthetic models with identical residual magnitudes but opposite biases. RMSE cannot tell them apart; NASA reveals the 38% safety differential. Click any line to see the per-sample trace.
PyTorch: The Paper's Evaluator._nasa_score
The verbatim production scoring helper from grace/training/evaluator.py. Two parts: the per-sample asymmetric formula (lines 15–21) and the per-unit aggregation that picks each engine's LAST window (lines 47–51) before scoring. The aggregation matters because C-MAPSS test sets give partial trajectories and the scoring convention is defined on end-of-trajectory only.
Asymmetric Cost In Other Domains
The pattern ‘late prediction costs more than early prediction’ appears anywhere safety-critical timing matters:
| Domain | Asymmetric loss equivalent | Why late > early |
|---|---|---|
| Cancer screening (mammography) | False-negative rate weighted higher than false-positive rate | Missing a tumour delays treatment by months; a false alarm causes a biopsy. |
| Earthquake early warning (ShakeAlert) | Pinball loss with τ < 0.5 on time-to-shake | Predicting the shake too late means no evacuation; too early means a few seconds of unnecessary alarm. |
| Battery management (electric vehicles) | Asymmetric capacity-loss penalty | Over-estimating remaining range strands drivers; under-estimating annoys them. |
| Climate ice-sheet collapse forecasting | Time-to-tipping-point with late-bias penalty | Predicting collapse late = adaptation strategies fail; predicting early = costly mitigation that may not be needed yet. |
| Medical drug-dosing PK/PD | AUC-asymmetric loss between underdose / overdose | Therapeutic windows are asymmetric — overdose toxicity is often steeper than underdose efficacy loss. |
In every row, the ‘train on RMSE / report on domain-specific score’ pattern is the same as C-MAPSS-with-NASA. GRACE's recipe — failure-biased inner loss + adaptive task-balanced outer loss — is a template for any of these problems.
Pitfalls When Reading NASA Numbers
Pitfall 1: comparing NASA across datasets without normalising
FD002 has 260 test engines, FD004 has 248. The NASA score is a SUM, not a mean. A NASA of 224 on FD002 (per-unit ≈ 0.86) is comparable to a NASA of 213 on FD004 (per-unit ≈ 0.86), not to the same number on FD004. Always divide by before across-dataset comparisons.
Pitfall 2: forgetting the cap clip on y_pred
y_pred = np.minimum(y_pred, 125.0) is non-trivial. Without it, a single test engine where the model wildly over-predicts (RUL = 500 vs reality of 5) contributes and dominates the score. Saxena's convention clips first; any reimplementation must do the same.
Pitfall 3: scoring on shuffled test data
The per-unit aggregation requires the loader to yield windows in temporal order so preds[mask][-1] picks the engine's LAST window, not a random one. Setting shuffle=True on the test DataLoader silently breaks this. Section 22·1 calls this out as a top-5 pitfall.
Pitfall 4: comparing NASA on FD001 with NASA on FD002
FD001 NASA scores cluster around 130–140 because single-condition models predict accurately. FD002 NASA scores cluster around 220–240 not because the methods are worse, but because there are more units and the difficulty is higher. Always report per-dataset scores; the across-dataset average can hide reversals (chapter 21·3).
Pitfall 5: claiming a NASA win without the standard error
Per-method 5-seed NASA standard deviations on FD002 are 20–40 points (chapter 22·3 §5 has the table). A 1-point difference in mean NASA is not a meaningful claim — the SEM is around 10–20. The paper's claim is ‘GRACE is best or tied for best’, not ‘GRACE is statistically significantly better than Uncertainty’.
Takeaway
- The NASA score is asymmetric: late predictions cost ; early predictions cost . Late is 1.3× more expensive per cycle.
- On the multi-condition mean of FD002 + FD004 (n=5 seeds × 2 datasets), GRACE achieves — the lowest of any of the 9 MTL methods.
- The RMSE cost is 0.47 cycles vs AMNL's accuracy-best 7.45. AMNL pays for that 0.47 cycles with NASA = 446.7 (almost double GRACE's).
- On FD002 alone GRACE wins NASA (223.4) and ties for best health accuracy (97.22%). On FD004 alone GradNorm wins both RMSE (7.74) and NASA (222.9); GRACE is second on both.
- The asymmetric-cost pattern generalises: cancer screening, earthquake warning, EV range, ice-sheet forecasting, drug dosing — all benefit from a domain-specific replacement for the NASA formula plus the GRACE training recipe.