An Asymmetric Verdict
§16.1 celebrated AMNL's 6.74 RMSE on FD002. Symmetric metric. Symmetric celebration. Now we look at FD001 with the ASYMMETRIC NASA score - and the verdict flips. AMNL on FD001 scores 434.4 ± 239.8 on NASA - 3.0× the Baseline 0.5/0.5 (143.2) and 3.8× the Uncertainty method (115.4). RMSE was a wash; NASA is a cliff.
The Numbers Behind the Penalty
Real Table I FD001 columns (paper_ieee_tii/tables/table1_sota_comparison.md):
| Method | FD001 RMSE | FD001 NASA | NASA penalty vs best |
|---|---|---|---|
| Uncertainty (Kendall) | 9.15 ± 0.42 | **115.4 ± 4.7** | 0% (best) |
| GRACE (ours) | **9.14 ± 1.39** | 121.4 ± 36.8 | +5% |
| GradNorm | 9.38 ± 1.53 | 127.6 ± 44.3 | +11% |
| GABA (ours) | 9.63 ± 1.90 | 130.4 ± 41.8 | +13% |
| Baseline 0.5/0.5 | 10.08 ± 1.71 | 143.2 ± 48.4 | +24% |
| DWA | 10.58 ± 1.66 | 146.1 ± 40.3 | +27% |
| **AMNL (ours)** | 10.43 ± 1.94 | **434.4 ± 239.8** | **+277%** |
Why Failure-Bias Backfires on FD001
Three conditions combine to amplify the penalty:
| Condition | Effect on NASA on FD001 |
|---|---|
| 1. Few near-failure samples (~30%) | AMNL's 2× weighting on near-failure inflates noise from the ~30 critical engines. Gradient direction depends heavily on the seed. |
| 2. Long healthy plateau (RUL > 80 for 65% of cycles) | Model fits the healthy regime first; failure-weighted updates push it LATE on those healthy windows. Almost half of FD001's test set sits at RUL ≈ 70-90. |
| 3. Asymmetric NASA decay constants (a1=13, a2=10) | exp(d/10) − 1 grows ~50% faster than exp(−d/13) − 1 for the same |d|. A 2-cycle late shift on the median engine moves the score 50% more than a 2-cycle early shift would. |
Net result: a +2-cycle median shift in residuals - barely visible in RMSE - tripled the NASA score. The math is crystal clear; the empirical surprise was that AMNL would produce that shift on a single-condition subset at all. FD001 is where it shows because the healthy plateau is so dominant.
Interactive: Watch the Penalty Form
The histogram on top shows per-engine errors for Baseline (blue) and AMNL (green); the red bars on the bottom are the per-bin NASA penalty. Drag AMNL's late bias and watch the green histogram shift right - and the NASA total explode.
Try this. Set late_bias = +2.0 (the paper-realistic AMNL profile). The NASA score crosses ~430 - the paper's 434.4. Now slide back to 0. NASA drops to ~120 - matching Baseline. The shift is small (median engine moves 2 cycles); the penalty is huge (3-4×). NASA is the harshest metric in the literature for late predictions.
Python: Reproduce the Penalty
Pure NumPy. Sample 100 synthetic per-engine errors with a per-method (mean, std), compute RMSE + NASA totals + bias diagnostics, print a side-by-side comparison. The numbers match Table I within seed noise.
PyTorch: NASA Score Across Methods
Same simulation in PyTorch with the paper-canonical nasa_score from §13.1 factored out. The smoke test uses three method profiles and reproduces Table I's FD001 column to within seed noise.
Same Pattern, Other Single-Condition Datasets
FD001-style penalties show up wherever a regression model with failure-amplified loss meets a healthy-dominated distribution AND an asymmetric evaluation metric. Examples:
| Domain | Healthy plateau % | Late penalty asymmetry | Failure-weighted RMSE OK? |
|---|---|---|---|
| RUL on C-MAPSS FD001 (this section) | ~65% | exp(d/10) vs exp(-d/13) ⇒ 1.5× | no - NASA explodes |
| RUL on PRONOSTIA bearings (single regime) | ~70% | exp(d/12) vs exp(-d/15) ⇒ 1.6× | no - similar pattern |
| Battery SoH (one cycling protocol) | ~80% | 0.05× early vs 1× late ⇒ 20× | no - extreme penalty |
| Wind-turbine load on rated speed | ~50% | modest ~1.2× | moderate - check NASA-equivalent |
| MRI tumour pre/post-treatment | ~45% | 1× early vs 5× late | no - clinical asymmetry |
| Multi-condition C-MAPSS FD002 | ~30% | same NASA | yes - AMNL wins both metrics |
Three NASA-Penalty Pitfalls
The point. AMNL's 6.74 FD002 RMSE and 434.4 FD001 NASA are TWO FACES OF THE SAME MECHANISM. Failure-biased weighting helps when condition variability provides the gradient diversity to balance it; hurts otherwise. §16.3 covers the cross-pipeline caveat that muddies these comparisons further; §16.4 turns the pattern into a deployment rule.
Takeaway
- AMNL FD001 NASA = 434.4 ± 239.8. 3× the Baseline; 3.8× the best (Uncertainty 115.4).
- RMSE is a wash; NASA is a cliff. The two metrics disagree because NASA is asymmetric and AMNL biases LATE on FD001.
- Three causes: few near-failure samples, long healthy plateau, asymmetric NASA decay constants.
- The std doubles too. 239.8 cycles of variance ⇒ single-seed comparisons are doubly meaningless.
- Don't tune AMNL to fix it. Use a different method for single-condition deployments (GABA / GRACE / Uncertainty).