Boo-AI — Master Artificial Intelligence by Building from Scratch

Pareto, Cars, And Why You Cannot Have It All

Vilfredo Pareto, the Italian economist, noticed in 1896 that a policy reform that helps everyone equally is rare; what is far more common is a reform that helps some at the expense of others. The configurations where you cannot improve any individual's outcome without making someone else worse off are Pareto-optimal — the boundary of the attainable region.

Engineering knows this picture well. Tune an F1-car suspension for grip and you lose top speed; tune for top speed and you lose grip. The set of suspensions where no other configuration is BOTH grippier AND faster is the Pareto frontier. A team can choose a point on the frontier based on the day's circuit but cannot beat the frontier without changing something outside the suspension — a different tyre compound, a different aerodynamic package, a different driver.

Section 23·1 reported one number: GRACE wins NASA at 232.7 on multi-condition C-MAPSS. This section shows the picture underneath that number. RMSE and NASA define a 2D performance space; the 9 published MTL methods scatter inside it; only 3 are on the Pareto frontier; GRACE is one of them. The plot is an honest record of which methods to choose, when, and what you get up.

The headline. On the multi-condition mean of FD002 + FD004, the Pareto frontier consists of exactly three methods: AMNL (accuracy-best, RMSE 7.45 / NASA 446.7), GABA (balanced, 7.89 / 235.7), and GRACE (safety-best, 7.92 / 232.7). Six other methods are dominated — some other method beats them on both axes simultaneously.

Pareto Dominance In Two Lines

Two-axis Pareto dominance is a precise relation. For two candidate methods $p$ and $q$ with metrics $(r_p, n_p)$ and $(r_q, n_q)$ where lower is better on both axes:

$q \succ p \quad \Longleftrightarrow \quad r_q \leq r_p \;\text{and}\; n_q \leq n_p \;\text{and}\; (r_q < r_p \;\text{or}\; n_q < n_p).$

Read this as ‘ $q$ dominates $p$ ’: q is no worse on either axis AND strictly better on at least one. The strict-better clause excludes ties — if $(r_p, n_p) = (r_q, n_q)$ exactly, neither dominates the other and both are candidates for the front. The Pareto frontier is the set of methods that no other method dominates.

Why both axes. A method with better RMSE but worse NASA cannot be flatly compared with one that has the opposite trade. The Pareto formalism says: report both, mark the frontier, let the practitioner pick a point. Single-number rankings (mean RMSE only, or weighted-average score) hide the choice. The C-MAPSS literature accumulated 15 years of accuracy-only races; the Pareto picture is what got mostly skipped.

Interactive: The Multi-Condition Pareto Picture

Below: 9 methods plotted in (RMSE, NASA) space, multi-condition mean. The dashed green line traces the Pareto frontier. Drag the preference slider to compose RMSE and NASA into a single score $J = \lambda \cdot \text{RMSE}_{\text{norm}} + (1-\lambda) \cdot \text{NASA}_{\text{norm}}$ and watch the optimal method change. Toggle FD002 / FD004 / multi-cond mean above the chart.

Loading Pareto explorer…

Two things to notice. Drag $\lambda$ from 0 to 1: the optimal method walks along the frontier from GRACE (safety-only) through GABA (balanced) to AMNL (accuracy-only). Switch from MULTI to FD002: the frontier gains Baseline as a fourth corner. Switch to FD004: the frontier collapses to GradNorm alone — one method dominates everything.

Three Methods Make The Multi-Condition Front

Front method	RMSE	NASA	Role
AMNL	7.45 (best)	446.7 (worst)	Accuracy-only — buy 0.5 RMSE with +200 NASA
GABA	7.89	235.7	Balanced — pays only 0.44 RMSE for NASA −211
GRACE	7.92	232.7 (best)	Safety-best — pays only 0.03 RMSE for NASA −3

The three corners answer different questions. AMNL says: you only care about accuracy (e.g. an HM benchmark scoreboard). GABA says: you accept some accuracy cost to keep the safety score reasonable. GRACE says: safety dominates — buy every possible NASA point, and only stop when accuracy starts to regress.

The 6 dominated methods (Baseline, DWA, GradNorm, Uncertainty, PCGrad, CAGrad) all have at least one front method that beats them on both axes. Uncertainty (RMSE 7.98, NASA 233.8) is the closest to the front: GRACE beats it by 0.06 RMSE and 1.1 NASA — within seed noise. The published claim is therefore ‘GRACE strictly dominates Uncertainty’ with a small statistical margin; section 22·3 §5 explained why this is robust at n=5 seeds.

FD002 Has A Four-Method Front

Front method	RMSE	NASA	Position
AMNL	6.74 (best)	356.0	Accuracy corner — costs +131 NASA
Baseline	7.37	224.5	Cheap balanced point
GABA	7.53	224.2	Middle of the front
GRACE	7.72	223.4 (best)	Safety corner

FD002 is the ‘easy’ multi-condition subset (single fault mode) and the front widens to four methods. Baseline appears here because plain MTL with fixed equal weights already does a reasonable job balancing RMSE and NASA when the data only has one fault to learn. GRACE still wins NASA, GABA is one tick higher in NASA but lower in RMSE, AMNL stretches all the way to the accuracy corner.

FD004 Has A Single Winner

Front method	RMSE	NASA	Status
GradNorm	7.74 (best)	222.9 (best)	Dominates every other method on FD004

FD004 (multi-condition multi-fault — the hardest C-MAPSS subset) produces a singular winner: GradNorm beats every other method on BOTH axes. GRACE comes second on RMSE (8.12) and second on NASA (242.0) but is dominated by GradNorm. The reason GRACE still wins the multi-condition mean is its FD002 dominance: the 232.7 multi-cond NASA = mean(223.4, 242.0) is lower than any other method's mean of the same pair.

Why GradNorm wins FD004. The GradNorm controller uses learnable per-task weights with their own gradient (chapter 24 §3); on FD004's noisier multi-fault data the additional flexibility helps. GRACE's simpler closed-form GABA controller is more robust per-dataset but less adaptive to FD004's specific structure. Section 23·4 gives the decision rule for which to use.

Picking A Point With A Preference Weight

For deployment you need to pick ONE method, not the whole front. The textbook way: define a preference weight $\lambda \in [0, 1]$ capturing how much you care about RMSE relative to NASA, then minimise the composite score

$J(\text{method}) = \lambda \cdot \widetilde{\text{RMSE}} + (1 - \lambda) \cdot \widetilde{\text{NASA}}$

where the tildes denote min-max normalisation onto $[0, 1]$ (so the two axes can be added without unit-scale issues). At $\lambda = 0$ only NASA matters → GRACE wins. At $\lambda = 1$ only RMSE matters → AMNL wins. In between, the optimum walks along the front.

λ range	Optimal method	Operational reading
[0.00, 0.40]	GRACE	Safety-first deployment. Aviation, nuclear, medical.
[0.40, 0.85]	GABA	Balanced. General industrial PHM.
[0.85, 1.00]	AMNL	Accuracy-first. Benchmark leaderboards, research demos.

The cut-points are dataset-specific (the explorer above recomputes them live), but the qualitative picture — GRACE for safety-heavy weights, AMNL for accuracy-heavy weights, GABA in between — is robust across all four C-MAPSS subsets and the N-CMAPSS DS02 set covered in chapter 23·3.

Python: Pareto Sorting From Scratch

Implement Pareto-front identification on the 9-method results matrix. The algorithm is two nested loops — O(N²) which is plenty for N=9. Larger problems use NSGA-II's fast-non-dominated-sort which is O(N log N), but the definitions are the same.

Pareto front identification by non-dominated sorting

🐍grace_pareto_front.py

Explanation(29)

Code(57)

1docstring

States the contract: identify the Pareto front for the 9-method × 2-axis (RMSE, NASA) results matrix. The function we'll write is the same non-dominated-sorting primitive used in NSGA-II and other multi-objective optimisers.

9import numpy as np

NumPy. Used for the rmse and nasa arrays plus light array indexing.

15methods = ['Baseline', 'AMNL', 'DWA', 'GABA', 'GRACE', 'GradNorm', 'Uncertainty', 'PCGrad', 'CAGrad']

All 9 MTL methods studied in the paper. Order matters because the rmse[] and nasa[] arrays below are positionally aligned with this list.

EXECUTION STATE

→ 9 methods = 1 baseline + 8 MTL variants. Single-task is excluded because its FD002 RMSE is catastrophic (~27).

17rmse = np.array([8.06, 7.45, 8.13, 7.89, 7.92, 7.96, 7.98, 8.70, 8.78])

Multi-condition RMSE means. n=5 seeds × 2 datasets averaged. Values from data_analysis/cmapss_h256_complete_140.csv.

EXECUTION STATE

📚 np.array(list) = Convert a Python list to ndarray. Enables vectorised arithmetic and indexing.

rmse (9,) = [8.06, 7.45, 8.13, 7.89, 7.92, 7.96, 7.98, 8.70, 8.78] — AMNL is best at 7.45.

19nasa = np.array([252.5, 446.7, 251.0, 235.7, 232.7, 241.9, 233.8, 280.6, 282.1])

Multi-condition NASA means. Same 9 methods, same order. GRACE wins at 232.7.

EXECUTION STATE

nasa (9,) = [252.5, 446.7, 251.0, 235.7, 232.7, 241.9, 233.8, 280.6, 282.1] — GRACE at 232.7.

23def is_dominated(i, rmse, nasa):

Predicate: is point i dominated by ANY other point j? A dominator must be no worse on every axis AND strictly better on at least one.

EXECUTION STATE

⬇ input: i = Index of the candidate point to test.

⬇ input: rmse = (9,) array of RMSEs.

⬇ input: nasa = (9,) array of NASAs.

⬆ returns = Boolean. True = at least one j dominates i.

25for j in range(len(rmse)):

Iterate every other point j. Worst case O(N) per call; combined with the outer pareto_front loop it's O(N²) — fine for the 9-method case.

LOOP TRACE · 1 iterations

j=0..8

compares i vs each j = Skips j=i. Tests dominance condition.

26if i == j:

Skip self-comparison. A point cannot dominate itself.

27continue

Move to the next j.

28if (rmse[j] <= rmse[i] and nasa[j] <= nasa[i]

Domination check, part 1: j is no worse on EITHER axis. The two-axis tie still allows domination if part 2 fires.

29and (rmse[j] < rmse[i] or nasa[j] < nasa[i])):

Domination check, part 2: j is STRICTLY better on AT LEAST ONE axis. Without this clause two equal points would dominate each other — logically inconsistent.

EXECUTION STATE

→ why both clauses? = Strict dominance excludes ties. (rmse_j, nasa_j) == (rmse_i, nasa_i) → neither is dominated by the other (they're both on the front if they're otherwise non-dominated).

30return True

Found a dominator — stop early.

31return False

No dominator found across all j — point i is on the Pareto front.

34def pareto_front(rmse, nasa, names):

Return the Pareto front as a list of (name, rmse, nasa) tuples sorted by RMSE ascending. The sort gives a natural left-to-right curve when plotting.

EXECUTION STATE

⬇ input: rmse = (9,) array.

⬇ input: nasa = (9,) array.

⬇ input: names = List of method names, positionally aligned with rmse and nasa.

⬆ returns = List[(name, rmse, nasa)] — Pareto-front methods sorted left-to-right.

36front_idx = [i for i in range(len(rmse))

List comprehension: collect indices where is_dominated returns False.

37if not is_dominated(i, rmse, nasa)]

Filter clause for the comprehension.

EXECUTION STATE

front_idx (illustrative) = [1, 3, 4] — corresponding to AMNL, GABA, GRACE.

39front_idx.sort(key=lambda i: rmse[i])

In-place sort by RMSE ascending. Lambda is the key function: returns the RMSE for index i, so smaller RMSEs come first.

EXECUTION STATE

📚 list.sort(key) = Stable in-place sort. key is called once per element; the result is used for comparison.

📚 lambda i: rmse[i] = Anonymous function. Equivalent to: def f(i): return rmse[i].

40return [(names[i], float(rmse[i]), float(nasa[i])) for i in front_idx]

Pack into (name, rmse, nasa) tuples. float() converts numpy scalars to plain Python floats so the result is JSON-serialisable.

44front = pareto_front(rmse, nasa, methods)

Compute the front.

EXECUTION STATE

front (illustrative) = [('AMNL', 7.45, 446.7), ('GABA', 7.89, 235.7), ('GRACE', 7.92, 232.7)]

45print("Pareto front (multi-condition mean):")

Header.

47for name, r, n in front:

Iterate the 3-element front. Tuple-unpacking gives readable variable names.

LOOP TRACE · 3 iterations

AMNL

Output = AMNL RMSE = 7.45 NASA = 446.7

GABA

Output = GABA RMSE = 7.89 NASA = 235.7

GRACE

Output = GRACE RMSE = 7.92 NASA = 232.7

47print(f" {name:<14} RMSE = {r:5.2f} NASA = {n:6.1f}")

Pretty print. {name:<14} pads to 14 chars left-aligned; {r:5.2f} formats as 5-wide 2-decimal float.

50grace_idx = methods.index("GRACE")

Find where GRACE sits in the methods list. methods.index returns the first index of the matching value (4 here).

EXECUTION STATE

📚 list.index(value) = Returns the index of the first occurrence of value. Raises ValueError if not found.

grace_idx = 4 (zero-based).

52dominated_by_grace = [

Comprehension start. Collect method names that GRACE strictly dominates.

53methods[j] for j in range(len(methods))

Iterate all methods.

54if j != grace_idx and rmse[grace_idx] <= rmse[j] and nasa[grace_idx] <= nasa[j]

Two-axis dominance check. j is excluded if it's GRACE itself; otherwise GRACE must be no worse on BOTH axes.

55and (rmse[grace_idx] < rmse[j] or nasa[grace_idx] < nasa[j])

Strict-better-on-one-axis clause. Closes the comprehension filter.

16]

Close the list comprehension.

56print(f"\nGRACE strictly dominates: {dominated_by_grace}")

Print the methods strictly dominated by GRACE.

EXECUTION STATE

Output (illustrative) = GRACE strictly dominates: ['Baseline', 'DWA', 'GradNorm', 'Uncertainty', 'PCGrad', 'CAGrad']

→ reading = GRACE is on the front (not dominated). It also strictly dominates 6 of the other 8 methods. The two it doesn't dominate are AMNL (better RMSE) and GABA (better RMSE).

28 lines without explanation

1"""Pareto front identification on the GRACE results matrix.
2
3A point p is &lsquo;dominated&rsquo; if there is another point q that is
4no worse than p on every axis AND strictly better on at least one.
5The Pareto front is the set of points that are NOT dominated by any
6other.
7"""
8
9import numpy as np
10
11
12# ---------- 9-method results on multi-condition C-MAPSS ----------
13# Source: paper data_analysis/cmapss_h256_complete_140.csv averaged
14# across FD002 + FD004, n=5 seeds per dataset.
15methods = ["Baseline","AMNL","DWA","GABA","GRACE",
16           "GradNorm","Uncertainty","PCGrad","CAGrad"]
17rmse    = np.array([8.06, 7.45, 8.13, 7.89, 7.92,
18                    7.96, 7.98, 8.70, 8.78])
19nasa    = np.array([252.5, 446.7, 251.0, 235.7, 232.7,
20                    241.9, 233.8, 280.6, 282.1])
21
22
23def is_dominated(i, rmse, nasa):
24    """Return True if point i is dominated by any other point."""
25    for j in range(len(rmse)):
26        if i == j:
27            continue
28        if (rmse[j] <= rmse[i] and nasa[j] <= nasa[i]
29                and (rmse[j] < rmse[i] or nasa[j] < nasa[i])):
30            return True
31    return False
32
33
34def pareto_front(rmse, nasa, names):
35    """Return the indices of non-dominated points, sorted by RMSE."""
36    front_idx = [i for i in range(len(rmse))
37                 if not is_dominated(i, rmse, nasa)]
38    # Sort by RMSE ascending — gives the natural left-to-right curve
39    front_idx.sort(key=lambda i: rmse[i])
40    return [(names[i], float(rmse[i]), float(nasa[i])) for i in front_idx]
41
42
43# ---------- Compute and print ----------
44front = pareto_front(rmse, nasa, methods)
45
46print("Pareto front (multi-condition mean):")
47for name, r, n in front:
48    print(f"  {name:<14} RMSE = {r:5.2f}  NASA = {n:6.1f}")
49
50# Confirm GRACE dominates Uncertainty + 5 others
51grace_idx = methods.index("GRACE")
52dominated_by_grace = [
53    methods[j] for j in range(len(methods))
54    if j != grace_idx and rmse[grace_idx] <= rmse[j] and nasa[grace_idx] <= nasa[j]
55       and (rmse[grace_idx] < rmse[j] or nasa[grace_idx] < nasa[j])
56]
57print(f"\nGRACE strictly dominates: {dominated_by_grace}")

What the script proves. GRACE strictly dominates 6 of the 8 other methods (Baseline, DWA, GradNorm, Uncertainty, PCGrad, CAGrad). The two it does not dominate are AMNL (better RMSE) and GABA (better RMSE). Together {AMNL, GABA, GRACE} form the Pareto front.

PyTorch: Hooking RMSE+NASA Into An Eval Loop

Production usage: take a trained model, run the paper's Evaluator, place the result on the published Pareto picture, and decide whether the new model is publishable. The wrapper returns three signals: the raw $(\text{RMSE},\ \text{NASA})$ pair, a boolean ‘on Pareto front vs paper’, and the lists of paper methods that dominate or are dominated.

Place a new model on the published Pareto picture

🐍grace_pareto_eval.py

Explanation(39)

Code(60)

1docstring

States the contract: take a trained model, evaluate it on a test loader, and report where it sits relative to the paper's reference Pareto picture. Useful for sanity-checking new methods before claiming a win.

9import torch

PyTorch core. We use torch.device for the eval helper.

10from torch.utils.data import DataLoader

DataLoader type for the function signature.

11from grace.training.evaluator import Evaluator

Paper's production evaluator. Returns rmse_last + rmse_all + mae_last + mae_all + r2 + nasa_score + health_accuracy + health_f1 in a single dict. Source: grace/training/evaluator.py:Evaluator.

EXECUTION STATE

📚 Evaluator(device, max_rul) = Initialised with a torch.device and the RUL cap (125 = paper default). The .evaluate(model, loader) method runs one full pass.

12from grace.training.seed_utils import set_seed

Reproducibility pin from chapter 22 §3.

16REFERENCE = {

Hard-coded paper Table-I numbers. Five methods chosen as anchors for the Pareto comparison: the AMNL accuracy-best, the GABA balanced, GRACE safety-best, plus Uncertainty (closest to GRACE) and Baseline (the unbiased reference).

17"AMNL": (7.45, 446.7),

AMNL multi-condition mean from the paper.

18"GABA": (7.89, 235.7),

GABA — middle of the front.

19"GRACE": (7.92, 232.7),

GRACE — safety-best.

20"Uncertainty": (7.98, 233.8),

Uncertainty — close to GRACE but dominated.

21"Baseline": (8.06, 252.5),

Plain-MTL reference.

25def evaluate_and_place(model, test_loader: DataLoader,

End-to-end helper: run a model and tell the user where it sits on the published Pareto picture.

EXECUTION STATE

⬇ input: model = Trained DualTaskModel. Should already be in eval mode + EMA shadow applied if relevant.

⬇ input: test_loader = DataLoader yielding (seq, rul, health, uid). MUST be shuffle=False for last-cycle NASA scoring.

26device: torch.device, seed: int = 42):

Continued signature. seed is for the eval-time RNG (mostly cosmetic but pinned for reproducibility).

EXECUTION STATE

⬇ input: device = torch.device — same as training-time device.

⬇ input: seed = 42 = Integer. Pinned via set_seed for reproducibility.

⬆ returns = Dict with rmse, nasa, on_front_vs_paper, dominators, dominated_paper_methods.

28set_seed(seed)

Pin the PRNG. Mostly a safety net — eval is supposed to be deterministic, but some PyTorch ops still produce slightly different floats run-to-run without the cuDNN flags from chapter 22 §3.

29evaluator = Evaluator(device=device, max_rul=125)

Construct the evaluator. max_rul=125 matches the paper's piecewise-linear RUL cap.

31metrics = evaluator.evaluate(model, test_loader)

Run the eval. Internally walks the loader, accumulates per-window predictions, picks last-cycle per unit, and computes all metrics. ~1 second per dataset on a GPU.

EXECUTION STATE

📚 evaluator.evaluate(model, loader) = Returns dict with keys: rmse_all, mae_all, r2_all, rmse_last, mae_last, r2_last, nasa_score, n_units, health_accuracy, health_f1.

32rmse = metrics["rmse_last"]

Extract the last-cycle RMSE. The paper's reported number is rmse_last, NOT rmse_all — last-cycle is what NASA scoring uses.

33nasa = metrics["nasa_score"]

Extract the NASA score (also computed on last-cycle).

36dominators = []

List of paper methods that strictly dominate the new model. If empty, the new model is on the Pareto front.

37dominated = []

List of paper methods that the new model strictly dominates. Useful for marketing — ‘our model strictly dominates Baseline and DWA’.

38for name, (r_ref, n_ref) in REFERENCE.items():

Iterate the 5 reference methods. Tuple-unpacking on .items() gives (name, (rmse, nasa)) per iteration.

LOOP TRACE · 5 iterations

AMNL → (7.45, 446.7)

compare = AMNL has lower RMSE but vastly higher NASA. Likely neither dominates either way.

GABA → (7.89, 235.7)

compare = GABA has lower RMSE and higher NASA than GRACE — neither dominates.

GRACE → (7.92, 232.7)

compare = GRACE matches itself — skip via the strict-better clause.

Uncertainty → (7.98, 233.8)

compare = GRACE has lower RMSE AND lower NASA — GRACE strictly dominates.

Baseline → (8.06, 252.5)

compare = GRACE strictly dominates.

39if r_ref <= rmse and n_ref <= nasa and (r_ref < rmse or n_ref < nasa):

Reference dominates new model: r_ref no worse on RMSE, n_ref no worse on NASA, strictly better on at least one.

40dominators.append(name)

Add to dominator list.

41if rmse <= r_ref and nasa <= n_ref and (rmse < r_ref or nasa < n_ref):

New model dominates reference. Symmetric clause.

42dominated.append(name)

Add to dominated list.

44on_front = len(dominators) == 0

If NO reference dominates the new model, it is on the Pareto front of the {paper methods + new model} set. This is the publishable claim.

45return {

Pack the result for downstream logging or printing.

46"rmse": rmse, "nasa": nasa,

The two raw metrics.

47"on_front_vs_paper": on_front,

Boolean: is the new model on the Pareto front of {paper methods + new model}? The answer to the publishable claim.

48"dominators": dominators,

Names of paper methods that beat the new one on both axes.

49"dominated_paper_methods": dominated,

Names of paper methods the new one beats on both axes. The marketing line: ‘our method strictly dominates X, Y, Z’ only when this list is non-empty AND on_front is True.

22}

Close the dict.

54if __name__ == "__main__":

Standard guard. Lets the file be imported without running the demo.

55pass

Stub.

56 # In production:

Pseudo-code marker.

57 # model = load_checkpoint("...")

Load a saved model.

58 # loader = make_test_loader("FD002")

Build the test loader for one dataset.

59 # result = evaluate_and_place(model, loader, torch.device("cuda"))

Call the helper.

60 # print(result)

Print the result. Example output: {'rmse': 7.92, 'nasa': 232.7, 'on_front_vs_paper': True, 'dominators': [], 'dominated_paper_methods': ['Uncertainty', 'Baseline']}

EXECUTION STATE

Final example output =

{
  'rmse': 7.92,
  'nasa': 232.7,
  'on_front_vs_paper': True,
  'dominators': [],
  'dominated_paper_methods': ['Uncertainty', 'Baseline']
}

21 lines without explanation

1"""Compute RMSE and NASA on a real test set, then place the result.
2
3Reuses the paper&apos;s Evaluator from grace/training/evaluator.py.
4The evaluator returns BOTH metrics in a single dict; what we add
5here is the Pareto-positioning logic that turns a single (rmse, nasa)
6result into a recommendation.
7"""
8
9import torch
10from torch.utils.data import DataLoader
11from grace.training.evaluator import Evaluator
12from grace.training.seed_utils import set_seed
13
14
15# Reference points from paper Table I (multi-condition mean)
16REFERENCE = {
17    "AMNL":        (7.45, 446.7),
18    "GABA":        (7.89, 235.7),
19    "GRACE":       (7.92, 232.7),
20    "Uncertainty": (7.98, 233.8),
21    "Baseline":    (8.06, 252.5),
22}
23
24
25def evaluate_and_place(model, test_loader: DataLoader,
26                       device: torch.device, seed: int = 42):
27    """Run the model and report its RMSE+NASA position vs the reference."""
28    set_seed(seed)
29    evaluator = Evaluator(device=device, max_rul=125)
30
31    metrics = evaluator.evaluate(model, test_loader)
32    rmse  = metrics["rmse_last"]
33    nasa  = metrics["nasa_score"]
34
35    # Compare to references via Pareto dominance
36    dominators = []
37    dominated  = []
38    for name, (r_ref, n_ref) in REFERENCE.items():
39        if r_ref <= rmse and n_ref <= nasa and (r_ref < rmse or n_ref < nasa):
40            dominators.append(name)
41        if rmse <= r_ref and nasa <= n_ref and (rmse < r_ref or nasa < n_ref):
42            dominated.append(name)
43
44    on_front = len(dominators) == 0
45    return {
46        "rmse": rmse, "nasa": nasa,
47        "on_front_vs_paper": on_front,
48        "dominators": dominators,
49        "dominated_paper_methods": dominated,
50    }
51
52
53# ---------- Stub usage ----------
54if __name__ == "__main__":
55    pass
56    # In production:
57    #   model = load_checkpoint("...")
58    #   loader = make_test_loader("FD002")
59    #   result = evaluate_and_place(model, loader, torch.device("cuda"))
60    #   print(result)

Comparing on a different test split is meaningless. The reference numbers in the script are 5-seed means on the official C-MAPSS test sets. If you evaluate a new model on a held-out validation split, on a re-shuffled test set, or on N-CMAPSS, the Pareto comparison is not valid — you would be comparing the new method to itself on different data. The paper's rule: evaluate on the published test split; the method changes, the data does not.

Pareto Tradeoffs In Other Domains

Domain	Axis 1 (typical accuracy)	Axis 2 (operational)	Pareto winner pattern
Object detection	mAP @ IoU=0.5	Inference latency (ms/frame)	Family of detectors (YOLO, EfficientDet, RT-DETR) trace a clean Pareto curve; pick by FPS budget.
LLM serving	MMLU / HumanEval score	Tokens/sec at fixed cost	Larger models dominate on quality, smaller dominate on speed; quantisation moves a model along the frontier.
Medical imaging	Dice / sensitivity	False-positive rate	Sensitivity-specificity Pareto via threshold sweep — every classifier has its own front.
Energy storage controllers	State-of-charge accuracy	Cell aging cost (calendar + cycle)	Aggressive controllers dominate on accuracy, gentle on aging — operator picks based on lifecycle economics.
Recommender systems	CTR / conversion rate	Catalog diversity / fairness	Greedy click-maximisers dominate on CTR and lose diversity; calibrated rankers Pareto-improve along both.

In every row the ‘publish your Pareto front, not your winner’ norm is becoming standard. Single-number comparisons (mAP, MMLU, Dice) hide deployment trade-offs the practitioner needs to see. GRACE's contribution to the C-MAPSS conversation is the same: shift the literature from ‘best RMSE wins’ to ‘here is the Pareto picture, here is where we sit, here is what you give up if you choose differently’.

Pitfalls When Reading A Pareto Plot

Pitfall 1: confusing ‘on the front’ with ‘the winner’

AMNL is on the front because nothing dominates it on RMSE; that does NOT make AMNL a good general-purpose method. Its NASA score is twice GRACE's. Being on the front means ‘optimal for SOME preference’, not ‘best overall’.

Pitfall 2: averaging across datasets before Pareto-ising

The multi-condition mean has 3 methods on the front; FD004 alone has 1 (GradNorm), FD002 alone has 4. Per-dataset and averaged fronts can disagree dramatically. Always report per-dataset Pareto pictures alongside the averaged one — the averaging hides reversals.

Pitfall 3: ignoring seed variance

On FD002 GRACE's NASA = 223.4 ± 26.5 (one-sigma) and Uncertainty's NASA = 224.4 ± 35.3. The two error bars overlap; the ‘GRACE strictly dominates Uncertainty’ claim is a POINT-ESTIMATE claim, not a statistical one. Always report SEM bars on Pareto plots; the paper's Fig. 3 does.

Pitfall 4: choosing λ post-hoc

Picking the preference weight $\lambda$ AFTER seeing the results — e.g. ‘ $\lambda = 0.3$ happens to make my method win’ — is post-hoc cherry-picking. The $\lambda$ must come from the deployment context (operational cost ratios, regulatory requirements) not from a sweep that targets a desired winner.

Pitfall 5: forgetting that lower is better on both axes here

Some Pareto plots in the literature put accuracy on the x-axis (higher better) and cost on the y-axis (lower better). The frontier then sits in the upper-right. The C-MAPSS Pareto plot has both axes in ‘lower better’ orientation; the frontier sits in the lower-left. Read the axis legends before interpreting; mirroring an axis silently inverts the front.

Takeaway

Pareto dominance is precise: $q \succ p$ iff q is no worse on either axis AND strictly better on at least one. The Pareto front is the set of methods that no other method dominates.
On multi-condition C-MAPSS the front has 3 methods: AMNL (accuracy corner), GABA (middle), GRACE (safety corner). On FD002 alone the front widens to 4 (Baseline added). On FD004 alone it collapses to 1 (GradNorm dominates everything).
GRACE strictly dominates 6 of the 8 other methods on multi-condition mean: Baseline, DWA, GradNorm, Uncertainty, PCGrad, CAGrad. It does NOT dominate AMNL or GABA — those have lower RMSE.
Picking a deployment point uses a preference weight $\lambda \in [0, 1]$ : GRACE for $\lambda \leq 0.4$ (safety-first), GABA for the middle, AMNL for $\lambda \geq 0.85$ (accuracy-first).
The Pareto-front norm is now standard across deep learning: object detection, LLM serving, medical imaging, energy, recommenders. Always publish the front, not the winner.