Surgeons run a pre-op checklist. Pilots run a pre-flight checklist. ML practitioners increasingly run a pre-training checklist: which method, on this data, with this deployment cost? The answer for an MTL prognostics task is one of GRACE, GABA, AMNL, GradNorm, or single-task. Six questions decide it. The diagnostics that answer them cost ten seconds of GPU time and save you the wrong 30-minute training run.
This section synthesises chapters 21–23 into a runnable recipe. The decision tree, the diagnostic Python script, and the production PyTorch helper are the same artefact viewed at three levels of abstraction. By the end of the section you should be able to look at any new dataset and pick the right method before training, with empirical evidence behind every choice.
The headline. Two scalar diagnostics — gradient ratio gmax/gmin and gradient cosine ρ=cos(∇La,∇Lb) on the shared backbone — plus three deployment-context questions decide between GRACE / GABA / AMNL / Single-task. Compute time: ~10 s for the diagnostics; ~30 min savings for avoiding the wrong training run.
Anatomy Of The Decision
The choice is governed by three independent factors:
Imbalance (the OUTER axis premise). Is the per-task gradient ratio big enough that adaptive weighting has work to do? Threshold: ratio ≥10. Below that, GABA's correction is small and adds compute without payoff.
Alignment (the OUTER axis benefit). Does the auxiliary task pull the shared backbone toward features the primary task wants? Threshold: cosine ρ≥+0.05 mildly, ≥+0.20 strongly. Negative cosine means the auxiliary fights the primary — turn off MTL.
Asymmetry (the INNER axis premise). Is the deployment cost asymmetric (NASA-like, false-negative-heavy, late-prediction-heavy)? Yes → failure-biased MSE pays. No → standard MSE.
The 2×2 product of {outer pays, outer doesn't} × {inner pays, inner doesn't} maps to GRACE / GABA / AMNL / Baseline (the chapter 21 §1 grid). The decision wizard below walks the questions in dependency order and produces the recommendation in 6 clicks or fewer.
Interactive: The Decision Wizard
Answer six questions about your data and deployment context. Each click narrows the recommendation; the final verdict cites the chapter that empirically justifies it. Click ↻ restart to try alternative paths.
Loading decision wizard…
Try the canonical paths. ‘Yes → 100× → +0.31 → Yes failure region → Yes asymmetric → continuous regimes’ gives GRACE (FD002 / DS02 path). ‘Yes → 100× → −0.04 (orthogonal) → AMNL’ is the FD003 path. ‘No (single-task) → SingleTask’ is the baseline path.
The Full Decision Table
Diagnostic profile
Deployment context
Recommendation
Justifying section
ratio ≥ 100, cos ≥ 0.20, large failure subgroup
Asymmetric cost (NASA / safety / late-pred)
GRACE
Chapters 21 §1, 23 §1, 23 §3
ratio ≥ 100, cos ≥ 0.20, large failure subgroup
Symmetric cost (plain RMSE)
GABA
Chapters 18-19 (full GABA)
ratio ≥ 100, cos ∈ [0.05, 0.20]
Either cost regime
GABA (consider GRACE if asymmetric AND failure-rich)
Chapter 21 §3 cosine threshold
ratio ≥ 100, |cos| < 0.05 (orthogonal)
Asymmetric cost
AMNL
Chapter 21 §3 FD003 anomaly
ratio ≥ 100, cos ≤ −0.05 (anti-aligned)
Either cost regime
Single-task (separate models per task)
Chapter 21 §3 FD003 mechanism
ratio < 10 (already balanced)
Either cost regime
AMNL (fixed weights are fine)
Chapter 22 §2 GABA prerequisite
FD004-style: continuous regimes + multi-fault + cos ≥ 0.20
Asymmetric cost
GradNorm (chapter 23 §1 winner on FD004)
Chapter 23 §1 FD004 row
No auxiliary task available
Either cost regime
Single-task with cost-matched per-sample loss
Chapter 22 §2 single-task baseline
Cheap Diagnostics To Run Before Training
The two scalar diagnostics — gradient ratio and gradient cosine — are computable in 10 seconds on a single GPU, without committing to a full training run. The recipe:
Build a single-task baseline model. The DualTaskModel architecture suffices. No training needed for the cosine; for the ratio, a few epochs of warmup help stabilise the magnitudes.
Run 50 mini-batches. For each, compute grul=∥∇Lrul∥2 and ghealth on the shared backbone, then ρb=cos(∇Lrul,∇Lhealth).
Average. Report rˉ=mean(gmax,b/gmin,b) and ρˉ=mean(ρb).
Apply the thresholds. Section 22 §3 §5 showed the SEM at n=50 batches drops to ~0.02 for cosine and ~10% for ratio — tight enough that the threshold rules are stable.
Threshold
Rule
Cite
ratio < 10
GABA correction is small. Use AMNL.
Chapter 22 §2
cos < −0.05
Auxiliary task fights primary. Use single-task models.
Chapter 21 §3
−0.05 ≤ cos < +0.05
Auxiliary doesn't move the backbone usefully. Use AMNL.
Chapter 21 §3
+0.05 ≤ cos < +0.20
Mildly aligned. GABA pays; GRACE pays if cost is asymmetric.
Chapter 21 §3
cos ≥ +0.20
Strongly aligned. GRACE pays for the full composition.
Chapters 21 §1, 23 §3
Python: A Self-Contained Diagnostic Script
Compute both diagnostics on synthetic gradient vectors, then run the threshold rule. No PyTorch dependency — the same algebra you would do on a real model, just on toy ndarrays so the values are visible.
Gradient ratio + cosine + threshold rule
🐍grace_diagnostics_demo.py
Explanation(53)
Code(88)
1Module docstring — what this script does
Triple-quoted string at the top of a module is the module docstring. Python stores it in __doc__. It announces the contract: two cheap pre-training diagnostics (gradient ratio + gradient cosine) decide between GRACE / GABA / AMNL / Single-task BEFORE committing to a 30-minute training run.
EXECUTION STATE
Output of running this script =
||g_rul|| (mean over 50 batches) = 53370.78
||g_health|| (mean over 50 batches) = 106.1058
gradient ratio (max/min) = 503.0x
gradient cosine on shared backbone = +0.808
→ cos >= 0.20: GRACE — both axes aligned, full composition wins
8import numpy as np
NumPy is Python's numerical-array library. It provides the ndarray (N-dimensional array) type plus optimised C implementations of every operator we use below: element-wise arithmetic, broadcasting, reductions, linear algebra. Aliased as `np` by universal convention.
EXECUTION STATE
📚 numpy = Library for fast N-dimensional arrays. Built on contiguous memory + SIMD; ~50–100× faster than pure Python loops for vector math.
Computes the L2 norm of a flat concatenation of per-parameter gradient arrays. Same algebra as the paper's compute_task_grad_norm (chapter 18 §1) but in pure NumPy. We never materialise the concatenation — we sum squares per-parameter, which is mathematically identical and saves memory.
EXECUTION STATE
⬇ input: grad_lists = List of ndarrays, one per parameter. Each ndarray has the SAME shape as its underlying parameter tensor. Example: [(64, 14), (128, 64)] for a small CNN-BiLSTM slice.
→ grad_lists purpose = Holds ∂L/∂θ for every parameter θ. We treat the whole list as one long vector to compute its L2 norm — that is the single scalar ‘how big is this gradient?’
⬆ returns = Python float — the scalar L2 norm √(Σᵢ θᵢ²). For our toy demo this is ~106.5 before scaling, ~53,000 after × 500.
12Function docstring
Documents the function's contract: it accepts a list of per-parameter gradient ndarrays and treats them as one long vector. Important for readers because the function does NOT explicitly concatenate — it relies on the algebraic identity ‖[a; b]‖² = ‖a‖² + ‖b‖².
17sq = 0.0
Initialise the running sum-of-squares accumulator. Float (not int) so the += on line 19 keeps float semantics.
EXECUTION STATE
sq = 0.0 — accumulator. After the loop will hold Σᵢ θᵢ² across all parameters.
18for g in grad_lists:
Walk every per-parameter ndarray. The loop body adds that parameter's contribution to the squared norm. With grad_lists = [(64,14), (128,64)] the loop runs exactly twice.
LOOP TRACE · 2 iterations
iter 1: g.shape = (64, 14)
g first 4 values = [-0.452, -0.422, 0.663, 0.015]
(g**2).sum() = ≈ 1065.50
sq after = ≈ 1065.50
iter 2: g.shape = (128, 64)
(g**2).sum() = ≈ 10289.48
sq after = ≈ 11354.98
19sq += float((g ** 2).sum())
Add this parameter's contribution to the running sum-of-squares. Three operations chained: element-wise square, sum reduction, cast to plain float.
EXECUTION STATE
📚 g ** 2 = ndarray element-wise power. (64,14) → (64,14) where every element is squared. Vectorised in C — no Python loop.
📚 .sum() = ndarray reduction. Adds every element of the array, returns a 0-d ndarray (scalar wrapped in array shell).
📚 float(scalar) = Cast 0-d ndarray → plain Python float. Required because sq is a Python float and += between np.float64 and float can cause subtle type promotions in long pipelines.
→ first iter result = (g_rul[0] ** 2).sum() = 1065.50 (sum of 64×14 = 896 squared values)
→ second iter result = (g_rul[1] ** 2).sum() = 10289.48 (sum of 128×64 = 8192 squared values)
20return float(np.sqrt(sq))
L2 norm = √(Σθᵢ²). For the demo: √11354.98 ≈ 106.56. We cast back to Python float for downstream comparison/printing.
EXECUTION STATE
📚 np.sqrt(x) = Element-wise square root. On a scalar (0-d ndarray) returns a 0-d ndarray. Example: np.sqrt(4.0) = 2.0.
⬆ return value = ≈ 106.56 — the L2 norm of the unscaled toy gradient.
Computes the mean per-batch gradient-norm ratio max(g_a, g_b) / min(g_a, g_b). This is the OUTER-axis premise diagnostic: GABA only earns its compute when this ratio is ≥ 10. Below 10, fixed 50/50 weights work just as well.
EXECUTION STATE
⬇ input: g_a_per_batch = List/array of float gradient norms for task A (e.g. RUL) collected across N batches. For our demo: 50 floats around 53,000.
⬇ input: g_b_per_batch = Same for task B (e.g. health). 50 floats around 106.
Documents the rule: ratio < 10 means GABA's adaptive correction is too small to pay back the per-step cost; AMNL fixed weights are sufficient.
29a = np.array(g_a_per_batch)
Convert the Python list to an ndarray so np.maximum / np.minimum vectorise over all batches at once.
EXECUTION STATE
📚 np.array(list) = Construct ndarray from a Python sequence. Infers dtype from contents (float64 here).
a = ndarray shape (50,) — one float per batch. Mean ≈ 53,370.78.
30b = np.array(g_b_per_batch)
Same conversion for task B. Result is shape (50,) with mean ≈ 106.11.
EXECUTION STATE
b = ndarray shape (50,) — health gradient norms.
31return float((np.maximum(a, b) / np.maximum(np.minimum(a, b), 1e-12)).mean())
Element-wise max divided by element-wise (max-of-min-and-1e-12), then mean across the 50 batches. The 1e-12 floor on the denominator prevents division-by-zero on a degenerate batch.
EXECUTION STATE
📚 np.maximum(a, b) = ELEMENT-WISE max. Different from np.max which is a reduction. Example: np.maximum([3, 2], [1, 5]) = [3, 5].
📚 np.minimum(a, b) = ELEMENT-WISE min. Example: np.minimum([3, 2], [1, 5]) = [1, 2].
→ numerator np.maximum(a, b) = [max(a₀, b₀), max(a₁, b₁), …] — for our demo this is just `a` (RUL) since RUL is always larger.
→ denominator np.maximum(np.minimum(a, b), 1e-12) = Element-wise min, then floor at 1e-12. For the demo this is just `b`.
→ why 1e-12 floor? = If any batch has a zero gradient on the smaller task, dividing by zero would emit NaN/inf and corrupt the mean.
⬆ return: ratio = ≈ 503.0 — almost exactly the 500× scaling we applied. Far above the 10 threshold → GABA pays.
34def grad_cosine(grad_a, grad_b) → float
Cosine similarity between two flattened gradient vectors. This is the chapter-21 §3 alignment diagnostic — predicts whether the OUTER axis (GABA) helps, hurts, or wastes effort. Scale-invariant: multiplying both gradients by any positive constant gives the same cosine.
EXECUTION STATE
⬇ input: grad_a = List of per-parameter ndarrays for task A. SAME shapes as grad_b (the parameters are shared backbone).
⬇ input: grad_b = List of per-parameter ndarrays for task B.
Encodes the threshold rules right next to the function so a reader who clicks the def doesn't need to leave the file. Mirrors the table in the §diagnostics section above.
41flat_a = np.concatenate([g.reshape(-1) for g in grad_a])
Flatten every per-parameter ndarray to 1-D, then concatenate end-to-end into one long vector. For our demo: (64,14) + (128,64) → 896 + 8192 = 9088-dim vector.
EXECUTION STATE
📚 .reshape(-1) = Flatten ndarray to 1-D. The -1 sentinel tells NumPy ‘you figure out the length.’ (64, 14) → (896,). No data copy when memory is contiguous.
📚 np.concatenate([a, b, …]) = Stack 1-D arrays end-to-end into one vector. [(896,), (8192,)] → (9088,). Allocates a fresh buffer.
→ list comprehension = [g.reshape(-1) for g in grad_a] yields one flat ndarray per parameter, in the SAME ORDER as grad_a. Order matters — flat_a and flat_b must align element-by-element.
→ flat_a result = ndarray shape (9088,) — the whole task-A gradient as one vector.
42flat_b = np.concatenate([g.reshape(-1) for g in grad_b])
Identical operation for task B. CRITICAL: must use the same parameter ordering as flat_a so that flat_a[i] and flat_b[i] correspond to the SAME parameter element.
EXECUTION STATE
flat_b = ndarray shape (9088,) — the task-B gradient. Aligned element-wise with flat_a.
43num = float((flat_a * flat_b).sum())
Compute the dot product flat_a · flat_b. Element-wise multiply (still a 9088-vector) then sum-reduce to a scalar.
EXECUTION STATE
📚 a * b (ndarray) = Element-wise multiply. NOT matrix multiply (that's @). Returns same shape as inputs.
→ num value (demo) = Approximately ⟨shared, shared⟩ ≈ 9088 (since shared has variance 1 across 9088 elements) before scaling. After × 500 it scales linearly.
Compute the product of the two L2 norms. This is the cosine denominator: cos(u, v) = ⟨u, v⟩ / (‖u‖ · ‖v‖).
EXECUTION STATE
📚 np.linalg.norm(v) = By default L2 norm of a vector: √(Σᵢ vᵢ²). Equivalent to np.sqrt((v**2).sum()) but uses an optimised LAPACK routine.
→ ‖flat_a‖ = ≈ 53,370 (after × 500 scaling) for the demo.
→ ‖flat_b‖ = ≈ 106.11 for the demo.
→ den = ≈ 53,370 × 106.11 ≈ 5,662,800 — the product.
45return num / (den + 1e-12)
Final cosine = numerator / denominator. The 1e-12 epsilon prevents NaN when both gradient vectors are exactly zero (rare but possible for the very first batch on a freshly-initialised model with zero-init heads).
EXECUTION STATE
→ cosine value (demo) = +0.808 — strongly aligned. Both gradients share the `shared` component; noise pulls them apart only weakly.
→ scale invariance check = If we multiply flat_a by 500: num scales by 500, den scales by 500, ratio unchanged. This is why the cosine is the alignment diagnostic — it strips out magnitude.
48# Synthetic example — section divider
Comment delimiting the start of the demo. Below this line we manufacture two correlated gradient lists and run them through the diagnostics so the threshold rule fires on a known, reproducible input.
49rng = np.random.default_rng(0)
Construct a Generator with seed=0. Pinning the seed makes every printed number reproducible — a reader can run this script and see the exact same 503.0 / +0.808 / GRACE output.
EXECUTION STATE
📚 np.random.default_rng(seed) = Modern NumPy RNG (recommended over legacy np.random). Returns a Generator object with .normal(), .uniform(), .integers(), etc.
⬇ arg: seed = 0 = Pins the RNG state. Same seed → same numbers across runs. seed=None would use OS entropy and give different numbers each run.
51# Parameter shapes — section divider
Comment introducing the toy backbone shapes used below.
52shapes = [(64, 14), (128, 64)]
Two parameter shapes representing a CNN-BiLSTM backbone slice: (64, 14) is a CNN layer (64 filters × 14 input channels) and (128, 64) is a hidden LSTM weight (128 hidden × 64 input). Total parameters: 64×14 + 128×64 = 896 + 8192 = 9088.
EXECUTION STATE
shapes = [(64, 14), (128, 64)] — two shape tuples. List of tuples, not ndarrays.
→ total params = 9,088 — small enough to keep the demo fast, big enough that the cosine concentration of measure kicks in.
54# Aligned gradients — section divider
Comment explaining that the next three lines build two gradients that share a common direction (the `shared` term) plus independent noise.
55shared = [rng.normal(0, 1, s) for s in shapes]
Build the shared component used by both task gradients. List comprehension: one ndarray per shape, drawn from N(0, 1).
EXECUTION STATE
📚 rng.normal(loc, scale, size) = Draws size-many samples from a Gaussian with mean=loc and std=scale. size can be a tuple → returns ndarray of that shape.
⬇ arg: loc = 0 = Mean of the distribution. Zero-mean is conventional for gradient-like quantities.
⬇ arg: scale = 1 = Standard deviation. Sets variance to 1.0 — produces unit-variance shared component.
⬇ arg: size = s = Shape tuple from the outer comprehension. (64, 14) for the first iteration; (128, 64) for the second.
shared[0] first 4 values = [ 0.126, -0.132, 0.640, 0.105] — concrete N(0,1) draws under seed=0.
→ why shared? = By construction g_rul and g_health both contain this term, so their dot product is dominated by ‖shared‖². That guarantees positive cosine.
56g_rul = [s + rng.normal(0, 0.7, sh.shape) * 0.7 for s, sh in zip(shared, shared)]
Build the RUL task gradient. For each shared element s, add an independent Gaussian noise of std 0.7 × 0.7 = 0.49. The result is a slightly noisier copy of shared.
EXECUTION STATE
📚 zip(a, b) = Pair iterables element-wise. zip([1, 2], ['a', 'b']) → [(1, 'a'), (2, 'b')]. We pass shared twice so we can read both s (the value) and sh (just to get .shape).
⬇ noise scale = 0.7 * 0.7 = 0.49 = Effective noise std. Small relative to the shared component (std 1.0), so the cosine stays high.
→ ‖g_rul‖ before scaling = ≈ 106.56 — the L2 norm of the 9088-element flat concatenation.
57g_health = [s + rng.normal(0, 0.7, sh.shape) * 0.7 for s, sh in zip(shared, shared)]
Same construction for the health task — same `shared` core, fresh independent noise. The shared term gives positive cosine; the noise pulls cosine slightly toward zero.
EXECUTION STATE
→ ‖g_health‖ before scaling = ≈ 106.11 — almost identical to g_rul because variance is the same.
Comment introducing the simulation of 50 batches by adding scalar noise to a fixed gradient norm. In real training each batch has a slightly different gradient; here we approximate that with a fixed gradient + scalar perturbation.
60batches_rul = [compute_grad_norm(g_rul) + rng.normal(0, 5) for _ in range(50)]
Generate 50 simulated batch-level RUL gradient norms by perturbing the same compute_grad_norm(g_rul) with Gaussian noise of std 5.0.
EXECUTION STATE
📚 range(50) = Yields 0, 1, …, 49 — 50 iterations. Idiomatic Python range. We use _ as the loop variable because we don't need the index.
→ compute_grad_norm(g_rul) called once per iteration = Same value every iteration: ~106.56. The variation comes entirely from the rng.normal noise.
batches_rul[0:4] = [≈106.7, ≈101.5, ≈108.3, ≈106.0] — first four simulated batch norms.
61batches_health = [compute_grad_norm(g_health) + rng.normal(0, 0.05) for _ in range(50)]
Same idea for health, but noise is 100× smaller (std 0.05 instead of 5.0). After scaling RUL by 500× on line 64, this asymmetry creates the 500× ratio that makes GABA worthwhile.
EXECUTION STATE
→ why std=0.05? = Health gradient is 500× smaller than scaled RUL, so we want the noise scale to be proportionally smaller.
batches_health[0:4] = [≈106.13, ≈106.08, ≈106.16, ≈106.10] — health norms barely vary because noise is tiny.
63# RUL scaling — section divider
Comment explaining the next two lines. The 500× factor mimics the typical C-MAPSS RUL/health gradient ratio. Without this scaling our toy gradients would have ratio ≈ 1 and we'd hit the AMNL branch instead of GRACE.
64batches_rul = [b * 500 for b in batches_rul]
Scalar multiply every per-batch RUL norm by 500. List comprehension iterates the existing list and produces a new one. After this line, mean(batches_rul) ≈ 53,370.
EXECUTION STATE
→ why 500? = Empirical C-MAPSS ratio (chapter 18 §1). RUL is regression with a target in [0, 125]; health is classification with cross-entropy bounded by ln(K) — naturally smaller magnitude.
batches_rul mean after scaling = ≈ 53,370.78
65g_rul_scaled = [g * 500 for g in g_rul]
Scale the per-parameter RUL gradients by 500 to match the per-batch norms. Cosine is scale-invariant — this affects only the magnitude shown, NOT the cosine value (we verify on line 45).
Comment marking the actual diagnostic computation. The next two lines call grad_ratio and grad_cosine on the simulated data.
69ratio = grad_ratio(batches_rul, batches_health)
Mean per-batch max/min ratio. With RUL scaled by 500 and health unchanged, the ratio lands almost exactly at 503.0 (slightly above 500 because of the noise on each batch).
EXECUTION STATE
ratio = ≈ 503.0 — far above the 10 threshold. OUTER axis (GABA) earns its compute.
70cosine = grad_cosine(g_rul_scaled, g_health)
Cosine similarity on the shared backbone. Because both gradients share the `shared` component, cosine is strongly positive.
EXECUTION STATE
cosine = +0.808 — well above the +0.20 threshold. OUTER axis composes constructively → GRACE branch.
72print(f"||g_rul|| (mean over 50 batches) = {np.mean(batches_rul):.2f}")
Pretty-print the mean RUL gradient norm with 2-decimal precision. f-strings let us embed `np.mean(batches_rul)` and a format spec inline.
EXECUTION STATE
📚 f-string = Python 3.6+ formatted string literal. {expr:format} substitutes expr with the given format spec. :.2f means ‘float, 2 decimal places’.
📚 np.mean(arr) = Mean reduction across all elements (or a specified axis). Equivalent to arr.sum() / arr.size for default axis.
Output = ||g_rul|| (mean over 50 batches) = 53370.78
73print(f"||g_health|| (mean over 50 batches) = {np.mean(batches_health):.4f}")
Same for health. We use 4 decimals because health norms are O(100) and the per-batch noise is O(0.05) — 4 decimals show the variation.
EXECUTION STATE
Output = ||g_health|| (mean over 50 batches) = 106.1058
74print(f"gradient ratio (max/min) = {ratio:.1f}x")
Print the imbalance ratio with 1-decimal precision. The trailing `x` makes the unit explicit.
EXECUTION STATE
Output = gradient ratio (max/min) = 503.0x
75print(f"gradient cosine on shared backbone = {cosine:+.3f}")
Print the cosine. The `+` flag in `+.3f` forces a sign character (`+0.808` instead of just `0.808`) so positive vs. negative is visible at a glance — important when the threshold rule branches on the sign.
EXECUTION STATE
📚 :+.3f format spec = Float, 3 decimal places, always show sign. Example: 0.808 → '+0.808', -0.04 → '-0.040'.
Output = gradient cosine on shared backbone = +0.808
76print()
Blank line for visual spacing between the diagnostic numbers and the verdict.
78# Decision rule — section divider
Comment introducing the threshold cascade. The if/elif chain encodes the same rule table from the §diagnostics section — the source of truth lives in one place.
79if ratio < 10:
Branch 1 — small gradient ratio. If ratio < 10, GABA's adaptive correction would be tiny; AMNL fixed weights are sufficient. NOT taken in our demo (ratio = 503).
EXECUTION STATE
Branch taken? = No — 503 ≥ 10.
80print("→ ratio < 10: AMNL (fixed 0.5/0.5) — GABA correction is tiny")
AMNL recommendation when the gradient ratio is small. Skipped here because branch 1 isn't taken.
81elif cosine < -0.05:
Branch 2 — anti-aligned gradients. If the auxiliary task pulls the backbone AWAY from features the primary needs, MTL hurts. Train two single-task models instead. NOT taken (cosine = +0.808).
EXECUTION STATE
Branch taken? = No — +0.808 ≥ -0.05.
82print("→ cos < -0.05: SingleTask — auxiliary fights the primary task")
Single-task recommendation when cosine is anti-aligned (FD003 anomaly territory; chapter 21 §3).
83elif cosine < 0.05:
Branch 3 — orthogonal gradients. The auxiliary task neither helps nor hurts the backbone in a directional sense. AMNL fixed weights with the right inner-loss shape is the safest pick. NOT taken.
EXECUTION STATE
Branch taken? = No — +0.808 ≥ 0.05.
84print(...) — AMNL when |cos| < 0.05
AMNL recommendation for orthogonal cosine. Different reason than branch 1 (here the auxiliary is useless rather than the controller being unnecessary).
85elif cosine < 0.20:
Branch 4 — mildly aligned. OUTER axis (GABA) pays; whether to upgrade to GRACE depends on the deployment cost-asymmetry question (Q5 in the wizard). NOT taken (cosine = +0.808 ≥ 0.20).
EXECUTION STATE
Branch taken? = No — +0.808 ≥ 0.20.
86print("→ 0.05 <= cos < 0.20: GABA — outer axis pays, inner depends on cost asymmetry")
GABA recommendation for mild alignment. The student then asks: is my deployment cost asymmetric? If yes, upgrade to GRACE; if no, stay at GABA.
87else:
Branch 5 — strongly aligned. Both axes work. Full GRACE composition is the recommendation. TAKEN (cosine = +0.808 ≥ 0.20).
EXECUTION STATE
Branch taken? = Yes — +0.808 ≥ 0.20.
88print("→ cos >= 0.20: GRACE — both axes aligned, full composition wins")
Final recommendation: GRACE. The full output of the script is shown below.
EXECUTION STATE
Final stdout =
||g_rul|| (mean over 50 batches) = 53370.78
||g_health|| (mean over 50 batches) = 106.1058
gradient ratio (max/min) = 503.0x
gradient cosine on shared backbone = +0.808
→ cos >= 0.20: GRACE — both axes aligned, full composition wins
35 lines without explanation
1"""Two cheap pre-training checks: gradient ratio and gradient cosine.
23Run on a single-task baseline for ~50 batches per task. Total cost
4~10 seconds on a single GPU. The decision wizard uses the two scalar
5outputs to recommend GRACE / GABA / AMNL / Single-task.
6"""78import numpy as np
91011defcompute_grad_norm(grad_lists):12"""L2 norm of a flat concatenation of per-parameter gradient arrays.
1314 grad_lists: list of ndarrays, one per parameter — the same shape
15 as the parameter tensor. Treats them as one long vector.
16 """17 sq =0.018for g in grad_lists:19 sq +=float((g **2).sum())20returnfloat(np.sqrt(sq))212223defgrad_ratio(g_a_per_batch, g_b_per_batch):24"""Mean per-batch gradient ratio (max / min).
2526 Higher ratios mean GABA's adaptive correction has more work to do.
27 Threshold: ratio < 10 → AMNL fixed weights are fine.
28 """29 a = np.array(g_a_per_batch)30 b = np.array(g_b_per_batch)31returnfloat((np.maximum(a, b)/ np.maximum(np.minimum(a, b),1e-12)).mean())323334defgrad_cosine(grad_a, grad_b):35"""Cosine similarity of two flattened gradient vectors.
3637 >= +0.20 = strongly aligned (GRACE works well)
38 -0.05 .. +0.05 = orthogonal (AMNL or single-task)
39 < -0.05 = anti-aligned (GRACE will hurt — see chapter 21 §3 FD003)
40 """41 flat_a = np.concatenate([g.reshape(-1)for g in grad_a])42 flat_b = np.concatenate([g.reshape(-1)for g in grad_b])43 num =float((flat_a * flat_b).sum())44 den =float(np.linalg.norm(flat_a)* np.linalg.norm(flat_b))45return num /(den +1e-12)464748# ---------- Synthetic example: simulate two task gradients ----------49rng = np.random.default_rng(0)5051# Simulate two parameter shapes (typical CNN-BiLSTM backbone slice)52shapes =[(64,14),(128,64)]5354# Aligned task gradients: shared 0.6 component + independent 0.455shared =[rng.normal(0,1, s)for s in shapes]56g_rul =[s + rng.normal(0,0.7, sh.shape)*0.7for s, sh inzip(shared, shared)]57g_health =[s + rng.normal(0,0.7, sh.shape)*0.7for s, sh inzip(shared, shared)]5859# Per-batch norms (50 batches simulated as scalar perturbations)60batches_rul =[compute_grad_norm(g_rul)+ rng.normal(0,5)for _ inrange(50)]61batches_health =[compute_grad_norm(g_health)+ rng.normal(0,0.05)for _ inrange(50)]6263# RUL grad is amplified by the typical C-MAPSS scale factor (~500x)64batches_rul =[b *500for b in batches_rul]65g_rul_scaled =[g *500for g in g_rul]666768# ---------- Compute the two diagnostics ----------69ratio = grad_ratio(batches_rul, batches_health)70cosine = grad_cosine(g_rul_scaled, g_health)7172print(f"||g_rul|| (mean over 50 batches) = {np.mean(batches_rul):.2f}")73print(f"||g_health|| (mean over 50 batches) = {np.mean(batches_health):.4f}")74print(f"gradient ratio (max/min) = {ratio:.1f}x")75print(f"gradient cosine on shared backbone = {cosine:+.3f}")76print()7778# ---------- Decision rule ----------79if ratio <10:80print("→ ratio < 10: AMNL (fixed 0.5/0.5) — GABA correction is tiny")81elif cosine <-0.05:82print("→ cos < -0.05: SingleTask — auxiliary fights the primary task")83elif cosine <0.05:84print("→ |cos| < 0.05: AMNL — auxiliary doesn't help the backbone")85elif cosine <0.20:86print("→ 0.05 <= cos < 0.20: GABA — outer axis pays, inner depends on cost asymmetry")87else:88print("→ cos >= 0.20: GRACE — both axes aligned, full composition wins")
The cosine is scale-invariant. Multiplying both gradients by 500 (line 62) does not change the cosine — only the ratio. This is what makes the cosine such a clean diagnostic: the per-task gradient magnitudes can vary by orders of magnitude across datasets, but the cosine cleanly captures the directional alignment regardless.
PyTorch: Production Diagnostic With Paper Helpers
The same logic, run on a real DualTaskModel and a real DataLoader. Imports the paper's compute_task_grad_norm and compute_gradient_cosine helpers from grace/core/gradient_utils.py. Returns a result dict the decision wizard above can directly consume.
Production diagnostic with paper helpers
🐍grace_diagnose.py
Explanation(59)
Code(84)
1Module docstring — production diagnostic contract
Module docstring. Announces the contract: a function that takes a real DualTaskModel + DataLoader and returns the GRACE/GABA/AMNL/SingleTask verdict in ~10 seconds of GPU time. Calls into the paper's gradient_utils helpers from chapter 18 §1.
PyTorch core. Provides torch.Tensor, torch.device, torch.autograd, etc. Everything below is built on top of this — tensors live on GPU, autograd builds the dynamic computation graph, the optimiser updates parameters.
Submodule containing all neural-network building blocks (Module, Linear, Conv1d, LSTM, …). We use it ONLY for the type annotation `nn.Module` on the model parameter.
EXECUTION STATE
📚 torch.nn = Module subsystem. nn.Module is the base class for every learnable component. It provides .parameters(), .train(), .eval(), .state_dict(), etc.
13import torch.nn.functional as F
Functional API — stateless versions of nn modules. We use F.cross_entropy on line 44 for the health loss. Aliased as `F` by convention.
DataLoader is PyTorch's batching iterator. We use it ONLY for the type annotation on the `loader` parameter.
EXECUTION STATE
📚 DataLoader = Wraps a Dataset, yields batches. Handles shuffling, multi-process workers, pin_memory, drop_last. Iterating it returns a tuple of tensors per batch.
16from grace.core.gradient_utils import (
Paper code from chapter 18 §1. Multi-line import — we're bringing in two helper functions on the next two lines.
17compute_task_grad_norm,
First helper. Computes ‖∇L‖₂ on a list of shared parameters using torch.autograd.grad. Crucially uses retain_graph=True so a second call can compute on the SAME forward pass without re-running it.
→ why retain_graph? = Without it the autograd graph is freed after the first .grad call. The second call would crash with ‘Trying to backward through the graph a second time’.
18compute_gradient_cosine,
Second helper. Computes cos(∇L_a, ∇L_b) on a list of shared parameters. Chapter 21 §3's key diagnostic. Internally calls torch.autograd.grad twice, flattens both gradient lists into single vectors, then F.cosine_similarity.
EXECUTION STATE
📚 compute_gradient_cosine(L_a, L_b, shared_params) → float = Source: grace/core/gradient_utils.py:70. Returns Python float in [-1, +1]. Memory ~2× a regular backward; runtime ~10s for 50 batches on a single GPU.
19)
Close the multi-import parenthesis. Pythonic way to import many names without a long single line.
Failure-biased MSE for the RUL task. Same loss used in production training (chapter 21 §2). Weights samples in the failure region (RUL ≤ 30) higher than samples far from failure.
EXECUTION STATE
📚 moderate_weighted_mse_loss(pred, target) → Tensor = Per-sample MSE × weight, then mean. Weight ramps from 1.0 (RUL > 60) to ~3.0 (RUL ≤ 10). Returns scalar.
23def diagnose_for_grace(model, loader, …) — signature line 1
Diagnostic entry point. Takes a real model + loader and returns the verdict dict the decision wizard above can directly consume. The signature spans two lines; line 24 has the rest of the parameters.
EXECUTION STATE
⬇ input: model: nn.Module = DualTaskModel (or any nn.Module with .get_shared_params() + a forward returning (rul_pred, health_logits)). Type-annotated for IDE help; PyTorch doesn't enforce types at runtime.
→ model purpose = Provides parameters whose gradients we measure. A single-task baseline works fine — we only need both heads to exist so we can compute both losses.
⬇ input: loader: DataLoader = Iterable yielding (seq, rul, health, uid) tuples. Same loader you'd use for actual training; the diagnostic must see the same data distribution.
24 device, n_batches=50): — signature line 2
Continued signature. device controls where tensors live; n_batches sets the diagnostic budget. Default n_batches=50 — chapter 22 §3 §5's SEM analysis shows this is enough for the verdict thresholds to be stable.
EXECUTION STATE
⬇ input: device: torch.device = torch.device('cuda:0') typically. Same as training-time device — diagnostic should reflect production conditions.
⬇ input: n_batches: int = 50 = Diagnostic batch budget. Default 50. Smaller (~20) gives noisier cosine (SEM ~0.04); larger (~100) wastes GPU time without reducing SEM materially.
States the contract: compute the two diagnostics across n_batches batches and return a wizard-ready dict.
29model.train()
Switch the model into training mode. This affects modules with dual behaviour: Dropout activates (zeroes random units); BatchNorm uses batch statistics instead of running stats. The diagnostic must reflect what training will actually see.
EXECUTION STATE
📚 nn.Module.train(mode=True) = Recursively sets self.training = True on every submodule. Returns self (so you can chain calls). Counterpart: .eval() sets it to False.
→ why not .eval()? = If you ran the diagnostic in eval mode, BatchNorm would use frozen running stats and Dropout would be inactive — gradient distribution would not match training.
30shared = model.get_shared_params()
Return the BACKBONE-only parameter list — excludes rul_head and health_head weights. Chapter 18 §1 explains why head parameters would corrupt the gradient norm: the head sees only its own loss, so its gradient on the ‘other’ loss is zero, biasing the ratio.
EXECUTION STATE
📚 model.get_shared_params() → list[Parameter] = Custom method on DualTaskModel (defined in grace/models/dual_task.py:42). Returns [p for n, p in self.named_parameters() if not (n.startswith('rul_head') or n.startswith('health_head'))].
shared = List of nn.Parameter — typically ~1.7M floats for the C-MAPSS backbone (CNN-BiLSTM with 2 conv layers, 1 LSTM layer, 1 attention head).
32ratios = []
Per-batch gradient-ratio accumulator. After the loop, ratios will hold n_batches floats; we mean them on line 55.
33cosines = []
Per-batch cosine accumulator. After the loop, cosines will hold n_batches floats. The mean cosine drives the verdict.
35for i, (seq, rul_tgt, health_tgt, _) in enumerate(loader):
Iterate the DataLoader with index. Tuple-unpack each batch into (sequence tensor, RUL target, health target, unit_id) — the underscore discards unit_id since we don't need it for the diagnostic.
EXECUTION STATE
📚 enumerate(iter, start=0) = Yields (index, item) pairs. Lets us count iterations cheaply and break after n_batches. Equivalent to: i = -1; for x in iter: i += 1; ….
📚 tuple unpacking (a, b, c, d) = Python destructuring. The DataLoader yields a 4-tuple per batch; we name three components and discard the fourth with _.
→ seq = Tensor shape (B, T, F) — batch of input sequences. Typical B=128, T=30, F=14 for C-MAPSS.
Cap the loop at n_batches iterations. Without this, the loop would consume the entire training set on every diagnostic call.
EXECUTION STATE
→ why >= and not >? = Index i starts at 0. After n_batches iterations we've processed indices 0…n_batches-1, so the next iteration has i == n_batches — that's where we want to stop.
37break
Exit the for loop early. Control passes to line 55 (mean_ratio).
38seq = seq.to(device)
Move the sequence batch to the GPU (or CPU, whichever device holds the model). Necessary because DataLoader yields CPU tensors by default — running model(seq) on a CUDA model with a CPU tensor crashes with a device mismatch error.
EXECUTION STATE
📚 tensor.to(device) = Returns a tensor on the target device. If already there, returns self (no copy). If different, allocates new device memory and copies.
⬇ arg: device = torch.device — typically 'cuda:0' on a single-GPU box.
39rul_tgt = rul_tgt.to(device).view(-1, 1)
Move the RUL target to GPU AND reshape from (B,) to (B, 1). The reshape is required because moderate_weighted_mse_loss expects (B, 1) to match the model's rul_pred output.
EXECUTION STATE
📚 tensor.view(*shape) = Reshape — returns a view of the same data with new dimensions. Total elements must match. -1 in any position lets PyTorch infer that dimension.
⬇ arg: -1 = Inferred dimension. With original shape (B,) and target (-1, 1), PyTorch solves for -1 = B.
⬇ arg: 1 = Explicit second dimension. After view: shape (B, 1).
→ why .view not .reshape? = Both work here, but .view requires the underlying storage to be contiguous (which it is right after .to(device)) — slightly faster and signals intent.
40health_tgt = health_tgt.to(device)
Move the health label batch to GPU. No reshape — F.cross_entropy expects integer targets shape (B,) which is already what we have.
42rul_pred, health_logits = model(seq)
Forward pass through the dual-task model. Both heads share the backbone activation, then branch into their respective output heads. Tuple unpack the two outputs.
EXECUTION STATE
📚 model(seq) — calls model.forward(seq) = PyTorch's nn.Module __call__ wrapper. Adds hooks (forward_pre_hook, forward_hook) and dispatches to .forward.
Compute ‖∇L_rul‖₂ on the shared backbone params. retain_graph=True keeps the autograd graph alive so the next call (line 47) can backprop through the SAME forward without re-running model(seq).
EXECUTION STATE
⬇ arg 1: L_rul = Scalar loss tensor with autograd graph.
⬇ arg 2: shared = List of backbone parameters (~1.7M floats for C-MAPSS).
⬇ arg 3: retain_graph=True = Tells torch.autograd.grad to NOT free the graph after computing gradients. Mandatory because we'll backprop again on line 47.
→ why retain_graph? = Default behaviour frees the graph to save memory. Without retain_graph=True the next .grad call would crash: ‘Trying to backward through the graph a second time, but the saved intermediate results have already been freed.’
⬆ returns: g_rul = 0-d Tensor on the same device — typical magnitude ~50,000 for the post-warmup C-MAPSS backbone.
Same for health. After this line both gradient norms are computed on the SAME forward — that's why retain_graph mattered. Note: this is the LAST call that will backprop through the graph in this batch (compute_gradient_cosine on line 50 also uses retain_graph internally), so leaving retain_graph=True here is fine.
EXECUTION STATE
⬆ returns: g_health = 0-d Tensor — typical magnitude ~100 for the post-warmup C-MAPSS backbone.
Per-batch imbalance ratio. Element-wise max/min on the two scalar tensors, divide, then .item() unwraps to a plain Python float.
EXECUTION STATE
📚 max(t1, t2) / min(t1, t2) = When called with two 0-d tensors, returns the larger / smaller as a 0-d tensor. (Python builtins fall back to tensor comparison via __gt__/__lt__.)
📚 .item() = 0-d tensor → plain Python float. Drops the autograd graph and the device. Required for downstream Python arithmetic on line 55 (sum/len doesn't play well with mixed tensors and floats).
→ 1e-12 floor = Prevents division-by-zero on a degenerate batch where one task's gradient is exactly zero.
Cosine of the two gradients on the shared backbone. Returns a Python float directly (the helper calls .item() internally).
EXECUTION STATE
⬇ arg 1: L_rul = RUL loss tensor (with autograd graph still alive thanks to retain_graph above).
⬇ arg 2: L_health = Health loss tensor.
⬇ arg 3: shared = Shared backbone parameter list.
→ cost inside the helper = 2× torch.autograd.grad calls (no extra graph because create_graph=False) + flatten + F.cosine_similarity. ~2× the memory of a regular backward.
cos (typical FD002 batch) = +0.31 — strongly aligned. Pushes the verdict toward GRACE.
52ratios.append(ratio)
Append this batch's ratio. After 50 iterations, ratios is a list of 50 floats.
53cosines.append(cos)
Append this batch's cosine. After 50 iterations, cosines is a list of 50 floats.
55mean_ratio = sum(ratios) / len(ratios)
Mean across batches. SEM ≈ std / √50 — about 10% of the mean for typical C-MAPSS data, tight enough that the verdict thresholds are stable.
EXECUTION STATE
📚 sum(iterable) / len(iterable) = Pure Python mean. Simpler than np.mean for a list of floats; we don't need NumPy here.
mean_ratio (typical FD002) = ≈ 447.6
56mean_cosine = sum(cosines) / len(cosines)
Mean cosine across the 50 batches.
EXECUTION STATE
mean_cosine (typical FD002) = ≈ +0.31
58if mean_ratio < 10:
Branch 1 — small ratio. GABA correction would be tiny; recommend AMNL fixed weights.
EXECUTION STATE
Branch taken? (FD002) = No — 447.6 ≥ 10.
59verdict = "AMNL"
AMNL recommendation when ratio is small. Skipped on FD002.
60elif mean_cosine < -0.05:
Branch 2 — anti-aligned. Auxiliary task fights the primary; train two single-task models.
EXECUTION STATE
Branch taken? (FD002) = No — +0.31 ≥ -0.05.
61verdict = "SingleTask"
Single-task recommendation. Triggers on the FD003-anomaly path (chapter 21 §3).
62elif mean_cosine < 0.05:
Branch 3 — orthogonal. Auxiliary doesn't move the backbone usefully. AMNL with the right inner-loss shape.
EXECUTION STATE
Branch taken? (FD002) = No — +0.31 ≥ 0.05.
63verdict = "AMNL"
AMNL recommendation for the orthogonal-cosine case. Same verdict as branch 1 but a different reason.
64elif mean_cosine < 0.20:
Branch 4 — mildly aligned. OUTER axis (GABA) pays; whether to upgrade to GRACE depends on cost asymmetry (Q5 in the wizard).
EXECUTION STATE
Branch taken? (FD002) = No — +0.31 ≥ 0.20.
65verdict = "GABA"
GABA recommendation for mild alignment.
66elif mean_cosine < 0.20: — fall-through
(Same elif as line 64 — students sometimes click here by mistake. The branch evaluates once on line 64.)
66else:
Branch 5 — strongly aligned. Both axes work. Full GRACE composition is the recommendation. TAKEN on FD002.
EXECUTION STATE
Branch taken? (FD002) = Yes — +0.31 ≥ 0.20.
67verdict = "GRACE"
Final verdict for the FD002 path. GRACE wins.
69return {
Open the result dict. Keys are wizard-friendly names; values are scalars + the verdict string.
70"mean_grad_ratio": float(mean_ratio),
First scalar diagnostic. float() ensures we don't accidentally return a NumPy scalar (e.g. if someone changed sum() to np.sum()).
71"mean_grad_cosine": float(mean_cosine),
Second scalar diagnostic.
72"verdict": verdict,
The wizard-recommended method as a string.
73"n_batches": len(ratios),
How many batches we actually used. Could be < n_batches if the loader is short — useful for debugging.
74}
Close the dict and return.
77# Stub usage — section divider
Comment marking the example call site. The block below is intentionally a no-op so this file can be imported as a module without side effects.
78if __name__ == "__main__":
Standard Python guard. The block runs only when this file is invoked directly with `python grace_diagnose.py`, NOT when imported as a module.
EXECUTION STATE
📚 __name__ = Module-level string. When run directly: '__main__'. When imported: the module's dotted name (e.g. 'grace.diagnose'). The guard prevents side effects on import.
79pass
No-op statement. Required because the if-block needs at least one statement; the actual call site is in the comments below.
80 # In production:
Pseudo-code header for the four lines below — shows the canonical call sequence.
81 # model = DualTaskModel(...).to(device)
Build the model and move it to the target device. The diagnostic must run on the same device as production training.
82 # loader = make_train_loader("FD002")
Build a DataLoader for the dataset of interest. Use the SAME loader configuration as training (batch size, shuffle, workers) so the diagnostic sees the production data distribution.
83 # result = diagnose_for_grace(model, loader, device)
Call the diagnostic with the standard 3 positional args; n_batches defaults to 50.
84 # print(result)
Inspect the verdict. Pipe through json.dumps() for nicer formatting, or feed the dict directly into the decision wizard's controlled inputs.
1"""Production diagnostic: gradient ratio + cosine on a real model.
23Uses paper helpers from grace/core/gradient_utils.py:
4 - compute_task_grad_norm: per-task gradient norm on shared params
5 - compute_gradient_cosine: cosine of the two gradient vectors
67Run on a single-task baseline that has been trained for a few epochs
8(or even fresh — gradient cosine stabilises within ~20 batches).
9"""1011import torch
12import torch.nn as nn
13import torch.nn.functional as F
14from torch.utils.data import DataLoader
1516from grace.core.gradient_utils import(17 compute_task_grad_norm,18 compute_gradient_cosine,19)20from grace.core.weighted_mse import moderate_weighted_mse_loss
212223defdiagnose_for_grace(model: nn.Module, loader: DataLoader,24 device: torch.device, n_batches:int=50):25"""Compute gradient ratio + cosine over n_batches.
2627 Returns a dict the decision wizard can consume.
28 """29 model.train()30 shared = model.get_shared_params()3132 ratios =[]33 cosines =[]3435for i,(seq, rul_tgt, health_tgt, _)inenumerate(loader):36if i >= n_batches:37break38 seq = seq.to(device)39 rul_tgt = rul_tgt.to(device).view(-1,1)40 health_tgt = health_tgt.to(device)4142 rul_pred, health_logits = model(seq)43 L_rul = moderate_weighted_mse_loss(rul_pred, rul_tgt)44 L_health = F.cross_entropy(health_logits, health_tgt)4546 g_rul = compute_task_grad_norm(L_rul, shared, retain_graph=True)47 g_health = compute_task_grad_norm(L_health, shared, retain_graph=True)4849 ratio =(max(g_rul, g_health)/max(min(g_rul, g_health),1e-12)).item()50 cos = compute_gradient_cosine(L_rul, L_health, shared)5152 ratios.append(ratio)53 cosines.append(cos)5455 mean_ratio =sum(ratios)/len(ratios)56 mean_cosine =sum(cosines)/len(cosines)5758if mean_ratio <10:59 verdict ="AMNL"60elif mean_cosine <-0.05:61 verdict ="SingleTask"62elif mean_cosine <0.05:63 verdict ="AMNL"64elif mean_cosine <0.20:65 verdict ="GABA"66else:67 verdict ="GRACE"6869return{70"mean_grad_ratio":float(mean_ratio),71"mean_grad_cosine":float(mean_cosine),72"verdict": verdict,73"n_batches":len(ratios),74}757677# ---------- Stub usage ----------78if __name__ =="__main__":79pass80# In production:81# model = DualTaskModel(...).to(device)82# loader = make_train_loader("FD002")83# result = diagnose_for_grace(model, loader, device)84# print(result)
Run the diagnostic on the actual data, not synthetic. Cosine values from synthetic Gaussian gradients are heavily biased toward zero (random vectors in high dimensions are almost orthogonal). Real-data cosines are much higher because gradients on a real backbone have shared structure. Always run diagnose_for_grace on the actual training loader before deciding.
Worked Examples: Five Real Domains
The decision tree, applied to five domains beyond C-MAPSS RUL. The diagnostic profile drives the recommendation; deployment context selects between GABA and GRACE.
Domain
Aux task
Typical g-ratio
Typical cos
Asymmetric cost?
Recommendation
Aero engine RUL (FD002)
Health classification
447×
+0.31
Yes (NASA)
GRACE
Aero engine RUL (FD003)
Health classification
496×
−0.04
Yes
AMNL (cosine fails)
EV battery state-of-health
Charge-curve regression
~50×
+0.40
Yes
GRACE
Speech ASR
Speaker ID
~100×
+0.10
Asymmetric WER on rare words
GABA + focal CTC
Tumour segmentation + survival
Survival regression
~30×
+0.25
Yes (FN > FP)
GRACE with Dice + asymm survival
Recsys CTR + dwell
Dwell-time regression
~5×
+0.15
Symmetric
AMNL (ratio < 10)
Robotics policy + payload class
Payload classification
~20×
+0.30
Symmetric
GABA
Two patterns emerge. First, the FD003 anomaly (cos near zero) kicks GRACE off the menu — AMNL wins. Second, when the cost is symmetric (recsys, robotics), GABA suffices — the INNER axis adds nothing.
What Chapters 21-23 Add Up To
Three chapters, one decision tree:
Chapter 21: the algebra. The OUTER (per-task) and INNER (per-sample) axes are orthogonal in the formal sense — touching different indices — and compose into the four cells of a 2×2 grid: Baseline, AMNL, GABA, GRACE. The cosine diagnostic explains when the orthogonality composes constructively.
Chapter 22: the production training. The 14-stage pipeline runs the chosen method to completion in ~30 minutes per seed. Reproducibility recipe (5 seeds + cuDNN flags + PRNG pinning) makes results comparable.
Chapter 23: the empirical evidence. On C-MAPSS multi-condition GRACE wins NASA at minimal RMSE cost (§1). On the Pareto picture three methods make the front: AMNL (accuracy corner), GABA (middle), GRACE (safety corner) (§2). On N-CMAPSS DS02 GRACE wins both axes outright (§3). The decision tree of this section synthesises these patterns into a runnable recipe.
The thesis of Part VII. GRACE is not a single algorithm to apply universally; it is the corner cell of a 2×2 grid that wins under specific, measurable conditions. Chapters 21-23 give you both the grid and the diagnostics that identify which cell your problem is in. Skip the decision tree and you may train the wrong method on the wrong data — the FD003 anomaly is one such failure mode.
Pitfalls When Following The Tree
Pitfall 1: running the diagnostic on a tiny model
Cosine and ratio values are stable on the production backbone size (~1.7M params for C-MAPSS). On a 100-param toy backbone the cosine is dominated by random noise — the high-dimensional concentration of measure that gives the cosine its diagnostic power vanishes. Always run the diagnostic on the real architecture you intend to train.
Pitfall 2: averaging over too few batches
Per-batch cosine has high variance (single-batch values can span ±0.5 on real data). Use n≥50 batches; the SEM drops to ~0.02 at that scale, tight enough that the threshold rules are meaningful. With 5 batches the SEM is ~0.2 — confidence intervals overlap multiple rule branches.
Pitfall 3: using the diagnostic on a fully trained model
Late in training the gradient cosine drifts: late-epoch backpropagation produces highly correlated per-task gradients (both losses are near their minima, so gradients shrink proportionally). Run the diagnostic on a freshly-initialised or partly-trained (5-10 epochs) model to capture the alignment that drives the OUTER axis EARLY in training, when the controller's decisions matter most.
Pitfall 4: skipping the deployment-context questions
The diagnostic alone (ratio + cosine) tells you whether GABA or AMNL fits. It does NOT distinguish GABA from GRACE — that decision needs the cost-asymmetry question (Q5 in the wizard). On a benchmark scoreboard where only RMSE matters, GABA is the right call; on a deployment with NASA-like asymmetric cost, GRACE is.
Pitfall 5: trusting a single seed's diagnostic
Cosine on a single random init has ±0.1 noise (chapter 22 §3). Run the diagnostic on at least 2 seeds and take the mean before applying the threshold rules. The decision wizard treats your reported number as a single point estimate; the thresholds have ~0.05 margin built in to absorb that noise.
Takeaway
Six questions decide the method: aux task? gradient ratio? cosine? failure-region size? cost asymmetry? regime structure? Each answer narrows the recommendation by one step.
Two cheap scalar diagnostics — gradient ratio rˉ and gradient cosine ρˉ — do most of the work. Compute time: ~10 s on a single GPU.
Threshold rules: ratio < 10 → AMNL; cos < −0.05 → single-task; |cos| < 0.05 → AMNL; cos ∈ [0.05, 0.20] → GABA; cos ≥ 0.20 → GRACE (if cost is asymmetric and failure subgroup is sizable).
The Pareto picture from chapter 23 §2 reminds you the wizard finds the optimal CORNER. AMNL, GABA, GRACE all sit on the front for SOME deployment context; the wizard maps you to the right one.
Run the diagnostic on the real model, real data, ≥ 50 batches, ≥ 2 seeds. Apply the threshold rules. Pick GRACE for safety-first multi-condition; GABA for symmetric-cost balanced; AMNL for the FD003-like anomaly. Single-task when MTL hurts.