Chapter 17
15 min read
Section 71 of 121

How GABA Differs from GradNorm

Inverse-Gradient Balancing: The Idea

Two Thermostats

A building has two zones, both wanting 72F72^\circ \text{F}. Two thermostats on the wall. Thermostat A reads the temperature, applies a closed-form correction, done. Thermostat B trains a tiny neural network on (temperature, occupancy, time-of-day) to LEARN the correction, gets there eventually but with a few cycles of overshoot. When the building catches fire, A hits the alarm instantly; B is still updating its weights.

That is the GABA-vs-GradNorm contrast. Both methods aim to balance task contributions to a shared backbone. GABA reads the gradient norms and applies the closed form λi=gj/(gi+gj)\lambda^*_i = g_j / (g_i + g_j) in one division. GradNorm sets up a target gradient ratio and runs SGD on a learnable weight vector to push the actual gradients toward that target. Same goal; different machinery; different stability properties.

The headline. On the paper's FD002 evaluation, GABA delivers RMSE 7.53±0.657.53 \pm 0.65 and NASA 224.2±22.4224.2 \pm 22.4; GradNorm delivers RMSE 8.19±0.788.19 \pm 0.78 and NASA 260.9±36.1260.9 \pm 36.1 (paper Table III). On N-CMAPSS, GradNorm diverged on 1 of 5 seeds (seed 789, NaN gradients); GABA never diverges, by construction.

What GradNorm Actually Does

GradNorm (Chen et al., ICML 2018) is a two-step algorithm that introduces an auxiliary loss to learn the per-task weights through SGD:

  • Step 1 — target gradient norms. For each task, compute the desired backbone-gradient norm: tgti=g(Li/Li(0))α\text{tgt}_i = \overline{g} \cdot \bigl( L_i / L_i^{(0)} \bigr)^{\alpha} where g=(g1+g2)/2\overline{g} = (g_1 + g_2)/2 is the average gradient norm, Li(0)L_i^{(0)} is the loss at step 0, and α0\alpha \geq 0 is a hand-tuned restoring exponent (paper sets α=1.5\alpha = 1.5).
  • Step 2 — auxiliary loss. Define LGN=iwiLitgti\mathcal{L}_{\text{GN}} = \sum_i \bigl| \| w_i \nabla L_i \| - \text{tgt}_i \bigr| and add it to the main loss with a small coupling weight (paper uses 0.10.1). SGD on wiw_i (treated as learnable parameters) drives LGN0\mathcal{L}_{\text{GN}} \to 0 and hence wiLitgti\| w_i \nabla L_i \| \to \text{tgt}_i.

The intuition: tasks that have made less progress relative to their initial loss get a higher target gradient norm. The auxiliary loss then tunes the per-task weight so the actual weighted gradient lands on that target. The exponent α\alpha controls how aggressively the algorithm pursues the slow-learning task.

Two consequences of being a learned algorithm. (1) GradNorm needs create_graph=True on its gradient-norm computation so backprop reaches the learnable weights — that is full second-order autograd, roughly 2× the memory of a normal step. (2) The auxiliary loss can diverge if the gradient magnitudes are extreme (the paper observed 1/5 N-CMAPSS seeds producing NaN on the gradient buffer).

What GABA Does, By Contrast

GABA reads only the gradient norms gi=θLig_i = \|\nabla_\theta L_i\| and applies the closed form derived in §17.3:

λrul=ghealthgrul+ghealth\lambda^*_{\text{rul}} = \frac{g_{\text{health}}}{g_{\text{rul}} + g_{\text{health}}}

That is the entire algorithm at the math level. The paper adds three stabilisers (EMA β=0.99\beta = 0.99, floor λmin=0.05\lambda_{\min} = 0.05, warmup W=100W = 100) but the underlying weight is closed-form. There are no learnable parameters; no inner optimisation; no auxiliary loss; no second-order autograd.

The structural property GradNorm cannot get for free. Because GABA returns a value on the simplex by construction (after floor + renormalisation), every task weight is provably bounded in [λmin,1λmin][\lambda_{\min}, 1 - \lambda_{\min}] at every step. GradNorm has no such bound — its learnable wiw_i can take any real value and DOES go negative when the aux loss takes a single aggressive step (we'll see this empirically below).

Side-by-Side Mechanism Comparison

AspectGradNorm (Chen 2018, paper α=1.5)GABA (this work)
Inputs at step tg_i (with grad-of-grad), L_i, L_i^(0), αg_i only
State carried across stepsInitial losses L_i^(0); learnable weights w_i; their optimiser stateEMA λ̂_i; step counter
Mechanism for setting weightsAuxiliary loss + SGD on w_i (each step pushes toward tgt)Closed form λ_i = (S − g_i) / ((K−1)·S)
Number of hyperparametersα (paper 1.5) + aux-loss weight (paper 0.1) + w_i learning rateβ (0.99), λ_min (0.05), warmup (100)
Cost per stepcreate_graph=True (≈2× memory) + extra optimiser updatecreate_graph=False; one division
Bounded weights?No — w_i can go negative under aggressive aux-loss stepsYes — guaranteed in [λ_min, 1 − λ_min]
Behaviour with extreme imbalance (500×)Aux loss dominated by larger-magnitude task; can produce NaNClosed form is smooth; EMA + floor stabilise
Reduces to GABA when…α = 0 OR all r_i are equalAlways — GABA is the ‘α=0’ case

The takeaway: GABA is GradNorm with α=0\alpha = 0, no aux loss, and a floor — minus the second-order autograd. Whatever GradNorm gains from the riαr_i^\alpha signal it pays for in machinery and stability.

Interactive: Same Gradients, Different Verdicts

Drag the four sliders. The left two bars show GABA's weights; the right two bars show what GradNorm would converge to if its auxiliary loss reached zero. When rrul=rhealthr_{\text{rul}} = r_{\text{health}} the bars match. When the relative progresses diverge, GradNorm shifts further than GABA — sometimes usefully, sometimes catastrophically.

Loading GABA-vs-GradNorm visualizer…
Try this. Set α=0\alpha = 0: the GradNorm bars snap to the GABA bars. Now set α=3\alpha = 3 and rrul=0.1r_{\text{rul}} = 0.1, rhealth=1.0r_{\text{health}} = 1.0: GradNorm gives almost all the weight to RUL (the ‘already learned’ task) because GradNorm interprets ‘low relative loss’ as ‘needs more practice’. That is the regime where GradNorm tends to misbehave on 500×-imbalanced gradients.

Python: Three Scenarios, Two Verdicts

Implement both algorithms in pure NumPy and run them on the same gradient norms with three different progress profiles. The output table makes the conceptual difference unambiguous: GABA is constant in the L's; GradNorm moves with them.

GABA vs GradNorm — same gradients, three progress profiles
🐍gaba_vs_gradnorm.py
1Module docstring

States the thesis of this file in one sentence: GABA and GradNorm pursue the SAME goal (balance per-task contributions to a shared backbone) using DIFFERENT machinery (closed-form division vs learned auxiliary loss + SGD). Triple-quoted string at the top of a Python file becomes the module's __doc__ attribute.

EXECUTION STATE
→ thesis = Same goal, different machinery. Both produce a length-K weight vector λ that sums to 1; only the route to that vector differs.
3import numpy as np

NumPy is Python's numerical computing workhorse. We use it for ndarray (the typed N-dimensional array) and a single function — np.array() — to package the two task weights into a length-2 vector. Both methods' math is small enough to express in plain NumPy because each step touches at most a 2-element array.

EXECUTION STATE
📚 numpy = Library: ndarray (fast typed arrays), broadcasting, linear algebra, math functions. Backed by C/BLAS, so even a 2-element array goes through optimised paths.
📚 as np = Universal alias convention. Lets us write np.array() instead of numpy.array() — every NumPy tutorial uses this exact alias.
→ why NumPy here? = Both algorithms operate on (2,) weight vectors. We could use a Python list, but ndarray gives us element-wise division, broadcasting, and easy interop with PyTorch/Pandas downstream.
6# Realistic FD002 numbers from section 12.3

Anchor comment: the constants below are NOT toy numbers — they are the median values measured across n=4,120 training steps on the FD002 dataset (paper section 12.3). Using realistic numbers makes the GABA/GradNorm contrast visible; with toy numbers (like both gradients = 1) the methods would agree trivially.

7g_rul, g_health = 5.0, 0.01

Tuple unpacking: assigns 5.0 to g_rul and 0.01 to g_health in a single statement. These are the L2 norms of the two task losses' gradients with respect to the SHARED backbone parameters — i.e., how much each task is ‘pulling’ on the shared layers.

EXECUTION STATE
g_rul = 5.0 = RUL regression gradient norm. RUL uses MSE on a 0–125 cycle target, so the gradient is large.
g_health = 0.01 = Health classification gradient norm. Cross-entropy on a 3-class softmax with a near-converged head produces tiny gradients.
→ ratio = 5.0 / 0.01 = 500x. RUL is pulling the backbone 500 times harder than health. This is the imbalance regime where naive equal-weighting (λ = [0.5, 0.5]) lets RUL dominate the shared features.
→ why scalar? = g_i is the L2 norm of the gradient TENSOR — a single non-negative real number per task. We don't need the full gradient direction, only its magnitude.
8L0_rul, L0_health = 20.0, 2.0

Snapshot of each task's loss at training step 0 (the very first batch, before any backbone update). GradNorm freezes these and uses them as the denominator of its training-rate ratio r_i = L_i / L_i^(0). GABA never reads them — it is one of the structural advantages: GABA has nothing to remember.

EXECUTION STATE
L0_rul = 20.0 = Initial RUL MSE loss. Roughly the variance of the target distribution before the model has learned anything.
L0_health = 2.0 = Initial cross-entropy loss. Slightly above the random-guess baseline ln(3) ≈ 1.10.
→ why frozen? = GradNorm's r_i compares CURRENT loss to STEP-0 loss. If you re-baseline mid-training, the algorithm forgets what ‘made progress’ means and r_i suddenly reads 1.0 again — collapsing the alpha-signal to nothing.
→ GABA vs GradNorm here = GABA can run on a checkpoint resumed from step 50,000 with no special handling. GradNorm needs you to either (a) save L0 in the checkpoint or (b) re-baseline (and lose the progress signal). The paper bakes this asymmetry into its ‘continual learning’ recommendation.
9alpha = 1.5

GradNorm's restoring-force exponent — the SINGLE hyperparameter that gives GradNorm its character beyond gradient equalisation. Paper value 1.5 follows Chen et al. ICML 2018. Setting alpha = 0 collapses GradNorm exactly into GABA (the r_i^0 = 1 term disappears).

EXECUTION STATE
alpha = 1.5 = Float. Controls how aggressively GradNorm pushes the lagging task. alpha=0 → no progress signal (≡ GABA); alpha=1 → linear; alpha=1.5 (paper) → super-linear; alpha=3 → very aggressive.
→ r_i^alpha example = r_rul=0.25, r_health=0.75: alpha=0: 1.000 vs 1.000 (no asymmetry) alpha=1: 0.250 vs 0.750 (3.0x) alpha=1.5: 0.125 vs 0.650 (5.2x) alpha=3: 0.016 vs 0.422 (27x)
→ why this is dangerous = Any extra hyperparameter that affects the task-weight loop interacts with the model learning rate, the optimiser, and the data scale. Tuning it on a new dataset usually requires a sweep. GABA has no analogue.
13def gaba_weights(g_rul, g_health) → np.ndarray

GABA in code. The complete algorithm fits on three lines below, has no state across calls, no learnable parameters, and no inner optimisation loop. Compare to gradnorm_targets() + gradnorm_converged() further down — twice the lines, four extra inputs, and an algebraic identity that hides an SGD loop.

EXECUTION STATE
⬇ input: g_rul (float) = RUL gradient norm. e.g. 5.0 — the value we set on line 7.
→ role = Used in BOTH the numerator (for the health weight) and the denominator (sum). One scalar drives both outputs.
⬇ input: g_health (float) = Health gradient norm. e.g. 0.01.
→ role = Mirror of g_rul. The closed form is symmetric in the two inputs by construction.
→ np.ndarray (return type hint) = Tells type-checkers that this function returns a NumPy ndarray, not a Python list. Pure documentation — Python doesn't enforce it at runtime.
⬆ returns = ndarray of shape (2,) summing to 1.0. Layout: [lambda_rul, lambda_health].
→ example call = gaba_weights(5.0, 0.01) → array([0.001996, 0.998004])
14Function docstring

Single-line docstring records the headline contrast: the entire body is one division and there are no learnable parameters anywhere. This is the property the rest of section 17.4 is comparing against.

15S = g_rul + g_health

Sum the two gradient norms. This single scalar S becomes the denominator of the closed form on the next line, normalising the weights so they land on the probability simplex (sum to 1).

EXECUTION STATE
S = 5.0 + 0.01 = 5.01
→ why this works = If lambda_rul = g_health/S and lambda_health = g_rul/S, then their sum is (g_health + g_rul)/S = S/S = 1. Simplex constraint satisfied without any explicit projection.
→ contribution interpretation = lambda_rul · g_rul = (g_health/S)·g_rul = g_rul·g_health/S. Symmetric in the two tasks ⇒ both tasks contribute EQUALLY to the backbone gradient. That is the whole point of GABA.
16return np.array([g_health / S, g_rul / S])

K=2 closed form. Numerator for lambda_rul is g_health (the OTHER task's norm) — that is the inverse-proportional rule from §17.3. Wrapping in np.array() gives us a typed, broadcasting-friendly vector for downstream math.

EXECUTION STATE
📚 np.array(object) = NumPy function: turns any array-like (list, tuple, nested list, scalar) into an ndarray. Infers dtype from contents — here both elements are float, so dtype=float64.
⬇ arg: [g_health / S, g_rul / S] = Python list literal of length 2. Element 0 will become lambda_rul (note: numerator is g_HEALTH because of the inverse-proportional rule); element 1 will become lambda_health.
→ element 0: g_health / S = 0.01 / 5.01 = 0.001996. Lambda for the RUL task — TINY because RUL is the loud task and we want to hold it back.
→ element 1: g_rul / S = 5.0 / 5.01 = 0.998004. Lambda for the health task — HUGE because health is the quiet task and we want to amplify it.
→ why inverse? = Larger gradient norm ⇒ smaller weight. This is the ‘every task contributes equally’ condition (lambda_i · g_i = constant) solved for K=2.
⬆ return: ndarray (2,) = [0.001996, 0.998004] — sums to 1.0 exactly. dtype=float64.
20def gradnorm_targets(g_rul, g_health, L_rul, L_health, L0_rul, L0_health, alpha)

GradNorm Step 1: compute the TARGET gradient norm each task should hit. The targets are then chased by SGD on a learnable weight vector (in real GradNorm) or solved for analytically (in this demo). Notice the input list is more than double GABA's — that is the price of the alpha-signal.

EXECUTION STATE
⬇ input: g_rul (float) = Current RUL backbone gradient norm. Same value GABA reads.
⬇ input: g_health (float) = Current health backbone gradient norm. Same value GABA reads.
⬇ input: L_rul (float) = CURRENT RUL loss. Changes every step. e.g. 10.0 if loss has halved from L0=20.
⬇ input: L_health (float) = CURRENT health loss. Changes every step. e.g. 1.0 if loss has halved from L0=2.
⬇ input: L0_rul (float) = Frozen step-0 loss. Stored at the very first training step and never updated.
⬇ input: L0_health (float) = Frozen step-0 loss for health.
⬇ input: alpha (float) = Restoring-force exponent (paper 1.5). Larger ⇒ more aggressive shift toward the slower-learning task.
→ input count = 7 inputs. GABA needed 2. Every extra input is a chance to misconfigure or to drift if the surrounding code mishandles state.
⬆ returns = Tuple (tgt_rul, tgt_health) — two non-negative floats. The desired ||w_i · ∇L_i|| for the next step.
21Function docstring

Records the formula tgt_i = avg_g · (L_i / L_i^0)^alpha — Chen et al. 2018 equation 2. The docstring also flags ‘Step 1’ — there is a Step 2 (the auxiliary loss + SGD) that is replaced here by gradnorm_converged() for clean comparison.

22avg_g = (g_rul + g_health) / 2.0

Average of the two backbone-gradient norms. Becomes the ‘reference scale’ that all task targets get pushed toward. GradNorm's philosophy: every task should eventually have gradient norm ≈ avg_g; alpha · r_i^alpha biases this around the mean.

EXECUTION STATE
avg_g = (5.0 + 0.01) / 2 = 2.505
→ /2.0 vs /K = K=2 here, so /2 is the arithmetic mean. For K tasks the canonical GradNorm formula is /K — this generalises straight to any number of tasks.
→ reading = ‘Both tasks should push the backbone with about 2.5 units of gradient norm.’ alpha · r_i^alpha then perturbs around 2.5 by a factor of (L_i/L0_i)^1.5.
23r_rul = L_rul / (L0_rul + 1e-8)

Inverse training rate for the RUL task. r_rul < 1 means the loss has dropped from its step-0 value (= progress); r_rul ≈ 1 means no progress; r_rul > 1 means the loss has gone UP (very rare, indicates instability).

EXECUTION STATE
📚 / operator = NumPy/Python true division. Both operands are scalars here, so the result is a scalar float.
⬇ + 1e-8 = Numerical guard. Prevents 1/0 if L0_rul somehow equals 0 (it shouldn't, but defensive). 1e-8 is small enough to be invisible against L0_rul ≈ 20.
r_rul (example) = For L_rul=10, L0_rul=20: 10 / (20 + 1e-8) ≈ 0.5
→ reading scale = r=0 means task fully solved. r=0.5 means halfway. r=1 means no progress. r>1 means losing ground (alarm).
→ why ratio not difference? = Ratio is dimensionless — it works whether L is on a [0, 100] scale (MSE) or [0, ln(C)] scale (cross-entropy). Difference would not.
24r_health = L_health / (L0_health + 1e-8)

Same training-rate ratio for the health task. Note that r_rul and r_health are NOT directly comparable across tasks unless their L0 values are on similar scales — which is one of the latent assumptions of GradNorm.

EXECUTION STATE
r_health (example) = For L_health=1.0, L0_health=2.0: 1.0 / (2.0 + 1e-8) ≈ 0.5
→ cross-task pitfall = If L0_rul=20 (MSE) and L0_health=2 (CE), reaching r=0.5 for both means very different absolute progress. GradNorm treats them as comparable — this is fine in practice but worth noticing.
25return avg_g * (r_rul ** alpha), avg_g * (r_health ** alpha)

GradNorm targets. The TARGET gradient norm scales with r_i^alpha — so a task that has made LESS progress (large r_i) gets a LARGER target. Both targets share the same avg_g pre-factor so the average stays anchored.

EXECUTION STATE
📚 ** operator = Python exponentiation. r ** alpha = r^alpha. For non-integer alpha (1.5 here), r must be ≥ 0 to stay real-valued — guaranteed because r is L_i/L0_i ≥ 0.
→ r ** 1.5 examples = 0.0^1.5 = 0.000 0.25^1.5 = 0.125 0.5^1.5 = 0.354 0.75^1.5 = 0.650 1.0^1.5 = 1.000 2.0^1.5 = 2.828
tgt_rul = avg_g · r_rul^1.5 = For r_rul=0.5: 2.505 · 0.354 = 0.8857
tgt_health = avg_g · r_health^1.5 = For r_health=0.5: 2.505 · 0.354 = 0.8857
→ equal r → equal tgt = When r_rul = r_health, the alpha term contributes the SAME factor to both targets, so tgt_rul = tgt_health = avg_g · r^alpha. GradNorm reduces to balanced equalisation here ⇒ AGREES with GABA.
→ divergent r → divergent tgt = When r_health > r_rul (health is lagging), tgt_health > tgt_rul ⇒ GradNorm wants ||w_health · g_health|| to be larger than ||w_rul · g_rul||. This is the ‘restoring force’ and the source of all GABA/GradNorm disagreement.
→ tuple return = Python returns multiple values as a tuple. Caller can unpack: tgt_r, tgt_h = gradnorm_targets(...).
28def gradnorm_converged(tgt_rul, tgt_health, g_rul, g_health) → np.ndarray

GradNorm Step 2 (analytic shortcut). At the auxiliary-loss minimum, w_i · g_i = tgt_i, so we can solve for w_i = tgt_i / g_i and renormalise to the simplex. This bypasses the SGD inner loop and gives us the EXACT fixed point GradNorm would converge to. Real GradNorm uses SGD because at training time we don't pay for the analytic solve every step — but for comparison this is the cleanest reference.

EXECUTION STATE
⬇ input: tgt_rul (float) = From gradnorm_targets()[0]. The desired ||w_rul · g_rul||.
⬇ input: tgt_health (float) = From gradnorm_targets()[1]. The desired ||w_health · g_health||.
⬇ input: g_rul (float) = Current RUL gradient norm. Needed to invert tgt = w · g.
⬇ input: g_health (float) = Current health gradient norm.
⬆ returns = ndarray (2,) summing to 1.0. The lambda that GradNorm CONVERGES TO at zero auxiliary loss.
→ caveat = Real GradNorm runs SGD on a learnable nn.Parameter. It may not reach this fixed point on every step — and an aggressive step can overshoot into negative weights (the divergence regime). The closed-form value here is the BEST-CASE GradNorm.
29Function docstring

Records the algebraic identity: aux loss is zero ⇔ w_i · g_i = tgt_i for every i ⇔ w_i = tgt_i / g_i. Plus a final renormalise to project onto the simplex.

30w_rul = tgt_rul / g_rul

Solve the aux-loss zero condition for w_rul. UNNORMALISED — the value can be very small (if tgt_rul ≪ g_rul) or very large.

EXECUTION STATE
w_rul example (scenario 2) = tgt_rul=0.3131, g_rul=5.0 → 0.3131 / 5.0 = 0.06263
→ why squeezed? = Health is lagging (r_health > r_rul) so GradNorm wants to RAISE the health target and LOWER the RUL target. With g_rul large and tgt_rul small, w_rul collapses toward zero.
31w_health = tgt_health / g_health

Same identity for health. The killer is g_health = 0.01 in the denominator — any non-trivial tgt_health produces a HUGE unnormalised w_health.

EXECUTION STATE
w_health example (scenario 2) = tgt_health=1.6270, g_health=0.01 → 1.6270 / 0.01 = 162.70
→ why huge? = Tiny denominator. Without normalisation this is no longer a probability. The next two lines fix it.
→ divergence seed = If GradNorm uses SGD with learning rate 0.05, one step on the aux loss gradient can push w_rul from 1.0 down to ~ -2.83 — into the FORBIDDEN negative half-line. GABA cannot do this; closed form ≥ 0 by construction.
32s = w_rul + w_health

Sum-of-weights for normalisation onto the probability simplex. After dividing by s, both weights are non-negative and sum to 1.0.

EXECUTION STATE
s example (scenario 2) = 0.06263 + 162.70 = 162.7626
→ why renormalise? = tgt_i / g_i is in ‘weight units’ not probability units. Dividing by s puts the result on the simplex so it is comparable to GABA's output and fits the loss-combination convention L = sum(lambda_i · L_i).
33return np.array([w_rul / s, w_health / s])

Final normalised lambda. Same ndarray (2,) shape as GABA's output so the two methods can be compared element-by-element.

EXECUTION STATE
📚 np.array([...]) = Same constructor as line 16 — packs a 2-element list into a typed float64 array.
→ element 0 (RUL) example = scenario 2: 0.06263 / 162.7626 = 0.000385
→ element 1 (health) example = scenario 2: 162.70 / 162.7626 = 0.999615
⬆ return: ndarray (2,) = [0.000385, 0.999615] for scenario 2. Sum: 1.0. Compare against GABA's [0.001996, 0.998004] — GradNorm pushes ≈5x further toward health.
37scenarios = [...]

List of three (name, L_rul, L_health) triples. Same gradient norms across every scenario — only the per-task losses change. The test: does GABA budge? does GradNorm budge?

EXECUTION STATE
📚 list literal [...] = Python list, 3 elements. Each element is a tuple of (str, float, float).
scenario 1 — equal progress = ("equal progress", 10.0, 1.0) L_rul/L0_rul = 0.5, L_health/L0_health = 0.5 ⇒ r_rul = r_health → GABA == GradNorm
scenario 2 — health lagging = ("health is lagging", 5.0, 1.5) L_rul/L0_rul = 0.25, L_health/L0_health = 0.75 ⇒ r_health > r_rul → GradNorm shifts FURTHER toward health
scenario 3 — health learning fast = ("health is learning fast", 10.0, 0.5) L_rul/L0_rul = 0.5, L_health/L0_health = 0.25 ⇒ r_rul > r_health → GradNorm RELAXES the health weight, gives some back to RUL
→ fixed gradients = g_rul = 5.0, g_health = 0.01 in EVERY scenario. So GABA's answer is identical [0.001996, 0.998004] in all three. Only GradNorm moves.
38scenario 1: equal progress

Tuple literal. L_rul = L0_rul / 2 = 10.0 (loss has halved). L_health = L0_health / 2 = 1.0 (loss has halved). Both relative ratios are 0.5 — symmetric progress.

EXECUTION STATE
L0_rul / 2 = 20.0 / 2 = 10.0
L0_health / 2 = 2.0 / 2 = 1.0
→ expected behaviour = r_rul = r_health = 0.5 ⇒ tgt_rul = tgt_health ⇒ GradNorm gives the same answer as GABA. Test of correctness: when there is no asymmetry in progress, the alpha-signal vanishes.
39scenario 2: health is lagging

L_rul = L0_rul / 4 = 5.0 (loss has quartered — RUL is well ahead). L_health = L0_health · 0.75 = 1.5 (loss has barely moved — health is dragging).

EXECUTION STATE
L0_rul / 4 = 20.0 / 4 = 5.0
L0_health * 0.75 = 2.0 · 0.75 = 1.5
→ expected behaviour = r_rul = 0.25, r_health = 0.75. GradNorm raises tgt_health 5.2x above tgt_rul. Net effect: GradNorm pushes 99.96% to health vs GABA's 99.80%. The ‘restoring force’ in action.
40scenario 3: health is learning fast

L_rul = L0_rul / 2 = 10.0 (RUL halved). L_health = L0_health / 4 = 0.5 (health quartered — pulled ahead). Now RUL is the relatively-lagging task.

EXECUTION STATE
L0_rul / 2 = 20.0 / 2 = 10.0
L0_health / 4 = 2.0 / 4 = 0.5
→ expected behaviour = r_rul = 0.5, r_health = 0.25. Now GradNorm RAISES the RUL target and RELAXES the health weight from 99.80% (GABA) down to 99.44%. GABA stays put — it never sees the L's.
41] — close scenarios list

End of the list literal. Three tuples now bound to the name `scenarios`.

42print(f-string header)

Formatted header row for the comparison table. f-string with format-specs <24 (left-align in 24 chars), >10 (right-align in 10 chars), >12 (right-align in 12 chars).

EXECUTION STATE
📚 f-string = Python literal-string interpolation. f"...{expr:spec}..." replaces {expr:spec} with str(expr) formatted per spec.
→ :<24 = Left-aligned in a 24-char field, padded with spaces.
→ :>10 = Right-aligned in a 10-char field. Used for short numeric columns like ‘GABA rul’.
→ :>12 = Right-aligned in a 12-char field. For slightly wider headers like ‘GABA health’.
Output = scenario | GABA rul GABA health | GN rul GN health
43print("-" * 84)

Separator line of 84 hyphens — Python's string * int operator repeats the string. Width chosen to span the full table.

EXECUTION STATE
📚 "-" * 84 = String repetition operator. Result: a single string of 84 "-" characters.
Output = ------------------------------------------------------------------------------------
44for name, L_rul, L_health in scenarios:

Iterate the three scenarios. Each iteration tuple-unpacks the (str, float, float) triple into three names, then computes lam_g (GABA), the GradNorm targets, and lam_gn (GradNorm converged).

EXECUTION STATE
📚 for ... in iterable = Python for-loop. Calls iter(scenarios) once, then next(...) for each iteration.
→ tuple unpacking = `name, L_rul, L_health = (str, float, float)`. Three names assigned in one step. Mismatch in arity ⇒ ValueError at runtime.
LOOP TRACE · 3 iterations
iter 1 — name='equal progress', L_rul=10.0, L_health=1.0
lam_g (GABA) = [0.001996, 0.998004] (always — GABA does not see L)
r_rul, r_health = 0.5, 0.5
tgt_rul, tgt_health = 0.8857, 0.8857 (equal because r's are equal)
lam_gn (GradNorm) = [0.001996, 0.998004]
→ verdict = IDENTICAL. When both tasks have made the same relative progress, the alpha-signal vanishes and GradNorm collapses to the same answer GABA gives unconditionally.
iter 2 — name='health is lagging', L_rul=5.0, L_health=1.5
lam_g (GABA) = [0.001996, 0.998004] (unchanged)
r_rul, r_health = 0.25, 0.75
tgt_rul, tgt_health = 0.3131, 1.6270 — GradNorm targets 5.2x larger gradient on the lagging task
lam_gn (GradNorm) = [0.000385, 0.999615]
→ verdict = GradNorm is MORE aggressive — pushes 99.96% to health vs GABA's 99.80%. The ‘restoring force’ effect.
iter 3 — name='health is learning fast', L_rul=10.0, L_health=0.5
lam_g (GABA) = [0.001996, 0.998004] (unchanged)
r_rul, r_health = 0.5, 0.25
tgt_rul, tgt_health = 0.8857, 0.3131 — RUL is now the relatively-lagging task
lam_gn (GradNorm) = [0.005625, 0.994375]
→ verdict = GradNorm RELAXES the health weight from 99.80% (GABA) down to 99.44%, redirecting attention to RUL. GABA does not move.
45lam_g = gaba_weights(g_rul, g_health)

Call GABA. SAME for every iteration — GABA never reads the L's, so the answer is invariant in this loop. Three identical computations, only included to make the symmetry visible in the printed output.

EXECUTION STATE
⬇ args: g_rul=5.0, g_health=0.01 = Module-level constants from line 7. Captured by closure — function does not take L_rul or L_health.
⬆ lam_g (every iteration) = [0.001996, 0.998004]
→ invariance is the point = The whole demo exists to print this constant 3 times next to GradNorm's 3 different answers. Visualises ‘GABA is closed-form in g; GradNorm is dynamical in L'.
46tgt_r, tgt_h = gradnorm_targets(g_rul, g_health, L_rul, L_health, L0_rul, L0_health, alpha)

Call GradNorm Step 1. Returns a tuple of two floats; tuple-unpack into tgt_r and tgt_h. DIFFERENT every iteration because the L's feed in.

EXECUTION STATE
→ tuple unpack = Python feature — `a, b = func()` works iff func() returns a length-2 iterable.
iter 1 → (tgt_r, tgt_h) = (0.8857, 0.8857)
iter 2 → (tgt_r, tgt_h) = (0.3131, 1.6270)
iter 3 → (tgt_r, tgt_h) = (0.8857, 0.3131)
47lam_gn = gradnorm_converged(tgt_r, tgt_h, g_rul, g_health)

Call GradNorm Step 2 (analytic). Solves for the lambda at zero auxiliary loss using the targets just computed.

EXECUTION STATE
iter 1 → lam_gn = [0.001996, 0.998004] (matches GABA)
iter 2 → lam_gn = [0.000385, 0.999615] (more aggressive on health)
iter 3 → lam_gn = [0.005625, 0.994375] (relaxes toward RUL)
48print(f-string row)

One formatted row per scenario. Format specs >10.6f / >12.6f mean ‘right-aligned, 6 decimal places’.

EXECUTION STATE
📚 :>10.6f = Right-aligned in 10 chars, 6 digits after the decimal point. Lines up the columns visually.
Final output =
scenario                 |   GABA rul  GABA health |     GN rul    GN health
------------------------------------------------------------------------------------
equal progress           |   0.001996     0.998004 |   0.001996     0.998004
health is lagging        |   0.001996     0.998004 |   0.000385     0.999615
health is learning fast  |   0.001996     0.998004 |   0.005625     0.994375
→ reading the table = GABA column constant down all three rows; GradNorm column moves with relative progress. That single contrast is the entire conceptual gap between the methods.
14 lines without explanation
1"""GABA's closed form vs GradNorm's auxiliary loss - same goal, different machinery."""
2
3import numpy as np
4
5
6# ---------- Realistic FD002 numbers from section 12.3 ----------
7g_rul, g_health = 5.0, 0.01           # backbone gradient norms
8L0_rul, L0_health = 20.0, 2.0         # initial losses (frozen at step 0)
9alpha = 1.5                            # GradNorm restoring-force exponent (paper)
10
11
12# ---------- GABA: closed-form inverse-proportional ----------
13def gaba_weights(g_rul: float, g_health: float) -> np.ndarray:
14    """One division. Zero learnable parameters."""
15    S = g_rul + g_health
16    return np.array([g_health / S, g_rul / S])
17
18
19# ---------- GradNorm: target gradient norms + auxiliary descent ----------
20def gradnorm_targets(g_rul, g_health, L_rul, L_health, L0_rul, L0_health, alpha):
21    """Step 1 of GradNorm: tgt_i = avg_g * (L_i / L_i^0) ** alpha."""
22    avg_g    = (g_rul + g_health) / 2.0
23    r_rul    = L_rul    / (L0_rul    + 1e-8)
24    r_health = L_health / (L0_health + 1e-8)
25    return avg_g * (r_rul ** alpha), avg_g * (r_health ** alpha)
26
27
28def gradnorm_converged(tgt_rul, tgt_health, g_rul, g_health):
29    """At aux-loss minimum: w_i * g_i = tgt_i. Solve and renormalise."""
30    w_rul    = tgt_rul    / g_rul
31    w_health = tgt_health / g_health
32    s = w_rul + w_health
33    return np.array([w_rul / s, w_health / s])
34
35
36# ---------- Three scenarios on the SAME gradients ----------
37scenarios = [
38    ("equal progress",         L0_rul / 2,  L0_health / 2),
39    ("health is lagging",       L0_rul / 4,  L0_health * 0.75),
40    ("health is learning fast", L0_rul / 2,  L0_health / 4),
41]
42print(f"{'scenario':<24} | {'GABA rul':>10} {'GABA health':>12} | {'GN rul':>10} {'GN health':>12}")
43print("-" * 84)
44for name, L_rul, L_health in scenarios:
45    lam_g    = gaba_weights(g_rul, g_health)
46    tgt_r, tgt_h = gradnorm_targets(g_rul, g_health, L_rul, L_health, L0_rul, L0_health, alpha)
47    lam_gn   = gradnorm_converged(tgt_r, tgt_h, g_rul, g_health)
48    print(f"{name:<24} | {lam_g[0]:>10.6f} {lam_g[1]:>12.6f} | {lam_gn[0]:>10.6f} {lam_gn[1]:>12.6f}")

PyTorch: Why GradNorm Needs create_graph=True

The single line that captures the operational gap iscreate_graph=False for GABA vs create_graph=True for GradNorm. The first is cheap; the second roughly doubles memory because intermediate activations through the gradient computation must be retained for the second backward pass.

Same forward, two autograd budgets
🐍autograd_cost_difference.py
1Module docstring

States the operational contrast in one sentence: the SAME helper grad_norm() is called twice, once with create_graph=False (GABA path — cheap) and once with create_graph=True (GradNorm path — second-order autograd, ~2x memory). Everything below is built around that single flag.

3import torch

PyTorch&apos;s top-level package. Provides Tensor, autograd, optimisers, and tensor constructors (torch.randn, torch.rand, torch.randint, torch.ones).

EXECUTION STATE
📚 torch = Tensor library with reverse-mode autograd. Tensors are like NumPy arrays but track operations on a computation graph so gradients can be computed by .backward() or torch.autograd.grad().
→ why we need it here = We need autograd to compute gradient norms on the shared backbone. The whole demo turns on the create_graph flag exposed by torch.autograd.grad — there is no NumPy equivalent.
4import torch.nn as nn

Neural-network building blocks. We use nn.Linear (fully-connected layer) and nn.Parameter (learnable tensor wrapper). Aliasing as `nn` is the universal PyTorch convention.

EXECUTION STATE
📚 torch.nn = Submodule with layers (Linear, Conv2d, LayerNorm, ...) and Module/Parameter classes. nn.Module is the base class for all model components; nn.Parameter is how you mark a tensor as &lsquo;trainable model state&rsquo;.
→ as nn = Universal alias. Almost every PyTorch tutorial uses it.
6torch.manual_seed(0)

Fix the global PyTorch random number generator so this demo is bit-exactly reproducible. Without this line, every run of the script would produce different random weights, different losses, different gradient norms — making the printed numbers untrustworthy.

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets the global PRNG used by torch.randn, torch.rand, torch.randint, and the random initialisers in nn.Linear/Conv/etc. Returns a torch.Generator (which we ignore here).
⬇ arg: seed = 0 = Any int works. 0 is the conventional default for &lsquo;reproducible demo&rsquo;. Using a different seed would change the numerical values below but not the qualitative result.
→ caveat = Reproducibility on GPU also requires torch.use_deterministic_algorithms(True) and CUBLAS_WORKSPACE_CONFIG=:4096:8 — but on CPU (this demo), manual_seed alone is enough.
7backbone = nn.Linear(14, 32)

Tiny shared backbone. Both task heads (rul_head and hp_head) share its 480 parameters. In the real paper this is a much larger encoder (LSTM or Transformer); here a single Linear layer is enough to demonstrate the gradient-balancing mechanics.

EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias=True) = PyTorch module: fully-connected layer. Forward: output = x @ W.T + b. Stores W of shape (out, in) and b of shape (out,). Both initialised by Kaiming-uniform under the hood.
⬇ arg 1: in_features = 14 = Number of input dimensions. Matches the §5 C-MAPSS sensor count (14 turbofan sensors). Sets the number of COLUMNS in the weight matrix W.
⬇ arg 2: out_features = 32 = Number of output dimensions. Sets the number of ROWS in W. 32 hidden features is small enough to keep the demo fast but large enough that both heads have something to learn from.
→ parameter count = W has 14·32 = 448 entries; b has 32. Total = 480 trainable scalars. Both rul_head and hp_head will compute gradients on these same 480 parameters.
→ why &lsquo;shared&rsquo;? = These 480 params are the SHARED backbone the multi-task learning is trying to balance. The whole GABA-vs-GradNorm question is: how much should each task pull on these specific 480 numbers?
8rul_head = nn.Linear(32, 1)

RUL regression head: maps the 32-dim shared features to a single scalar — the predicted Remaining Useful Life in cycles.

EXECUTION STATE
⬇ arg 1: in_features = 32 = Must match the backbone&apos;s out_features so x @ backbone @ rul_head is dimensionally valid.
⬇ arg 2: out_features = 1 = Single scalar output — the regression prediction.
→ parameter count = 32·1 + 1 = 33 trainable scalars. NOT counted as backbone — gradients on rul_head are NOT what we&apos;re balancing. Only gradients on `shared` matter.
9hp_head = nn.Linear(32, 3)

Health classification head: maps 32-dim features to 3 logits (Normal / Degrading / Critical). Softmax is applied implicitly inside cross_entropy on line 18.

EXECUTION STATE
⬇ arg 1: in_features = 32 = Same hidden width as the backbone output.
⬇ arg 2: out_features = 3 = Three classes. Each output is an unnormalised logit.
→ parameter count = 32·3 + 3 = 99 trainable scalars. Again NOT in `shared`.
10shared = list(backbone.parameters())

Materialise the backbone&apos;s parameters as a Python list. We compute gradient norms on THIS list — RUL and health both. That&apos;s how the &lsquo;same backbone, two pulls&rsquo; story shows up in code.

EXECUTION STATE
📚 nn.Module.parameters() = Method on every nn.Module. Returns a generator yielding all nn.Parameter instances registered under this module (recursively). Generators can only be walked once — that&apos;s why we wrap in list().
📚 list(iterable) = Python builtin: drains an iterator into a list. Result: a concrete list we can re-walk multiple times (once per gradient-norm call).
shared = List of length 2: [W, b] for the backbone Linear layer. W is a (32, 14) Parameter; b is a (32,) Parameter. Total elements: 480.
→ why list, not generator? = torch.autograd.grad gets called twice (once per task) on the same params. A generator would be exhausted after the first call, so we materialise eagerly.
12x = torch.randn(64, 14)

Synthetic mini-batch of 64 samples, 14 sensor channels each, drawn from N(0, 1). Stand-in for a real C-MAPSS window — the gradient mechanics don&apos;t care about realism.

EXECUTION STATE
📚 torch.randn(*size) = Sample from the standard normal N(0, 1). Variadic: torch.randn(64, 14) returns shape (64, 14). dtype defaults to float32 on CPU.
⬇ arg: 64 = Batch size. Big enough that the loss is stable but small enough to keep the gradient computation snappy.
⬇ arg: 14 = Sensor channels. Matches backbone&apos;s in_features.
x = Tensor (64, 14), float32, no requires_grad. Only model parameters need gradients; inputs do not.
13rul_target = torch.rand(64, 1) * 125.0

Synthetic RUL targets uniformly in [0, 125] cycles. Multiply by 125 because torch.rand is uniform on [0, 1). 125 is the canonical FD002 piecewise-linear cap from the C-MAPSS preprocessing convention (paper §5).

EXECUTION STATE
📚 torch.rand(*size) = Uniform in [0, 1). Shape (64, 1). dtype float32. Different from torch.randn (which is normal).
⬇ arg: (64, 1) = Match the regression head output shape so we can compute (rul_head(feat) - rul_target) without broadcasting surprises.
→ * 125.0 = Element-wise scalar broadcast. Each entry in [0, 1) becomes a value in [0, 125). 125 is the standard RUL cap used in C-MAPSS papers.
rul_target = Tensor (64, 1) of float32 in [0, 125).
14hp_target = torch.randint(0, 3, (64,))

Synthetic class labels in {0, 1, 2}. Required for cross_entropy, which expects int64 class indices, NOT one-hot vectors.

EXECUTION STATE
📚 torch.randint(low, high, size) = Uniform integer tensor in the half-open interval [low, high). Default dtype: int64 (required by cross_entropy).
⬇ arg 1: low = 0 = Inclusive lower bound. Smallest class index.
⬇ arg 2: high = 3 = EXCLUSIVE upper bound. So the labels are in {0, 1, 2} — exactly 3 classes.
⬇ arg 3: size = (64,) = 1D tensor of length 64 — one class label per sample. Note the trailing comma: (64,) is a tuple of length 1, not the int 64.
hp_target = Tensor (64,) of int64. Each entry ∈ {0, 1, 2}. Roughly 21–22 of each class on average.
16feat = backbone(x)

Forward pass through the shared backbone. Calling a Module like a function invokes its .__call__, which calls .forward(x) plus any hooks. This is where the autograd graph starts being built — every subsequent op on `feat` is recorded.

EXECUTION STATE
📚 nn.Module.__call__(x) = Dunder method. Wraps .forward() with hook plumbing. Always prefer module(x) over module.forward(x) so hooks fire correctly.
→ computation = feat = x @ W.T + b. x is (64, 14), W is (32, 14), so x @ W.T is (64, 32). Add b (broadcast over batch dim) and you get (64, 32).
feat = Tensor (64, 32), float32, requires_grad=True (because backbone params have it). This is the SHARED representation both heads will consume.
17rul_loss = ((rul_head(feat) - rul_target) ** 2).mean()

Mean squared error on RUL. Three operations: pass features through rul_head, subtract target, square element-wise, then mean-reduce to a scalar. The whole computation is recorded on the autograd graph.

EXECUTION STATE
→ step 1: rul_head(feat) = Tensor (64, 1). Predicted RUL for each of the 64 samples.
→ step 2: - rul_target = Element-wise subtraction. Both are (64, 1) — no broadcast needed. Result is the residual e = ŷ - y.
→ step 3: ** 2 = Element-wise square. e^2 ≥ 0 everywhere.
📚 .mean() = Reduction. With no dim argument, reduces ALL dimensions to a 0-dim scalar tensor. Equivalent to .sum() / numel().
rul_loss = 0-dim tensor ≈ 5318 for seed 0 (random init, predictions are noise around 0; targets in [0, 125]; squared residual ≈ 5000).
→ why so big? = MSE is unbounded. With targets up to 125 and untrained predictions ~ 0, squared residuals can hit 15625. That magnitude is what makes g_rul ≫ g_health later.
18health_loss = nn.functional.cross_entropy(hp_head(feat), hp_target)

Standard 3-class cross-entropy. Combines log-softmax with negative log-likelihood in a single, numerically-stable op. This is the QUIET task in the imbalance — its gradient ends up ~200x smaller than RUL&apos;s.

EXECUTION STATE
📚 F.cross_entropy(input, target, ...) = Functional cross-entropy. input: logits (B, C) — UNNORMALISED scores. target: int64 class indices (B,) — NOT one-hot. Returns 0-dim mean loss.
⬇ arg 1: hp_head(feat) = Tensor (64, 3). Raw logits, NOT softmax probabilities — F.cross_entropy applies log-softmax internally.
⬇ arg 2: hp_target = Tensor (64,) of int64. Required dtype.
→ math = loss = -mean_b log(softmax(logits_b)[target_b]). For random init, softmax ≈ uniform 1/3, so log ≈ ln(1/3) ≈ -1.0986. Final loss ≈ 1.10.
health_loss = 0-dim tensor ≈ 1.10 (≈ ln 3, matching the random-guess floor).
→ vs RUL = rul_loss ≈ 5318 vs health_loss ≈ 1.10. ~5000x ratio at the LOSS level. The gradient ratio (after the chain rule through the heads) ends up ~200x. This is the imbalance regime GABA was built for.
21def grad_norm(loss, params, create_graph) → torch.Tensor

Helper that returns the L2 norm of `loss`&apos;s gradient w.r.t. `params`. Same function called for both methods — the ONLY difference is the create_graph flag. That single bool decides whether we pay second-order autograd cost.

EXECUTION STATE
⬇ input: loss (torch.Tensor) = 0-dim tensor to differentiate. Could be rul_loss (≈ 5318) or weighted_rul (= w · rul_loss).
⬇ input: params (list) = List of nn.Parameter. The gradient is taken w.r.t. these. We pass `shared` (the 480 backbone params) every time.
⬇ input: create_graph (bool) = False (GABA): standard reverse-mode autograd. True (GradNorm): build a SECOND graph through the gradient computation itself, so the resulting norm tensor stays differentiable.
→ why this single flag matters = create_graph=True roughly DOUBLES memory because intermediate activations through the gradient computation must be retained for the second backward pass. That cost is the entire structural difference between the two methods.
→ torch.Tensor (return type hint) = Documentation only. Tells type-checkers we return a tensor, not a float.
⬆ returns = 0-dim tensor: ||grad(loss, params)||_2. With create_graph=True, the tensor carries autograd history; with False, it&apos;s a plain value.
22grads = torch.autograd.grad(loss, params, retain_graph=True,

Functional autograd call (continued on next line). Returns the gradients of `loss` w.r.t. each entry of `params` as a tuple of tensors, WITHOUT writing them to .grad. The four keyword arguments below tune the autograd behaviour.

EXECUTION STATE
📚 torch.autograd.grad(outputs, inputs, ...) = Functional differentiation. Returns a tuple of gradients with the same length as `inputs`. Unlike loss.backward(), it does NOT accumulate into param.grad — leaves the model untouched. Useful for any &lsquo;compute a gradient as part of another loss&rsquo; pattern (meta-learning, GradNorm, etc.).
⬇ arg 1: outputs = loss = 0-dim tensor. The thing being differentiated. Must be a scalar (or a list of scalars).
⬇ arg 2: inputs = params = List of leaf tensors w.r.t. which to differentiate. Result tuple has same length and order.
⬇ kwarg: retain_graph=True = Keep the autograd graph alive after this call so we can run grad() AGAIN for the OTHER task on the same forward pass. Without it, PyTorch frees buffers after the first call ⇒ the second call would crash with &lsquo;Trying to backward through the graph a second time&rsquo;.
23 create_graph=create_graph, allow_unused=True)

Continuation of the call from the previous line. The remaining two kwargs control whether we build a second-order graph and how to handle params that don&apos;t affect this loss.

EXECUTION STATE
⬇ kwarg: create_graph=create_graph = If True, the returned gradient tensors carry an autograd graph back to whatever produced `loss` (including learnable task weights). Required for GradNorm so aux_loss.backward() can update task_weights. If False (GABA), gradients are plain values.
→ memory cost = create_graph=True ≈ 2x peak memory because intermediate activations through the grad computation are retained for the second backward. GABA pays nothing extra; GradNorm pays this every step. On a 100M-param model this can be the difference between fitting and OOM.
⬇ kwarg: allow_unused=True = If a param does not contribute to `loss` (e.g. rul_head params don&apos;t affect health_loss), tolerate it and return None for that entry instead of raising. We filter Nones on the next line.
grads = Tuple of length 2: (W.grad_tensor, b.grad_tensor) with shapes (32, 14) and (32,). Each is a Tensor; if create_graph=True it carries graph metadata back to task_weights.
24sq = sum((g.norm() ** 2 for g in grads if g is not None))

Sum of SQUARED L2 norms across all gradient tensors. Squared first, summed, then sqrt&apos;d on the next line — the standard recipe for ||g_total||_2 across multiple parameter tensors.

EXECUTION STATE
📚 generator expression = (expr for x in iterable if cond) — lazy iterator. Each `g.norm() ** 2` is computed only when sum() pulls it. Filters out None entries from allow_unused before squaring.
📚 Tensor.norm() = Tensor method. Default p=&apos;fro&apos; (Frobenius), which on a 1D/2D tensor is the L2 norm. Returns a 0-dim tensor.
→ ** 2 = Element-wise square. On a 0-dim tensor this is just the scalar squared.
→ if g is not None = Drop unused-param entries. Required because allow_unused=True can return None.
📚 sum(iterable) = Python builtin. Adds successive elements with +; tensors override + so the result stays a 0-dim tensor (NOT a Python float).
sq = 0-dim tensor. ||W.grad||² + ||b.grad||². For seed 0 RUL: ≈ 5857.
→ why squared sum? = L2 norm of a CONCATENATED vector = sqrt(sum of squared L2 norms of pieces). Avoids actually concatenating the tensors — cheaper.
25return sq.sqrt()

Square root recovers ||g||_2 from the sum-of-squares. Returned as a 0-dim tensor — that&apos;s key, because if create_graph=True it carries autograd history back into task_weights.

EXECUTION STATE
📚 Tensor.sqrt() = Element-wise square root. On a 0-dim tensor: just sqrt(scalar). Differentiable (derivative of sqrt(x) is 1/(2·sqrt(x))).
⬆ return = 0-dim tensor. For seed 0, RUL path: √5857 ≈ 76.5335. Health path: ≈ 0.3665.
→ autograd payload = If create_graph=False: plain tensor, .grad_fn=None. If create_graph=True: .grad_fn=&lt;SqrtBackward&gt; — calling .backward() on this can flow gradients all the way back to task_weights.
28# ---------- GABA path: cheap (no second-order autograd) ----------

Section header marking the cheap path. Everything below until line 33 uses create_graph=False — vanilla autograd, ~1x memory.

29g_rul_g = grad_norm(rul_loss, shared, create_graph=False)

GABA reads ||grad L_rul|| on the backbone WITHOUT building a second-order graph. The returned tensor is a plain value with no autograd history — fast and cheap.

EXECUTION STATE
⬇ arg 1: rul_loss = The 0-dim MSE loss from line 17.
⬇ arg 2: shared = The 480-element backbone parameter list.
⬇ arg 3: create_graph=False = GABA flag. Standard reverse-mode autograd ≈ 1x memory. The returned norm has .grad_fn=None — it cannot be back-propped further. That&apos;s fine because GABA never needs to.
g_rul_g = 0-dim tensor ≈ 76.5335 for seed 0. Plain value — calling .backward() on it would do nothing useful.
→ cost = ONE forward + ONE backward through `rul_loss → shared`. Zero extra memory beyond standard training.
30g_health_g = grad_norm(health_loss, shared, create_graph=False)

Same call, different loss. retain_graph=True inside grad_norm() means the autograd graph from the joint forward is still alive — so this second gradient call works without re-running the forward pass.

EXECUTION STATE
⬇ arg 1: health_loss = The 0-dim cross-entropy loss from line 18.
g_health_g = 0-dim tensor ≈ 0.3665 for seed 0. ~200x smaller than g_rul_g.
→ ratio = 76.5335 / 0.3665 ≈ 209x imbalance. Same regime as the FD002 paper measurement (500x median across batches).
31S = g_rul_g + g_health_g

Sum-of-norms. The denominator of the GABA closed form. Note both operands are 0-dim tensors so the result is also a 0-dim tensor — element-wise addition on scalars.

EXECUTION STATE
S = 0-dim tensor ≈ 76.9000 (76.5335 + 0.3665).
→ contrast with NumPy version = On line 15 of the NumPy file we wrote the same expression on Python floats. Here the operands are tensors, so PyTorch&apos;s + produces another tensor that could (in principle) be used in a larger autograd graph. We won&apos;t — GABA is closed-form.
32print(f"GABA : g_rul={g_rul_g.item():.4f} g_health={g_health_g.item():.4f}")

Print the two gradient norms. .item() converts a 0-dim tensor to a Python float — required because f-string format-spec :.4f does not work on tensors.

EXECUTION STATE
📚 Tensor.item() = Returns a Python scalar (float or int). REQUIRES the tensor be 0-dim. On a multi-element tensor, raises ValueError. Useful for logging, comparisons against Python ints, and f-string formatting.
→ :.4f = f-string format spec: floating-point with 4 decimals.
Output = GABA : g_rul=76.5335 g_health=0.3665
33print(f" lam_rul={(g_health_g / S).item():.6f} lam_health={(g_rul_g / S).item():.6f}")

Apply the GABA closed form INLINE inside the print f-string. Two divisions, six-decimal output. This is the entire GABA computation in real PyTorch — two scalar divisions on tensors, no .backward() needed.

EXECUTION STATE
(g_health_g / S) = 0.3665 / 76.9 = 0.004766. The lambda for RUL — note: numerator is g_HEALTH because of the inverse-proportional rule.
(g_rul_g / S) = 76.5335 / 76.9 = 0.995234. The lambda for health.
Output = lam_rul=0.004766 lam_health=0.995234
→ end of GABA path = Two prints, two divisions, no auxiliary loss, no inner SGD, no second-order autograd. Done. Compare line count and complexity to the GradNorm path below (lines 36–57).
36# ---------- GradNorm path: needs create_graph=True ----------

Section header. Everything below uses create_graph=True (line 45–46), pays the second-order autograd cost, and ends with an aux_loss that still requires an SGD inner loop on task_weights to actually converge.

37task_weights = nn.Parameter(torch.ones(2))

Two LEARNABLE task weights, initialised to 1. nn.Parameter is the wrapper that marks a tensor as model state — it gets requires_grad=True automatically and shows up in module.parameters().

EXECUTION STATE
📚 nn.Parameter(data, requires_grad=True) = Subclass of Tensor. Difference from a plain tensor: when assigned to a Module attribute, it&apos;s registered with the module (shows up in .parameters() and is moved by .to(device), .cuda() etc.). requires_grad defaults to True.
📚 torch.ones(*size) = Tensor full of 1.0. Shape (2,) here.
task_weights = Parameter (2,) = [1.0, 1.0]. requires_grad=True. Index 0 = w_rul, index 1 = w_health.
→ cost vs GABA = GABA stores 0 learnable parameters. GradNorm stores K (here 2). For K=2 it&apos;s trivial; for K=10 detection task heads (bbox + cls + obj + ...) it adds 10 learnable scalars + a separate optimiser state for them — and the aux-loss learning rate becomes a hyperparameter.
38L0_rul = rul_loss.detach().clone()

Snapshot the initial RUL loss into a tensor that has no autograd history and no shared storage with rul_loss. The chain .detach().clone() is the safe idiom for &lsquo;take a frozen copy I can use later&rsquo;.

EXECUTION STATE
📚 Tensor.detach() = Returns a new tensor that SHARES STORAGE with the original but has requires_grad=False and grad_fn=None. Cheap (no copy) but fragile if you mutate.
📚 Tensor.clone() = Returns a new tensor with a FRESH copy of the data. Independent storage. Combined with .detach(), gives a fully independent frozen snapshot.
→ why both? = Just .detach() shares storage — if anything ever does in-place ops on rul_loss the snapshot is corrupted. Just .clone() preserves autograd history — defeats the &lsquo;frozen&rsquo; intent. Both together: independent + non-differentiable.
L0_rul = 0-dim tensor. Frozen value of rul_loss at this step (treated as &lsquo;step 0&rsquo; for this single-step demo). For seed 0: ≈ 5318.
39L0_health = health_loss.detach().clone()

Same frozen snapshot for health.

EXECUTION STATE
L0_health = 0-dim tensor ≈ 1.10 (≈ ln 3). Independent of health_loss going forward.
40alpha = 1.5

GradNorm restoring-force exponent. Paper value (Chen 2018). Exactly the same as in the NumPy demo above.

EXECUTION STATE
alpha = 1.5 = Plain Python float — not a tensor because it never appears as something we differentiate.
42# Canonical GradNorm: g_i = ||grad(w_i * L_i, theta)|| -- depends on w_i

Critical comment. In CANONICAL GradNorm, the gradient norm is taken on the WEIGHTED loss (w_i · L_i), not on L_i alone. This is what makes the eventual aux_loss differentiable in task_weights — without it, the chain rule would not reach back to the learnable weights.

43weighted_rul = task_weights[0] * rul_loss

Scalar multiplication of a learnable Parameter and a 0-dim loss tensor. The product has BOTH `task_weights[0].grad_fn` and `rul_loss.grad_fn` in its computation graph.

EXECUTION STATE
📚 [0] (Tensor.__getitem__) = Indexing into a Parameter. Returns a 0-dim tensor view that still tracks gradients into the parent Parameter.
weighted_rul = 0-dim tensor ≈ 5318 (since task_weights[0]=1.0 initially). Differentiable in BOTH backbone params AND task_weights[0] — that dual differentiability is what makes the GradNorm aux-loss possible.
→ why this matters = When we later call grad_norm(weighted_rul, shared, create_graph=True), the resulting norm tensor will still have task_weights[0] in its history. So aux_loss.backward() can update task_weights[0].
44weighted_health = task_weights[1] * health_loss

Same construction for health.

EXECUTION STATE
weighted_health = 0-dim tensor ≈ 1.10. Differentiable in task_weights[1] and the backbone params.
45g_rul_gn = grad_norm(weighted_rul, shared, create_graph=True)

Gradient norm of the WEIGHTED loss with second-order autograd ENABLED. The returned tensor carries an autograd graph back to task_weights[0]. This is the line that pays ~2x memory.

EXECUTION STATE
⬇ arg 3: create_graph=True = GradNorm flag. Build a graph through the gradient computation itself so the result tensor remains differentiable. Required so aux_loss.backward() can update task_weights.
g_rul_gn = 0-dim tensor ≈ 76.5335 (NUMERICALLY identical to GABA&apos;s g_rul_g because task_weights[0]=1, but with a richer .grad_fn pointing back through the weighted product).
→ memory cost vs line 29 = Line 29 (create_graph=False): retains nothing extra. Line 45 (create_graph=True): retains all intermediate activations from the gradient computation so the second backward through the norm can run. On a 100M-param backbone, this is the difference between fitting and OOM.
46g_health_gn = grad_norm(weighted_health, shared, create_graph=True)

Same for health, also with second-order autograd. Two of these calls per step — that&apos;s the ongoing GradNorm tax.

EXECUTION STATE
g_health_gn = 0-dim tensor ≈ 0.3665. Carries autograd history back to task_weights[1].
48avg_g = (g_rul_gn + g_health_gn) / 2.0

Reference scale: average of the two weighted gradient norms. Becomes the pre-factor for both targets.

EXECUTION STATE
📚 / 2.0 = Tensor / Python-float. Dispatches to torch.div (scalar broadcast). Result is a 0-dim tensor.
avg_g = 0-dim tensor ≈ 38.4500.
49r_rul = rul_loss.detach() / (L0_rul + 1e-8)

Inverse training rate for RUL. .detach() because we don&apos;t want this ratio to backprop through rul_loss into the model — r_i is just a numerical signal driving the target, not part of the variational objective.

EXECUTION STATE
📚 .detach() = Same idiom as line 38 but without .clone(). We don&apos;t mutate rul_loss, so sharing storage is fine here.
+ 1e-8 = Numerical guard against div-by-zero.
r_rul = 0-dim tensor = 1.0 (because L0_rul WAS rul_loss a moment ago — same value). In a real run with proper L0 from step 0, r_rul would drop below 1 as training progresses.
→ demo caveat = We took L0 = current loss for simplicity. So r=1 for both tasks ⇒ targets are EQUAL ⇒ this single-step demo agrees with GABA. The interesting GradNorm-vs-GABA divergence only appears across many steps.
50r_health = health_loss.detach() / (L0_health + 1e-8)

Same for health.

EXECUTION STATE
r_health = 0-dim tensor = 1.0 (same single-step caveat).
51tgt_rul = avg_g * (r_rul ** alpha)

GradNorm target gradient norm for the RUL task. With r=1.0 and alpha=1.5, r^alpha = 1.0, so tgt = avg_g.

EXECUTION STATE
📚 ** alpha = Tensor ** Python-float. Element-wise exponentiation, returns a tensor with autograd support.
tgt_rul = 0-dim tensor = 38.4500 · 1.0 = 38.4500.
52tgt_health = avg_g * (r_health ** alpha)

Same for health.

EXECUTION STATE
tgt_health = 0-dim tensor = 38.4500.
→ equal targets = Because we faked L0 = current loss for both tasks, both r=1 ⇒ both targets equal avg_g. In a real run they&apos;d differ — that&apos;s when GradNorm starts diverging from GABA.
54aux_loss = (g_rul_gn - tgt_rul).abs() + (g_health_gn - tgt_health).abs()

GradNorm AUXILIARY LOSS. Sum of absolute deviations between actual weighted gradient norms and their targets. This is what gets backpropped into task_weights — the SECOND backward pass that pays the create_graph=True memory bill.

EXECUTION STATE
📚 Tensor.abs() = Element-wise absolute value. Differentiable everywhere except 0 (where the subgradient is 0).
(g_rul_gn - tgt_rul).abs() = |76.5335 - 38.4500| = 38.0835. RUL gradient is way ABOVE its target — GradNorm wants to lower w_rul.
(g_health_gn - tgt_health).abs() = |0.3665 - 38.4500| = 38.0835. Health gradient is way BELOW its target — GradNorm wants to RAISE w_health.
aux_loss = 0-dim tensor ≈ 76.17. Large because the initial gradients are 200x apart while the targets are equal.
→ this is the entire GradNorm anchor = aux_loss = 0 ⇔ ||w_i · grad L_i|| = avg_g · r_i^alpha for every i. SGD on task_weights drives aux_loss → 0.
→ divergence seed = If we now do `aux_loss.backward()` and step task_weights with lr=0.05, the magnitude of grad of aux_loss w.r.t. task_weights[0] is ≈ 76.5 (chain rule through grad_norm). Step size ≈ 0.05·76.5 = 3.8 → w_rul = 1.0 - 3.8 = -2.8. NEGATIVE. The next forward pass inverts the RUL gradient — backbone walks AWAY from minimum — NaN within a few steps. This is the seed-789 N-CMAPSS divergence the paper observed.
55print(f"GradNorm: g_rul(weighted)={g_rul_gn.item():.4f} g_health(weighted)={g_health_gn.item():.4f}")

Print the weighted gradient norms (.item() unwraps tensors to floats).

EXECUTION STATE
Output = GradNorm: g_rul(weighted)=76.5335 g_health(weighted)=0.3665
56print(f" tgt_rul={tgt_rul.item():.4f} tgt_health={tgt_health.item():.4f}")

Print the targets so the next line&apos;s aux_loss is interpretable as a deviation.

EXECUTION STATE
Output = tgt_rul=38.4500 tgt_health=38.4500
57print(f" aux_loss={aux_loss.item():.4f} ...")

Final print. The trailing parenthetical reminds the reader that this aux_loss is not a scalar to admire — it&apos;s the input to ANOTHER backward pass on task_weights. That backward is the second-order autograd cost we&apos;ve been paying for since line 45.

EXECUTION STATE
Final output =
GABA   : g_rul=76.5335  g_health=0.3665
         lam_rul=0.004766  lam_health=0.995234
GradNorm: g_rul(weighted)=76.5335  g_health(weighted)=0.3665
          tgt_rul=38.4500  tgt_health=38.4500
          aux_loss=76.1670  (must be back-proppable to task_weights)
→ reading the gap = GABA produced FINAL lambdas in two divisions on line 33. GradNorm produced an aux_loss that still needs an SGD step on task_weights to nudge them — and another forward + (second-order) backward pair next iteration. The autograd cost difference is encoded in a single bool: create_graph.
→ why the paper picks GABA = Same operating point at convergence (when r&apos;s are equal), 2x cheaper memory, no learnable task weights to misconfigure, no risk of negative w_i sending the backbone divergent. The GradNorm path here is correct and well-motivated; it&apos;s just dominated on every axis the paper measured.
13 lines without explanation
1"""GABA needs create_graph=False; GradNorm needs create_graph=True."""
2
3import torch
4import torch.nn as nn
5
6torch.manual_seed(0)
7backbone = nn.Linear(14, 32)
8rul_head = nn.Linear(32, 1)
9hp_head  = nn.Linear(32, 3)
10shared = list(backbone.parameters())
11
12x          = torch.randn(64, 14)
13rul_target = torch.rand(64, 1) * 125.0
14hp_target  = torch.randint(0, 3, (64,))
15
16feat = backbone(x)
17rul_loss    = ((rul_head(feat) - rul_target) ** 2).mean()
18health_loss = nn.functional.cross_entropy(hp_head(feat), hp_target)
19
20
21def grad_norm(loss: torch.Tensor, params: list, create_graph: bool) -> torch.Tensor:
22    grads = torch.autograd.grad(loss, params, retain_graph=True,
23                                 create_graph=create_graph, allow_unused=True)
24    sq = sum((g.norm() ** 2 for g in grads if g is not None))
25    return sq.sqrt()
26
27
28# ---------- GABA path: cheap (no second-order autograd) ----------
29g_rul_g    = grad_norm(rul_loss,    shared, create_graph=False)
30g_health_g = grad_norm(health_loss, shared, create_graph=False)
31S = g_rul_g + g_health_g
32print(f"GABA   : g_rul={g_rul_g.item():.4f}  g_health={g_health_g.item():.4f}")
33print(f"         lam_rul={(g_health_g / S).item():.6f}  lam_health={(g_rul_g / S).item():.6f}")
34
35
36# ---------- GradNorm path: needs create_graph=True ----------
37task_weights = nn.Parameter(torch.ones(2))
38L0_rul    = rul_loss.detach().clone()
39L0_health = health_loss.detach().clone()
40alpha = 1.5
41
42# Canonical GradNorm: g_i = ||grad(w_i * L_i, theta)||  --  depends on w_i
43weighted_rul    = task_weights[0] * rul_loss
44weighted_health = task_weights[1] * health_loss
45g_rul_gn    = grad_norm(weighted_rul,    shared, create_graph=True)
46g_health_gn = grad_norm(weighted_health, shared, create_graph=True)
47
48avg_g    = (g_rul_gn + g_health_gn) / 2.0
49r_rul    = rul_loss.detach()    / (L0_rul    + 1e-8)
50r_health = health_loss.detach() / (L0_health + 1e-8)
51tgt_rul    = avg_g * (r_rul    ** alpha)
52tgt_health = avg_g * (r_health ** alpha)
53
54aux_loss = (g_rul_gn - tgt_rul).abs() + (g_health_gn - tgt_health).abs()
55print(f"GradNorm: g_rul(weighted)={g_rul_gn.item():.4f}  g_health(weighted)={g_health_gn.item():.4f}")
56print(f"          tgt_rul={tgt_rul.item():.4f}  tgt_health={tgt_health.item():.4f}")
57print(f"          aux_loss={aux_loss.item():.4f}  (must be back-proppable to task_weights)")

Paper Results: GABA vs GradNorm Head-to-Head

All numbers below are taken verbatim from paper_ieee_tii/latex/main.tex Table III (per-dataset) and Table IV (multi-condition average). 5 random seeds per configuration, identical model, preprocessing, and training pipeline; only the multi-task objective differs.

DatasetMethodRMSENASA ScoreComment
FD002 (multi-cond.)GradNorm (α=1.5)8.19 ± 0.78260.9 ± 36.1Slower than GABA on both metrics
FD002 (multi-cond.)GABA7.53 ± 0.65224.2 ± 22.4Best safety among adaptive methods
FD002 (multi-cond.)GRACE (= GABA + AMNL)7.72 ± 0.66223.4 ± 26.5Best NASA overall
FD004 (multi-cond.)GradNorm (α=1.5)7.74 ± 0.59222.9 ± 18.6Best RMSE on FD004
FD004 (multi-cond.)GABA8.25 ± 1.10247.2 ± 60.3Wider seed variance
FD004 (multi-cond.)GRACE8.12 ± 0.70242.0 ± 25.6Tightest seed std on FD004
MC avg (FD002+FD004)GradNorm7.96241.9
MC avg (FD002+FD004)GABA7.89235.7
MC avg (FD002+FD004)GRACE7.92232.7Best multi-condition NASA

Three observations from the paper's Table IV:

  • RMSE difference is statistically not significant. Friedman test across the ten methods on multi-condition data: p=0.077p = 0.077 (paper Section 5.2). All gradient-aware methods land in the same RMSE band.
  • NASA difference IS significant. p=0.0002p = 0.0002 on the same Friedman test. GABA / GRACE hold the best-NASA positions on multi-condition data.
  • The ranking is consistent across both datasets. Within-pipeline (same model, same preprocessing), GABA and GRACE outperform GradNorm on NASA and tie on RMSE.

Why GradNorm Diverges (1 / 5 N-CMAPSS Seeds)

The paper reports (Section 5.1) that GradNorm produced NaN gradients on seed 789 of the N-CMAPSS DS02 evaluation — one out of five seeds. GABA never diverges. The mechanism is the absence of a weight bound:

  • The auxiliary loss is iwiLitgti\sum_i | \| w_i \nabla L_i \| - \text{tgt}_i |. When Li\| \nabla L_i \| is huge (500× imbalance regime), a single SGD step can push wiw_i across zero into the negative half-line.
  • Negative wiw_i inverts the gradient direction for that task. The backbone walks AWAY from the loss minimum.
  • With AdamW + weight decay + the next forward pass, the loss explodes and the entire computation producesNaN on the next gradient.
  • GABA cannot do this: weights live on the simplex by construction; a step that would produce \lambda &lt; 0 is impossible because the closed form returns a non-negative value and the floor / renormalise stage enforces λ[λmin,1λmin]\lambda \in [\lambda_{\min}, 1 - \lambda_{\min}] on every step.
The PyTorch demo above already shows the seed of this issue. One SGD step on the GradNorm aux loss with lr=0.05\text{lr} = 0.05 on the realistic gradients pushes wrulw_{\text{rul}} from 1.0 down to 2.83-2.83. The aux-loss-weight choice of 0.10.1 in the paper softens this but does not eliminate it.

Where Each Method Fits

SettingBetter choiceReason
Tasks with similar gradient magnitudes (&lt; 5×)EitherBoth methods agree to within seed noise; pick whichever is cheaper to integrate
Severe gradient imbalance (50× or more)GABABounded weights guarantee convergence; GradNorm risks divergence
Tasks with dramatically different learning trajectoriesGradNormr_i^α signal genuinely useful; gradient ratio alone may underweight slow learners
Safety-critical training (must not produce NaN)GABABounded by construction; never produces NaN even on extreme batches
Limited GPU memoryGABAcreate_graph=False saves ~2× the memory of GradNorm&apos;s second-order pass
No knowledge of L_i^(0)GABAInitial losses unavailable in continual learning / online adaptation
3+ tasks with different scales (e.g. detection: bbox + cls + obj)GABAK-task closed form; no exponential blow-up in hyperparameters
RUL prediction with health classification (this paper)GABA / GRACEPaper&apos;s 500× imbalance is exactly the regime GABA was designed for

Pitfalls In The Comparison

Pitfall 1: Comparing weights, not contributions. Don't conclude ‘GradNorm gives more weight to health, so it must care more about health’. The relevant quantity is ci=λigic_i = \lambda_i \|g_i\| (§17.2). GradNorm intentionally produces ciriαc_i \propto r_i^\alpha — UNequal contributions when rirjr_i \neq r_j.
Pitfall 2: Forgetting the aux-loss weight. GradNorm's combined loss is L=iwiLi+0.1LGN\mathcal{L} = \sum_i w_i L_i + 0.1 \, \mathcal{L}_{\text{GN}}. Setting the auxiliary weight too high makes GradNorm oscillate; too low makes it ignore the signal. There is no analogue in GABA — the inverse-proportional rule has no auxiliary scalar to misconfigure.
Pitfall 3: Treating ‘learning rate’ for GradNorm as a free hyperparameter. GradNorm's wiw_i SGD learning rate is a hidden hyperparameter that interacts with the model's main learning rate. Tuning it on a new dataset typically requires another sweep. GABA's three hyperparameters (β, λ_min, warmup) are dataset-independent in the paper's evaluation.
Why GRACE picks GABA over GradNorm despite a modestly worse RMSE on FD004. The paper's criterion for the ‘recommended deployment’ is BOUNDED safety, not best-case accuracy. A method that diverges on 1/5 seeds is unusable in fleet management even if it slightly out-performs on the 4 surviving seeds. GABA's bounded-weight guarantee converts a statistical advantage on FD004 RMSE into a deployment liability everywhere else.

Takeaway

  • GABA reads only gig_i; GradNorm reads gi,Li,Li(0)g_i, L_i, L_i^{(0)} and α\alpha. When relative progress is symmetric, the extra signals collapse and the methods agree.
  • GABA is closed form; GradNorm runs SGD on learnable weights. One division vs an inner optimisation with create_graph=True and an aux-loss coupling weight to tune.
  • GABA's weights are bounded by construction. λi[λmin,1λmin]\lambda_i \in [\lambda_{\min}, 1 - \lambda_{\min}] at every step. GradNorm has no such guarantee, and the paper observed NaN divergence on 1/5 N-CMAPSS seeds at seed 789.
  • Within-pipeline FD002 numbers from the paper: GABA RMSE 7.53 / NASA 224.2; GradNorm RMSE 8.19 / NASA 260.9. Multi-condition NASA difference is statistically significant (p = 0.0002).
  • Pick GradNorm only when its α-signal is genuinely useful AND gradient magnitudes are moderate. For RUL with 500× imbalance, GABA is the deployment-safe choice. The next chapter (§18) walks through the full GABA algorithm step by step.
Loading comments...