Imagine two people pulling on the same rope. One pulls with the force of a tractor (RUL regression). The other tugs with the force of a kitten (health classification). The rope moves, of course - but it goes wherever the tractor wants. The kitten's effort is essentially wasted.
That metaphor is not rhetorical. On the DualTaskModel of §11.4 the two tasks really do pull on the shared backbone with gradients that differ in magnitude by two-and-a-half orders of magnitude: roughly 500× at the start of training, 100-300× near convergence. This chapter measures that gap carefully; the rest of the book is about closing it.
Why this matters before you even see numbers.Adam, AdamW, RMSProp - all standard optimisers - rescale per parameter, not per task. A 500× pull from one task and a 1× pull from another at the SAME parameter is what those optimisers see; they cannot tell the two apart, and the weaker task is structurally suppressed.
What “Per-Task Gradient Norm” Means
Let θshared∈RN be the parameters of the shared backbone (CNN + BiLSTM + Attention + FCFunnel - everything BEFORE the two heads). Let Lrul and Lhs be the per-task losses.
The per-task gradient on the shared parameters is just the partial derivative:
grul=∂θshared∂Lrul,ghs=∂θshared∂Lhs
Both live in the same RN. The norm we report is the L2 / Frobenius norm:
∥grul∥2=∑i=1Ngrul,i2,∥ghs∥2=∑i=1Nghs,i2
and the headline ratio:
ρ=∥ghs∥2∥grul∥2.
Why L2 and not L1 or L∞. Adam's update is proportional to g/v where v is a running mean of g2. So the OPTIMISER's view of gradient magnitude is L2-shaped, and a per-task L2 norm is the right thing to compare. L1 / L∞ paint the same picture qualitatively but disagree on the exact ratio.
Interactive: Norms vs. Scales vs. Dim
Drag the dim slider to see norms grow as N. Drag the per-task scales to see the ratio track the per-element scale gap. Watch the ∥g∥/N column stay constant - that is what makes the ratio a meaningful invariant of the loss design, not an artefact of model size.
Loading gradient norm explorer…
Try this. Set the RUL grad scale to 1.0 and the HS grad scale to 0.002. The ratio readout is exactly the 500× the chapter title advertises. Now slide the HS scale up to 1.0 - the ratio collapses to 1×. Section §14.3 - AMNL - changes the EFFECTIVE per-element scale. Section §17 - GABA - rescales per-task gradients directly.Both attack this slider.
Python: NumPy from Scratch
Manually compute the chain-rule gradients of each loss with respect to a tiny shared linear layer. No autograd magic - so you can see WHY the ratio is what it is.
Manual chain rule for both task gradients
🐍grad_norms_numpy.py
Explanation(31)
Code(51)
1import numpy as np
Standard alias.
7np.random.seed(0)
Reproducible random init - same numbers every run.
8D_in, D_shared, K = 4, 8, 3
Toy sizes. Tiny on purpose so the per-element math is easy to read; the real DualTaskModel has D_shared = 32 and ~3.4M shared params, but the principle is identical.
EXECUTION STATE
D_in = 4 (input feature dim)
D_shared = 8 (shared backbone output dim)
K = 3 (number of health classes)
9W = np.random.randn(D_shared, D_in) * 0.1
The SHARED parameter matrix. This is the analogue of the CNN+BiLSTM+Attention+Funnel parameters in the real model - both heads' gradients will flow back here.
EXECUTION STATE
shape = (8, 4)
init scale = 0.1 - small enough to keep activations near zero
10W_rul = np.random.randn(D_shared) * 0.1
RUL head weight vector. Maps 8-D shared features to a scalar. Stand-in for §11.2's 4-layer MLP.
EXECUTION STATE
shape = (8,)
11W_hs = np.random.randn(K, D_shared) * 0.1
Health-classification head. Maps 8-D shared features to K=3 raw logits. Stand-in for §11.3's 3-layer MLP.
Forward pass through the shared backbone. (B, D_in) @ (D_in, D_shared) = (B, D_shared).
EXECUTION STATE
z.shape = (32, 8) ← shared features
purpose = BOTH heads will read z. This is what makes the gradient analysis tricky.
21rul_pred = z @ W_rul
(B, 8) @ (8,) = (B,). One scalar per engine.
EXECUTION STATE
shape = (32,)
22logits = z @ W_hs.T
(B, 8) @ (8, 3) = (B, 3). Three logits per engine.
EXECUTION STATE
shape = (32, 3)
25loss_rul = np.mean((rul_pred - y_rul) ** 2)
Mean squared error. With y_rul roughly uniform in [0, 125] and rul_pred near 0, the residuals are O(60). Squared they're O(3,600). The huge constant is the ROOT of the gradient imbalance.
log-softmax. Equivalent to log(F.softmax(logits)) but numerically stable.
EXECUTION STATE
shape = (32, 3)
28loss_hs = -np.mean(log_p[np.arange(B), y_hs])
Cross-entropy: pick the log-probability of the TRUE class for each row, negate, average. With 3 classes and random init, log_p ≈ log(1/3) ≈ -1.099, so loss_hs ≈ 1.10.
EXAMPLE
# Numerical scale check at init:
# logits ~ small, all classes ≈ equal probability
# log p_true ≈ log(1/3) ≈ -1.0986
# loss_hs ≈ 1.10 ← bounded, O(1)
Chain rule for softmax+CE: ∂L_hs/∂z = (1/B)(p - y_onehot) @ W_hs. The factor (p - y_onehot) is bounded in [-1, 1], so g_z_hs is small.
EXAMPLE
# Per-row (sm - oh): bounded in [-1, 1]
# At init p ≈ (0.33, 0.33, 0.33), y_onehot is one-hot
# sm - oh ≈ (0.33, 0.33, -0.67) magnitude
# ⇒ g_z_hs per element ≈ 0.001 - 0.01
# This factor of 100-1000× smaller is exactly
# the source of the 500× imbalance.
EXECUTION STATE
shape = (32, 8)
39g_W_rul = g_z_rul.T @ x
Backprop through the shared-feature linear: ∂L/∂W = (∂L/∂z).T @ x. (D_shared, B) @ (B, D_in) = (D_shared, D_in).
EXECUTION STATE
shape = (8, 4) - same as W
40g_W_hs = g_z_hs.T @ x
Same chain-rule trick for the classification gradient.
EXECUTION STATE
shape = (8, 4) - same as W
43norm_rul = np.linalg.norm(g_W_rul)
L2 (Frobenius) norm of the gradient matrix. np.linalg.norm with no axis argument flattens the matrix and returns sqrt(sum of squares).
EXAMPLE
# np.linalg.norm(M) for matrix M is identical to:
# np.sqrt((M**2).sum())
# i.e. the Frobenius norm.
44norm_hs = np.linalg.norm(g_W_hs)
Same for the classification gradient.
45ratio = norm_rul / max(norm_hs, 1e-12)
The headline number. The 1e-12 floor is just a divide-by-zero guard.
47print("loss_rul :", ...)
The MSE loss at init - dominated by the y_rul magnitude.
EXECUTION STATE
Output (one realisation) = loss_rul : 3754.21
48print("loss_hs :", ...)
Cross-entropy at init - bounded by log(K) ≈ 1.099.
EXECUTION STATE
Output (one realisation) = loss_hs : 1.0987
49print("‖g_W_rul‖₂ :", ...)
Regression gradient L2 norm.
EXECUTION STATE
Output (one realisation) = ‖g_W_rul‖₂ : 4.812
50print("‖g_W_hs‖₂ :", ...)
Classification gradient L2 norm.
EXECUTION STATE
Output (one realisation) = ‖g_W_hs‖₂ : 0.009
51print("ratio :", ...)
The imbalance. On the toy problem we already see ~500×. On the real DualTaskModel, the empirical median over 4,120 batches is between 100× and 1,000× depending on epoch.
EXECUTION STATE
Output (one realisation) = ratio : 506.4 ×
20 lines without explanation
1import numpy as np
234# ----- toy shared backbone with two task heads -----5# A 1-layer linear backbone phi(x) = W x with weight W: (D_out=8, D_in=4)6# Two heads on top: a regression head and a classification head.7np.random.seed(0)8D_in, D_shared, K =4,8,39W = np.random.randn(D_shared, D_in)*0.1# shared params10W_rul = np.random.randn(D_shared)*0.1# RUL head: (D_shared,) -> scalar11W_hs = np.random.randn(K, D_shared)*0.1# HS head: (D_shared,) -> (K,)1213# ----- one fake batch -----14B =3215x = np.random.randn(B, D_in)16y_rul = np.random.randint(0,126, B).astype(np.float32)# capped RUL targets17y_hs = np.random.randint(0, K, B)# class indices1819# ----- forward -----20z = x @ W.T # (B, D_shared) shared features21rul_pred = z @ W_rul # (B,)22logits = z @ W_hs.T # (B, K)2324# ----- per-task losses (averaged over batch) -----25loss_rul = np.mean((rul_pred - y_rul)**2)# MSE26shift = logits - logits.max(-1, keepdims=True)# log-sum-exp shift27log_p = shift - np.log(np.exp(shift).sum(-1, keepdims=True))28loss_hs =-np.mean(log_p[np.arange(B), y_hs])# cross-entropy2930# ----- per-task gradients on the SHARED weights W -----31# dL_rul/dz = 2 * (rul_pred - y_rul) / B * W_rul[None, :]32g_z_rul =(2.0/ B)*(rul_pred - y_rul)[:,None]* W_rul[None,:]# (B, D_shared)33# dL_hs/dz = (softmax(logits) - one_hot) / B @ W_hs34sm = np.exp(log_p)# (B, K)35oh = np.zeros_like(sm); oh[np.arange(B), y_hs]=1.036g_z_hs =((sm - oh)/ B) @ W_hs # (B, D_shared)3738# Backprop into W via the shared-feature chain rule: dL/dW = g_z.T @ x39g_W_rul = g_z_rul.T @ x # (D_shared, D_in)40g_W_hs = g_z_hs.T @ x # (D_shared, D_in)4142# ----- per-task L2 norms on the shared parameters -----43norm_rul = np.linalg.norm(g_W_rul)44norm_hs = np.linalg.norm(g_W_hs)45ratio = norm_rul /max(norm_hs,1e-12)4647print("loss_rul :",round(loss_rul,4))48print("loss_hs :",round(loss_hs,4))49print("‖g_W_rul‖₂ :",round(norm_rul,4))50print("‖g_W_hs‖₂ :",round(norm_hs,6))51print("ratio :",round(ratio,1),"×")
PyTorch: retain_graph and grad.norm()
Production version. The trick is calling loss.backward(retain_graph=True) twice (once per task), zeroing grads in between, and reading p.grad.norm() on the shared-param list.
Reusable helper. Returns ‖g_rul‖, ‖g_hs‖, and their ratio - measured ONLY on the shared parameter list.
EXECUTION STATE
input: model = any nn.Module
input: loss_rul = scalar tensor with requires_grad=True
input: loss_hs = scalar tensor with requires_grad=True
input: shared_params = list of nn.Parameter from the shared backbone (NOT the task heads)
returns = dict with keys {rul, hs, ratio}
14out = {}
Output dict.
17model.zero_grad(set_to_none=False)
Reset every .grad to ZERO (not None). set_to_none=False is critical here - if grads are None, the (p.grad ** 2).sum() loop below will skip those params and silently underestimate the norm.
18loss_rul.backward(retain_graph=True)
Backprop ONLY the regression loss. retain_graph=True keeps the autograd graph alive so we can call .backward() again on loss_hs in a moment.
EXAMPLE
# Why retain_graph?
# By default, autograd frees the computation graph
# after .backward(). The second .backward() call
# would then crash with:
# RuntimeError: Trying to backward through the
# graph a second time...
# retain_graph=True keeps the graph so we can
# call .backward() once per task.
19out["rul"] = torch.sqrt(sum((p.grad ** 2).sum() for p in shared_params if p.grad is not None)).item()
Frobenius norm across the shared-param list. Squares each grad element, sums across all parameters, takes sqrt, converts to a Python float with .item().
EXAMPLE
# Equivalent vector form:
# flat = torch.cat([p.grad.flatten() for p in shared_params])
# norm = flat.norm().item()
# Both compute the same number; the sum-form
# avoids allocating the flat tensor.
23model.zero_grad(set_to_none=False)
Reset before measuring task 2 - critical, otherwise grads accumulate.
24loss_hs.backward(retain_graph=True)
Backprop the classification loss into a FRESH grad buffer.
25out["hs"] = torch.sqrt(sum((p.grad ** 2).sum() for p in shared_params if p.grad is not None)).item()
Smaller mirror of DualTaskModel. One shared Linear, one regression head, one classification head. Same gradient story as the full 3.4M-param model.
38self.shared = nn.Linear(4, 8)
Shared backbone. The ONLY parameters we want to measure.
39self.rul = nn.Linear(8, 1)
Regression head.
40self.hs = nn.Linear(8, 3)
Classification head.
42def forward(self, x):
Same dual-output convention as the full model.
43z = self.shared(x)
(B, 4) -> (B, 8). The shared features.
44return self.rul(z).squeeze(-1), self.hs(z)
Tuple: (B,) regression, (B, 3) logits.
47model = TinyDualTask()
Instantiate.
48x = torch.randn(32, 4)
Fake batch.
49y_rul = torch.randint(0, 126, (32,)).float()
Capped RUL targets in [0, 125].
50y_hs = torch.randint(0, 3, (32,))
Class indices.
52rul, logits = model(x)
One forward pass returns both outputs.
53loss_rul = F.mse_loss(rul, y_rul)
Regression loss.
54loss_hs = F.cross_entropy(logits, y_hs)
Classification loss.
56shared_params = list(model.shared.parameters())
ONLY the shared backbone params. We deliberately exclude rul.parameters() and hs.parameters() - measuring those would muddle the imbalance because the heads' own weights see only their own task.
EXAMPLE
# Why filter?
# We want to know: how does each task pull on the
# SHARED state? Including head params would mix in
# task-private gradients and dilute the comparison.
Classification gradient norm on shared params - 100-1000× smaller.
EXECUTION STATE
Output (one realisation) = ‖g_hs‖₂ : 0.0102
63print("ratio :", ...)
The imbalance. Section §12.3 measures this on the real DualTaskModel across 4,120 batches.
EXECUTION STATE
Output (one realisation) = ratio : 502.3 ×
33 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
456defper_task_grad_norms(model: nn.Module,7 loss_rul: torch.Tensor,8 loss_hs: torch.Tensor,9 shared_params:list)->dict:10"""Compute the L2 norm of each task's gradient on the SHARED parameters.
1112 Calls .backward(retain_graph=True) once per task into a fresh grad
13 buffer, reads the norm, and zeros grads for the next caller.
14 """15 out ={}1617# --- task 1: regression ---18 model.zero_grad(set_to_none=False)19 loss_rul.backward(retain_graph=True)# populate .grad20 out["rul"]= torch.sqrt(sum(21(p.grad **2).sum()for p in shared_params if p.grad isnotNone22)).item()2324# --- task 2: classification ---25 model.zero_grad(set_to_none=False)26 loss_hs.backward(retain_graph=True)27 out["hs"]= torch.sqrt(sum(28(p.grad **2).sum()for p in shared_params if p.grad isnotNone29)).item()3031 out["ratio"]= out["rul"]/max(out["hs"],1e-12)32return out
333435# ---------- Smoke test ----------36torch.manual_seed(0)373839classTinyDualTask(nn.Module):40def__init__(self):41super().__init__()42 self.shared = nn.Linear(4,8)# the shared backbone43 self.rul = nn.Linear(8,1)44 self.hs = nn.Linear(8,3)4546defforward(self, x):47 z = self.shared(x)48return self.rul(z).squeeze(-1), self.hs(z)495051model = TinyDualTask()52x = torch.randn(32,4)53y_rul = torch.randint(0,126,(32,)).float()54y_hs = torch.randint(0,3,(32,))5556rul, logits = model(x)57loss_rul = F.mse_loss(rul, y_rul)58loss_hs = F.cross_entropy(logits, y_hs)5960shared_params =list(model.shared.parameters())61norms = per_task_grad_norms(model, loss_rul, loss_hs, shared_params)6263print("loss_rul :",round(loss_rul.item(),4))64print("loss_hs :",round(loss_hs.item(),4))65print("‖g_rul‖₂ :",round(norms['rul'],4))66print("‖g_hs‖₂ :",round(norms['hs'],6))67print("ratio :",round(norms['ratio'],1),"×")
Same Trick, Different Domains
Per-task gradient norms are a standard MTL diagnostic, not an RUL-specific invention. The same helper applies to:
Domain
Task A
Task B
Typical ratio at init
Predictive maintenance
RUL (MSE)
Health (CE)
100-1,000×
Self-driving
Steering angle (MSE)
Lane occupancy (CE)
50-200×
Speech recognition
Phoneme posteriors (CE)
Word boundary (CE)
1-3×
Object detection
Bounding box (smooth-L1)
Class (CE)
5-30×
Multi-spectral imagery
Reflectance (MSE)
Land-cover (CE)
20-100×
Battery diagnostics
Capacity fade (MSE)
Cell-failure type (CE)
100-500×
Pattern. When one task is regression on an unbounded target and the other is bounded classification, the ratio is large. That is the rule, not the exception. The 500× of C-MAPSS is at the high end mostly because RUL targets run up to 125 cycles, not 1.0.
Three Measurement Pitfalls
Pitfall 1: Forgetting retain_graph=True. The first backward() frees the graph by default. The second call crashes. If you see RuntimeError: Trying to backward through the graph a second time, that's why.
Pitfall 2: Forgetting to zero grads between tasks. Without model.zero_grad() between the two backwards, the second norm is ∥grul+ghs∥, not ∥ghs∥. The ratio collapses to roughly 1 and looks “balanced” - which is exactly the bug you DON'T want to silently ship.
Pitfall 3: Including head params in the norm. The classification head's OWN weights see a sensible CE gradient; including them inflates ‖g_hs‖ and hides the imbalance on the SHARED state. Always filter to the shared backbone.
The point. Two backward passes, two Frobenius norms, one ratio. Two-line summary: loss.backward(retain_graph=True); norm = sqrt(sum of g²) over shared params only. Section §12.2 explains WHY the ratio is huge; §12.3 measures it; §12.4 derives the consequence.
Takeaway
Per-task gradient norm.∥∂Lt/∂θshared∥2on the shared backbone only.
retain_graph=True. One backward per task, model.zero_grad() in between.
Filter to shared params. Including head weights muddles the comparison.
Ratio ≈ 500× at init. Not a bug; it's a structural consequence of MSE-on-unbounded-targets vs. CE-on-bounded-probabilities.