Chapter 12
12 min read
Section 50 of 121

Consequences for Shared Feature Learning

The 500× Gradient Imbalance

What 500× Actually Costs

§12.1 found the imbalance. §12.2 explained why it must exist. §12.3 measured the distribution across 4,120 batches. This section is the punchline: under a 500× imbalance, the shared backbone learns RUL features and ignores classification. Auxiliary task accuracy plateaus below random guessing for the rare class. RUL itself trains but converges to a representation that ignores degradation-state structure - which is exactly the structure the auxiliary task was meant to inject.

The kitten-tractor metaphor pays off here. The rope ends up wherever the tractor wants. The kitten can be there too if the kitten is at the same place, but the kitten cannot move the rope. So if our two tasks happen to agree on the right shared representation, the imbalance is invisible. If they disagree - even slightly - one task is sacrificed. C-MAPSS is exactly the disagreeing case.

The Shared-Parameter Update Equation

For a single shared parameter θ\theta with quadratic per-task losses Lt(θ)=at2(θθt)2L_t(\theta) = \tfrac{a_t}{2}(\theta - \theta_t^*)^2, the plain SGD step on the naive sum L=Lrul+LhsL = L_{\text{rul}} + L_{\text{hs}} has fixed point

θ=arulθrul+ahsθhsarul+ahs.\theta^{*} = \dfrac{a_{\text{rul}}\, \theta_{\text{rul}}^{*} + a_{\text{hs}}\, \theta_{\text{hs}}^{*}}{a_{\text{rul}} + a_{\text{hs}}}.

With opposing optima (θrul=+1\theta_{\text{rul}}^{*} = +1, θhs=1\theta_{\text{hs}}^{*} = -1) and ρ=arul/ahs\rho = a_{\text{rul}} / a_{\text{hs}} this collapses to θ=(ρ1)/(ρ+1)\theta^{*} = (\rho - 1) / (\rho + 1).

ρθ*L_rul at θ*L_hs at θ*Interpretation
10.0000.5000.500Both tasks served. Symmetry.
100.8180.0171.652Mild imbalance. HS already 3× worse.
1000.9800.0001.961RUL solved. HS essentially abandoned.
5000.9960.0001.992C-MAPSS grade. HS at 4× initial loss.
20000.9990.0001.998Tail batches. HS gradient is invisible.
θ* depends only on ρ. Not on lr, not on batch size, not on optimiser. The bias is BAKED INTO the loss landscape - the only way to remove it is to change the landscape, which means rebalancing the gradients themselves. AMNL / GABA / GRACE do exactly that.

Why Adam Cannot Save You

Adam's update rule is Δθ=ηm^/(v^+ε)\Delta\theta = -\eta \cdot \hat{m} / (\sqrt{\hat{v}} + \varepsilon) where m,vm, v track the first and second moments of the COMBINED gradient g=grul+ghsg = g_{\text{rul}} + g_{\text{hs}}. Per-parameter rescaling normalises gg against its own variance - it has no concept of “which task contributed how much”. On a shared parameter, both tasks hit the SAME v^\sqrt{\hat{v}} denominator, and the dominant task's gradient still dominates the numerator.

Concretely, at our toy fixed point θ\theta^* the combined gradient is zero by construction, so Adam converges to the same biased fixed point as plain SGD. It just takes a slightly different path there. The PyTorch experiment below demonstrates this.

Interactive: Watch The Bias Form

Drag ρ\rho. Toggle SGD vs Adam. The steady-state values do not move when you change optimiser - only when you change ρ\rho.

Loading trajectory simulator…
Try this. Set ρ = 1, optimiser = SGD. Both losses settle at 0.5; HS accuracy ~73%. Now flip optimiser to Adam - same final state. Now crank ρ to 500 - HS accuracy collapses to ~28% under both optimisers. The accuracy plateau is a property of the loss landscape, not the optimiser.

Python: Toy Two-Task SGD

Plain numerical simulation of a shared scalar parameter being pulled in two directions. We print the closed-form fixed point next to the simulator's final value to confirm the theory matches.

Closed-form bias matches the simulator
🐍two_task_simulator_numpy.py
1import numpy as np

NumPy is the numerical workhorse here. We do not actually use ndarray in this toy simulation - everything is a Python float - but we keep the import in case the reader extends the simulator to vector theta.

EXECUTION STATE
📚 numpy = Library: ndarray, linear algebra, random, math.
as np = Universal alias.
4def simulate_two_task(rho, n_steps=60, lr=0.05, theta_0=0.0) -> dict:

Toy two-task simulator. ONE scalar shared parameter theta. Both losses are quadratic with opposing minima. The relative pull strength is rho. We run plain SGD and record the trajectory so we can plot the bias.

EXECUTION STATE
⬇ input: rho = Per-step gradient ratio g_rul / g_hs near init. C-MAPSS empirical median is 500. The simulator works for any positive value.
⬇ input: n_steps = 60 = Number of SGD steps. 60 is enough for the trajectory to settle into its bias-equilibrium for this 1-D problem.
⬇ input: lr = 0.05 = Learning rate. Picked so even rho=500 stays numerically stable. lr=0.1 would oscillate; lr=0.001 would be too slow.
⬇ input: theta_0 = 0.0 = Initial value of theta. Sits exactly between the two optima (+1 and -1). Symmetric start so the trajectory's drift is purely from the gradient ratio.
⬆ returns = dict with three Python lists of length n_steps - theta, rul_loss, hs_loss per step.
13theta = float(theta_0)

Cast to float in case the caller passed an int. Python ints and floats divide differently in some places; explicit cast is cheap insurance.

EXECUTION STATE
📚 float(x) = Python built-in. Converts int/str/etc to a Python float.
⬆ result: theta = 0.0 (float)
14history = {"theta": [], "rul_loss": [], "hs_loss": []}

Dict of three empty lists. We append to these as we step, then return them at the end for plotting.

EXECUTION STATE
⬆ result: history = {'theta': [], 'rul_loss': [], 'hs_loss': []}
16for step in range(n_steps):

Counted loop over the SGD steps.

EXECUTION STATE
📚 range(stop) = Lazy iterator over [0, stop). Memory O(1); no list materialised.
iter var: step = 0, 1, 2, …, n_steps - 1.
LOOP TRACE · 4 iterations
step 0
theta = 0.000
g = rho·(0-1) + (0+1) = -rho + 1
step 1
theta = lr·(rho-1)
g_rul direction = still toward +1
step 30
theta = approaching closed-form (rho-1)/(rho+1)
step 59
theta = ≈ (rho-1)/(rho+1) - the bias equilibrium
interpretation = θ* depends ONLY on rho. Bigger rho ⇒ θ closer to +1 ⇒ HS task fully sacrificed.
17g_rul = rho * (theta - 1.0)

Analytic gradient of L_rul = (rho/2)(theta - 1)² wrt theta. Pulls theta toward the RUL optimum at +1 with strength rho.

EXECUTION STATE
operator: * = Scalar multiply.
operator: - = Scalar subtract.
rho = Per-step pull strength. C-MAPSS-flavour value is 500.
(theta - 1.0) = Distance from RUL optimum. Negative when theta < 1 (i.e. needs to grow).
⬆ result: g_rul = On step 0: rho · (0 - 1) = -rho. With rho=500 ⇒ -500. Pulls theta UP.
18g_hs = (theta + 1.0)

Analytic gradient of L_hs = (1/2)(theta + 1)² wrt theta. Pulls theta toward -1 with strength 1.

EXECUTION STATE
operator: + = Scalar add.
(theta + 1.0) = Distance from HS optimum. Positive when theta > -1 (i.e. needs to shrink).
⬆ result: g_hs = On step 0: (0 + 1) = +1. Pulls theta DOWN with unit strength.
19g = g_rul + g_hs

Naive sum of the two task gradients - the "multi-task loss" that almost every paper writes as L = L_rul + L_hs. With rho ≫ 1 the sum is dominated by g_rul.

EXECUTION STATE
operator: + = Scalar add.
→ bias = g ≈ g_rul whenever rho ≫ 1. The HS task contribution is structurally swamped.
⬆ result: g = Step 0 with rho=500: -500 + 1 = -499. The HS push is invisible.
21theta = theta - lr * g

Plain SGD update step. New theta moves opposite the gradient by lr · g.

EXECUTION STATE
operator: * = Scalar multiply.
operator: - = Scalar subtract.
lr = Learning rate (0.05).
→ step 0 = theta_new = 0 - 0.05 · (-499) = +24.95. With this lr we'd overshoot - rely on the trajectory to oscillate down. In practice a smaller lr is used; the dynamics are unchanged.
→ fixed point = When g = 0: rho(theta - 1) + (theta + 1) = 0 ⇒ theta* = (rho - 1)/(rho + 1). For rho=500: theta* ≈ +0.996.
23history["theta" ].append(theta)

Record the new theta for plotting.

EXECUTION STATE
📚 list.append(x) = In-place append. Returns None. O(1) amortised.
24history["rul_loss"].append(0.5 * (theta - 1.0) ** 2)

Record L_rul = ½(theta - 1)². Quadratic loss with optimum at theta = 1 ⇒ L = 0.

EXECUTION STATE
operator: ** 2 = Square. Same as math.pow(x, 2).
(theta - 1.0) = Residual to RUL optimum.
⬆ recorded value = 0.5 · (theta - 1)². For theta=0.996, L_rul ≈ 8e-6 - the RUL task almost solved.
25history["hs_loss" ].append(0.5 * (theta + 1.0) ** 2)

Record L_hs = ½(theta + 1)². Optimum at theta = -1.

EXECUTION STATE
(theta + 1.0) = Residual to HS optimum.
⬆ recorded value = 0.5 · (theta + 1)². For theta=0.996, L_hs ≈ 1.992 - the HS task is at 4× its initial-state loss. SACRIFICED.
27return history

Hand back the dict. Caller can plot trajectories and compute final losses.

EXECUTION STATE
⬆ return: history = Dict with three lists of n_steps Python floats each.
31for rho in (1, 10, 100, 500):

Loop over four imbalance ratios that span equal-weight, mild, moderate, and C-MAPSS-grade.

EXECUTION STATE
iter var: rho = 1, 10, 100, 500
LOOP TRACE · 4 iterations
rho = 1
expectation = θ* = 0/2 = 0. Both losses equal. Both tasks served.
rho = 10
expectation = θ* = 9/11 ≈ 0.818. RUL nearly solved; HS loss = ½(1.818)² ≈ 1.65.
rho = 100
expectation = θ* = 99/101 ≈ 0.980. RUL solved; HS loss ≈ 1.96. Auxiliary task essentially abandoned.
rho = 500
expectation = θ* = 499/501 ≈ 0.996. RUL solved; HS loss ≈ 1.992. C-MAPSS-grade outcome.
32h = simulate_two_task(rho)

Run the simulator for this rho. Returns the history dict.

EXECUTION STATE
⬇ arg: rho = Loop variable.
→ other args = Use defaults: n_steps=60, lr=0.05, theta_0=0.0.
⬆ result: h = history dict with three lists.
33final_theta = h["theta"][-1]

Pick the last entry of the theta trajectory - the converged value.

EXECUTION STATE
📚 [-1] = Negative indexing. -1 is the last element of any sequence.
⬆ result: final_theta = For rho=500: ≈ +0.996.
34final_rul = h["rul_loss"][-1]

Last RUL-loss value. Tells us how well the dominant task was served.

EXECUTION STATE
⬆ result: final_rul = For rho=500: ≈ 8e-6 - virtually solved.
35final_hs = h["hs_loss"][-1]

Last HS-loss value. The bigger this gets relative to its starting value (½ · 1² = 0.5), the more the HS task was sacrificed.

EXECUTION STATE
⬆ result: final_hs = For rho=500: ≈ 1.992 - 4× the initial HS loss.
36print(f"rho={rho:4d} → theta*={final_theta:+.3f} ...")

Format-string output. The :+.3f and :4d are Python format-spec mini-language - +.3f forces a sign and 3 decimals; 4d right-pads the int to width 4.

EXECUTION STATE
📚 f-string = Inline expression interpolation. f'{x}' evaluates x and inserts its repr.
→ :4d = Format spec: int, min width 4 ⇒ ' 1', ' 10', ' 100', ' 500'.
→ :+.3f = Format spec: float, force sign, 3 decimals ⇒ '+0.000', '+0.818', '+0.980', '+0.996'.
⬆ Output = rho= 1 → theta*=+0.000 L_rul=0.500 L_hs=0.500 closed-form theta*=+0.000 rho= 10 → theta*=+0.818 L_rul=0.017 L_hs=1.652 closed-form theta*=+0.818 rho= 100 → theta*=+0.980 L_rul=0.000 L_hs=1.961 closed-form theta*=+0.980 rho= 500 → theta*=+0.996 L_rul=0.000 L_hs=1.992 closed-form theta*=+0.996
→ reading the table = RUL loss → 0 monotonically; HS loss → 2 monotonically. Closed-form (rho-1)/(rho+1) matches the simulator within the 3-decimal print precision.
21 lines without explanation
1import numpy as np
2
3
4def simulate_two_task(rho:    float,
5                       n_steps: int   = 60,
6                       lr:      float = 0.05,
7                       theta_0: float = 0.0) -> dict:
8    """One shared scalar parameter, two opposing optima.
9
10    L_rul(theta) = (rho / 2) * (theta - 1.0)**2     # pulls toward +1
11    L_hs (theta) = (1   / 2) * (theta + 1.0)**2     # pulls toward -1
12    Combined gradient: g = rho*(theta - 1) + (theta + 1)
13    Steady state of plain SGD: theta* = (rho - 1) / (rho + 1)
14    """
15    theta   = float(theta_0)
16    history = {"theta": [], "rul_loss": [], "hs_loss": []}
17
18    for step in range(n_steps):
19        g_rul = rho * (theta - 1.0)
20        g_hs  =        (theta + 1.0)
21        g     = g_rul + g_hs                        # naive sum
22
23        theta = theta - lr * g                      # plain SGD update
24
25        history["theta"   ].append(theta)
26        history["rul_loss"].append(0.5 * (theta - 1.0) ** 2)
27        history["hs_loss" ].append(0.5 * (theta + 1.0) ** 2)
28
29    return history
30
31
32# ---------- Run for several imbalance ratios ----------
33for rho in (1, 10, 100, 500):
34    h = simulate_two_task(rho)
35    final_theta = h["theta"][-1]
36    final_rul   = h["rul_loss"][-1]
37    final_hs    = h["hs_loss"][-1]
38    print(f"rho={rho:4d}  →  theta*={final_theta:+.3f}  "
39          f"L_rul={final_rul:.3f}  L_hs={final_hs:.3f}  "
40          f"closed-form theta*={(rho - 1) / (rho + 1):+.3f}")

PyTorch: Adam vs SGD on the Same Imbalance

Same toy two-task problem expressed as an nn.Module, trained under both SGD and Adam at ρ = 500. The print at the end confirms: both optimisers land at the same biased fixed point.

Same biased destination under SGD and Adam
🐍two_task_simulator_torch.py
1import torch

Top-level PyTorch.

EXECUTION STATE
📚 torch = Tensor library + autograd + optim.
2import torch.nn as nn

Modules and Parameter container.

EXECUTION STATE
📚 nn.Parameter = Tensor subclass that auto-registers as a learnable parameter when assigned as a module attribute.
3import torch.nn.functional as F

Stateless ops (unused in this toy but conventional).

6class TinyShared(nn.Module):

Tiny module that holds ONE learnable scalar parameter theta, plus a method to compute both task losses. Stand-in for the real DualTaskModel - same gradient story, smaller code.

8def __init__(self, theta_0: float = 0.0):

Constructor. Single knob: where to initialise theta.

EXECUTION STATE
⬇ input: theta_0 = 0.0 = Initial value of the shared param. Sits exactly between the two task optima.
9super().__init__()

Initialise nn.Module - sets up the parameter / buffer registries.

10self.theta = nn.Parameter(torch.tensor(theta_0))

Wrap a 0-D tensor as a Parameter so PyTorch auto-tracks it. Assigning to self.theta also auto-registers it; model.parameters() will yield it.

EXECUTION STATE
📚 nn.Parameter(t) = Wrap a tensor so it counts as a learnable parameter of the module. Equivalent to setting requires_grad=True and registering with the module.
📚 torch.tensor(scalar) = Allocate a new tensor from a Python scalar. 0-D shape.
⬇ arg: theta_0 = 0.0 - the initial value.
⬆ result: self.theta = Parameter containing tensor(0.) with requires_grad=True.
12def losses(self, rho: float):

Compute both per-task losses. Quadratic, with opposing minima. Symmetric except for the rho factor in front of L_rul.

EXECUTION STATE
⬇ input: rho = Per-step gradient ratio. The whole point of the experiment is to see what large rho does to the steady state.
⬆ returns = (L_rul, L_hs) tuple of 0-D tensors. Both depend on self.theta via autograd.
13L_rul = 0.5 * rho * (self.theta - 1.0) ** 2

Quadratic in theta with optimum at +1. Multiplied by rho so its gradient magnitude is rho × the residual.

EXECUTION STATE
operator: ** 2 = Element-wise square (here on a 0-D tensor).
(self.theta - 1.0) = Residual to RUL optimum. Tensor minus Python float - PyTorch auto-broadcasts.
⬆ result: L_rul = 0-D tensor. At theta=0, rho=500: 0.5 · 500 · 1 = 250.
14L_hs = 0.5 * (self.theta + 1.0) ** 2

Same quadratic but optimum at -1 and pull strength 1.

EXECUTION STATE
⬆ result: L_hs = 0-D tensor. At theta=0: 0.5 · 1 = 0.5.
→ ratio = L_rul / L_hs at init = 500 - matches the empirical median.
15return L_rul, L_hs

Tuple. Caller adds them naively.

18def run(rho, opt_name, n_steps=60, lr=0.05) -> dict:

Driver function: build a fresh model and optimiser, then run n_steps of gradient descent under the chosen optimiser. Return the trajectory.

EXECUTION STATE
⬇ input: rho = Per-step gradient ratio.
⬇ input: opt_name = 'sgd' or 'adam'. Anything else raises.
⬇ input: n_steps = 60 - enough to settle.
⬇ input: lr = 0.05 - same as NumPy version.
⬆ returns = dict with three lists of length n_steps.
20torch.manual_seed(0)

Repro. Strictly speaking unnecessary here (TinyShared has no random init) but it's harmless and habitual.

EXECUTION STATE
📚 torch.manual_seed(seed) = Sets PyTorch's global PRNG.
⬇ arg: seed = 0 = Conventional canonical seed.
21model = TinyShared()

Instantiate with theta_0=0.0 (default).

EXECUTION STATE
⬆ result: model = TinyShared with one Parameter (theta = 0.0).
23if opt_name == "sgd":

Branch on optimiser choice. Plain SGD is the trivial baseline.

24optim = torch.optim.SGD(model.parameters(), lr=lr)

Vanilla SGD optimiser. No momentum, no weight decay. Update rule: theta ← theta - lr · g.

EXECUTION STATE
📚 torch.optim.SGD(params, lr, momentum=0, weight_decay=0, nesterov=False) = Class implementing stochastic gradient descent. With defaults it is plain SGD.
⬇ arg: params = model.parameters() = Iterator yielding the (one) Parameter to optimise. Returned by every nn.Module.
⬇ arg: lr = lr = Learning rate. 0.05 here.
⬆ result: optim = An SGD instance.
25elif opt_name == "adam":

Adam branch.

26optim = torch.optim.Adam(model.parameters(), lr=lr, betas=(0.9, 0.999), eps=1e-8)

Adam optimiser with default hyperparameters. Per-parameter adaptive step. Update rule (sketch): m_t = β1·m + (1-β1)·g; v_t = β2·v + (1-β2)·g²; theta ← theta - lr · m̂ / (√v̂ + eps).

EXECUTION STATE
📚 torch.optim.Adam(params, lr, betas, eps, weight_decay) = Class implementing the Adam algorithm. Tracks first (m) and second (v) moment estimates per parameter.
⬇ arg: params = Same as SGD.
⬇ arg: lr = 0.05 = Step size scale.
⬇ arg: betas = (0.9, 0.999) = (β1, β2) - first- and second-moment decay rates. 0.9 / 0.999 are the canonical Adam defaults.
⬇ arg: eps = 1e-8 = Numerical stabiliser in the denominator. Without eps, the very first step (v=0) would divide by zero.
⬆ result: optim = An Adam instance.
28else:

Fall-through.

29raise ValueError(f"unknown optimiser: {opt_name}")

Defensive error - tells the caller exactly which input failed.

EXECUTION STATE
📚 raise ExceptionType(msg) = Python statement that throws. Stops the function and propagates up the stack.
31hist = {"theta": [], "L_rul": [], "L_hs": []}

Trajectory dict.

EXECUTION STATE
⬆ result: hist = Three empty lists, ready to append.
32for step in range(n_steps):

Counted training loop.

EXECUTION STATE
📚 range(stop) = Lazy iterator [0, stop).
iter var: step = 0..n_steps-1.
LOOP TRACE · 3 iterations
step 0
theta = 0.0000
L_rul = 250.0000 (= 0.5 · 500 · 1)
L_hs = 0.5000
g = g_rul + g_hs = -500 + 1 = -499
step 30
theta (Adam) = ≈ +0.85 - Adam slowly approaches the bias equilibrium
theta (SGD) = ≈ +0.99 - SGD is already there
step 59
theta (SGD) = +0.9960 ≈ (rho-1)/(rho+1)
theta (Adam) = +0.9960 - Adam ARRIVES at the same biased fixed point
L_rul = ≈ 4e-6 (RUL solved)
L_hs = ≈ 1.992 (HS sacrificed)
→ conclusion = Different paths, same biased destination. Adam IS NOT a fix.
33L_rul, L_hs = model.losses(rho)

Forward pass returns both task losses as 0-D tensors connected to theta via autograd.

EXECUTION STATE
⬆ result: L_rul = 0-D tensor. e.g. step 0: tensor(250.).
⬆ result: L_hs = 0-D tensor. e.g. step 0: tensor(0.5).
34loss = L_rul + L_hs

Naive sum - the "multi-task loss" convention. With rho=500 the sum is dominated by L_rul.

EXECUTION STATE
operator: + = Tensor add.
⬆ result: loss = 0-D tensor. Step 0: tensor(250.5).
35optim.zero_grad()

Reset .grad to None (default in optim.zero_grad - DIFFERENT from model.zero_grad(set_to_none=False)). Clears stale grads from the previous step.

EXECUTION STATE
📚 optim.zero_grad(set_to_none=True) = PyTorch ≥ 1.7 default sets grads to None instead of zeroing - faster, less memory.
36loss.backward()

Reverse-mode autograd. Populates self.theta.grad with d(loss)/d(theta).

EXECUTION STATE
📚 .backward() = Backprops through the autograd graph and accumulates grads into all leaves with requires_grad=True.
→ effect = model.theta.grad becomes a 0-D tensor holding rho·(theta - 1) + (theta + 1).
37optim.step()

Apply the optimiser update. For SGD: theta -= lr · grad. For Adam: theta -= lr · m̂ / (√v̂ + eps).

EXECUTION STATE
📚 optim.step() = Reads .grad off every parameter and applies the optimiser's update rule.
39hist["theta"].append(model.theta.detach().item())

Record the current theta. .detach() avoids holding the autograd graph alive; .item() pulls a Python float out of a 0-D tensor.

EXECUTION STATE
📚 .detach() = Returns a tensor sharing storage but detached from autograd. Useful when you just want the numeric value, not the gradient.
📚 .item() = 0-D tensor → Python float.
40hist["L_rul"].append(L_rul.detach().item())

Same trick for the RUL loss.

41hist["L_hs" ].append(L_hs .detach().item())

Same trick for the HS loss.

42return hist

Hand back the trajectory.

EXECUTION STATE
⬆ return: hist = Three lists of n_steps Python floats.
46for opt_name in ("sgd", "adam"):

Compare the two optimisers head-to-head at C-MAPSS-grade rho.

EXECUTION STATE
iter var: opt_name = 'sgd' then 'adam'.
LOOP TRACE · 2 iterations
opt_name = 'sgd'
expected = Reaches the closed-form bias fixed point in ~30 steps.
opt_name = 'adam'
expected = Reaches the SAME fixed point. Per-parameter rescaling cannot tell two task gradients apart on ONE shared param.
47h = run(rho=500.0, opt_name=opt_name)

Run the simulator at C-MAPSS rho.

EXECUTION STATE
⬇ arg: rho = 500.0 = Empirical C-MAPSS median.
⬇ arg: opt_name = Loop variable.
⬆ result: h = history dict.
48print(f"{opt_name:>4s} final theta={h['theta'][-1]:+.4f} L_rul={h['L_rul'][-1]:.5f} L_hs={h['L_hs'][-1]:.4f}")

f-string output. {opt_name:>4s} right-pads the name to width 4. {…[-1]:+.4f} pulls the last entry and formats with sign and 4 decimals.

EXECUTION STATE
→ :>4s = Format spec: string, right-aligned, min width 4.
→ :+.4f = Float, force sign, 4 decimals.
→ :.5f = Float, 5 decimals (we want to see the small RUL loss clearly).
⬆ Output = sgd final theta=+0.9960 L_rul=0.00001 L_hs=1.9920 adam final theta=+0.9960 L_rul=0.00001 L_hs=1.9920
→ reading = Both optimisers land at the SAME biased fixed point. Adam buys you a smoother trajectory; it does NOT change the destination.
16 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6class TinyShared(nn.Module):
7    """Single shared scalar param theta with two opposing task losses."""
8    def __init__(self, theta_0: float = 0.0):
9        super().__init__()
10        self.theta = nn.Parameter(torch.tensor(theta_0))
11
12    def losses(self, rho: float):
13        L_rul = 0.5 * rho * (self.theta - 1.0) ** 2     # pulls toward +1
14        L_hs  = 0.5 *        (self.theta + 1.0) ** 2     # pulls toward -1
15        return L_rul, L_hs
16
17
18def run(rho: float, opt_name: str, n_steps: int = 60, lr: float = 0.05) -> dict:
19    """Train the toy model with either plain SGD or Adam."""
20    torch.manual_seed(0)
21    model = TinyShared()
22
23    if opt_name == "sgd":
24        optim = torch.optim.SGD(model.parameters(), lr=lr)
25    elif opt_name == "adam":
26        optim = torch.optim.Adam(model.parameters(), lr=lr,
27                                 betas=(0.9, 0.999), eps=1e-8)
28    else:
29        raise ValueError(f"unknown optimiser: {opt_name}")
30
31    hist = {"theta": [], "L_rul": [], "L_hs": []}
32    for step in range(n_steps):
33        L_rul, L_hs = model.losses(rho)
34        loss        = L_rul + L_hs                       # naive sum
35        optim.zero_grad()
36        loss.backward()
37        optim.step()
38
39        hist["theta"].append(model.theta.detach().item())
40        hist["L_rul"].append(L_rul.detach().item())
41        hist["L_hs" ].append(L_hs .detach().item())
42    return hist
43
44
45# ---------- Compare SGD vs Adam at C-MAPSS rho ----------
46for opt_name in ("sgd", "adam"):
47    h = run(rho=500.0, opt_name=opt_name)
48    print(f"{opt_name:>4s}  final theta={h['theta'][-1]:+.4f}  "
49          f"L_rul={h['L_rul'][-1]:.5f}  "
50          f"L_hs={h['L_hs'][-1]:.4f}")

Symptoms in the Wild

These are the empirical signatures of a 500× imbalance on the real DualTaskModel - exactly what you should expect to see if you train the §11.4 model with a vanilla L = L_rul + L_hs loss.

SymptomWhere it shows upDiagnostic
RUL trains, HS plateaus near 33%TensorBoard / val curvesHS accuracy ≤ 1/K after 5 epochs
Critical-class recall < 30%Confusion matrixRare class essentially unlearned
Shared features cluster by RUL bint-SNE of z (32-D)No class structure in the embedding
Class boundaries depend on RUL onlyLinear-probe on shared featuresProbe accuracy = chance
Loss weight 1:1 fails reproduciblyHyperparameter sweepNo fixed weight crosses both tasks
Adam vs SGD gives same final lossOptimiser ablation≤ 1% relative gap in final HS accuracy
The probe trick. Freeze the trunk, fit a linear classifier on the shared 32-D vector. If probe accuracy is at chance, the trunk has not learned class structure - and that is the ultimate evidence that the gradient imbalance has biased the representation.

Three Diagnostic Pitfalls

Pitfall 1: Reading val loss instead of probe accuracy. Total val loss can DROP because the dominant task improves - even while the auxiliary task stays at chance. Always track per-task metrics, never just the sum.
Pitfall 2: Blaming the optimiser. “Switch to Adam” / “try AdamW” / “use cosine schedule” - none fix the bias. The toy simulation makes this brutally clear: the fixed point is a function of ρ alone. Rebalance the gradients, not the scheduler.
Pitfall 3: Trusting one fixed loss weight. A grid search over L = L_rul + λ·L_hs can find a λ that works at one epoch (typically near init), but the residual decays during training while the CE bound stays flat. The required λ doubles, then quadruples, by epoch 30. Adaptive weighting is the only reliable answer; §14 onward is about how to compute it.
The point of all four sections. The 500× gradient imbalance is structural, not anecdotal. It biases the shared representation. It cannot be fixed by changing the optimiser, learning rate, or a single loss weight. Adaptive multi-task losses (the next part of the book) are the answer.

Takeaway — End of Part IV Setup

  • Bias is structural. θ* = (ρ − 1)/(ρ + 1) regardless of optimiser, lr, or batch size.
  • Adam does not fix it. Per-parameter rescaling cannot tell two task gradients apart on a shared parameter. Same fixed point as SGD.
  • HS is sacrificed first. Auxiliary task plateaus near chance; its rare class is invisible to the backbone.
  • Linear-probe diagnostic. If a frozen-trunk probe is at chance, the imbalance has bitten. Use this before declaring “multi-task works.”
  • Next stop: Part V. Chapter 13 reframes this as the accuracy-safety tradeoff (the IEEE/CAA JAS paper's framing). Chapter 14 introduces AMNL - the first of the three rebalancing methods.
Loading comments...