What 500× Actually Costs
§12.1 found the imbalance. §12.2 explained why it must exist. §12.3 measured the distribution across 4,120 batches. This section is the punchline: under a 500× imbalance, the shared backbone learns RUL features and ignores classification. Auxiliary task accuracy plateaus below random guessing for the rare class. RUL itself trains but converges to a representation that ignores degradation-state structure - which is exactly the structure the auxiliary task was meant to inject.
The Shared-Parameter Update Equation
For a single shared parameter with quadratic per-task losses , the plain SGD step on the naive sum has fixed point
With opposing optima (, ) and this collapses to .
| ρ | θ* | L_rul at θ* | L_hs at θ* | Interpretation |
|---|---|---|---|---|
| 1 | 0.000 | 0.500 | 0.500 | Both tasks served. Symmetry. |
| 10 | 0.818 | 0.017 | 1.652 | Mild imbalance. HS already 3× worse. |
| 100 | 0.980 | 0.000 | 1.961 | RUL solved. HS essentially abandoned. |
| 500 | 0.996 | 0.000 | 1.992 | C-MAPSS grade. HS at 4× initial loss. |
| 2000 | 0.999 | 0.000 | 1.998 | Tail batches. HS gradient is invisible. |
Why Adam Cannot Save You
Adam's update rule is where track the first and second moments of the COMBINED gradient . Per-parameter rescaling normalises against its own variance - it has no concept of “which task contributed how much”. On a shared parameter, both tasks hit the SAME denominator, and the dominant task's gradient still dominates the numerator.
Concretely, at our toy fixed point the combined gradient is zero by construction, so Adam converges to the same biased fixed point as plain SGD. It just takes a slightly different path there. The PyTorch experiment below demonstrates this.
Interactive: Watch The Bias Form
Drag . Toggle SGD vs Adam. The steady-state values do not move when you change optimiser - only when you change .
Try this. Set ρ = 1, optimiser = SGD. Both losses settle at 0.5; HS accuracy ~73%. Now flip optimiser to Adam - same final state. Now crank ρ to 500 - HS accuracy collapses to ~28% under both optimisers. The accuracy plateau is a property of the loss landscape, not the optimiser.
Python: Toy Two-Task SGD
Plain numerical simulation of a shared scalar parameter being pulled in two directions. We print the closed-form fixed point next to the simulator's final value to confirm the theory matches.
PyTorch: Adam vs SGD on the Same Imbalance
Same toy two-task problem expressed as an nn.Module, trained under both SGD and Adam at ρ = 500. The print at the end confirms: both optimisers land at the same biased fixed point.
Symptoms in the Wild
These are the empirical signatures of a 500× imbalance on the real DualTaskModel - exactly what you should expect to see if you train the §11.4 model with a vanilla L = L_rul + L_hs loss.
| Symptom | Where it shows up | Diagnostic |
|---|---|---|
| RUL trains, HS plateaus near 33% | TensorBoard / val curves | HS accuracy ≤ 1/K after 5 epochs |
| Critical-class recall < 30% | Confusion matrix | Rare class essentially unlearned |
| Shared features cluster by RUL bin | t-SNE of z (32-D) | No class structure in the embedding |
| Class boundaries depend on RUL only | Linear-probe on shared features | Probe accuracy = chance |
| Loss weight 1:1 fails reproducibly | Hyperparameter sweep | No fixed weight crosses both tasks |
| Adam vs SGD gives same final loss | Optimiser ablation | ≤ 1% relative gap in final HS accuracy |
Three Diagnostic Pitfalls
L = L_rul + λ·L_hs can find a λ that works at one epoch (typically near init), but the residual decays during training while the CE bound stays flat. The required λ doubles, then quadruples, by epoch 30. Adaptive weighting is the only reliable answer; §14 onward is about how to compute it.The point of all four sections. The 500× gradient imbalance is structural, not anecdotal. It biases the shared representation. It cannot be fixed by changing the optimiser, learning rate, or a single loss weight. Adaptive multi-task losses (the next part of the book) are the answer.
Takeaway — End of Part IV Setup
- Bias is structural. θ* = (ρ − 1)/(ρ + 1) regardless of optimiser, lr, or batch size.
- Adam does not fix it. Per-parameter rescaling cannot tell two task gradients apart on a shared parameter. Same fixed point as SGD.
- HS is sacrificed first. Auxiliary task plateaus near chance; its rare class is invisible to the backbone.
- Linear-probe diagnostic. If a frozen-trunk probe is at chance, the imbalance has bitten. Use this before declaring “multi-task works.”
- Next stop: Part V. Chapter 13 reframes this as the accuracy-safety tradeoff (the IEEE/CAA JAS paper's framing). Chapter 14 introduces AMNL - the first of the three rebalancing methods.