Tug-of-War on the Shared Parameters
Two children pull a rope from opposite ends. One is twice the size of the other; their pulls have the same direction (away from centre) but very different magnitudes. The rope's actual motion is the SUM of the two forces — and it moves toward the bigger child. Worse, if the smaller child suddenly redirects 90 degrees, the rope does not respond to the change at all — her pull is too small to matter.
That is what happens to the shared parameters of an MTL model during training. Every backward pass produces one gradient vector per task. The optimiser follows their (lambda-weighted) sum. If one task's gradient is much larger than the other's — or if their directions disagree — the shared parameters are being yanked one way while the other task's preferences are ignored.
The Gradient Decomposition
With combined loss , the gradient with respect to a shared parameter decomposes linearly:
This is the definition of derivative of a sum. The combined update is a convex combination (when ) of per-task gradients. Two properties of this linear combination determine whether learning succeeds:
| Property | Symbol | Failure mode |
|---|---|---|
| Per-task magnitude | If they differ by orders of magnitude, the larger task dominates | |
| Per-task direction | If two tasks pull opposite ways, gradients partially cancel |
Two Failure Modes: Magnitude and Direction
Magnitude imbalance. Suppose as we will measure on real C-MAPSS in Section 12. With , the combined gradient is approximately equal to — task 2's influence is essentially noise.
Direction conflict. Suppose two tasks have similar magnitudes but opposite directions: . Their sum is zero. The optimiser does not move. With static lambda, the only escape is to abandon one task.
Interactive: Combine Two Gradient Vectors
Below: two task gradients drawn as 2-D arrows from the same origin. Slide lambda; slide the angle of ; slide the magnitude ratio. The green arrow is what the optimiser actually follows.
Set magnitude ratio to 20x and the green arrow snaps to the red regardless of lambda — magnitude wins. Now drop ratio to 1x and rotate the health gradient to 180 degrees: green nearly disappears — direction conflict cancels both. These are the two failure modes the rest of the book engineers around.
Python: Per-Task Gradient Norms by Hand
A toy in NumPy that quantifies what the diagram shows. With a 100x magnitude imbalance and equal lambda, RUL contributes 99% of the total gradient. The corresponding measurement on real C-MAPSS produces a 500x ratio — meaning health contributes ~0.2% of the total update step.
PyTorch: torch.autograd.grad for Per-Task Inspection
Real measurements use torch.autograd.grad — it computes per-task gradients without mutating the model's .grad attributes, so you can inspect each task's signal in isolation. This is the same primitive Section 18 uses inside the GABA controller.
Gradient Conflicts in Other ML Areas
| Domain | Source of imbalance | Common fix |
|---|---|---|
| RUL (this book) | Regression vs classification scale | GABA / inverse-gradient (this paper) |
| Reinforcement learning | Policy vs value loss | Coefficient sweep, PPO clip |
| GANs | Generator vs discriminator | Spectral normalisation, learning-rate balancing |
| NLP fine-tuning | Pre-train objective vs downstream task | Learning-rate warm-up, layer freezing |
| Self-driving | Steering vs detection vs depth | Loss weighting search, MTL routing |
| Federated learning | Per-client gradient drift | FedProx, FedAvg with momentum |
| Generative VAE | Reconstruction vs KL term | Beta-annealing |
Every row is gradient combat. The mathematical machinery is the same; what differs is which tool the field has settled on.
Preview: How Each Method Addresses This
| Method | Section | What it does to gradients |
|---|---|---|
| Fixed equal weighting | Baseline (Section 24) | Ignores the imbalance - regression dominates |
| AMNL | Section 14 | Reshapes the regression LOSS (failure-biased) - same gradient profile, different shape |
| GABA | Sections 17-20 | ADAPTIVELY rescales lambda inversely to gradient norm - kills magnitude imbalance |
| GRACE | Sections 21-23 | GABA + AMNL - balanced gradients with safety-tilted loss shape |
| GradNorm | Section 24 | Auxiliary loss that drives lambda toward equal-rate convergence |
| PCGrad | Section 25 | Projects out the direction conflict - addresses the WRONG failure mode here |
The throughline. Every method in Parts V-VIII is a different answer to the same question: how do we get the shared parameters to receive sensible gradients from both tasks simultaneously? The next chapter (Part II) returns to data; Parts V-VII implement the three answers we recommend.
Takeaway
- Gradient = sum of per-task contributions. on shared parameters.
- Two failure modes. Magnitude imbalance (one task dominates) and direction conflict (two tasks cancel).
- On C-MAPSS, magnitude wins. The 500x ratio in Section 12 explains why magnitude-balancing methods (GABA) beat direction-surgery methods (PCGrad).
- The diagnostic is autograd.
torch.autograd.grad(L_k, shared_params)per task, then a single L2 norm. Run it once per epoch and watch the ratio. - The rest of the book is solutions to this problem. AMNL changes the loss shape; GABA balances magnitudes; GRACE does both.