Two Voices in a Choir
In a duet, the two voices are not interchangeable. A mezzo-soprano and a baritone have different ranges, different volumes, different tonal weights. A naive sound engineer who slides both faders to the same level produces something the mezzo dominates — or, with the wrong room, a baritone that buries her. The job of mixing is to WEIGHT the channels so the result sounds balanced.
Multi-task learning has the same problem. The RUL regression loss is on the scale of squared cycles (tens to hundreds); the health classification loss is on the scale of cross-entropy (zero to a couple). Add them with equal weight and the regression dominates the optimisation. Picking the weights well — statically, dynamically, adaptively — is the central engineering problem of every paper in this corner of the literature.
The Weighted-Sum Formulation
The dominant approach is a single scalar loss formed as a weighted sum of per-task losses:
Three properties of this formulation matter. (1) Smooth and differentiable: gradients flow through the sum into every shared parameter. (2) One scalar to optimise: standard back-prop and optimisers work unchanged. (3) Pareto-aware: every choice of selects a point on the Pareto frontier of the per-task losses — a tradeoff curve where reducing one task's loss requires increasing another's.
Interactive: The Pareto Frontier
The diagram below sketches a toy Pareto frontier between two synthetic losses. Slide from 0 to 1 and watch the optimum move along the curve. Notice the curve is non-linear: at the extreme ends, gaining a tiny bit on the favoured task costs a lot on the other.
Real C-MAPSS frontiers look qualitatively similar but with much wider scale gaps; one of the reasons GABA (Section 17) outperforms a simple sweep is that it moves the operating point during training instead of committing to one upfront.
Five Ways to Choose the Weights
| Strategy | Where lambda comes from | Section |
|---|---|---|
| Fixed equal | Set 1/K and forget | Section 4.1 baseline |
| Fixed unequal | Hand-tuned per task | AMNL (Sections 14-16) |
| Inverse-magnitude | 1/||grad|| or 1/L per task | Heuristic; precursor to GABA |
| Uncertainty-weighted | Learnable per-task sigmas | Kendall et al. 2018 |
| Adaptive (GABA) | Inverse-gradient-norm with EMA | Sections 17-20 (the paper) |
The first two are static; the next three update during training. The static ones are easy to reason about and reproducible; the dynamic ones tend to outperform them because the right can change over the course of an epoch (early epochs benefit from one balance, late epochs from another).
Python: One Loss, Five Weighting Strategies
PyTorch: Combining Losses in the Training Loop
The Same Tradeoff in Other Domains
| Domain | Task A | Task B | Weighting strategy |
|---|---|---|---|
| RUL (this book) | Regression (cycles) | Classification (3 classes) | Adaptive (GABA / GRACE) |
| Self-driving | Steering angle | Object detection | Tesla HydraNet (per-task heads, learned losses) |
| Multi-modal LLM | Text autoregression | Image-caption matching | Static lambda + warm-up |
| Speech | Phoneme recognition | Speaker classification | Uncertainty-weighted |
| Object detection | Box regression (smooth-L1) | Class probability (CE) | Heuristic 1.0 / 1.0 |
| Generative models | Reconstruction (MSE) | KL divergence (latent prior) | Beta-VAE annealing |
| Reinforcement learning | Policy gradient | Value function (MSE) | Coefficient sweep |
Every row is one paper's worth of literature on how to set the weights. The patterns transfer; what changes is the magnitude of each loss and the criticality of each task.
Two Pitfalls
What the rest of the book is about. Choosing lambda well. Section 4.4 examines the gradients themselves; Sections 14, 17, 21 give three concrete weighting strategies (failure-biased weights, GABA, GRACE) and benchmark them.
Takeaway
- The combined loss is . Smooth, differentiable, single-scalar. Standard optimisers work.
- Equal lambda is rarely balanced. Per-task losses live on different scales; equal lambda silently lets the larger-scale loss dominate.
- Five weighting strategies, one design choice. Fixed equal, fixed unequal, inverse-magnitude, uncertainty, adaptive. The book's contribution lives in the last bucket.
- The training loop pattern is universal. Forward, two losses, one combined loss, backward, step. lambda is the only knob you change between AMNL, GABA, GRACE.