The Dosing Problem
A clinician is treating a patient with two simultaneous drugs. Drug A is 500× more potent than Drug B on a per-milligram basis. If she gives 50 mg of each, Drug A overwhelms the body and Drug B is pharmacologically invisible. The fair prescription is not equal mass — it is equal physiological effect. To get there she scales each dose inversely by potency: very little Drug A, much more Drug B.
Multi-task learning has the same shape. The two losses are the drugs; the shared backbone is the patient. Their ‘potency’ on the backbone is the gradient norm . The effective dose — what the parameters actually feel — is . Equal loss weights guarantee that the high-potency task dominates every step. Equal effects require inverse weighting. That is the entire principle behind GABA.
The Effective-Contribution Principle
Recall the gradient-descent update on the shared backbone parameters for a multi-task loss :
Each task contributes a vector to the update. Its magnitude is where . We name this scalar the effective contribution of task :
It is the unit-free measurement of how hard task is pulling on the backbone. Every weighting scheme defines a vector ; only one scheme makes the entries equal.
| Quantity | Symbol | What it measures |
|---|---|---|
| Per-task loss | L_i | How wrong task i currently is |
| Per-task gradient norm | ||g_i|| | How hard task i would pull if given full weight |
| Per-task weight | lambda_i | How much we let it pull (scaling knob) |
| Effective contribution | c_i = lambda_i ||g_i|| | How hard task i ACTUALLY pulls in this step — the only thing the optimiser sees |
Why Inverse-Proportional Is The Unique Solution
For two tasks, demand with . That is two equations in two unknowns. Substitute into the balance condition:
Solve for :
That is exactly the K=2 GABA formula. It is the unique solution: any other weighting violates either the simplex constraint or the equal-contribution constraint. The next section (§17.3) re-derives the same expression from a Lagrangian and shows the closed form survives small perturbations.
The intuition is symmetric. “Each task contributes equally” means — balance of permission. “Each task contributes equally to what is learned” means — balance of effect. They differ by the gradient ratio. With a 500× ratio they differ by 500×.
Plugging realistic FD002 numbers (paper : , ):
| Scheme | λ_rul | λ_health | c_rul = λ_rul·g_rul | c_health = λ_health·g_health | Imbalance c_max / c_min |
|---|---|---|---|---|---|
| Uniform (Fixed Baseline, paper §3.5) | 0.5000 | 0.5000 | 2.500000 | 0.005000 | 500.00x |
| Sqrt-inverse (softer rule) | 0.0428 | 0.9572 | 0.214035 | 0.009572 | 22.36x |
| GABA (inverse-proportional) | 0.0020 | 0.9980 | 0.009980 | 0.009980 | 1.00x |
Uniform preserves the 500× gradient imbalance in the contributions — doing nothing useful. Sqrt-inverse leaves a 22× residual. Only the exact inverse rule collapses the imbalance to 1.00×.
Interactive: Where The Curves Cross
Plot two functions of : the increasing line (in blue) and the decreasing line (in green). They cross at exactly one point, which is the GABA . Drag the red λ slider; only at the crossing point are the two contribution bars equal.
Try this. Set (move both sliders to 1.0). The two curves coincide and any balances them — GABA degenerates to vanilla 0.5/0.5 when nothing needs fixing. Now set the ratio to 500× (the paper default). The crossing point snaps far to the left, and only the bottom GABA card has balanced ✓.
Python: Three Schemes, Side By Side
Implement uniform, sqrt-inverse, and GABA in pure NumPy and print their effective contributions on the realistic 500× example. The point of the table is not to show the λ values — it is to show that only GABA equalises .
PyTorch: Verification On Real Autograd Gradients
The hand-picked NumPy demo is an existence proof. The empirical demo is a real autograd run on a tiny shared backbone. We compute and with torch.autograd.grad, apply the K=2 GABA closed form, and assert to . The assertion passes on every seed, every batch, every backbone — because the equality is algebraic, not approximate.
The Same Principle In Other Domains
“Equalise effective effect, not nominal weight” recurs whenever a single mechanism aggregates heterogeneous contributors:
| Domain | Mechanism | What ‘effective contribution’ is | Inverse-rule analogue |
|---|---|---|---|
| Pharmacology (the hook) | Combination drug therapy | dose × potency | Inverse-potency dosing |
| Portfolio risk (Markowitz) | Multi-asset allocation | weight × asset volatility | Risk parity (1/σ_i weights) |
| Federated learning (FedAvg) | Average client gradients | client weight × ||local update|| | Inverse-update-norm aggregation |
| Reinforcement learning (DQN with multi-reward) | Sum reward components | weight × reward gradient | Inverse-gradient reward shaping |
| Sound mixing (mastering) | Sum tracks at the bus | fader × stem loudness | LUFS-normalising fader |
| Climate-model ensembles (CMIP6) | Weight per model in multi-model mean | weight × model variance | Inverse-variance weighting |
| Object detection (Detectron2) | Sum bbox-regression + classification + objectness | weight × loss-component grad | GABA / Inv-grad weighting |
In each row, the ‘loud contributor’ is the one with the larger natural magnitude. Naive equal weighting lets it monopolise the mechanism; inverse-proportional weighting equalises effect. The mathematics is identical to K=2 GABA — just renamed for the domain.
Three Misconceptions About ‘Why It Works’
Takeaway
- Effective contribution is the right quantity to balance. is what the optimiser sees, not alone.
- Inverse-proportional is unique for K=2. Among all weightings on the simplex, only satisfies for every batch and every backbone.
- The identity is algebraic. . The PyTorch assertion passes to single-precision epsilon, on any seed.
- Sqrt / log / rank-based inverses do NOT balance. They reduce imbalance but leave 22× or more residual. Linear inverse is the only exact rule.
- The principle generalises. Risk parity, inverse-variance ensemble weighting, LUFS-normalised mixing, and federated-averaging all instantiate the same equal-effect identity under different names.
- This is why GABA fixes the safety failure. The 500× gradient gap is exactly what was burying the health signal; equalising unburies it, and the −55% NASA improvement on FD002 is the empirical witness.