A small restaurant kitchen runs an unremarkable rule: every cook spends equal time on every dish. Vegetable prep, sauces, plating — same minutes per cook per task. The rule is suboptimal in any individual moment (some dishes need more attention than others on any given evening), but it is bulletproof. No coordination overhead. No misprioritisation. No specialist becoming a single point of failure.
Fixed equal weighting is the same idea applied to a multi-task loss. Two tasks — RUL regression and health classification — with one knob set once at 0.5 each, then never touched. No controller, no learnable scalars, no per-batch correction. Surprisingly, this rule is both the simplest MTL strategy AND the empirical winner of the AMNL paper's exhaustive ablation on FD002 + FD004 — beating the heuristic 0.75/0.25 split, any unbalanced choice, and (on chapter 23 §1 evidence) keeping pace with adaptive controllers when the gradient ratio is moderate.
The headline. The complete fixed-weight loss is one line: L=λrulLrul+(1−λrul)Lhealth. At λrul=0.5 on the paper's C-MAPSS multi-condition mean: FD002 RMSE = 6.33, FD004 RMSE = 7.30 — beating both 0.75/0.25 (V7) and 0.9/0.1 by clear margins.
The Equation In One Line
For two task losses Lrul and Lhealth, the fixed-weight composition is
Linearity. The gradient is ∇L=λrul∇Lrul+λhealth∇Lhealth. A weighted sum of the per-task gradients — no interaction terms, no per-batch state.
Statelessness. The two weights are PYTHON FLOATS, not learnable parameters. They never appear in the optimiser's parameter list and never receive a gradient. Compare with Uncertainty (chapter 24 §2) which makes them learnable.
Sum-to-one is not enforced. The paper's FixedWeightLoss stores both weights independently; the user is responsible for the convex-combination convention. You CAN set 0.7/0.7 and it will still run (with an uninterpretable scale).
Why 0.5 / 0.5 — A Counter-Intuitive Finding
The natural prior is λrul≈1: RUL regression is the primary task, so why give half the gradient budget to a 3-class auxiliary classifier? The AMNL paper's exhaustive ablation answers: because health classification REGULARISES the shared backbone toward condition-invariant features, and that regularisation buys more accuracy than the lost RUL focus.
Configuration
FD002 RMSE
FD004 RMSE
Avg
Reading
AMNL 0.5/0.5
6.33
7.30
6.82
★ best — equal weight regularises
AMNL 0.6/0.4
6.88
7.78
7.33
Slight RUL bias hurts
V7 baseline 0.75/0.25
7.24
8.14
7.69
Heuristic, sub-optimal
AMNL 0.9/0.1
8.59
7.71
8.15
Heavy RUL emphasis ≈ near-single-task
Single-task (RUL only)
26.92
28.27
27.60
Catastrophic — no regularisation
The four data rows trace a U-shape with the minimum at 0.5: too little health weight loses regularisation; too much loses RUL accuracy. The shape and minimum are robust across both multi-condition subsets — the 0.5/0.5 rule is a real empirical optimum on this benchmark, not a coincidence.
Why MTL with equal weights regularises so well. The shared backbone learns features that BOTH tasks can use. Down-weighting health to 0.1 means the backbone has to host both an RUL-specialist representation AND a marginal health-specialist representation; the available capacity is split inefficiently. Equal weighting forces the backbone to find features useful for both, which generalises better. Same logic that makes self-supervised auxiliary tasks improve downstream representations in language and vision models.
Interactive: λ_rul Sensitivity Sweep
Drag the slider to interpolate between the four AMNL ablation anchor points. The two curves are FD002 (blue) and FD004 (amber) RMSE. Both bottom out near λrul=0.5 and rise steeply as the weight approaches 1.0 (single-task).
Loading fixed-weight explorer…
Try the slider. Push λ_rul above 0.85 and the log-scale RMSE shoots up sharply — that is the regularisation cliff. Auxiliary task weight matters more than intuition suggests, because the backbone is shared. Below ~0.45 RMSE rises again as the backbone is dragged toward a health-specialist regime.
AMNL-Fixed: Magnitude-Normalised 0.5 / 0.5
The paper's repo includes a second fixed-weight variant, AMNLFixedLoss (baselines.py:55). It also fixes the weights at 0.5/0.5 but FIRST normalises each per-task loss by its own magnitude:
with floor scales srul=1.0 and shealth=0.1 to prevent division by tiny early-training values. The intent: scale the two losses to comparable magnitudes BEFORE combining, so the equal-weight rule is meaningful. In practice the unnormalised FixedWeightLoss is what the paper reports as ‘Baseline’ and ‘AMNL fixed (with WMSE)’ in chapter 23 §1; the AMNL-fixed variant is included as an ablation but not the main result.
Python: FixedWeightLoss From Scratch
Reproduce the four ablation rows with one function. No PyTorch needed — just NumPy + a softmax CE helper. The for-loop at the bottom prints the L_total values for the four configurations at the same input losses.
Fixed weighted MTL loss in 30 lines
🐍fixed_weight_demo.py
Explanation(24)
Code(53)
1docstring
Names the contract: the simplest MTL loss is a fixed weighted sum of the per-task losses. No state, no adaptation. The choice of λ_rul is the entire algorithm. Section 23 §1 framed this as the OUTER axis being ‘fixed’ (chapter 21 §1 cell A or B).
11import numpy as np
Numerical-array library. Used for the per-task loss computations and the softmax + cross-entropy helper.
14def softmax_ce(logits, target):
Standalone 3-class cross-entropy on (B, 3) logits with integer (B,) targets. Same algebra as F.cross_entropy in PyTorch.
EXECUTION STATE
⬇ input: logits (B, 3) = Pre-softmax scores per sample. Each row is unnormalised log-probabilities for {healthy, degrading, critical}.
⬇ input: target (B,) = Integer class indices in {0, 1, 2}. dtype=int64 in production.
⬆ returns = Float — mean cross-entropy across the batch.
16z = logits - logits.max(axis=-1, keepdims=True)
Log-sum-exp trick: subtract the per-row max for numerical stability before exp(). Without this, softmax with large logits would overflow.
EXECUTION STATE
📚 .max(axis=-1, keepdims=True) = Per-row max, kept as a (B, 1) column for broadcasting against (B, 3).
17p = np.exp(z); p /= p.sum(axis=-1, keepdims=True)
Two-statement softmax. First line: exp; second: normalise per row to sum to 1.
EXECUTION STATE
📚 /= = In-place division. p /= s reuses the buffer of p, faster than p = p / s.
Pull the predicted probability of the TRUE class for each sample, take -log, average. The 1e-12 floor prevents log(0) when the model is very confident on the wrong class.
EXECUTION STATE
📚 fancy indexing = p[np.arange(B), target] returns p[i, target[i]] for each row — the predicted probability of the true class.
→ why -log? = Cross-entropy = -log(p_correct). Lower is better; a perfectly correct prediction gives 0.
Convex combination. lam_rul=1 → pure RUL (single-task). lam_rul=0 → pure health (degenerate). lam_rul=0.5 → equal weight.
EXECUTION STATE
→ why convex? = Weights sum to 1 by construction. The composed loss is bounded between min(L_rul, L_health) and max(L_rul, L_health) — no scale blow-up.
→ contrast with GABA = GABA's λ are ALSO sum-to-one (post-renormalisation), but they UPDATE per-batch from gradient measurements. Fixed weights never change.
30y_rul = np.array([20., 80., 5., 100.])
Four ground-truth RUL values per engine.
EXECUTION STATE
y_rul (4,) = [20., 80., 5., 100.]
31y_pred = np.array([25., 75., 12., 98.])
Four predictions. Residuals: [+5, −5, +7, −2] — slight late bias.
EXECUTION STATE
y_pred (4,) = [25., 75., 12., 98.]
residuals = [+5, −5, +7, −2]
32hp_target = np.array([2, 1, 2, 0])
Health labels: critical / degrading / critical / healthy.
→ reading = Larger λ_rul → larger L_total because L_rul (25.75) >> L_health (0.90). But the AMNL paper's ablation shows that VALIDATION RMSE is best at λ_rul=0.5 — even though TRAINING L_total is largest there. The two are NOT the same thing.
29 lines without explanation
1"""Fixed equal weighting — the simplest MTL loss.
23L = lam_rul * L_rul + (1 - lam_rul) * L_health.
45The two task losses are kept on the same scale by using compatible
6units (cycles² for MSE and bits for cross-entropy). At lam_rul = 0.5,
7the two losses contribute equally to the gradient, regardless of their
8absolute magnitudes — the OUTER axis is held fixed, no controller.
9"""1011import numpy as np
121314defsoftmax_ce(logits, target):15"""3-class cross-entropy on logits (B, 3) with int target (B,)."""16 z = logits - logits.max(axis=-1, keepdims=True)17 p = np.exp(z); p /= p.sum(axis=-1, keepdims=True)18return-np.log(p[np.arange(len(target)), target]+1e-12).mean()192021deffixed_weight_loss(rul_loss, health_loss, lam_rul=0.5):22"""L = lam_rul * L_rul + (1 - lam_rul) * L_health.
2324 The whole MTL composition in one line. No state, no controller.
25 """26return lam_rul * rul_loss +(1.0- lam_rul)* health_loss
272829# ---------- Tiny mini-batch (4 samples) ----------30y_rul = np.array([20.,80.,5.,100.])31y_pred = np.array([25.,75.,12.,98.])32hp_target = np.array([2,1,2,0])33hp_logits = np.array([[0.1,0.2,0.7],34[0.5,0.3,0.2],35[0.4,0.5,0.6],36[0.8,0.1,0.1]])3738L_rul =((y_pred - y_rul)**2).mean()39L_health = softmax_ce(hp_logits, hp_target)404142# ---------- Compare four fixed weightings ----------43configs =[(0.50,"AMNL 0.5/0.5"),44(0.60,"AMNL 0.6/0.4"),45(0.75,"V7 baseline 0.75/0.25"),46(0.90,"AMNL 0.9/0.1 (heavy RUL)")]4748print(f"L_rul = {L_rul:.4f} L_health = {L_health:.4f}")49print(f"{'config':<24}{'λ_rul':>6}{'λ_h':>6}{'L_total':>10}")50print("-"*55)51for lam, name in configs:52 L = fixed_weight_loss(L_rul, L_health, lam_rul=lam)53print(f"{name:<24}{lam:>6.2f}{1-lam:>6.2f}{L:>10.4f}")
The L_total numbers are misleading. 0.5/0.5 gives the LOWEST training L_total (because L_rul is numerically larger and gets weighted down). But 0.5/0.5 also gives the LOWEST validation RMSE. The two minimums coinciding is the regularisation effect: equal weights produce a backbone that GENERALISES better despite a less aggressive training objective on the primary task.
PyTorch: The Paper's FixedWeightLoss
Verbatim from grace/core/baselines.py:34-49. 16 lines of class definition; 3 method implementations; no learnable parameters; no internal state. The simplest entry in the loss registry — and the baseline every other method in this chapter compares against.
Production FixedWeightLoss class
🐍baselines.py:FixedWeightLoss
Explanation(35)
Code(56)
1docstring
States the contract: this is the verbatim production class. The same loss-registry name ‘fixed_050’ that ChatGPT-21 §1's 2×2 grid Cell A uses.
9import torch
PyTorch core. We only need .Tensor and autograd hooks.
10import torch.nn as nn
nn.Module base class. Inheriting it gives the MTL loss class access to .to(device), state_dict() serialisation, and the standard Module call protocol.
11from typing import Dict
Type hint for the get_weights return type.
14class BaseMTLLoss(nn.Module):
Abstract base. Every MTL loss in the paper repo inherits from this class and implements three methods: forward, get_weights, get_name.
EXECUTION STATE
→ why a base class? = Uniform interface lets UnifiedTrainer (chapter 22 §1) call `mtl_loss(rul_loss, health_loss)` regardless of which controller is plugged in. Polymorphism over MTL strategies.
14"""Abstract base. forward() returns a scalar combined loss."""
Abstract method. The `...` placeholder means subclasses must implement it. **kwargs accepts shared_params and other GABA-specific arguments without breaking the interface.
EXECUTION STATE
📚 **kwargs = Catches any extra keyword arguments. Lets the same call signature work for FixedWeightLoss (no extra args) and GABALoss (needs shared_params).
17def get_weights(self) -> Dict[str, float]: ...
Abstract. Returns the current per-task weights so the gradient logger can record them. For fixed weights, always returns the constants set in __init__.
18def get_name(self) -> str: ...
Abstract. Returns a human-readable name for logging (`Fixed(0.50/0.50)`, `GABA(beta=0.99)`, etc.).
21class FixedWeightLoss(BaseMTLLoss):
Concrete subclass. The simplest production loss — wraps the fixed weighted sum in the BaseMTLLoss interface.
21"""Static task weighting."""
One-line description.
24def __init__(self, rul_weight: float = 0.5,
Constructor. Defaults to 0.5/0.5 (the AMNL recommendation).
EXECUTION STATE
⬇ rul_weight = 0.5 = RUL task weight. The paper registers two convenience aliases: `fixed_050` (this default) and `fixed_075` (the V7 baseline).
25health_weight: float = 0.5) -> None:
Health task weight. Default 0.5. The two weights are NOT auto-normalised — the user is responsible for their sum.
EXECUTION STATE
⬇ health_weight = 0.5 = Health task weight.
→ return type None = Constructor side-effects only; no return value.
26super().__init__()
Call nn.Module's __init__. Required so PyTorch can register parameters, buffers, and submodules. Skipping this breaks .to(device) and state_dict.
27self.rul_weight = rul_weight
Store as instance attribute. NOT registered as nn.Parameter — fixed weights are PYTHON FLOATS, not tensors. They never receive gradients.
EXECUTION STATE
→ contrast with Uncertainty = UncertaintyLoss stores log_var_rul as nn.Parameter — it's LEARNABLE. FixedWeightLoss stores a Python float — it's constant.
One line. Python multiplies a float by a tensor (autograd preserved) and adds two tensors (autograd preserved). Result keeps the gradient graph intact for .backward().
EXECUTION STATE
→ autograd flow = dL/dθ = w_rul * dL_rul/dθ + w_health * dL_health/dθ. Both task losses contribute to every shared backbone parameter.
33def get_weights(self):
Inspection helper. UnifiedTrainer's gradient_logger calls this every epoch to log the current weights.
33return {"rul_weight": self.rul_weight,
Return as a dict for JSON-serialisable logging.
34"health_weight": self.health_weight}
Two keys. Same dict structure as GABA.get_weights, so the trainer's logger doesn't branch on loss type.
f-string with format spec. `Fixed(0.50/0.50)` for the default.
42if __name__ == "__main__":
Standard guard. The block runs only when the file is executed directly.
43 # Equivalent to get_loss("fixed_050") from the loss_registry.
Comment marker. The paper's loss_registry.py registers `fixed_050 = lambda **kw: FixedWeightLoss(0.5, 0.5)` — calling get_loss('fixed_050') from anywhere in the codebase produces the same instance as the explicit construction below.
Call the loss module. Equivalent to `loss.forward(L_rul, L_health)`. The autograd graph: L = 0.5·L_rul + 0.5·L_health.
EXECUTION STATE
→ why call the module? = nn.Module.__call__ runs forward() AND any registered hooks. Direct .forward() bypasses hooks. Always use `module(x)`, not `module.forward(x)`.
50print(f"L_total = {L.item():.4f}")
Convert the 0-d tensor to a Python float and print. .item() requires the tensor to be 0-dimensional.
51print(f"weights = {loss.get_weights()}")
Inspect the per-task weights.
52print(f"name = {loss.get_name()}")
Inspect the loss name.
54L.backward()
Run autograd backward. Populates L_rul.grad and L_health.grad with dL/dL_rul and dL/dL_health respectively.
EXECUTION STATE
📚 .backward() = Tensor method. Default: pretend the gradient of L w.r.t. itself is 1.0, and propagate backwards through the autograd graph.
54print(f"dL/dL_rul = {L_rul.grad.item():.4f}")
Should print 0.5 — the partial derivative dL/dL_rul = rul_weight = 0.5.
EXECUTION STATE
→ math = L = 0.5·L_rul + 0.5·L_health → dL/dL_rul = 0.5, dL/dL_health = 0.5. Verifies the loss does what it claims.
1"""Paper's FixedWeightLoss class — verbatim.
23Source: paper_ieee_tii/grace/core/baselines.py:34-49.
4The simplest entry in the loss registry. Wraps the fixed weighted sum
5in the same BaseMTLLoss interface as GABA, GradNorm, Uncertainty, etc.,
6so it plugs into UnifiedTrainer without special-casing.
7"""89import torch
10import torch.nn as nn
11from typing import Dict
121314classBaseMTLLoss(nn.Module):15"""Abstract base. forward() returns a scalar combined loss."""16defforward(self, rul_loss, health_loss,**kwargs):...17defget_weights(self)-> Dict[str,float]:...18defget_name(self)->str:...192021classFixedWeightLoss(BaseMTLLoss):22"""Static task weighting."""2324def__init__(self, rul_weight:float=0.5,25 health_weight:float=0.5)->None:26super().__init__()27 self.rul_weight = rul_weight
28 self.health_weight = health_weight
2930defforward(self, rul_loss, health_loss,**kwargs):31return self.rul_weight * rul_loss + self.health_weight * health_loss
3233defget_weights(self):34return{"rul_weight": self.rul_weight,35"health_weight": self.health_weight}3637defget_name(self):38returnf"Fixed({self.rul_weight:.2f}/{self.health_weight:.2f})"394041# ---------- Drop-in usage in the trainer ----------42if __name__ =="__main__":43# Equivalent to get_loss("fixed_050") from the loss_registry.44 loss = FixedWeightLoss(rul_weight=0.5, health_weight=0.5)4546# Synthetic per-task losses47 L_rul = torch.tensor(25.75, requires_grad=True)48 L_health = torch.tensor(0.90, requires_grad=True)4950 L = loss(L_rul, L_health)51print(f"L_total = {L.item():.4f}")52print(f"weights = {loss.get_weights()}")53print(f"name = {loss.get_name()}")54 L.backward()55print(f"dL/dL_rul = {L_rul.grad.item():.4f}")56print(f"dL/dL_health = {L_health.grad.item():.4f}")
The weights are floats, not Parameters.self.rul_weight = 0.5 stores a Python float. If you wrote self.rul_weight = nn.Parameter(torch.tensor(0.5)) instead, the optimiser would learn the weight — turning FixedWeightLoss into a (degenerate) UncertaintyLoss without the regularisation term. Section 24 §2 walks through the difference; subtle bugs hide in this distinction.
Fixed Weights In Other Domains
Domain
Primary task
Auxiliary task
Typical fixed split
Why this split
Object detection
Bounding-box regression
Class score (softmax)
1.0 / 1.0 (no weighting)
Per-anchor losses already on similar scales after Focal Loss; no weighting helps.
Speech recognition
CTC (frame-level)
Attention decoder CE
0.7 / 0.3
CTC needs more weight because attention can drift without strong alignment supervision.
Image segmentation + classification
Pixel-wise Dice
Image-level CE
0.8 / 0.2
Pixel loss is N pixels × class loss in scale; the 0.8/0.2 ratio normalises.
Recommender CTR + dwell
Click-through binary CE
Dwell-time MSE
0.5 / 0.5 (equal)
Dwell-time signal is a noisy regulariser for CTR; equal weighting works as in C-MAPSS.
Self-driving multi-task
Detection mAP
Lane segmentation, depth
1.0 / 1.0 / 1.0
All sub-tasks framed at comparable losses; gradient surgery (PCGrad) used instead of weighting.
The lesson: fixed weights are a valid first stop, but the right ratio is domain-specific. Run the AMNL ablation (4–5 fixed splits) on a small validation set before committing. If the U-shape minimum is at 0.5, you have the AMNL pattern; if it's elsewhere, the loss scales differ materially and you may benefit from magnitude-normalisation (AMNL-fixed) or learnable weighting (Uncertainty).
Pitfalls With Fixed Weights
Pitfall 1: weights that don't sum to 1
FixedWeightLoss does not enforce λrul+λhealth=1. Setting (0.7, 0.7) scales the combined loss by 1.4 — same gradient direction, larger magnitude. Effectively the same as running with a 1.4× learning rate on a (0.5, 0.5) loss, which often blows up training. Always verify the sum.
Pitfall 2: assuming 0.5 / 0.5 transfers
The 0.5 / 0.5 result is C-MAPSS-specific. On datasets where the per-task losses have different magnitudes (e.g. raw MSE vs cross-entropy with very different ranges), 0.5/0.5 will be dominated by the larger-magnitude loss. Either rescale the losses to comparable magnitudes (AMNL-fixed pattern) or learn the weights (Uncertainty — chapter 24 §2).
Pitfall 3: forgetting that 0.9/0.1 is nearly single-task
At λrul=0.9 the health task contributes only 10% of the gradient signal — the backbone has effectively no auxiliary supervision. RMSE on FD002 climbs from 6.33 to 8.59. Single-task (λ_rul=1.0) is even worse: 26.92. The continuum from MTL to single-task has a sharp cliff.
Pitfall 4: comparing fixed vs adaptive without controlling for inner loss
Fixed 0.5/0.5 with weighted MSE = AMNL. Fixed 0.5/0.5 with standard MSE = Baseline. GABA + standard MSE = adaptive version of Baseline. GABA + weighted MSE = GRACE. The four cells of the chapter 21 §1 grid. Confounding fixed-vs-adaptive with which inner loss is used produces apples-to-oranges comparisons.
Pitfall 5: starting fixed and never re-checking
A fixed 0.5/0.5 model trained successfully on one dataset version may break on a refresh of the data. The optimal split depends on the relative scales of the per-task losses, which change with data preprocessing. Re-run the 4-point sweep when your training data changes meaningfully.
Takeaway
Fixed weighting is the simplest MTL loss: L=λrulLrul+(1−λrul)Lhealth. No state, no controller, no learnable parameters.
On C-MAPSS multi-condition the AMNL ablation finds λrul=0.5 the optimal split: FD002 RMSE 6.33, FD004 RMSE 7.30, beating the intuitive 0.75/0.25 by 0.91 and 0.84 cycles.
The U-shape RMSE-vs-λ_rul curve has a regularisation cliff near λ_rul → 1 (RMSE jumps to 26.92 at single-task) — the auxiliary task is doing real work.
The production class is FixedWeightLoss at grace/core/baselines.py:34-49. Inherits BaseMTLLoss for uniform interface with GABA/Uncertainty/GradNorm/DWA. Weights stored as Python floats — explicitly NOT learnable.
Use fixed weights as a baseline. Run a 4–5-point ablation on a validation set; if the U-minimum is at 0.5, you have the AMNL pattern. If not, escalate to magnitude normalisation (AMNL-fixed), learnable weights (Uncertainty), or gradient-magnitude adaptation (GABA / GRACE).