Chapter 24
10 min read
Section 95 of 121

Fixed Equal Weighting (0.5 / 0.5)

Adaptive MTL Baselines

The Simplest Multi-Task Loss

A small restaurant kitchen runs an unremarkable rule: every cook spends equal time on every dish. Vegetable prep, sauces, plating — same minutes per cook per task. The rule is suboptimal in any individual moment (some dishes need more attention than others on any given evening), but it is bulletproof. No coordination overhead. No misprioritisation. No specialist becoming a single point of failure.

Fixed equal weighting is the same idea applied to a multi-task loss. Two tasks — RUL regression and health classification — with one knob set once at 0.5 each, then never touched. No controller, no learnable scalars, no per-batch correction. Surprisingly, this rule is both the simplest MTL strategy AND the empirical winner of the AMNL paper's exhaustive ablation on FD002 + FD004 — beating the heuristic 0.75/0.25 split, any unbalanced choice, and (on chapter 23 §1 evidence) keeping pace with adaptive controllers when the gradient ratio is moderate.

The headline. The complete fixed-weight loss is one line: L=λrulLrul+(1λrul)Lhealth\mathcal{L} = \lambda_{\text{rul}} \, \mathcal{L}_{\text{rul}} + (1 - \lambda_{\text{rul}}) \, \mathcal{L}_{\text{health}}. At λrul=0.5\lambda_{\text{rul}} = 0.5 on the paper's C-MAPSS multi-condition mean: FD002 RMSE = 6.33, FD004 RMSE = 7.30 — beating both 0.75/0.25 (V7) and 0.9/0.1 by clear margins.

The Equation In One Line

For two task losses Lrul\mathcal{L}_{\text{rul}} and Lhealth\mathcal{L}_{\text{health}}, the fixed-weight composition is

L(θ)=λrulLrul(θ)+λhealthLhealth(θ),λrul+λhealth=1.\mathcal{L}(\theta) = \lambda_{\text{rul}} \, \mathcal{L}_{\text{rul}}(\theta) + \lambda_{\text{health}} \, \mathcal{L}_{\text{health}}(\theta), \quad \lambda_{\text{rul}} + \lambda_{\text{health}} = 1.

Three properties to internalise:

  • Linearity. The gradient is L=λrulLrul+λhealthLhealth\nabla\mathcal{L} = \lambda_{\text{rul}}\nabla\mathcal{L}_{\text{rul}} + \lambda_{\text{health}}\nabla\mathcal{L}_{\text{health}}. A weighted sum of the per-task gradients — no interaction terms, no per-batch state.
  • Statelessness. The two weights are PYTHON FLOATS, not learnable parameters. They never appear in the optimiser's parameter list and never receive a gradient. Compare with Uncertainty (chapter 24 §2) which makes them learnable.
  • Sum-to-one is not enforced. The paper's FixedWeightLoss stores both weights independently; the user is responsible for the convex-combination convention. You CAN set 0.7/0.7 and it will still run (with an uninterpretable scale).

Why 0.5 / 0.5 — A Counter-Intuitive Finding

The natural prior is λrul1\lambda_{\text{rul}} \approx 1: RUL regression is the primary task, so why give half the gradient budget to a 3-class auxiliary classifier? The AMNL paper's exhaustive ablation answers: because health classification REGULARISES the shared backbone toward condition-invariant features, and that regularisation buys more accuracy than the lost RUL focus.

ConfigurationFD002 RMSEFD004 RMSEAvgReading
AMNL 0.5/0.56.337.306.82★ best — equal weight regularises
AMNL 0.6/0.46.887.787.33Slight RUL bias hurts
V7 baseline 0.75/0.257.248.147.69Heuristic, sub-optimal
AMNL 0.9/0.18.597.718.15Heavy RUL emphasis ≈ near-single-task
Single-task (RUL only)26.9228.2727.60Catastrophic — no regularisation

The four data rows trace a U-shape with the minimum at 0.5: too little health weight loses regularisation; too much loses RUL accuracy. The shape and minimum are robust across both multi-condition subsets — the 0.5/0.5 rule is a real empirical optimum on this benchmark, not a coincidence.

Why MTL with equal weights regularises so well. The shared backbone learns features that BOTH tasks can use. Down-weighting health to 0.1 means the backbone has to host both an RUL-specialist representation AND a marginal health-specialist representation; the available capacity is split inefficiently. Equal weighting forces the backbone to find features useful for both, which generalises better. Same logic that makes self-supervised auxiliary tasks improve downstream representations in language and vision models.

Interactive: λ_rul Sensitivity Sweep

Drag the slider to interpolate between the four AMNL ablation anchor points. The two curves are FD002 (blue) and FD004 (amber) RMSE. Both bottom out near λrul=0.5\lambda_{\text{rul}} = 0.5 and rise steeply as the weight approaches 1.0 (single-task).

Loading fixed-weight explorer…
Try the slider. Push λ_rul above 0.85 and the log-scale RMSE shoots up sharply — that is the regularisation cliff. Auxiliary task weight matters more than intuition suggests, because the backbone is shared. Below ~0.45 RMSE rises again as the backbone is dragged toward a health-specialist regime.

AMNL-Fixed: Magnitude-Normalised 0.5 / 0.5

The paper's repo includes a second fixed-weight variant, AMNLFixedLoss (baselines.py:55). It also fixes the weights at 0.5/0.5 but FIRST normalises each per-task loss by its own magnitude:

LAMNL-fix=0.5Lrulmax(Lrul, srul)+0.5Lhealthmax(Lhealth, shealth)\mathcal{L}_{\text{AMNL-fix}} = 0.5 \cdot \frac{\mathcal{L}_{\text{rul}}}{\max(L_{\text{rul}},\ s_{\text{rul}})} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\max(L_{\text{health}},\ s_{\text{health}})}

with floor scales srul=1.0s_{\text{rul}} = 1.0 and shealth=0.1s_{\text{health}} = 0.1 to prevent division by tiny early-training values. The intent: scale the two losses to comparable magnitudes BEFORE combining, so the equal-weight rule is meaningful. In practice the unnormalised FixedWeightLoss is what the paper reports as ‘Baseline’ and ‘AMNL fixed (with WMSE)’ in chapter 23 §1; the AMNL-fixed variant is included as an ablation but not the main result.

Python: FixedWeightLoss From Scratch

Reproduce the four ablation rows with one function. No PyTorch needed — just NumPy + a softmax CE helper. The for-loop at the bottom prints the L_total values for the four configurations at the same input losses.

Fixed weighted MTL loss in 30 lines
🐍fixed_weight_demo.py
1docstring

Names the contract: the simplest MTL loss is a fixed weighted sum of the per-task losses. No state, no adaptation. The choice of λ_rul is the entire algorithm. Section 23 §1 framed this as the OUTER axis being ‘fixed’ (chapter 21 §1 cell A or B).

11import numpy as np

Numerical-array library. Used for the per-task loss computations and the softmax + cross-entropy helper.

14def softmax_ce(logits, target):

Standalone 3-class cross-entropy on (B, 3) logits with integer (B,) targets. Same algebra as F.cross_entropy in PyTorch.

EXECUTION STATE
⬇ input: logits (B, 3) = Pre-softmax scores per sample. Each row is unnormalised log-probabilities for {healthy, degrading, critical}.
⬇ input: target (B,) = Integer class indices in {0, 1, 2}. dtype=int64 in production.
⬆ returns = Float — mean cross-entropy across the batch.
16z = logits - logits.max(axis=-1, keepdims=True)

Log-sum-exp trick: subtract the per-row max for numerical stability before exp(). Without this, softmax with large logits would overflow.

EXECUTION STATE
📚 .max(axis=-1, keepdims=True) = Per-row max, kept as a (B, 1) column for broadcasting against (B, 3).
17p = np.exp(z); p /= p.sum(axis=-1, keepdims=True)

Two-statement softmax. First line: exp; second: normalise per row to sum to 1.

EXECUTION STATE
📚 /= = In-place division. p /= s reuses the buffer of p, faster than p = p / s.
18return -np.log(p[np.arange(len(target)), target] + 1e-12).mean()

Pull the predicted probability of the TRUE class for each sample, take -log, average. The 1e-12 floor prevents log(0) when the model is very confident on the wrong class.

EXECUTION STATE
📚 fancy indexing = p[np.arange(B), target] returns p[i, target[i]] for each row — the predicted probability of the true class.
→ why -log? = Cross-entropy = -log(p_correct). Lower is better; a perfectly correct prediction gives 0.
21def fixed_weight_loss(rul_loss, health_loss, lam_rul=0.5):

The complete fixed-weighting MTL loss. One line of math; no internal state. lam_rul=0.5 is the AMNL paper's recommended default.

EXECUTION STATE
⬇ input: rul_loss = Scalar RUL loss (e.g., MSE on predicted RULs).
⬇ input: health_loss = Scalar health-classification loss (cross-entropy).
⬇ input: lam_rul=0.5 = RUL weight in [0, 1]. λ_health = 1 − λ_rul, so the two weights always sum to 1.
⬆ returns = Scalar combined loss for one mini-batch.
26return lam_rul * rul_loss + (1.0 - lam_rul) * health_loss

Convex combination. lam_rul=1 → pure RUL (single-task). lam_rul=0 → pure health (degenerate). lam_rul=0.5 → equal weight.

EXECUTION STATE
→ why convex? = Weights sum to 1 by construction. The composed loss is bounded between min(L_rul, L_health) and max(L_rul, L_health) — no scale blow-up.
→ contrast with GABA = GABA's λ are ALSO sum-to-one (post-renormalisation), but they UPDATE per-batch from gradient measurements. Fixed weights never change.
30y_rul = np.array([20., 80., 5., 100.])

Four ground-truth RUL values per engine.

EXECUTION STATE
y_rul (4,) = [20., 80., 5., 100.]
31y_pred = np.array([25., 75., 12., 98.])

Four predictions. Residuals: [+5, −5, +7, −2] — slight late bias.

EXECUTION STATE
y_pred (4,) = [25., 75., 12., 98.]
residuals = [+5, −5, +7, −2]
32hp_target = np.array([2, 1, 2, 0])

Health labels: critical / degrading / critical / healthy.

33hp_logits = np.array([[0.1, 0.2, 0.7], [0.5, 0.3, 0.2], [0.4, 0.5, 0.6], [0.8, 0.1, 0.1]])

Pre-softmax health logits per sample. Columns: {healthy, degrading, critical}.

EXECUTION STATE
hp_logits (4, 3) =
      h    d    c
A   0.1  0.2  0.7    ← critical predicted (correct, target=2)
B   0.5  0.3  0.2    ← healthy predicted (target=1)
C   0.4  0.5  0.6    ← critical predicted (correct, target=2)
D   0.8  0.1  0.1    ← healthy predicted (correct, target=0)
38L_rul = ((y_pred - y_rul) ** 2).mean()

Standard MSE. Three operations: subtract, square, average.

EXECUTION STATE
(y_pred - y_rul) ** 2 = [25, 25, 49, 4] — squared residuals.
L_rul = (25 + 25 + 49 + 4) / 4 = 25.75
39L_health = softmax_ce(hp_logits, hp_target)

Cross-entropy on the health head.

EXECUTION STATE
L_health (illustrative) = ≈ 0.90 — partly-trained 3-class softmax.
43configs = [(0.50, 'AMNL 0.5/0.5'),

Four fixed-weight configurations to compare. Each is a (λ_rul, label) tuple.

44(0.60, 'AMNL 0.6/0.4'),

Slight RUL emphasis.

45(0.75, 'V7 baseline 0.75/0.25'),

The AMNL paper's pre-ablation default: 0.75 to RUL, 0.25 to health. Heuristic — RUL is the primary task so give it 3× the weight.

46(0.90, 'AMNL 0.9/0.1 (heavy RUL)')]

Heavy RUL emphasis. Approaches single-task (λ_rul=1).

47print(f"L_rul = {L_rul:.4f} L_health = {L_health:.4f}")

Print the per-task losses for context.

48print(f"{'config':<24} {'λ_rul':>6} {'λ_h':>6} {'L_total':>10}")

Header row. Format: 24 chars left, 6/6/10 chars right.

EXECUTION STATE
📚 f-string format spec = {value:<width} = left-align in `width` chars; {value:>width} = right-align.
49print("-" * 55)

Separator line.

51for lam, name in configs:

Iterate the 4 configurations.

LOOP TRACE · 4 iterations
λ_rul = 0.50
L_total = 0.5 * 25.75 + 0.5 * 0.90 = 13.325
λ_rul = 0.60
L_total = 0.6 * 25.75 + 0.4 * 0.90 = 15.81
λ_rul = 0.75
L_total = 0.75 * 25.75 + 0.25 * 0.90 = 19.5375
λ_rul = 0.90
L_total = 0.9 * 25.75 + 0.1 * 0.90 = 23.265
52L = fixed_weight_loss(L_rul, L_health, lam_rul=lam)

Compute the fixed-weight loss for this configuration. Just one line of arithmetic — that&apos;s the whole MTL composition.

52print(f"{name:<24} {lam:>6.2f} {1-lam:>6.2f} {L:>10.4f}")

Print the row.

EXECUTION STATE
Final output (illustrative) =
L_rul    = 25.7500    L_health = 0.8999
config                    λ_rul    λ_h   L_total
-------------------------------------------------------
AMNL 0.5/0.5               0.50   0.50    13.3249
AMNL 0.6/0.4               0.60   0.40    15.8100
V7 baseline 0.75/0.25      0.75   0.25    19.5375
AMNL 0.9/0.1 (heavy RUL)   0.90   0.10    23.2650
→ reading = Larger λ_rul → larger L_total because L_rul (25.75) >> L_health (0.90). But the AMNL paper&apos;s ablation shows that VALIDATION RMSE is best at λ_rul=0.5 — even though TRAINING L_total is largest there. The two are NOT the same thing.
29 lines without explanation
1"""Fixed equal weighting — the simplest MTL loss.
2
3L = lam_rul * L_rul + (1 - lam_rul) * L_health.
4
5The two task losses are kept on the same scale by using compatible
6units (cycles² for MSE and bits for cross-entropy). At lam_rul = 0.5,
7the two losses contribute equally to the gradient, regardless of their
8absolute magnitudes — the OUTER axis is held fixed, no controller.
9"""
10
11import numpy as np
12
13
14def softmax_ce(logits, target):
15    """3-class cross-entropy on logits (B, 3) with int target (B,)."""
16    z = logits - logits.max(axis=-1, keepdims=True)
17    p = np.exp(z); p /= p.sum(axis=-1, keepdims=True)
18    return -np.log(p[np.arange(len(target)), target] + 1e-12).mean()
19
20
21def fixed_weight_loss(rul_loss, health_loss, lam_rul=0.5):
22    """L = lam_rul * L_rul + (1 - lam_rul) * L_health.
23
24    The whole MTL composition in one line. No state, no controller.
25    """
26    return lam_rul * rul_loss + (1.0 - lam_rul) * health_loss
27
28
29# ---------- Tiny mini-batch (4 samples) ----------
30y_rul     = np.array([20., 80.,  5., 100.])
31y_pred    = np.array([25., 75., 12.,  98.])
32hp_target = np.array([2, 1, 2, 0])
33hp_logits = np.array([[0.1, 0.2, 0.7],
34                      [0.5, 0.3, 0.2],
35                      [0.4, 0.5, 0.6],
36                      [0.8, 0.1, 0.1]])
37
38L_rul    = ((y_pred - y_rul) ** 2).mean()
39L_health = softmax_ce(hp_logits, hp_target)
40
41
42# ---------- Compare four fixed weightings ----------
43configs = [(0.50, "AMNL 0.5/0.5"),
44           (0.60, "AMNL 0.6/0.4"),
45           (0.75, "V7 baseline 0.75/0.25"),
46           (0.90, "AMNL 0.9/0.1 (heavy RUL)")]
47
48print(f"L_rul    = {L_rul:.4f}    L_health = {L_health:.4f}")
49print(f"{'config':<24} {'λ_rul':>6} {'λ_h':>6} {'L_total':>10}")
50print("-" * 55)
51for lam, name in configs:
52    L = fixed_weight_loss(L_rul, L_health, lam_rul=lam)
53    print(f"{name:<24} {lam:>6.2f} {1-lam:>6.2f} {L:>10.4f}")
The L_total numbers are misleading. 0.5/0.5 gives the LOWEST training L_total (because L_rul is numerically larger and gets weighted down). But 0.5/0.5 also gives the LOWEST validation RMSE. The two minimums coinciding is the regularisation effect: equal weights produce a backbone that GENERALISES better despite a less aggressive training objective on the primary task.

PyTorch: The Paper's FixedWeightLoss

Verbatim from grace/core/baselines.py:34-49. 16 lines of class definition; 3 method implementations; no learnable parameters; no internal state. The simplest entry in the loss registry — and the baseline every other method in this chapter compares against.

Production FixedWeightLoss class
🐍baselines.py:FixedWeightLoss
1docstring

States the contract: this is the verbatim production class. The same loss-registry name &lsquo;fixed_050&rsquo; that ChatGPT-21 §1&apos;s 2&times;2 grid Cell A uses.

9import torch

PyTorch core. We only need .Tensor and autograd hooks.

10import torch.nn as nn

nn.Module base class. Inheriting it gives the MTL loss class access to .to(device), state_dict() serialisation, and the standard Module call protocol.

11from typing import Dict

Type hint for the get_weights return type.

14class BaseMTLLoss(nn.Module):

Abstract base. Every MTL loss in the paper repo inherits from this class and implements three methods: forward, get_weights, get_name.

EXECUTION STATE
→ why a base class? = Uniform interface lets UnifiedTrainer (chapter 22 §1) call `mtl_loss(rul_loss, health_loss)` regardless of which controller is plugged in. Polymorphism over MTL strategies.
14"""Abstract base. forward() returns a scalar combined loss."""

Class docstring explaining the contract.

16def forward(self, rul_loss, health_loss, **kwargs): ...

Abstract method. The `...` placeholder means subclasses must implement it. **kwargs accepts shared_params and other GABA-specific arguments without breaking the interface.

EXECUTION STATE
📚 **kwargs = Catches any extra keyword arguments. Lets the same call signature work for FixedWeightLoss (no extra args) and GABALoss (needs shared_params).
17def get_weights(self) -> Dict[str, float]: ...

Abstract. Returns the current per-task weights so the gradient logger can record them. For fixed weights, always returns the constants set in __init__.

18def get_name(self) -> str: ...

Abstract. Returns a human-readable name for logging (`Fixed(0.50/0.50)`, `GABA(beta=0.99)`, etc.).

21class FixedWeightLoss(BaseMTLLoss):

Concrete subclass. The simplest production loss — wraps the fixed weighted sum in the BaseMTLLoss interface.

21"""Static task weighting."""

One-line description.

24def __init__(self, rul_weight: float = 0.5,

Constructor. Defaults to 0.5/0.5 (the AMNL recommendation).

EXECUTION STATE
⬇ rul_weight = 0.5 = RUL task weight. The paper registers two convenience aliases: `fixed_050` (this default) and `fixed_075` (the V7 baseline).
25health_weight: float = 0.5) -> None:

Health task weight. Default 0.5. The two weights are NOT auto-normalised — the user is responsible for their sum.

EXECUTION STATE
⬇ health_weight = 0.5 = Health task weight.
→ return type None = Constructor side-effects only; no return value.
26super().__init__()

Call nn.Module's __init__. Required so PyTorch can register parameters, buffers, and submodules. Skipping this breaks .to(device) and state_dict.

27self.rul_weight = rul_weight

Store as instance attribute. NOT registered as nn.Parameter — fixed weights are PYTHON FLOATS, not tensors. They never receive gradients.

EXECUTION STATE
→ contrast with Uncertainty = UncertaintyLoss stores log_var_rul as nn.Parameter — it&apos;s LEARNABLE. FixedWeightLoss stores a Python float — it&apos;s constant.
28self.health_weight = health_weight

Same for health.

16def forward(self, rul_loss, health_loss, **kwargs):

Implement the BaseMTLLoss interface. **kwargs swallows shared_params and any other arguments other MTL losses use.

EXECUTION STATE
⬇ rul_loss = 0-d tensor — scalar RUL loss with autograd attached.
⬇ health_loss = 0-d tensor — scalar health loss.
⬇ **kwargs = Ignored. Lets this class accept the same call signature as GABALoss(rul_loss, health_loss, shared_params=...).
⬆ returns = 0-d tensor — combined loss with autograd.
31return self.rul_weight * rul_loss + self.health_weight * health_loss

One line. Python multiplies a float by a tensor (autograd preserved) and adds two tensors (autograd preserved). Result keeps the gradient graph intact for .backward().

EXECUTION STATE
→ autograd flow = dL/dθ = w_rul * dL_rul/dθ + w_health * dL_health/dθ. Both task losses contribute to every shared backbone parameter.
33def get_weights(self):

Inspection helper. UnifiedTrainer&apos;s gradient_logger calls this every epoch to log the current weights.

33return {"rul_weight": self.rul_weight,

Return as a dict for JSON-serialisable logging.

34"health_weight": self.health_weight}

Two keys. Same dict structure as GABA.get_weights, so the trainer&apos;s logger doesn&apos;t branch on loss type.

37def get_name(self):

Human-readable name for logging.

37return f"Fixed({self.rul_weight:.2f}/{self.health_weight:.2f})"

f-string with format spec. `Fixed(0.50/0.50)` for the default.

42if __name__ == "__main__":

Standard guard. The block runs only when the file is executed directly.

43 # Equivalent to get_loss("fixed_050") from the loss_registry.

Comment marker. The paper&apos;s loss_registry.py registers `fixed_050 = lambda **kw: FixedWeightLoss(0.5, 0.5)` — calling get_loss(&apos;fixed_050&apos;) from anywhere in the codebase produces the same instance as the explicit construction below.

44loss = FixedWeightLoss(rul_weight=0.5, health_weight=0.5)

Instantiate. No GPU placement needed because there are no parameters.

47L_rul = torch.tensor(25.75, requires_grad=True)

Synthetic RUL loss as a 0-d tensor with autograd. requires_grad=True is needed so the .backward() call below populates .grad.

EXECUTION STATE
📚 torch.tensor(value, requires_grad) = Construct a leaf tensor. requires_grad=True turns on gradient tracking.
48L_health = torch.tensor( 0.90, requires_grad=True)

Synthetic health loss.

50L = loss(L_rul, L_health)

Call the loss module. Equivalent to `loss.forward(L_rul, L_health)`. The autograd graph: L = 0.5·L_rul + 0.5·L_health.

EXECUTION STATE
→ why call the module? = nn.Module.__call__ runs forward() AND any registered hooks. Direct .forward() bypasses hooks. Always use `module(x)`, not `module.forward(x)`.
50print(f"L_total = {L.item():.4f}")

Convert the 0-d tensor to a Python float and print. .item() requires the tensor to be 0-dimensional.

51print(f"weights = {loss.get_weights()}")

Inspect the per-task weights.

52print(f"name = {loss.get_name()}")

Inspect the loss name.

54L.backward()

Run autograd backward. Populates L_rul.grad and L_health.grad with dL/dL_rul and dL/dL_health respectively.

EXECUTION STATE
📚 .backward() = Tensor method. Default: pretend the gradient of L w.r.t. itself is 1.0, and propagate backwards through the autograd graph.
54print(f"dL/dL_rul = {L_rul.grad.item():.4f}")

Should print 0.5 — the partial derivative dL/dL_rul = rul_weight = 0.5.

EXECUTION STATE
→ math = L = 0.5·L_rul + 0.5·L_health → dL/dL_rul = 0.5, dL/dL_health = 0.5. Verifies the loss does what it claims.
55print(f"dL/dL_health = {L_health.grad.item():.4f}")

Should print 0.5 — the partial w.r.t. L_health.

EXECUTION STATE
Final output =
L_total      = 13.3249
weights      = {'rul_weight': 0.5, 'health_weight': 0.5}
name         = Fixed(0.50/0.50)
dL/dL_rul    = 0.5000
dL/dL_health = 0.5000
21 lines without explanation
1"""Paper&apos;s FixedWeightLoss class — verbatim.
2
3Source: paper_ieee_tii/grace/core/baselines.py:34-49.
4The simplest entry in the loss registry. Wraps the fixed weighted sum
5in the same BaseMTLLoss interface as GABA, GradNorm, Uncertainty, etc.,
6so it plugs into UnifiedTrainer without special-casing.
7"""
8
9import torch
10import torch.nn as nn
11from typing import Dict
12
13
14class BaseMTLLoss(nn.Module):
15    """Abstract base. forward() returns a scalar combined loss."""
16    def forward(self, rul_loss, health_loss, **kwargs): ...
17    def get_weights(self) -> Dict[str, float]: ...
18    def get_name(self) -> str: ...
19
20
21class FixedWeightLoss(BaseMTLLoss):
22    """Static task weighting."""
23
24    def __init__(self, rul_weight: float = 0.5,
25                 health_weight: float = 0.5) -> None:
26        super().__init__()
27        self.rul_weight    = rul_weight
28        self.health_weight = health_weight
29
30    def forward(self, rul_loss, health_loss, **kwargs):
31        return self.rul_weight * rul_loss + self.health_weight * health_loss
32
33    def get_weights(self):
34        return {"rul_weight":    self.rul_weight,
35                "health_weight": self.health_weight}
36
37    def get_name(self):
38        return f"Fixed({self.rul_weight:.2f}/{self.health_weight:.2f})"
39
40
41# ---------- Drop-in usage in the trainer ----------
42if __name__ == "__main__":
43    # Equivalent to get_loss("fixed_050") from the loss_registry.
44    loss = FixedWeightLoss(rul_weight=0.5, health_weight=0.5)
45
46    # Synthetic per-task losses
47    L_rul    = torch.tensor(25.75, requires_grad=True)
48    L_health = torch.tensor( 0.90, requires_grad=True)
49
50    L = loss(L_rul, L_health)
51    print(f"L_total      = {L.item():.4f}")
52    print(f"weights      = {loss.get_weights()}")
53    print(f"name         = {loss.get_name()}")
54    L.backward()
55    print(f"dL/dL_rul    = {L_rul.grad.item():.4f}")
56    print(f"dL/dL_health = {L_health.grad.item():.4f}")
The weights are floats, not Parameters. self.rul_weight = 0.5 stores a Python float. If you wrote self.rul_weight = nn.Parameter(torch.tensor(0.5)) instead, the optimiser would learn the weight — turning FixedWeightLoss into a (degenerate) UncertaintyLoss without the regularisation term. Section 24 §2 walks through the difference; subtle bugs hide in this distinction.

Fixed Weights In Other Domains

DomainPrimary taskAuxiliary taskTypical fixed splitWhy this split
Object detectionBounding-box regressionClass score (softmax)1.0 / 1.0 (no weighting)Per-anchor losses already on similar scales after Focal Loss; no weighting helps.
Speech recognitionCTC (frame-level)Attention decoder CE0.7 / 0.3CTC needs more weight because attention can drift without strong alignment supervision.
Image segmentation + classificationPixel-wise DiceImage-level CE0.8 / 0.2Pixel loss is N pixels × class loss in scale; the 0.8/0.2 ratio normalises.
Recommender CTR + dwellClick-through binary CEDwell-time MSE0.5 / 0.5 (equal)Dwell-time signal is a noisy regulariser for CTR; equal weighting works as in C-MAPSS.
Self-driving multi-taskDetection mAPLane segmentation, depth1.0 / 1.0 / 1.0All sub-tasks framed at comparable losses; gradient surgery (PCGrad) used instead of weighting.

The lesson: fixed weights are a valid first stop, but the right ratio is domain-specific. Run the AMNL ablation (4–5 fixed splits) on a small validation set before committing. If the U-shape minimum is at 0.5, you have the AMNL pattern; if it's elsewhere, the loss scales differ materially and you may benefit from magnitude-normalisation (AMNL-fixed) or learnable weighting (Uncertainty).

Pitfalls With Fixed Weights

Pitfall 1: weights that don't sum to 1

FixedWeightLoss does not enforce λrul+λhealth=1\lambda_{\text{rul}} + \lambda_{\text{health}} = 1. Setting (0.7, 0.7) scales the combined loss by 1.4 — same gradient direction, larger magnitude. Effectively the same as running with a 1.4× learning rate on a (0.5, 0.5) loss, which often blows up training. Always verify the sum.

Pitfall 2: assuming 0.5 / 0.5 transfers

The 0.5 / 0.5 result is C-MAPSS-specific. On datasets where the per-task losses have different magnitudes (e.g. raw MSE vs cross-entropy with very different ranges), 0.5/0.5 will be dominated by the larger-magnitude loss. Either rescale the losses to comparable magnitudes (AMNL-fixed pattern) or learn the weights (Uncertainty — chapter 24 §2).

Pitfall 3: forgetting that 0.9/0.1 is nearly single-task

At λrul=0.9\lambda_{\text{rul}} = 0.9 the health task contributes only 10% of the gradient signal — the backbone has effectively no auxiliary supervision. RMSE on FD002 climbs from 6.33 to 8.59. Single-task (λ_rul=1.0) is even worse: 26.92. The continuum from MTL to single-task has a sharp cliff.

Pitfall 4: comparing fixed vs adaptive without controlling for inner loss

Fixed 0.5/0.5 with weighted MSE = AMNL. Fixed 0.5/0.5 with standard MSE = Baseline. GABA + standard MSE = adaptive version of Baseline. GABA + weighted MSE = GRACE. The four cells of the chapter 21 §1 grid. Confounding fixed-vs-adaptive with which inner loss is used produces apples-to-oranges comparisons.

Pitfall 5: starting fixed and never re-checking

A fixed 0.5/0.5 model trained successfully on one dataset version may break on a refresh of the data. The optimal split depends on the relative scales of the per-task losses, which change with data preprocessing. Re-run the 4-point sweep when your training data changes meaningfully.

Takeaway

  • Fixed weighting is the simplest MTL loss: L=λrulLrul+(1λrul)Lhealth\mathcal{L} = \lambda_{\text{rul}}\mathcal{L}_{\text{rul}} + (1-\lambda_{\text{rul}})\mathcal{L}_{\text{health}}. No state, no controller, no learnable parameters.
  • On C-MAPSS multi-condition the AMNL ablation finds λrul=0.5\lambda_{\text{rul}} = 0.5 the optimal split: FD002 RMSE 6.33, FD004 RMSE 7.30, beating the intuitive 0.75/0.25 by 0.91 and 0.84 cycles.
  • The U-shape RMSE-vs-λ_rul curve has a regularisation cliff near λ_rul → 1 (RMSE jumps to 26.92 at single-task) — the auxiliary task is doing real work.
  • The production class is FixedWeightLoss at grace/core/baselines.py:34-49. Inherits BaseMTLLoss for uniform interface with GABA/Uncertainty/GradNorm/DWA. Weights stored as Python floats — explicitly NOT learnable.
  • Use fixed weights as a baseline. Run a 4–5-point ablation on a validation set; if the U-minimum is at 0.5, you have the AMNL pattern. If not, escalate to magnitude normalisation (AMNL-fixed), learnable weights (Uncertainty), or gradient-magnitude adaptation (GABA / GRACE).
Loading comments...