Two Derivatives, Two Universes
§12.1 measured the imbalance. This section explains it. The gap is not an artefact of our model size or our optimiser - it falls straight out of the derivatives of MSE and CE.
Read the next four lines slowly. They are the entire chapter in compressed form.
| Loss | Derivative wrt logit / pred | Element bound |
|---|---|---|
| MSE L = (1/B) Σ (ŷ - y)² | ∂L/∂ŷ_i = (2/B)(ŷ_i - y_i) | unbounded - scales with |y| |
| CE L = -(1/B) Σ log p_y | ∂L/∂z_ik = (1/B)(p_ik - 1[k=y_i]) | ≤ 1/B (always) |
MSE: Derivative Grows With Residual
Let be the per-sample residual. Then
The gradient's magnitude is linear in the residual. With our capped RUL target (R_max = 125) and a freshly-initialised network predicting near zero, the per-sample residual is on the order of cycles. With the per-element gradient magnitude is .
CE: Derivative Bounded By 1
Cross-entropy after softmax has a famously clean derivative:
Because and , every element of is in . Dividing by gives a per-element bound of .
Interactive: Live Comparison
Drag the residual and probability sliders. Notice how the ratio crosses 100× when the residual goes above ~50 cycles (which it does at every freshly-seeded run on C-MAPSS).
Try this. Set residual to 0. The MSE bar disappears. Now set residual to ±125. The MSE bar saturates the chart. The CE bar barely moves either way. The MSE gradient is the thing growing without limit; everything else about the imbalance is bookkeeping.
Python: NumPy Verification
Compute both gradients analytically (no autograd), then print the ratio per sample. The point of doing this from scratch is to make the bound visible: every per-element CE entry comes out at the same number, while the MSE entries spread up to the residual cap.
PyTorch: autograd Cross-Check
Same problem, autograd-verified. F.mse_loss and F.cross_entropy with.backward() reproduce the analytic numbers exactly.
Same Math, Different Domains
Anywhere unbounded MSE meets bounded CE on a shared backbone, the same imbalance shows up. The fix scales (AMNL / GABA / GRACE) generalise without modification.
| Domain | Regression scale | Classification cap | Source of gap |
|---|---|---|---|
| RUL prediction (C-MAPSS) | y up to 125 cycles | K=3 → bound 1/(3B) | RUL cap |
| Battery capacity vs failure type | y in [0.7, 1.0] capacity ratio | K=4 fault types | scaled but still >5× |
| Power-grid load vs anomaly tag | y in MW, can hit 10⁴ | K=2 anomaly/normal | MW magnitude |
| Wind speed vs gear-fault tag | y in [0, 25] m/s | K=3 fault levels | moderate ~10× |
| Astronomy: redshift vs galaxy class | y in [0, 7] | K=10 morphology classes | small ~3-5× |
| NLP: sentiment score vs topic | y in [-5, 5] | K=20 topics | small ~1-3× |
Three Loss-Scaling Pitfalls
The point. One bounded gradient, one unbounded gradient, on the same shared parameters. Standard optimisers cannot tell the two apart - they see the bigger one. Section §12.3 measures this on the real DualTaskModel. Section §12.4 spells out the consequence. Then we fix it.
Takeaway
- MSE's derivative is unbounded. - linear in the residual.
- CE's derivative is bounded. ≤ 1/B element-wise.
- The ratio is structural. It is not a bug, a hyperparameter, or an artefact of model size. Re-normalising targets does not fix it; rebalancing gradients does.
- NumPy and autograd agree. Both compute the same per-sample numbers. Use either to diagnose your own MTL setup.