The Thermostat That Overshoots
A thermostat with high gain heats the room past the setpoint, then over-corrects, then under-corrects. Eventually it settles - but only because the gain is bounded. Crank the gain to 10× and the system never settles. The same thing happens to AMNL training when w_max grows past ~3.
§14.1 said “more weight on near-failure” helps. §14.2 said the schedule should be linear. This section answers the remaining question: how MUCH weight? The paper sets w_max = 2.0 and never tunes it. That choice is not accidental.
What Goes Wrong As w_max Grows
With AMNL the per-sample gradient is . Variance across the batch:
For uniformly distributed RUL the schedule gives . Variance grows quadratically with . Adam's second-moment estimate tracks , so scales LINEARLY in . The effective per-step update is - and at near-failure samples the numerator can be × the median while the denominator only RMS-scales - so the ratio explodes.
Interactive: Stability vs w_max
Slide w_max from 1 (uniform MSE) to 10 (extreme). The left panel is the schedule curve; the right panel is the histogram of per-sample on a synthetic batch. Watch the right tail of the histogram explode as w_max grows.
Try this. Set w_max=2 - the histogram has a modest tail (max/median ≈ 5-6×). Push w_max to 5 - the tail nearly doubles (max/median ≈ 10-11×). At w_max=10 the tail is a long flat ramp - the optimiser can't tell “a bit critical” from “extremely critical” anymore.
Empirical Stability by w_max
Numbers from a 50-step Adam smoke test (the PyTorch block below). Same lr, same data, same seed - only w_max differs.
| w_max | interpretation | p99/median grad ratio | Adam overshoot factor | verdict |
|---|---|---|---|---|
| 1.0 | uniform MSE | ≈ 4.5× | ≈ 0.06 | stable, no emphasis |
| 2.0 | paper choice | ≈ 5.5× | ≈ 0.06 | stable, useful emphasis |
| 3.0 | edge of stable regime | ≈ 7.0× | ≈ 0.10 | stable but tight |
| 5.0 | “legacy strong” ablation | ≈ 10.5× | ≈ 0.17 | marginal - oscillates |
| 10.0 | extreme | ≈ 17.0× | ≈ 0.28 | unstable - frequent spikes |
Python: Simulate Three w_max Values
Compute the per-sample gradient distribution and stability diagnostics for w_max ∈ {1, 2, 5, 10} on a synthetic batch matching C-MAPSS shape. The numbers verify the quadratic-variance argument from the math section.
PyTorch: Ablation Harness
50 Adam steps on a single learnable prediction tensor, run once per w_max. The peak-vs-final overshoot ratio reveals which ceilings produce stable training. Same lr and seed across all four runs.
Cost-Ratio → w_max in Other Domains
The ceiling generalises to any AMNL-style weighting where you have an operational cost ratio. As long as the cost-ratio is at most 3, w_max stays in the stable regime.
| Domain | Late vs early cost ratio | Recommended w_max | Notes |
|---|---|---|---|
| RUL prediction (this book) | ~2× (NASA score asymmetry) | 2.0 | paper default |
| Battery state-of-health | ~1.5× (premature derate vs strand) | 1.5 | modest tilt |
| Wind-turbine SCADA | ~2.5× (crane swap vs outage) | 2.5 | tight but stable |
| Hospital ICU triage score | ~3× (extra hour vs missed deteriorate) | 3.0 | edge of stable regime |
| Wildfire risk forecast | > 10× (false alarm vs missed wildfire) | use focal loss instead | outside AMNL's linear regime |
| Power-grid frequency | > 50× (spot buy vs blackout) | use focal loss instead | outside AMNL's linear regime |
Three w_max Pitfalls
The point. w_max=2 is not arbitrary - it's the largest ceiling that keeps Adam's adaptive denominator in the regime where the per-task gradient variance scales linearly. §14.4 wires this into the §11.4 DualTaskModel and runs the full PyTorch loss as the paper ships it.
Takeaway
- Ceiling = 2.0. Paper choice. Largest stable value.
- Variance is quadratic in w_max. drives the per-sample gradient variance.
- Adam overshoot factor. peak/final loss ratio across a training run. Doubles from w_max=2 to w_max=5; quadruples to w_max=10.
- Cost-ratio > 3 ⇒ different loss family. AMNL is for modest asymmetry. Use NASA-style asymmetric loss (§13.1) for safety-critical ratios.
- Don't tune w_max on val. Set from operational cost ratio, freeze, never retune.