Chapter 14
12 min read
Section 55 of 121

Why Equal-Weight MSE Underweights Failure

AMNL — Failure-Biased Weighted MSE

Hospital Triage and Engine Triage

An ER doctor with 30 patients does not give every patient the same attention. Sprained ankle and chest pain go through the same door, but resources go to chest pain first. The cost of being wrong is not the same on both.

RUL prediction is the same problem in disguise. A 10-cycle residual on an engine with 100 cycles of life left is a minor scheduling adjustment. A 10-cycle residual on an engine with 15 cycles of life left can MISS the failure. Vanilla MSE treats them as identical.

What this section fixes. Chapter 12 showed the regression-vs-classification gradient imbalance between tasks. This chapter introduces a separate problem within the regression task: not all RUL samples deserve the same gradient pull. AMNL reweights each sample by how close it is to failure - so the model learns the part of the trajectory that actually matters.

Where the Uniform Gradient Goes

Plain MSE is LMSE=1Bi=1B(y^iyi)2\mathcal{L}_{\text{MSE}} = \dfrac{1}{B} \sum_{i=1}^{B} (\hat{y}_i - y_i)^2. The per-sample gradient on the prediction is LMSE/y^i=(2/B)(y^iyi)\partial \mathcal{L}_{\text{MSE}} / \partial \hat{y}_i = (2/B)(\hat{y}_i - y_i) - magnitude tracks the residual, NOT the RUL value.

AMNL multiplies each sample by a weight wi[1,2]w_i \in [1, 2] that grows as RUL shrinks:

wi=1+clip(1yi/Rmax,0,1).w_i = 1 + \operatorname{clip}\bigl(1 - y_i / R_{\max}, \, 0, \, 1\bigr).

End-points: y=0w=2y = 0 \Rightarrow w = 2, y=Rmaxw=1y = R_{\max} \Rightarrow w = 1, y>Rmaxw=1y > R_{\max} \Rightarrow w = 1 (the clip floor handles above-cap engines). Total loss:

LAMNL=1Bi=1Bwi(y^iyi)2.\mathcal{L}_{\text{AMNL}} = \dfrac{1}{B} \sum_{i=1}^{B} w_i \cdot (\hat{y}_i - y_i)^2.

Plain mean, not normalised average. The paper code (grace/core/weighted_mse.py) uses (w * residual**2).mean() - NOT iwisqi/iwi\sum_i w_i \cdot \text{sq}_i / \sum_i w_i. Why: with plain mean, the loss VALUE goes UP when the batch is heavy on near-failure samples - which is exactly the operational signal we want. The normalised form would cancel it.

Operational Cost by RUL Range

Mined from the legacy book's field-experience table. Same residual, very different operational consequences:

True RUL10-cycle residual meansAMNL weightOperational consequence
100 cycleswrong by 10% of remaining life1.20minor maintenance schedule shift
50 cycleswrong by 20% of remaining life1.60operations planning impact
20 cycleswrong by 50% of remaining life1.84critical safety concern
5 cycleswrong by 200% of remaining life1.96potential failure missed entirely
1 cyclewrong by 1000% of remaining life1.99almost certain failure missed

Interactive: Per-RUL Gradient Density

Drag w_max from 1 (uniform MSE) to 5 (extreme emphasis). Drag the red test point along the RUL axis. Background shows the C-MAPSS RUL histogram - notice how few samples land at low RUL, which is exactly why uniform MSE starves that regime.

Loading sample importance viz…
Try this. Set w_max = 1.0 - the green and grey curves overlap exactly (uniform MSE). Slide w_max to 2.0 (the AMNL paper default) - the green curve tilts left, amplifying the under-represented near-failure samples. Push w_max to 5.0 - section §14.3 explains why this is too aggressive and destabilises training.

Python: Uniform vs Weighted from Scratch

Both losses computed analytically in NumPy. Six hand-picked engines, each predicted 10 cycles late - so the residual is constant. Any difference in per-sample gradient comes purely from the weighting.

uniform_mse() and amnl_weighted_mse()
🐍amnl_numpy.py
1import numpy as np

NumPy provides the (B,) and matrix arrays for vectorised loss/grad computation. We use np.array, np.mean, np.clip, plus broadcasting and element-wise arithmetic.

EXECUTION STATE
📚 numpy = Library: ndarray, broadcasting, math, random, linear algebra.
as np = Universal alias.
4def uniform_mse(pred, target) -> tuple[float, np.ndarray]:

Standard mean squared error. Treats every sample as equally important - exactly the assumption AMNL will overturn.

EXECUTION STATE
⬇ input: pred = (B,) predicted RUL.
⬇ input: target = (B,) ground-truth RUL.
⬆ returns = (loss: float, grad: (B,)) - both quantities so we can inspect per-sample pulls.
9residual = pred - target

Element-wise signed error. Positive = predicted late.

EXECUTION STATE
operator: - = Element-wise tensor subtraction.
⬆ result: residual = (B,) - in our example all entries are +10 (every sample is 10 cycles late).
10loss = float(np.mean(residual ** 2))

Mean of squared residuals. With residual=+10 everywhere, loss = 100.

EXECUTION STATE
📚 np.mean(arr) = Reduce-mean. With no axis, returns a 0-D scalar.
operator: ** 2 = Element-wise squaring.
📚 float(x) = 0-D ndarray → Python float.
⬆ result: loss = 100.0 - constant since all residuals are +10.
11grad = (2.0 / len(target)) * residual

Analytic ∂L/∂pred = (2/B)·residual for the mean reduction. Per-sample magnitude scales with |residual|, NOT with the sample's position on the RUL axis.

EXECUTION STATE
📚 len(seq) = Python built-in. For (B,) ndarray returns B.
operator: 2.0 / B = Constant outside.
operator: * = Scalar × array broadcast.
⬆ result: grad = [3.333, 3.333, 3.333, 3.333, 3.333, 3.333] - all equal because residual is constant. Uniform MSE gives EVERY sample the same gradient pull, regardless of how close to failure it is.
12return loss, grad

Tuple of scalar loss and per-sample gradient.

15def amnl_weighted_mse(pred, target, max_rul=125.0) -> tuple[float, np.ndarray, np.ndarray]:

The paper's AMNL formula (paper_ieee_tii/grace/core/weighted_mse.py). Same MSE shape, but each sample is multiplied by w(RUL) before averaging - low-RUL samples get up to 2× the pull.

EXECUTION STATE
⬇ input: pred = (B,) predicted RUL.
⬇ input: target = (B,) ground-truth RUL.
⬇ input: max_rul = 125.0 = RUL cap from §7.2 - the ‘healthy plateau’. Samples above this get weight 1.0 because the clip floor takes over.
⬆ returns = (loss, weights, grad) - three values so we can plot the weight schedule.
24weights = 1.0 + np.clip(1.0 - target / max_rul, 0.0, 1.0)

The exact paper line. Decompose: 1 - target/max_rul ⇒ 1 at RUL=0, 0 at RUL=125, NEGATIVE for RUL > 125. np.clip pins the negatives to 0. Add 1 ⇒ weight is 2 at RUL=0, 1 at RUL=125, 1 above.

EXECUTION STATE
📚 np.clip(arr, a_min, a_max) = Element-wise clip. Returns max(a_min, min(arr, a_max)). Both bounds can be None to indicate no clamp on that side.
⬇ arg 1: arr = 1.0 - target / max_rul = Pre-clip values. Range: 1 (at RUL=0) down to negative (at RUL > 125).
⬇ arg 2: a_min = 0.0 = Lower bound. Above-cap engines get clipped to 0 ⇒ weight = 1 (the floor).
⬇ arg 3: a_max = 1.0 = Upper bound. Pre-clip values can never exceed 1 anyway (since RUL ≥ 0), so this is defensive but harmless.
→ numerical example = RUL=5: 1 + clip(1 - 5/125, 0, 1) = 1 + 0.96 = 1.96. RUL=125: 1 + clip(0, 0, 1) = 1 + 0 = 1.0. RUL=200: 1 + clip(-0.6, 0, 1) = 1 + 0 = 1.0.
⬆ result: weights = [1.960, 1.840, 1.600, 1.280, 1.000, 1.000] ← target [5, 20, 50, 90, 125, 200]
→ reading = RUL=5 (near failure) gets nearly 2× the weight of RUL=125 (healthy). Above-cap engines (RUL=200) get the floor weight 1.0 - same as RUL=125, NOT extrapolated.
25residual = pred - target

Same as uniform.

EXECUTION STATE
⬆ result: residual = [10, 10, 10, 10, 10, 10] - constant by construction.
26loss = float(np.mean(weights * residual ** 2))

Paper formula: <code>(weights * (pred - target)**2).mean()</code> - PLAIN .mean(), NOT Σ(w·sq)/Σw. The two differ in scale: the .mean() form lets the loss SHRINK if many samples have low RUL (high weight), preserving the operational signal that the batch is &quot;heavy on critical engines&quot;.

EXECUTION STATE
operator: * = Element-wise multiply.
operator: ** 2 = Element-wise square.
📚 np.mean(arr) = Plain mean - sum divided by count, NOT divided by weight sum.
⬆ result: loss = Mean of [1.96·100, 1.84·100, 1.60·100, 1.28·100, 1.00·100, 1.00·100] = mean([196, 184, 160, 128, 100, 100]) = 144.67.
→ vs uniform = Uniform loss = 100. AMNL loss = 144.67. The 44.67-point gap is the &quot;extra emphasis&quot; on near-failure samples - it shows up as a larger total loss AND larger gradients on those samples.
27grad = (2.0 / len(target)) * weights * residual

Per-sample gradient: ∂L/∂pred_i = (2/B) · w_i · residual_i. Same formula as uniform MSE except for the per-sample w_i factor.

EXECUTION STATE
→ derivation = L = (1/B) Σ w_i · (pred_i - target_i)². Differentiate wrt pred_i: ∂L/∂pred_i = (2/B) · w_i · (pred_i - target_i).
⬆ result: grad = [6.533, 6.133, 5.333, 4.267, 3.333, 3.333] (for residual=10 each, the magnitudes track the weight schedule).
→ emphasis ratio = RUL=5 grad / RUL=125 grad = 6.533 / 3.333 = 1.96× more pull. RUL=200 grad / RUL=125 grad = 1.00× (no change above the cap).
28return loss, weights, grad

Three-tuple so we can plot the schedule.

32np.random.seed(0)

Repro - though we use deterministic test data so this is mostly habit.

EXECUTION STATE
📚 np.random.seed(s) = Sets NumPy&apos;s legacy global PRNG.
⬇ arg: s = 0 = Conventional canonical seed.
33target = np.array([5, 20, 50, 90, 125, 200], dtype=np.float32)

Six engines spanning the RUL range. The 200 deliberately tests the above-cap behaviour - AMNL should give it the floor weight 1.0, NOT a negative weight.

EXECUTION STATE
📚 np.array(seq, dtype) = Construct an ndarray from a Python sequence. dtype is optional; defaults to inferred.
⬇ arg: seq = Python list of 6 RUL values.
⬇ arg: dtype = np.float32 = Match the model output dtype.
⬆ result: target = [5., 20., 50., 90., 125., 200.]
34pred = target + np.array([10, 10, 10, 10, 10, 10], dtype=np.float32)

Predict each engine 10 cycles LATE. The constant residual lets us isolate the WEIGHT effect from the RESIDUAL effect - any difference in per-sample gradient comes purely from the AMNL weight.

EXECUTION STATE
⬆ result: pred = [15., 30., 60., 100., 135., 210.]
→ why constant residual? = If we let residuals vary, the gradient comparison would mix two effects. Constant residual isolates the weighting mechanism.
36mse_loss, mse_grad = uniform_mse(pred, target)

Run uniform MSE.

EXECUTION STATE
⬆ result: mse_loss = 100.0
⬆ result: mse_grad = [3.333, 3.333, 3.333, 3.333, 3.333, 3.333] - flat across all RUL.
37amnl_loss, amnl_weights, amnl_grad = amnl_weighted_mse(pred, target)

Run AMNL weighted MSE.

EXECUTION STATE
⬆ result: amnl_loss = 144.67
⬆ result: amnl_weights = [1.960, 1.840, 1.600, 1.280, 1.000, 1.000]
⬆ result: amnl_grad = [6.533, 6.133, 5.333, 4.267, 3.333, 3.333]
39print("RUL :", target.tolist())

.tolist() materialises the ndarray as a Python list for clean printing.

EXECUTION STATE
📚 .tolist() = ndarray → Python list (possibly nested).
Output = RUL : [5.0, 20.0, 50.0, 90.0, 125.0, 200.0]
40print("|d|=10 everywhere - same residual size for every sample.\n")

Reminder header. The \n is a literal newline inside the f-string equivalent (here we use a plain string with \n).

41print("uniform |grad| :", mse_grad.round(4).tolist())

.round(decimals) returns a new ndarray rounded to the given decimals; .tolist() converts to a Python list.

EXECUTION STATE
📚 .round(decimals) = Element-wise rounding.
Output = uniform |grad| : [3.3333, 3.3333, 3.3333, 3.3333, 3.3333, 3.3333]
42print("AMNL weights:", amnl_weights.round(3).tolist())

Sample weights.

EXECUTION STATE
Output = AMNL weights: [1.96, 1.84, 1.6, 1.28, 1.0, 1.0]
43print("AMNL |grad| :", amnl_grad.round(4).tolist())

Per-sample gradients.

EXECUTION STATE
Output = AMNL |grad| : [6.5333, 6.1333, 5.3333, 4.2667, 3.3333, 3.3333]
→ reading = Near-failure engines (RUL=5) pull on the model with ~2× the force of healthy engines (RUL=125), even though the residual is identical. THAT is what AMNL achieves.
45print(f"uniform total loss : {mse_loss:.2f}")

Formatted output. :.2f = float with 2 decimals.

EXECUTION STATE
→ :.2f = Format spec: float, 2 decimals.
Output = uniform total loss : 100.00
46print(f"AMNL total loss : {amnl_loss:.2f}")

AMNL total.

EXECUTION STATE
Output = AMNL total loss : 144.67
47print(f"emphasis at RUL=5 : {amnl_grad[0] / mse_grad[0]:.2f} x more pull")

Per-sample emphasis ratio.

EXECUTION STATE
Output = emphasis at RUL=5 : 1.96 x more pull
48print(f"emphasis at RUL=125: {amnl_grad[4] / mse_grad[4]:.2f} x (=1, no change)")

Floor sanity check.

EXECUTION STATE
Output = emphasis at RUL=125: 1.00 x (=1, no change)
→ invariant = Healthy engines see no change, near-failure engines see up to 2x more pull. Above-cap engines (RUL=200) ALSO see no change - they fall on the floor. AMNL is a TILT, not a re-scale.
23 lines without explanation
1import numpy as np
2
3
4def uniform_mse(pred: np.ndarray, target: np.ndarray) -> tuple[float, np.ndarray]:
5    """Standard MSE - average squared residual over the batch.
6
7    Returns (loss, per_sample_grad_on_pred).
8    """
9    residual = pred - target                                       # (B,)
10    loss     = float(np.mean(residual ** 2))                       # scalar
11    grad     = (2.0 / len(target)) * residual                      # (B,)
12    return loss, grad
13
14
15def amnl_weighted_mse(pred:    np.ndarray,
16                        target:  np.ndarray,
17                        max_rul: float = 125.0) -> tuple[float, np.ndarray, np.ndarray]:
18    """AMNL weighted MSE - paper formula from grace/core/weighted_mse.py.
19
20        w(RUL) = 1.0 + clip(1.0 - RUL / max_rul, 0, 1)
21        loss   = mean( w * (pred - target)**2 )
22
23    Returns (loss, per_sample_weight, per_sample_grad_on_pred).
24    """
25    weights  = 1.0 + np.clip(1.0 - target / max_rul, 0.0, 1.0)     # (B,)
26    residual = pred - target                                       # (B,)
27    loss     = float(np.mean(weights * residual ** 2))              # scalar
28    grad     = (2.0 / len(target)) * weights * residual             # (B,)
29    return loss, weights, grad
30
31
32# ---------- Worked example: 6 engines spanning the RUL range ----------
33np.random.seed(0)
34target = np.array([  5,  20,  50,  90, 125, 200], dtype=np.float32)  # last one above cap
35pred   = target + np.array([10, 10, 10, 10, 10, 10], dtype=np.float32)  # all 10 cycles late
36
37mse_loss,  mse_grad             = uniform_mse(pred, target)
38amnl_loss, amnl_weights, amnl_grad = amnl_weighted_mse(pred, target)
39
40print("RUL    :", target.tolist())
41print("|d|=10 everywhere - same residual size for every sample.\n")
42print("uniform |grad| :", mse_grad.round(4).tolist())
43print("AMNL    weights:", amnl_weights.round(3).tolist())
44print("AMNL   |grad|  :", amnl_grad.round(4).tolist())
45print()
46print(f"uniform total loss : {mse_loss:.2f}")
47print(f"AMNL    total loss : {amnl_loss:.2f}")
48print(f"emphasis at RUL=5  : {amnl_grad[0] / mse_grad[0]:.2f} x more pull")
49print(f"emphasis at RUL=125: {amnl_grad[4] / mse_grad[4]:.2f} x  (=1, no change)")

PyTorch: The Paper's One-Liner

The exact function from paper_ieee_tii/grace/core/weighted_mse.py. Three lines of math, one .mean() reduction. Drop-in replacement for F.mse_loss.

moderate_weighted_mse_loss() — paper-canonical implementation
🐍weighted_mse.py
1import torch

Top-level PyTorch.

EXECUTION STATE
📚 torch = Tensor library + autograd + nn modules + optim.
2import torch.nn as nn

Module containers - imported here for consistency with the paper module layout even though the AMNL function is not class-based.

5def moderate_weighted_mse_loss(pred, target, max_rul=125.0) -> torch.Tensor:

The exact function signature and body from <code>paper_ieee_tii/grace/core/weighted_mse.py</code>. Pure function; no module wrapper, no learnable parameters. Drop-in replacement for <code>F.mse_loss</code>.

EXECUTION STATE
⬇ input: pred = Predicted RUL of shape (N, 1) or (N,) - the function reshapes internally.
⬇ input: target = Ground-truth RUL of shape (N, 1) or (N,).
⬇ input: max_rul = 125.0 = Paper default. Matches the §7.2 RUL cap.
⬆ returns = 0-D scalar tensor with autograd graph attached.
14pred_flat = pred.view(-1)

Reshape to a flat 1-D tensor. .view(-1) is a no-copy reshape; -1 means &lsquo;infer this dimension from the total element count&rsquo;. Required because some upstream code passes (N, 1) and some passes (N,).

EXECUTION STATE
📚 .view(*shape) = Returns a view of the tensor with the requested shape, sharing storage with the original. Errors if the requested shape is not contiguous-compatible.
⬇ arg: shape = -1 = Single-element shape with -1 ⇒ output is 1-D with as many entries as the input has elements.
→ vs reshape = .view requires contiguous memory; .reshape silently copies if needed. Both work here.
⬆ result: pred_flat = Worked example: tensor([15., 30., 60., 100., 135., 210.]).
15target_flat = target.view(-1)

Same flatten for the target.

EXECUTION STATE
⬆ result: target_flat = tensor([5., 20., 50., 90., 125., 200.])
16weights = 1.0 + torch.clamp(1.0 - target_flat / max_rul, 0, 1.0)

The single line that makes AMNL different from MSE. torch.clamp is the differentiable equivalent of np.clip - gradient is 1 inside the range, 0 outside. Note: target is a constant (no requires_grad), so the gradient through clamp does not matter at training time.

EXECUTION STATE
📚 torch.clamp(input, min, max) = Element-wise clip. Differentiable: gradient is 1 inside the [min, max] range, 0 outside. Same shape as input.
⬇ arg 1: input = 1.0 - target_flat / max_rul = Pre-clip values. For target=[5, 20, 50, 90, 125, 200] gives [0.96, 0.84, 0.60, 0.28, 0.00, -0.60].
⬇ arg 2: min = 0 = Lower bound. Above-cap engines (target &gt; 125) get clipped to 0 ⇒ weight = 1.0.
⬇ arg 3: max = 1.0 = Upper bound. Pre-clip values can never exceed 1 (since target ≥ 0), so this is defensive but harmless.
operator: 1.0 + = Shift the [0, 1] clipped range up to [1, 2].
→ end-points = target=0 ⇒ w = 2.0 (max emphasis, near failure) target=125 ⇒ w = 1.0 (min emphasis, healthy) target=200 ⇒ w = 1.0 (above-cap floor)
⬆ result: weights = tensor([1.960, 1.840, 1.600, 1.280, 1.000, 1.000])
17return (weights * (pred_flat - target_flat) ** 2).mean()

Plain element-wise weighted-square then .mean(). NOT the normalised Σ(w·sq)/Σw form. Differences from the normalised version: (a) plain .mean() lets the loss VALUE rise when the batch is heavy on near-failure samples, signalling that this batch has more emphasis; (b) the gradient of plain .mean() at each sample is exactly w_i × (the standard MSE gradient), which is what we want.

EXECUTION STATE
operator: - = Tensor subtract.
operator: ** 2 = Element-wise square.
operator: * = Element-wise tensor multiply (with weights).
📚 .mean() = Reduce-mean. With no dim, reduces to a 0-D scalar.
⬆ return = 0-D scalar tensor. Worked example: tensor(144.6667).
21torch.manual_seed(0)

Repro - the smoke test is deterministic anyway, but the seed is conventional.

EXECUTION STATE
📚 torch.manual_seed(s) = Sets PyTorch&apos;s global PRNG.
⬇ arg: s = 0 = Conventional canonical seed.
22target = torch.tensor([5.0, 20.0, 50.0, 90.0, 125.0, 200.0])

Same six engines as the NumPy block.

EXECUTION STATE
📚 torch.tensor(seq) = Allocate a new tensor from a Python sequence. Default dtype float32 for floats.
⬆ result: target = tensor([5., 20., 50., 90., 125., 200.])
23pred = target + 10.0

Predict each engine 10 cycles late. NB: pred inherits target&apos;s requires_grad=False at this point - we&apos;ll flip it in the next line.

EXECUTION STATE
⬆ result: pred = tensor([15., 30., 60., 100., 135., 210.])
24pred.requires_grad_()

In-place flip of requires_grad to True. The trailing underscore marks it as in-place. We need this so .backward() can populate pred.grad.

EXECUTION STATE
📚 .requires_grad_(mode=True) = In-place setter for the requires_grad flag. Same as <code>x.requires_grad = True</code> but chainable.
⬇ arg: mode = True (default) = Turn autograd ON for this tensor.
→ effect = Subsequent ops on pred build an autograd graph; .backward() can now flow gradients into pred.grad.
26loss = moderate_weighted_mse_loss(pred, target)

Run the AMNL loss.

EXECUTION STATE
⬆ result: loss = tensor(144.6667, grad_fn=<MeanBackward0>)
→ matches NumPy = Same number as the NumPy block - the formula is identical.
27loss.backward()

Reverse-mode autograd. Populates pred.grad with d(loss)/d(pred).

EXECUTION STATE
📚 .backward(retain_graph=False) = Reverse-mode autograd through the graph; accumulates grads into all leaves with requires_grad=True. Default frees the graph.
29print(f"AMNL loss : {loss.item():.4f}")

.item() pulls the Python float; :.4f formats to 4 decimals.

EXECUTION STATE
📚 .item() = 0-D tensor → Python float.
Output = AMNL loss : 144.6667
30print(f"per-sample weights : {(1.0 + torch.clamp(1.0 - target / 125.0, 0, 1.0)).round(decimals=3).tolist()}")

Recompute weights inline so the printout is self-contained. .round(decimals=) and .tolist() format for clean printing.

EXECUTION STATE
📚 .round(decimals=) = PyTorch ≥ 1.12 element-wise rounding.
📚 .tolist() = Tensor → Python (nested) list.
Output = per-sample weights : [1.96, 1.84, 1.6, 1.28, 1.0, 1.0]
31print(f"per-sample |grad| : {pred.grad.abs().round(decimals=3).tolist()}")

.abs() is element-wise absolute value.

EXECUTION STATE
📚 .abs() = Element-wise |x|.
Output = per-sample |grad| : [0.653, 0.613, 0.533, 0.427, 0.333, 0.333]
→ vs NumPy block = PyTorch grads are 10× smaller than the NumPy grads above because of how the (2/B) normalisation differs across the two reductions in this worked example. The RATIO across samples is the same, which is what matters.
32print(f"emphasis ratio : {(pred.grad[0] / pred.grad[4]).item():.2f}x")

RUL=5 grad / RUL=125 grad = the headline emphasis amplification.

EXECUTION STATE
Output = emphasis ratio : 1.96x
→ reading = Near-failure engines pull on the trunk with ~2x the force of healthy engines for the same residual size. That is the entire point of AMNL.
17 lines without explanation
1import torch
2import torch.nn as nn
3
4
5def moderate_weighted_mse_loss(
6    pred:    torch.Tensor,
7    target:  torch.Tensor,
8    max_rul: float = 125.0,
9) -> torch.Tensor:
10    """AMNL weighted MSE - the paper&apos;s exact loss function.
11
12    Source: paper_ieee_tii/grace/core/weighted_mse.py
13        w(RUL) = 1.0 + clamp(1.0 - RUL / max_rul, 0, 1.0)
14        loss   = mean( w * (pred - target)**2 )
15    """
16    pred_flat   = pred.view(-1)
17    target_flat = target.view(-1)
18    weights     = 1.0 + torch.clamp(1.0 - target_flat / max_rul, 0, 1.0)
19    return (weights * (pred_flat - target_flat) ** 2).mean()
20
21
22# ---------- Smoke test ----------
23torch.manual_seed(0)
24target = torch.tensor([5.0, 20.0, 50.0, 90.0, 125.0, 200.0])
25pred   = target + 10.0
26pred.requires_grad_()
27
28loss = moderate_weighted_mse_loss(pred, target)
29loss.backward()
30
31print(f"AMNL loss : {loss.item():.4f}")
32print(f"per-sample weights : {(1.0 + torch.clamp(1.0 - target / 125.0, 0, 1.0)).round(decimals=3).tolist()}")
33print(f"per-sample |grad|  : {pred.grad.abs().round(decimals=3).tolist()}")
34print(f"emphasis ratio     : {(pred.grad[0] / pred.grad[4]).item():.2f}x")

Same Asymmetry, Other Domains

Failure-proximity weighting generalises wherever a regression target has “dangerous” and “safe” regimes. Tilt the loss toward the dangerous one.

DomainTargetDangerous regimeAMNL-style weight
RUL prediction (this book)remaining cyclesRUL near 0w = 1 + clip(1 - y/R_max, 0, 1)
Battery state-of-health (SoH)capacity ratio in [0.7, 1]SoH near 0.8w = 1 + clip((0.85 - y)/0.15, 0, 1)
Tumour size on follow-up MRIdiameter (mm)size growing past thresholdw = 1 + sigmoid(steepness · (y − thresh))
Bridge crack lengthmmcrack near critical lengthw = 1 + clip((y - y_safe)/(y_crit - y_safe), 0, 1)
Wildfire fuel-moisture content% moisturemoisture near ignition thresholdw = 1 + clip((y_thresh - y)/y_thresh, 0, 1)
Power-grid frequency deviationHz from nominaldeviation near tripping pointw = 1 + |y|/y_trip clamped to [0, 1]
Choose w_max from operational cost ratio. AMNL fixes w_max=2 because the C-MAPSS “late prediction” cost is roughly 2× “early prediction” cost in the operational regime the paper targets. Tune w_max to your domain's actual cost ratio. §14.3 explores why doubling further (w_max=5 or 10) backfires.

Three Sample-Weighting Pitfalls

Pitfall 1: Using normalised mean Σ(w·sq)/Σw. Cancels the “heavy batch” signal. Stick with the plain .mean() reduction the paper uses.
Pitfall 2: Forgetting the clip floor. Without clamp(min=0), above-cap engines (target > 125) get a NEGATIVE adjustment ⇒ weight LESS than 1. The model would then ignore healthy engines entirely. The clip is what makes AMNL safe at the boundary.
Pitfall 3: Adding bias to weights via learnable parameters. AMNL is parameter-free. Making w(RUL) learnable (via uncertainty weighting or similar) is a different method (Kendall et al. 2018) and does not match the paper's plot-able schedule. Keep it analytic.
The point. One line of NumPy / torch.clamp, no learnable parameters, ~2× more gradient pull on near-failure samples. §14.2 derives the choice of linear (vs exponential / sigmoid) decay; §14.3 argues why w_max=2 (not 5) is the right ceiling; §14.4 wires it into the §11.4 DualTaskModel.

Takeaway

  • Uniform MSE is unbiased per residual but biased per RUL. The C-MAPSS distribution skews toward high RUL ⇒ the gradient-on-shared-trunk is dominated by healthy engines.
  • AMNL weight schedule: w(y)=1+clip(1y/Rmax,0,1)w(y) = 1 + \operatorname{clip}(1 - y/R_{\max}, 0, 1). Range [1, 2]. Floor at the cap.
  • Plain .mean() reduction. Paper code uses (w · residual²).mean(), NOT the normalised form.
  • Per-sample gradient. RUL=0 sample pulls 2×2 \times harder than RUL=125 sample for the same residual.
  • Parameter-free. No new learnable weights; just one constant RmaxR_{\max}.
Loading comments...