An ER doctor with 30 patients does not give every patient the same attention. Sprained ankle and chest pain go through the same door, but resources go to chest pain first. The cost of being wrong is not the same on both.
RUL prediction is the same problem in disguise. A 10-cycle residual on an engine with 100 cycles of life left is a minor scheduling adjustment. A 10-cycle residual on an engine with 15 cycles of life left can MISS the failure. Vanilla MSE treats them as identical.
What this section fixes. Chapter 12 showed the regression-vs-classification gradient imbalance between tasks. This chapter introduces a separate problem within the regression task: not all RUL samples deserve the same gradient pull. AMNL reweights each sample by how close it is to failure - so the model learns the part of the trajectory that actually matters.
Where the Uniform Gradient Goes
Plain MSE is LMSE=B1∑i=1B(y^i−yi)2. The per-sample gradient on the prediction is ∂LMSE/∂y^i=(2/B)(y^i−yi) - magnitude tracks the residual, NOT the RUL value.
AMNL multiplies each sample by a weight wi∈[1,2] that grows as RUL shrinks:
wi=1+clip(1−yi/Rmax,0,1).
End-points: y=0⇒w=2, y=Rmax⇒w=1, y>Rmax⇒w=1 (the clip floor handles above-cap engines). Total loss:
LAMNL=B1∑i=1Bwi⋅(y^i−yi)2.
Plain mean, not normalised average. The paper code (grace/core/weighted_mse.py) uses (w * residual**2).mean() - NOT ∑iwi⋅sqi/∑iwi. Why: with plain mean, the loss VALUE goes UP when the batch is heavy on near-failure samples - which is exactly the operational signal we want. The normalised form would cancel it.
Operational Cost by RUL Range
Mined from the legacy book's field-experience table. Same residual, very different operational consequences:
True RUL
10-cycle residual means
AMNL weight
Operational consequence
100 cycles
wrong by 10% of remaining life
1.20
minor maintenance schedule shift
50 cycles
wrong by 20% of remaining life
1.60
operations planning impact
20 cycles
wrong by 50% of remaining life
1.84
critical safety concern
5 cycles
wrong by 200% of remaining life
1.96
potential failure missed entirely
1 cycle
wrong by 1000% of remaining life
1.99
almost certain failure missed
Interactive: Per-RUL Gradient Density
Drag w_max from 1 (uniform MSE) to 5 (extreme emphasis). Drag the red test point along the RUL axis. Background shows the C-MAPSS RUL histogram - notice how few samples land at low RUL, which is exactly why uniform MSE starves that regime.
Loading sample importance viz…
Try this. Set w_max = 1.0 - the green and grey curves overlap exactly (uniform MSE). Slide w_max to 2.0 (the AMNL paper default) - the green curve tilts left, amplifying the under-represented near-failure samples. Push w_max to 5.0 - section §14.3 explains why this is too aggressive and destabilises training.
Python: Uniform vs Weighted from Scratch
Both losses computed analytically in NumPy. Six hand-picked engines, each predicted 10 cycles late - so the residual is constant. Any difference in per-sample gradient comes purely from the weighting.
uniform_mse() and amnl_weighted_mse()
🐍amnl_numpy.py
Explanation(26)
Code(49)
1import numpy as np
NumPy provides the (B,) and matrix arrays for vectorised loss/grad computation. We use np.array, np.mean, np.clip, plus broadcasting and element-wise arithmetic.
EXECUTION STATE
📚 numpy = Library: ndarray, broadcasting, math, random, linear algebra.
Standard mean squared error. Treats every sample as equally important - exactly the assumption AMNL will overturn.
EXECUTION STATE
⬇ input: pred = (B,) predicted RUL.
⬇ input: target = (B,) ground-truth RUL.
⬆ returns = (loss: float, grad: (B,)) - both quantities so we can inspect per-sample pulls.
9residual = pred - target
Element-wise signed error. Positive = predicted late.
EXECUTION STATE
operator: - = Element-wise tensor subtraction.
⬆ result: residual = (B,) - in our example all entries are +10 (every sample is 10 cycles late).
10loss = float(np.mean(residual ** 2))
Mean of squared residuals. With residual=+10 everywhere, loss = 100.
EXECUTION STATE
📚 np.mean(arr) = Reduce-mean. With no axis, returns a 0-D scalar.
operator: ** 2 = Element-wise squaring.
📚 float(x) = 0-D ndarray → Python float.
⬆ result: loss = 100.0 - constant since all residuals are +10.
11grad = (2.0 / len(target)) * residual
Analytic ∂L/∂pred = (2/B)·residual for the mean reduction. Per-sample magnitude scales with |residual|, NOT with the sample's position on the RUL axis.
EXECUTION STATE
📚 len(seq) = Python built-in. For (B,) ndarray returns B.
operator: 2.0 / B = Constant outside.
operator: * = Scalar × array broadcast.
⬆ result: grad = [3.333, 3.333, 3.333, 3.333, 3.333, 3.333] - all equal because residual is constant. Uniform MSE gives EVERY sample the same gradient pull, regardless of how close to failure it is.
The paper's AMNL formula (paper_ieee_tii/grace/core/weighted_mse.py). Same MSE shape, but each sample is multiplied by w(RUL) before averaging - low-RUL samples get up to 2× the pull.
EXECUTION STATE
⬇ input: pred = (B,) predicted RUL.
⬇ input: target = (B,) ground-truth RUL.
⬇ input: max_rul = 125.0 = RUL cap from §7.2 - the ‘healthy plateau’. Samples above this get weight 1.0 because the clip floor takes over.
⬆ returns = (loss, weights, grad) - three values so we can plot the weight schedule.
The exact paper line. Decompose: 1 - target/max_rul ⇒ 1 at RUL=0, 0 at RUL=125, NEGATIVE for RUL > 125. np.clip pins the negatives to 0. Add 1 ⇒ weight is 2 at RUL=0, 1 at RUL=125, 1 above.
EXECUTION STATE
📚 np.clip(arr, a_min, a_max) = Element-wise clip. Returns max(a_min, min(arr, a_max)). Both bounds can be None to indicate no clamp on that side.
→ reading = RUL=5 (near failure) gets nearly 2× the weight of RUL=125 (healthy). Above-cap engines (RUL=200) get the floor weight 1.0 - same as RUL=125, NOT extrapolated.
Paper formula: <code>(weights * (pred - target)**2).mean()</code> - PLAIN .mean(), NOT Σ(w·sq)/Σw. The two differ in scale: the .mean() form lets the loss SHRINK if many samples have low RUL (high weight), preserving the operational signal that the batch is "heavy on critical engines".
EXECUTION STATE
operator: * = Element-wise multiply.
operator: ** 2 = Element-wise square.
📚 np.mean(arr) = Plain mean - sum divided by count, NOT divided by weight sum.
⬆ result: loss = Mean of [1.96·100, 1.84·100, 1.60·100, 1.28·100, 1.00·100, 1.00·100] = mean([196, 184, 160, 128, 100, 100]) = 144.67.
→ vs uniform = Uniform loss = 100. AMNL loss = 144.67. The 44.67-point gap is the "extra emphasis" on near-failure samples - it shows up as a larger total loss AND larger gradients on those samples.
27grad = (2.0 / len(target)) * weights * residual
Per-sample gradient: ∂L/∂pred_i = (2/B) · w_i · residual_i. Same formula as uniform MSE except for the per-sample w_i factor.
Six engines spanning the RUL range. The 200 deliberately tests the above-cap behaviour - AMNL should give it the floor weight 1.0, NOT a negative weight.
EXECUTION STATE
📚 np.array(seq, dtype) = Construct an ndarray from a Python sequence. dtype is optional; defaults to inferred.
⬇ arg: seq = Python list of 6 RUL values.
⬇ arg: dtype = np.float32 = Match the model output dtype.
Predict each engine 10 cycles LATE. The constant residual lets us isolate the WEIGHT effect from the RESIDUAL effect - any difference in per-sample gradient comes purely from the AMNL weight.
EXECUTION STATE
⬆ result: pred = [15., 30., 60., 100., 135., 210.]
→ why constant residual? = If we let residuals vary, the gradient comparison would mix two effects. Constant residual isolates the weighting mechanism.
36mse_loss, mse_grad = uniform_mse(pred, target)
Run uniform MSE.
EXECUTION STATE
⬆ result: mse_loss = 100.0
⬆ result: mse_grad = [3.333, 3.333, 3.333, 3.333, 3.333, 3.333] - flat across all RUL.
→ reading = Near-failure engines (RUL=5) pull on the model with ~2× the force of healthy engines (RUL=125), even though the residual is identical. THAT is what AMNL achieves.
45print(f"uniform total loss : {mse_loss:.2f}")
Formatted output. :.2f = float with 2 decimals.
EXECUTION STATE
→ :.2f = Format spec: float, 2 decimals.
Output = uniform total loss : 100.00
46print(f"AMNL total loss : {amnl_loss:.2f}")
AMNL total.
EXECUTION STATE
Output = AMNL total loss : 144.67
47print(f"emphasis at RUL=5 : {amnl_grad[0] / mse_grad[0]:.2f} x more pull")
Per-sample emphasis ratio.
EXECUTION STATE
Output = emphasis at RUL=5 : 1.96 x more pull
48print(f"emphasis at RUL=125: {amnl_grad[4] / mse_grad[4]:.2f} x (=1, no change)")
Floor sanity check.
EXECUTION STATE
Output = emphasis at RUL=125: 1.00 x (=1, no change)
→ invariant = Healthy engines see no change, near-failure engines see up to 2x more pull. Above-cap engines (RUL=200) ALSO see no change - they fall on the floor. AMNL is a TILT, not a re-scale.
23 lines without explanation
1import numpy as np
234defuniform_mse(pred: np.ndarray, target: np.ndarray)->tuple[float, np.ndarray]:5"""Standard MSE - average squared residual over the batch.
67 Returns (loss, per_sample_grad_on_pred).
8 """9 residual = pred - target # (B,)10 loss =float(np.mean(residual **2))# scalar11 grad =(2.0/len(target))* residual # (B,)12return loss, grad
131415defamnl_weighted_mse(pred: np.ndarray,16 target: np.ndarray,17 max_rul:float=125.0)->tuple[float, np.ndarray, np.ndarray]:18"""AMNL weighted MSE - paper formula from grace/core/weighted_mse.py.
1920 w(RUL) = 1.0 + clip(1.0 - RUL / max_rul, 0, 1)
21 loss = mean( w * (pred - target)**2 )
2223 Returns (loss, per_sample_weight, per_sample_grad_on_pred).
24 """25 weights =1.0+ np.clip(1.0- target / max_rul,0.0,1.0)# (B,)26 residual = pred - target # (B,)27 loss =float(np.mean(weights * residual **2))# scalar28 grad =(2.0/len(target))* weights * residual # (B,)29return loss, weights, grad
303132# ---------- Worked example: 6 engines spanning the RUL range ----------33np.random.seed(0)34target = np.array([5,20,50,90,125,200], dtype=np.float32)# last one above cap35pred = target + np.array([10,10,10,10,10,10], dtype=np.float32)# all 10 cycles late3637mse_loss, mse_grad = uniform_mse(pred, target)38amnl_loss, amnl_weights, amnl_grad = amnl_weighted_mse(pred, target)3940print("RUL :", target.tolist())41print("|d|=10 everywhere - same residual size for every sample.\n")42print("uniform |grad| :", mse_grad.round(4).tolist())43print("AMNL weights:", amnl_weights.round(3).tolist())44print("AMNL |grad| :", amnl_grad.round(4).tolist())45print()46print(f"uniform total loss : {mse_loss:.2f}")47print(f"AMNL total loss : {amnl_loss:.2f}")48print(f"emphasis at RUL=5 : {amnl_grad[0]/ mse_grad[0]:.2f} x more pull")49print(f"emphasis at RUL=125: {amnl_grad[4]/ mse_grad[4]:.2f} x (=1, no change)")
PyTorch: The Paper's One-Liner
The exact function from paper_ieee_tii/grace/core/weighted_mse.py. Three lines of math, one .mean() reduction. Drop-in replacement for F.mse_loss.
The exact function signature and body from <code>paper_ieee_tii/grace/core/weighted_mse.py</code>. Pure function; no module wrapper, no learnable parameters. Drop-in replacement for <code>F.mse_loss</code>.
EXECUTION STATE
⬇ input: pred = Predicted RUL of shape (N, 1) or (N,) - the function reshapes internally.
⬇ input: target = Ground-truth RUL of shape (N, 1) or (N,).
⬇ input: max_rul = 125.0 = Paper default. Matches the §7.2 RUL cap.
⬆ returns = 0-D scalar tensor with autograd graph attached.
14pred_flat = pred.view(-1)
Reshape to a flat 1-D tensor. .view(-1) is a no-copy reshape; -1 means ‘infer this dimension from the total element count’. Required because some upstream code passes (N, 1) and some passes (N,).
EXECUTION STATE
📚 .view(*shape) = Returns a view of the tensor with the requested shape, sharing storage with the original. Errors if the requested shape is not contiguous-compatible.
⬇ arg: shape = -1 = Single-element shape with -1 ⇒ output is 1-D with as many entries as the input has elements.
→ vs reshape = .view requires contiguous memory; .reshape silently copies if needed. Both work here.
The single line that makes AMNL different from MSE. torch.clamp is the differentiable equivalent of np.clip - gradient is 1 inside the range, 0 outside. Note: target is a constant (no requires_grad), so the gradient through clamp does not matter at training time.
EXECUTION STATE
📚 torch.clamp(input, min, max) = Element-wise clip. Differentiable: gradient is 1 inside the [min, max] range, 0 outside. Same shape as input.
Plain element-wise weighted-square then .mean(). NOT the normalised Σ(w·sq)/Σw form. Differences from the normalised version: (a) plain .mean() lets the loss VALUE rise when the batch is heavy on near-failure samples, signalling that this batch has more emphasis; (b) the gradient of plain .mean() at each sample is exactly w_i × (the standard MSE gradient), which is what we want.
Predict each engine 10 cycles late. NB: pred inherits target's requires_grad=False at this point - we'll flip it in the next line.
EXECUTION STATE
⬆ result: pred = tensor([15., 30., 60., 100., 135., 210.])
24pred.requires_grad_()
In-place flip of requires_grad to True. The trailing underscore marks it as in-place. We need this so .backward() can populate pred.grad.
EXECUTION STATE
📚 .requires_grad_(mode=True) = In-place setter for the requires_grad flag. Same as <code>x.requires_grad = True</code> but chainable.
⬇ arg: mode = True (default) = Turn autograd ON for this tensor.
→ effect = Subsequent ops on pred build an autograd graph; .backward() can now flow gradients into pred.grad.
26loss = moderate_weighted_mse_loss(pred, target)
Run the AMNL loss.
EXECUTION STATE
⬆ result: loss = tensor(144.6667, grad_fn=<MeanBackward0>)
→ matches NumPy = Same number as the NumPy block - the formula is identical.
27loss.backward()
Reverse-mode autograd. Populates pred.grad with d(loss)/d(pred).
EXECUTION STATE
📚 .backward(retain_graph=False) = Reverse-mode autograd through the graph; accumulates grads into all leaves with requires_grad=True. Default frees the graph.
29print(f"AMNL loss : {loss.item():.4f}")
.item() pulls the Python float; :.4f formats to 4 decimals.
→ vs NumPy block = PyTorch grads are 10× smaller than the NumPy grads above because of how the (2/B) normalisation differs across the two reductions in this worked example. The RATIO across samples is the same, which is what matters.
32print(f"emphasis ratio : {(pred.grad[0] / pred.grad[4]).item():.2f}x")
RUL=5 grad / RUL=125 grad = the headline emphasis amplification.
EXECUTION STATE
Output = emphasis ratio : 1.96x
→ reading = Near-failure engines pull on the trunk with ~2x the force of healthy engines for the same residual size. That is the entire point of AMNL.
Choose w_max from operational cost ratio. AMNL fixes w_max=2 because the C-MAPSS “late prediction” cost is roughly 2× “early prediction” cost in the operational regime the paper targets. Tune w_max to your domain's actual cost ratio. §14.3 explores why doubling further (w_max=5 or 10) backfires.
Three Sample-Weighting Pitfalls
Pitfall 1: Using normalised mean Σ(w·sq)/Σw. Cancels the “heavy batch” signal. Stick with the plain .mean() reduction the paper uses.
Pitfall 2: Forgetting the clip floor. Without clamp(min=0), above-cap engines (target > 125) get a NEGATIVE adjustment ⇒ weight LESS than 1. The model would then ignore healthy engines entirely. The clip is what makes AMNL safe at the boundary.
Pitfall 3: Adding bias to weights via learnable parameters. AMNL is parameter-free. Making w(RUL) learnable (via uncertainty weighting or similar) is a different method (Kendall et al. 2018) and does not match the paper's plot-able schedule. Keep it analytic.
The point. One line of NumPy / torch.clamp, no learnable parameters, ~2× more gradient pull on near-failure samples. §14.2 derives the choice of linear (vs exponential / sigmoid) decay; §14.3 argues why w_max=2 (not 5) is the right ceiling; §14.4 wires it into the §11.4 DualTaskModel.
Takeaway
Uniform MSE is unbiased per residual but biased per RUL. The C-MAPSS distribution skews toward high RUL ⇒ the gradient-on-shared-trunk is dominated by healthy engines.
AMNL weight schedule:w(y)=1+clip(1−y/Rmax,0,1). Range [1, 2]. Floor at the cap.
Plain .mean() reduction. Paper code uses (w · residual²).mean(), NOT the normalised form.
Per-sample gradient. RUL=0 sample pulls 2× harder than RUL=125 sample for the same residual.
Parameter-free. No new learnable weights; just one constant Rmax.