Chapter 14
15 min read
Section 56 of 121

The Linear-Decay Sample Weight w(RUL)

AMNL — Failure-Biased Weighted MSE

Designing a Schedule From Four Constraints

§14.1 said “weight near-failure samples more.” That is necessary but not sufficient - many curves do that. Pick the wrong one and training oscillates, or worse, the weight schedule introduces its own bias on top of the data bias it was meant to fix.

Four constraints any sensible weight schedule should satisfy:

#ConstraintWhy
1monotonic non-increasing in RULweight should not flip sign or oscillate as we move toward failure
2bounded in [1, 2]no sample should be IGNORED (w=0) and no sample should DOMINATE (w → ∞)
3smooth (continuous + bounded slope)SGD reads the gradient ∂L/∂pred which carries w. Discontinuous w ⇒ unstable updates
4parameter-free apart from R_maxanything else is a hyperparameter that has to be tuned per dataset; the paper avoids tuning
Why bounded above by 2. A weight ratio of 2 between most-critical and least-critical samples is enough to re-tilt the gradient density (see §14.1 viz). Pushing it higher (5×, 10×) makes the loss surface near-failure samples steep enough that Adam's adaptive step becomes the dominant source of update direction - more on this in §14.3.

Five Candidates

Five reasonable schedules, ordered by how well they satisfy the four constraints. Only LINEAR satisfies all four.

ScheduleFormulaMonotone?Bounded [1, 2]?Smooth?Param-free?
constantw(y) = w_0triviallyyes (any w_0)triviallyno (w_0)
linear (paper)w(y) = 1 + clip(1 - y/R_max, 0, 1)yesyesyes (slope = -1/R_max constant)yes
exponentialw(y) = 1 + exp(-β · y/R_max)yesno - never reaches 1 flooryes (peaky)no (β)
sigmoidw(y) = 1 + σ(-k · (y - y_mid))yesyes (asymptotic)yesno (k, y_mid)
stepw(y) = 2 if y < τ else 1yesyesNO (jump at τ)no (τ)
Linear is the only candidate that wins on all four. The paper's closed form w(y)=1+clip(1y/Rmax,0,1)w(y) = 1 + \operatorname{clip}(1 - y / R_{\max}, 0, 1) is monotone, bounded, smooth, and parameter-free apart from RmaxR_{\max} (which the §7.2 RUL cap already gave us). The above-cap clip is what makes it a TRUE schedule, not just an algebraic identity 2y/Rmax2 - y/R_{\max}.

Why Linear Wins

On [0,Rmax][0, R_{\max}] the schedule is w(y)=2y/Rmaxw(y) = 2 - y/R_{\max}. The slope is constant: dw/dy=1/Rmaxdw/dy = -1/R_{\max} ≈ −0.008/cycle. That “flat slope” matters because SGD reads the gradient ∂L/∂pred ∝ w(y) - and a constant-slope w gives every cycle the same incremental emphasis. Exponential and sigmoid concentrate emphasis near their peaks, which means SGD's update direction depends on which RUL bin is over-represented in the current batch.

Above the cap (y>Rmaxy > R_{\max}) the clip pins the schedule at the floor 1.0 - the same as a healthy engine inside the cap. This matches the §7.2 design decision that engines “far from failure” are operationally equivalent.

Interactive: Schedule Comparison

Pick a schedule on the right; the green dashed line is the paper choice. Notice how exp explodes if you push β up; how sigmoid creates a cliff; how step is two flat plateaus. Only linear stays inside the [1, 2] band with a constant slope.

Loading weight-schedule viz…
Try this. Switch to “exp” and slide β to 8 - the curve climbs above w=2 (out of the safe band). Switch to “sigmoid” and crank k to 1 - the curve becomes nearly a step (large slope at y_mid, zero slope elsewhere). Switch to “step” - the slope is infinite at the threshold (the chart shows a vertical line). Only LINEAR keeps the slope bounded everywhere.

Python: Five Schedules Side by Side

All five candidates implemented as pure functions, then evaluated on five test RUL values. The numerical-derivative helper at the bottom verifies that the paper schedule has the most uniform slope.

w_constant, w_linear, w_exponential, w_sigmoid, w_step
🐍weight_schedules_numpy.py
1import numpy as np

NumPy provides the (B,) RUL arrays and broadcasting we need to implement and compare five candidate weight schedules. We use np.full_like, np.clip, np.exp, np.where, np.array.

EXECUTION STATE
📚 numpy = Library: ndarray + broadcasting + math + linear algebra.
as np = Universal alias.
4def w_constant(rul, w0=1.0) -> np.ndarray:

Uniform weighting - the ‘do nothing’ baseline. Identical to standard MSE up to a constant scale.

EXECUTION STATE
⬇ input: rul = (B,) ground-truth RUL values.
⬇ input: w0 = 1.0 = Constant weight to assign every sample. Default 1.0 makes this exactly standard MSE.
⬆ returns = (B,) weight array, all entries equal to w0.
6return np.full_like(rul, w0)

Allocate an array of the same shape and dtype as rul, every element set to w0.

EXECUTION STATE
📚 np.full_like(arr, fill_value) = Create an array with the same shape and dtype as `arr`, filled with `fill_value`. Faster than np.ones * w0 because it skips a multiply.
⬇ arg 1: arr = rul = (B,) reference array - we copy its shape and dtype.
⬇ arg 2: fill_value = w0 = Scalar to broadcast into every element.
⬆ result = (B,) array of w0 values - e.g. [1, 1, 1, 1, 1].
9def w_linear(rul, max_rul=125.0) -> np.ndarray:

The paper formula. The RUL-clip-add-1 form is mathematically equivalent to <code>2 - rul/max_rul</code> on [0, max_rul] but ALSO defines the floor for above-cap engines.

EXECUTION STATE
⬇ input: rul = (B,) ground-truth RUL values.
⬇ input: max_rul = 125.0 = RUL cap. The §7.2 piecewise-linear cap. Above this the schedule saturates at w=1.0.
⬆ returns = (B,) weights in [1, 2]. Monotone non-increasing in RUL.
14return 1.0 + np.clip(1.0 - rul / max_rul, 0.0, 1.0)

The exact paper line. Read inside out: 1 - rul/max_rul ranges from 1 (at RUL=0) down through 0 (at RUL=max_rul) to negative (above the cap). np.clip pins the negative tail to 0 (the floor). Add 1 ⇒ weight in [1, 2].

EXECUTION STATE
📚 np.clip(arr, a_min, a_max) = Element-wise clip. Returns max(a_min, min(arr, a_max)). Either bound can be None to indicate no clamp.
⬇ arg 1: arr = 1.0 - rul / max_rul = Pre-clip values. Range: 1 (at RUL=0) ⇒ 0 (at RUL=125) ⇒ negative beyond.
⬇ arg 2: a_min = 0.0 = Lower bound. Floors above-cap engines at w=1.0.
⬇ arg 3: a_max = 1.0 = Upper bound. Pre-clip values can never exceed 1 (since RUL ≥ 0), so this is defensive but harmless.
→ end-points = RUL=0 ⇒ 1 + 1 = 2.0 RUL=62.5 ⇒ 1 + 0.5 = 1.5 RUL=125 ⇒ 1 + 0 = 1.0 RUL=200 ⇒ 1 + clip(-0.6, 0, 1) = 1.0
⬆ result on worked example = [1.96, 1.76, 1.52, 1.28, 1.00] ← for RUL=[5, 30, 60, 90, 125]
17def w_exponential(rul, beta=2.0, max_rul=125.0) -> np.ndarray:

Exponential decay alternative. Sharper drop-off near RUL=0 than linear; can spike above 2 if beta is large; aggressive enough to destabilise SGD on residual-heavy batches.

EXECUTION STATE
⬇ input: rul = (B,) ground-truth RUL.
⬇ input: beta = 2.0 = Decay rate. Larger ⇒ steeper. Even beta=2 gives w=2.0 at RUL=0 but the slope is 6× the linear slope.
⬇ input: max_rul = 125.0 = Normalisation constant - same as linear.
⬆ returns = (B,) weights. NOT bounded above by 2; for beta=4, RUL=0 gives w=2.0, RUL→∞ gives w→1.
19return 1.0 + np.exp(-beta * rul / max_rul)

Element-wise exponential.

EXECUTION STATE
📚 np.exp(arr) = Element-wise e^x.
operator: -beta * rul / max_rul = Negative scaled RUL. At RUL=0 this is 0; at RUL=125 with beta=2 this is -2. exp(0)=1, exp(-2)≈0.135.
operator: 1.0 + = Shift to [1, 2].
→ numerical example = RUL=0 ⇒ 1 + e^0 = 2.000 RUL=30 ⇒ 1 + e^{-0.48} = 1.619 RUL=60 ⇒ 1 + e^{-0.96} = 1.383 RUL=125 ⇒ 1 + e^{-2.0} = 1.135
⬆ result on worked example = [1.923, 1.619, 1.383, 1.237, 1.135]
22def w_sigmoid(rul, k=0.1, y_mid=40.0) -> np.ndarray:

Sigmoid centred at y_mid with steepness k. Has a soft &lsquo;cliff&rsquo; near y_mid; saturates at w=1 (above) and w=2 (below).

EXECUTION STATE
⬇ input: rul = (B,) RUL values.
⬇ input: k = 0.1 = Steepness. Larger k ⇒ sharper transition. k=0.1 gives a soft transition over ~40 cycles; k=1 makes it nearly a step.
⬇ input: y_mid = 40.0 = Centre of the transition. The choice of y_mid is arbitrary - introducing a free hyperparameter is exactly what the paper avoids.
⬆ returns = (B,) weights, smoothly transitioning from 2 (RUL=0) to 1 (RUL→∞).
27return 1.0 + 1.0 / (1.0 + np.exp(k * (rul - y_mid)))

Sigmoid trick: 1/(1 + exp(z)) where z = k*(rul - y_mid). z &gt; 0 ⇒ output near 0 ⇒ w near 1. z &lt; 0 ⇒ output near 1 ⇒ w near 2.

EXECUTION STATE
→ derivation = We want w high at low RUL, low at high RUL. So we want sigmoid(-k*(rul - y_mid)) = 1/(1+exp(k*(rul - y_mid))) to track that pattern.
⬆ result on worked example = [1.971, 1.731, 1.119, 1.007, 1.000]
→ reading = Notice the dramatic drop between RUL=30 (w=1.73) and RUL=60 (w=1.12). That is the sigmoid&apos;s cliff.
30def w_step(rul, threshold=50.0) -> np.ndarray:

Discontinuous step. Non-differentiable at the threshold; gradient is exactly 0 everywhere else, so the schedule provides ZERO gradient signal about how close to failure a sample is - just &lsquo;in the danger zone&rsquo; or not.

EXECUTION STATE
⬇ input: rul = (B,) RUL values.
⬇ input: threshold = 50.0 = Boundary cycle count. RUL strictly below the threshold gets w=2; equal-or-above gets w=1.
⬆ returns = (B,) weights ∈ {1, 2}. Two-valued, non-differentiable.
32return np.where(rul < threshold, 2.0, 1.0)

Element-wise ternary - sets weights based on the threshold mask.

EXECUTION STATE
📚 np.where(cond, a, b) = Element-wise ternary - returns a where cond is True, else b.
⬇ arg 1: cond = rul &lt; threshold = Boolean mask of &lsquo;dangerous&rsquo; samples.
⬇ arg 2: a = 2.0 = Weight for dangerous samples.
⬇ arg 3: b = 1.0 = Weight for safe samples.
⬆ result on worked example = [2.0, 2.0, 1.0, 1.0, 1.0]
→ reading = RUL=30 (w=2) and RUL=60 (w=1) are treated as completely different even though they&apos;re only 30 cycles apart. Linear gives 1.76 vs 1.52 - smooth transition.
36rul = np.array([5, 30, 60, 90, 125], dtype=np.float32)

Five test points spanning the RUL range. Hand-picked to show the schedule shape clearly.

EXECUTION STATE
📚 np.array(seq, dtype) = Construct an ndarray from a Python sequence.
⬇ arg: dtype = np.float32 = Match the model output dtype.
⬆ result: rul = [5., 30., 60., 90., 125.]
38print(f"{'RUL':>4s} | {'const':>5s} | {'linear':>6s} | {'exp':>5s} | {'sigm':>5s} | {'step':>5s}")

Header row for the comparison table. The :>Ns format spec right-aligns each header to width N.

EXECUTION STATE
→ :>4s = Format spec: string, right-aligned, min width 4.
→ other widths = Each column gets its own width to line up with the data printed below.
Output = RUL | const | linear | exp | sigm | step
39for r in rul:

Loop over the five test RUL values. Plain Python iteration over the ndarray.

EXECUTION STATE
iter var: r = One RUL value per iteration: 5.0, 30.0, 60.0, 90.0, 125.0.
LOOP TRACE · 5 iterations
r = 5.0
row = 5 | 1.00 | 1.960 | 1.923 | 1.971 | 2.0
r = 30.0
row = 30 | 1.00 | 1.760 | 1.619 | 1.731 | 2.0
r = 60.0
row = 60 | 1.00 | 1.520 | 1.383 | 1.119 | 1.0
r = 90.0
row = 90 | 1.00 | 1.280 | 1.237 | 1.007 | 1.0
r = 125.0
row = 125 | 1.00 | 1.000 | 1.135 | 1.000 | 1.0
40rs = np.array([r])

Wrap the scalar in a 1-element ndarray so it can be passed to the schedule functions (which expect arrays).

EXECUTION STATE
→ why wrap? = Each schedule function uses NumPy ops that broadcast over arrays; passing a Python float would error or silently coerce.
41print(f"{int(r):>4d} | ...

Print one row of the comparison table. {:>4d} = right-aligned int width 4; {:>5.2f} = right-aligned float width 5 with 2 decimals.

EXECUTION STATE
→ :>4d = Integer, right-aligned, min width 4.
→ :>5.2f / :>6.3f = Float, right-aligned, fixed width and decimal places. Used to keep the table columns aligned.
→ indexing [0] = Each schedule returns a (1,) array; we extract the scalar with [0] for printing.
49def slope_at(f, y, eps=0.5):

Numerical derivative via centred finite differences. (f(y+ε) - f(y-ε)) / (2ε). Used to compare smoothness of the candidate schedules at RUL=20 (the dangerous regime).

EXECUTION STATE
⬇ input: f = Schedule function (callable).
⬇ input: y = RUL value at which to compute the slope.
⬇ input: eps = 0.5 = Step size. Half a cycle is small enough for smooth schedules, large enough to avoid float noise.
⬆ returns = Numerical derivative as a Python float.
50return (f(np.array([y + eps]))[0] - f(np.array([y - eps]))[0]) / (2 * eps)

Centred difference - cancels first-order error term, giving O(eps²) accuracy.

EXECUTION STATE
→ centred-diff = (f(y+ε) - f(y-ε)) / (2ε) = f&apos;(y) + O(ε²). One-sided diff (f(y+ε) - f(y))/ε would only be O(ε).
53print(f"|dw/dy| at RUL=20:")

Header for the slope comparison.

EXECUTION STATE
Output = |dw/dy| at RUL=20:
54print(f" linear : {abs(slope_at(w_linear, 20.0)):.5f}")

Linear slope - constant 1/max_rul = 1/125 = 0.008 everywhere on [0, max_rul].

EXECUTION STATE
📚 abs(x) = Built-in absolute value.
Output = linear : 0.00800
→ constant! = Linear has constant |dw/dy|. SGD sees uniform sample-weight gradient pressure across the RUL range.
55print(f" exp : {abs(slope_at(w_exponential, 20.0)):.5f} (often unstable)")

Exponential slope is large at low RUL and decays toward 0 at high RUL. At RUL=20 with beta=2: |dw/dy| ≈ 0.0137, ~1.7× the linear value.

EXECUTION STATE
Output = exp : 0.01376 (often unstable)
→ why unstable? = Slope is largest where it matters least (very near RUL=0). With Adam the per-step update at RUL=0 can swamp gradient information from RUL=10.
56print(f" sigmoid: {abs(slope_at(w_sigmoid, 20.0)):.5f} (peaky near y_mid)")

Sigmoid slope peaks at y_mid (= 40 here) and is small elsewhere. At RUL=20 we are still on the &lsquo;rising&rsquo; flank.

EXECUTION STATE
Output = sigmoid: 0.02473 (peaky near y_mid)
→ reading = Sigmoid&apos;s peak slope near y_mid is ~3× the linear slope. Far from y_mid, slope is near 0 - so the schedule provides almost no gradient signal in 80% of the RUL range.
37 lines without explanation
1import numpy as np
2
3
4def w_constant(rul: np.ndarray, w0: float = 1.0) -> np.ndarray:
5    """Uniform weighting - the baseline."""
6    return np.full_like(rul, w0)
7
8
9def w_linear(rul: np.ndarray, max_rul: float = 125.0) -> np.ndarray:
10    """Paper formula: w(RUL) = 1 + clip(1 - RUL / max_rul, 0, 1).
11
12    Equivalent on [0, max_rul] to 2 - RUL/max_rul.
13    Saturates at 1.0 above max_rul (the &lsquo;floor&rsquo;).
14    """
15    return 1.0 + np.clip(1.0 - rul / max_rul, 0.0, 1.0)
16
17
18def w_exponential(rul: np.ndarray, beta: float = 2.0,
19                   max_rul: float = 125.0) -> np.ndarray:
20    """Exponential decay: w = 1 + exp(-beta * RUL / max_rul). Aggressive."""
21    return 1.0 + np.exp(-beta * rul / max_rul)
22
23
24def w_sigmoid(rul: np.ndarray, k: float = 0.1,
25                y_mid: float = 40.0) -> np.ndarray:
26    """Sigmoid around y_mid: w = 1 + sigmoid(-k * (RUL - y_mid)).
27
28    Has a soft &lsquo;cliff&rsquo; near y_mid; flat far from it.
29    """
30    return 1.0 + 1.0 / (1.0 + np.exp(k * (rul - y_mid)))
31
32
33def w_step(rul: np.ndarray, threshold: float = 50.0) -> np.ndarray:
34    """Hard threshold: w = 2 if RUL < threshold else 1. Non-differentiable at the boundary."""
35    return np.where(rul < threshold, 2.0, 1.0)
36
37
38# ---------- Worked example: 5 RUL values, all 5 schedules ----------
39rul = np.array([5, 30, 60, 90, 125], dtype=np.float32)
40
41print(f"{'RUL':>4s} | {'const':>5s} | {'linear':>6s} | {'exp':>5s} | {'sigm':>5s} | {'step':>5s}")
42for r in rul:
43    rs = np.array([r])
44    print(f"{int(r):>4d} | "
45          f"{w_constant(rs)[0]:>5.2f} | "
46          f"{w_linear(rs)[0]:>6.3f} | "
47          f"{w_exponential(rs)[0]:>5.3f} | "
48          f"{w_sigmoid(rs)[0]:>5.3f} | "
49          f"{w_step(rs)[0]:>5.1f}")
50
51# Slopes |dw/dy| at the dangerous regime (RUL=20)
52def slope_at(f, y, eps=0.5):
53    return (f(np.array([y + eps]))[0] - f(np.array([y - eps]))[0]) / (2 * eps)
54
55print()
56print(f"|dw/dy| at RUL=20:")
57print(f"  linear : {abs(slope_at(w_linear, 20.0)):.5f}")
58print(f"  exp    : {abs(slope_at(w_exponential, 20.0)):.5f}  (often unstable)")
59print(f"  sigmoid: {abs(slope_at(w_sigmoid, 20.0)):.5f}  (peaky near y_mid)")

PyTorch: Paper Form + Ablation Hooks

The paper's exact linear_decay_weight plus an make_amnl closure factory that lets you swap the schedule with one line - perfect for ablations. Smoke test runs all three (linear / exp / step) on identical predictions.

linear_decay_weight() + make_amnl(weight_fn) closure
🐍schedule_ablation_torch.py
1import torch

Top-level PyTorch.

EXECUTION STATE
📚 torch = Tensor library + autograd + nn modules + optim.
2import torch.nn as nn

Module containers - imported for convention.

5def linear_decay_weight(target, max_rul=125.0) -> torch.Tensor:

The single line of math that defines AMNL&apos;s sample weight - paper-canonical. Pure function; no learnable parameters; no module wrapper.

EXECUTION STATE
⬇ input: target = (B,) ground-truth RUL.
⬇ input: max_rul = 125.0 = Paper default.
⬆ returns = (B,) weights in [1.0, 2.0]. Differentiable through autograd if target ever has requires_grad=True (rare for labels).
11return 1.0 + torch.clamp(1.0 - target / max_rul, 0.0, 1.0)

Element-wise. Same algebra as the NumPy version, identical numerical output. Differentiable via torch.clamp (gradient is 1 in-range, 0 out-of-range).

EXECUTION STATE
📚 torch.clamp(input, min, max) = Element-wise clip. Differentiable: gradient is 1 inside [min, max], 0 outside.
⬇ arg 1: input = 1.0 - target / max_rul = Pre-clip values.
⬇ arg 2: min = 0.0 = Lower bound. Above-cap engines floor at w=1.
⬇ arg 3: max = 1.0 = Upper bound. Defensive.
⬆ result on worked example = tensor([1.96, 1.76, 1.52, 1.28, 1.00])
15def make_amnl(weight_fn) -> "callable":

Higher-order function: takes a schedule (any callable that maps target → weight) and returns a CLOSURE that computes the AMNL loss using that schedule. Perfect for ablations.

EXECUTION STATE
⬇ input: weight_fn = Callable. Signature: target tensor → weight tensor of same shape.
⬆ returns = Callable. Signature: (pred, target) → 0-D scalar loss tensor with autograd graph.
17def loss(pred, target) -> torch.Tensor:

Inner closure. Captures weight_fn from the outer scope.

18w = weight_fn(target)

Compute per-sample weights. The outer-scope weight_fn determines which schedule.

EXECUTION STATE
→ closure capture = Python remembers weight_fn from the outer make_amnl call. Each returned `loss` function uses its own weight_fn.
⬆ result: w = (B,) tensor of weights, same shape and dtype as target.
19return (w * (pred - target) ** 2).mean()

Weighted MSE - paper formula with the active weight schedule.

EXECUTION STATE
operator: - = Element-wise tensor subtraction.
operator: ** 2 = Element-wise square.
operator: * = Element-wise multiply.
📚 .mean() = Reduce-mean over all elements. With no dim, returns a 0-D scalar.
⬆ result = 0-D tensor with autograd graph.
20return loss

Hand back the closure. Each call to make_amnl produces a NEW loss function with its own captured weight_fn.

24sched_linear = linear_decay_weight

First-class function reference - no call here, just naming the function for the ablation dict.

25sched_exp = lambda y, max_rul=125.0, beta=2.0: 1.0 + torch.exp(-beta * y / max_rul)

Inline lambda for the exponential schedule. Anonymous function; same signature as linear_decay_weight (target → weight).

EXECUTION STATE
📚 lambda = Python keyword for an anonymous function. lambda x: x+1 is equivalent to def f(x): return x+1.
📚 torch.exp(t) = Element-wise e^x.
→ at y=0 = 1 + e^0 = 2.0 (max emphasis).
→ at y=125 = 1 + e^{-2} ≈ 1.135 (NOT 1.0; the exponential never quite reaches the floor).
26sched_step = lambda y, threshold=50.0: torch.where(y < threshold, 2.0, 1.0)

Step schedule. <code>torch.where</code> is the differentiable element-wise ternary, but at the discontinuity y == threshold the gradient is undefined.

EXECUTION STATE
📚 torch.where(cond, a, b) = Element-wise ternary - returns a where cond is True, else b.
⬇ arg 1: cond = y &lt; threshold = Boolean tensor mask.
⬇ arg 2: a = 2.0 = Weight for dangerous samples.
⬇ arg 3: b = 1.0 = Weight for safe samples.
28losses = { ... }

Dict of ablation candidates. Same forward signature ⇒ drop-in interchangeable.

29"linear (paper)": make_amnl(sched_linear),

Paper-canonical schedule wrapped in the AMNL loss.

30"exponential": make_amnl(sched_exp),

Exponential ablation.

31"step": make_amnl(sched_step),

Step ablation.

36torch.manual_seed(0)

Repro.

EXECUTION STATE
📚 torch.manual_seed(s) = Set the global PyTorch PRNG.
⬇ arg: s = 0 = Conventional canonical seed.
37target = torch.tensor([5.0, 30.0, 60.0, 90.0, 125.0])

Five test points spanning the RUL range.

EXECUTION STATE
📚 torch.tensor(seq) = Construct a new tensor from a Python sequence. Default float32.
⬆ result: target = tensor([5., 30., 60., 90., 125.])
38pred = target + 10.0

Constant residual = +10 across the batch. Isolates the schedule effect from the residual effect.

EXECUTION STATE
⬆ result: pred = tensor([15., 40., 70., 100., 135.])
39pred.requires_grad_()

In-place flip to True so .backward() can populate pred.grad.

EXECUTION STATE
📚 .requires_grad_(mode=True) = In-place setter for the requires_grad flag. Trailing underscore = in-place op.
⬇ arg: mode = True (default) = Turn autograd ON for this tensor.
41print(f"{'schedule':<18s} | loss | |grad@RUL=5| | |grad@RUL=125|")

Header row. :<18s = left-aligned string width 18.

EXECUTION STATE
Output = schedule | loss | |grad@RUL=5| | |grad@RUL=125|
42for name, loss_fn in losses.items():

Iterate the ablation dict.

EXECUTION STATE
📚 dict.items() = View of (key, value) pairs. Stable iteration order in Python 3.7+.
LOOP TRACE · 3 iterations
linear (paper)
expected = loss=144.67 | g[0]=0.7833 | g[4]=0.4000 (linear: w[0]/w[4] = 1.96)
exponential
expected = loss=132.12 | g[0]=0.7691 | g[4]=0.4540 (exp: w[0]/w[4] = 1.69, never reaches floor)
step
expected = loss=160.00 | g[0]=0.8000 | g[4]=0.4000 (step: w[0]/w[4] = 2.0 - matches paper at extremes but jumps mid-range)
43pred.grad = None

Reset gradient to None before the next backward. Equivalent to <code>pred.grad.zero_()</code> followed by setting None - but cheaper because we skip allocation.

EXECUTION STATE
→ why None? = PyTorch ACCUMULATES into .grad by default. Without this reset, the second iteration would add to the first&apos;s gradient and we&apos;d see compounded numbers.
44loss = loss_fn(pred, target)

Run the active schedule&apos;s loss.

EXECUTION STATE
⬆ result: loss = 0-D tensor, e.g. tensor(144.6667) for linear.
45loss.backward()

Reverse-mode autograd populates pred.grad with d(loss)/d(pred).

EXECUTION STATE
📚 .backward(retain_graph=False) = Accumulates grads into all leaves with requires_grad=True.
46g = pred.grad.abs()

Element-wise absolute value of the per-sample gradient.

EXECUTION STATE
📚 .abs() = Element-wise |x|.
47print(f"{name:<18s} | {loss.item():>9.4f} | {g[0].item():>11.4f} | {g[4].item():>11.4f}")

Format-string row. .item() pulls Python floats; the format specs keep columns aligned.

EXECUTION STATE
→ :<18s = Left-aligned string, min width 18.
→ :>9.4f = Right-aligned float, width 9, 4 decimals.
→ :>11.4f = Right-aligned float, width 11, 4 decimals.
Output (one realisation) = linear (paper) | 144.6667 | 0.7833 | 0.4000 exponential | 132.1216 | 0.7691 | 0.4540 step | 160.0000 | 0.8000 | 0.4000
→ reading = Paper linear: clean 1.96× emphasis. Exponential: smaller emphasis (1.69×) AND never reaches w=1.0 floor at RUL=125 - it keeps adding noise to healthy samples. Step: matches linear at the extremes but creates a gradient discontinuity at threshold=50.
20 lines without explanation
1import torch
2import torch.nn as nn
3
4
5def linear_decay_weight(target: torch.Tensor, max_rul: float = 125.0) -> torch.Tensor:
6    """Paper-canonical linear decay weight. Pure function, no learnable params.
7
8    Source: paper_ieee_tii/grace/core/weighted_mse.py
9        w(RUL) = 1.0 + clamp(1.0 - RUL / max_rul, 0, 1.0)
10    """
11    return 1.0 + torch.clamp(1.0 - target / max_rul, 0.0, 1.0)
12
13
14# ---------- Ablation harness: run AMNL on three different schedules ----------
15def make_amnl(weight_fn) -> "callable":
16    """Return a closure that computes weighted MSE under the given schedule."""
17    def loss(pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
18        w = weight_fn(target)
19        return (w * (pred - target) ** 2).mean()
20    return loss
21
22
23# Three candidate schedules (paper + two ablation alternatives)
24sched_linear = linear_decay_weight
25sched_exp     = lambda y, max_rul=125.0, beta=2.0:  1.0 + torch.exp(-beta * y / max_rul)
26sched_step    = lambda y, threshold=50.0:           torch.where(y < threshold, 2.0, 1.0)
27
28losses = {
29    "linear (paper)": make_amnl(sched_linear),
30    "exponential":    make_amnl(sched_exp),
31    "step":           make_amnl(sched_step),
32}
33
34
35# ---------- Smoke test on identical predictions ----------
36torch.manual_seed(0)
37target = torch.tensor([5.0, 30.0, 60.0, 90.0, 125.0])
38pred   = target + 10.0
39pred.requires_grad_()
40
41print(f"{'schedule':<18s} | loss      | |grad@RUL=5| | |grad@RUL=125|")
42for name, loss_fn in losses.items():
43    pred.grad = None
44    loss = loss_fn(pred, target)
45    loss.backward()
46    g = pred.grad.abs()
47    print(f"{name:<18s} | {loss.item():>9.4f} | {g[0].item():>11.4f}  | {g[4].item():>11.4f}")

Linear Decay Beyond C-MAPSS

The four constraints are not RUL-specific. Anywhere a regression target has “safe” and “dangerous” regimes with a known reference boundary RmaxR_{\max}, linear decay with w(y)=1+clip((Rmaxy)/Rmax,0,1)w(y) = 1 + \operatorname{clip}((R_{\max} - y) / R_{\max}, 0, 1) transfers cleanly.

DomainTarget yReference R_maxSchedule call
RUL prediction (this book)remaining cycles125 cycles (§7.2 cap)linear_decay_weight(y, max_rul=125)
Battery state-of-healthcapacity ratio1.0 (fresh)linear_decay_weight(y, max_rul=1.0)
Tumour size on follow-up MRIdiameter (mm)20 mm (clinical thresh)linear_decay_weight(20 - y, max_rul=20)
Bridge crack lengthmmmax safe lengthlinear_decay_weight(L_safe - y, max_rul=L_safe)
Wildfire fuel-moisture% moistureignition thresholdlinear_decay_weight(y - thresh, max_rul=thresh)
Power-grid frequency deviation|Hz from nominal|trip thresholdlinear_decay_weight(trip - y, max_rul=trip)
The trick. If your dangerous regime is at HIGH y instead of LOW y (e.g. tumour size, crack length), pass RmaxyR_{\max} - y instead of yy - the schedule is symmetric so it works either way.

Three Schedule-Design Pitfalls

Pitfall 1: Skipping the clip. Without clip(min=0), above-cap engines (y > R_max) get NEGATIVE weights. The optimiser would then INCREASE the residual on healthy engines. The clip is what makes the schedule operationally meaningful.
Pitfall 2: Tuning w_max as a hyperparameter. The paper sets w_max=2.0 and never tunes it. Treating w_max as a hyperparameter creates a false comparison axis and invites overfitting to the validation set. Pick once based on operational cost ratio, then never touch it.
Pitfall 3: Mixing schedule choice with loss-weight choice. The schedule decides per-SAMPLE weight; the AMNL constant lam_hs decides per-TASK weight. They are orthogonal. Tuning one to compensate for the wrong choice of the other produces a model that looks fine on val but fails in production.
The point. Linear decay is not the only schedule that emphasises near-failure samples - but it is the only one that does so without introducing a hyperparameter or destabilising training. §14.3 nails down why the ceiling sits at 2.0 specifically; §14.4 wires the whole thing into the §11.4 DualTaskModel.

Takeaway

  • Four constraints. Monotonic, bounded, smooth, parameter-free. Linear is the only candidate that wins on all four.
  • Closed form. w(y)=1+clip(1y/Rmax,0,1)w(y) = 1 + \operatorname{clip}(1 - y/R_{\max}, 0, 1) - paper code, one line.
  • Constant slope. dw/dy=1/Rmaxdw/dy = -1/R_{\max} on [0,Rmax][0, R_{\max}]. Same incremental emphasis at every cycle.
  • Floor above the cap. The clip turns the algebraic identity into a true schedule.
  • Domain-agnostic. Same formula transfers to battery SoH, tumour diameter, crack length, wildfire moisture - flip the sign of (R_max - y) if your dangerous regime is at high y.
Loading comments...