Chapter 11
15 min read
Section 53 of 104

Weighted MSE with Linear Decay

Advanced Loss Components

Learning Objectives

By the end of this section, you will:

  1. Understand the motivation for sample-level weighting in RUL prediction
  2. Analyze the linear decay design and its mathematical properties
  3. Compare alternative weighting schemes (exponential, piecewise, adaptive)
  4. Implement robust weighted MSE with numerical stability
  5. Choose appropriate weight bounds for stable training
Why This Matters: Sample weighting is a powerful technique for directing model attention to the most critical prediction regions. In RUL prediction, errors near failure have far greater operational consequences than errors during healthy operation. Linear decay provides an elegant, stable solution to this asymmetric importance problem.

Motivation for Sample Weighting

Not all prediction errors have equal consequences in predictive maintenance.

Operational Cost Analysis

Consider the real-world cost of prediction errors at different RUL values:

True RULError TypeOperational ImpactCost Level
120 cycles±15 cyclesQuarterly planning adjustmentLow
80 cycles±15 cyclesMonthly schedule modificationMedium
40 cycles±15 cyclesWeekly maintenance urgencyHigh
15 cycles±15 cyclesPotential unplanned failureCritical

The same absolute error (±15 cycles) has vastly different consequences depending on where in the degradation trajectory it occurs.

The Standard MSE Problem

LMSE=1Ni=1N(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Standard MSE treats all samples equally, but the training data distribution is typically skewed:

📝text
1Sample distribution by RUL:
2  RUL > 100:   ~35% of samples (healthy operation)
3  50 < RUL ≤ 100: ~30% of samples (early degradation)
4  20 < RUL ≤ 50:  ~20% of samples (late degradation)
5  RUL ≤ 20:    ~15% of samples (critical phase)
6
7With equal weighting:
8  Model optimizes mostly for healthy/early phases
9  Critical phase errors are underweighted

The Core Problem

Standard MSE optimizes for the average case. In RUL prediction, we need to optimize for the critical case—samples near failure where accurate prediction matters most.


Weight Function Design

We design a weight function that emphasizes low-RUL samples while maintaining training stability.

Linear Decay Formula

w(y)=wmin+(wmaxwmin)(1min(y,Rmax)Rmax)w(y) = w_{\min} + (w_{\max} - w_{\min}) \cdot \left(1 - \frac{\min(y, R_{\max})}{R_{\max}}\right)

Simplified with standard parameters:

w(y)=2min(y,125)125w(y) = 2 - \frac{\min(y, 125)}{125}

Where:

  • wmin=1w_{\min} = 1: Minimum weight (at RUL = R_max)
  • wmax=2w_{\max} = 2: Maximum weight (at RUL = 0)
  • Rmax=125R_{\max} = 125: RUL cap

Weight Function Properties

Why Cap at R_max?

The min operation in min(y,Rmax)\min(y, R_{\max}) serves two purposes:

  • Consistency with piecewise RUL: Samples with RUL > 125 are in the healthy phase where exact RUL is less meaningful
  • Bounded weights: Prevents weights from becoming negative for very high RUL values

Alternative Weighting Schemes

We evaluated several weighting schemes before selecting linear decay.

Exponential Decay

wexp(y)=α+(1α)exp(yτ)w_{\text{exp}}(y) = \alpha + (1-\alpha) \cdot \exp\left(-\frac{y}{\tau}\right)
ParameterValueEffect
α0.5Base weight (long-tail minimum)
τ50Decay rate (cycles)
🐍python
1def exponential_weight(rul: torch.Tensor, alpha: float = 0.5, tau: float = 50.0):
2    """Exponential decay weight function."""
3    return alpha + (1 - alpha) * torch.exp(-rul / tau)

Problem: Weights change too rapidly near RUL = 0, causing training instability. The gradient magnitude varies dramatically across the RUL range.

Piecewise Constant

wpiece(y)={2.0if y501.5if 50<y1001.0if y>100w_{\text{piece}}(y) = \begin{cases} 2.0 & \text{if } y \leq 50 \\ 1.5 & \text{if } 50 < y \leq 100 \\ 1.0 & \text{if } y > 100 \end{cases}

Problem: Discontinuous gradients at boundaries. Samples near boundaries (e.g., RUL = 49 vs 51) have sudden weight changes, introducing noise into training.

Polynomial (Quadratic) Decay

wquad(y)=1+(1yRmax)2w_{\text{quad}}(y) = 1 + \left(1 - \frac{y}{R_{\max}}\right)^2

Problem: Too gentle near R_max, too aggressive near 0. The quadratic shape puts insufficient emphasis on mid-range samples (50-100 cycles).

Comparison Results

SchemeFD002 RMSETraining StabilityRecommendation
Uniform (w=1)16.8Very stableBaseline only
Linear decay13.9StableRecommended
Exponential14.6UnstableNot recommended
Piecewise14.3ModerateAcceptable
Quadratic14.1StableAlternative

Design Choice

Linear decay offers the best balance of performance and stability. It is simple to implement, easy to interpret, and performs consistently across all datasets.


Implementation Details

Our research implementation uses a clean, functional approach that achieves the same effect as the class-based version but with minimal code complexity.

AMNL Research Implementation

Moderate Weighted MSE Loss
🐍enhanced_train_nasa_cmapss_sota_v7.py
1Function Signature

Simple functional interface taking predictions and targets. No class overhead needed for this stateless operation.

3Design Philosophy

This represents V7 of our training approach - finding the balance between aggressive exponential weighting (V1-V4) and plain MSE (V5-V6).

7Weight Distribution

Linear decay from 2.0 at RUL=0 to 1.0 at RUL=125. Critical samples near failure get 2× emphasis while healthy samples get normal weighting.

EXAMPLE
RUL=0 → w=2.0, RUL=62.5 → w=1.5, RUL=125 → w=1.0
14Weight Computation

The formula 1.0 + clamp(1.0 - target/125, 0, 1) elegantly maps RUL to weights. The clamp ensures weights stay in [1.0, 2.0].

EXAMPLE
target=50: 1.0 + clamp(1.0 - 0.4, 0, 1) = 1.0 + 0.6 = 1.6
15Weighted MSE

Element-wise multiplication of weights with squared errors, then mean reduction. This is more efficient than the class-based approach.

12 lines without explanation
1def moderate_weighted_mse_loss(pred, target):
2    """
3    ⭐ Moderate weighted MSE loss
4
5    Balanced weighting - not as aggressive as exponential,
6    but not as simple as plain MSE.
7
8    Linear decay provides gentle emphasis on critical RUL:
9    - RUL=0: weight=2.0 (2x emphasis)
10    - RUL=50: weight=1.6 (1.6x emphasis)
11    - RUL=125: weight=1.0 (normal)
12
13    This is the sweet spot between overfitting and underfitting!
14    """
15    # Linear decay from 2.0 to 1.0 based on RUL
16    weights = 1.0 + torch.clamp(1.0 - target / 125.0, 0, 1.0)
17    return (weights * (pred - target) ** 2).mean()

Simplicity by Design

This functional implementation achieves the same result as the class-based version in just 3 lines of actual code. In our research, we found that simpler implementations are easier to debug and less prone to subtle bugs.

Weight Visualization

🐍python
1# Visualize weight function
2import matplotlib.pyplot as plt
3import torch
4
5rul_values = torch.linspace(0, 150, 100)
6weights = 1.0 + torch.clamp(1.0 - rul_values / 125.0, 0, 1.0)
7
8plt.figure(figsize=(10, 5))
9plt.plot(rul_values.numpy(), weights.numpy(), 'b-', linewidth=2)
10plt.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5)
11plt.axhline(y=2.0, color='gray', linestyle='--', alpha=0.5)
12plt.axvline(x=125, color='red', linestyle='--', alpha=0.5, label='R_max')
13plt.xlabel('True RUL (cycles)')
14plt.ylabel('Sample Weight')
15plt.title('Linear Decay Weight Function')
16plt.legend()
17plt.grid(True, alpha=0.3)
18plt.show()

Summary

In this section, we examined weighted MSE with linear decay:

  1. Motivation: Errors near failure have greater operational consequences
  2. Formula: w=2y/125w = 2 - y/125 for weight range [1, 2]
  3. Properties: Smooth, bounded, interpretable
  4. Alternatives: Linear outperforms exponential, piecewise, quadratic
  5. Implementation: Normalize by weight sum, not sample count
ParameterRecommended Value
R_max125 cycles
w_min1.0
w_max2.0
Weight ratio2:1 (critical:healthy)
Looking Ahead: Linear decay addresses sample importance but treats over-prediction and under-prediction equally. The next section introduces asymmetric RUL loss that penalizes late predictions (under-estimation) more severely than early predictions.

With weighted MSE understood, we now address the asymmetric nature of RUL prediction errors.