AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the motivation for sample-level weighting in RUL prediction
Analyze the linear decay design and its mathematical properties
Compare alternative weighting schemes (exponential, piecewise, adaptive)
Implement robust weighted MSE with numerical stability
Choose appropriate weight bounds for stable training

Why This Matters: Sample weighting is a powerful technique for directing model attention to the most critical prediction regions. In RUL prediction, errors near failure have far greater operational consequences than errors during healthy operation. Linear decay provides an elegant, stable solution to this asymmetric importance problem.

Motivation for Sample Weighting

Not all prediction errors have equal consequences in predictive maintenance.

Operational Cost Analysis

Consider the real-world cost of prediction errors at different RUL values:

True RUL	Error Type	Operational Impact	Cost Level
120 cycles	±15 cycles	Quarterly planning adjustment	Low
80 cycles	±15 cycles	Monthly schedule modification	Medium
40 cycles	±15 cycles	Weekly maintenance urgency	High
15 cycles	±15 cycles	Potential unplanned failure	Critical

The same absolute error (±15 cycles) has vastly different consequences depending on where in the degradation trajectory it occurs.

The Standard MSE Problem

\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Standard MSE treats all samples equally, but the training data distribution is typically skewed:

📝text

1Sample distribution by RUL:
2  RUL > 100:   ~35% of samples (healthy operation)
3  50 < RUL ≤ 100: ~30% of samples (early degradation)
4  20 < RUL ≤ 50:  ~20% of samples (late degradation)
5  RUL ≤ 20:    ~15% of samples (critical phase)
6
7With equal weighting:
8  Model optimizes mostly for healthy/early phases
9  Critical phase errors are underweighted

The Core Problem

Standard MSE optimizes for the average case. In RUL prediction, we need to optimize for the critical case—samples near failure where accurate prediction matters most.

Weight Function Design

We design a weight function that emphasizes low-RUL samples while maintaining training stability.

Linear Decay Formula

w(y) = w_{\min} + (w_{\max} - w_{\min}) \cdot \left(1 - \frac{\min(y, R_{\max})}{R_{\max}}\right)

Simplified with standard parameters:

w(y) = 2 - \frac{\min(y, 125)}{125}

Where:

$w_{\min} = 1$ : Minimum weight (at RUL = R_max)
$w_{\max} = 2$ : Maximum weight (at RUL = 0)
$R_{\max} = 125$ : RUL cap

Weight Function Properties

Why Cap at R_max?

The min operation in $\min(y, R_{\max})$ serves two purposes:

Consistency with piecewise RUL: Samples with RUL > 125 are in the healthy phase where exact RUL is less meaningful
Bounded weights: Prevents weights from becoming negative for very high RUL values

Alternative Weighting Schemes

We evaluated several weighting schemes before selecting linear decay.

Exponential Decay

w_{\text{exp}}(y) = \alpha + (1-\alpha) \cdot \exp\left(-\frac{y}{\tau}\right)

Parameter	Value	Effect
α	0.5	Base weight (long-tail minimum)
τ	50	Decay rate (cycles)

🐍python

1def exponential_weight(rul: torch.Tensor, alpha: float = 0.5, tau: float = 50.0):
2    """Exponential decay weight function."""
3    return alpha + (1 - alpha) * torch.exp(-rul / tau)

Problem: Weights change too rapidly near RUL = 0, causing training instability. The gradient magnitude varies dramatically across the RUL range.

Piecewise Constant

w_{\text{piece}}(y) = \begin{cases} 2.0 & \text{if } y \leq 50 \\ 1.5 & \text{if } 50 < y \leq 100 \\ 1.0 & \text{if } y > 100 \end{cases}

Problem: Discontinuous gradients at boundaries. Samples near boundaries (e.g., RUL = 49 vs 51) have sudden weight changes, introducing noise into training.

Polynomial (Quadratic) Decay

w_{\text{quad}}(y) = 1 + \left(1 - \frac{y}{R_{\max}}\right)^2

Problem: Too gentle near R_max, too aggressive near 0. The quadratic shape puts insufficient emphasis on mid-range samples (50-100 cycles).

Comparison Results

Scheme	FD002 RMSE	Training Stability	Recommendation
Uniform (w=1)	16.8	Very stable	Baseline only
Linear decay	13.9	Stable	Recommended
Exponential	14.6	Unstable	Not recommended
Piecewise	14.3	Moderate	Acceptable
Quadratic	14.1	Stable	Alternative

Design Choice

Linear decay offers the best balance of performance and stability. It is simple to implement, easy to interpret, and performs consistently across all datasets.

Implementation Details

Our research implementation uses a clean, functional approach that achieves the same effect as the class-based version but with minimal code complexity.

AMNL Research Implementation

Moderate Weighted MSE Loss

🐍enhanced_train_nasa_cmapss_sota_v7.py

Explanation(5)

Code(17)

1Function Signature

Simple functional interface taking predictions and targets. No class overhead needed for this stateless operation.

3Design Philosophy

This represents V7 of our training approach - finding the balance between aggressive exponential weighting (V1-V4) and plain MSE (V5-V6).

7Weight Distribution

Linear decay from 2.0 at RUL=0 to 1.0 at RUL=125. Critical samples near failure get 2× emphasis while healthy samples get normal weighting.

EXAMPLE

RUL=0 → w=2.0, RUL=62.5 → w=1.5, RUL=125 → w=1.0

14Weight Computation

The formula 1.0 + clamp(1.0 - target/125, 0, 1) elegantly maps RUL to weights. The clamp ensures weights stay in [1.0, 2.0].

EXAMPLE

target=50: 1.0 + clamp(1.0 - 0.4, 0, 1) = 1.0 + 0.6 = 1.6

15Weighted MSE

Element-wise multiplication of weights with squared errors, then mean reduction. This is more efficient than the class-based approach.

12 lines without explanation

1def moderate_weighted_mse_loss(pred, target):
2    """
3    ⭐ Moderate weighted MSE loss
4
5    Balanced weighting - not as aggressive as exponential,
6    but not as simple as plain MSE.
7
8    Linear decay provides gentle emphasis on critical RUL:
9    - RUL=0: weight=2.0 (2x emphasis)
10    - RUL=50: weight=1.6 (1.6x emphasis)
11    - RUL=125: weight=1.0 (normal)
12
13    This is the sweet spot between overfitting and underfitting!
14    """
15    # Linear decay from 2.0 to 1.0 based on RUL
16    weights = 1.0 + torch.clamp(1.0 - target / 125.0, 0, 1.0)
17    return (weights * (pred - target) ** 2).mean()

Simplicity by Design

This functional implementation achieves the same result as the class-based version in just 3 lines of actual code. In our research, we found that simpler implementations are easier to debug and less prone to subtle bugs.

Weight Visualization

🐍python

1# Visualize weight function
2import matplotlib.pyplot as plt
3import torch
4
5rul_values = torch.linspace(0, 150, 100)
6weights = 1.0 + torch.clamp(1.0 - rul_values / 125.0, 0, 1.0)
7
8plt.figure(figsize=(10, 5))
9plt.plot(rul_values.numpy(), weights.numpy(), 'b-', linewidth=2)
10plt.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5)
11plt.axhline(y=2.0, color='gray', linestyle='--', alpha=0.5)
12plt.axvline(x=125, color='red', linestyle='--', alpha=0.5, label='R_max')
13plt.xlabel('True RUL (cycles)')
14plt.ylabel('Sample Weight')
15plt.title('Linear Decay Weight Function')
16plt.legend()
17plt.grid(True, alpha=0.3)
18plt.show()

Summary

In this section, we examined weighted MSE with linear decay:

Motivation: Errors near failure have greater operational consequences
Formula: $w = 2 - y/125$ for weight range [1, 2]
Properties: Smooth, bounded, interpretable
Alternatives: Linear outperforms exponential, piecewise, quadratic
Implementation: Normalize by weight sum, not sample count

Parameter	Recommended Value
R_max	125 cycles
w_min	1.0
w_max	2.0
Weight ratio	2:1 (critical:healthy)

Looking Ahead: Linear decay addresses sample importance but treats over-prediction and under-prediction equally. The next section introduces asymmetric RUL loss that penalizes late predictions (under-estimation) more severely than early predictions.

With weighted MSE understood, we now address the asymmetric nature of RUL prediction errors.