Learning Objectives
By the end of this section, you will:
- Understand why standard MSE is suboptimal for RUL
- Design the weight function that emphasizes critical samples
- Derive the weighted MSE formula
- Analyze gradient behavior with sample weighting
- Implement weighted MSE in PyTorch
Why This Matters: Not all RUL prediction errors are equal. An error of 10 cycles when true RUL is 100 is less critical than an error of 10 cycles when true RUL is 20. Weighted MSE ensures the model focuses on the samples that matter most—those approaching failure.
Why Weighted MSE?
Standard MSE treats all samples equally, but RUL prediction has asymmetric importance.
The Problem with Standard MSE
Standard MSE weights all prediction errors equally, regardless of the true RUL value.
Real-World Importance
| True RUL | Prediction Error | Consequence |
|---|---|---|
| 100 cycles | ±10 cycles | Minor schedule adjustment |
| 50 cycles | ±10 cycles | Significant planning impact |
| 20 cycles | ±10 cycles | Critical safety concern |
| 5 cycles | ±10 cycles | Potential failure miss or false alarm |
Errors at low RUL have far greater practical consequences than errors at high RUL.
The Solution
Weight samples inversely to their RUL: low-RUL samples get higher weights.
The Weight Function
We use a linear decay weight function based on true RUL.
Linear Decay Weights
Where:
- : True RUL for sample i
- : Maximum RUL (cap)
- : Weight range
Weight Distribution
1Weight vs. True RUL:
2
3Weight
4 2.0 ─┤●
5 │ ╲
6 1.8 ─┤ ╲
7 │ ╲
8 1.6 ─┤ ╲
9 │ ╲
10 1.4 ─┤ ╲
11 │ ╲
12 1.2 ─┤ ╲
13 │ ╲
14 1.0 ─┤ ╲────────────●
15 └──┬──┬──┬──┬──┬──┬──┬──┬
16 0 15 30 45 60 75 90 105 125
17 True RUL
18
19Key points:
20 RUL = 0: w = 2.0 (maximum weight)
21 RUL = 62: w = 1.5 (midpoint)
22 RUL = 125: w = 1.0 (minimum weight)Why Linear Decay?
Mathematical Formulation
The complete weighted MSE loss for RUL prediction.
Full Formula
With weight function:
RUL Capping
We use to handle samples with RUL > 125. These samples are in the healthy phase where exact RUL is less meaningful, so we cap them at 125.
Implementation
Numerical Example
Gradient Analysis
Understanding the gradient helps us see how weighting affects learning.
Gradient Derivation
The gradient of weighted MSE with respect to prediction:
Key observation:
- Gradient magnitude is proportional to weight
- Low-RUL samples (high weight) produce larger gradients
- Model receives stronger learning signal from critical cases
Gradient Comparison
For the same prediction error (e.g., 10 cycles off):
| True RUL | Weight | Relative Gradient |
|---|---|---|
| 125 | 1.00 | 1.0× |
| 100 | 1.20 | 1.2× |
| 50 | 1.60 | 1.6× |
| 25 | 1.80 | 1.8× |
| 0 | 2.00 | 2.0× |
Low-RUL samples contribute up to 2× larger gradients, ensuring the model prioritizes learning to predict accurately near failure.
Summary
In this section, we designed the weighted MSE loss for RUL:
- Motivation: Low-RUL errors are more critical than high-RUL errors
- Weight function:
- Weight range: [1.0, 2.0] (maximum 2× emphasis)
- Effect: Larger gradients for critical samples
- Normalization: Divide by sum of weights
| Parameter | Value |
|---|---|
| R_max | 125 cycles |
| Weight at RUL=0 | 2.0 |
| Weight at RUL=125 | 1.0 |
| Gradient amplification | Up to 2× |
Looking Ahead: With the RUL loss component defined, the next section describes the health classification loss—the cross-entropy component that provides regularization for the RUL prediction task.
Having defined the RUL loss, we now examine the health classification component.