Chapter 17
18 min read
Section 88 of 104

Loss Function Comparisons

Ablation Studies

Learning Objectives

By the end of this section, you will:

  1. Compare different loss functions for RUL prediction
  2. Understand weighted MSE and its benefits
  3. Analyze linear vs exponential weighting schemes
  4. Evaluate NASA Score as loss and its challenges
  5. Implement custom loss functions for RUL prediction
Key Finding: Linear-weighted MSE (emphasizing low-RUL predictions) improves RMSE by 10-18% compared to plain MSE, and outperforms exponential weighting by 4-8%. The moderate weighting scheme balances early prediction accuracy with critical-phase precision.

Loss Function Landscape

RUL prediction presents unique challenges for loss function design due to the asymmetric nature of prediction errors.

The Asymmetric Error Problem

In maintenance applications, late predictions (actual RUL < predicted RUL) are more dangerous than early predictions:

Error TypeConsequenceSeverity
Late predictionFailure occurs before maintenance scheduledCatastrophic
Early predictionPremature maintenance, some wasteInconvenient
Accurate predictionOptimal maintenance timingIdeal

Loss Function Candidates

Loss FunctionFormulaCharacteristics
Plain MSE(y - ŷ)²Symmetric, baseline
Weighted MSE (Linear)w(y) · (y - ŷ)²Emphasizes low RUL
Weighted MSE (Exponential)exp(-y/τ) · (y - ŷ)²Stronger low-RUL emphasis
Huber LossHybrid L1/L2Robust to outliers
NASA Score LossAsymmetric exponentialMatches evaluation metric

RUL-Specific Challenge

Standard loss functions treat all errors equally. For RUL prediction, an error of 5 cycles when true RUL=10 is far more critical than the same error when true RUL=100. Our weighted MSE addresses this asymmetry.


Weighted MSE Ablation

Comparing plain MSE with linear-weighted MSE that emphasizes low-RUL predictions.

Linear Weighted MSE Formulation

Lweighted=1Ni=1Nw(yi)(yiy^i)2\mathcal{L}_{\text{weighted}} = \frac{1}{N} \sum_{i=1}^{N} w(y_i) \cdot (y_i - \hat{y}_i)^2

Where the weight function is:

w(y)=1+RULmaxclamp(y,0,RULmax)RULmaxw(y) = 1 + \frac{\text{RUL}_{\max} - \text{clamp}(y, 0, \text{RUL}_{\max})}{\text{RUL}_{\max}}

With RULmax=125\text{RUL}_{\max} = 125:

True RULWeightInterpretation
125 (healthy)1.0Baseline importance
1001.220% more important
501.660% more important
251.880% more important
0 (failure)2.0Maximum importance

Ablation Results: Plain MSE vs Weighted MSE

DatasetWeighted MSEPlain MSEImprovement
FD00110.4312.31+18.0%
FD0026.747.89+14.6%
FD0039.5110.89+12.7%
FD0048.169.62+15.2%

NASA Score Analysis

DatasetWeighted MSE NASAPlain MSE NASAChange
FD001434.3612.8-29.1%
FD002356.0498.2-28.5%
FD003338.9456.7-25.8%
FD004537.5723.1-25.7%

NASA Score Improvement

Weighted MSE improves NASA Score by 25-30%. Since NASA Score penalizes late predictions exponentially, weighted MSE's focus on low-RUL accuracy directly reduces late prediction penalties.


Linear vs Exponential Weighting

Comparing different weighting function shapes.

Weighting Function Formulations

Linear:

wlinear(y)=1+RULmaxyRULmaxw_{\text{linear}}(y) = 1 + \frac{\text{RUL}_{\max} - y}{\text{RUL}_{\max}}

Exponential:

wexp(y)=1+exp(yτ)w_{\text{exp}}(y) = 1 + \exp\left(-\frac{y}{\tau}\right)

With τ=50\tau = 50 as the decay constant.

Weight Comparison at Different RUL Values

RULLinear WeightExponential WeightRatio (Exp/Lin)
1251.01.081.08
1001.21.140.95
501.61.370.86
251.81.610.89
101.921.820.95
02.02.01.0

Ablation Results: Linear vs Exponential

DatasetLinearExponentialLinear Advantage
FD00110.4311.21+7.5%
FD0026.747.12+5.6%
FD0039.519.89+4.0%
FD0048.168.78+7.6%

NASA Score as Loss

Directly optimizing the NASA asymmetric scoring function.

NASA Score Formulation

S=i=1N{exp(di13)1if di<0 (early)exp(di10)1if di0 (late)S = \sum_{i=1}^{N} \begin{cases} \exp\left(-\frac{d_i}{13}\right) - 1 & \text{if } d_i < 0 \text{ (early)} \\ \exp\left(\frac{d_i}{10}\right) - 1 & \text{if } d_i \geq 0 \text{ (late)} \end{cases}

Where di=y^iyid_i = \hat{y}_i - y_i is the prediction error.

Challenges with NASA Score Loss

Training Instability

Direct optimization of NASA Score as loss leads to training instability due to:

  • Exponential gradients: Late prediction errors generate extremely large gradients
  • Non-convexity: The loss landscape has sharp valleys near d=0
  • Gradient explosion: Large errors can cause numerical overflow

NASA Score Loss Ablation

DatasetWeighted MSENASA Score LossOutcome
FD00110.4314.23-36.4% (worse)
FD0026.74Training failedDiverged
FD0039.5112.87-35.3% (worse)
FD0048.16Training failedDiverged

On complex datasets (FD002, FD004), NASA Score loss causes training to diverge. Even on simpler datasets, it underperforms weighted MSE.

Alternative: Soft NASA Approximation

A smoother approximation for training stability:

Ssoft=i=1N{αdi2if di<0βdi2if di0S_{\text{soft}} = \sum_{i=1}^{N} \begin{cases} \alpha \cdot d_i^2 & \text{if } d_i < 0 \\ \beta \cdot d_i^2 & \text{if } d_i \geq 0 \end{cases}

With α=0.5\alpha = 0.5 and β=1.0\beta = 1.0 to approximate the asymmetry.

DatasetWeighted MSESoft NASAComparison
FD00110.4310.67-2.3% (similar)
FD0026.746.92-2.7% (similar)
FD0039.519.78-2.8% (similar)
FD0048.168.45-3.6% (similar)

Practical Recommendation

Linear-weighted MSE provides the best balance of training stability and performance. While soft NASA approximation is stable, it doesn't outperform weighted MSE. The optimal strategy is to train with weighted MSE and evaluate with NASA Score.


Implementation

Code for all loss function variants.

Loss Function Implementations

🐍python
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6def plain_mse_loss(
7    pred: torch.Tensor,
8    target: torch.Tensor
9) -> torch.Tensor:
10    """
11    Standard Mean Squared Error loss.
12
13    Args:
14        pred: Predicted RUL values [batch_size]
15        target: True RUL values [batch_size]
16
17    Returns:
18        Scalar loss value
19    """
20    return F.mse_loss(pred, target)
21
22
23def linear_weighted_mse_loss(
24    pred: torch.Tensor,
25    target: torch.Tensor,
26    max_rul: float = 125.0
27) -> torch.Tensor:
28    """
29    Linear-weighted MSE emphasizing low-RUL predictions.
30
31    Weight increases linearly from 1.0 at max_rul to 2.0 at 0.
32
33    Args:
34        pred: Predicted RUL values [batch_size]
35        target: True RUL values [batch_size]
36        max_rul: Maximum RUL value for weight calculation
37
38    Returns:
39        Scalar loss value
40    """
41    # Clamp targets to valid range
42    clamped_target = target.clamp(0, max_rul)
43
44    # Linear weight: 1.0 at max_rul, 2.0 at 0
45    weights = 1.0 + (max_rul - clamped_target) / max_rul
46
47    # Weighted squared error
48    squared_errors = (pred - target) ** 2
49    weighted_loss = (weights * squared_errors).mean()
50
51    return weighted_loss
52
53
54def exponential_weighted_mse_loss(
55    pred: torch.Tensor,
56    target: torch.Tensor,
57    tau: float = 50.0
58) -> torch.Tensor:
59    """
60    Exponential-weighted MSE with stronger low-RUL emphasis.
61
62    Weight = 1 + exp(-target / tau)
63
64    Args:
65        pred: Predicted RUL values [batch_size]
66        target: True RUL values [batch_size]
67        tau: Decay constant (lower = sharper weighting)
68
69    Returns:
70        Scalar loss value
71    """
72    weights = 1.0 + torch.exp(-target / tau)
73    squared_errors = (pred - target) ** 2
74    weighted_loss = (weights * squared_errors).mean()
75
76    return weighted_loss
77
78
79def nasa_score_loss(
80    pred: torch.Tensor,
81    target: torch.Tensor,
82    clip_value: float = 100.0
83) -> torch.Tensor:
84    """
85    NASA asymmetric scoring function as loss.
86
87    WARNING: Can cause training instability on complex datasets.
88
89    Args:
90        pred: Predicted RUL values [batch_size]
91        target: True RUL values [batch_size]
92        clip_value: Maximum score per sample (for stability)
93
94    Returns:
95        Scalar loss value
96    """
97    errors = pred - target  # Positive = late prediction
98
99    # NASA scoring function
100    scores = torch.where(
101        errors < 0,
102        torch.exp(-errors / 13.0) - 1,  # Early: moderate penalty
103        torch.exp(errors / 10.0) - 1     # Late: severe penalty
104    )
105
106    # Clip for stability
107    scores = scores.clamp(-clip_value, clip_value)
108
109    return scores.mean()
110
111
112def soft_asymmetric_loss(
113    pred: torch.Tensor,
114    target: torch.Tensor,
115    alpha: float = 0.5,
116    beta: float = 1.0
117) -> torch.Tensor:
118    """
119    Soft asymmetric loss approximating NASA Score behavior.
120
121    Uses quadratic penalty with different slopes for early/late.
122
123    Args:
124        pred: Predicted RUL values [batch_size]
125        target: True RUL values [batch_size]
126        alpha: Weight for early predictions (under-prediction)
127        beta: Weight for late predictions (over-prediction)
128
129    Returns:
130        Scalar loss value
131    """
132    errors = pred - target
133    squared_errors = errors ** 2
134
135    weights = torch.where(errors < 0, alpha, beta)
136    asymmetric_loss = (weights * squared_errors).mean()
137
138    return asymmetric_loss

Loss Function Ablation Runner

🐍python
1LOSS_FUNCTIONS = {
2    'plain_mse': {
3        'name': 'Plain MSE',
4        'fn': plain_mse_loss,
5    },
6    'linear_weighted': {
7        'name': 'Linear Weighted MSE',
8        'fn': linear_weighted_mse_loss,
9    },
10    'exponential_weighted': {
11        'name': 'Exponential Weighted MSE',
12        'fn': exponential_weighted_mse_loss,
13    },
14    'soft_asymmetric': {
15        'name': 'Soft Asymmetric',
16        'fn': soft_asymmetric_loss,
17    },
18}
19
20
21def run_loss_function_ablation(
22    datasets: List[str] = ['FD002', 'FD004'],
23    seeds: List[int] = [42, 123, 456],
24    epochs: int = 300
25) -> pd.DataFrame:
26    """
27    Compare different loss functions.
28    """
29    results = []
30
31    for loss_name, loss_config in LOSS_FUNCTIONS.items():
32        for dataset in datasets:
33            for seed in seeds:
34                print(f"Training with {loss_name} on {dataset}, seed {seed}")
35
36                result = train_with_loss_function(
37                    dataset=dataset,
38                    seed=seed,
39                    loss_fn=loss_config['fn'],
40                    epochs=epochs
41                )
42
43                results.append({
44                    'loss_function': loss_name,
45                    'dataset': dataset,
46                    'seed': seed,
47                    'rmse': result['rmse'],
48                    'nasa_score': result['nasa_score']
49                })
50
51    return pd.DataFrame(results)

Summary

Loss Function Comparison Summary:

  1. Weighted MSE wins: 10-18% improvement over plain MSE
  2. Linear beats exponential: 4-8% advantage for linear weighting
  3. NASA Score loss unstable: Causes divergence on complex datasets
  4. Soft asymmetric viable: Stable but doesn't outperform weighted MSE
  5. Best practice: Train with linear-weighted MSE, evaluate with NASA Score

Loss Function Ranking

RankLoss FunctionAvg RMSEStability
1Linear Weighted MSE8.71Excellent
2Soft Asymmetric8.96Excellent
3Exponential Weighted MSE9.25Good
4Plain MSE10.18Excellent
5NASA Score LossDivergesPoor
Key Insight: The choice of loss function significantly impacts RUL prediction quality. Linear-weighted MSE provides the optimal balance: it emphasizes critical low-RUL predictions (improving NASA Score by ~28%) while maintaining stable training dynamics. Directly optimizing NASA Score is theoretically appealing but practically unstable. The lesson: match your loss function to the problem structure, but respect training stability constraints.

Chapter 17 Ablation Studies: Complete Summary

Across all ablation studies, we identified the key contributions to AMNL's state-of-the-art performance:

ComponentImpactInsight
Equal weighting (0.5/0.5)+28.7% vs asymmetricRegularization from balanced tasks
Dual-task learningEssential (removes +304% degradation)Health task prevents overfitting
Multi-head attention+20% on complex dataCaptures temporal dependencies
Linear weighted MSE+15% vs plain MSEEmphasizes critical predictions
EMA + training components~10% combinedStabilizes training dynamics

Compound Effect

These components work synergistically. The total improvement from all ablations (~400% vs single-task baseline) far exceeds the sum of individual improvements, confirming that AMNL's success comes from the principled integration of multiple techniques.


With ablation studies complete, we move to Chapter 18 to analyze generalization and cross-dataset transfer.