AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Compare different loss functions for RUL prediction
Understand weighted MSE and its benefits
Analyze linear vs exponential weighting schemes
Evaluate NASA Score as loss and its challenges
Implement custom loss functions for RUL prediction

Key Finding: Linear-weighted MSE (emphasizing low-RUL predictions) improves RMSE by 10-18% compared to plain MSE, and outperforms exponential weighting by 4-8%. The moderate weighting scheme balances early prediction accuracy with critical-phase precision.

Loss Function Landscape

RUL prediction presents unique challenges for loss function design due to the asymmetric nature of prediction errors.

The Asymmetric Error Problem

In maintenance applications, late predictions (actual RUL < predicted RUL) are more dangerous than early predictions:

Error Type	Consequence	Severity
Late prediction	Failure occurs before maintenance scheduled	Catastrophic
Early prediction	Premature maintenance, some waste	Inconvenient
Accurate prediction	Optimal maintenance timing	Ideal

Loss Function Candidates

Loss Function	Formula	Characteristics
Plain MSE	(y - ŷ)²	Symmetric, baseline
Weighted MSE (Linear)	w(y) · (y - ŷ)²	Emphasizes low RUL
Weighted MSE (Exponential)	exp(-y/τ) · (y - ŷ)²	Stronger low-RUL emphasis
Huber Loss	Hybrid L1/L2	Robust to outliers
NASA Score Loss	Asymmetric exponential	Matches evaluation metric

RUL-Specific Challenge

Standard loss functions treat all errors equally. For RUL prediction, an error of 5 cycles when true RUL=10 is far more critical than the same error when true RUL=100. Our weighted MSE addresses this asymmetry.

Weighted MSE Ablation

Comparing plain MSE with linear-weighted MSE that emphasizes low-RUL predictions.

Linear Weighted MSE Formulation

\mathcal{L}_{\text{weighted}} = \frac{1}{N} \sum_{i=1}^{N} w(y_i) \cdot (y_i - \hat{y}_i)^2

Where the weight function is:

w(y) = 1 + \frac{\text{RUL}_{\max} - \text{clamp}(y, 0, \text{RUL}_{\max})}{\text{RUL}_{\max}}

With $\text{RUL}_{\max} = 125$ :

True RUL	Weight	Interpretation
125 (healthy)	1.0	Baseline importance
100	1.2	20% more important
50	1.6	60% more important
25	1.8	80% more important
0 (failure)	2.0	Maximum importance

Ablation Results: Plain MSE vs Weighted MSE

Dataset	Weighted MSE	Plain MSE	Improvement
FD001	10.43	12.31	+18.0%
FD002	6.74	7.89	+14.6%
FD003	9.51	10.89	+12.7%
FD004	8.16	9.62	+15.2%

NASA Score Analysis

Dataset	Weighted MSE NASA	Plain MSE NASA	Change
FD001	434.3	612.8	-29.1%
FD002	356.0	498.2	-28.5%
FD003	338.9	456.7	-25.8%
FD004	537.5	723.1	-25.7%

NASA Score Improvement

Weighted MSE improves NASA Score by 25-30%. Since NASA Score penalizes late predictions exponentially, weighted MSE's focus on low-RUL accuracy directly reduces late prediction penalties.

Linear vs Exponential Weighting

Comparing different weighting function shapes.

Weighting Function Formulations

Linear:

w_{\text{linear}}(y) = 1 + \frac{\text{RUL}_{\max} - y}{\text{RUL}_{\max}}

Exponential:

w_{\text{exp}}(y) = 1 + \exp\left(-\frac{y}{\tau}\right)

With $\tau = 50$ as the decay constant.

Weight Comparison at Different RUL Values

RUL	Linear Weight	Exponential Weight	Ratio (Exp/Lin)
125	1.0	1.08	1.08
100	1.2	1.14	0.95
50	1.6	1.37	0.86
25	1.8	1.61	0.89
10	1.92	1.82	0.95
0	2.0	2.0	1.0

Ablation Results: Linear vs Exponential

Dataset	Linear	Exponential	Linear Advantage
FD001	10.43	11.21	+7.5%
FD002	6.74	7.12	+5.6%
FD003	9.51	9.89	+4.0%
FD004	8.16	8.78	+7.6%

NASA Score as Loss

Directly optimizing the NASA asymmetric scoring function.

NASA Score Formulation

S = \sum_{i=1}^{N} \begin{cases} \exp\left(-\frac{d_i}{13}\right) - 1 & \text{if } d_i < 0 \text{ (early)} \\ \exp\left(\frac{d_i}{10}\right) - 1 & \text{if } d_i \geq 0 \text{ (late)} \end{cases}

Where $d_i = \hat{y}_i - y_i$ is the prediction error.

Challenges with NASA Score Loss

Training Instability

Direct optimization of NASA Score as loss leads to training instability due to:

Exponential gradients: Late prediction errors generate extremely large gradients
Non-convexity: The loss landscape has sharp valleys near d=0
Gradient explosion: Large errors can cause numerical overflow

NASA Score Loss Ablation

Dataset	Weighted MSE	NASA Score Loss	Outcome
FD001	10.43	14.23	-36.4% (worse)
FD002	6.74	Training failed	Diverged
FD003	9.51	12.87	-35.3% (worse)
FD004	8.16	Training failed	Diverged

On complex datasets (FD002, FD004), NASA Score loss causes training to diverge. Even on simpler datasets, it underperforms weighted MSE.

Alternative: Soft NASA Approximation

A smoother approximation for training stability:

S_{\text{soft}} = \sum_{i=1}^{N} \begin{cases} \alpha \cdot d_i^2 & \text{if } d_i < 0 \\ \beta \cdot d_i^2 & \text{if } d_i \geq 0 \end{cases}

With $\alpha = 0.5$ and $\beta = 1.0$ to approximate the asymmetry.

Dataset	Weighted MSE	Soft NASA	Comparison
FD001	10.43	10.67	-2.3% (similar)
FD002	6.74	6.92	-2.7% (similar)
FD003	9.51	9.78	-2.8% (similar)
FD004	8.16	8.45	-3.6% (similar)

Practical Recommendation

Linear-weighted MSE provides the best balance of training stability and performance. While soft NASA approximation is stable, it doesn't outperform weighted MSE. The optimal strategy is to train with weighted MSE and evaluate with NASA Score.

Implementation

Code for all loss function variants.

Loss Function Implementations

🐍python

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6def plain_mse_loss(
7    pred: torch.Tensor,
8    target: torch.Tensor
9) -> torch.Tensor:
10    """
11    Standard Mean Squared Error loss.
12
13    Args:
14        pred: Predicted RUL values [batch_size]
15        target: True RUL values [batch_size]
16
17    Returns:
18        Scalar loss value
19    """
20    return F.mse_loss(pred, target)
21
22
23def linear_weighted_mse_loss(
24    pred: torch.Tensor,
25    target: torch.Tensor,
26    max_rul: float = 125.0
27) -> torch.Tensor:
28    """
29    Linear-weighted MSE emphasizing low-RUL predictions.
30
31    Weight increases linearly from 1.0 at max_rul to 2.0 at 0.
32
33    Args:
34        pred: Predicted RUL values [batch_size]
35        target: True RUL values [batch_size]
36        max_rul: Maximum RUL value for weight calculation
37
38    Returns:
39        Scalar loss value
40    """
41    # Clamp targets to valid range
42    clamped_target = target.clamp(0, max_rul)
43
44    # Linear weight: 1.0 at max_rul, 2.0 at 0
45    weights = 1.0 + (max_rul - clamped_target) / max_rul
46
47    # Weighted squared error
48    squared_errors = (pred - target) ** 2
49    weighted_loss = (weights * squared_errors).mean()
50
51    return weighted_loss
52
53
54def exponential_weighted_mse_loss(
55    pred: torch.Tensor,
56    target: torch.Tensor,
57    tau: float = 50.0
58) -> torch.Tensor:
59    """
60    Exponential-weighted MSE with stronger low-RUL emphasis.
61
62    Weight = 1 + exp(-target / tau)
63
64    Args:
65        pred: Predicted RUL values [batch_size]
66        target: True RUL values [batch_size]
67        tau: Decay constant (lower = sharper weighting)
68
69    Returns:
70        Scalar loss value
71    """
72    weights = 1.0 + torch.exp(-target / tau)
73    squared_errors = (pred - target) ** 2
74    weighted_loss = (weights * squared_errors).mean()
75
76    return weighted_loss
77
78
79def nasa_score_loss(
80    pred: torch.Tensor,
81    target: torch.Tensor,
82    clip_value: float = 100.0
83) -> torch.Tensor:
84    """
85    NASA asymmetric scoring function as loss.
86
87    WARNING: Can cause training instability on complex datasets.
88
89    Args:
90        pred: Predicted RUL values [batch_size]
91        target: True RUL values [batch_size]
92        clip_value: Maximum score per sample (for stability)
93
94    Returns:
95        Scalar loss value
96    """
97    errors = pred - target  # Positive = late prediction
98
99    # NASA scoring function
100    scores = torch.where(
101        errors < 0,
102        torch.exp(-errors / 13.0) - 1,  # Early: moderate penalty
103        torch.exp(errors / 10.0) - 1     # Late: severe penalty
104    )
105
106    # Clip for stability
107    scores = scores.clamp(-clip_value, clip_value)
108
109    return scores.mean()
110
111
112def soft_asymmetric_loss(
113    pred: torch.Tensor,
114    target: torch.Tensor,
115    alpha: float = 0.5,
116    beta: float = 1.0
117) -> torch.Tensor:
118    """
119    Soft asymmetric loss approximating NASA Score behavior.
120
121    Uses quadratic penalty with different slopes for early/late.
122
123    Args:
124        pred: Predicted RUL values [batch_size]
125        target: True RUL values [batch_size]
126        alpha: Weight for early predictions (under-prediction)
127        beta: Weight for late predictions (over-prediction)
128
129    Returns:
130        Scalar loss value
131    """
132    errors = pred - target
133    squared_errors = errors ** 2
134
135    weights = torch.where(errors < 0, alpha, beta)
136    asymmetric_loss = (weights * squared_errors).mean()
137
138    return asymmetric_loss

Loss Function Ablation Runner

🐍python

1LOSS_FUNCTIONS = {
2    'plain_mse': {
3        'name': 'Plain MSE',
4        'fn': plain_mse_loss,
5    },
6    'linear_weighted': {
7        'name': 'Linear Weighted MSE',
8        'fn': linear_weighted_mse_loss,
9    },
10    'exponential_weighted': {
11        'name': 'Exponential Weighted MSE',
12        'fn': exponential_weighted_mse_loss,
13    },
14    'soft_asymmetric': {
15        'name': 'Soft Asymmetric',
16        'fn': soft_asymmetric_loss,
17    },
18}
19
20
21def run_loss_function_ablation(
22    datasets: List[str] = ['FD002', 'FD004'],
23    seeds: List[int] = [42, 123, 456],
24    epochs: int = 300
25) -> pd.DataFrame:
26    """
27    Compare different loss functions.
28    """
29    results = []
30
31    for loss_name, loss_config in LOSS_FUNCTIONS.items():
32        for dataset in datasets:
33            for seed in seeds:
34                print(f"Training with {loss_name} on {dataset}, seed {seed}")
35
36                result = train_with_loss_function(
37                    dataset=dataset,
38                    seed=seed,
39                    loss_fn=loss_config['fn'],
40                    epochs=epochs
41                )
42
43                results.append({
44                    'loss_function': loss_name,
45                    'dataset': dataset,
46                    'seed': seed,
47                    'rmse': result['rmse'],
48                    'nasa_score': result['nasa_score']
49                })
50
51    return pd.DataFrame(results)

Summary

Loss Function Comparison Summary:

Weighted MSE wins: 10-18% improvement over plain MSE
Linear beats exponential: 4-8% advantage for linear weighting
NASA Score loss unstable: Causes divergence on complex datasets
Soft asymmetric viable: Stable but doesn't outperform weighted MSE
Best practice: Train with linear-weighted MSE, evaluate with NASA Score

Loss Function Ranking

Rank	Loss Function	Avg RMSE	Stability
1	Linear Weighted MSE	8.71	Excellent
2	Soft Asymmetric	8.96	Excellent
3	Exponential Weighted MSE	9.25	Good
4	Plain MSE	10.18	Excellent
5	NASA Score Loss	Diverges	Poor

Key Insight: The choice of loss function significantly impacts RUL prediction quality. Linear-weighted MSE provides the optimal balance: it emphasizes critical low-RUL predictions (improving NASA Score by ~28%) while maintaining stable training dynamics. Directly optimizing NASA Score is theoretically appealing but practically unstable. The lesson: match your loss function to the problem structure, but respect training stability constraints.

Chapter 17 Ablation Studies: Complete Summary

Across all ablation studies, we identified the key contributions to AMNL's state-of-the-art performance:

Component	Impact	Insight
Equal weighting (0.5/0.5)	+28.7% vs asymmetric	Regularization from balanced tasks
Dual-task learning	Essential (removes +304% degradation)	Health task prevents overfitting
Multi-head attention	+20% on complex data	Captures temporal dependencies
Linear weighted MSE	+15% vs plain MSE	Emphasizes critical predictions
EMA + training components	~10% combined	Stabilizes training dynamics

Compound Effect

These components work synergistically. The total improvement from all ablations (~400% vs single-task baseline) far exceeds the sum of individual improvements, confirming that AMNL's success comes from the principled integration of multiple techniques.

With ablation studies complete, we move to Chapter 18 to analyze generalization and cross-dataset transfer.