AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the core discovery that equal task weights are optimal
Examine the experimental evidence across all C-MAPSS datasets
Explain why equal weighting works for RUL prediction
Recognize the importance of loss normalization
Appreciate how this discovery led to AMNL

Why This Matters: The discovery that equal task weights (0.5/0.5) consistently outperform adaptive methods was unexpected and counterintuitive. It challenged the prevailing wisdom that sophisticated weight adaptation is always better. This discovery is the foundation of AMNL and the key to achieving state-of-the-art performance.

The Discovery

Our research began with a comprehensive evaluation of multi-task weighting methods. What we found was surprising.

The Hypothesis

We initially hypothesized that adaptive methods (uncertainty weighting, GradNorm, DWA) would outperform fixed weights because:

They adapt to changing loss scales during training
They balance gradient magnitudes automatically
They are "principled" (derived from optimization theory)

The Surprise

Extensive experiments revealed the opposite:

📝text

1Hypothesis: Adaptive > Fixed
2
3Reality:
4  Fixed (0.5/0.5) > Adaptive methods
5
6Specifically:
7  Equal weights (0.5/0.5) consistently achieved the best results
8  across ALL four C-MAPSS datasets.

The Core Discovery

Equal task weights (0.5 for RUL, 0.5 for health) provide optimal performance when combined with proper loss normalization.

This simple approach outperforms all sophisticated adaptive weighting methods, while being simpler, faster, and more robust.

Experimental Evidence

We conducted systematic weight sweep experiments across all datasets.

Weight Sweep Results

Testing all weight combinations $(\lambda_{\text{RUL}}, \lambda_{\text{health}}) = (\lambda, 1-\lambda)$ :

λ_RUL	FD001	FD002	FD003	FD004	Average
0.1	13.2	17.8	13.9	21.5	16.6
0.2	12.4	16.5	13.1	20.3	15.6
0.3	11.8	15.9	12.4	19.4	14.9
0.4	11.3	15.6	11.9	18.8	14.4
0.5	10.8	13.9	11.2	17.4	13.3
0.6	11.1	15.4	11.7	18.2	14.1
0.7	11.6	15.8	12.2	18.9	14.6
0.8	12.2	16.4	12.8	19.6	15.3
0.9	12.9	17.2	13.5	20.8	16.1

Optimal Weight Location

📝text

1Performance vs. λ_RUL:
2
3RMSE
4 17 ─┐                                    ╭─
5    │╲                                  ╱
6 16 ─┤ ╲                              ╱
7    │  ╲                            ╱
8 15 ─┤   ╲                        ╱
9    │    ╲                      ╱
10 14 ─┤     ╲                  ╱
11    │      ╲                ╱
12 13 ─┤       ╲    ╱╲      ╱
13    │        ╲__╱  ╲____╱
14 12 ─┤                ▼
15    │            λ = 0.5 (optimal)
16 11 ─┴──┬──┬──┬──┬──┬──┬──┬──┬──
17       0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
18                     λ_RUL

Statistical Significance

We verified the results are statistically significant:

Comparison	Δ RMSE	p-value	Significant?
0.5 vs 0.4	-1.1	0.023	Yes (p < 0.05)
0.5 vs 0.6	-0.8	0.041	Yes (p < 0.05)
0.5 vs Uncertainty	-2.5	0.002	Yes (p < 0.01)
0.5 vs GradNorm	-1.4	0.018	Yes (p < 0.05)

Robust Finding

The optimality of λ = 0.5 is not due to chance. Statistical tests confirm the result is significant across multiple random seeds and dataset splits.

Why Equal Weights Work

Several factors explain why equal weighting is optimal for RUL prediction.

Factor 1: Maximum Regularization

Factor 2: Gradient Balance

With proper loss normalization, equal weights produce balanced gradients:

\nabla_\theta \mathcal{L} = 0.5 \cdot \nabla_\theta \tilde{\mathcal{L}}_{\text{RUL}} + 0.5 \cdot \nabla_\theta \tilde{\mathcal{L}}_{\text{health}}

Where $\tilde{\mathcal{L}}$ denotes normalized losses. Neither task dominates gradient updates.

Factor 3: Complementary Information

The two tasks provide complementary supervision:

Task	Information Type	Benefit
RUL	Fine-grained (exact cycles)	Precise predictions
Health	Coarse-grained (3 states)	Robust features

Equal weighting ensures both types of information contribute equally to learning.

The Key: Loss Normalization

Equal weighting only works with proper loss normalization.

The Problem Without Normalization

Raw losses have vastly different scales:

📝text

1Without normalization:
2  L_RUL ≈ 100-2000  (MSE on cycles)
3  L_health ≈ 0.5-2  (cross-entropy)
4
5With λ = 0.5 for both:
6  0.5 × 1000 + 0.5 × 1.5 = 500.75
7
8RUL contribution: 500 / 500.75 = 99.85%
9Health contribution: 0.15%
10
11→ Equal weights ≠ equal contribution!

The Solution: Normalize First

AMNL normalizes losses before weighting:

\mathcal{L}_{\text{AMNL}} = 0.5 \cdot \frac{\mathcal{L}_{\text{RUL}}}{\text{EMA}(\mathcal{L}_{\text{RUL}})} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\text{EMA}(\mathcal{L}_{\text{health}})}

Where EMA is exponential moving average for stable normalization.

📝text

1With normalization:
2  L̃_RUL = 1000 / 1000 = 1.0
3  L̃_health = 1.5 / 1.5 = 1.0
4
5With λ = 0.5 for both:
6  0.5 × 1.0 + 0.5 × 1.0 = 1.0
7
8RUL contribution: 50%
9Health contribution: 50%
10
11→ Equal weights = equal contribution!

AMNL = Normalization + Equal Weights

The key innovation of AMNL is not the equal weighting itself, but the combination of proper loss normalization with equal weights. This ensures both tasks contribute equally regardless of their raw loss scales.

Summary

In this section, we presented the core discovery behind AMNL:

Discovery: Equal weights (0.5/0.5) are optimal for RUL + health
Evidence: Consistent across all C-MAPSS datasets
Why it works: Maximum regularization, gradient balance
Critical requirement: Loss normalization
AMNL formula: Normalized losses + equal weights

Aspect	Value
Optimal λ_RUL	0.5
Optimal λ_health	0.5
Statistical significance	p < 0.05
Key enabler	Loss normalization
Result	State-of-the-art on all datasets

Looking Ahead: We have established that equal weighting works when combined with normalization. The next section presents the complete mathematical formulation of AMNL, including the normalization mechanism and the full loss equation.

With the discovery explained, we now formalize AMNL mathematically.