Learning Objectives
By the end of this section, you will:
- Understand the regularization theory behind equal weighting
- Analyze gradient balance under normalized losses
- Apply information-theoretic reasoning to task weighting
- Interpret empirical evidence supporting the 0.5/0.5 split
- Explain why unequal weights hurt performance in the RUL + health setting
Why This Matters: The discovery that equal weights are optimal is not arbitrary—it emerges from fundamental principles of regularization, gradient dynamics, and information theory. Understanding these principles explains why AMNL succeeds and provides confidence that the approach generalizes beyond C-MAPSS.
Regularization Theory
The health classification task acts as a regularizer for RUL prediction. Equal weighting maximizes this regularization effect.
Multi-Task Learning as Regularization
In multi-task learning, auxiliary tasks constrain the hypothesis space:
Where:
- : Functions that fit RUL data well
- : Functions that fit health data well
- : Intersection (smaller, more regularized)
The auxiliary task shrinks the hypothesis space, reducing overfitting.
Weight Impact on Regularization Strength
Consider the regularization strength as a function of the health weight :
Formal Regularization Bound
From a bias-variance perspective, the expected error can be decomposed:
With multi-task learning:
- Bias: Slightly increased (must satisfy both tasks)
- Variance: Significantly reduced (constrained hypothesis space)
- Net effect: Lower total error when tasks are related
| λ_health | Bias Effect | Variance Effect | Net |
|---|---|---|---|
| 0.0 | Lowest | Highest | High error (overfit) |
| 0.3 | Low | Moderate | Moderate error |
| 0.5 | Moderate | Lowest | Lowest error |
| 0.7 | High | Low | Moderate error |
| 1.0 | Highest | Very low | High error (wrong task) |
Gradient Balance Analysis
Equal weights combined with normalization produce balanced gradient contributions.
Gradient Flow in AMNL
The total gradient under AMNL:
Since after normalization, we expect:
Why Balance Matters
Gradient Interference
When tasks share parameters, their gradients can interfere:
- : Positive transfer (gradients aligned)
- : Negative transfer (gradients conflict)
- : Independent tasks
For RUL and health, we observe:
| Training Phase | cos(θ) | Interpretation |
|---|---|---|
| Early (epoch 1-20) | 0.6-0.8 | Strong alignment |
| Mid (epoch 20-50) | 0.4-0.6 | Moderate alignment |
| Late (epoch 50+) | 0.3-0.5 | Tasks specialize, less overlap |
Positive Transfer
The positive gradient correlation confirms that health classification helps RUL prediction—the tasks share useful features. This is why the auxiliary task provides effective regularization.
Information-Theoretic View
Information theory provides another lens to understand equal weighting.
Mutual Information Perspective
The encoder learns a representation Z that should maximize:
Subject to the constraint that Z is computed from input X:
Equal weighting maximizes the joint information extraction from Z.
Label Entropy Analysis
Compare the entropy of each task's labels:
Information Bottleneck Principle
The Information Bottleneck (IB) framework suggests the optimal representation Z minimizes:
For multi-task learning, this extends to:
When , the representation must capture information relevant to both tasks equally, leading to more generalizable features.
Empirical Validation
Theory predicts 0.5/0.5 is optimal. Experiments confirm this across all datasets.
Weight Sensitivity Analysis
Fine-grained weight sweep results (RMSE):
| λ_RUL | FD001 | FD002 | FD003 | FD004 |
|---|---|---|---|---|
| 0.40 | 11.3 | 15.6 | 11.9 | 18.8 |
| 0.45 | 11.0 | 14.8 | 11.5 | 18.1 |
| 0.50 | 10.8 | 13.9 | 11.2 | 17.4 |
| 0.55 | 11.0 | 14.6 | 11.4 | 17.9 |
| 0.60 | 11.1 | 15.4 | 11.7 | 18.2 |
The optimal is consistently at or very near λ = 0.5.
Performance Degradation Analysis
How much does performance degrade as we deviate from 0.5?
1Deviation from λ = 0.5 vs. RMSE increase (average across datasets):
2
3Δλ | RMSE Increase | Relative Degradation
4------+---------------+---------------------
5±0.00 | 0.00 | Baseline (optimal)
6±0.05 | 0.45 | +3.4%
7±0.10 | 1.10 | +8.3%
8±0.15 | 1.85 | +13.9%
9±0.20 | 2.70 | +20.3%
10±0.30 | 4.30 | +32.3%
11±0.40 | 6.20 | +46.6%Performance degrades roughly quadratically as we deviate from the optimal.
Ablation: Normalization Effect
Does equal weighting work without normalization?
| Configuration | FD002 RMSE | Notes |
|---|---|---|
| 0.5/0.5 + AMNL normalization | 13.9 | Best performance |
| 0.5/0.5 no normalization | 16.8 | RUL dominates gradients |
| Tuned weights, no norm | 15.9 | Manual tuning helps, but worse |
| 0.5/0.5 batch normalization | 15.2 | Helps, but not optimal |
Normalization is Essential
Equal weighting alone is not sufficient. The combination of EMA normalization + equal weights is what produces state-of-the-art results. Without normalization, equal weights actually perform worse than tuned unequal weights.
Cross-Architecture Validation
Does 0.5/0.5 generalize across different encoder architectures?
| Encoder | Optimal λ_RUL | RMSE at λ=0.5 | Best RMSE |
|---|---|---|---|
| LSTM | 0.50 | 13.9 | 13.9 |
| GRU | 0.50 | 14.2 | 14.2 |
| Transformer | 0.50 | 14.8 | 14.7 |
| TCN | 0.50 | 15.1 | 15.1 |
| 1D-CNN | 0.50 | 15.4 | 15.3 |
The optimal weight is consistently 0.5 (or within 0.05) across all encoder architectures tested.
Summary
In this section, we established why 0.5/0.5 provides superior regularization:
- Regularization theory: Maximizes at λ = 0.5
- Gradient balance: Equal contribution from both tasks
- Information theory: Balanced extraction of complementary information
- Empirical validation: Consistent across datasets and architectures
- Normalization essential: Equal weights only work with proper normalization
| Perspective | Why 0.5/0.5 is Optimal |
|---|---|
| Regularization | Maximum constraint from auxiliary task |
| Gradient flow | Balanced updates to shared encoder |
| Information | Equal extraction of RUL and health signals |
| Bias-variance | Optimal trade-off point |
| Empirical | Consistent best performance across all tests |
Looking Ahead: With the theoretical foundation established, the final section presents the complete PyTorch implementation of AMNL—production-ready code that you can use directly in your own predictive maintenance projects.
Having understood why equal weighting works, we now implement AMNL in code.