Chapter 10
18 min read
Section 51 of 104

Why 0.5/0.5 Provides Superior Regularization

AMNL: The Novel Loss Function

Learning Objectives

By the end of this section, you will:

  1. Understand the regularization theory behind equal weighting
  2. Analyze gradient balance under normalized losses
  3. Apply information-theoretic reasoning to task weighting
  4. Interpret empirical evidence supporting the 0.5/0.5 split
  5. Explain why unequal weights hurt performance in the RUL + health setting
Why This Matters: The discovery that equal weights are optimal is not arbitrary—it emerges from fundamental principles of regularization, gradient dynamics, and information theory. Understanding these principles explains why AMNL succeeds and provides confidence that the approach generalizes beyond C-MAPSS.

Regularization Theory

The health classification task acts as a regularizer for RUL prediction. Equal weighting maximizes this regularization effect.

Multi-Task Learning as Regularization

In multi-task learning, auxiliary tasks constrain the hypothesis space:

HMTL=HRULHhealth\mathcal{H}_{\text{MTL}} = \mathcal{H}_{\text{RUL}} \cap \mathcal{H}_{\text{health}}

Where:

  • HRUL\mathcal{H}_{\text{RUL}}: Functions that fit RUL data well
  • Hhealth\mathcal{H}_{\text{health}}: Functions that fit health data well
  • HMTL\mathcal{H}_{\text{MTL}}: Intersection (smaller, more regularized)

The auxiliary task shrinks the hypothesis space, reducing overfitting.

Weight Impact on Regularization Strength

Consider the regularization strength as a function of the health weight λhealth\lambda_{\text{health}}:

R(λ)=Reg. strengthλhealth(1λhealth)R(\lambda) = \text{Reg. strength} \propto \lambda_{\text{health}} \cdot (1 - \lambda_{\text{health}})

Formal Regularization Bound

From a bias-variance perspective, the expected error can be decomposed:

E[Error]=Bias2+Variance+Noise\mathbb{E}[\text{Error}] = \text{Bias}^2 + \text{Variance} + \text{Noise}

With multi-task learning:

  • Bias: Slightly increased (must satisfy both tasks)
  • Variance: Significantly reduced (constrained hypothesis space)
  • Net effect: Lower total error when tasks are related
λ_healthBias EffectVariance EffectNet
0.0LowestHighestHigh error (overfit)
0.3LowModerateModerate error
0.5ModerateLowestLowest error
0.7HighLowModerate error
1.0HighestVery lowHigh error (wrong task)

Gradient Balance Analysis

Equal weights combined with normalization produce balanced gradient contributions.

Gradient Flow in AMNL

The total gradient under AMNL:

θLAMNL=0.51μRULθLRUL+0.51μhealthθLhealth\nabla_\theta \mathcal{L}_{\text{AMNL}} = 0.5 \cdot \frac{1}{\mu_{\text{RUL}}} \nabla_\theta \mathcal{L}_{\text{RUL}} + 0.5 \cdot \frac{1}{\mu_{\text{health}}} \nabla_\theta \mathcal{L}_{\text{health}}

Since Li/μi1\mathcal{L}_i / \mu_i \approx 1 after normalization, we expect:

θL~RULθL~health\|\nabla_\theta \tilde{\mathcal{L}}_{\text{RUL}}\| \approx \|\nabla_\theta \tilde{\mathcal{L}}_{\text{health}}\|

Why Balance Matters

Gradient Interference

When tasks share parameters, their gradients can interfere:

cos(θ)=RULhealthRULhealth\cos(\theta) = \frac{\nabla_{\text{RUL}} \cdot \nabla_{\text{health}}}{\|\nabla_{\text{RUL}}\| \|\nabla_{\text{health}}\|}
  • cos(θ)>0\cos(\theta) > 0: Positive transfer (gradients aligned)
  • cos(θ)<0\cos(\theta) < 0: Negative transfer (gradients conflict)
  • cos(θ)0\cos(\theta) \approx 0: Independent tasks

For RUL and health, we observe:

Training Phasecos(θ)Interpretation
Early (epoch 1-20)0.6-0.8Strong alignment
Mid (epoch 20-50)0.4-0.6Moderate alignment
Late (epoch 50+)0.3-0.5Tasks specialize, less overlap

Positive Transfer

The positive gradient correlation confirms that health classification helps RUL prediction—the tasks share useful features. This is why the auxiliary task provides effective regularization.


Information-Theoretic View

Information theory provides another lens to understand equal weighting.

Mutual Information Perspective

The encoder learns a representation Z that should maximize:

I(Z;YRUL)+I(Z;Yhealth)I(Z; Y_{\text{RUL}}) + I(Z; Y_{\text{health}})

Subject to the constraint that Z is computed from input X:

Z=fθ(X)Z = f_\theta(X)

Equal weighting maximizes the joint information extraction from Z.

Label Entropy Analysis

Compare the entropy of each task's labels:

Information Bottleneck Principle

The Information Bottleneck (IB) framework suggests the optimal representation Z minimizes:

LIB=I(X;Z)βI(Z;Y)\mathcal{L}_{\text{IB}} = I(X; Z) - \beta I(Z; Y)

For multi-task learning, this extends to:

LMTL-IB=I(X;Z)β1I(Z;YRUL)β2I(Z;Yhealth)\mathcal{L}_{\text{MTL-IB}} = I(X; Z) - \beta_1 I(Z; Y_{\text{RUL}}) - \beta_2 I(Z; Y_{\text{health}})

When β1=β2\beta_1 = \beta_2, the representation must capture information relevant to both tasks equally, leading to more generalizable features.


Empirical Validation

Theory predicts 0.5/0.5 is optimal. Experiments confirm this across all datasets.

Weight Sensitivity Analysis

Fine-grained weight sweep results (RMSE):

λ_RULFD001FD002FD003FD004
0.4011.315.611.918.8
0.4511.014.811.518.1
0.5010.813.911.217.4
0.5511.014.611.417.9
0.6011.115.411.718.2

The optimal is consistently at or very near λ = 0.5.

Performance Degradation Analysis

How much does performance degrade as we deviate from 0.5?

📝text
1Deviation from λ = 0.5 vs. RMSE increase (average across datasets):
2
3Δλ    | RMSE Increase | Relative Degradation
4------+---------------+---------------------
5±0.00 |     0.00      | Baseline (optimal)
6±0.05 |     0.45      | +3.4%
7±0.10 |     1.10      | +8.3%
8±0.15 |     1.85      | +13.9%
9±0.20 |     2.70      | +20.3%
10±0.30 |     4.30      | +32.3%
11±0.40 |     6.20      | +46.6%

Performance degrades roughly quadratically as we deviate from the optimal.

Ablation: Normalization Effect

Does equal weighting work without normalization?

ConfigurationFD002 RMSENotes
0.5/0.5 + AMNL normalization13.9Best performance
0.5/0.5 no normalization16.8RUL dominates gradients
Tuned weights, no norm15.9Manual tuning helps, but worse
0.5/0.5 batch normalization15.2Helps, but not optimal

Normalization is Essential

Equal weighting alone is not sufficient. The combination of EMA normalization + equal weights is what produces state-of-the-art results. Without normalization, equal weights actually perform worse than tuned unequal weights.

Cross-Architecture Validation

Does 0.5/0.5 generalize across different encoder architectures?

EncoderOptimal λ_RULRMSE at λ=0.5Best RMSE
LSTM0.5013.913.9
GRU0.5014.214.2
Transformer0.5014.814.7
TCN0.5015.115.1
1D-CNN0.5015.415.3

The optimal weight is consistently 0.5 (or within 0.05) across all encoder architectures tested.


Summary

In this section, we established why 0.5/0.5 provides superior regularization:

  1. Regularization theory: Maximizes λ(1λ)\lambda(1-\lambda) at λ = 0.5
  2. Gradient balance: Equal contribution from both tasks
  3. Information theory: Balanced extraction of complementary information
  4. Empirical validation: Consistent across datasets and architectures
  5. Normalization essential: Equal weights only work with proper normalization
PerspectiveWhy 0.5/0.5 is Optimal
RegularizationMaximum constraint from auxiliary task
Gradient flowBalanced updates to shared encoder
InformationEqual extraction of RUL and health signals
Bias-varianceOptimal trade-off point
EmpiricalConsistent best performance across all tests
Looking Ahead: With the theoretical foundation established, the final section presents the complete PyTorch implementation of AMNL—production-ready code that you can use directly in your own predictive maintenance projects.

Having understood why equal weighting works, we now implement AMNL in code.