AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the regularization theory behind equal weighting
Analyze gradient balance under normalized losses
Apply information-theoretic reasoning to task weighting
Interpret empirical evidence supporting the 0.5/0.5 split
Explain why unequal weights hurt performance in the RUL + health setting

Why This Matters: The discovery that equal weights are optimal is not arbitrary—it emerges from fundamental principles of regularization, gradient dynamics, and information theory. Understanding these principles explains why AMNL succeeds and provides confidence that the approach generalizes beyond C-MAPSS.

Regularization Theory

The health classification task acts as a regularizer for RUL prediction. Equal weighting maximizes this regularization effect.

Multi-Task Learning as Regularization

In multi-task learning, auxiliary tasks constrain the hypothesis space:

\mathcal{H}_{\text{MTL}} = \mathcal{H}_{\text{RUL}} \cap \mathcal{H}_{\text{health}}

Where:

$\mathcal{H}_{\text{RUL}}$ : Functions that fit RUL data well
$\mathcal{H}_{\text{health}}$ : Functions that fit health data well
$\mathcal{H}_{\text{MTL}}$ : Intersection (smaller, more regularized)

The auxiliary task shrinks the hypothesis space, reducing overfitting.

Weight Impact on Regularization Strength

Consider the regularization strength as a function of the health weight $\lambda_{\text{health}}$ :

R(\lambda) = \text{Reg. strength} \propto \lambda_{\text{health}} \cdot (1 - \lambda_{\text{health}})

Formal Regularization Bound

From a bias-variance perspective, the expected error can be decomposed:

\mathbb{E}[\text{Error}] = \text{Bias}^2 + \text{Variance} + \text{Noise}

With multi-task learning:

Bias: Slightly increased (must satisfy both tasks)
Variance: Significantly reduced (constrained hypothesis space)
Net effect: Lower total error when tasks are related

λ_health	Bias Effect	Variance Effect	Net
0.0	Lowest	Highest	High error (overfit)
0.3	Low	Moderate	Moderate error
0.5	Moderate	Lowest	Lowest error
0.7	High	Low	Moderate error
1.0	Highest	Very low	High error (wrong task)

Gradient Balance Analysis

Equal weights combined with normalization produce balanced gradient contributions.

Gradient Flow in AMNL

The total gradient under AMNL:

\nabla_\theta \mathcal{L}_{\text{AMNL}} = 0.5 \cdot \frac{1}{\mu_{\text{RUL}}} \nabla_\theta \mathcal{L}_{\text{RUL}} + 0.5 \cdot \frac{1}{\mu_{\text{health}}} \nabla_\theta \mathcal{L}_{\text{health}}

Since $\mathcal{L}_i / \mu_i \approx 1$ after normalization, we expect:

\|\nabla_\theta \tilde{\mathcal{L}}_{\text{RUL}}\| \approx \|\nabla_\theta \tilde{\mathcal{L}}_{\text{health}}\|

Why Balance Matters

Gradient Interference

When tasks share parameters, their gradients can interfere:

\cos(\theta) = \frac{\nabla_{\text{RUL}} \cdot \nabla_{\text{health}}}{\|\nabla_{\text{RUL}}\| \|\nabla_{\text{health}}\|}

$\cos(\theta) > 0$ : Positive transfer (gradients aligned)
$\cos(\theta) < 0$ : Negative transfer (gradients conflict)
$\cos(\theta) \approx 0$ : Independent tasks

For RUL and health, we observe:

Training Phase	cos(θ)	Interpretation
Early (epoch 1-20)	0.6-0.8	Strong alignment
Mid (epoch 20-50)	0.4-0.6	Moderate alignment
Late (epoch 50+)	0.3-0.5	Tasks specialize, less overlap

Positive Transfer

The positive gradient correlation confirms that health classification helps RUL prediction—the tasks share useful features. This is why the auxiliary task provides effective regularization.

Information-Theoretic View

Information theory provides another lens to understand equal weighting.

Mutual Information Perspective

The encoder learns a representation Z that should maximize:

I(Z; Y_{\text{RUL}}) + I(Z; Y_{\text{health}})

Subject to the constraint that Z is computed from input X:

Z = f_\theta(X)

Equal weighting maximizes the joint information extraction from Z.

Label Entropy Analysis

Compare the entropy of each task's labels:

Information Bottleneck Principle

The Information Bottleneck (IB) framework suggests the optimal representation Z minimizes:

\mathcal{L}_{\text{IB}} = I(X; Z) - \beta I(Z; Y)

For multi-task learning, this extends to:

\mathcal{L}_{\text{MTL-IB}} = I(X; Z) - \beta_1 I(Z; Y_{\text{RUL}}) - \beta_2 I(Z; Y_{\text{health}})

When $\beta_1 = \beta_2$ , the representation must capture information relevant to both tasks equally, leading to more generalizable features.

Empirical Validation

Theory predicts 0.5/0.5 is optimal. Experiments confirm this across all datasets.

Weight Sensitivity Analysis

Fine-grained weight sweep results (RMSE):

λ_RUL	FD001	FD002	FD003	FD004
0.40	11.3	15.6	11.9	18.8
0.45	11.0	14.8	11.5	18.1
0.50	10.8	13.9	11.2	17.4
0.55	11.0	14.6	11.4	17.9
0.60	11.1	15.4	11.7	18.2

The optimal is consistently at or very near λ = 0.5.

Performance Degradation Analysis

How much does performance degrade as we deviate from 0.5?

📝text

1Deviation from λ = 0.5 vs. RMSE increase (average across datasets):
2
3Δλ    | RMSE Increase | Relative Degradation
4------+---------------+---------------------
5±0.00 |     0.00      | Baseline (optimal)
6±0.05 |     0.45      | +3.4%
7±0.10 |     1.10      | +8.3%
8±0.15 |     1.85      | +13.9%
9±0.20 |     2.70      | +20.3%
10±0.30 |     4.30      | +32.3%
11±0.40 |     6.20      | +46.6%

Performance degrades roughly quadratically as we deviate from the optimal.

Ablation: Normalization Effect

Does equal weighting work without normalization?

Configuration	FD002 RMSE	Notes
0.5/0.5 + AMNL normalization	13.9	Best performance
0.5/0.5 no normalization	16.8	RUL dominates gradients
Tuned weights, no norm	15.9	Manual tuning helps, but worse
0.5/0.5 batch normalization	15.2	Helps, but not optimal

Normalization is Essential

Equal weighting alone is not sufficient. The combination of EMA normalization + equal weights is what produces state-of-the-art results. Without normalization, equal weights actually perform worse than tuned unequal weights.

Cross-Architecture Validation

Does 0.5/0.5 generalize across different encoder architectures?

Encoder	Optimal λ_RUL	RMSE at λ=0.5	Best RMSE
LSTM	0.50	13.9	13.9
GRU	0.50	14.2	14.2
Transformer	0.50	14.8	14.7
TCN	0.50	15.1	15.1
1D-CNN	0.50	15.4	15.3

The optimal weight is consistently 0.5 (or within 0.05) across all encoder architectures tested.

Summary

In this section, we established why 0.5/0.5 provides superior regularization:

Regularization theory: Maximizes $\lambda(1-\lambda)$ at λ = 0.5
Gradient balance: Equal contribution from both tasks
Information theory: Balanced extraction of complementary information
Empirical validation: Consistent across datasets and architectures
Normalization essential: Equal weights only work with proper normalization

Perspective	Why 0.5/0.5 is Optimal
Regularization	Maximum constraint from auxiliary task
Gradient flow	Balanced updates to shared encoder
Information	Equal extraction of RUL and health signals
Bias-variance	Optimal trade-off point
Empirical	Consistent best performance across all tests

Looking Ahead: With the theoretical foundation established, the final section presents the complete PyTorch implementation of AMNL—production-ready code that you can use directly in your own predictive maintenance projects.

Having understood why equal weighting works, we now implement AMNL in code.