AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the multi-task learning problem as loss combination
Examine the actual baseline implementation from our research
Identify weight selection challenges in practice
Recognize loss scale mismatch as a fundamental issue
Appreciate why fixed weights are suboptimal

Why This Matters: Before understanding AMNL's innovation, we must understand what came before. Fixed weight multi-task learning is the simplest approach and serves as the baseline against which all advanced methods are compared. Its limitations motivated the development of adaptive weighting schemes.

The Multi-Task Learning Problem

In multi-task learning (MTL), we train a single model on multiple related tasks simultaneously. The central question is: how do we combine task losses into a single optimization objective?

The AMNL Setting

Our model has two tasks with different characteristics:

Task	Type	Loss	Scale
RUL Prediction	Regression	MSE	~100-10000
Health Classification	Classification	Cross-Entropy	~0.1-3

The fundamental challenge: these losses have vastly different scales and semantics. How do we meaningfully combine them?

The Naive Approach

The simplest solution is to just add the losses:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{RUL}} + \mathcal{L}_{\text{health}}

This is problematic because MSE loss for RUL (values in hundreds) will completely dominate cross-entropy loss for health (values near 1).

Fixed Weight Formulation

To address scale mismatch, we introduce fixed weights for each task.

Mathematical Formulation

\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\text{RUL}} + \lambda_2 \mathcal{L}_{\text{health}}

Where:

$\lambda_1$ : Weight for RUL task
$\lambda_2$ : Weight for health task
$\mathcal{L}_{\text{RUL}}$ : RUL regression loss (MSE or weighted MSE)
$\mathcal{L}_{\text{health}}$ : Health classification loss (cross-entropy)

Common Weight Schemes

Scheme	λ₁ (RUL)	λ₂ (Health)	Rationale
Equal	1.0	1.0	Treat tasks equally (ignores scale)
Normalized	0.5	0.5	Sum to 1 (still ignores scale)
Scale-balanced	0.001	1.0	Manual scale matching
Tuned	0.7	0.3	Grid search optimized (our default)

Research Implementation

Below is the actual implementation from our research codebase. This is the baseline loss function used in ablation studies to compare against AMNL.

Fixed Weight Loss: Baseline Implementation

🐍losses/multitask_losses.py

Explanation(7)

Code(23)

1Class Definition

FixedWeightLoss inherits from nn.Module for PyTorch integration. This is the baseline against which AMNL is compared.

3Baseline Method

This represents the simplest multi-task approach: manually set fixed weights that remain constant throughout training.

6Default Weights

Default values of 0.7/0.3 prioritize RUL prediction over health classification. These are hyperparameters that require tuning.

EXAMPLE

Different weight ratios tested: 0.9/0.1, 0.7/0.3, 0.5/0.5

7Weight Storage

Weights are stored as instance attributes. Unlike AMNL, these never change during training—a key limitation.

10Forward Method

Receives pre-computed losses from each task. Note the **kwargs for compatibility with other loss functions that need additional parameters.

19Weighted Sum

The core computation: simple weighted addition. This is the fundamental limitation—weights are static regardless of loss magnitudes.

EXAMPLE

If rul_loss=1000, health_loss=1: 0.7×1000 + 0.3×1 = 700.3

21Name Method

Returns a descriptive string for logging and experiment tracking. Used in ablation study result tables.

16 lines without explanation

1class FixedWeightLoss(nn.Module):
2    """
3    Baseline: Traditional fixed-weight multi-task loss
4    Reference: Standard practice in multi-task learning
5    """
6    def __init__(self, rul_weight=0.7, health_weight=0.3):
7        super().__init__()
8        self.rul_weight = rul_weight
9        self.health_weight = health_weight
10
11    def forward(self, rul_loss, health_loss, **kwargs):
12        """
13        Args:
14            rul_loss: RUL regression loss
15            health_loss: Health classification loss
16
17        Returns:
18            Combined loss
19        """
20        return self.rul_weight * rul_loss + self.health_weight * health_loss
21
22    def get_name(self):
23        return f"Fixed({self.rul_weight:.1f}/{self.health_weight:.1f})"

Usage in Ablation Studies

The fixed weight loss is instantiated with different weight combinations in our ablation experiments:

🐍python

1# Ablation study configurations (from run_ablation_studies.py)
2ABLATION_CONFIGS = {
3    'fixed_70_30': {
4        'name': 'Fixed 0.7/0.3',
5        'description': 'RUL-prioritized weighting',
6        'changes': {'rul_weight': 0.7, 'health_weight': 0.3},
7    },
8    'fixed_50_50': {
9        'name': 'Fixed 0.5/0.5',
10        'description': 'Equal weighting without normalization',
11        'changes': {'rul_weight': 0.5, 'health_weight': 0.5},
12    },
13    'fixed_90_10': {
14        'name': 'Fixed 0.9/0.1',
15        'description': 'Strong RUL preference',
16        'changes': {'rul_weight': 0.9, 'health_weight': 0.1},
17    },
18}

Weight Selection Challenges

Choosing appropriate fixed weights is surprisingly difficult.

The Hyperparameter Search Problem

With two tasks, we have two hyperparameters to tune. Common approaches:

Grid search: Try combinations like (0.1, 0.9), (0.5, 0.5), (0.9, 0.1), etc.
Random search: Sample weights from distributions
Bayesian optimization: Guided search based on validation performance

Problems with Fixed Weights

Expensive search: Each weight combination requires full training
Dataset-specific: Optimal weights differ across FD001-FD004
Training dynamics: Optimal weights change during training
No principled selection: Weights are arbitrary, not derived from data

Loss Scale Mismatch

Even with careful weight tuning, fundamental issues remain.

Dynamic Scale Changes

Loss magnitudes change during training:

📝text

1Training progression (from actual runs):
2
3Epoch 1:   L_RUL = 2500,  L_health = 1.8
4Epoch 10:  L_RUL = 800,   L_health = 0.9
5Epoch 50:  L_RUL = 200,   L_health = 0.4
6Epoch 100: L_RUL = 100,   L_health = 0.2
7
8Relative scale (L_RUL / L_health):
9Epoch 1:   1389x
10Epoch 10:  889x
11Epoch 50:  500x
12Epoch 100: 500x

A fixed weight that balances losses at epoch 1 will not balance them at epoch 100. The relative contribution of tasks shifts during training.

Gradient Magnitude Analysis

What matters for optimization is not loss magnitude but gradient magnitude:

\frac{\partial \mathcal{L}_{\text{total}}}{\partial \theta} = \lambda_1 \frac{\partial \mathcal{L}_{\text{RUL}}}{\partial \theta} + \lambda_2 \frac{\partial \mathcal{L}_{\text{health}}}{\partial \theta}

If $\left\|\frac{\partial \mathcal{L}_{\text{RUL}}}{\partial \theta}\right\| \gg \left\|\frac{\partial \mathcal{L}_{\text{health}}}{\partial \theta}\right\|$ , the RUL task dominates parameter updates regardless of weights.

Gradient Domination

Even if loss values are balanced via weights, gradient magnitudes may still be imbalanced. This can cause one task to dominate learning, leading to poor performance on the other task.

Task Interference

Fixed weights can cause negative transfer when task gradients conflict:

\cos(\nabla_\theta \mathcal{L}_1, \nabla_\theta \mathcal{L}_2) < 0

When gradients point in opposing directions, fixed weights cannot resolve the conflict—they just scale the competing gradients without addressing the underlying tension.

Summary

In this section, we examined fixed weight multi-task loss:

Basic formulation: $\mathcal{L} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2$
Implementation: ~25 lines of Python, simple but limited
Scale mismatch: MSE (~100s) vs CE (~1) requires manual balancing
Weight selection: Expensive grid search, no principled method
Gradient issues: Loss balancing ≠ gradient balancing

Issue	Impact	AMNL Solution
Scale mismatch	One task dominates	Dynamic normalization
Manual tuning	750+ GPU hours	Zero hyperparameters
Static weights	Can't adapt	Per-batch adaptation
Gradient conflicts	Not addressed	Equal contribution

Looking Ahead: Fixed weights are clearly inadequate. The next section introduces Uncertainty Weighting (Kendall et al.)—a principled approach that learns task weights from homoscedastic uncertainty, eliminating manual tuning.

With fixed weight limitations understood, we examine principled adaptive alternatives.