AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Master the complete AMNL formulation
Understand the actual research implementation
Derive the gradient properties of normalized losses
Appreciate the scale-invariance AMNL provides
Contrast AMNL with previous methods

Why This Matters: AMNL (Adaptive Multi-task Normalized Loss) combines loss normalization with equal weighting to achieve state-of-the-art results. Understanding its mathematical formulation reveals why it succeeds where other methods fail—and provides the foundation for implementing it correctly.

AMNL Overview

AMNL is deceptively simple: normalize each loss by its current magnitude, then combine with equal weights.

The Core Formula

\mathcal{L}_{\text{AMNL}} = \lambda_{\text{RUL}} \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + \lambda_{\text{health}} \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}

Where:

$\lambda_{\text{RUL}} = \lambda_{\text{health}} = 0.5$ : Equal task weights (discovered optimal through ablation)
$\mathcal{L}_{\text{RUL}}$ : RUL regression loss (weighted MSE)
$\mathcal{L}_{\text{health}}$ : Health classification loss (cross-entropy)
$\mu_{\text{RUL}}, \mu_{\text{health}}$ : Current loss magnitudes (used as normalization factors)

Why This Works

Component	Role	Effect
Normalization (÷μ)	Scale losses to ~1	Equal gradient contribution
Equal weights (0.5)	Balance tasks	Maximum regularization
Dynamic scaling	Adapt to training	No manual tuning needed

Research Implementation

Below is the actual implementation from our research codebase. This is the exact code used to achieve state-of-the-art results on all C-MAPSS datasets.

AMNL: Adaptive Magnitude-Normalized Loss Implementation

🐍losses/multitask_losses.py

Explanation(12)

Code(61)

1Class Definition

AMNLLoss inherits from nn.Module, making it a PyTorch-compatible loss function that integrates seamlessly with autograd for gradient computation.

3Docstring

AMNL (Adaptive Magnitude-Normalized Loss) is our novel contribution. The key insight: normalize losses before weighting to ensure equal gradient contribution.

7Zero Hyperparameters

Unlike Uncertainty Weighting or GradNorm, AMNL requires no tuning. The 0.5/0.5 split works optimally across all C-MAPSS datasets.

10Minimum Scales

These prevent division by very small numbers. rul_min_scale=1.0 ensures stability when RUL loss approaches zero in late training.

EXAMPLE

If rul_loss=0.01, we use scale=1.0 instead to avoid gradient explosion

14Forward Method

Called during each training step. Receives pre-computed task losses from the model (MSE for RUL, CrossEntropy for health).

24Dynamic Normalization

The core innovation: use current loss magnitude as the normalization factor. This adapts automatically as training progresses without any learned parameters.

EXAMPLE

Epoch 1: rul_loss=1000 → scale=1000. Epoch 100: rul_loss=50 → scale=50

25RUL Scale Computation

max() ensures we never divide by less than rul_min_scale. The .item() detaches the value from the computation graph for the scale factor.

28Loss Normalization

After division, both normalized losses are approximately 1.0. This ensures equal gradient magnitude contribution to the shared encoder.

EXAMPLE

1000/1000 = 1.0, 1.2/1.2 = 1.0 → both contribute equally

32Equal Weighting

The 0.5/0.5 split is not arbitrary—our experiments showed this provides maximum regularization benefit. See ablation studies in Chapter 15.

37Loss Factory Name

Used by the loss factory function get_loss_function() to instantiate AMNL by name in configuration files.

40Effective Weights

Utility method for logging and debugging. Shows how the dynamic normalization translates to effective task weights at each step.

52Weight Normalization

Normalizes effective weights to sum to 1 for interpretability. Useful for TensorBoard logging during training.

49 lines without explanation

1class AMNLLoss(nn.Module):
2    """
3    AMNL: Adaptive Magnitude-Normalized Loss (Your Novel Method)
4
5    Automatically balances multi-task losses through magnitude normalization
6    - Zero hyperparameters
7    - Prevents mode collapse
8    - Simple and interpretable
9    """
10    def __init__(self, rul_min_scale=1.0, health_min_scale=0.1):
11        super().__init__()
12        self.rul_min_scale = rul_min_scale
13        self.health_min_scale = health_min_scale
14
15    def forward(self, rul_loss, health_loss, **kwargs):
16        """
17        AMNL loss balancing
18
19        Args:
20            rul_loss: RUL regression loss (typically MSE)
21            health_loss: Health classification loss (typically CrossEntropy)
22
23        Returns:
24            Balanced combined loss
25        """
26        # Dynamic magnitude normalization
27        rul_scale = max(rul_loss.item(), self.rul_min_scale)
28        health_scale = max(health_loss.item(), self.health_min_scale)
29
30        # Normalize losses to same magnitude
31        normalized_rul = rul_loss / rul_scale
32        normalized_health = health_loss / health_scale
33
34        # Equal weighting after normalization
35        total_loss = 0.5 * normalized_rul + 0.5 * normalized_health
36
37        return total_loss
38
39    def get_name(self):
40        return "AMNL"
41
42    def get_effective_weights(self, rul_loss, health_loss):
43        """
44        Compute effective weights for logging
45
46        Since AMNL uses dynamic normalization, weights change per batch
47        """
48        rul_scale = max(rul_loss.item(), self.rul_min_scale)
49        health_scale = max(health_loss.item(), self.health_min_scale)
50
51        effective_rul = 0.5 / rul_scale
52        effective_health = 0.5 / health_scale
53
54        total = effective_rul + effective_health
55
56        return {
57            'rul_weight': effective_rul / total,
58            'health_weight': effective_health / total,
59            'rul_scale': rul_scale,
60            'health_scale': health_scale
61        }

Key Implementation Detail

Notice that we use rul_loss.item() for the scale factor but keep rul_loss (without .item()) in the numerator. This is critical: the scale factor is a constant for gradient computation, but the loss itself must remain in the computation graph.

Normalization Mechanism

The key innovation is using the current loss magnitude as the normalization factor, rather than learned parameters or running averages.

Dynamic vs Static Normalization

Approach	Normalization Factor	Pros	Cons
Fixed scale	Constant (e.g., 100)	Simple	Doesn't adapt
EMA	Running average	Smooth	Lags behind
AMNL (ours)	Current loss value	Instant adaptation	Slightly noisier

Why Current Value Works

Minimum Scale Protection

The rul_min_scale=1.0 and health_min_scale=0.1 prevent division by very small numbers. In late training when losses are near zero, this prevents gradient explosion.

Complete AMNL Formula

Putting it all together, the complete AMNL loss computation:

Step-by-Step Formulation

\text{Step 1: Compute raw losses}

\mathcal{L}_{\text{RUL}} = \frac{1}{N}\sum_{i=1}^{N} w_i (y_i - \hat{y}_i)^2

\mathcal{L}_{\text{health}} = -\frac{1}{N}\sum_{i=1}^{N} \log p(c_i | x_i)

\text{Step 2: Compute scale factors}

\mu_{\text{RUL}} = \max(\mathcal{L}_{\text{RUL}}, 1.0)

\mu_{\text{health}} = \max(\mathcal{L}_{\text{health}}, 0.1)

\text{Step 3: Normalize and combine}

\mathcal{L}_{\text{AMNL}} = 0.5 \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}

Symbol Table

Symbol	Meaning	Value in Code
λ_RUL, λ_health	Task weights	0.5, 0.5
μ_RUL	RUL scale factor	max(rul_loss.item(), 1.0)
μ_health	Health scale factor	max(health_loss.item(), 0.1)
wᵢ	Sample weight (RUL-dependent)	Linear decay based on RUL
N	Batch size	256 (default)

Mathematical Properties

AMNL has several desirable mathematical properties that explain its effectiveness.

Property 1: Scale Invariance

Normalized losses are approximately 1 regardless of raw scale:

\frac{\mathcal{L}_i}{\mu_i} \approx 1 \quad \text{(when } \mathcal{L}_i > \text{min\_scale})

This means the gradient contribution from each task is proportional to its weight (0.5), not its loss magnitude.

Property 2: Zero Hyperparameters

Unlike other multi-task methods:

Method	Hyperparameters	Tuning Required
Fixed Weights	λ₁, λ₂	Grid search over weights
Uncertainty	None (learned)	Sensitive to initialization
GradNorm	α, lr_weights	Requires careful tuning
AMNL	None	Works out of the box

Property 3: Computational Efficiency

AMNL adds negligible overhead:

Method	Extra Computation	Extra Memory
Fixed Weights	2 multiplications	2 scalars
Uncertainty	Exp + log operations	2 learnable params
GradNorm	Full backward pass × 2	Gradient buffers
AMNL	2 divisions	2 scalars

Comparison with Other Methods

Property	AMNL	Uncertainty	GradNorm
Scale invariant	Yes	Partial	Yes
Computational overhead	Minimal	Minimal	~2× backward
Hyperparameters	0	0 (learned)	2 (α, lr)
Stability	High	Medium	Medium
Interpretable	Yes (fixed 0.5/0.5)	No	No

Summary

In this section, we presented the mathematical formulation and actual implementation of AMNL:

Core formula: $\mathcal{L} = 0.5 \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}$
Implementation: 60 lines of clean Python code
Scale invariance: Gradients balanced automatically
Zero hyperparameters: No tuning required
Minimal overhead: Just 2 divisions per step

Parameter	Value
λ_RUL	0.5
λ_health	0.5
rul_min_scale	1.0
health_min_scale	0.1
Typical AMNL value	≈ 1.0

Looking Ahead: We have the complete AMNL framework and implementation. The next section details the RUL loss component—specifically, the weighted MSE that emphasizes low-RUL samples where accurate prediction matters most.

With the AMNL formulation and implementation complete, we examine each loss component in detail.