Chapter 10
22 min read
Section 48 of 104

Mathematical Formulation of AMNL

AMNL: The Novel Loss Function

Learning Objectives

By the end of this section, you will:

  1. Master the complete AMNL formulation
  2. Understand the actual research implementation
  3. Derive the gradient properties of normalized losses
  4. Appreciate the scale-invariance AMNL provides
  5. Contrast AMNL with previous methods
Why This Matters: AMNL (Adaptive Multi-task Normalized Loss) combines loss normalization with equal weighting to achieve state-of-the-art results. Understanding its mathematical formulation reveals why it succeeds where other methods fail—and provides the foundation for implementing it correctly.

AMNL Overview

AMNL is deceptively simple: normalize each loss by its current magnitude, then combine with equal weights.

The Core Formula

LAMNL=λRULLRULμRUL+λhealthLhealthμhealth\mathcal{L}_{\text{AMNL}} = \lambda_{\text{RUL}} \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + \lambda_{\text{health}} \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}

Where:

  • λRUL=λhealth=0.5\lambda_{\text{RUL}} = \lambda_{\text{health}} = 0.5: Equal task weights (discovered optimal through ablation)
  • LRUL\mathcal{L}_{\text{RUL}}: RUL regression loss (weighted MSE)
  • Lhealth\mathcal{L}_{\text{health}}: Health classification loss (cross-entropy)
  • μRUL,μhealth\mu_{\text{RUL}}, \mu_{\text{health}}: Current loss magnitudes (used as normalization factors)

Why This Works

ComponentRoleEffect
Normalization (÷μ)Scale losses to ~1Equal gradient contribution
Equal weights (0.5)Balance tasksMaximum regularization
Dynamic scalingAdapt to trainingNo manual tuning needed

Research Implementation

Below is the actual implementation from our research codebase. This is the exact code used to achieve state-of-the-art results on all C-MAPSS datasets.

AMNL: Adaptive Magnitude-Normalized Loss Implementation
🐍losses/multitask_losses.py
1Class Definition

AMNLLoss inherits from nn.Module, making it a PyTorch-compatible loss function that integrates seamlessly with autograd for gradient computation.

3Docstring

AMNL (Adaptive Magnitude-Normalized Loss) is our novel contribution. The key insight: normalize losses before weighting to ensure equal gradient contribution.

7Zero Hyperparameters

Unlike Uncertainty Weighting or GradNorm, AMNL requires no tuning. The 0.5/0.5 split works optimally across all C-MAPSS datasets.

10Minimum Scales

These prevent division by very small numbers. rul_min_scale=1.0 ensures stability when RUL loss approaches zero in late training.

EXAMPLE
If rul_loss=0.01, we use scale=1.0 instead to avoid gradient explosion
14Forward Method

Called during each training step. Receives pre-computed task losses from the model (MSE for RUL, CrossEntropy for health).

24Dynamic Normalization

The core innovation: use current loss magnitude as the normalization factor. This adapts automatically as training progresses without any learned parameters.

EXAMPLE
Epoch 1: rul_loss=1000 → scale=1000. Epoch 100: rul_loss=50 → scale=50
25RUL Scale Computation

max() ensures we never divide by less than rul_min_scale. The .item() detaches the value from the computation graph for the scale factor.

28Loss Normalization

After division, both normalized losses are approximately 1.0. This ensures equal gradient magnitude contribution to the shared encoder.

EXAMPLE
1000/1000 = 1.0, 1.2/1.2 = 1.0 → both contribute equally
32Equal Weighting

The 0.5/0.5 split is not arbitrary—our experiments showed this provides maximum regularization benefit. See ablation studies in Chapter 15.

37Loss Factory Name

Used by the loss factory function get_loss_function() to instantiate AMNL by name in configuration files.

40Effective Weights

Utility method for logging and debugging. Shows how the dynamic normalization translates to effective task weights at each step.

52Weight Normalization

Normalizes effective weights to sum to 1 for interpretability. Useful for TensorBoard logging during training.

49 lines without explanation
1class AMNLLoss(nn.Module):
2    """
3    AMNL: Adaptive Magnitude-Normalized Loss (Your Novel Method)
4
5    Automatically balances multi-task losses through magnitude normalization
6    - Zero hyperparameters
7    - Prevents mode collapse
8    - Simple and interpretable
9    """
10    def __init__(self, rul_min_scale=1.0, health_min_scale=0.1):
11        super().__init__()
12        self.rul_min_scale = rul_min_scale
13        self.health_min_scale = health_min_scale
14
15    def forward(self, rul_loss, health_loss, **kwargs):
16        """
17        AMNL loss balancing
18
19        Args:
20            rul_loss: RUL regression loss (typically MSE)
21            health_loss: Health classification loss (typically CrossEntropy)
22
23        Returns:
24            Balanced combined loss
25        """
26        # Dynamic magnitude normalization
27        rul_scale = max(rul_loss.item(), self.rul_min_scale)
28        health_scale = max(health_loss.item(), self.health_min_scale)
29
30        # Normalize losses to same magnitude
31        normalized_rul = rul_loss / rul_scale
32        normalized_health = health_loss / health_scale
33
34        # Equal weighting after normalization
35        total_loss = 0.5 * normalized_rul + 0.5 * normalized_health
36
37        return total_loss
38
39    def get_name(self):
40        return "AMNL"
41
42    def get_effective_weights(self, rul_loss, health_loss):
43        """
44        Compute effective weights for logging
45
46        Since AMNL uses dynamic normalization, weights change per batch
47        """
48        rul_scale = max(rul_loss.item(), self.rul_min_scale)
49        health_scale = max(health_loss.item(), self.health_min_scale)
50
51        effective_rul = 0.5 / rul_scale
52        effective_health = 0.5 / health_scale
53
54        total = effective_rul + effective_health
55
56        return {
57            'rul_weight': effective_rul / total,
58            'health_weight': effective_health / total,
59            'rul_scale': rul_scale,
60            'health_scale': health_scale
61        }

Key Implementation Detail

Notice that we use rul_loss.item() for the scale factor but keep rul_loss (without .item()) in the numerator. This is critical: the scale factor is a constant for gradient computation, but the loss itself must remain in the computation graph.


Normalization Mechanism

The key innovation is using the current loss magnitude as the normalization factor, rather than learned parameters or running averages.

Dynamic vs Static Normalization

ApproachNormalization FactorProsCons
Fixed scaleConstant (e.g., 100)SimpleDoesn't adapt
EMARunning averageSmoothLags behind
AMNL (ours)Current loss valueInstant adaptationSlightly noisier

Why Current Value Works

Minimum Scale Protection

The rul_min_scale=1.0 and health_min_scale=0.1 prevent division by very small numbers. In late training when losses are near zero, this prevents gradient explosion.


Complete AMNL Formula

Putting it all together, the complete AMNL loss computation:

Step-by-Step Formulation

Step 1: Compute raw losses\text{Step 1: Compute raw losses}
LRUL=1Ni=1Nwi(yiy^i)2\mathcal{L}_{\text{RUL}} = \frac{1}{N}\sum_{i=1}^{N} w_i (y_i - \hat{y}_i)^2
Lhealth=1Ni=1Nlogp(cixi)\mathcal{L}_{\text{health}} = -\frac{1}{N}\sum_{i=1}^{N} \log p(c_i | x_i)
Step 2: Compute scale factors\text{Step 2: Compute scale factors}
μRUL=max(LRUL,1.0)\mu_{\text{RUL}} = \max(\mathcal{L}_{\text{RUL}}, 1.0)
μhealth=max(Lhealth,0.1)\mu_{\text{health}} = \max(\mathcal{L}_{\text{health}}, 0.1)
Step 3: Normalize and combine\text{Step 3: Normalize and combine}
LAMNL=0.5LRULμRUL+0.5Lhealthμhealth\mathcal{L}_{\text{AMNL}} = 0.5 \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}

Symbol Table

SymbolMeaningValue in Code
λ_RUL, λ_healthTask weights0.5, 0.5
μ_RULRUL scale factormax(rul_loss.item(), 1.0)
μ_healthHealth scale factormax(health_loss.item(), 0.1)
wᵢSample weight (RUL-dependent)Linear decay based on RUL
NBatch size256 (default)

Mathematical Properties

AMNL has several desirable mathematical properties that explain its effectiveness.

Property 1: Scale Invariance

Normalized losses are approximately 1 regardless of raw scale:

Liμi1(when Li>min_scale)\frac{\mathcal{L}_i}{\mu_i} \approx 1 \quad \text{(when } \mathcal{L}_i > \text{min\_scale})

This means the gradient contribution from each task is proportional to its weight (0.5), not its loss magnitude.

Property 2: Zero Hyperparameters

Unlike other multi-task methods:

MethodHyperparametersTuning Required
Fixed Weightsλ₁, λ₂Grid search over weights
UncertaintyNone (learned)Sensitive to initialization
GradNormα, lr_weightsRequires careful tuning
AMNLNoneWorks out of the box

Property 3: Computational Efficiency

AMNL adds negligible overhead:

MethodExtra ComputationExtra Memory
Fixed Weights2 multiplications2 scalars
UncertaintyExp + log operations2 learnable params
GradNormFull backward pass × 2Gradient buffers
AMNL2 divisions2 scalars

Comparison with Other Methods

PropertyAMNLUncertaintyGradNorm
Scale invariantYesPartialYes
Computational overheadMinimalMinimal~2× backward
Hyperparameters00 (learned)2 (α, lr)
StabilityHighMediumMedium
InterpretableYes (fixed 0.5/0.5)NoNo

Summary

In this section, we presented the mathematical formulation and actual implementation of AMNL:

  1. Core formula: L=0.5LRULμRUL+0.5Lhealthμhealth\mathcal{L} = 0.5 \cdot \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + 0.5 \cdot \frac{\mathcal{L}_{\text{health}}}{\mu_{\text{health}}}
  2. Implementation: 60 lines of clean Python code
  3. Scale invariance: Gradients balanced automatically
  4. Zero hyperparameters: No tuning required
  5. Minimal overhead: Just 2 divisions per step
ParameterValue
λ_RUL0.5
λ_health0.5
rul_min_scale1.0
health_min_scale0.1
Typical AMNL value≈ 1.0
Looking Ahead: We have the complete AMNL framework and implementation. The next section details the RUL loss component—specifically, the weighted MSE that emphasizes low-RUL samples where accurate prediction matters most.

With the AMNL formulation and implementation complete, we examine each loss component in detail.