Learning Objectives
By the end of this section, you will:
- Master the complete AMNL formulation
- Understand the actual research implementation
- Derive the gradient properties of normalized losses
- Appreciate the scale-invariance AMNL provides
- Contrast AMNL with previous methods
Why This Matters: AMNL (Adaptive Multi-task Normalized Loss) combines loss normalization with equal weighting to achieve state-of-the-art results. Understanding its mathematical formulation reveals why it succeeds where other methods fail—and provides the foundation for implementing it correctly.
AMNL Overview
AMNL is deceptively simple: normalize each loss by its current magnitude, then combine with equal weights.
The Core Formula
Where:
- : Equal task weights (discovered optimal through ablation)
- : RUL regression loss (weighted MSE)
- : Health classification loss (cross-entropy)
- : Current loss magnitudes (used as normalization factors)
Why This Works
| Component | Role | Effect |
|---|---|---|
| Normalization (÷μ) | Scale losses to ~1 | Equal gradient contribution |
| Equal weights (0.5) | Balance tasks | Maximum regularization |
| Dynamic scaling | Adapt to training | No manual tuning needed |
Research Implementation
Below is the actual implementation from our research codebase. This is the exact code used to achieve state-of-the-art results on all C-MAPSS datasets.
Key Implementation Detail
Notice that we use rul_loss.item() for the scale factor but keep rul_loss (without .item()) in the numerator. This is critical: the scale factor is a constant for gradient computation, but the loss itself must remain in the computation graph.
Normalization Mechanism
The key innovation is using the current loss magnitude as the normalization factor, rather than learned parameters or running averages.
Dynamic vs Static Normalization
| Approach | Normalization Factor | Pros | Cons |
|---|---|---|---|
| Fixed scale | Constant (e.g., 100) | Simple | Doesn't adapt |
| EMA | Running average | Smooth | Lags behind |
| AMNL (ours) | Current loss value | Instant adaptation | Slightly noisier |
Why Current Value Works
Minimum Scale Protection
The rul_min_scale=1.0 and health_min_scale=0.1 prevent division by very small numbers. In late training when losses are near zero, this prevents gradient explosion.
Complete AMNL Formula
Putting it all together, the complete AMNL loss computation:
Step-by-Step Formulation
Symbol Table
| Symbol | Meaning | Value in Code |
|---|---|---|
| λ_RUL, λ_health | Task weights | 0.5, 0.5 |
| μ_RUL | RUL scale factor | max(rul_loss.item(), 1.0) |
| μ_health | Health scale factor | max(health_loss.item(), 0.1) |
| wᵢ | Sample weight (RUL-dependent) | Linear decay based on RUL |
| N | Batch size | 256 (default) |
Mathematical Properties
AMNL has several desirable mathematical properties that explain its effectiveness.
Property 1: Scale Invariance
Normalized losses are approximately 1 regardless of raw scale:
This means the gradient contribution from each task is proportional to its weight (0.5), not its loss magnitude.
Property 2: Zero Hyperparameters
Unlike other multi-task methods:
| Method | Hyperparameters | Tuning Required |
|---|---|---|
| Fixed Weights | λ₁, λ₂ | Grid search over weights |
| Uncertainty | None (learned) | Sensitive to initialization |
| GradNorm | α, lr_weights | Requires careful tuning |
| AMNL | None | Works out of the box |
Property 3: Computational Efficiency
AMNL adds negligible overhead:
| Method | Extra Computation | Extra Memory |
|---|---|---|
| Fixed Weights | 2 multiplications | 2 scalars |
| Uncertainty | Exp + log operations | 2 learnable params |
| GradNorm | Full backward pass × 2 | Gradient buffers |
| AMNL | 2 divisions | 2 scalars |
Comparison with Other Methods
| Property | AMNL | Uncertainty | GradNorm |
|---|---|---|---|
| Scale invariant | Yes | Partial | Yes |
| Computational overhead | Minimal | Minimal | ~2× backward |
| Hyperparameters | 0 | 0 (learned) | 2 (α, lr) |
| Stability | High | Medium | Medium |
| Interpretable | Yes (fixed 0.5/0.5) | No | No |
Summary
In this section, we presented the mathematical formulation and actual implementation of AMNL:
- Core formula:
- Implementation: 60 lines of clean Python code
- Scale invariance: Gradients balanced automatically
- Zero hyperparameters: No tuning required
- Minimal overhead: Just 2 divisions per step
| Parameter | Value |
|---|---|
| λ_RUL | 0.5 |
| λ_health | 0.5 |
| rul_min_scale | 1.0 |
| health_min_scale | 0.1 |
| Typical AMNL value | ≈ 1.0 |
Looking Ahead: We have the complete AMNL framework and implementation. The next section details the RUL loss component—specifically, the weighted MSE that emphasizes low-RUL samples where accurate prediction matters most.
With the AMNL formulation and implementation complete, we examine each loss component in detail.