Learning Objectives
By the end of this section, you will:
- Understand DWA's loss-rate approach to task weighting
- Implement the softmax-based weight computation
- Tune the temperature hyperparameter for weight sensitivity
- Compare DWA with GradNorm in terms of cost and performance
- Identify DWA's limitations for RUL prediction
Why This Matters: Dynamic Weight Average (Liu et al., 2019) offers a computationally cheap alternative to GradNorm. Instead of computing per-task gradients, DWA adjusts weights based on how quickly each task's loss is decreasing—a proxy for learning difficulty that requires no additional backward passes.
DWA Motivation
DWA is motivated by a simple observation: tasks that are improving slowly likely need more focus.
The Core Intuition
Track the rate of loss decrease for each task:
- If loss decreased (ratio < 1): task is learning well → lower weight
- If loss increased (ratio > 1): task is struggling → higher weight
- If loss unchanged (ratio = 1): maintain current focus
Comparison with GradNorm
| Aspect | GradNorm | DWA |
|---|---|---|
| Signal | Gradient magnitudes | Loss change rates |
| Computation | Per-task gradients | Just loss values |
| Overhead | ~T× backward passes | Negligible |
| Memory | O(T × params) | O(T) |
| Sensitivity | Per-step | Smoothed over epochs |
No Gradient Computation
DWA's key advantage: it only needs loss values from consecutive epochs, not per-task gradients. This makes it as fast as standard training while still adapting weights dynamically.
Mathematical Formulation
DWA computes weights using softmax over loss ratios.
Loss Rate Computation
Define the relative descending rate for task i:
Where:
- : Loss at previous epoch
- : Loss at epoch before that
- : Loss decreased (task improving)
- : Loss increased (task struggling)
Softmax Weighting
Convert rates to weights using softmax with temperature T:
Where:
- : Number of tasks (ensures weights sum to K)
- : Temperature hyperparameter
- Large T → weights approach uniform (less sensitive)
- Small T → weights more extreme (more sensitive)
Implementation
DWA is straightforward to implement.
PyTorch Implementation
Temperature Selection
The temperature T controls weight sensitivity:
| Temperature | Behavior | Use Case |
|---|---|---|
| T → 0 | Winner-take-all (focus on hardest) | Aggressive balancing |
| T = 1 | Proportional to rates | Moderate adaptation |
| T = 2 | Smoothed weights (recommended) | Stable training |
| T → ∞ | Uniform weights | No adaptation |
Default Choice
T = 2 is typically used as it provides good balance between responsiveness and stability. Lower values can cause oscillating weights; higher values reduce the benefit of adaptation.
Comparison with Other Methods
How does DWA compare to the methods we have studied?
Method Overview
| Method | Weights From | Learnable | Overhead | Hyperparams |
|---|---|---|---|---|
| Fixed | Manual | No | None | T weights |
| Uncertainty | Log-variance | Yes | Minimal | None |
| GradNorm | Gradients | Yes | ~T× | α, lr |
| DWA | Loss rates | No* | None | T (temperature) |
DWA Weights
*DWA weights are computed from loss history, not learned via backpropagation. They adapt dynamically but are not gradient-updated parameters.
Pros and Cons
| Aspect | Pro | Con |
|---|---|---|
| Simplicity | Easy to implement | Less principled than others |
| Cost | No overhead | - |
| Adaptation | Responds to training dynamics | Lags by two epochs |
| Sensitivity | Temperature tunable | Can be too smooth or noisy |
Empirical Results on C-MAPSS
| Method | FD001 | FD002 | FD003 | FD004 | Time |
|---|---|---|---|---|---|
| Fixed | 11.8 | 16.2 | 12.5 | 19.8 | 1.0× |
| Uncertainty | 12.4 | 17.1 | 13.1 | 20.5 | 1.0× |
| GradNorm | 11.5 | 15.8 | 12.1 | 19.2 | 2.8× |
| DWA | 11.6 | 15.9 | 12.3 | 19.3 | 1.0× |
| AMNL | 10.8 | 13.9 | 11.2 | 17.4 | 1.0× |
DWA performs similarly to GradNorm but without the computational cost. However, both are significantly outperformed by AMNL.
Summary
In this section, we examined Dynamic Weight Average:
- Core idea: Weight tasks by inverse learning rate
- Computation: rᵢ = Lᵢ(t-1) / Lᵢ(t-2)
- Weights: Softmax over rates with temperature
- Advantage: No computational overhead
- Limitation: Two-epoch lag, temperature sensitivity
| Property | Value |
|---|---|
| Weight update | Once per epoch |
| Memory | O(T) for loss history |
| Hyperparameter | Temperature T |
| Default T | 2.0 |
| RUL suitability | Moderate (lag and noise issues) |
Looking Ahead: We have surveyed four multi-task weighting methods: fixed, uncertainty, GradNorm, and DWA. The final section synthesizes why all of these methods fail for RUL prediction—setting the stage for AMNL's novel approach in Chapter 10.
With DWA understood, we analyze why existing methods fall short for RUL.