Learning Objectives
By the end of this section, you will:
- Understand multi-task learning as a paradigm for learning related tasks jointly
- Identify complementary tasks for RUL prediction: regression and health state classification
- Compare parameter sharing strategies: hard sharing vs soft sharing
- Formulate multi-task loss functions and understand the task balancing problem
- Derive the AMNL loss that gives our model its name
- Appreciate the benefits of multi-task learning for industrial applications
Why This Matters: The "M" in AMNL stands for Multi-task. Our model predicts both continuous RUL values and discrete health states simultaneously. This section explains why combining these tasks improves performance beyond training them separately—and how to balance their contributions effectively.
What is Multi-Task Learning?
Multi-task learning (MTL) is a machine learning paradigm where a single model learns to perform multiple related tasks simultaneously, sharing representations across tasks.
The Core Idea
Instead of training separate models for each task:
We train a single model with shared parameters that outputs multiple predictions:
Why MTL Works
MTL provides several theoretical and practical benefits:
| Benefit | Mechanism | Effect |
|---|---|---|
| Regularization | Tasks constrain each other | Reduces overfitting |
| Data efficiency | Shared features trained on all data | Better with limited data |
| Feature learning | Tasks share useful representations | Richer feature extraction |
| Inductive bias | Related tasks provide hints | Better generalization |
Human Learning Analogy
Consider learning to drive. You don't learn steering, braking, and navigation as completely separate skills. Instead, you learn them together, with shared representations of road conditions, traffic patterns, and vehicle dynamics. Each skill reinforces the others—this is multi-task learning.
Why Multi-Task Learning for RUL?
For RUL prediction, we define two complementary tasks that share a common underlying phenomenon: equipment degradation.
Task 1: RUL Regression
Predict the continuous remaining useful life:
Where is the context vector from attention. We use MSE or Huber loss to train this task.
Task 2: Health State Classification
Classify the current health state into discrete categories:
Where is the number of health states (e.g., 5 states from "healthy" to "critical").
Why These Tasks are Complementary
| Aspect | RUL Regression | Health Classification |
|---|---|---|
| Output | Continuous (0 to 125) | Discrete (5 classes) |
| Precision | Exact cycles remaining | Broad category |
| Signal | Fine-grained trends | Categorical boundaries |
| Loss landscape | Smooth but noisy | Structured but sparse |
Parameter Sharing Strategies
The key design choice in MTL is how to share parameters between tasks. Two main strategies exist.
Hard Parameter Sharing
All tasks share a common representation backbone, with task-specific heads:
Our AMNL model uses hard sharing:
- Shared backbone: CNN → BiLSTM → Attention (produces )
- RUL head: Linear layer → RUL prediction
- Health head: Linear → Softmax → Health state probabilities
| Aspect | Hard Sharing | Soft Sharing |
|---|---|---|
| Shared layers | All backbone layers identical | Separate backbones |
| Task coupling | Strong | Weak (learned) |
| Parameters | Fewer (shared) | More (separate) |
| Regularization | Implicit (shared features) | Explicit (constraints) |
| Use case | Closely related tasks | Loosely related tasks |
Why Hard Sharing for RUL?
For our application, hard sharing is appropriate because:
- Tasks are tightly coupled: Both RUL and health state derive from the same degradation process
- Data efficiency: C-MAPSS has limited training data; sharing parameters prevents overfitting
- Computational efficiency: Single backbone reduces inference time (important for real-time monitoring)
Multi-Task Loss Formulation
With two tasks, we need to combine their losses into a single objective. The standard approach is weighted sum:
Individual Task Losses
RUL Regression Loss (MSE):
Health Classification Loss (Cross-Entropy):
Scale Mismatch Problem
A critical issue arises: the two losses have different scales!
Task Balancing Challenges
Balancing task losses is one of the most challenging aspects of MTL. Several approaches exist:
1. Manual Weighting
Problem: Requires extensive hyperparameter tuning. Optimal weights may change during training.
2. Uncertainty Weighting
Kendall et al. (2018) proposed learning weights based on task uncertainty:
Where are learned parameters. The regularization term prevents degenerate solutions.
3. Gradient Normalization
GradNorm (Chen et al., 2018) dynamically adjusts weights to balance gradient magnitudes across tasks.
4. Loss Normalization (Our Approach)
Normalize each loss to have comparable scale before combining:
This is the key insight behind our AMNL loss.
The AMNL Loss Function
AMNL stands for Adaptive Multi-task Normalized Loss. Our loss function addresses the scale mismatch through batch-level normalization.
AMNL Loss Formulation
Where and are the running means of each loss, computed with exponential moving average:
Why Normalization Works
Adaptive Nature
The "Adaptive" in AMNL refers to how normalization constants evolve:
- Early training: Losses are large and variable; adapts quickly
- Late training: Losses stabilize; provides stable normalization
- Task difficulty changes: If one task becomes harder (loss increases), its normalized loss increases, drawing more attention
Complete AMNL Training Objective
Combining everything, our training minimizes:
Gradient Flow
Note that values are treated as constants during backpropagation (no gradient through the normalization). This ensures stable training—we don't want the network to minimize loss by inflating .
Benefits of AMNL
| Benefit | Mechanism | Impact |
|---|---|---|
| No manual tuning | Automatic normalization | Reduces hyperparameters |
| Scale invariance | Division by running mean | Works across RUL ranges |
| Adaptive balancing | EMA tracks loss evolution | Responds to training dynamics |
| Stable training | Normalized gradients | Consistent learning rates |
Practical Implementation
In practice, we use for the EMA, which means the running mean reflects approximately the last 100 batches. Initialize to avoid division by zero at the start.
Summary
In this section, we explored the multi-task learning framework that powers our AMNL model:
- Multi-task learning trains one model on multiple related tasks, sharing representations
- RUL and health state are complementary tasks—regression provides precision, classification provides structure
- Hard parameter sharing uses a shared backbone (CNN → BiLSTM → Attention) with task-specific heads
- Loss scale mismatch causes one task to dominate without proper balancing
- AMNL loss normalizes each task loss by its running mean, ensuring balanced contributions
- Benefits: Regularization, data efficiency, no manual weight tuning
| Component | Formula | Purpose |
|---|---|---|
| RUL loss | L_RUL = Σ(y - ŷ)² / N | Continuous prediction |
| Health loss | L_Health = -Σ y log(p̂) / N | Discrete classification |
| Running mean | μ^(t) = β·μ^(t-1) + (1-β)·L^(t) | Track loss scale |
| AMNL loss | L_AMNL = L_RUL/μ_RUL + L_Health/μ_Health | Balanced combination |
Chapter Summary: We have now covered all the mathematical foundations needed to understand our AMNL model: time series representation, CNN convolution, RNN/LSTM temporal processing, attention mechanisms, and multi-task learning. In the next chapter, we will dive deep into the NASA C-MAPSS dataset—understanding its structure, operating conditions, failure modes, and how to extract the features our model will learn from.
With the mathematical foundations complete, we are ready to work with real data. Chapter 3 introduces the benchmark dataset that defines state-of-the-art in predictive maintenance research.