Chapter 2
15 min read
Section 11 of 104

Multi-Task Learning Framework

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Understand multi-task learning as a paradigm for learning related tasks jointly
  2. Identify complementary tasks for RUL prediction: regression and health state classification
  3. Compare parameter sharing strategies: hard sharing vs soft sharing
  4. Formulate multi-task loss functions and understand the task balancing problem
  5. Derive the AMNL loss that gives our model its name
  6. Appreciate the benefits of multi-task learning for industrial applications
Why This Matters: The "M" in AMNL stands for Multi-task. Our model predicts both continuous RUL values and discrete health states simultaneously. This section explains why combining these tasks improves performance beyond training them separately—and how to balance their contributions effectively.

What is Multi-Task Learning?

Multi-task learning (MTL) is a machine learning paradigm where a single model learns to perform multiple related tasks simultaneously, sharing representations across tasks.

The Core Idea

Instead of training separate models for each task:

Model1:XY1,Model2:XY2\text{Model}_1: \mathcal{X} \to \mathcal{Y}_1, \quad \text{Model}_2: \mathcal{X} \to \mathcal{Y}_2

We train a single model with shared parameters that outputs multiple predictions:

Model:X(Y1,Y2)\text{Model}: \mathcal{X} \to (\mathcal{Y}_1, \mathcal{Y}_2)

Why MTL Works

MTL provides several theoretical and practical benefits:

BenefitMechanismEffect
RegularizationTasks constrain each otherReduces overfitting
Data efficiencyShared features trained on all dataBetter with limited data
Feature learningTasks share useful representationsRicher feature extraction
Inductive biasRelated tasks provide hintsBetter generalization

Human Learning Analogy

Consider learning to drive. You don't learn steering, braking, and navigation as completely separate skills. Instead, you learn them together, with shared representations of road conditions, traffic patterns, and vehicle dynamics. Each skill reinforces the others—this is multi-task learning.


Why Multi-Task Learning for RUL?

For RUL prediction, we define two complementary tasks that share a common underlying phenomenon: equipment degradation.

Task 1: RUL Regression

Predict the continuous remaining useful life:

y^RUL=freg(c)R+\hat{y}_{\text{RUL}} = f_{\text{reg}}(\mathbf{c}) \in \mathbb{R}^+

Where c\mathbf{c} is the context vector from attention. We use MSE or Huber loss to train this task.

Task 2: Health State Classification

Classify the current health state into discrete categories:

p^health=softmax(fcls(c))RK\hat{\mathbf{p}}_{\text{health}} = \text{softmax}(f_{\text{cls}}(\mathbf{c})) \in \mathbb{R}^K

Where KK is the number of health states (e.g., 5 states from "healthy" to "critical").

Why These Tasks are Complementary

AspectRUL RegressionHealth Classification
OutputContinuous (0 to 125)Discrete (5 classes)
PrecisionExact cycles remainingBroad category
SignalFine-grained trendsCategorical boundaries
Loss landscapeSmooth but noisyStructured but sparse

Parameter Sharing Strategies

The key design choice in MTL is how to share parameters between tasks. Two main strategies exist.

Hard Parameter Sharing

All tasks share a common representation backbone, with task-specific heads:

c=Shared(X),y^1=Head1(c),y^2=Head2(c)\mathbf{c} = \text{Shared}(\mathbf{X}), \quad \hat{y}_1 = \text{Head}_1(\mathbf{c}), \quad \hat{y}_2 = \text{Head}_2(\mathbf{c})

Our AMNL model uses hard sharing:

  • Shared backbone: CNN → BiLSTM → Attention (produces c\mathbf{c})
  • RUL head: Linear layer → RUL prediction
  • Health head: Linear → Softmax → Health state probabilities
AspectHard SharingSoft Sharing
Shared layersAll backbone layers identicalSeparate backbones
Task couplingStrongWeak (learned)
ParametersFewer (shared)More (separate)
RegularizationImplicit (shared features)Explicit (constraints)
Use caseClosely related tasksLoosely related tasks

Why Hard Sharing for RUL?

For our application, hard sharing is appropriate because:

  1. Tasks are tightly coupled: Both RUL and health state derive from the same degradation process
  2. Data efficiency: C-MAPSS has limited training data; sharing parameters prevents overfitting
  3. Computational efficiency: Single backbone reduces inference time (important for real-time monitoring)

Multi-Task Loss Formulation

With two tasks, we need to combine their losses into a single objective. The standard approach is weighted sum:

Ltotal=λ1L1+λ2L2\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2

Individual Task Losses

RUL Regression Loss (MSE):

LRUL=1Ni=1N(yiy^i)2\mathcal{L}_{\text{RUL}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Health Classification Loss (Cross-Entropy):

LHealth=1Ni=1Nk=1Kyi,klog(p^i,k)\mathcal{L}_{\text{Health}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{p}_{i,k})

Scale Mismatch Problem

A critical issue arises: the two losses have different scales!


Task Balancing Challenges

Balancing task losses is one of the most challenging aspects of MTL. Several approaches exist:

1. Manual Weighting

L=λ1L1+λ2L2\mathcal{L} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2

Problem: Requires extensive hyperparameter tuning. Optimal weights may change during training.

2. Uncertainty Weighting

Kendall et al. (2018) proposed learning weights based on task uncertainty:

L=12σ12L1+12σ22L2+log(σ1σ2)\mathcal{L} = \frac{1}{2\sigma_1^2} \mathcal{L}_1 + \frac{1}{2\sigma_2^2} \mathcal{L}_2 + \log(\sigma_1 \sigma_2)

Where σ1,σ2\sigma_1, \sigma_2 are learned parameters. The regularization term log(σ1σ2)\log(\sigma_1 \sigma_2) prevents degenerate solutions.

3. Gradient Normalization

GradNorm (Chen et al., 2018) dynamically adjusts weights to balance gradient magnitudes across tasks.

4. Loss Normalization (Our Approach)

Normalize each loss to have comparable scale before combining:

L=L1L1+L2L2\mathcal{L} = \frac{\mathcal{L}_1}{\|\mathcal{L}_1\|} + \frac{\mathcal{L}_2}{\|\mathcal{L}_2\|}

This is the key insight behind our AMNL loss.


The AMNL Loss Function

AMNL stands for Adaptive Multi-task Normalized Loss. Our loss function addresses the scale mismatch through batch-level normalization.

AMNL Loss Formulation

LAMNL=LRULμRUL+LHealthμHealth\mathcal{L}_{\text{AMNL}} = \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + \frac{\mathcal{L}_{\text{Health}}}{\mu_{\text{Health}}}

Where μRUL\mu_{\text{RUL}} and μHealth\mu_{\text{Health}} are the running means of each loss, computed with exponential moving average:

μRUL(t)=βμRUL(t1)+(1β)LRUL(t)\mu_{\text{RUL}}^{(t)} = \beta \cdot \mu_{\text{RUL}}^{(t-1)} + (1 - \beta) \cdot \mathcal{L}_{\text{RUL}}^{(t)}

Why Normalization Works

Adaptive Nature

The "Adaptive" in AMNL refers to how normalization constants evolve:

  • Early training: Losses are large and variable; μ\mu adapts quickly
  • Late training: Losses stabilize; μ\mu provides stable normalization
  • Task difficulty changes: If one task becomes harder (loss increases), its normalized loss increases, drawing more attention

Complete AMNL Training Objective

Combining everything, our training minimizes:

LAMNL=1μRUL1Ni=1N(yiy^i)2+1μHealth(1Ni=1Nk=1Kyi,klog(p^i,k))\mathcal{L}_{\text{AMNL}} = \frac{1}{\mu_{\text{RUL}}} \cdot \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 + \frac{1}{\mu_{\text{Health}}} \cdot \left(-\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{p}_{i,k})\right)

Gradient Flow

Note that μ\mu values are treated as constants during backpropagation (no gradient through the normalization). This ensures stable training—we don't want the network to minimize loss by inflating μ\mu.

Benefits of AMNL

BenefitMechanismImpact
No manual tuningAutomatic normalizationReduces hyperparameters
Scale invarianceDivision by running meanWorks across RUL ranges
Adaptive balancingEMA tracks loss evolutionResponds to training dynamics
Stable trainingNormalized gradientsConsistent learning rates

Practical Implementation

In practice, we use β=0.99\beta = 0.99 for the EMA, which means the running mean reflects approximately the last 100 batches. Initialize μ=1\mu = 1 to avoid division by zero at the start.


Summary

In this section, we explored the multi-task learning framework that powers our AMNL model:

  1. Multi-task learning trains one model on multiple related tasks, sharing representations
  2. RUL and health state are complementary tasks—regression provides precision, classification provides structure
  3. Hard parameter sharing uses a shared backbone (CNN → BiLSTM → Attention) with task-specific heads
  4. Loss scale mismatch causes one task to dominate without proper balancing
  5. AMNL loss normalizes each task loss by its running mean, ensuring balanced contributions
  6. Benefits: Regularization, data efficiency, no manual weight tuning
ComponentFormulaPurpose
RUL lossL_RUL = Σ(y - ŷ)² / NContinuous prediction
Health lossL_Health = -Σ y log(p̂) / NDiscrete classification
Running meanμ^(t) = β·μ^(t-1) + (1-β)·L^(t)Track loss scale
AMNL lossL_AMNL = L_RUL/μ_RUL + L_Health/μ_HealthBalanced combination
Chapter Summary: We have now covered all the mathematical foundations needed to understand our AMNL model: time series representation, CNN convolution, RNN/LSTM temporal processing, attention mechanisms, and multi-task learning. In the next chapter, we will dive deep into the NASA C-MAPSS dataset—understanding its structure, operating conditions, failure modes, and how to extract the features our model will learn from.

With the mathematical foundations complete, we are ready to work with real data. Chapter 3 introduces the benchmark dataset that defines state-of-the-art in predictive maintenance research.