AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand multi-task learning as a paradigm for learning related tasks jointly
Identify complementary tasks for RUL prediction: regression and health state classification
Compare parameter sharing strategies: hard sharing vs soft sharing
Formulate multi-task loss functions and understand the task balancing problem
Derive the AMNL loss that gives our model its name
Appreciate the benefits of multi-task learning for industrial applications

Why This Matters: The "M" in AMNL stands for Multi-task. Our model predicts both continuous RUL values and discrete health states simultaneously. This section explains why combining these tasks improves performance beyond training them separately—and how to balance their contributions effectively.

What is Multi-Task Learning?

Multi-task learning (MTL) is a machine learning paradigm where a single model learns to perform multiple related tasks simultaneously, sharing representations across tasks.

The Core Idea

Instead of training separate models for each task:

\text{Model}_1: \mathcal{X} \to \mathcal{Y}_1, \quad \text{Model}_2: \mathcal{X} \to \mathcal{Y}_2

We train a single model with shared parameters that outputs multiple predictions:

\text{Model}: \mathcal{X} \to (\mathcal{Y}_1, \mathcal{Y}_2)

Why MTL Works

MTL provides several theoretical and practical benefits:

Benefit	Mechanism	Effect
Regularization	Tasks constrain each other	Reduces overfitting
Data efficiency	Shared features trained on all data	Better with limited data
Feature learning	Tasks share useful representations	Richer feature extraction
Inductive bias	Related tasks provide hints	Better generalization

Human Learning Analogy

Consider learning to drive. You don't learn steering, braking, and navigation as completely separate skills. Instead, you learn them together, with shared representations of road conditions, traffic patterns, and vehicle dynamics. Each skill reinforces the others—this is multi-task learning.

Why Multi-Task Learning for RUL?

For RUL prediction, we define two complementary tasks that share a common underlying phenomenon: equipment degradation.

Task 1: RUL Regression

Predict the continuous remaining useful life:

\hat{y}_{\text{RUL}} = f_{\text{reg}}(\mathbf{c}) \in \mathbb{R}^+

Where $\mathbf{c}$ is the context vector from attention. We use MSE or Huber loss to train this task.

Task 2: Health State Classification

Classify the current health state into discrete categories:

\hat{\mathbf{p}}_{\text{health}} = \text{softmax}(f_{\text{cls}}(\mathbf{c})) \in \mathbb{R}^K

Where $K$ is the number of health states (e.g., 5 states from "healthy" to "critical").

Why These Tasks are Complementary

Aspect	RUL Regression	Health Classification
Output	Continuous (0 to 125)	Discrete (5 classes)
Precision	Exact cycles remaining	Broad category
Signal	Fine-grained trends	Categorical boundaries
Loss landscape	Smooth but noisy	Structured but sparse

The key design choice in MTL is how to share parameters between tasks. Two main strategies exist.

All tasks share a common representation backbone, with task-specific heads:

\mathbf{c} = \text{Shared}(\mathbf{X}), \quad \hat{y}_1 = \text{Head}_1(\mathbf{c}), \quad \hat{y}_2 = \text{Head}_2(\mathbf{c})

Our AMNL model uses hard sharing:

Shared backbone: CNN → BiLSTM → Attention (produces $\mathbf{c}$ )
RUL head: Linear layer → RUL prediction
Health head: Linear → Softmax → Health state probabilities

Aspect	Hard Sharing	Soft Sharing
Shared layers	All backbone layers identical	Separate backbones
Task coupling	Strong	Weak (learned)
Parameters	Fewer (shared)	More (separate)
Regularization	Implicit (shared features)	Explicit (constraints)
Use case	Closely related tasks	Loosely related tasks

For our application, hard sharing is appropriate because:

Tasks are tightly coupled: Both RUL and health state derive from the same degradation process
Data efficiency: C-MAPSS has limited training data; sharing parameters prevents overfitting
Computational efficiency: Single backbone reduces inference time (important for real-time monitoring)

Multi-Task Loss Formulation

With two tasks, we need to combine their losses into a single objective. The standard approach is weighted sum:

\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2

Individual Task Losses

RUL Regression Loss (MSE):

\mathcal{L}_{\text{RUL}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Health Classification Loss (Cross-Entropy):

\mathcal{L}_{\text{Health}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{p}_{i,k})

Scale Mismatch Problem

A critical issue arises: the two losses have different scales!

Task Balancing Challenges

Balancing task losses is one of the most challenging aspects of MTL. Several approaches exist:

1. Manual Weighting

\mathcal{L} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2

Problem: Requires extensive hyperparameter tuning. Optimal weights may change during training.

2. Uncertainty Weighting

Kendall et al. (2018) proposed learning weights based on task uncertainty:

\mathcal{L} = \frac{1}{2\sigma_1^2} \mathcal{L}_1 + \frac{1}{2\sigma_2^2} \mathcal{L}_2 + \log(\sigma_1 \sigma_2)

Where $\sigma_1, \sigma_2$ are learned parameters. The regularization term $\log(\sigma_1 \sigma_2)$ prevents degenerate solutions.

3. Gradient Normalization

GradNorm (Chen et al., 2018) dynamically adjusts weights to balance gradient magnitudes across tasks.

4. Loss Normalization (Our Approach)

Normalize each loss to have comparable scale before combining:

\mathcal{L} = \frac{\mathcal{L}_1}{\|\mathcal{L}_1\|} + \frac{\mathcal{L}_2}{\|\mathcal{L}_2\|}

This is the key insight behind our AMNL loss.

The AMNL Loss Function

AMNL stands for Adaptive Multi-task Normalized Loss. Our loss function addresses the scale mismatch through batch-level normalization.

AMNL Loss Formulation

\mathcal{L}_{\text{AMNL}} = \frac{\mathcal{L}_{\text{RUL}}}{\mu_{\text{RUL}}} + \frac{\mathcal{L}_{\text{Health}}}{\mu_{\text{Health}}}

Where $\mu_{\text{RUL}}$ and $\mu_{\text{Health}}$ are the running means of each loss, computed with exponential moving average:

\mu_{\text{RUL}}^{(t)} = \beta \cdot \mu_{\text{RUL}}^{(t-1)} + (1 - \beta) \cdot \mathcal{L}_{\text{RUL}}^{(t)}

Why Normalization Works

Adaptive Nature

The "Adaptive" in AMNL refers to how normalization constants evolve:

Early training: Losses are large and variable; $\mu$ adapts quickly
Late training: Losses stabilize; $\mu$ provides stable normalization
Task difficulty changes: If one task becomes harder (loss increases), its normalized loss increases, drawing more attention

Complete AMNL Training Objective

Combining everything, our training minimizes:

\mathcal{L}_{\text{AMNL}} = \frac{1}{\mu_{\text{RUL}}} \cdot \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 + \frac{1}{\mu_{\text{Health}}} \cdot \left(-\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{p}_{i,k})\right)

Gradient Flow

Note that $\mu$ values are treated as constants during backpropagation (no gradient through the normalization). This ensures stable training—we don't want the network to minimize loss by inflating $\mu$ .

Benefits of AMNL

Benefit	Mechanism	Impact
No manual tuning	Automatic normalization	Reduces hyperparameters
Scale invariance	Division by running mean	Works across RUL ranges
Adaptive balancing	EMA tracks loss evolution	Responds to training dynamics
Stable training	Normalized gradients	Consistent learning rates

Practical Implementation

In practice, we use $\beta = 0.99$ for the EMA, which means the running mean reflects approximately the last 100 batches. Initialize $\mu = 1$ to avoid division by zero at the start.

Summary

In this section, we explored the multi-task learning framework that powers our AMNL model:

Multi-task learning trains one model on multiple related tasks, sharing representations
RUL and health state are complementary tasks—regression provides precision, classification provides structure
Hard parameter sharing uses a shared backbone (CNN → BiLSTM → Attention) with task-specific heads
Loss scale mismatch causes one task to dominate without proper balancing
AMNL loss normalizes each task loss by its running mean, ensuring balanced contributions
Benefits: Regularization, data efficiency, no manual weight tuning

Component	Formula	Purpose
RUL loss	L_RUL = Σ(y - ŷ)² / N	Continuous prediction
Health loss	L_Health = -Σ y log(p̂) / N	Discrete classification
Running mean	μ^(t) = β·μ^(t-1) + (1-β)·L^(t)	Track loss scale
AMNL loss	L_AMNL = L_RUL/μ_RUL + L_Health/μ_Health	Balanced combination

Chapter Summary: We have now covered all the mathematical foundations needed to understand our AMNL model: time series representation, CNN convolution, RNN/LSTM temporal processing, attention mechanisms, and multi-task learning. In the next chapter, we will dive deep into the NASA C-MAPSS dataset—understanding its structure, operating conditions, failure modes, and how to extract the features our model will learn from.

With the mathematical foundations complete, we are ready to work with real data. Chapter 3 introduces the benchmark dataset that defines state-of-the-art in predictive maintenance research.