AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the encoder output as a shared feature representation
Explain why multi-task learning benefits RUL prediction
Design a shared representation that serves both regression and classification
Appreciate the inductive bias introduced by shared features

Why This Matters: The AMNL model predicts both Remaining Useful Life (continuous) and Health State (categorical) from the same encoder. This dual-task design is not merely convenient—it provides crucial regularization that leads to our +21% improvement over single-task methods. Understanding why sharing helps is key to understanding AMNL's success.

Encoder Output Recap

The encoder we built in Chapters 5-7 transforms raw sensor data into a rich 256-dimensional representation.

Complete Encoder Pipeline

📝text

1Input: Raw sensor readings
2       (batch, 30, 17)
3            ↓
4┌─────────────────────────────────┐
5│  CNN Feature Extractor          │
6│  17 channels → 64 channels      │
7│  Local pattern extraction       │
8└─────────────────────────────────┘
9            ↓
10       (batch, 30, 64)
11            ↓
12┌─────────────────────────────────┐
13│  BiLSTM Encoder                 │
14│  2 layers, 128 hidden           │
15│  Temporal dependency modeling   │
16└─────────────────────────────────┘
17            ↓
18       (batch, 30, 256)
19            ↓
20┌─────────────────────────────────┐
21│  Multi-Head Attention           │
22│  8 heads, residual + LayerNorm  │
23│  Focus on informative timesteps │
24└─────────────────────────────────┘
25            ↓
26       (batch, 256)
27            ↓
28        SHARED FEATURE
29        REPRESENTATION

What the 256 Dimensions Capture

The encoder output $\mathbf{z} \in \mathbb{R}^{256}$ is a compressed representation of the entire 30-timestep sensor window:

Local patterns: CNN extracts sensor correlations and local trends
Temporal dynamics: BiLSTM captures how patterns evolve over time
Relevant focus: Attention emphasizes degradation signals over noise

\mathbf{z} = \text{Encoder}(\mathbf{X}) \in \mathbb{R}^{256}

Where $\mathbf{X} \in \mathbb{R}^{30 \times 17}$ is the input sensor window. This single vector $\mathbf{z}$ must contain all information needed for prediction.

Multi-Task Learning Motivation

Why predict both RUL and health state? Why not just RUL directly?

The Problem with Single-Task Learning

Training only for RUL prediction has limitations:

Noisy supervision: RUL labels have inherent uncertainty (exact failure time is stochastic)
Regression difficulty: Predicting exact cycle counts is harder than relative ordering
Overfitting risk: Model may memorize spurious correlations

Multi-Task as Regularization

Adding the health classification task provides implicit regularization:

Mechanism	Effect
Shared features	Forces encoder to learn generalizable representations
Multiple objectives	Prevents overfitting to single task noise
Complementary signals	Health state provides discrete checkpoints for RUL
Gradient diversity	Different loss gradients stabilize training

The Two Tasks

Task	Output	Loss	Purpose
RUL Prediction	Scalar (cycles)	Weighted MSE	Exact remaining life
Health Classification	3 classes	Cross-Entropy	Degradation stage

Shared Representation Design

Both tasks receive the same encoder output but process it through separate heads.

Architecture Overview

📝text

1Encoder Output: z ∈ ℝ²⁵⁶
2                    │
3        ┌───────────┴───────────┐
4        ↓                       ↓
5┌───────────────┐       ┌───────────────┐
6│  RUL Head     │       │  Health Head  │
7│  (Regression) │       │(Classification)│
8├───────────────┤       ├───────────────┤
9│ Linear(256,128)│      │ Linear(256,64) │
10│ ReLU           │      │ ReLU           │
11│ Dropout(0.3)   │      │ Dropout(0.3)   │
12│ Linear(128,1)  │      │ Linear(64,3)   │
13└───────────────┘       └───────────────┘
14        ↓                       ↓
15    RUL ∈ ℝ               logits ∈ ℝ³
16   (cycles)            (health classes)

Mathematical Formulation

Given encoder output $\mathbf{z}$ :

\hat{y}_{\text{RUL}} = f_{\text{RUL}}(\mathbf{z}; \theta_{\text{RUL}})

\hat{\mathbf{p}}_{\text{health}} = f_{\text{health}}(\mathbf{z}; \theta_{\text{health}})

Where:

$f_{\text{RUL}}$ : RUL prediction head (MLP)
$f_{\text{health}}$ : Health classification head (MLP)
$\theta_{\text{RUL}}, \theta_{\text{health}}$ : Head-specific parameters
$\mathbf{z}$ : Shared encoder output (256-dim)

Component	Shared?	Rationale
CNN	Yes	Local patterns useful for both tasks
BiLSTM	Yes	Temporal dynamics shared
Attention	Yes	Relevant timesteps shared
RUL Head	No	Task-specific transformation
Health Head	No	Task-specific transformation

Hard Parameter Sharing

We use hard parameter sharing where encoder parameters are identical for both tasks. This is the most common multi-task approach and provides strongest regularization. Soft sharing (separate encoders with regularization) is an alternative but adds complexity.

Inductive Bias from Sharing

Sharing the encoder introduces beneficial inductive biases.

What is Inductive Bias?

Inductive bias refers to assumptions built into the model that help generalization. Multi-task learning introduces the bias that:

Features useful for predicting RUL should also be useful for predicting health state, and vice versa.

Why This Bias is Valid

For predictive maintenance, this assumption is well-founded:

Same underlying process: Both tasks predict aspects of degradation
Same sensor inputs: Same physical signals contain both information types
Causal relationship: Health state directly relates to RUL range

Regularization Effect

Empirical Evidence

Our ablation studies (Chapter 17) show that removing the health task degrades RUL performance by up to 304%. This dramatic drop confirms the regularization value of multi-task learning.

Configuration	FD002 Score	Degradation
AMNL (dual-task)	1,102	Baseline
RUL only (single-task)	4,453	+304%

Key Finding

The health classification task is not just an auxiliary output—it is essential for achieving state-of-the-art RUL prediction. The shared representation forces the encoder to learn features that generalize across both tasks, preventing overfitting to RUL noise.

Summary

In this section, we introduced the shared feature representation:

Encoder output: 256-dimensional vector capturing sensor patterns
Multi-task motivation: Regularization through complementary tasks
Shared design: Same encoder, separate prediction heads
Inductive bias: Forces generalizable feature learning

Aspect	Value
Shared representation	z ∈ ℝ²⁵⁶
RUL output	Scalar (cycles)
Health output	3 classes (logits)
Sharing strategy	Hard parameter sharing
Single-task degradation	Up to +304%

Looking Ahead: We have established why sharing matters. The next section designs the RUL prediction head—the MLP that transforms the 256-dim representation into a single RUL value, including the architectural choices for depth, width, and activation.

With the motivation for shared representations established, we now design the RUL prediction head.