Learning Objectives
By the end of this section, you will:
- Understand the encoder output as a shared feature representation
- Explain why multi-task learning benefits RUL prediction
- Design a shared representation that serves both regression and classification
- Appreciate the inductive bias introduced by shared features
Why This Matters: The AMNL model predicts both Remaining Useful Life (continuous) and Health State (categorical) from the same encoder. This dual-task design is not merely convenientβit provides crucial regularization that leads to our +21% improvement over single-task methods. Understanding why sharing helps is key to understanding AMNL's success.
Encoder Output Recap
The encoder we built in Chapters 5-7 transforms raw sensor data into a rich 256-dimensional representation.
Complete Encoder Pipeline
1Input: Raw sensor readings
2 (batch, 30, 17)
3 β
4βββββββββββββββββββββββββββββββββββ
5β CNN Feature Extractor β
6β 17 channels β 64 channels β
7β Local pattern extraction β
8βββββββββββββββββββββββββββββββββββ
9 β
10 (batch, 30, 64)
11 β
12βββββββββββββββββββββββββββββββββββ
13β BiLSTM Encoder β
14β 2 layers, 128 hidden β
15β Temporal dependency modeling β
16βββββββββββββββββββββββββββββββββββ
17 β
18 (batch, 30, 256)
19 β
20βββββββββββββββββββββββββββββββββββ
21β Multi-Head Attention β
22β 8 heads, residual + LayerNorm β
23β Focus on informative timesteps β
24βββββββββββββββββββββββββββββββββββ
25 β
26 (batch, 256)
27 β
28 SHARED FEATURE
29 REPRESENTATIONWhat the 256 Dimensions Capture
The encoder output is a compressed representation of the entire 30-timestep sensor window:
- Local patterns: CNN extracts sensor correlations and local trends
- Temporal dynamics: BiLSTM captures how patterns evolve over time
- Relevant focus: Attention emphasizes degradation signals over noise
Where is the input sensor window. This single vector must contain all information needed for prediction.
Multi-Task Learning Motivation
Why predict both RUL and health state? Why not just RUL directly?
The Problem with Single-Task Learning
Training only for RUL prediction has limitations:
- Noisy supervision: RUL labels have inherent uncertainty (exact failure time is stochastic)
- Regression difficulty: Predicting exact cycle counts is harder than relative ordering
- Overfitting risk: Model may memorize spurious correlations
Multi-Task as Regularization
Adding the health classification task provides implicit regularization:
| Mechanism | Effect |
|---|---|
| Shared features | Forces encoder to learn generalizable representations |
| Multiple objectives | Prevents overfitting to single task noise |
| Complementary signals | Health state provides discrete checkpoints for RUL |
| Gradient diversity | Different loss gradients stabilize training |
The Two Tasks
| Task | Output | Loss | Purpose |
|---|---|---|---|
| RUL Prediction | Scalar (cycles) | Weighted MSE | Exact remaining life |
| Health Classification | 3 classes | Cross-Entropy | Degradation stage |
Shared Representation Design
Both tasks receive the same encoder output but process it through separate heads.
Architecture Overview
1Encoder Output: z β βΒ²β΅βΆ
2 β
3 βββββββββββββ΄ββββββββββββ
4 β β
5βββββββββββββββββ βββββββββββββββββ
6β RUL Head β β Health Head β
7β (Regression) β β(Classification)β
8βββββββββββββββββ€ βββββββββββββββββ€
9β Linear(256,128)β β Linear(256,64) β
10β ReLU β β ReLU β
11β Dropout(0.3) β β Dropout(0.3) β
12β Linear(128,1) β β Linear(64,3) β
13βββββββββββββββββ βββββββββββββββββ
14 β β
15 RUL β β logits β βΒ³
16 (cycles) (health classes)Mathematical Formulation
Given encoder output :
Where:
- : RUL prediction head (MLP)
- : Health classification head (MLP)
- : Head-specific parameters
- : Shared encoder output (256-dim)
Parameter Sharing Strategy
| Component | Shared? | Rationale |
|---|---|---|
| CNN | Yes | Local patterns useful for both tasks |
| BiLSTM | Yes | Temporal dynamics shared |
| Attention | Yes | Relevant timesteps shared |
| RUL Head | No | Task-specific transformation |
| Health Head | No | Task-specific transformation |
Hard Parameter Sharing
We use hard parameter sharing where encoder parameters are identical for both tasks. This is the most common multi-task approach and provides strongest regularization. Soft sharing (separate encoders with regularization) is an alternative but adds complexity.
Inductive Bias from Sharing
Sharing the encoder introduces beneficial inductive biases.
What is Inductive Bias?
Inductive bias refers to assumptions built into the model that help generalization. Multi-task learning introduces the bias that:
Features useful for predicting RUL should also be useful for predicting health state, and vice versa.
Why This Bias is Valid
For predictive maintenance, this assumption is well-founded:
- Same underlying process: Both tasks predict aspects of degradation
- Same sensor inputs: Same physical signals contain both information types
- Causal relationship: Health state directly relates to RUL range
Regularization Effect
Empirical Evidence
Our ablation studies (Chapter 17) show that removing the health task degrades RUL performance by up to 304%. This dramatic drop confirms the regularization value of multi-task learning.
| Configuration | FD002 Score | Degradation |
|---|---|---|
| AMNL (dual-task) | 1,102 | Baseline |
| RUL only (single-task) | 4,453 | +304% |
Key Finding
The health classification task is not just an auxiliary outputβit is essential for achieving state-of-the-art RUL prediction. The shared representation forces the encoder to learn features that generalize across both tasks, preventing overfitting to RUL noise.
Summary
In this section, we introduced the shared feature representation:
- Encoder output: 256-dimensional vector capturing sensor patterns
- Multi-task motivation: Regularization through complementary tasks
- Shared design: Same encoder, separate prediction heads
- Inductive bias: Forces generalizable feature learning
| Aspect | Value |
|---|---|
| Shared representation | z β βΒ²β΅βΆ |
| RUL output | Scalar (cycles) |
| Health output | 3 classes (logits) |
| Sharing strategy | Hard parameter sharing |
| Single-task degradation | Up to +304% |
Looking Ahead: We have established why sharing matters. The next section designs the RUL prediction headβthe MLP that transforms the 256-dim representation into a single RUL value, including the architectural choices for depth, width, and activation.
With the motivation for shared representations established, we now design the RUL prediction head.