Chapter 8
12 min read
Section 37 of 104

Shared Feature Representation

Dual Task Prediction Heads

Learning Objectives

By the end of this section, you will:

  1. Understand the encoder output as a shared feature representation
  2. Explain why multi-task learning benefits RUL prediction
  3. Design a shared representation that serves both regression and classification
  4. Appreciate the inductive bias introduced by shared features
Why This Matters: The AMNL model predicts both Remaining Useful Life (continuous) and Health State (categorical) from the same encoder. This dual-task design is not merely convenientβ€”it provides crucial regularization that leads to our +21% improvement over single-task methods. Understanding why sharing helps is key to understanding AMNL's success.

Encoder Output Recap

The encoder we built in Chapters 5-7 transforms raw sensor data into a rich 256-dimensional representation.

Complete Encoder Pipeline

πŸ“text
1Input: Raw sensor readings
2       (batch, 30, 17)
3            ↓
4β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
5β”‚  CNN Feature Extractor          β”‚
6β”‚  17 channels β†’ 64 channels      β”‚
7β”‚  Local pattern extraction       β”‚
8β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
9            ↓
10       (batch, 30, 64)
11            ↓
12β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13β”‚  BiLSTM Encoder                 β”‚
14β”‚  2 layers, 128 hidden           β”‚
15β”‚  Temporal dependency modeling   β”‚
16β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
17            ↓
18       (batch, 30, 256)
19            ↓
20β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
21β”‚  Multi-Head Attention           β”‚
22β”‚  8 heads, residual + LayerNorm  β”‚
23β”‚  Focus on informative timesteps β”‚
24β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
25            ↓
26       (batch, 256)
27            ↓
28        SHARED FEATURE
29        REPRESENTATION

What the 256 Dimensions Capture

The encoder output z∈R256\mathbf{z} \in \mathbb{R}^{256} is a compressed representation of the entire 30-timestep sensor window:

  • Local patterns: CNN extracts sensor correlations and local trends
  • Temporal dynamics: BiLSTM captures how patterns evolve over time
  • Relevant focus: Attention emphasizes degradation signals over noise
z=Encoder(X)∈R256\mathbf{z} = \text{Encoder}(\mathbf{X}) \in \mathbb{R}^{256}

Where X∈R30Γ—17\mathbf{X} \in \mathbb{R}^{30 \times 17} is the input sensor window. This single vector z\mathbf{z} must contain all information needed for prediction.


Multi-Task Learning Motivation

Why predict both RUL and health state? Why not just RUL directly?

The Problem with Single-Task Learning

Training only for RUL prediction has limitations:

  • Noisy supervision: RUL labels have inherent uncertainty (exact failure time is stochastic)
  • Regression difficulty: Predicting exact cycle counts is harder than relative ordering
  • Overfitting risk: Model may memorize spurious correlations

Multi-Task as Regularization

Adding the health classification task provides implicit regularization:

MechanismEffect
Shared featuresForces encoder to learn generalizable representations
Multiple objectivesPrevents overfitting to single task noise
Complementary signalsHealth state provides discrete checkpoints for RUL
Gradient diversityDifferent loss gradients stabilize training

The Two Tasks

TaskOutputLossPurpose
RUL PredictionScalar (cycles)Weighted MSEExact remaining life
Health Classification3 classesCross-EntropyDegradation stage

Shared Representation Design

Both tasks receive the same encoder output but process it through separate heads.

Architecture Overview

πŸ“text
1Encoder Output: z ∈ ℝ²⁡⁢
2                    β”‚
3        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4        ↓                       ↓
5β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
6β”‚  RUL Head     β”‚       β”‚  Health Head  β”‚
7β”‚  (Regression) β”‚       β”‚(Classification)β”‚
8β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
9β”‚ Linear(256,128)β”‚      β”‚ Linear(256,64) β”‚
10β”‚ ReLU           β”‚      β”‚ ReLU           β”‚
11β”‚ Dropout(0.3)   β”‚      β”‚ Dropout(0.3)   β”‚
12β”‚ Linear(128,1)  β”‚      β”‚ Linear(64,3)   β”‚
13β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
14        ↓                       ↓
15    RUL ∈ ℝ               logits ∈ ℝ³
16   (cycles)            (health classes)

Mathematical Formulation

Given encoder output z\mathbf{z}:

y^RUL=fRUL(z;ΞΈRUL)\hat{y}_{\text{RUL}} = f_{\text{RUL}}(\mathbf{z}; \theta_{\text{RUL}})
p^health=fhealth(z;ΞΈhealth)\hat{\mathbf{p}}_{\text{health}} = f_{\text{health}}(\mathbf{z}; \theta_{\text{health}})

Where:

  • fRULf_{\text{RUL}}: RUL prediction head (MLP)
  • fhealthf_{\text{health}}: Health classification head (MLP)
  • ΞΈRUL,ΞΈhealth\theta_{\text{RUL}}, \theta_{\text{health}}: Head-specific parameters
  • z\mathbf{z}: Shared encoder output (256-dim)

Parameter Sharing Strategy

ComponentShared?Rationale
CNNYesLocal patterns useful for both tasks
BiLSTMYesTemporal dynamics shared
AttentionYesRelevant timesteps shared
RUL HeadNoTask-specific transformation
Health HeadNoTask-specific transformation

Hard Parameter Sharing

We use hard parameter sharing where encoder parameters are identical for both tasks. This is the most common multi-task approach and provides strongest regularization. Soft sharing (separate encoders with regularization) is an alternative but adds complexity.


Inductive Bias from Sharing

Sharing the encoder introduces beneficial inductive biases.

What is Inductive Bias?

Inductive bias refers to assumptions built into the model that help generalization. Multi-task learning introduces the bias that:

Features useful for predicting RUL should also be useful for predicting health state, and vice versa.

Why This Bias is Valid

For predictive maintenance, this assumption is well-founded:

  • Same underlying process: Both tasks predict aspects of degradation
  • Same sensor inputs: Same physical signals contain both information types
  • Causal relationship: Health state directly relates to RUL range

Regularization Effect

Empirical Evidence

Our ablation studies (Chapter 17) show that removing the health task degrades RUL performance by up to 304%. This dramatic drop confirms the regularization value of multi-task learning.

ConfigurationFD002 ScoreDegradation
AMNL (dual-task)1,102Baseline
RUL only (single-task)4,453+304%

Key Finding

The health classification task is not just an auxiliary outputβ€”it is essential for achieving state-of-the-art RUL prediction. The shared representation forces the encoder to learn features that generalize across both tasks, preventing overfitting to RUL noise.


Summary

In this section, we introduced the shared feature representation:

  1. Encoder output: 256-dimensional vector capturing sensor patterns
  2. Multi-task motivation: Regularization through complementary tasks
  3. Shared design: Same encoder, separate prediction heads
  4. Inductive bias: Forces generalizable feature learning
AspectValue
Shared representationz ∈ ℝ²⁡⁢
RUL outputScalar (cycles)
Health output3 classes (logits)
Sharing strategyHard parameter sharing
Single-task degradationUp to +304%
Looking Ahead: We have established why sharing matters. The next section designs the RUL prediction headβ€”the MLP that transforms the 256-dim representation into a single RUL value, including the architectural choices for depth, width, and activation.

With the motivation for shared representations established, we now design the RUL prediction head.