Chapter 8
15 min read
Section 38 of 104

RUL Prediction Head Architecture

Dual Task Prediction Heads

Learning Objectives

By the end of this section, you will:

  1. Design the RUL prediction head as a two-layer MLP
  2. Understand dimension reduction from 256 to 1
  3. Justify activation and dropout choices
  4. Interpret the output as predicted remaining cycles
  5. Calculate parameter count for the RUL head
Why This Matters: The RUL head is the final transformation from rich 256-dimensional features to a single scalar prediction. Its design must balance expressiveness (enough capacity to capture nonlinear relationships) with simplicity (avoid overfitting when the encoder already does heavy lifting).

Head Architecture Design

The RUL prediction head is a simple two-layer MLP that transforms the shared representation into a scalar output.

Architecture Overview

πŸ“text
1Input: z ∈ ℝ²⁡⁢ (encoder output)
2            ↓
3β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4β”‚  Linear(256, 128)               β”‚
5β”‚  256 Γ— 128 + 128 = 32,896 paramsβ”‚
6β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
7            ↓
8β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9β”‚  ReLU                           β”‚
10β”‚  0 parameters                   β”‚
11β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
12            ↓
13β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
14β”‚  Dropout(p=0.3)                 β”‚
15β”‚  0 parameters                   β”‚
16β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
17            ↓
18β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
19β”‚  Linear(128, 1)                 β”‚
20β”‚  128 Γ— 1 + 1 = 129 params       β”‚
21β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
22            ↓
23Output: Ε·_RUL ∈ ℝ (predicted cycles)

PyTorch Implementation

RUL Prediction Head
🐍model.py
18First Linear Layer

Projects 256-dimensional encoder output to 128 hidden features. Each hidden unit learns to detect RUL-relevant patterns.

EXAMPLE
# BEFORE: z.shape = (32, 256)
z = [[0.12, -0.45, 0.33, ..., 0.87],   # sample 1
     [0.08, -0.21, 0.56, ..., 0.92],   # sample 2
     ...
     [0.15, -0.38, 0.41, ..., 0.78]]   # sample 32

# AFTER Linear(256, 128): h.shape = (32, 128)
h = [[0.55, 0.12, ..., -0.34],   # sample 1
     [0.41, 0.28, ..., -0.19],   # sample 2
     ...
     [0.63, 0.05, ..., -0.42]]   # sample 32
19ReLU Activation

Applies element-wise ReLU, zeroing negative values. Creates sparse activations where only relevant features fire.

EXAMPLE
# BEFORE ReLU: h with negative values
h = [[0.55, -0.12, 0.33, -0.45],
     [0.41, 0.28, -0.19, 0.56]]

# AFTER ReLU: negatives become 0
h = [[0.55, 0.00, 0.33, 0.00],
     [0.41, 0.28, 0.00, 0.56]]

# Sparsity: ~50% of values are now zero
20Dropout Regularization

During training, randomly zeros 30% of hidden units. Forces network to learn redundant representations.

EXAMPLE
# Training mode: 30% randomly dropped, rest scaled
# BEFORE: a = [0.55, 0.00, 0.33, 0.00, 0.41, 0.28]

# AFTER Dropout(0.3): scale = 1/(1-0.3) = 1.43
a = [0.79, 0.00, 0.00, 0.00, 0.59, 0.00]
#    ↑scale  ↑ok   ↑drop  ↑ok   ↑scale ↑drop

# Inference mode: no dropout, no scaling
21Output Linear Layer

Final projection from 128 hidden features to single RUL scalar. No activation allows unbounded positive outputs.

EXAMPLE
# BEFORE: a.shape = (32, 128)
a = [[0.79, 0.00, 0.47, ..., 0.59],   # sample 1
     [0.58, 0.40, 0.00, ..., 0.81]]   # sample 2

# AFTER Linear(128, 1): rul.shape = (32, 1)
rul = [[47.3],    # sample 1: ~47 cycles left
       [92.1]]    # sample 2: ~92 cycles left
32Forward Output

Returns RUL predictions with shape (batch, 1). Each value represents predicted cycles until failure.

EXAMPLE
# Input: encoder output z (batch, 256)
# Output: RUL predictions (batch, 1)

# Full flow for batch of 4 samples:
z.shape = (4, 256)
↓ Linear(256, 128)
h.shape = (4, 128)
↓ ReLU + Dropout
a.shape = (4, 128)
↓ Linear(128, 1)
rul.shape = (4, 1)  # [[47.3], [92.1], [15.8], [125.0]]
29 lines without explanation
1class RULHead(nn.Module):
2    """
3    RUL Prediction Head: 256 β†’ 128 β†’ 1
4
5    Two-layer MLP that predicts remaining useful life in cycles.
6    Uses ReLU activation and dropout for regularization.
7    """
8
9    def __init__(
10        self,
11        input_dim: int = 256,
12        hidden_dim: int = 128,
13        dropout: float = 0.3
14    ):
15        super().__init__()
16
17        self.head = nn.Sequential(
18            nn.Linear(input_dim, hidden_dim),   # 256 β†’ 128
19            nn.ReLU(),
20            nn.Dropout(dropout),
21            nn.Linear(hidden_dim, 1)            # 128 β†’ 1
22        )
23
24    def forward(self, z: torch.Tensor) -> torch.Tensor:
25        """
26        Forward pass.
27
28        Args:
29            z: Encoder output (batch, 256)
30
31        Returns:
32            RUL prediction (batch, 1)
33        """
34        return self.head(z)

Design Rationale

ChoiceRationale
Two layersSufficient nonlinearity without overfitting
Hidden dim 128Gradual reduction (256 β†’ 128 β†’ 1)
ReLU activationSimple, effective, no vanishing gradient
Dropout 0.3Prevent co-adaptation of hidden units
No output activationRUL is unbounded positive (could be >125)

Layer-by-Layer Analysis

Let us trace the transformations through each layer.

Layer 1: Linear(256, 128)

The first linear layer reduces dimensionality while learning task-specific features:

h=zW1+b1\mathbf{h} = \mathbf{z} \mathbf{W}_1 + \mathbf{b}_1

Where:

  • z∈R256\mathbf{z} \in \mathbb{R}^{256}: Encoder output
  • W1∈R256Γ—128\mathbf{W}_1 \in \mathbb{R}^{256 \times 128}: Weight matrix
  • b1∈R128\mathbf{b}_1 \in \mathbb{R}^{128}: Bias vector
  • h∈R128\mathbf{h} \in \mathbb{R}^{128}: Hidden representation

ReLU Activation

After the linear transformation, we apply ReLU:

a=ReLU(h)=max⁑(0,h)\mathbf{a} = \text{ReLU}(\mathbf{h}) = \max(0, \mathbf{h})

ReLU provides:

  • Nonlinearity: Enables learning complex input-output mappings
  • Sparsity: Negative values become zero, creating sparse activations
  • Gradient flow: Gradient is 1 for positive inputs, avoiding vanishing gradients

Dropout Layer

Dropout randomly zeros elements during training:

a~i={0withΒ probabilityΒ pai1βˆ’pwithΒ probabilityΒ 1βˆ’p\tilde{\mathbf{a}}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{\mathbf{a}_i}{1-p} & \text{with probability } 1-p \end{cases}

With p=0.3p = 0.3, approximately 30% of hidden units are dropped each forward pass. The scaling by 1/(1βˆ’p)1/(1-p) ensures expected values match at test time.

Why Dropout Here?

The RUL head is relatively small (only ~33K parameters), but dropout still helps prevent the head from memorizing training examples. It forces the network to learn robust features that work even when some hidden units are missing.

Layer 2: Linear(128, 1)

The final layer produces the scalar RUL prediction:

y^RUL=a~β‹…w2+b2\hat{y}_{\text{RUL}} = \tilde{\mathbf{a}} \cdot \mathbf{w}_2 + b_2

Where:

  • a~∈R128\tilde{\mathbf{a}} \in \mathbb{R}^{128}: Dropout-regularized hidden activations
  • w2∈R128\mathbf{w}_2 \in \mathbb{R}^{128}: Output weight vector
  • b2∈Rb_2 \in \mathbb{R}: Output bias
  • y^RUL∈R\hat{y}_{\text{RUL}} \in \mathbb{R}: Predicted RUL

Activation Function Choice

We use ReLU for the hidden layer but no activation for the output.

Why No Output Activation?

RUL is a continuous, non-negative value. Common choices and their issues:

ActivationRangeIssue for RUL
None (Linear)(-∞, +∞)Preferred: unbounded output
ReLU[0, +∞)Could work, but adds nonlinearity unnecessarily
Sigmoid(0, 1)Wrong range: RUL can be >1
Softplus(0, +∞)Adds computation, marginal benefit

We choose linear output because:

  • RUL can theoretically be any positive value (though capped at 125 in our labels)
  • The loss function handles the target range appropriately
  • Simpler models generalize better when expressiveness is sufficient

Handling Negative Predictions

Without output activation, the model can predict negative RUL values. In practice, we clamp predictions to [0, R_max] during evaluation. During training, allowing negative outputs helps gradient flow and the model quickly learns to avoid them.


Output Interpretation

The RUL head output represents predicted remaining cycles until failure.

Output Semantics

πŸ“text
1Prediction: Ε·_RUL = 47.3
2
3Interpretation:
4  - The engine is predicted to fail in approximately 47 cycles
5  - Actual failure could occur earlier or later due to:
6    * Stochastic nature of degradation
7    * Sensor noise
8    * Model uncertainty
9
10Label format:
11  - Piecewise linear degradation (Chapter 3)
12  - RUL capped at 125 cycles (early life β‰ˆ constant health)

Prediction vs Label Relationship

Parameter Count

LayerParametersCalculation
Linear(256, 128)32,896256 Γ— 128 + 128
Linear(128, 1)129128 Γ— 1 + 1
Total33,025Sum

The RUL head adds approximately 33K parameters to the modelβ€”a small fraction of the total, reflecting its role as a lightweight task-specific adapter.


Summary

In this section, we designed the RUL prediction head:

  1. Architecture: Two-layer MLP (256 β†’ 128 β†’ 1)
  2. Activation: ReLU for hidden layer, linear output
  3. Regularization: Dropout(0.3) between layers
  4. Output: Single scalar (predicted cycles to failure)
  5. Parameters: ~33K
PropertyValue
Input dimension256
Hidden dimension128
Output dimension1
ActivationReLU (hidden), None (output)
Dropout rate0.3
Total parameters33,025
Looking Ahead: The RUL head predicts exact remaining life. The next section designs the health classification head that predicts which of three degradation stages the engine is inβ€”providing the complementary coarse localization that regularizes RUL learning.

With the RUL head designed, we now build the health classification head for discrete stage prediction.