Learning Objectives
By the end of this section, you will:
- Design the RUL prediction head as a two-layer MLP
- Understand dimension reduction from 256 to 1
- Justify activation and dropout choices
- Interpret the output as predicted remaining cycles
- Calculate parameter count for the RUL head
Why This Matters: The RUL head is the final transformation from rich 256-dimensional features to a single scalar prediction. Its design must balance expressiveness (enough capacity to capture nonlinear relationships) with simplicity (avoid overfitting when the encoder already does heavy lifting).
Head Architecture Design
The RUL prediction head is a simple two-layer MLP that transforms the shared representation into a scalar output.
Architecture Overview
1Input: z β βΒ²β΅βΆ (encoder output)
2 β
3βββββββββββββββββββββββββββββββββββ
4β Linear(256, 128) β
5β 256 Γ 128 + 128 = 32,896 paramsβ
6βββββββββββββββββββββββββββββββββββ
7 β
8βββββββββββββββββββββββββββββββββββ
9β ReLU β
10β 0 parameters β
11βββββββββββββββββββββββββββββββββββ
12 β
13βββββββββββββββββββββββββββββββββββ
14β Dropout(p=0.3) β
15β 0 parameters β
16βββββββββββββββββββββββββββββββββββ
17 β
18βββββββββββββββββββββββββββββββββββ
19β Linear(128, 1) β
20β 128 Γ 1 + 1 = 129 params β
21βββββββββββββββββββββββββββββββββββ
22 β
23Output: Ε·_RUL β β (predicted cycles)PyTorch Implementation
Design Rationale
| Choice | Rationale |
|---|---|
| Two layers | Sufficient nonlinearity without overfitting |
| Hidden dim 128 | Gradual reduction (256 β 128 β 1) |
| ReLU activation | Simple, effective, no vanishing gradient |
| Dropout 0.3 | Prevent co-adaptation of hidden units |
| No output activation | RUL is unbounded positive (could be >125) |
Layer-by-Layer Analysis
Let us trace the transformations through each layer.
Layer 1: Linear(256, 128)
The first linear layer reduces dimensionality while learning task-specific features:
Where:
- : Encoder output
- : Weight matrix
- : Bias vector
- : Hidden representation
ReLU Activation
After the linear transformation, we apply ReLU:
ReLU provides:
- Nonlinearity: Enables learning complex input-output mappings
- Sparsity: Negative values become zero, creating sparse activations
- Gradient flow: Gradient is 1 for positive inputs, avoiding vanishing gradients
Dropout Layer
Dropout randomly zeros elements during training:
With , approximately 30% of hidden units are dropped each forward pass. The scaling by ensures expected values match at test time.
Why Dropout Here?
The RUL head is relatively small (only ~33K parameters), but dropout still helps prevent the head from memorizing training examples. It forces the network to learn robust features that work even when some hidden units are missing.
Layer 2: Linear(128, 1)
The final layer produces the scalar RUL prediction:
Where:
- : Dropout-regularized hidden activations
- : Output weight vector
- : Output bias
- : Predicted RUL
Activation Function Choice
We use ReLU for the hidden layer but no activation for the output.
Why No Output Activation?
RUL is a continuous, non-negative value. Common choices and their issues:
| Activation | Range | Issue for RUL |
|---|---|---|
| None (Linear) | (-β, +β) | Preferred: unbounded output |
| ReLU | [0, +β) | Could work, but adds nonlinearity unnecessarily |
| Sigmoid | (0, 1) | Wrong range: RUL can be >1 |
| Softplus | (0, +β) | Adds computation, marginal benefit |
We choose linear output because:
- RUL can theoretically be any positive value (though capped at 125 in our labels)
- The loss function handles the target range appropriately
- Simpler models generalize better when expressiveness is sufficient
Handling Negative Predictions
Without output activation, the model can predict negative RUL values. In practice, we clamp predictions to [0, R_max] during evaluation. During training, allowing negative outputs helps gradient flow and the model quickly learns to avoid them.
Output Interpretation
The RUL head output represents predicted remaining cycles until failure.
Output Semantics
1Prediction: Ε·_RUL = 47.3
2
3Interpretation:
4 - The engine is predicted to fail in approximately 47 cycles
5 - Actual failure could occur earlier or later due to:
6 * Stochastic nature of degradation
7 * Sensor noise
8 * Model uncertainty
9
10Label format:
11 - Piecewise linear degradation (Chapter 3)
12 - RUL capped at 125 cycles (early life β constant health)Prediction vs Label Relationship
Parameter Count
| Layer | Parameters | Calculation |
|---|---|---|
| Linear(256, 128) | 32,896 | 256 Γ 128 + 128 |
| Linear(128, 1) | 129 | 128 Γ 1 + 1 |
| Total | 33,025 | Sum |
The RUL head adds approximately 33K parameters to the modelβa small fraction of the total, reflecting its role as a lightweight task-specific adapter.
Summary
In this section, we designed the RUL prediction head:
- Architecture: Two-layer MLP (256 β 128 β 1)
- Activation: ReLU for hidden layer, linear output
- Regularization: Dropout(0.3) between layers
- Output: Single scalar (predicted cycles to failure)
- Parameters: ~33K
| Property | Value |
|---|---|
| Input dimension | 256 |
| Hidden dimension | 128 |
| Output dimension | 1 |
| Activation | ReLU (hidden), None (output) |
| Dropout rate | 0.3 |
| Total parameters | 33,025 |
Looking Ahead: The RUL head predicts exact remaining life. The next section designs the health classification head that predicts which of three degradation stages the engine is inβproviding the complementary coarse localization that regularizes RUL learning.
With the RUL head designed, we now build the health classification head for discrete stage prediction.