AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Design the RUL prediction head as a two-layer MLP
Understand dimension reduction from 256 to 1
Justify activation and dropout choices
Interpret the output as predicted remaining cycles
Calculate parameter count for the RUL head

Why This Matters: The RUL head is the final transformation from rich 256-dimensional features to a single scalar prediction. Its design must balance expressiveness (enough capacity to capture nonlinear relationships) with simplicity (avoid overfitting when the encoder already does heavy lifting).

Head Architecture Design

The RUL prediction head is a simple two-layer MLP that transforms the shared representation into a scalar output.

Architecture Overview

📝text

1Input: z ∈ ℝ²⁵⁶ (encoder output)
2            ↓
3┌─────────────────────────────────┐
4│  Linear(256, 128)               │
5│  256 × 128 + 128 = 32,896 params│
6└─────────────────────────────────┘
7            ↓
8┌─────────────────────────────────┐
9│  ReLU                           │
10│  0 parameters                   │
11└─────────────────────────────────┘
12            ↓
13┌─────────────────────────────────┐
14│  Dropout(p=0.3)                 │
15│  0 parameters                   │
16└─────────────────────────────────┘
17            ↓
18┌─────────────────────────────────┐
19│  Linear(128, 1)                 │
20│  128 × 1 + 1 = 129 params       │
21└─────────────────────────────────┘
22            ↓
23Output: ŷ_RUL ∈ ℝ (predicted cycles)

PyTorch Implementation

RUL Prediction Head

🐍model.py

Explanation(5)

Code(34)

18First Linear Layer

Projects 256-dimensional encoder output to 128 hidden features. Each hidden unit learns to detect RUL-relevant patterns.

EXAMPLE

# BEFORE: z.shape = (32, 256)
z = [[0.12, -0.45, 0.33, ..., 0.87],   # sample 1
     [0.08, -0.21, 0.56, ..., 0.92],   # sample 2
     ...
     [0.15, -0.38, 0.41, ..., 0.78]]   # sample 32

# AFTER Linear(256, 128): h.shape = (32, 128)
h = [[0.55, 0.12, ..., -0.34],   # sample 1
     [0.41, 0.28, ..., -0.19],   # sample 2
     ...
     [0.63, 0.05, ..., -0.42]]   # sample 32

19ReLU Activation

Applies element-wise ReLU, zeroing negative values. Creates sparse activations where only relevant features fire.

EXAMPLE

# BEFORE ReLU: h with negative values
h = [[0.55, -0.12, 0.33, -0.45],
     [0.41, 0.28, -0.19, 0.56]]

# AFTER ReLU: negatives become 0
h = [[0.55, 0.00, 0.33, 0.00],
     [0.41, 0.28, 0.00, 0.56]]

# Sparsity: ~50% of values are now zero

20Dropout Regularization

During training, randomly zeros 30% of hidden units. Forces network to learn redundant representations.

EXAMPLE

# Training mode: 30% randomly dropped, rest scaled
# BEFORE: a = [0.55, 0.00, 0.33, 0.00, 0.41, 0.28]

# AFTER Dropout(0.3): scale = 1/(1-0.3) = 1.43
a = [0.79, 0.00, 0.00, 0.00, 0.59, 0.00]
#    ↑scale  ↑ok   ↑drop  ↑ok   ↑scale ↑drop

# Inference mode: no dropout, no scaling

21Output Linear Layer

Final projection from 128 hidden features to single RUL scalar. No activation allows unbounded positive outputs.

EXAMPLE

# BEFORE: a.shape = (32, 128)
a = [[0.79, 0.00, 0.47, ..., 0.59],   # sample 1
     [0.58, 0.40, 0.00, ..., 0.81]]   # sample 2

# AFTER Linear(128, 1): rul.shape = (32, 1)
rul = [[47.3],    # sample 1: ~47 cycles left
       [92.1]]    # sample 2: ~92 cycles left

32Forward Output

Returns RUL predictions with shape (batch, 1). Each value represents predicted cycles until failure.

EXAMPLE

# Input: encoder output z (batch, 256)
# Output: RUL predictions (batch, 1)

# Full flow for batch of 4 samples:
z.shape = (4, 256)
↓ Linear(256, 128)
h.shape = (4, 128)
↓ ReLU + Dropout
a.shape = (4, 128)
↓ Linear(128, 1)
rul.shape = (4, 1)  # [[47.3], [92.1], [15.8], [125.0]]

29 lines without explanation

1class RULHead(nn.Module):
2    """
3    RUL Prediction Head: 256 → 128 → 1
4
5    Two-layer MLP that predicts remaining useful life in cycles.
6    Uses ReLU activation and dropout for regularization.
7    """
8
9    def __init__(
10        self,
11        input_dim: int = 256,
12        hidden_dim: int = 128,
13        dropout: float = 0.3
14    ):
15        super().__init__()
16
17        self.head = nn.Sequential(
18            nn.Linear(input_dim, hidden_dim),   # 256 → 128
19            nn.ReLU(),
20            nn.Dropout(dropout),
21            nn.Linear(hidden_dim, 1)            # 128 → 1
22        )
23
24    def forward(self, z: torch.Tensor) -> torch.Tensor:
25        """
26        Forward pass.
27
28        Args:
29            z: Encoder output (batch, 256)
30
31        Returns:
32            RUL prediction (batch, 1)
33        """
34        return self.head(z)

Design Rationale

Choice	Rationale
Two layers	Sufficient nonlinearity without overfitting
Hidden dim 128	Gradual reduction (256 → 128 → 1)
ReLU activation	Simple, effective, no vanishing gradient
Dropout 0.3	Prevent co-adaptation of hidden units
No output activation	RUL is unbounded positive (could be >125)

Layer-by-Layer Analysis

Let us trace the transformations through each layer.

Layer 1: Linear(256, 128)

The first linear layer reduces dimensionality while learning task-specific features:

\mathbf{h} = \mathbf{z} \mathbf{W}_1 + \mathbf{b}_1

Where:

$\mathbf{z} \in \mathbb{R}^{256}$ : Encoder output
$\mathbf{W}_1 \in \mathbb{R}^{256 \times 128}$ : Weight matrix
$\mathbf{b}_1 \in \mathbb{R}^{128}$ : Bias vector
$\mathbf{h} \in \mathbb{R}^{128}$ : Hidden representation

ReLU Activation

After the linear transformation, we apply ReLU:

\mathbf{a} = \text{ReLU}(\mathbf{h}) = \max(0, \mathbf{h})

ReLU provides:

Nonlinearity: Enables learning complex input-output mappings
Sparsity: Negative values become zero, creating sparse activations
Gradient flow: Gradient is 1 for positive inputs, avoiding vanishing gradients

Dropout Layer

Dropout randomly zeros elements during training:

\tilde{\mathbf{a}}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{\mathbf{a}_i}{1-p} & \text{with probability } 1-p \end{cases}

With $p = 0.3$ , approximately 30% of hidden units are dropped each forward pass. The scaling by $1/(1-p)$ ensures expected values match at test time.

Why Dropout Here?

The RUL head is relatively small (only ~33K parameters), but dropout still helps prevent the head from memorizing training examples. It forces the network to learn robust features that work even when some hidden units are missing.

Layer 2: Linear(128, 1)

The final layer produces the scalar RUL prediction:

\hat{y}_{\text{RUL}} = \tilde{\mathbf{a}} \cdot \mathbf{w}_2 + b_2

Where:

$\tilde{\mathbf{a}} \in \mathbb{R}^{128}$ : Dropout-regularized hidden activations
$\mathbf{w}_2 \in \mathbb{R}^{128}$ : Output weight vector
$b_2 \in \mathbb{R}$ : Output bias
$\hat{y}_{\text{RUL}} \in \mathbb{R}$ : Predicted RUL

Activation Function Choice

We use ReLU for the hidden layer but no activation for the output.

Why No Output Activation?

RUL is a continuous, non-negative value. Common choices and their issues:

Activation	Range	Issue for RUL
None (Linear)	(-∞, +∞)	Preferred: unbounded output
ReLU	[0, +∞)	Could work, but adds nonlinearity unnecessarily
Sigmoid	(0, 1)	Wrong range: RUL can be >1
Softplus	(0, +∞)	Adds computation, marginal benefit

We choose linear output because:

RUL can theoretically be any positive value (though capped at 125 in our labels)
The loss function handles the target range appropriately
Simpler models generalize better when expressiveness is sufficient

Handling Negative Predictions

Without output activation, the model can predict negative RUL values. In practice, we clamp predictions to [0, R_max] during evaluation. During training, allowing negative outputs helps gradient flow and the model quickly learns to avoid them.

Output Interpretation

The RUL head output represents predicted remaining cycles until failure.

Output Semantics

📝text

1Prediction: ŷ_RUL = 47.3
2
3Interpretation:
4  - The engine is predicted to fail in approximately 47 cycles
5  - Actual failure could occur earlier or later due to:
6    * Stochastic nature of degradation
7    * Sensor noise
8    * Model uncertainty
9
10Label format:
11  - Piecewise linear degradation (Chapter 3)
12  - RUL capped at 125 cycles (early life ≈ constant health)

Prediction vs Label Relationship

Parameter Count

Layer	Parameters	Calculation
Linear(256, 128)	32,896	256 × 128 + 128
Linear(128, 1)	129	128 × 1 + 1
Total	33,025	Sum

The RUL head adds approximately 33K parameters to the model—a small fraction of the total, reflecting its role as a lightweight task-specific adapter.

Summary

In this section, we designed the RUL prediction head:

Architecture: Two-layer MLP (256 → 128 → 1)
Activation: ReLU for hidden layer, linear output
Regularization: Dropout(0.3) between layers
Output: Single scalar (predicted cycles to failure)
Parameters: ~33K

Property	Value
Input dimension	256
Hidden dimension	128
Output dimension	1
Activation	ReLU (hidden), None (output)
Dropout rate	0.3
Total parameters	33,025

Looking Ahead: The RUL head predicts exact remaining life. The next section designs the health classification head that predicts which of three degradation stages the engine is in—providing the complementary coarse localization that regularizes RUL learning.

With the RUL head designed, we now build the health classification head for discrete stage prediction.