AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the single-task ablation experiment design
Analyze the catastrophic +304.7% degradation when removing health classification
Interpret why single-task RUL prediction fails on complex datasets
Connect overfitting patterns to the regularization hypothesis
Understand the importance of auxiliary tasks in multi-task learning

Critical Finding: Removing the health classification task and training with RUL prediction only results in +304.7% degradation on FD002 (RMSE: 27.27 vs 6.74). This catastrophic failure provides the strongest evidence for AMNL's regularization hypothesis.

Single-Task Baseline

The single-task ablation removes the health classification head and trains the model using only RUL prediction loss.

Experimental Design

Configuration	RUL Task	Health Task	Description
AMNL (0.5/0.5)	✓	✓	Full dual-task architecture
Single-Task RUL	✓	✗	RUL prediction only

Architecture Comparison

The single-task model uses the same CNN-BiLSTM-Attention encoder but removes the health classification head:

\text{Single-Task}: \mathcal{L} = \mathcal{L}_{\text{RUL}}

\text{Dual-Task}: \mathcal{L} = 0.5 \cdot \mathcal{L}_{\text{RUL}} + 0.5 \cdot \mathcal{L}_{\text{Health}}

Component	AMNL	Single-Task
CNN Feature Extractor	✓	✓
BiLSTM Temporal Encoder	✓	✓
Multi-Head Attention	✓	✓
RUL Prediction Head	✓	✓
Health Classification Head	✓	✗
Total Parameters	~2.3M	~2.2M

Minimal Architecture Difference

The health classification head adds only ~100K parameters (4% of total). The dramatic performance difference cannot be attributed to model capacity—it must be due to the auxiliary task's regularization effect.

Catastrophic Failure

The single-task model exhibits catastrophic performance degradation on complex multi-condition datasets.

Per-Dataset Results

Dataset	AMNL RMSE	Single-Task RMSE	Degradation
FD001	10.43	15.63	+49.9%
FD002	6.74	27.27	+304.7%
FD003	9.51	17.62	+85.3%
FD004	8.16	25.41	+211.4%

Catastrophic on Multi-Condition Data

The single-task model's RMSE on FD002 (27.27) is worse than random guessing (a constant prediction of mean RUL yields ~25 RMSE). The model has learned patterns that actively hurt prediction accuracy.

Detailed FD002 Analysis

Training Dynamics Comparison

Metric	AMNL	Single-Task	Interpretation
Train Loss (final)	0.023	0.018	Single-task overfits more
Validation RMSE	6.74	27.27	Massive generalization gap
Train/Val Gap	Small	Very Large	Clear overfitting signal
Epochs to Best	219	156	Early peak, then degradation

Overfitting Signature

The single-task model achieves lower training loss but much higher validation error. This classic overfitting pattern confirms the model memorizes training data rather than learning generalizable features.

NASA Score Impact

Dataset	AMNL NASA	Single-Task NASA	Degradation
FD001	434.3	892.1	+105.4%
FD002	356.0	3,247.8	+812.3%
FD003	338.9	1,023.4	+202.0%
FD004	537.5	4,891.2	+810.0%

The NASA Score degradation is even more severe than RMSE degradation. Since NASA Score penalizes late predictions exponentially, the single-task model's tendency to under-predict RUL results in catastrophic scoring penalties.

Why Single-Task Fails

Understanding the mechanisms behind single-task failure illuminates AMNL's success.

The Overfitting Hypothesis

Without the health classification task, the model has no constraint on what features it learns. It can exploit any pattern in the training data that correlates with RUL—including spurious correlations that don't generalize.

\text{Sensor}_{i}(t) = f(\text{Degradation}) + g(\text{Condition}) + h(\text{Engine-specific}) + \epsilon

$f(\text{Degradation})$ : True degradation signal (generalizes)
$g(\text{Condition})$ : Operating condition effects (must be factored out)
$h(\text{Engine-specific})$ : Individual engine quirks (noise)

The single-task model can minimize training loss by learning $g + h$ instead of $f$ , since all three correlate with RUL in the training set.

Health Task as Regularizer

The health classification task constrains what features the encoder can learn:

Discrete supervision: Health states (Healthy, Degrading, Critical) are defined by RUL thresholds, forcing the model to learn RUL-predictive features
Condition-invariant targets: Health labels don't depend on operating condition—an engine at RUL=10 is "Critical" regardless of altitude
Smooth decision boundaries: Classification loss encourages features that cleanly separate health states

Condition-Invariance Requirement

On FD002 and FD004 (6 operating conditions), the single-task failure is most severe because:

Dataset	Conditions	Single-Task Degradation	Explanation
FD001	1	+49.9%	Less condition variation to exploit
FD002	6	+304.7%	Overfits to condition-specific patterns
FD003	1	+85.3%	Multiple fault modes add complexity
FD004	6	+211.4%	Maximum complexity, maximum overfitting

Pattern Confirmation

The severity of single-task failure correlates with dataset complexity. Multi-condition datasets (FD002, FD004) show 3-4× degradation, while single-condition datasets (FD001, FD003) show 0.5-0.9× degradation. This confirms the health task's role in learning condition-invariant features.

Implementation

Code for the single-task ablation experiment.

Single-Task Model Definition

🐍python

1class SingleTaskRULModel(nn.Module):
2    """
3    Single-task RUL prediction model (for ablation).
4
5    Same architecture as AMNL but without health classification head.
6    """
7
8    def __init__(
9        self,
10        input_size: int = 17,
11        sequence_length: int = 30,
12        hidden_size: int = 256,
13        dropout: float = 0.2,
14        use_attention: bool = True
15    ):
16        super().__init__()
17
18        # CNN Feature Extractor (same as AMNL)
19        self.cnn = nn.Sequential(
20            nn.Conv1d(input_size, 64, kernel_size=3, padding=1),
21            nn.BatchNorm1d(64),
22            nn.ReLU(),
23            nn.Conv1d(64, 128, kernel_size=3, padding=1),
24            nn.BatchNorm1d(128),
25            nn.ReLU(),
26        )
27
28        # BiLSTM Temporal Encoder (same as AMNL)
29        self.lstm = nn.LSTM(
30            input_size=128,
31            hidden_size=hidden_size,
32            num_layers=2,
33            batch_first=True,
34            bidirectional=True,
35            dropout=dropout
36        )
37
38        # Multi-Head Attention (same as AMNL)
39        self.use_attention = use_attention
40        if use_attention:
41            self.attention = nn.MultiheadAttention(
42                embed_dim=hidden_size * 2,
43                num_heads=8,
44                dropout=dropout,
45                batch_first=True
46            )
47
48        # RUL Prediction Head ONLY (no health head)
49        self.rul_head = nn.Sequential(
50            nn.Linear(hidden_size * 2, 128),
51            nn.ReLU(),
52            nn.Dropout(dropout),
53            nn.Linear(128, 1)
54        )
55
56        # NOTE: No health_head - this is the ablation!
57
58    def forward(self, x: torch.Tensor) -> torch.Tensor:
59        # CNN: [batch, seq, features] -> [batch, features, seq]
60        x = x.transpose(1, 2)
61        x = self.cnn(x)
62        x = x.transpose(1, 2)  # Back to [batch, seq, features]
63
64        # BiLSTM
65        lstm_out, _ = self.lstm(x)
66
67        # Attention
68        if self.use_attention:
69            attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
70            lstm_out = lstm_out + attn_out  # Residual
71
72        # Pool and predict RUL
73        pooled = lstm_out[:, -1, :]  # Take last timestep
74        rul_pred = self.rul_head(pooled)
75
76        # Return ONLY RUL prediction (no health logits)
77        return rul_pred

Single-Task Training Loop

🐍python

1def train_single_task(
2    model: SingleTaskRULModel,
3    train_loader: DataLoader,
4    optimizer: torch.optim.Optimizer,
5    device: torch.device,
6    use_weighted_mse: bool = True
7) -> float:
8    """
9    Train single-task model for one epoch.
10
11    Note: No health labels, no health loss, no dual-task balancing.
12    """
13    model.train()
14    total_loss = 0.0
15
16    for batch_x, batch_rul in train_loader:
17        batch_x = batch_x.to(device)
18        batch_rul = batch_rul.to(device)
19
20        optimizer.zero_grad()
21
22        # Forward pass - single output
23        rul_pred = model(batch_x)
24
25        # Loss - RUL only (no health component)
26        if use_weighted_mse:
27            weights = 1.0 + (125.0 - batch_rul.clamp(0, 125)) / 125.0
28            loss = (weights * (rul_pred.squeeze() - batch_rul) ** 2).mean()
29        else:
30            loss = F.mse_loss(rul_pred.squeeze(), batch_rul)
31
32        # Backward pass
33        loss.backward()
34        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
35        optimizer.step()
36
37        total_loss += loss.item()
38
39    return total_loss / len(train_loader)

Ablation Experiment Runner

🐍python

1def run_dual_task_ablation(
2    datasets: List[str] = ['FD001', 'FD002', 'FD003', 'FD004'],
3    seeds: List[int] = [42, 123, 456],
4    epochs: int = 300
5) -> Dict:
6    """
7    Compare dual-task AMNL vs single-task RUL.
8
9    Returns results for statistical comparison.
10    """
11    results = {
12        'dual_task': [],
13        'single_task': []
14    }
15
16    for dataset in datasets:
17        for seed in seeds:
18            print(f"Dataset: {dataset}, Seed: {seed}")
19
20            # Train dual-task AMNL
21            amnl_model = DualTaskAMNL(...)
22            amnl_result = train_amnl(amnl_model, dataset, seed, epochs)
23            results['dual_task'].append({
24                'dataset': dataset,
25                'seed': seed,
26                'rmse': amnl_result['rmse'],
27                'nasa_score': amnl_result['nasa_score']
28            })
29
30            # Train single-task (ablation)
31            single_model = SingleTaskRULModel(...)
32            single_result = train_single_task_model(
33                single_model, dataset, seed, epochs
34            )
35            results['single_task'].append({
36                'dataset': dataset,
37                'seed': seed,
38                'rmse': single_result['rmse'],
39                'nasa_score': single_result['nasa_score']
40            })
41
42    # Calculate degradation
43    for ds in datasets:
44        dual_rmse = np.mean([
45            r['rmse'] for r in results['dual_task']
46            if r['dataset'] == ds
47        ])
48        single_rmse = np.mean([
49            r['rmse'] for r in results['single_task']
50            if r['dataset'] == ds
51        ])
52        degradation = (single_rmse - dual_rmse) / dual_rmse * 100
53        print(f"{ds}: Dual={dual_rmse:.2f}, Single={single_rmse:.2f}, "
54              f"Degradation={degradation:+.1f}%")
55
56    return results

Summary

Dual-Task vs Single-Task Summary:

Catastrophic failure: Single-task RUL shows +304.7% degradation on FD002
Overfitting confirmed: Lower training loss but much higher validation error
Condition-dependent severity: 6-condition datasets show 3-4× worse degradation
Health task essential: Not merely helpful but critical for generalization
Regularization validated: Results strongly support the regularization hypothesis

Key Finding	Value
Maximum degradation	+304.7% (FD002)
Minimum degradation	+49.9% (FD001)
Average degradation	+162.8% (all datasets)
NASA Score degradation	Up to +812.3%

Key Insight: The +304.7% degradation when removing the health classification task provides the strongest evidence for AMNL's design. The health task is not an optional enhancement—it is essential regularization that prevents the model from overfitting to dataset-specific patterns. This explains why equal weighting (0.5/0.5) works: giving the health task sufficient weight ensures adequate regularization.

With dual-task importance established, we now examine the impact of the attention mechanism.