Learning Objectives
By the end of this section, you will:
- Understand the single-task ablation experiment design
- Analyze the catastrophic +304.7% degradation when removing health classification
- Interpret why single-task RUL prediction fails on complex datasets
- Connect overfitting patterns to the regularization hypothesis
- Understand the importance of auxiliary tasks in multi-task learning
Critical Finding: Removing the health classification task and training with RUL prediction only results in +304.7% degradation on FD002 (RMSE: 27.27 vs 6.74). This catastrophic failure provides the strongest evidence for AMNL's regularization hypothesis.
Single-Task Baseline
The single-task ablation removes the health classification head and trains the model using only RUL prediction loss.
Experimental Design
| Configuration | RUL Task | Health Task | Description |
|---|---|---|---|
| AMNL (0.5/0.5) | ✓ | ✓ | Full dual-task architecture |
| Single-Task RUL | ✓ | ✗ | RUL prediction only |
Architecture Comparison
The single-task model uses the same CNN-BiLSTM-Attention encoder but removes the health classification head:
| Component | AMNL | Single-Task |
|---|---|---|
| CNN Feature Extractor | ✓ | ✓ |
| BiLSTM Temporal Encoder | ✓ | ✓ |
| Multi-Head Attention | ✓ | ✓ |
| RUL Prediction Head | ✓ | ✓ |
| Health Classification Head | ✓ | ✗ |
| Total Parameters | ~2.3M | ~2.2M |
Minimal Architecture Difference
The health classification head adds only ~100K parameters (4% of total). The dramatic performance difference cannot be attributed to model capacity—it must be due to the auxiliary task's regularization effect.
Catastrophic Failure
The single-task model exhibits catastrophic performance degradation on complex multi-condition datasets.
Per-Dataset Results
| Dataset | AMNL RMSE | Single-Task RMSE | Degradation |
|---|---|---|---|
| FD001 | 10.43 | 15.63 | +49.9% |
| FD002 | 6.74 | 27.27 | +304.7% |
| FD003 | 9.51 | 17.62 | +85.3% |
| FD004 | 8.16 | 25.41 | +211.4% |
Catastrophic on Multi-Condition Data
The single-task model's RMSE on FD002 (27.27) is worse than random guessing (a constant prediction of mean RUL yields ~25 RMSE). The model has learned patterns that actively hurt prediction accuracy.
Detailed FD002 Analysis
Training Dynamics Comparison
| Metric | AMNL | Single-Task | Interpretation |
|---|---|---|---|
| Train Loss (final) | 0.023 | 0.018 | Single-task overfits more |
| Validation RMSE | 6.74 | 27.27 | Massive generalization gap |
| Train/Val Gap | Small | Very Large | Clear overfitting signal |
| Epochs to Best | 219 | 156 | Early peak, then degradation |
Overfitting Signature
The single-task model achieves lower training loss but much higher validation error. This classic overfitting pattern confirms the model memorizes training data rather than learning generalizable features.
NASA Score Impact
| Dataset | AMNL NASA | Single-Task NASA | Degradation |
|---|---|---|---|
| FD001 | 434.3 | 892.1 | +105.4% |
| FD002 | 356.0 | 3,247.8 | +812.3% |
| FD003 | 338.9 | 1,023.4 | +202.0% |
| FD004 | 537.5 | 4,891.2 | +810.0% |
The NASA Score degradation is even more severe than RMSE degradation. Since NASA Score penalizes late predictions exponentially, the single-task model's tendency to under-predict RUL results in catastrophic scoring penalties.
Why Single-Task Fails
Understanding the mechanisms behind single-task failure illuminates AMNL's success.
The Overfitting Hypothesis
Without the health classification task, the model has no constraint on what features it learns. It can exploit any pattern in the training data that correlates with RUL—including spurious correlations that don't generalize.
- : True degradation signal (generalizes)
- : Operating condition effects (must be factored out)
- : Individual engine quirks (noise)
The single-task model can minimize training loss by learning instead of , since all three correlate with RUL in the training set.
Health Task as Regularizer
The health classification task constrains what features the encoder can learn:
- Discrete supervision: Health states (Healthy, Degrading, Critical) are defined by RUL thresholds, forcing the model to learn RUL-predictive features
- Condition-invariant targets: Health labels don't depend on operating condition—an engine at RUL=10 is "Critical" regardless of altitude
- Smooth decision boundaries: Classification loss encourages features that cleanly separate health states
Condition-Invariance Requirement
On FD002 and FD004 (6 operating conditions), the single-task failure is most severe because:
| Dataset | Conditions | Single-Task Degradation | Explanation |
|---|---|---|---|
| FD001 | 1 | +49.9% | Less condition variation to exploit |
| FD002 | 6 | +304.7% | Overfits to condition-specific patterns |
| FD003 | 1 | +85.3% | Multiple fault modes add complexity |
| FD004 | 6 | +211.4% | Maximum complexity, maximum overfitting |
Pattern Confirmation
The severity of single-task failure correlates with dataset complexity. Multi-condition datasets (FD002, FD004) show 3-4× degradation, while single-condition datasets (FD001, FD003) show 0.5-0.9× degradation. This confirms the health task's role in learning condition-invariant features.
Implementation
Code for the single-task ablation experiment.
Single-Task Model Definition
1class SingleTaskRULModel(nn.Module):
2 """
3 Single-task RUL prediction model (for ablation).
4
5 Same architecture as AMNL but without health classification head.
6 """
7
8 def __init__(
9 self,
10 input_size: int = 17,
11 sequence_length: int = 30,
12 hidden_size: int = 256,
13 dropout: float = 0.2,
14 use_attention: bool = True
15 ):
16 super().__init__()
17
18 # CNN Feature Extractor (same as AMNL)
19 self.cnn = nn.Sequential(
20 nn.Conv1d(input_size, 64, kernel_size=3, padding=1),
21 nn.BatchNorm1d(64),
22 nn.ReLU(),
23 nn.Conv1d(64, 128, kernel_size=3, padding=1),
24 nn.BatchNorm1d(128),
25 nn.ReLU(),
26 )
27
28 # BiLSTM Temporal Encoder (same as AMNL)
29 self.lstm = nn.LSTM(
30 input_size=128,
31 hidden_size=hidden_size,
32 num_layers=2,
33 batch_first=True,
34 bidirectional=True,
35 dropout=dropout
36 )
37
38 # Multi-Head Attention (same as AMNL)
39 self.use_attention = use_attention
40 if use_attention:
41 self.attention = nn.MultiheadAttention(
42 embed_dim=hidden_size * 2,
43 num_heads=8,
44 dropout=dropout,
45 batch_first=True
46 )
47
48 # RUL Prediction Head ONLY (no health head)
49 self.rul_head = nn.Sequential(
50 nn.Linear(hidden_size * 2, 128),
51 nn.ReLU(),
52 nn.Dropout(dropout),
53 nn.Linear(128, 1)
54 )
55
56 # NOTE: No health_head - this is the ablation!
57
58 def forward(self, x: torch.Tensor) -> torch.Tensor:
59 # CNN: [batch, seq, features] -> [batch, features, seq]
60 x = x.transpose(1, 2)
61 x = self.cnn(x)
62 x = x.transpose(1, 2) # Back to [batch, seq, features]
63
64 # BiLSTM
65 lstm_out, _ = self.lstm(x)
66
67 # Attention
68 if self.use_attention:
69 attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
70 lstm_out = lstm_out + attn_out # Residual
71
72 # Pool and predict RUL
73 pooled = lstm_out[:, -1, :] # Take last timestep
74 rul_pred = self.rul_head(pooled)
75
76 # Return ONLY RUL prediction (no health logits)
77 return rul_predSingle-Task Training Loop
1def train_single_task(
2 model: SingleTaskRULModel,
3 train_loader: DataLoader,
4 optimizer: torch.optim.Optimizer,
5 device: torch.device,
6 use_weighted_mse: bool = True
7) -> float:
8 """
9 Train single-task model for one epoch.
10
11 Note: No health labels, no health loss, no dual-task balancing.
12 """
13 model.train()
14 total_loss = 0.0
15
16 for batch_x, batch_rul in train_loader:
17 batch_x = batch_x.to(device)
18 batch_rul = batch_rul.to(device)
19
20 optimizer.zero_grad()
21
22 # Forward pass - single output
23 rul_pred = model(batch_x)
24
25 # Loss - RUL only (no health component)
26 if use_weighted_mse:
27 weights = 1.0 + (125.0 - batch_rul.clamp(0, 125)) / 125.0
28 loss = (weights * (rul_pred.squeeze() - batch_rul) ** 2).mean()
29 else:
30 loss = F.mse_loss(rul_pred.squeeze(), batch_rul)
31
32 # Backward pass
33 loss.backward()
34 torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
35 optimizer.step()
36
37 total_loss += loss.item()
38
39 return total_loss / len(train_loader)Ablation Experiment Runner
1def run_dual_task_ablation(
2 datasets: List[str] = ['FD001', 'FD002', 'FD003', 'FD004'],
3 seeds: List[int] = [42, 123, 456],
4 epochs: int = 300
5) -> Dict:
6 """
7 Compare dual-task AMNL vs single-task RUL.
8
9 Returns results for statistical comparison.
10 """
11 results = {
12 'dual_task': [],
13 'single_task': []
14 }
15
16 for dataset in datasets:
17 for seed in seeds:
18 print(f"Dataset: {dataset}, Seed: {seed}")
19
20 # Train dual-task AMNL
21 amnl_model = DualTaskAMNL(...)
22 amnl_result = train_amnl(amnl_model, dataset, seed, epochs)
23 results['dual_task'].append({
24 'dataset': dataset,
25 'seed': seed,
26 'rmse': amnl_result['rmse'],
27 'nasa_score': amnl_result['nasa_score']
28 })
29
30 # Train single-task (ablation)
31 single_model = SingleTaskRULModel(...)
32 single_result = train_single_task_model(
33 single_model, dataset, seed, epochs
34 )
35 results['single_task'].append({
36 'dataset': dataset,
37 'seed': seed,
38 'rmse': single_result['rmse'],
39 'nasa_score': single_result['nasa_score']
40 })
41
42 # Calculate degradation
43 for ds in datasets:
44 dual_rmse = np.mean([
45 r['rmse'] for r in results['dual_task']
46 if r['dataset'] == ds
47 ])
48 single_rmse = np.mean([
49 r['rmse'] for r in results['single_task']
50 if r['dataset'] == ds
51 ])
52 degradation = (single_rmse - dual_rmse) / dual_rmse * 100
53 print(f"{ds}: Dual={dual_rmse:.2f}, Single={single_rmse:.2f}, "
54 f"Degradation={degradation:+.1f}%")
55
56 return resultsSummary
Dual-Task vs Single-Task Summary:
- Catastrophic failure: Single-task RUL shows +304.7% degradation on FD002
- Overfitting confirmed: Lower training loss but much higher validation error
- Condition-dependent severity: 6-condition datasets show 3-4× worse degradation
- Health task essential: Not merely helpful but critical for generalization
- Regularization validated: Results strongly support the regularization hypothesis
| Key Finding | Value |
|---|---|
| Maximum degradation | +304.7% (FD002) |
| Minimum degradation | +49.9% (FD001) |
| Average degradation | +162.8% (all datasets) |
| NASA Score degradation | Up to +812.3% |
Key Insight: The +304.7% degradation when removing the health classification task provides the strongest evidence for AMNL's design. The health task is not an optional enhancement—it is essential regularization that prevents the model from overfitting to dataset-specific patterns. This explains why equal weighting (0.5/0.5) works: giving the health task sufficient weight ensures adequate regularization.
With dual-task importance established, we now examine the impact of the attention mechanism.