Chapter 17
15 min read
Section 85 of 104

Dual-Task vs Single-Task: +304% Degradation

Ablation Studies

Learning Objectives

By the end of this section, you will:

  1. Understand the single-task ablation experiment design
  2. Analyze the catastrophic +304.7% degradation when removing health classification
  3. Interpret why single-task RUL prediction fails on complex datasets
  4. Connect overfitting patterns to the regularization hypothesis
  5. Understand the importance of auxiliary tasks in multi-task learning
Critical Finding: Removing the health classification task and training with RUL prediction only results in +304.7% degradation on FD002 (RMSE: 27.27 vs 6.74). This catastrophic failure provides the strongest evidence for AMNL's regularization hypothesis.

Single-Task Baseline

The single-task ablation removes the health classification head and trains the model using only RUL prediction loss.

Experimental Design

ConfigurationRUL TaskHealth TaskDescription
AMNL (0.5/0.5)Full dual-task architecture
Single-Task RULRUL prediction only

Architecture Comparison

The single-task model uses the same CNN-BiLSTM-Attention encoder but removes the health classification head:

Single-Task:L=LRUL\text{Single-Task}: \mathcal{L} = \mathcal{L}_{\text{RUL}}
Dual-Task:L=0.5LRUL+0.5LHealth\text{Dual-Task}: \mathcal{L} = 0.5 \cdot \mathcal{L}_{\text{RUL}} + 0.5 \cdot \mathcal{L}_{\text{Health}}
ComponentAMNLSingle-Task
CNN Feature Extractor
BiLSTM Temporal Encoder
Multi-Head Attention
RUL Prediction Head
Health Classification Head
Total Parameters~2.3M~2.2M

Minimal Architecture Difference

The health classification head adds only ~100K parameters (4% of total). The dramatic performance difference cannot be attributed to model capacity—it must be due to the auxiliary task's regularization effect.


Catastrophic Failure

The single-task model exhibits catastrophic performance degradation on complex multi-condition datasets.

Per-Dataset Results

DatasetAMNL RMSESingle-Task RMSEDegradation
FD00110.4315.63+49.9%
FD0026.7427.27+304.7%
FD0039.5117.62+85.3%
FD0048.1625.41+211.4%

Catastrophic on Multi-Condition Data

The single-task model's RMSE on FD002 (27.27) is worse than random guessing (a constant prediction of mean RUL yields ~25 RMSE). The model has learned patterns that actively hurt prediction accuracy.

Detailed FD002 Analysis

Training Dynamics Comparison

MetricAMNLSingle-TaskInterpretation
Train Loss (final)0.0230.018Single-task overfits more
Validation RMSE6.7427.27Massive generalization gap
Train/Val GapSmallVery LargeClear overfitting signal
Epochs to Best219156Early peak, then degradation

Overfitting Signature

The single-task model achieves lower training loss but much higher validation error. This classic overfitting pattern confirms the model memorizes training data rather than learning generalizable features.

NASA Score Impact

DatasetAMNL NASASingle-Task NASADegradation
FD001434.3892.1+105.4%
FD002356.03,247.8+812.3%
FD003338.91,023.4+202.0%
FD004537.54,891.2+810.0%

The NASA Score degradation is even more severe than RMSE degradation. Since NASA Score penalizes late predictions exponentially, the single-task model's tendency to under-predict RUL results in catastrophic scoring penalties.


Why Single-Task Fails

Understanding the mechanisms behind single-task failure illuminates AMNL's success.

The Overfitting Hypothesis

Without the health classification task, the model has no constraint on what features it learns. It can exploit any pattern in the training data that correlates with RUL—including spurious correlations that don't generalize.

Sensori(t)=f(Degradation)+g(Condition)+h(Engine-specific)+ϵ\text{Sensor}_{i}(t) = f(\text{Degradation}) + g(\text{Condition}) + h(\text{Engine-specific}) + \epsilon
  • f(Degradation)f(\text{Degradation}): True degradation signal (generalizes)
  • g(Condition)g(\text{Condition}): Operating condition effects (must be factored out)
  • h(Engine-specific)h(\text{Engine-specific}): Individual engine quirks (noise)

The single-task model can minimize training loss by learning g+hg + h instead of ff, since all three correlate with RUL in the training set.

Health Task as Regularizer

The health classification task constrains what features the encoder can learn:

  1. Discrete supervision: Health states (Healthy, Degrading, Critical) are defined by RUL thresholds, forcing the model to learn RUL-predictive features
  2. Condition-invariant targets: Health labels don't depend on operating condition—an engine at RUL=10 is "Critical" regardless of altitude
  3. Smooth decision boundaries: Classification loss encourages features that cleanly separate health states

Condition-Invariance Requirement

On FD002 and FD004 (6 operating conditions), the single-task failure is most severe because:

DatasetConditionsSingle-Task DegradationExplanation
FD0011+49.9%Less condition variation to exploit
FD0026+304.7%Overfits to condition-specific patterns
FD0031+85.3%Multiple fault modes add complexity
FD0046+211.4%Maximum complexity, maximum overfitting

Pattern Confirmation

The severity of single-task failure correlates with dataset complexity. Multi-condition datasets (FD002, FD004) show 3-4× degradation, while single-condition datasets (FD001, FD003) show 0.5-0.9× degradation. This confirms the health task's role in learning condition-invariant features.


Implementation

Code for the single-task ablation experiment.

Single-Task Model Definition

🐍python
1class SingleTaskRULModel(nn.Module):
2    """
3    Single-task RUL prediction model (for ablation).
4
5    Same architecture as AMNL but without health classification head.
6    """
7
8    def __init__(
9        self,
10        input_size: int = 17,
11        sequence_length: int = 30,
12        hidden_size: int = 256,
13        dropout: float = 0.2,
14        use_attention: bool = True
15    ):
16        super().__init__()
17
18        # CNN Feature Extractor (same as AMNL)
19        self.cnn = nn.Sequential(
20            nn.Conv1d(input_size, 64, kernel_size=3, padding=1),
21            nn.BatchNorm1d(64),
22            nn.ReLU(),
23            nn.Conv1d(64, 128, kernel_size=3, padding=1),
24            nn.BatchNorm1d(128),
25            nn.ReLU(),
26        )
27
28        # BiLSTM Temporal Encoder (same as AMNL)
29        self.lstm = nn.LSTM(
30            input_size=128,
31            hidden_size=hidden_size,
32            num_layers=2,
33            batch_first=True,
34            bidirectional=True,
35            dropout=dropout
36        )
37
38        # Multi-Head Attention (same as AMNL)
39        self.use_attention = use_attention
40        if use_attention:
41            self.attention = nn.MultiheadAttention(
42                embed_dim=hidden_size * 2,
43                num_heads=8,
44                dropout=dropout,
45                batch_first=True
46            )
47
48        # RUL Prediction Head ONLY (no health head)
49        self.rul_head = nn.Sequential(
50            nn.Linear(hidden_size * 2, 128),
51            nn.ReLU(),
52            nn.Dropout(dropout),
53            nn.Linear(128, 1)
54        )
55
56        # NOTE: No health_head - this is the ablation!
57
58    def forward(self, x: torch.Tensor) -> torch.Tensor:
59        # CNN: [batch, seq, features] -> [batch, features, seq]
60        x = x.transpose(1, 2)
61        x = self.cnn(x)
62        x = x.transpose(1, 2)  # Back to [batch, seq, features]
63
64        # BiLSTM
65        lstm_out, _ = self.lstm(x)
66
67        # Attention
68        if self.use_attention:
69            attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
70            lstm_out = lstm_out + attn_out  # Residual
71
72        # Pool and predict RUL
73        pooled = lstm_out[:, -1, :]  # Take last timestep
74        rul_pred = self.rul_head(pooled)
75
76        # Return ONLY RUL prediction (no health logits)
77        return rul_pred

Single-Task Training Loop

🐍python
1def train_single_task(
2    model: SingleTaskRULModel,
3    train_loader: DataLoader,
4    optimizer: torch.optim.Optimizer,
5    device: torch.device,
6    use_weighted_mse: bool = True
7) -> float:
8    """
9    Train single-task model for one epoch.
10
11    Note: No health labels, no health loss, no dual-task balancing.
12    """
13    model.train()
14    total_loss = 0.0
15
16    for batch_x, batch_rul in train_loader:
17        batch_x = batch_x.to(device)
18        batch_rul = batch_rul.to(device)
19
20        optimizer.zero_grad()
21
22        # Forward pass - single output
23        rul_pred = model(batch_x)
24
25        # Loss - RUL only (no health component)
26        if use_weighted_mse:
27            weights = 1.0 + (125.0 - batch_rul.clamp(0, 125)) / 125.0
28            loss = (weights * (rul_pred.squeeze() - batch_rul) ** 2).mean()
29        else:
30            loss = F.mse_loss(rul_pred.squeeze(), batch_rul)
31
32        # Backward pass
33        loss.backward()
34        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
35        optimizer.step()
36
37        total_loss += loss.item()
38
39    return total_loss / len(train_loader)

Ablation Experiment Runner

🐍python
1def run_dual_task_ablation(
2    datasets: List[str] = ['FD001', 'FD002', 'FD003', 'FD004'],
3    seeds: List[int] = [42, 123, 456],
4    epochs: int = 300
5) -> Dict:
6    """
7    Compare dual-task AMNL vs single-task RUL.
8
9    Returns results for statistical comparison.
10    """
11    results = {
12        'dual_task': [],
13        'single_task': []
14    }
15
16    for dataset in datasets:
17        for seed in seeds:
18            print(f"Dataset: {dataset}, Seed: {seed}")
19
20            # Train dual-task AMNL
21            amnl_model = DualTaskAMNL(...)
22            amnl_result = train_amnl(amnl_model, dataset, seed, epochs)
23            results['dual_task'].append({
24                'dataset': dataset,
25                'seed': seed,
26                'rmse': amnl_result['rmse'],
27                'nasa_score': amnl_result['nasa_score']
28            })
29
30            # Train single-task (ablation)
31            single_model = SingleTaskRULModel(...)
32            single_result = train_single_task_model(
33                single_model, dataset, seed, epochs
34            )
35            results['single_task'].append({
36                'dataset': dataset,
37                'seed': seed,
38                'rmse': single_result['rmse'],
39                'nasa_score': single_result['nasa_score']
40            })
41
42    # Calculate degradation
43    for ds in datasets:
44        dual_rmse = np.mean([
45            r['rmse'] for r in results['dual_task']
46            if r['dataset'] == ds
47        ])
48        single_rmse = np.mean([
49            r['rmse'] for r in results['single_task']
50            if r['dataset'] == ds
51        ])
52        degradation = (single_rmse - dual_rmse) / dual_rmse * 100
53        print(f"{ds}: Dual={dual_rmse:.2f}, Single={single_rmse:.2f}, "
54              f"Degradation={degradation:+.1f}%")
55
56    return results

Summary

Dual-Task vs Single-Task Summary:

  1. Catastrophic failure: Single-task RUL shows +304.7% degradation on FD002
  2. Overfitting confirmed: Lower training loss but much higher validation error
  3. Condition-dependent severity: 6-condition datasets show 3-4× worse degradation
  4. Health task essential: Not merely helpful but critical for generalization
  5. Regularization validated: Results strongly support the regularization hypothesis
Key FindingValue
Maximum degradation+304.7% (FD002)
Minimum degradation+49.9% (FD001)
Average degradation+162.8% (all datasets)
NASA Score degradationUp to +812.3%
Key Insight: The +304.7% degradation when removing the health classification task provides the strongest evidence for AMNL's design. The health task is not an optional enhancement—it is essential regularization that prevents the model from overfitting to dataset-specific patterns. This explains why equal weighting (0.5/0.5) works: giving the health task sufficient weight ensures adequate regularization.

With dual-task importance established, we now examine the impact of the attention mechanism.