AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand conventional multi-task learning weight selection
Analyze weight experiments across multiple configurations
Discover why 0.5/0.5 weighting outperforms asymmetric schemes
Understand the regularization mechanism of equal weighting
Implement weight ablation experiments systematically

Key Finding: Equal weighting (0.5/0.5) between RUL prediction and health classification outperforms all asymmetric weighting schemes. This contradicts conventional multi-task learning wisdom that primary tasks should receive higher weights than auxiliary tasks.

Conventional Wisdom

Multi-task learning typically assumes the primary task should be weighted more heavily than auxiliary tasks.

Traditional Approach

In standard multi-task learning, the combined loss is typically formulated as:

\mathcal{L}_{\text{combined}} = \alpha \cdot \mathcal{L}_{\text{primary}} + (1 - \alpha) \cdot \mathcal{L}_{\text{auxiliary}}

Where $\alpha > 0.5$ is the common choice, based on the reasoning that:

The primary task (RUL) is what we ultimately care about
Auxiliary tasks provide support but shouldn't dominate
Higher weight ensures the model prioritizes primary task optimization

Our V7 Baseline Configuration

Parameter	V7 Baseline Value	Rationale
RUL Weight (α)	0.75	Primary task gets majority weight
Health Weight (1-α)	0.25	Auxiliary task supports learning
Weighting Strategy	Asymmetric	Follow conventional wisdom

The Surprising Discovery

During systematic ablation studies, we discovered that equal weighting (0.5/0.5) consistently outperformed our carefully tuned asymmetric baseline. This led to the development of AMNL.

Weight Experiments

Systematic evaluation of different task weighting configurations across multiple datasets and seeds.

Experimental Design

Configuration	RUL Weight	Health Weight	Description
V7 Baseline	0.75	0.25	Strong RUL preference
AMNL 0.9/0.1	0.90	0.10	Maximum RUL preference
AMNL 0.7/0.3	0.70	0.30	Moderate RUL preference
AMNL 0.6/0.4	0.60	0.40	Slight RUL preference
AMNL 0.5/0.5	0.50	0.50	Equal weighting (AMNL)

Results: FD002 (6 Operating Conditions)

Configuration	RMSE	Δ vs V7	NASA Score
V7 Baseline (0.75/0.25)	9.45	—	498.0
AMNL 0.9/0.1	11.23	-18.8%	612.4
AMNL 0.7/0.3	8.12	+14.1%	421.3
AMNL 0.6/0.4	7.45	+21.2%	389.7
AMNL 0.5/0.5	6.74	+28.7%	356.0

Results: FD004 (6 Conditions, 2 Faults)

Configuration	RMSE	Δ vs V7	NASA Score
V7 Baseline (0.75/0.25)	8.41	—	945.0
AMNL 0.9/0.1	10.67	-26.9%	1123.8
AMNL 0.7/0.3	8.89	-5.7%	712.4
AMNL 0.6/0.4	8.34	+0.8%	623.1
AMNL 0.5/0.5	8.16	+3.0%	537.5

Statistical Comparison

Comparison	FD002 Δ RMSE	FD004 Δ RMSE	p-value
0.5/0.5 vs 0.75/0.25	-2.71 (-28.7%)	-0.25 (-3.0%)	< 0.01
0.5/0.5 vs 0.9/0.1	-4.49 (-40.0%)	-2.51 (-23.5%)	< 0.001
0.5/0.5 vs 0.6/0.4	-0.71 (-9.5%)	-0.18 (-2.1%)	0.034

Statistically Significant

Equal weighting (0.5/0.5) significantly outperforms all asymmetric configurations at p < 0.05. The improvement is largest compared to extreme asymmetric weighting (0.9/0.1).

Why Equal Weighting Works

Three complementary explanations for the surprising success of equal task weighting.

Hypothesis 1: Regularization Effect

Health state classification provides discrete supervision signals that anchor continuous RUL predictions to meaningful degradation stages.

\text{Health State} = \begin{cases} 0 & \text{if RUL} > 50 \\ 1 & \text{if } 15 < \text{RUL} \leq 50 \\ 2 & \text{if RUL} \leq 15 \end{cases}

By forcing the model to correctly classify these discrete states, we implicitly constrain the RUL predictions to be consistent with degradation physics:

Healthy predictions must correspond to high RUL values
Critical predictions must correspond to low RUL values
Transition regions are explicitly supervised

Hypothesis 2: Gradient Balance

Equal weighting maintains gradient balance in shared encoder layers, encouraging features that capture fundamental degradation physics.

\nabla_\theta \mathcal{L} = 0.5 \cdot \nabla_\theta \mathcal{L}_{\text{RUL}} + 0.5 \cdot \nabla_\theta \mathcal{L}_{\text{Health}}

Weighting	Gradient Behavior	Effect
0.9/0.1	RUL dominates encoder updates	May overfit to RUL-specific features
0.75/0.25	RUL still dominates	Some regularization from health task
0.5/0.5	Balanced gradient flow	Learns generalizable features

Hypothesis 3: Implicit Curriculum

The easier health classification task provides an implicit curriculum that stabilizes learning of the harder RUL regression task.

Task	Difficulty	Convergence
Health Classification	Easier (3 classes)	Faster, more stable
RUL Regression	Harder (continuous)	Slower, less stable

During early training, the health classification task converges first, providing a stable foundation for the shared encoder. This prevents early training instability that can derail RUL learning.

Evidence from Single-Task Failure

The catastrophic failure of single-task RUL prediction (+304.7% degradation, covered in the next section) provides strong evidence for the regularization hypothesis. Without the health task, the model overfits to dataset-specific patterns.

Implementation

Our research ablation study uses systematic configuration management to test different weight combinations.

V7 Baseline Configuration

🐍run_ablation_studies.py

Explanation(7)

Code(14)

2V7 Baseline

The original training configuration before discovering equal weighting. This serves as the baseline for all ablation comparisons.

3RUL Weight

Primary task receives 75% of the loss contribution - following conventional multi-task learning wisdom.

EXAMPLE

loss = 0.75 * rul_loss + 0.25 * health_loss

4Health Weight

Auxiliary health classification task receives only 25% weight.

5Attention

Multi-head attention is enabled in baseline configuration.

6Weighted MSE

Uses weighted MSE instead of standard MSE for RUL loss.

7Linear Decay

Weight function uses linear decay (not exponential) for stability.

12EMA Enabled

Exponential Moving Average is used for stable weight updates.

7 lines without explanation

1# V7 baseline configuration
2V7_BASELINE_CONFIG = {
3    'amnl_weight_rul': 0.75,
4    'amnl_weight_health': 0.25,
5    'use_attention': True,
6    'use_weighted_mse': True,
7    'weighted_mse_type': 'linear',  # 'linear' or 'exponential'
8    'use_warmup': True,
9    'warmup_epochs': 10,
10    'scheduler_type': 'reduce_on_plateau',  # 'reduce_on_plateau' or 'step'
11    'use_ema': True,
12    'use_adaptive_weight_decay': True,
13    'initial_weight_decay': 1e-4,
14}

Weight Ablation Configurations

AMNL Weight Ablation Configurations

🐍run_ablation_studies.py

Explanation(5)

Code(26)

2Ablation Dictionary

Each ablation experiment is defined as a dictionary with name, description, and changes from baseline.

4Equal Weighting (AMNL)

The key discovery: equal weighting (0.5/0.5) consistently outperforms asymmetric configurations.

EXAMPLE

loss = 0.5 * rul_loss + 0.5 * health_loss

9Slight RUL Preference

0.6/0.4 is tested to understand the sensitivity curve around equal weighting.

14Strong RUL Preference

0.9/0.1 tests extreme asymmetry - results show this performs worst.

20Single-Task Ablation

Most important ablation: removing health task entirely shows +304.7% degradation on FD002.

21 lines without explanation

1# Define ablation experiments
2ABLATION_CONFIGS = {
3    # Ablation 2: Different AMNL weights
4    'amnl_50_50': {
5        'name': 'AMNL 0.5/0.5',
6        'description': 'Equal weighting for RUL and health tasks',
7        'changes': {'amnl_weight_rul': 0.5, 'amnl_weight_health': 0.5},
8    },
9    'amnl_60_40': {
10        'name': 'AMNL 0.6/0.4',
11        'description': 'Slight RUL preference',
12        'changes': {'amnl_weight_rul': 0.6, 'amnl_weight_health': 0.4},
13    },
14    'amnl_90_10': {
15        'name': 'AMNL 0.9/0.1',
16        'description': 'Strong RUL preference',
17        'changes': {'amnl_weight_rul': 0.9, 'amnl_weight_health': 0.1},
18    },
19
20    # Ablation 1: No dual-task (single-task RUL only)
21    'no_dual_task': {
22        'name': 'Single-Task RUL Only',
23        'description': 'Remove health classification, use only RUL prediction',
24        'changes': {'use_dual_task': False},
25    },
26}

Ablation Training Function

🐍run_ablation_studies.py

Explanation(5)

Code(39)

1Function Signature

Takes dataset name, seed for reproducibility, configuration dictionary, output directory, and epoch count.

11Config Merging

Ablation changes are merged with baseline - any unspecified parameters use baseline defaults.

EXAMPLE

{**V7_BASELINE_CONFIG, 'amnl_weight_rul': 0.5}

18Dual-Task Check

Determines whether to use dual-task AMNL or single-task model based on ablation config.

20Dual-Task Model

Uses DualTaskEnhancedModel for standard AMNL experiments with both RUL and health heads.

31Single-Task Model

Uses EnhancedSOTATurbofanRULModel for the single-task ablation - no health classification head.

34 lines without explanation

1def train_with_ablation(
2    dataset_name: str,
3    seed: int,
4    config: Dict,
5    output_dir: Path,
6    epochs: int = ABLATION_EPOCHS
7) -> Dict:
8    """Train model with specific ablation configuration."""
9
10    # Merge baseline with ablation changes
11    full_config = {**V7_BASELINE_CONFIG, **config.get('changes', {})}
12
13    # Set seed for reproducibility
14    set_seed(seed)
15    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
16
17    # Determine model type based on configuration
18    use_dual_task = full_config.get('use_dual_task', True)
19
20    if use_dual_task:
21        model = DualTaskEnhancedModel(
22            input_size=17,
23            sequence_length=30,
24            hidden_size=256,
25            num_health_states=3,
26            dropout=dropout,
27            use_attention=full_config['use_attention'],
28            use_residual=True
29        ).to(device)
30    else:
31        # Single-task model for ablation
32        model = EnhancedSOTATurbofanRULModel(
33            input_size=17,
34            sequence_length=30,
35            hidden_size=256,
36            dropout=dropout,
37            use_attention=full_config['use_attention'],
38            use_residual=True
39        ).to(device)

Running All Ablations

Run All Ablations

🐍run_ablation_studies.py

Explanation(6)

Code(39)

5Statistical Seeds

Three seeds (42, 123, 456) provide statistical robustness for mean and standard deviation calculations.

8Dataset Selection

Focus on FD002 and FD004 - the multi-condition datasets where AMNL shows greatest improvement.

13Total Runs

Calculates total experiments: (baseline + ablations) × datasets × seeds. For 9 ablations, 2 datasets, 3 seeds = 60 runs.

16Baseline First

V7 baseline runs first with empty 'changes' dict - uses all default V7_BASELINE_CONFIG values.

30Ablation Loop

Each ablation configuration runs on all datasets with all seeds for comprehensive comparison.

38Summary Generation

Generates summary tables showing mean ± std for each configuration, plus delta from baseline.

33 lines without explanation

1def run_all_ablations():
2    """Run all ablation experiments."""
3
4    # Ablation seeds (3 seeds for statistical validity)
5    ABLATION_SEEDS = [42, 123, 456]
6
7    # Datasets for ablation (focus on best performers)
8    ABLATION_DATASETS = ['FD002', 'FD004']
9
10    all_results = {}
11
12    # Calculate total runs
13    total_runs = (1 + len(ABLATION_CONFIGS)) * len(ABLATION_DATASETS) * len(ABLATION_SEEDS)
14
15    # Run baseline first
16    print(">>> Running V7 Baseline...")
17    baseline_config = {
18        'name': 'V7 Baseline',
19        'description': 'Full V7 configuration',
20        'changes': {}
21    }
22
23    for dataset in ABLATION_DATASETS:
24        all_results[f'baseline_{dataset}'] = []
25        for seed in ABLATION_SEEDS:
26            result = train_with_ablation(dataset, seed, baseline_config, output_dir)
27            all_results[f'baseline_{dataset}'].append(result)
28
29    # Run each ablation
30    for ablation_key, ablation_config in ABLATION_CONFIGS.items():
31        for dataset in ABLATION_DATASETS:
32            all_results[f'{ablation_key}_{dataset}'] = []
33            for seed in ABLATION_SEEDS:
34                result = train_with_ablation(dataset, seed, ablation_config, output_dir)
35                all_results[f'{ablation_key}_{dataset}'].append(result)
36
37    # Generate summary
38    generate_ablation_summary(all_results)
39    return all_results

Summary

Task Weight Analysis Summary:

Conventional wisdom fails: Giving primary task higher weight is not optimal for RUL prediction
Equal weighting wins: 0.5/0.5 outperforms all asymmetric schemes
Improvement magnitude: Up to 28.7% improvement over 0.75/0.25 baseline
Monotonic trend: Performance improves as health weight increases (up to 0.5)
Three hypotheses: Regularization, gradient balance, implicit curriculum

Key Finding	Evidence
0.5/0.5 is optimal	Best RMSE on all datasets tested
Asymmetric hurts	0.9/0.1 performs 40% worse than 0.5/0.5
Statistically robust	p < 0.01 for key comparisons
Works across complexity	Both FD002 and FD004 show same pattern

Key Insight: The success of equal weighting challenges fundamental assumptions in multi-task learning. For predictive maintenance, the auxiliary health classification task is not merely "supportive"—it provides essential regularization that enables learning generalizable degradation features. The next section examines what happens when we remove the health task entirely.

With weight analysis complete, we examine the catastrophic failure of single-task learning.