AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand each training component in the AMNL pipeline
Analyze EMA's contribution to model stability
Evaluate learning rate warmup impact
Compare scheduler strategies for RUL prediction
Assess adaptive weight decay effectiveness

Key Finding: Each training component contributes measurably to AMNL's performance. Exponential Moving Average (EMA) provides 8-12% improvement, warmup adds 5-8%, and adaptive weight decay contributes 3-6%. While individually modest, these components compound to significant gains.

Component Overview

AMNL incorporates several training techniques beyond the core architecture.

Training Pipeline Components

Component	Purpose	When Active
EMA	Smooth weight updates for stability	During training and inference
LR Warmup	Prevent early training instability	First 10 epochs
ReduceLROnPlateau	Adaptive learning rate reduction	When validation plateaus
Adaptive Weight Decay	Reduce regularization as training progresses	After epoch 100

Baseline Configuration

🐍python

1V7_BASELINE_CONFIG = {
2    # Core architecture
3    'hidden_size': 256,
4    'num_layers': 2,
5    'dropout': 0.2,  # 0.3 for FD001/FD003
6
7    # AMNL weights
8    'amnl_weight_rul': 0.5,
9    'amnl_weight_health': 0.5,
10
11    # Training components
12    'use_ema': True,
13    'ema_decay': 0.999,
14    'use_warmup': True,
15    'warmup_epochs': 10,
16    'scheduler_type': 'reduce_on_plateau',
17    'use_adaptive_weight_decay': True,
18    'initial_weight_decay': 1e-4,
19}

Ablation Strategy

Each ablation removes one component while keeping all others active. This isolates the contribution of each component and reveals any interactions between them.

EMA Ablation

Exponential Moving Average maintains a smoothed version of model weights for inference.

EMA Formulation

\theta_{\text{EMA}}^{(t)} = \beta \cdot \theta_{\text{EMA}}^{(t-1)} + (1 - \beta) \cdot \theta^{(t)}

Where $\beta = 0.999$ is the decay factor and $\theta^{(t)}$ are the current model weights.

EMA Ablation Results

Dataset	With EMA	Without EMA	Degradation
FD001	10.43	11.21	+7.5%
FD002	6.74	7.56	+12.2%
FD003	9.51	10.27	+8.0%
FD004	8.16	9.14	+12.0%

EMA Implementation

🐍python

1class ExponentialMovingAverage:
2    """
3    Maintains EMA of model parameters.
4
5    Usage:
6        ema = ExponentialMovingAverage(model, decay=0.999)
7        for batch in train_loader:
8            optimizer.step()
9            ema.update(model)
10
11        # For evaluation
12        ema.apply_shadow(model)
13        evaluate(model)
14        ema.restore(model)
15    """
16
17    def __init__(self, model: nn.Module, decay: float = 0.999):
18        self.decay = decay
19        self.shadow = {}
20        self.backup = {}
21
22        # Initialize shadow weights
23        for name, param in model.named_parameters():
24            if param.requires_grad:
25                self.shadow[name] = param.data.clone()
26
27    def update(self, model: nn.Module):
28        """Update EMA weights after each training step."""
29        for name, param in model.named_parameters():
30            if param.requires_grad:
31                self.shadow[name] = (
32                    self.decay * self.shadow[name] +
33                    (1 - self.decay) * param.data
34                )
35
36    def apply_shadow(self, model: nn.Module):
37        """Replace model weights with EMA weights for inference."""
38        for name, param in model.named_parameters():
39            if param.requires_grad:
40                self.backup[name] = param.data.clone()
41                param.data = self.shadow[name]
42
43    def restore(self, model: nn.Module):
44        """Restore original weights after inference."""
45        for name, param in model.named_parameters():
46            if param.requires_grad:
47                param.data = self.backup[name]

Learning Rate Warmup

Warmup gradually increases the learning rate during early training.

Warmup Schedule

\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \eta_{\max} & \text{otherwise} \end{cases}

With $T_{\text{warmup}} = 10$ epochs and $\eta_{\max} = 0.001$ .

Warmup Ablation Results

Dataset	With Warmup	Without Warmup	Degradation
FD001	10.43	11.02	+5.7%
FD002	6.74	7.28	+8.0%
FD003	9.51	10.12	+6.4%
FD004	8.16	8.78	+7.6%

Why Warmup Helps

Initialization sensitivity: Random initialization means early gradients may be noisy or poorly scaled
Large early updates: Without warmup, large gradients can push weights to poor regions
Dual-task interaction: The health and RUL heads may initially conflict; warmup allows gradual adaptation

Critical for Multi-Task

Warmup is especially important for multi-task learning. Without it, the two task heads can initially pull the shared encoder in conflicting directions, leading to unstable early training.

Warmup Implementation

🐍python

1def apply_warmup(
2    optimizer: torch.optim.Optimizer,
3    epoch: int,
4    warmup_epochs: int = 10,
5    base_lr: float = 0.001
6):
7    """
8    Apply linear warmup to learning rate.
9
10    Args:
11        optimizer: The optimizer to modify
12        epoch: Current epoch (0-indexed)
13        warmup_epochs: Number of warmup epochs
14        base_lr: Target learning rate after warmup
15    """
16    if epoch < warmup_epochs:
17        warmup_factor = (epoch + 1) / warmup_epochs
18        current_lr = base_lr * warmup_factor
19    else:
20        current_lr = base_lr
21
22    for param_group in optimizer.param_groups:
23        param_group['lr'] = current_lr
24
25    return current_lr
26
27
28# In training loop
29for epoch in range(num_epochs):
30    # Apply warmup (first 10 epochs)
31    current_lr = apply_warmup(optimizer, epoch)
32
33    for batch in train_loader:
34        # Training step...
35        pass

Scheduler Comparison

Comparing ReduceLROnPlateau (adaptive) with StepLR (fixed schedule).

Scheduler Configurations

Scheduler	Strategy	Parameters
ReduceLROnPlateau	Reduce when validation plateaus	factor=0.5, patience=30
StepLR	Reduce at fixed intervals	step_size=50, gamma=0.5

Scheduler Ablation Results

Dataset	ReduceLROnPlateau	StepLR	Difference
FD001	10.43	10.89	+4.4%
FD002	6.74	7.12	+5.6%
FD003	9.51	9.87	+3.8%
FD004	8.16	8.62	+5.6%

Scheduler Implementation

🐍python

1def create_scheduler(
2    optimizer: torch.optim.Optimizer,
3    scheduler_type: str = 'reduce_on_plateau'
4) -> torch.optim.lr_scheduler._LRScheduler:
5    """
6    Create learning rate scheduler.
7
8    Args:
9        optimizer: The optimizer
10        scheduler_type: 'reduce_on_plateau' or 'step'
11
12    Returns:
13        Configured scheduler
14    """
15    if scheduler_type == 'reduce_on_plateau':
16        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
17            optimizer,
18            mode='min',      # Minimize validation loss
19            factor=0.5,      # Halve learning rate
20            patience=30,     # Wait 30 epochs before reducing
21            min_lr=5e-6      # Don't reduce below this
22        )
23    else:  # 'step'
24        scheduler = torch.optim.lr_scheduler.StepLR(
25            optimizer,
26            step_size=50,    # Reduce every 50 epochs
27            gamma=0.5        # Halve learning rate
28        )
29
30    return scheduler
31
32
33# Usage in training loop
34if scheduler_type == 'reduce_on_plateau':
35    scheduler.step(val_loss)  # Pass validation metric
36else:
37    scheduler.step()  # No metric needed

Adaptive Weight Decay

Weight decay is reduced during training to allow fine-tuning in later epochs.

Adaptive Schedule

\lambda(t) = \begin{cases} 10^{-4} & \text{if } t \leq 100 \\ 5 \times 10^{-5} & \text{if } 100 < t \leq 200 \\ 10^{-5} & \text{if } t > 200 \end{cases}

Weight Decay Ablation Results

Dataset	Adaptive WD	Fixed WD	Degradation
FD001	10.43	10.78	+3.4%
FD002	6.74	7.12	+5.6%
FD003	9.51	9.89	+4.0%
FD004	8.16	8.61	+5.5%

Rationale

Early training (epochs 1-100): Strong regularization prevents overfitting while learning coarse features
Mid training (epochs 101-200): Moderate regularization allows refinement while maintaining stability
Late training (epochs 201+): Minimal regularization enables fine-grained adjustments

Connection to Learning Rate

Adaptive weight decay complements the learning rate schedule. As learning rate decreases (via scheduler), weight decay also decreases. This maintains a consistent ratio of learning signal to regularization throughout training.

Implementation

🐍python

1def update_weight_decay(
2    optimizer: torch.optim.Optimizer,
3    epoch: int,
4    use_adaptive: bool = True
5):
6    """
7    Update weight decay based on training progress.
8
9    Args:
10        optimizer: The optimizer
11        epoch: Current epoch (0-indexed)
12        use_adaptive: Whether to use adaptive schedule
13    """
14    if not use_adaptive:
15        return  # Keep fixed weight decay
16
17    if epoch <= 100:
18        weight_decay = 1e-4
19    elif epoch <= 200:
20        weight_decay = 5e-5
21    else:
22        weight_decay = 1e-5
23
24    for param_group in optimizer.param_groups:
25        param_group['weight_decay'] = weight_decay
26
27
28# Full training loop with all components
29for epoch in range(num_epochs):
30    # 1. Apply warmup (early epochs)
31    apply_warmup(optimizer, epoch)
32
33    # 2. Update weight decay (adaptive)
34    update_weight_decay(optimizer, epoch)
35
36    # 3. Training step
37    model.train()
38    for batch_x, batch_y in train_loader:
39        optimizer.zero_grad()
40        loss = compute_loss(model, batch_x, batch_y)
41        loss.backward()
42        optimizer.step()
43        ema.update(model)  # 4. EMA update
44
45    # 5. Validation
46    val_loss = evaluate(model, val_loader)
47
48    # 6. Scheduler step
49    scheduler.step(val_loss)

Summary

Architecture Component Analysis Summary:

EMA: 8-12% improvement, most impactful on multi-condition datasets
Warmup: 5-8% improvement, critical for multi-task stability
Adaptive scheduler: 4-6% improvement over fixed schedule
Adaptive weight decay: 3-6% improvement, enables fine-tuning
Compound effect: Together, these components provide substantial cumulative gains

Component Importance Ranking

Rank	Component	Average Impact	Critical For
1	EMA	~10%	Prediction stability
2	Warmup	~7%	Multi-task convergence
3	Adaptive Scheduler	~5%	Optimal learning dynamics
4	Adaptive Weight Decay	~4.5%	Late-stage fine-tuning

Synergistic Effects

The components interact synergistically. For example, EMA benefits more when warmup is active (stable early training leads to better EMA initialization). The total improvement from all components (~26%) exceeds the sum of individual improvements (~26.5%), suggesting some overlap but also confirming each component adds value.

Key Insight: While each training component provides modest individual improvements (3-12%), they compound to significant cumulative gains. These "engineering" choices are not mere implementation details—they represent principled decisions that enable the core AMNL architecture to reach its full potential.

With architecture components analyzed, we conclude the ablation studies with loss function comparisons.