Learning Objectives
By the end of this section, you will:
- Understand each training component in the AMNL pipeline
- Analyze EMA's contribution to model stability
- Evaluate learning rate warmup impact
- Compare scheduler strategies for RUL prediction
- Assess adaptive weight decay effectiveness
Key Finding: Each training component contributes measurably to AMNL's performance. Exponential Moving Average (EMA) provides 8-12% improvement, warmup adds 5-8%, and adaptive weight decay contributes 3-6%. While individually modest, these components compound to significant gains.
Component Overview
AMNL incorporates several training techniques beyond the core architecture.
Training Pipeline Components
| Component | Purpose | When Active |
|---|---|---|
| EMA | Smooth weight updates for stability | During training and inference |
| LR Warmup | Prevent early training instability | First 10 epochs |
| ReduceLROnPlateau | Adaptive learning rate reduction | When validation plateaus |
| Adaptive Weight Decay | Reduce regularization as training progresses | After epoch 100 |
Baseline Configuration
1V7_BASELINE_CONFIG = {
2 # Core architecture
3 'hidden_size': 256,
4 'num_layers': 2,
5 'dropout': 0.2, # 0.3 for FD001/FD003
6
7 # AMNL weights
8 'amnl_weight_rul': 0.5,
9 'amnl_weight_health': 0.5,
10
11 # Training components
12 'use_ema': True,
13 'ema_decay': 0.999,
14 'use_warmup': True,
15 'warmup_epochs': 10,
16 'scheduler_type': 'reduce_on_plateau',
17 'use_adaptive_weight_decay': True,
18 'initial_weight_decay': 1e-4,
19}Ablation Strategy
Each ablation removes one component while keeping all others active. This isolates the contribution of each component and reveals any interactions between them.
EMA Ablation
Exponential Moving Average maintains a smoothed version of model weights for inference.
EMA Formulation
Where is the decay factor and are the current model weights.
EMA Ablation Results
| Dataset | With EMA | Without EMA | Degradation |
|---|---|---|---|
| FD001 | 10.43 | 11.21 | +7.5% |
| FD002 | 6.74 | 7.56 | +12.2% |
| FD003 | 9.51 | 10.27 | +8.0% |
| FD004 | 8.16 | 9.14 | +12.0% |
EMA Implementation
1class ExponentialMovingAverage:
2 """
3 Maintains EMA of model parameters.
4
5 Usage:
6 ema = ExponentialMovingAverage(model, decay=0.999)
7 for batch in train_loader:
8 optimizer.step()
9 ema.update(model)
10
11 # For evaluation
12 ema.apply_shadow(model)
13 evaluate(model)
14 ema.restore(model)
15 """
16
17 def __init__(self, model: nn.Module, decay: float = 0.999):
18 self.decay = decay
19 self.shadow = {}
20 self.backup = {}
21
22 # Initialize shadow weights
23 for name, param in model.named_parameters():
24 if param.requires_grad:
25 self.shadow[name] = param.data.clone()
26
27 def update(self, model: nn.Module):
28 """Update EMA weights after each training step."""
29 for name, param in model.named_parameters():
30 if param.requires_grad:
31 self.shadow[name] = (
32 self.decay * self.shadow[name] +
33 (1 - self.decay) * param.data
34 )
35
36 def apply_shadow(self, model: nn.Module):
37 """Replace model weights with EMA weights for inference."""
38 for name, param in model.named_parameters():
39 if param.requires_grad:
40 self.backup[name] = param.data.clone()
41 param.data = self.shadow[name]
42
43 def restore(self, model: nn.Module):
44 """Restore original weights after inference."""
45 for name, param in model.named_parameters():
46 if param.requires_grad:
47 param.data = self.backup[name]Learning Rate Warmup
Warmup gradually increases the learning rate during early training.
Warmup Schedule
With epochs and .
Warmup Ablation Results
| Dataset | With Warmup | Without Warmup | Degradation |
|---|---|---|---|
| FD001 | 10.43 | 11.02 | +5.7% |
| FD002 | 6.74 | 7.28 | +8.0% |
| FD003 | 9.51 | 10.12 | +6.4% |
| FD004 | 8.16 | 8.78 | +7.6% |
Why Warmup Helps
- Initialization sensitivity: Random initialization means early gradients may be noisy or poorly scaled
- Large early updates: Without warmup, large gradients can push weights to poor regions
- Dual-task interaction: The health and RUL heads may initially conflict; warmup allows gradual adaptation
Critical for Multi-Task
Warmup is especially important for multi-task learning. Without it, the two task heads can initially pull the shared encoder in conflicting directions, leading to unstable early training.
Warmup Implementation
1def apply_warmup(
2 optimizer: torch.optim.Optimizer,
3 epoch: int,
4 warmup_epochs: int = 10,
5 base_lr: float = 0.001
6):
7 """
8 Apply linear warmup to learning rate.
9
10 Args:
11 optimizer: The optimizer to modify
12 epoch: Current epoch (0-indexed)
13 warmup_epochs: Number of warmup epochs
14 base_lr: Target learning rate after warmup
15 """
16 if epoch < warmup_epochs:
17 warmup_factor = (epoch + 1) / warmup_epochs
18 current_lr = base_lr * warmup_factor
19 else:
20 current_lr = base_lr
21
22 for param_group in optimizer.param_groups:
23 param_group['lr'] = current_lr
24
25 return current_lr
26
27
28# In training loop
29for epoch in range(num_epochs):
30 # Apply warmup (first 10 epochs)
31 current_lr = apply_warmup(optimizer, epoch)
32
33 for batch in train_loader:
34 # Training step...
35 passScheduler Comparison
Comparing ReduceLROnPlateau (adaptive) with StepLR (fixed schedule).
Scheduler Configurations
| Scheduler | Strategy | Parameters |
|---|---|---|
| ReduceLROnPlateau | Reduce when validation plateaus | factor=0.5, patience=30 |
| StepLR | Reduce at fixed intervals | step_size=50, gamma=0.5 |
Scheduler Ablation Results
| Dataset | ReduceLROnPlateau | StepLR | Difference |
|---|---|---|---|
| FD001 | 10.43 | 10.89 | +4.4% |
| FD002 | 6.74 | 7.12 | +5.6% |
| FD003 | 9.51 | 9.87 | +3.8% |
| FD004 | 8.16 | 8.62 | +5.6% |
Scheduler Implementation
1def create_scheduler(
2 optimizer: torch.optim.Optimizer,
3 scheduler_type: str = 'reduce_on_plateau'
4) -> torch.optim.lr_scheduler._LRScheduler:
5 """
6 Create learning rate scheduler.
7
8 Args:
9 optimizer: The optimizer
10 scheduler_type: 'reduce_on_plateau' or 'step'
11
12 Returns:
13 Configured scheduler
14 """
15 if scheduler_type == 'reduce_on_plateau':
16 scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
17 optimizer,
18 mode='min', # Minimize validation loss
19 factor=0.5, # Halve learning rate
20 patience=30, # Wait 30 epochs before reducing
21 min_lr=5e-6 # Don't reduce below this
22 )
23 else: # 'step'
24 scheduler = torch.optim.lr_scheduler.StepLR(
25 optimizer,
26 step_size=50, # Reduce every 50 epochs
27 gamma=0.5 # Halve learning rate
28 )
29
30 return scheduler
31
32
33# Usage in training loop
34if scheduler_type == 'reduce_on_plateau':
35 scheduler.step(val_loss) # Pass validation metric
36else:
37 scheduler.step() # No metric neededAdaptive Weight Decay
Weight decay is reduced during training to allow fine-tuning in later epochs.
Adaptive Schedule
Weight Decay Ablation Results
| Dataset | Adaptive WD | Fixed WD | Degradation |
|---|---|---|---|
| FD001 | 10.43 | 10.78 | +3.4% |
| FD002 | 6.74 | 7.12 | +5.6% |
| FD003 | 9.51 | 9.89 | +4.0% |
| FD004 | 8.16 | 8.61 | +5.5% |
Rationale
- Early training (epochs 1-100): Strong regularization prevents overfitting while learning coarse features
- Mid training (epochs 101-200): Moderate regularization allows refinement while maintaining stability
- Late training (epochs 201+): Minimal regularization enables fine-grained adjustments
Connection to Learning Rate
Adaptive weight decay complements the learning rate schedule. As learning rate decreases (via scheduler), weight decay also decreases. This maintains a consistent ratio of learning signal to regularization throughout training.
Implementation
1def update_weight_decay(
2 optimizer: torch.optim.Optimizer,
3 epoch: int,
4 use_adaptive: bool = True
5):
6 """
7 Update weight decay based on training progress.
8
9 Args:
10 optimizer: The optimizer
11 epoch: Current epoch (0-indexed)
12 use_adaptive: Whether to use adaptive schedule
13 """
14 if not use_adaptive:
15 return # Keep fixed weight decay
16
17 if epoch <= 100:
18 weight_decay = 1e-4
19 elif epoch <= 200:
20 weight_decay = 5e-5
21 else:
22 weight_decay = 1e-5
23
24 for param_group in optimizer.param_groups:
25 param_group['weight_decay'] = weight_decay
26
27
28# Full training loop with all components
29for epoch in range(num_epochs):
30 # 1. Apply warmup (early epochs)
31 apply_warmup(optimizer, epoch)
32
33 # 2. Update weight decay (adaptive)
34 update_weight_decay(optimizer, epoch)
35
36 # 3. Training step
37 model.train()
38 for batch_x, batch_y in train_loader:
39 optimizer.zero_grad()
40 loss = compute_loss(model, batch_x, batch_y)
41 loss.backward()
42 optimizer.step()
43 ema.update(model) # 4. EMA update
44
45 # 5. Validation
46 val_loss = evaluate(model, val_loader)
47
48 # 6. Scheduler step
49 scheduler.step(val_loss)Summary
Architecture Component Analysis Summary:
- EMA: 8-12% improvement, most impactful on multi-condition datasets
- Warmup: 5-8% improvement, critical for multi-task stability
- Adaptive scheduler: 4-6% improvement over fixed schedule
- Adaptive weight decay: 3-6% improvement, enables fine-tuning
- Compound effect: Together, these components provide substantial cumulative gains
Component Importance Ranking
| Rank | Component | Average Impact | Critical For |
|---|---|---|---|
| 1 | EMA | ~10% | Prediction stability |
| 2 | Warmup | ~7% | Multi-task convergence |
| 3 | Adaptive Scheduler | ~5% | Optimal learning dynamics |
| 4 | Adaptive Weight Decay | ~4.5% | Late-stage fine-tuning |
Synergistic Effects
The components interact synergistically. For example, EMA benefits more when warmup is active (stable early training leads to better EMA initialization). The total improvement from all components (~26%) exceeds the sum of individual improvements (~26.5%), suggesting some overlap but also confirming each component adds value.
Key Insight: While each training component provides modest individual improvements (3-12%), they compound to significant cumulative gains. These "engineering" choices are not mere implementation details—they represent principled decisions that enable the core AMNL architecture to reach its full potential.
With architecture components analyzed, we conclude the ablation studies with loss function comparisons.