Learning Objectives
By the end of this section, you will:
- Understand the Adam optimizer and its momentum mechanisms
- Explain the difference between Adam and AdamW in weight decay handling
- Configure AdamW hyperparameters for RUL prediction
- Recognize common pitfalls in optimizer configuration
- Implement AdamW with proper initialization
Why This Matters: The optimizer is the engine that drives learning. AdamW (Loshchilov & Hutter, 2019) fixes a subtle but important bug in standard Adam that affects regularization. For deep networks like our CNN-BiLSTM-Attention model, proper optimizer configuration can mean the difference between state-of-the-art results and training failure.
Adam Optimizer Review
Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates.
Core Adam Algorithm
Adam maintains two exponential moving averages:
Where:
- : Gradient at step t
- : First moment decay (momentum)
- : Second moment decay (variance)
Bias Correction
Both moments are bias-corrected:
Parameter Update
The adaptive learning rate scales inversely with gradient variance—parameters with noisy gradients get smaller updates.
AdamW: Decoupled Weight Decay
AdamW fixes how weight decay interacts with adaptive learning rates.
The Problem with L2 Regularization in Adam
Standard Adam with L2 regularization adds penalty to the loss:
The gradient becomes:
Problem: The regularization term is now scaled by Adam's adaptive learning rate . This means weight decay is weaker for parameters with high gradient variance—the opposite of intended behavior.
AdamW Solution: Decoupled Weight Decay
AdamW applies weight decay directly to parameters, not through the gradient:
The weight decay term is not scaled by the adaptive factor. This is called "decoupled" weight decay.
Comparison
| Aspect | Adam + L2 | AdamW |
|---|---|---|
| Weight decay | Through gradient | Direct on parameters |
| Adaptive scaling | Affects regularization | Only affects gradient |
| Effective decay | Varies by parameter | Uniform across parameters |
| Recommended | No | Yes |
Use AdamW, Not Adam
For any model with weight decay (which should be all production models), use AdamW instead of Adam. The difference is subtle but significant—AdamW provides more consistent regularization and often better generalization.
Hyperparameter Configuration
Proper hyperparameter selection is critical for training success.
Learning Rate
| Model Size | Recommended η | Range to Search |
|---|---|---|
| Small (<1M params) | 1e-3 | [5e-4, 5e-3] |
| Medium (1-10M params) | 1e-3 to 5e-4 | [1e-4, 1e-3] |
| Large (>10M params) | 1e-4 | [5e-5, 5e-4] |
For our 3.5M parameter CNN-BiLSTM-Attention model, we use .
Beta Parameters
| Parameter | Default | When to Change |
|---|---|---|
| β₁ | 0.9 | Lower (0.85) for very noisy gradients |
| β₂ | 0.999 | Lower (0.99) for non-stationary problems |
| ε | 1e-8 | Increase (1e-6) for mixed precision |
Default Betas Work Well
For most applications, including RUL prediction, the default β₁=0.9 and β₂=0.999 work well. Focus tuning effort on learning rate and weight decay instead.
Weight Decay
Weight decay prevents overfitting by penalizing large weights:
| λ Value | Effect | Use Case |
|---|---|---|
| 0 | No regularization | Very large datasets |
| 1e-5 | Light regularization | Large models, big data |
| 1e-4 | Standard regularization | Most cases (recommended) |
| 1e-3 | Strong regularization | Small datasets, overfitting |
| 1e-2 | Very strong | Extreme overfitting only |
For C-MAPSS with ~20K training samples and a 3.5M parameter model, we use .
Implementation
Our research implementation combines AdamW with adaptive weight decay and an aggressive learning rate scheduler.
AMNL Research Configuration
Adaptive Weight Decay
V7 introduces adaptive weight decay that decreases during training to allow fine-tuning:
Dynamic Update in Training Loop
The weight decay is updated at the start of each epoch by modifying the optimizer's parameter groups. This allows the regularization strength to adapt to the training phase.
1# Apply adaptive weight decay in training loop
2for epoch in range(epochs):
3 current_wd = get_adaptive_weight_decay(epoch)
4 for param_group in optimizer.param_groups:
5 param_group['weight_decay'] = current_wd
6
7 # ... training code ...Summary
In this section, we configured the AdamW optimizer:
- Adam: Adaptive learning rates via momentum + variance tracking
- AdamW: Decoupled weight decay for proper regularization
- Learning rate: 1e-3 for our 3.5M parameter model
- Weight decay: 1e-4 for balanced regularization
- Parameter groups: No weight decay on biases and norms
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate (η) | 1e-3 |
| β₁ (momentum) | 0.9 |
| β₂ (variance) | 0.999 |
| ε (stability) | 1e-8 |
| Weight decay (λ) | 1e-4 |
Looking Ahead: A constant learning rate often leads to suboptimal convergence. The next section introduces learning rate warmup—a technique that prevents early training instability by gradually increasing the learning rate.
With the optimizer configured, we implement learning rate scheduling.