Chapter 12
12 min read
Section 58 of 104

AdamW Optimizer Configuration

Optimization Strategy

Learning Objectives

By the end of this section, you will:

  1. Understand the Adam optimizer and its momentum mechanisms
  2. Explain the difference between Adam and AdamW in weight decay handling
  3. Configure AdamW hyperparameters for RUL prediction
  4. Recognize common pitfalls in optimizer configuration
  5. Implement AdamW with proper initialization
Why This Matters: The optimizer is the engine that drives learning. AdamW (Loshchilov & Hutter, 2019) fixes a subtle but important bug in standard Adam that affects regularization. For deep networks like our CNN-BiLSTM-Attention model, proper optimizer configuration can mean the difference between state-of-the-art results and training failure.

Adam Optimizer Review

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates.

Core Adam Algorithm

Adam maintains two exponential moving averages:

mt=β1mt1+(1β1)gt(first moment / momentum)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment / momentum)}
vt=β2vt1+(1β2)gt2(second moment / variance)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment / variance)}

Where:

  • gt=θLg_t = \nabla_\theta \mathcal{L}: Gradient at step t
  • β1=0.9\beta_1 = 0.9: First moment decay (momentum)
  • β2=0.999\beta_2 = 0.999: Second moment decay (variance)

Bias Correction

Both moments are bias-corrected:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Parameter Update

θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

The adaptive learning rate η/v^t\eta / \sqrt{\hat{v}_t} scales inversely with gradient variance—parameters with noisy gradients get smaller updates.


AdamW: Decoupled Weight Decay

AdamW fixes how weight decay interacts with adaptive learning rates.

The Problem with L2 Regularization in Adam

Standard Adam with L2 regularization adds penalty to the loss:

Ltotal=L+λ2θ2\mathcal{L}_{\text{total}} = \mathcal{L} + \frac{\lambda}{2}\|\theta\|^2

The gradient becomes:

gt=θL+λθg_t = \nabla_\theta \mathcal{L} + \lambda \theta

Problem: The regularization term λθ\lambda \theta is now scaled by Adam's adaptive learning rate 1/v^t1/\sqrt{\hat{v}_t}. This means weight decay is weaker for parameters with high gradient variance—the opposite of intended behavior.

AdamW Solution: Decoupled Weight Decay

AdamW applies weight decay directly to parameters, not through the gradient:

θt+1=θtηm^tv^t+ϵηλθt\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_t

The weight decay term ηλθt\eta \lambda \theta_t is not scaled by the adaptive factor. This is called "decoupled" weight decay.

Comparison

AspectAdam + L2AdamW
Weight decayThrough gradientDirect on parameters
Adaptive scalingAffects regularizationOnly affects gradient
Effective decayVaries by parameterUniform across parameters
RecommendedNoYes

Use AdamW, Not Adam

For any model with weight decay (which should be all production models), use AdamW instead of Adam. The difference is subtle but significant—AdamW provides more consistent regularization and often better generalization.


Hyperparameter Configuration

Proper hyperparameter selection is critical for training success.

Learning Rate

Model SizeRecommended ηRange to Search
Small (<1M params)1e-3[5e-4, 5e-3]
Medium (1-10M params)1e-3 to 5e-4[1e-4, 1e-3]
Large (>10M params)1e-4[5e-5, 5e-4]

For our 3.5M parameter CNN-BiLSTM-Attention model, we use η=103\eta = 10^{-3}.

Beta Parameters

ParameterDefaultWhen to Change
β₁0.9Lower (0.85) for very noisy gradients
β₂0.999Lower (0.99) for non-stationary problems
ε1e-8Increase (1e-6) for mixed precision

Default Betas Work Well

For most applications, including RUL prediction, the default β₁=0.9 and β₂=0.999 work well. Focus tuning effort on learning rate and weight decay instead.

Weight Decay

Weight decay prevents overfitting by penalizing large weights:

λ ValueEffectUse Case
0No regularizationVery large datasets
1e-5Light regularizationLarge models, big data
1e-4Standard regularizationMost cases (recommended)
1e-3Strong regularizationSmall datasets, overfitting
1e-2Very strongExtreme overfitting only

For C-MAPSS with ~20K training samples and a 3.5M parameter model, we use λ=104\lambda = 10^{-4}.


Implementation

Our research implementation combines AdamW with adaptive weight decay and an aggressive learning rate scheduler.

AMNL Research Configuration

AdamW with ReduceLROnPlateau Scheduler
🐍enhanced_train_nasa_cmapss_sota_v7.py
2AdamW Optimizer

AdamW with decoupled weight decay for proper L2 regularization. Weight decay is applied directly to parameters, not through gradients.

4Learning Rate

Base learning rate of 1e-3 works well for our 3.5M parameter model. This will be adjusted by the scheduler during training.

5Initial Weight Decay

Starting weight decay of 1e-4. Our V7 approach adapts this dynamically during training to allow fine-tuning in later epochs.

6Beta Parameters

Standard momentum (0.9) and variance (0.999) decay rates. These defaults work well for most deep learning tasks.

11ReduceLROnPlateau

Monitors validation metric and reduces LR when training plateaus. More flexible than step-based schedules.

13Monitor Mode

mode='min' monitors for decreasing RMSE. Use 'max' for metrics like accuracy where higher is better.

14Reduction Factor

Aggressive 50% reduction (factor=0.5) when plateau detected. V7 key improvement over previous versions that used 0.7.

EXAMPLE
LR=1e-3 → 5e-4 → 2.5e-4 → ...
15Patience

Wait 30 epochs without improvement before reducing LR. Balances between premature reduction and wasted epochs.

16Minimum LR

Floor at 5e-6 prevents LR from becoming negligible. Training continues to make progress even after multiple reductions.

9 lines without explanation
1# Optimizer (will update weight decay dynamically)
2optimizer = optim.AdamW(
3    model.parameters(),
4    lr=learning_rate,
5    weight_decay=1e-4,  # Will be updated dynamically
6    betas=(0.9, 0.999),
7    eps=1e-8
8)
9
10# ⭐ V7 KEY: More aggressive ReduceLROnPlateau
11scheduler = optim.lr_scheduler.ReduceLROnPlateau(
12    optimizer,
13    mode='min',
14    factor=0.5,      # Aggressive: 50% reduction (was 0.7)
15    patience=30,     # Moderate patience
16    min_lr=5e-6,
17    verbose=True
18)

Adaptive Weight Decay

V7 introduces adaptive weight decay that decreases during training to allow fine-tuning:

Adaptive Weight Decay Strategy
🐍enhanced_train_nasa_cmapss_sota_v7.py
1Function Signature

Takes current epoch and initial weight decay, returns the adjusted weight decay value for that epoch.

7Early Training Phase

Full weight decay (1e-4) in epochs 0-99. Strong regularization prevents overfitting while learning primary patterns.

EXAMPLE
Epoch 50: wd = 1e-4
9Middle Training Phase

Reduced weight decay (5e-5) in epochs 100-199. Model has learned main patterns, now refining with lighter regularization.

EXAMPLE
Epoch 150: wd = 5e-5
11Fine-tuning Phase

Minimal weight decay (1e-5) for epochs 200+. Allows fine-grained optimization without strong regularization penalty.

EXAMPLE
Epoch 250: wd = 1e-5
8 lines without explanation
1def get_adaptive_weight_decay(epoch: int, initial_wd: float = 1e-4) -> float:
2    """
3    ⭐ V7 NEW: Adaptive weight decay
4
5    Reduce weight decay as training progresses to allow fine-tuning.
6    """
7    if epoch < 100:
8        return initial_wd  # Full weight decay early on
9    elif epoch < 200:
10        return initial_wd * 0.5  # Half weight decay
11    else:
12        return initial_wd * 0.1  # Minimal weight decay for fine-tuning

Dynamic Update in Training Loop

The weight decay is updated at the start of each epoch by modifying the optimizer's parameter groups. This allows the regularization strength to adapt to the training phase.

🐍python
1# Apply adaptive weight decay in training loop
2for epoch in range(epochs):
3    current_wd = get_adaptive_weight_decay(epoch)
4    for param_group in optimizer.param_groups:
5        param_group['weight_decay'] = current_wd
6
7    # ... training code ...

Summary

In this section, we configured the AdamW optimizer:

  1. Adam: Adaptive learning rates via momentum + variance tracking
  2. AdamW: Decoupled weight decay for proper regularization
  3. Learning rate: 1e-3 for our 3.5M parameter model
  4. Weight decay: 1e-4 for balanced regularization
  5. Parameter groups: No weight decay on biases and norms
HyperparameterValue
OptimizerAdamW
Learning rate (η)1e-3
β₁ (momentum)0.9
β₂ (variance)0.999
ε (stability)1e-8
Weight decay (λ)1e-4
Looking Ahead: A constant learning rate often leads to suboptimal convergence. The next section introduces learning rate warmup—a technique that prevents early training instability by gradually increasing the learning rate.

With the optimizer configured, we implement learning rate scheduling.