AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the Adam optimizer and its momentum mechanisms
Explain the difference between Adam and AdamW in weight decay handling
Configure AdamW hyperparameters for RUL prediction
Recognize common pitfalls in optimizer configuration
Implement AdamW with proper initialization

Why This Matters: The optimizer is the engine that drives learning. AdamW (Loshchilov & Hutter, 2019) fixes a subtle but important bug in standard Adam that affects regularization. For deep networks like our CNN-BiLSTM-Attention model, proper optimizer configuration can mean the difference between state-of-the-art results and training failure.

Adam Optimizer Review

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates.

Core Adam Algorithm

Adam maintains two exponential moving averages:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment / momentum)}

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment / variance)}

Where:

$g_t = \nabla_\theta \mathcal{L}$ : Gradient at step t
$\beta_1 = 0.9$ : First moment decay (momentum)
$\beta_2 = 0.999$ : Second moment decay (variance)

Bias Correction

Both moments are bias-corrected:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Parameter Update

\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

The adaptive learning rate $\eta / \sqrt{\hat{v}_t}$ scales inversely with gradient variance—parameters with noisy gradients get smaller updates.

AdamW: Decoupled Weight Decay

AdamW fixes how weight decay interacts with adaptive learning rates.

The Problem with L2 Regularization in Adam

Standard Adam with L2 regularization adds penalty to the loss:

\mathcal{L}_{\text{total}} = \mathcal{L} + \frac{\lambda}{2}\|\theta\|^2

The gradient becomes:

g_t = \nabla_\theta \mathcal{L} + \lambda \theta

Problem: The regularization term $\lambda \theta$ is now scaled by Adam's adaptive learning rate $1/\sqrt{\hat{v}_t}$ . This means weight decay is weaker for parameters with high gradient variance—the opposite of intended behavior.

AdamW Solution: Decoupled Weight Decay

AdamW applies weight decay directly to parameters, not through the gradient:

\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_t

The weight decay term $\eta \lambda \theta_t$ is not scaled by the adaptive factor. This is called "decoupled" weight decay.

Comparison

Aspect	Adam + L2	AdamW
Weight decay	Through gradient	Direct on parameters
Adaptive scaling	Affects regularization	Only affects gradient
Effective decay	Varies by parameter	Uniform across parameters
Recommended	No	Yes

Use AdamW, Not Adam

For any model with weight decay (which should be all production models), use AdamW instead of Adam. The difference is subtle but significant—AdamW provides more consistent regularization and often better generalization.

Hyperparameter Configuration

Proper hyperparameter selection is critical for training success.

Learning Rate

Model Size	Recommended η	Range to Search
Small (<1M params)	1e-3	[5e-4, 5e-3]
Medium (1-10M params)	1e-3 to 5e-4	[1e-4, 1e-3]
Large (>10M params)	1e-4	[5e-5, 5e-4]

For our 3.5M parameter CNN-BiLSTM-Attention model, we use $\eta = 10^{-3}$ .

Beta Parameters

Parameter	Default	When to Change
β₁	0.9	Lower (0.85) for very noisy gradients
β₂	0.999	Lower (0.99) for non-stationary problems
ε	1e-8	Increase (1e-6) for mixed precision

Default Betas Work Well

For most applications, including RUL prediction, the default β₁=0.9 and β₂=0.999 work well. Focus tuning effort on learning rate and weight decay instead.

Weight Decay

Weight decay prevents overfitting by penalizing large weights:

λ Value	Effect	Use Case
0	No regularization	Very large datasets
1e-5	Light regularization	Large models, big data
1e-4	Standard regularization	Most cases (recommended)
1e-3	Strong regularization	Small datasets, overfitting
1e-2	Very strong	Extreme overfitting only

For C-MAPSS with ~20K training samples and a 3.5M parameter model, we use $\lambda = 10^{-4}$ .

Implementation

Our research implementation combines AdamW with adaptive weight decay and an aggressive learning rate scheduler.

AMNL Research Configuration

AdamW with ReduceLROnPlateau Scheduler

🐍enhanced_train_nasa_cmapss_sota_v7.py

Explanation(9)

Code(18)

2AdamW Optimizer

AdamW with decoupled weight decay for proper L2 regularization. Weight decay is applied directly to parameters, not through gradients.

4Learning Rate

Base learning rate of 1e-3 works well for our 3.5M parameter model. This will be adjusted by the scheduler during training.

5Initial Weight Decay

Starting weight decay of 1e-4. Our V7 approach adapts this dynamically during training to allow fine-tuning in later epochs.

6Beta Parameters

Standard momentum (0.9) and variance (0.999) decay rates. These defaults work well for most deep learning tasks.

11ReduceLROnPlateau

Monitors validation metric and reduces LR when training plateaus. More flexible than step-based schedules.

13Monitor Mode

mode='min' monitors for decreasing RMSE. Use 'max' for metrics like accuracy where higher is better.

14Reduction Factor

Aggressive 50% reduction (factor=0.5) when plateau detected. V7 key improvement over previous versions that used 0.7.

EXAMPLE

LR=1e-3 → 5e-4 → 2.5e-4 → ...

15Patience

Wait 30 epochs without improvement before reducing LR. Balances between premature reduction and wasted epochs.

16Minimum LR

Floor at 5e-6 prevents LR from becoming negligible. Training continues to make progress even after multiple reductions.

9 lines without explanation

1# Optimizer (will update weight decay dynamically)
2optimizer = optim.AdamW(
3    model.parameters(),
4    lr=learning_rate,
5    weight_decay=1e-4,  # Will be updated dynamically
6    betas=(0.9, 0.999),
7    eps=1e-8
8)
9
10# ⭐ V7 KEY: More aggressive ReduceLROnPlateau
11scheduler = optim.lr_scheduler.ReduceLROnPlateau(
12    optimizer,
13    mode='min',
14    factor=0.5,      # Aggressive: 50% reduction (was 0.7)
15    patience=30,     # Moderate patience
16    min_lr=5e-6,
17    verbose=True
18)

Adaptive Weight Decay

V7 introduces adaptive weight decay that decreases during training to allow fine-tuning:

Adaptive Weight Decay Strategy

🐍enhanced_train_nasa_cmapss_sota_v7.py

Explanation(4)

Code(12)

1Function Signature

Takes current epoch and initial weight decay, returns the adjusted weight decay value for that epoch.

7Early Training Phase

Full weight decay (1e-4) in epochs 0-99. Strong regularization prevents overfitting while learning primary patterns.

EXAMPLE

Epoch 50: wd = 1e-4

9Middle Training Phase

Reduced weight decay (5e-5) in epochs 100-199. Model has learned main patterns, now refining with lighter regularization.

EXAMPLE

Epoch 150: wd = 5e-5

11Fine-tuning Phase

Minimal weight decay (1e-5) for epochs 200+. Allows fine-grained optimization without strong regularization penalty.

EXAMPLE

Epoch 250: wd = 1e-5

8 lines without explanation

1def get_adaptive_weight_decay(epoch: int, initial_wd: float = 1e-4) -> float:
2    """
3    ⭐ V7 NEW: Adaptive weight decay
4
5    Reduce weight decay as training progresses to allow fine-tuning.
6    """
7    if epoch < 100:
8        return initial_wd  # Full weight decay early on
9    elif epoch < 200:
10        return initial_wd * 0.5  # Half weight decay
11    else:
12        return initial_wd * 0.1  # Minimal weight decay for fine-tuning

Dynamic Update in Training Loop

The weight decay is updated at the start of each epoch by modifying the optimizer's parameter groups. This allows the regularization strength to adapt to the training phase.

🐍python

1# Apply adaptive weight decay in training loop
2for epoch in range(epochs):
3    current_wd = get_adaptive_weight_decay(epoch)
4    for param_group in optimizer.param_groups:
5        param_group['weight_decay'] = current_wd
6
7    # ... training code ...

Summary

In this section, we configured the AdamW optimizer:

Adam: Adaptive learning rates via momentum + variance tracking
AdamW: Decoupled weight decay for proper regularization
Learning rate: 1e-3 for our 3.5M parameter model
Weight decay: 1e-4 for balanced regularization
Parameter groups: No weight decay on biases and norms

Hyperparameter	Value
Optimizer	AdamW
Learning rate (η)	1e-3
β₁ (momentum)	0.9
β₂ (variance)	0.999
ε (stability)	1e-8
Weight decay (λ)	1e-4

Looking Ahead: A constant learning rate often leads to suboptimal convergence. The next section introduces learning rate warmup—a technique that prevents early training instability by gradually increasing the learning rate.

With the optimizer configured, we implement learning rate scheduling.