The Three-Layer Optimiser Stack
AMNL ships with a three-piece optimiser stack: an AdamW base with decoupled weight decay, a linear warmup over the first 10 epochs, and a ReduceLROnPlateau scheduler that cuts lr by 50% every time the validation RMSE has not improved for 30 epochs. Each piece does one job; together they keep training stable from epoch 0 to convergence.
AdamW(lr=1e-3, weight_decay=1e-4, betas=(0.9, 0.999)); warmup is linear from 0.1 → 1.0 over 10 epochs; plateau scheduler is ReduceLROnPlateau(mode='min', factor=0.5, patience=30, min_lr=5e-6). All from paper_ieee_tii/experiments/train_amnl_v7.py.AdamW: Adam with Decoupled Weight Decay
Adam's update rule is where are bias-corrected EMAs of the first and second gradient moments. Plain L2 regularisation in Adam produces a regulariser strength that is divided by - so parameters with large running gradients get effectively LESS regularisation. AdamW decouples the wd term:
The second term is INDEPENDENT of . Every parameter gets the same regularisation strength regardless of its gradient history.
Warmup: First 10 Epochs
Linear warmup multiplies the base lr by:
with . So lr rises from at epoch 0 to at epoch 10.
ReduceLROnPlateau: Cuts on Stalls
After warmup, the lr is reactive: as long as validation RMSE keeps improving, lr stays at base_lr. Once 30 consecutive epochs pass without a new best RMSE, the scheduler halves the lr. This continues until lr hits the floor min_lr = 5×10⁻⁶ (about 200× below base).
| Hyperparameter | Value | Why |
|---|---|---|
| mode | 'min' | RMSE is a metric that should DECREASE - watch for new minima |
| factor | 0.5 | 50% reduction per cut. Default in PyTorch is 0.1 (more aggressive); paper picked 0.5 to be gentler |
| patience | 30 | wait 30 epochs without improvement before cutting. Long enough to absorb random batch-to-batch jitter |
| min_lr | 5e-6 | floor; below this the scheduler stops cutting and training continues at fixed small lr |
Interactive: Walk a 200-Epoch Schedule
Drag the patience slider to see how aggressive cuts feel. Drag the number-of-plateaus slider to add or remove synthetic stall events. Watch lr drop in lock-step with each plateau marker.
Try this. Set patience=10 and 4 plateaus - lr collapses to min_lr by epoch 100. Set patience=80 with 1 plateau - lr stays at base for ~50 epochs after the stall, a more conservative regime. The paper's patience=30 is the middle ground: aggressive enough to escape local minima, conservative enough not to cut prematurely.
Python: Schedule from Scratch
Pure NumPy reimplementation - replicates PyTorch's ReduceLROnPlateau(mode='min', factor=0.5, patience=30, min_lr=5e-6) step-for-step on a synthetic validation curve.
PyTorch: The Paper's Stack
Three factory functions plus a smoke test. Each function returns the exact paper-canonical config from paper_ieee_tii/experiments/train_amnl_v7.py.
Same Stack, Other Domains
The (AdamW + warmup + plateau) recipe transfers wherever you train a deep model with a reactive validation metric. Tune the patience to your epoch budget; everything else stays.
| Domain | lr | wd | warmup | patience | Notes |
|---|---|---|---|---|---|
| RUL prediction (this book) | 1e-3 | 1e-4 | 10 | 30 | paper default |
| BERT-base fine-tune (NLP) | 5e-5 | 0.01 | 500 steps | linear decay | warmup steps not epochs |
| Vision Transformer (ImageNet) | 1e-4 | 0.05 | 5000 steps | cosine | uses cosine decay instead of plateau |
| Stable Diffusion fine-tune | 1e-5 | 1e-2 | 100 steps | constant | wd higher to prevent collapse |
| GAN training | 2e-4 | 0 | 0 | n/a | wd=0 for generators - typically |
| Reinforcement learning (PPO) | 3e-4 | 0 | 0 | n/a | no decay - throughput matters more |
Three Optimiser Pitfalls
scheduler.step(val) during the warmup window (where lr is being externally set by the warmup callback), the scheduler's internal ‘best’ gets corrupted and the first post-warmup cut fires too early. Paper trainer skips scheduler.step() until epoch ≥ warmup_epochs.The point. AdamW + warmup + plateau is a three-piece stack that handles the early-instability, steady-state, and late-stall regimes of training with minimal hyperparameter sensitivity. §15.3 adds the two remaining tricks: gradient clipping and weight EMA.
Takeaway
- AdamW(lr=1e-3, wd=1e-4). Decoupled weight decay - critical with sample-weighted losses.
- Warmup 0.1 → 1.0 over 10 epochs. Lets Adam's stabilise before letting it drive the step size.
- ReduceLROnPlateau(factor=0.5, patience=30, min_lr=5e-6). Reactive scheduler. 50% cut after 30 stall epochs; floor at 5×10⁻⁶.
- ~1-2 decades of lr drop over 200 epochs on typical AMNL FD002 training.
- Skip scheduler.step() during warmup - paper trainer guards this explicitly.