Gradient-Aware Multi-Task Learning for Predictive Maintenance
AMNL, GABA, and GRACE for RUL Prediction Under Multi-Condition Degradation
A research-grade walkthrough of three multi-task learning strategies for Remaining Useful Life prediction — AMNL (accuracy-first), GABA (safety-first), and GRACE (balanced) — all built on a shared CNN-BiLSTM-Attention backbone. Validated across 335 experiments on NASA C-MAPSS and N-CMAPSS DS02, beating the published SOTA (DKAMFormer) on multi-condition data.
Foundations— RUL, benchmarks, math preliminaries, and MTL theory.
Predictive Maintenance & RUL
Why Remaining Useful Life prediction matters, the cost of being late, and the three deployment regimes that motivate this book.
Benchmarks: C-MAPSS & N-CMAPSS
The two benchmarks that define progress in turbofan RUL prediction, and why multi-condition data is the real challenge.
Mathematical Preliminaries
The minimal math you need: time series tensors, 1D convolution, recurrent networks, attention, and softmax cross-entropy.
Multi-Task Learning Theory
Shared backbones, task-specific heads, and the loss-combination problem that the rest of the book is dedicated to solving.
Data Pipeline— C-MAPSS / N-CMAPSS, per-condition normalization, sequences and labels.
NASA Datasets Deep Dive
Sensor catalog, operating conditions, fault modes, and the file formats you will actually load into PyTorch.
Per-Condition Normalization
The silent hero of the framework: why global Z-score fails on multi-condition data and how per-condition normalization fixes it.
Sequences, RUL Cap & Health Labels
Building the (B, 30, 17) input tensor: sliding windows, the piecewise-linear RUL cap, three-class health labels, and a reusable PyTorch Dataset.
Shared Backbone— CNN, BiLSTM, Multi-Head Attention, dual-task heads.
CNN Feature Extractor
Three 1D conv layers (17 → 64 → 128 → 64) extract local degradation patterns from sensor streams.
Bidirectional LSTM Encoder
Two-layer BiLSTM (h=256) captures long-range temporal dependencies in degradation signatures.
Multi-Head Self-Attention
Eight-head self-attention with residual connection lets the model focus on degradation-relevant timesteps.
Dual-Task Heads & Model Assembly
Two task-specific heads (RUL regression + 3-class health classification) on top of a shared 32-d feature, totaling ~3.5 M parameters.
The Core Discovery— The 500x gradient imbalance and the accuracy-safety tradeoff.
The 500x Gradient Imbalance
The empirical discovery that motivates the rest of the book: regression gradients exceed classification gradients by 500x on shared parameters.
The Accuracy-Safety Tradeoff
Why low RMSE coincides with high NASA score, and why the tradeoff cannot be hidden behind a single metric.
Model 1: AMNL— Failure-biased weighted MSE for accuracy-first deployment.
Failure-Biased Weighted MSE
Up-weighting near-failure samples so the regressor pays attention where errors hurt the most.
AMNL Training Pipeline
Fixed 0.5/0.5 task weighting + failure-biased MSE, with the optimizer, scheduler, and EMA tricks that hold it together.
AMNL Results & When to Use It
Best-in-literature RMSE on FD002/FD003, the FD001 NASA penalty, and the cross-pipeline caveat you must report.
Model 2: GABA— Inverse-gradient adaptive balancing for safety-first deployment.
Inverse-Gradient Balancing: The Idea
Equalize each task's contribution to the shared backbone by giving lower weight to whichever task has bigger gradients.
The GABA Algorithm
Per-step gradient norms, EMA smoothing (β = 0.99), minimum floor (λ_min = 0.05), and a 100-step warmup — the full pseudocode walked end to end.
Control-Theoretic Interpretation
GABA viewed as a proportional feedback controller with an IIR filter and anti-windup floor — the property that gives it stronger stability guarantees than GradNorm.
Training GABA & Results
GABA + standard MSE: best NASA among adaptive methods, no auxiliary loss, no learned parameters, and a single λ that converges within 10 epochs.
Model 3: GRACE— GABA + weighted MSE for balanced deployment.
Combining GABA + Weighted MSE
Adaptive weighting and loss-shape are orthogonal — GRACE composes them and resolves the accuracy-safety tradeoff.
GRACE Training Pipeline
The full reproducible pipeline: 5 seeds, AdamW, ReduceLROnPlateau, EMA, gradient clipping, and exact hyperparameters.
GRACE Results & the Pareto Frontier
Best NASA on multi-condition C-MAPSS, the RMSE-NASA Pareto picture, and the only method to win on N-CMAPSS DS02.
Baselines & SOTA— Adaptive baselines, gradient surgery, and the unified comparison.
Adaptive MTL Baselines
Fixed weighting, Homoscedastic Uncertainty, GradNorm, and DWA — the published baselines we ran inside the same framework.
Gradient-Surgery Baselines
PCGrad and CAGrad project away conflicting gradient directions. Useful, but magnitude correction beats direction correction in this domain.
Unified Comparison vs. Published SOTA
AMNL, GABA, GRACE plus six baselines vs. DKAMFormer, DMHA-ATCN, STAR, DVGTformer, and the rest of the field.
Ablations & Statistics— Architecture, normalization, robustness, and statistical tests.
Architecture Ablation
Five backbones (CNN-only, MLP, Transformer, LSTM-only, Full) and the result that the framework benefit is architecture-agnostic.
Normalization, Robustness & Statistical Tests
Per-condition vs. global normalization, GABA hyperparameter sweeps, and the formal Friedman / Wilcoxon tests behind every claim.
Production— Edge deployment, model selection, and future directions.
Deployment, Model Selection & Future Directions
Edge deployment profile, ONNX export, the model-selection decision tree, and where this research goes next.
Where the book lands in practice.
Failure-Biased Weighted MSE
Up-weighting near-failure samples so the regressor pays attention where errors hurt the most.
Open chapterAMNL Training Pipeline
Fixed 0.5/0.5 task weighting + failure-biased MSE, with the optimizer, scheduler, and EMA tricks that hold it together.
Open chapterAMNL Results & When to Use It
Best-in-literature RMSE on FD002/FD003, the FD001 NASA penalty, and the cross-pipeline caveat you must report.
Open chapterInverse-Gradient Balancing: The Idea
Equalize each task's contribution to the shared backbone by giving lower weight to whichever task has bigger gradients.
Open chapterThe GABA Algorithm
Per-step gradient norms, EMA smoothing (β = 0.99), minimum floor (λ_min = 0.05), and a 100-step warmup — the full pseudocode walked end to end.
Open chapterControl-Theoretic Interpretation
GABA viewed as a proportional feedback controller with an IIR filter and anti-windup floor — the property that gives it stronger stability guarantees than GradNorm.
Open chapterTraining GABA & Results
GABA + standard MSE: best NASA among adaptive methods, no auxiliary loss, no learned parameters, and a single λ that converges within 10 epochs.
Open chapterCombining GABA + Weighted MSE
Adaptive weighting and loss-shape are orthogonal — GRACE composes them and resolves the accuracy-safety tradeoff.
Open chapterGRACE Training Pipeline
The full reproducible pipeline: 5 seeds, AdamW, ReduceLROnPlateau, EMA, gradient clipping, and exact hyperparameters.
Open chapterGRACE Results & the Pareto Frontier
Best NASA on multi-condition C-MAPSS, the RMSE-NASA Pareto picture, and the only method to win on N-CMAPSS DS02.
Open chapter121 sections. Begin with one.
Chapter 1 — Predictive Maintenance & RUL — is where every reader starts.