← All books
Book · Advanced · 70+ hours

Gradient-Aware Multi-Task Learning for Predictive Maintenance

AMNL, GABA, and GRACE for RUL Prediction Under Multi-Condition Degradation

A research-grade walkthrough of three multi-task learning strategies for Remaining Useful Life prediction — AMNL (accuracy-first), GABA (safety-first), and GRACE (balanced) — all built on a shared CNN-BiLSTM-Attention backbone. Validated across 335 experiments on NASA C-MAPSS and N-CMAPSS DS02, beating the published SOTA (DKAMFormer) on multi-condition data.

29Chapters
121Sections
28hReading
10Parts
Part I·4 chapters · 17 sections

FoundationsRUL, benchmarks, math preliminaries, and MTL theory.

Predictive Maintenance & RUL

Why Remaining Useful Life prediction matters, the cost of being late, and the three deployment regimes that motivate this book.

4 sections44 min read
Start chapter
  1. 01What is Predictive Maintenance?10m
  2. 02The RUL Prediction Problem12m
  3. 03Why Safety Matters: NASA Score vs. RMSE12m
  4. 04Three Deployment Regimes (Accuracy / Safety / Balanced)10m

Benchmarks: C-MAPSS & N-CMAPSS

The two benchmarks that define progress in turbofan RUL prediction, and why multi-condition data is the real challenge.

4 sections54 min read
Start chapter
  1. 01NASA C-MAPSS Overview12m
  2. 02FD001 to FD004: Single vs. Multi-Condition15m
  3. 03N-CMAPSS DS02: Realistic Flight Envelopes15m
  4. 04Why Multi-Condition Datasets Are Hard12m

Mathematical Preliminaries

The minimal math you need: time series tensors, 1D convolution, recurrent networks, attention, and softmax cross-entropy.

5 sections72 min read
Start chapter
  1. 01Time Series & Tensors12m
  2. 021D Convolution for Sensor Streams15m
  3. 03Recurrent Networks & LSTM Cells18m
  4. 04Self-Attention15m
  5. 05Softmax & Cross-Entropy12m

Multi-Task Learning Theory

Shared backbones, task-specific heads, and the loss-combination problem that the rest of the book is dedicated to solving.

4 sections54 min read
Start chapter
  1. 01Why Multi-Task Learning?12m
  2. 02Shared vs. Task-Specific Parameters12m
  3. 03The Loss-Combination Problem15m
  4. 04A Gradient-Level View of MTL15m
Part II·3 chapters · 13 sections

Data PipelineC-MAPSS / N-CMAPSS, per-condition normalization, sequences and labels.

NASA Datasets Deep Dive

Sensor catalog, operating conditions, fault modes, and the file formats you will actually load into PyTorch.

5 sections66 min read
Start chapter
  1. 01C-MAPSS File Structure12m
  2. 02The 21 Sensors and 3 Operational Settings15m
  3. 03Selecting 14 Informative Sensors12m
  4. 04Operating-Condition Discovery15m
  5. 05Fault Modes (HPC, Fan, and Combinations)12m

Per-Condition Normalization

The silent hero of the framework: why global Z-score fails on multi-condition data and how per-condition normalization fixes it.

4 sections49 min read
Start chapter
  1. 01Why Global Z-Score Fails12m
  2. 02Discovering Operating Conditions (k-Means)12m
  3. 03Per-Condition Z-Score Implementation15m
  4. 04Preventing Test-Set Leakage10m

Sequences, RUL Cap & Health Labels

Building the (B, 30, 17) input tensor: sliding windows, the piecewise-linear RUL cap, three-class health labels, and a reusable PyTorch Dataset.

4 sections55 min read
Start chapter
  1. 01Sliding-Window Sequences (length = 30)15m
  2. 02Piecewise-Linear RUL Cap (R_max = 125)12m
  3. 03Health-State Discretization (3 Classes)10m
  4. 04PyTorch Dataset & DataLoader18m
Part III·4 chapters · 16 sections

Shared BackboneCNN, BiLSTM, Multi-Head Attention, dual-task heads.

CNN Feature Extractor

Three 1D conv layers (17 → 64 → 128 → 64) extract local degradation patterns from sensor streams.

4 sections54 min read
Start chapter
  1. 011D Convolution for Sensor Series12m
  2. 02Three-Layer Conv Stack15m
  3. 03BatchNorm and Dropout for Stability12m
  4. 04PyTorch Implementation15m

Bidirectional LSTM Encoder

Two-layer BiLSTM (h=256) captures long-range temporal dependencies in degradation signatures.

4 sections60 min read
Start chapter
  1. 01Why Bidirectional Beats Unidirectional12m
  2. 02LSTM Cell Mathematics18m
  3. 03Two-Layer BiLSTM Design (h = 256)15m
  4. 04PyTorch Implementation15m

Multi-Head Self-Attention

Eight-head self-attention with residual connection lets the model focus on degradation-relevant timesteps.

4 sections57 min read
Start chapter
  1. 01Scaled Dot-Product Attention15m
  2. 02Multi-Head Attention with 8 Heads15m
  3. 03Residual Connection and LayerNorm12m
  4. 04PyTorch Implementation15m

Dual-Task Heads & Model Assembly

Two task-specific heads (RUL regression + 3-class health classification) on top of a shared 32-d feature, totaling ~3.5 M parameters.

4 sections52 min read
Start chapter
  1. 01Shared 32-Dimensional Feature Vector10m
  2. 02RUL Regression Head12m
  3. 03Health Classification Head (3 Classes)12m
  4. 04Complete DualTaskModel and Parameter Count18m
Part IV·2 chapters · 8 sections

The Core DiscoveryThe 500x gradient imbalance and the accuracy-safety tradeoff.

The 500x Gradient Imbalance

The empirical discovery that motivates the rest of the book: regression gradients exceed classification gradients by 500x on shared parameters.

4 sections60 min read
Start chapter
  1. 01Computing Per-Task Gradient Norms15m
  2. 02Why MSE Gradients Dominate Cross-Entropy18m
  3. 03Empirical Measurement (n = 4,120 samples)15m
  4. 04Consequences for Shared Feature Learning12m

The Accuracy-Safety Tradeoff

Why low RMSE coincides with high NASA score, and why the tradeoff cannot be hidden behind a single metric.

4 sections54 min read
Start chapter
  1. 01NASA Score: The Asymmetric Cost of Lateness15m
  2. 02Visualizing the RMSE-NASA Pareto Frontier15m
  3. 03Three Deployment Regimes Revisited12m
  4. 04Mapping Regimes to AMNL, GABA, and GRACE12m
Part V·3 chapters · 13 sections

Model 1: AMNLFailure-biased weighted MSE for accuracy-first deployment.

Failure-Biased Weighted MSE

Up-weighting near-failure samples so the regressor pays attention where errors hurt the most.

4 sections51 min read
Start chapter
  1. 01Why Equal-Weight MSE Underweights Failure12m
  2. 02The Linear-Decay Sample Weight w(RUL)15m
  3. 03Choosing w_max = 2.0 (Not 5.0 or 10.0)12m
  4. 04PyTorch Implementation12m

AMNL Training Pipeline

Fixed 0.5/0.5 task weighting + failure-biased MSE, with the optimizer, scheduler, and EMA tricks that hold it together.

5 sections74 min read
Start chapter
  1. 01The Fixed 0.5/0.5 Combined Loss12m
  2. 02Optimizer & Scheduler (AdamW + Warmup + Plateau)15m
  3. 03Gradient Clipping and Weight EMA12m
  4. 04Per-Dataset Dropout Tuning (Legacy-Pipeline Note)10m
  5. 05Full Training Script Walkthrough25m

AMNL Results & When to Use It

Best-in-literature RMSE on FD002/FD003, the FD001 NASA penalty, and the cross-pipeline caveat you must report.

4 sections49 min read
Start chapter
  1. 01Best RMSE in the Literature (FD002, FD003)15m
  2. 02The FD001 NASA Penalty12m
  3. 03Cross-Pipeline Caveats12m
  4. 04When to Choose AMNL10m
Part VI·4 chapters · 16 sections

Model 2: GABAInverse-gradient adaptive balancing for safety-first deployment.

Inverse-Gradient Balancing: The Idea

Equalize each task's contribution to the shared backbone by giving lower weight to whichever task has bigger gradients.

4 sections54 min read
Start chapter
  1. 01Equalizing Task Contributions12m
  2. 02Why Inverse-Proportional Weights Work15m
  3. 03The Two-Task Closed Form12m
  4. 04How GABA Differs from GradNorm15m

The GABA Algorithm

Per-step gradient norms, EMA smoothing (β = 0.99), minimum floor (λ_min = 0.05), and a 100-step warmup — the full pseudocode walked end to end.

5 sections74 min read
Start chapter
  1. 01Per-Step Gradient Norm Computation15m
  2. 02Exponential Moving Average (β = 0.99)15m
  3. 03Minimum Floor and Renormalization12m
  4. 04Warmup (First 100 Steps)10m
  5. 05Full PyTorch Implementation22m

Control-Theoretic Interpretation

GABA viewed as a proportional feedback controller with an IIR filter and anti-windup floor — the property that gives it stronger stability guarantees than GradNorm.

3 sections39 min read
Start chapter
  1. 01GABA as a Proportional Feedback Controller15m
  2. 02EMA as a First-Order IIR Low-Pass Filter12m
  3. 03Floor as Anti-Windup; Bounded-Weight Guarantee12m

Training GABA & Results

GABA + standard MSE: best NASA among adaptive methods, no auxiliary loss, no learned parameters, and a single λ that converges within 10 epochs.

4 sections52 min read
Start chapter
  1. 01GABA + Standard MSE Training Pipeline15m
  2. 02Watching the Weights Converge15m
  3. 03Best NASA Among Adaptive Methods12m
  4. 04When to Choose GABA10m
Part VII·3 chapters · 11 sections

Model 3: GRACEGABA + weighted MSE for balanced deployment.

Combining GABA + Weighted MSE

Adaptive weighting and loss-shape are orthogonal — GRACE composes them and resolves the accuracy-safety tradeoff.

3 sections39 min read
Start chapter
  1. 01Separation of Concerns: Adaptation vs. Loss Shape12m
  2. 02The GRACE Loss Equation12m
  3. 03Why It Works (and When It Does Not — FD003)15m

GRACE Training Pipeline

The full reproducible pipeline: 5 seeds, AdamW, ReduceLROnPlateau, EMA, gradient clipping, and exact hyperparameters.

4 sections64 min read
Start chapter
  1. 01Putting It All Together15m
  2. 02Hyperparameters and Defaults12m
  3. 03Reproducibility (Seeds, Environment, Determinism)12m
  4. 04Full Training Script Walkthrough25m

GRACE Results & the Pareto Frontier

Best NASA on multi-condition C-MAPSS, the RMSE-NASA Pareto picture, and the only method to win on N-CMAPSS DS02.

4 sections55 min read
Start chapter
  1. 01Best NASA on Multi-Condition C-MAPSS15m
  2. 02The RMSE-NASA Pareto Picture15m
  3. 03Best Overall on N-CMAPSS DS0215m
  4. 04When to Choose GRACE10m
Part VIII·3 chapters · 13 sections

Baselines & SOTAAdaptive baselines, gradient surgery, and the unified comparison.

Adaptive MTL Baselines

Fixed weighting, Homoscedastic Uncertainty, GradNorm, and DWA — the published baselines we ran inside the same framework.

5 sections73 min read
Start chapter
  1. 01Fixed Equal Weighting (0.5 / 0.5)10m
  2. 02Homoscedastic Uncertainty (Kendall et al.)18m
  3. 03GradNorm — and Why It Diverged on N-CMAPSS18m
  4. 04Dynamic Weight Average (DWA)15m
  5. 05Implementation & Results Summary12m

Gradient-Surgery Baselines

PCGrad and CAGrad project away conflicting gradient directions. Useful, but magnitude correction beats direction correction in this domain.

3 sections48 min read
Start chapter
  1. 01PCGrad: Projecting Conflicting Gradients18m
  2. 02CAGrad: Conflict-Averse Updates18m
  3. 03Why Magnitude Beats Direction Here12m

Unified Comparison vs. Published SOTA

AMNL, GABA, GRACE plus six baselines vs. DKAMFormer, DMHA-ATCN, STAR, DVGTformer, and the rest of the field.

5 sections81 min read
Start chapter
  1. 01The Reference Set (DKAMFormer, DMHA-ATCN, STAR, DVGTformer)15m
  2. 02Our Three Models vs. SOTA on C-MAPSS18m
  3. 03Multi-Condition Dominance (FD002, FD004)15m
  4. 04N-CMAPSS: GRACE vs. DKAMFormer15m
  5. 05The Unified Comparison Table & Decision Matrix18m
Part IX·2 chapters · 9 sections

Ablations & StatisticsArchitecture, normalization, robustness, and statistical tests.

Architecture Ablation

Five backbones (CNN-only, MLP, Transformer, LSTM-only, Full) and the result that the framework benefit is architecture-agnostic.

4 sections52 min read
Start chapter
  1. 01Five Backbones Compared15m
  2. 02Architecture-Agnostic Result (Kruskal-Wallis p = 0.605)15m
  3. 03The Tiny Transformer (208K Parameters)12m
  4. 04Implications for Backbone Choice10m

Normalization, Robustness & Statistical Tests

Per-condition vs. global normalization, GABA hyperparameter sweeps, and the formal Friedman / Wilcoxon tests behind every claim.

5 sections75 min read
Start chapter
  1. 01Per-Condition vs. Global Normalization (40 Runs)15m
  2. 02GABA β and λ_min Sensitivity Sweeps15m
  3. 03Friedman Tests (RMSE vs. NASA)18m
  4. 04Wilcoxon Pairwise + Holm-Bonferroni15m
  5. 05Effect Sizes & Power Analysis12m
Part X·1 chapter · 5 sections

ProductionEdge deployment, model selection, and future directions.

Deployment, Model Selection & Future Directions

Edge deployment profile, ONNX export, the model-selection decision tree, and where this research goes next.

5 sections75 min read
Start chapter
  1. 01Edge Deployment Profile (3.5M Parameters, 4.6 ms Latency)15m
  2. 02ONNX Export & Real-Time Inference18m
  3. 03The Model-Selection Decision Tree12m
  4. 04Extensions: Bearings, Batteries, and Beyond15m
  5. 05Limitations & Open Research Questions15m
The capstone

Where the book lands in practice.

Chapter 14·4 sections

Failure-Biased Weighted MSE

Up-weighting near-failure samples so the regressor pays attention where errors hurt the most.

Open chapter
Chapter 15·5 sections

AMNL Training Pipeline

Fixed 0.5/0.5 task weighting + failure-biased MSE, with the optimizer, scheduler, and EMA tricks that hold it together.

Open chapter
Chapter 16·4 sections

AMNL Results & When to Use It

Best-in-literature RMSE on FD002/FD003, the FD001 NASA penalty, and the cross-pipeline caveat you must report.

Open chapter
Chapter 17·4 sections

Inverse-Gradient Balancing: The Idea

Equalize each task's contribution to the shared backbone by giving lower weight to whichever task has bigger gradients.

Open chapter
Chapter 18·5 sections

The GABA Algorithm

Per-step gradient norms, EMA smoothing (β = 0.99), minimum floor (λ_min = 0.05), and a 100-step warmup — the full pseudocode walked end to end.

Open chapter
Chapter 19·3 sections

Control-Theoretic Interpretation

GABA viewed as a proportional feedback controller with an IIR filter and anti-windup floor — the property that gives it stronger stability guarantees than GradNorm.

Open chapter
Chapter 20·4 sections

Training GABA & Results

GABA + standard MSE: best NASA among adaptive methods, no auxiliary loss, no learned parameters, and a single λ that converges within 10 epochs.

Open chapter
Chapter 21·3 sections

Combining GABA + Weighted MSE

Adaptive weighting and loss-shape are orthogonal — GRACE composes them and resolves the accuracy-safety tradeoff.

Open chapter
Chapter 22·4 sections

GRACE Training Pipeline

The full reproducible pipeline: 5 seeds, AdamW, ReduceLROnPlateau, EMA, gradient clipping, and exact hyperparameters.

Open chapter
Chapter 23·4 sections

GRACE Results & the Pareto Frontier

Best NASA on multi-condition C-MAPSS, the RMSE-NASA Pareto picture, and the only method to win on N-CMAPSS DS02.

Open chapter

121 sections. Begin with one.

Chapter 1 — Predictive Maintenance & RUL — is where every reader starts.