Learning Objectives
By the end of this section, you will:
- Trace the evolution of RUL prediction methods from traditional ML to modern deep learning
- Understand LSTM-based approaches and their success in capturing temporal dependencies
- Know how CNN-LSTM hybrids combine local feature extraction with sequence modeling
- Appreciate transformer architectures like DVGTformer and DKAMFormer that achieved previous SOTA
- Identify the critical limitation: inconsistent performance across operating conditions
- See where AMNL fits in this evolution and why its approach is novel
Why This Matters: Understanding the landscape of existing methods helps you appreciate why certain design decisions were made in AMNL. Every architecture choice has predecessors, and knowing what worked (and what didn't) guides the development of better solutions.
The Evolution of RUL Methods
RUL prediction has evolved through several distinct phases, each addressing limitations of previous approaches:
| Era | Methods | RMSE Range (FD001) | Key Innovation |
|---|---|---|---|
| Pre-2015 | Physics-based, Classical ML | 20-30+ | Domain knowledge, hand-crafted features |
| 2015-2017 | LSTM, GRU | 12-17 | End-to-end temporal modeling |
| 2016-2018 | CNN-LSTM hybrids | 11-14 | Local feature extraction + sequence modeling |
| 2017-2020 | Attention mechanisms | 10-13 | Adaptive focus on relevant timesteps |
| 2021-2024 | Transformers (DVGTformer, DKAMFormer) | 10.5-11.5 | Pure attention, domain knowledge integration |
| 2024-Present | AMNL (Ours) | 8.69-10.43 | Equal-weight multi-task learning |
The Narrowing Gap
On FD001 (the easiest dataset), methods have converged to similar performance—improvements are measured in fractions of RMSE points. But on complex datasets like FD002 and FD004, the gap remains enormous:
| Method Type | FD001 RMSE | FD002 RMSE | FD004 RMSE | FD002/FD001 Ratio |
|---|---|---|---|---|
| Classical ML | ~15 | ~22 | ~25 | 1.47× |
| LSTM | ~13 | ~18 | ~20 | 1.38× |
| CNN-LSTM-Attention | ~11 | ~14 | ~16 | 1.27× |
| DKAMFormer | 10.68 | 10.70 | 12.89 | 1.00× |
| AMNL (Ours) | 10.43 | 6.74 | 8.16 | 0.65× |
The Multi-Condition Gap
LSTM-Based Approaches
Long Short-Term Memory networks marked the first major breakthrough in data-driven RUL prediction. Unlike traditional methods that required manual feature engineering, LSTMs could learn temporal patterns directly from raw sensor sequences.
How LSTMs Helped
- Long-range dependencies: The cell state mechanism carries information across many timesteps, essential for tracking gradual degradation
- Adaptive memory: Gates learn which information to remember and forget based on training data
- End-to-end training: No need for hand-crafted features—the network learns representations optimized for RUL
Early LSTM Results
| Paper | Year | FD001 | FD002 | FD003 | FD004 |
|---|---|---|---|---|---|
| Zheng et al. | 2017 | 12.56 | 16.18 | 12.10 | 17.93 |
| Zhang et al. | 2018 | 13.65 | 15.72 | 13.43 | 16.21 |
| BiLSTM baseline | 2019 | 13.18 | 17.04 | 14.32 | 18.45 |
Limitations of Pure LSTM Approaches
- No local pattern extraction: LSTMs process one timestep at a time, missing local sensor patterns
- Sequential processing: Cannot parallelize across timesteps, slow training
- Vanishing gradients: Despite gating, very long sequences still pose challenges
- Multi-condition struggle: Performance degrades significantly on FD002/FD004
CNN-LSTM Hybrid Architectures
The insight that convolutional networks excel at local pattern extraction led to hybrid architectures: CNNs for features, LSTMs for temporal modeling.
Architecture Pattern
1Raw Sensors → CNN Layers → Feature Maps → LSTM Layers → RUL Prediction
2 ↓ ↓ ↓ ↓
3 (30, 17) Local (30, 64) Temporal Scalar
4 Patterns ContextWhy This Combination Works
- CNNs extract local patterns: Sensor spikes, local trends, noise filtering
- LSTMs model sequence: How patterns evolve over time
- Better gradient flow: CNN outputs are smoother, making LSTM training easier
- Reduced sequence length: Pooling layers can reduce temporal dimension
CNN-LSTM Results
| Paper | Year | FD001 | FD002 | FD003 | FD004 |
|---|---|---|---|---|---|
| Li et al. (DCNN) | 2018 | 12.61 | 16.94 | 12.64 | 17.30 |
| Zhao et al. | 2019 | 11.37 | 15.12 | 12.45 | 16.78 |
| Enhanced CNN-LSTM | 2022 | 10.93 | 14.45 | 11.71 | 16.64 |
Adding Attention
Attention mechanisms provided the next improvement by allowing the model to focus on the most diagnostic timesteps:
Where is the attention weight for timestep , and is the context vector.
CNN-LSTM-Attention models achieved RMSE of 14.45 and 16.64 on FD002 and FD004 respectively—substantial improvements but still far from solving the multi-condition challenge.
Transformer-Based Methods
The Transformer architecture, originally developed for NLP, brought pure attention-based sequence modeling to time series. Unlike RNNs, Transformers process all timesteps in parallel through self-attention.
Key Transformer Methods
DVGTformer (2024)
The Dual-View Graph Transformer introduced:
- Temporal view: Self-attention across timesteps
- Spatial view: Graph attention across sensors
- Dual fusion: Combining both views for comprehensive representation
DKAMFormer (2025)
The Domain Knowledge-Augmented Multiscale Transformer represented the previous state-of-the-art:
- Knowledge graphs: Incorporated domain knowledge about sensor relationships
- Multiscale processing: Parallel paths for different temporal resolutions
- Domain embeddings: Encoded physical knowledge into the architecture
Transformer Results (Previous SOTA)
| Method | Params | FD001 | FD002 | FD003 | FD004 |
|---|---|---|---|---|---|
| DVGTformer | ~5M | 11.49 | 19.77 | 11.71 | 20.67 |
| DKAMFormer | ~8M | 10.68 | 10.70 | 10.52 | 12.89 |
| AMNL (Ours) | ~0.8M | 10.43 | 6.74 | 9.51 | 8.16 |
Model Complexity
Multi-Task Learning Approaches
Multi-task learning (MTL) has been explored in prognostics as a way to leverage auxiliary tasks for improved feature learning. The core insight is that health state assessment and RUL prediction share underlying features.
Common MTL Architecture
1Shared Encoder (CNN/LSTM/Transformer)
2 ↓
3 Shared Features
4 ↙ ↘
5 RUL Head Health Head
6 ↓ ↓
7 ŷ_RUL ∈ ℝ ŷ_health ∈ {0,1,2}Conventional Wisdom: Asymmetric Weighting
Prior MTL work in prognostics almost universally used asymmetric task weighting, with the primary RUL task receiving higher weight:
Typical settings in prior work:
| Paper | λ_RUL | λ_health | Rationale |
|---|---|---|---|
| Liu et al. (2021) | 0.9 | 0.1 | Primary task should dominate |
| Xu et al. (2023) | 0.8 | 0.2 | Auxiliary as regularizer |
| Wang et al. (2023) | 0.7 | 0.3 | Balanced but RUL-focused |
| AMNL (Ours) | 0.5 | 0.5 | Equal importance (novel) |
The Gradient Imbalance Problem
Asymmetric weighting creates gradient imbalance. When , the shared encoder is dominated by RUL gradients, and the health classification task has minimal influence on learned representations.
Critical Limitations of Existing Methods
Despite years of progress, existing methods share several critical limitations that prevent truly robust RUL prediction:
1. Inconsistent Multi-Condition Performance
No prior method achieves consistent performance across all datasets. Methods that excel on FD001 often fail on FD002/FD004:
| Method | FD001 Rank | FD002 Rank | Consistent? |
|---|---|---|---|
| DVGTformer | 1st (11.49) | 6th (19.77) | No ❌ |
| CNN-LSTM-Attention | 3rd (10.93) | 4th (14.45) | No ❌ |
| DKAMFormer | 1st (10.68) | 1st (10.70) | Yes ✓ |
| AMNL (Ours) | 1st (10.43) | 1st (6.74) | Yes ✓ |
DKAMFormer was the first to achieve relatively consistent rankings, but its FD002/FD004 numbers remain significantly higher than FD001/FD003.
2. The Condition Confusion Problem
On multi-condition datasets, models often confuse operating condition effects with degradation. A sensor reading that indicates degradation in one condition might be perfectly normal in another.
Prior solutions include:
- Per-condition normalization: Normalize features separately for each operating condition (helps, but doesn't fully solve)
- Domain adaptation: Explicitly align features across conditions (requires target domain data)
- Condition embeddings: Add condition ID as input feature (limits generalization)
3. Auxiliary Task Underutilization
By weighting auxiliary tasks low (0.1-0.3), prior MTL approaches fail to fully leverage their regularization potential. The health classification task provides clean categorical supervision that could guide feature learning—but only if given sufficient influence.
4. Overfitting to Dataset Characteristics
Methods tuned on FD001 (single condition, single fault) often overfit to its specific characteristics:
- Assume a single degradation pattern
- Learn condition-specific features that don't generalize
- Optimize hyperparameters that work only for simple scenarios
The Generalization Problem
Cross-dataset experiments reveal a troubling pattern: methods trained on one dataset often show negative transfer when applied to others. A model trained on FD002 (6 conditions) may perform worse on FD001 (1 condition) than a model trained directly on FD001.
How AMNL Addresses These Limitations
AMNL was designed specifically to overcome these limitations through a single key insight: equal task weighting.
The Core Innovation
This simple change has profound effects:
1. Condition-Invariant Representations
Health states are defined by RUL thresholds, not operating conditions:
- Normal: RUL > 80 (regardless of condition)
- Early Degradation: 30 < RUL ≤ 80 (regardless of condition)
- Critical: RUL ≤ 30 (regardless of condition)
By forcing the model to predict health states accurately, we implicitly enforce condition invariance. The model cannot "cheat" by using condition-specific features—it must learn true degradation signatures.
2. Balanced Gradient Flow
Equal weighting ensures both tasks contribute meaningfully to the shared representation:
- RUL gradients encourage fine-grained regression accuracy
- Health gradients encourage categorical separation between degradation phases
- Together, they create features that are both discriminative and continuous
3. Implicit Regularization
The classification task acts as a strong regularizer:
- Prevents overfitting to dataset-specific noise in RUL labels
- Encourages learning of phase-level features (Normal → Degradation → Critical)
- Forces the model to maintain meaningful feature structure
Results: AMNL vs Prior SOTA
| Metric | DKAMFormer | AMNL | Improvement |
|---|---|---|---|
| FD001 RMSE | 10.68 | 10.43 | +2.3% |
| FD002 RMSE | 10.70 | 6.74 | +37.0% |
| FD003 RMSE | 10.52 | 9.51 | +9.6% |
| FD004 RMSE | 12.89 | 8.16 | +36.7% |
| Multi-condition avg | 11.80 | 7.45 | +36.9% |
| Parameters | ~8M | ~0.8M | 10× smaller |
The Counterintuitive Discovery: By treating the auxiliary task as equally important as the primary task, AMNL achieves better primary task performance than methods that prioritize it. This challenges fundamental assumptions in multi-task learning.
Summary
In this section, we have surveyed the landscape of RUL prediction methods:
- LSTM-based methods (2015-2017) established deep learning baselines with RMSE ~12-17 on FD001
- CNN-LSTM hybrids (2016-2018) combined local feature extraction with temporal modeling, reaching RMSE ~11-14
- Attention mechanisms enabled adaptive focus, improving to RMSE ~10-13
- Transformer methods (2021-2024) like DKAMFormer achieved previous SOTA at RMSE ~10.5-13
- Critical limitation: all prior methods show degraded performance on multi-condition datasets (FD002/FD004)
- AMNL's innovation: equal task weighting (0.5/0.5) creates condition-invariant representations
| Evolution Stage | Key Insight | Limitation Addressed |
|---|---|---|
| LSTM | Temporal modeling | Manual feature engineering |
| CNN-LSTM | Local + temporal | Local pattern extraction |
| Attention | Adaptive focus | Fixed attention patterns |
| Transformers | Parallel attention | Sequential processing |
| AMNL | Equal task weighting | Multi-condition generalization |
Looking Ahead: With Chapter 1 complete, you now have a solid foundation in predictive maintenance, the RUL prediction problem, deep learning architectures, the C-MAPSS benchmark, and the state of the art. In Part II, we will dive into the data—exploring the C-MAPSS dataset in detail and building the preprocessing pipeline that enables our model to learn effectively.
You are now ready to move beyond theory and start working with real data and code.