Chapter 1
16 min read
Section 5 of 104

State-of-the-Art Methods and Their Limitations

Introduction to Predictive Maintenance

Learning Objectives

By the end of this section, you will:

  1. Trace the evolution of RUL prediction methods from traditional ML to modern deep learning
  2. Understand LSTM-based approaches and their success in capturing temporal dependencies
  3. Know how CNN-LSTM hybrids combine local feature extraction with sequence modeling
  4. Appreciate transformer architectures like DVGTformer and DKAMFormer that achieved previous SOTA
  5. Identify the critical limitation: inconsistent performance across operating conditions
  6. See where AMNL fits in this evolution and why its approach is novel
Why This Matters: Understanding the landscape of existing methods helps you appreciate why certain design decisions were made in AMNL. Every architecture choice has predecessors, and knowing what worked (and what didn't) guides the development of better solutions.

The Evolution of RUL Methods

RUL prediction has evolved through several distinct phases, each addressing limitations of previous approaches:

EraMethodsRMSE Range (FD001)Key Innovation
Pre-2015Physics-based, Classical ML20-30+Domain knowledge, hand-crafted features
2015-2017LSTM, GRU12-17End-to-end temporal modeling
2016-2018CNN-LSTM hybrids11-14Local feature extraction + sequence modeling
2017-2020Attention mechanisms10-13Adaptive focus on relevant timesteps
2021-2024Transformers (DVGTformer, DKAMFormer)10.5-11.5Pure attention, domain knowledge integration
2024-PresentAMNL (Ours)8.69-10.43Equal-weight multi-task learning

The Narrowing Gap

On FD001 (the easiest dataset), methods have converged to similar performance—improvements are measured in fractions of RMSE points. But on complex datasets like FD002 and FD004, the gap remains enormous:

Method TypeFD001 RMSEFD002 RMSEFD004 RMSEFD002/FD001 Ratio
Classical ML~15~22~251.47×
LSTM~13~18~201.38×
CNN-LSTM-Attention~11~14~161.27×
DKAMFormer10.6810.7012.891.00×
AMNL (Ours)10.436.748.160.65×

The Multi-Condition Gap

Most methods show worse performance on multi-condition datasets (FD002, FD004). AMNL is the first method where FD002 RMSE is actually lower than FD001 RMSE—a complete inversion of the expected difficulty ordering.

LSTM-Based Approaches

Long Short-Term Memory networks marked the first major breakthrough in data-driven RUL prediction. Unlike traditional methods that required manual feature engineering, LSTMs could learn temporal patterns directly from raw sensor sequences.

How LSTMs Helped

  • Long-range dependencies: The cell state mechanism carries information across many timesteps, essential for tracking gradual degradation
  • Adaptive memory: Gates learn which information to remember and forget based on training data
  • End-to-end training: No need for hand-crafted features—the network learns representations optimized for RUL

Early LSTM Results

PaperYearFD001FD002FD003FD004
Zheng et al.201712.5616.1812.1017.93
Zhang et al.201813.6515.7213.4316.21
BiLSTM baseline201913.1817.0414.3218.45

Limitations of Pure LSTM Approaches

  • No local pattern extraction: LSTMs process one timestep at a time, missing local sensor patterns
  • Sequential processing: Cannot parallelize across timesteps, slow training
  • Vanishing gradients: Despite gating, very long sequences still pose challenges
  • Multi-condition struggle: Performance degrades significantly on FD002/FD004

CNN-LSTM Hybrid Architectures

The insight that convolutional networks excel at local pattern extraction led to hybrid architectures: CNNs for features, LSTMs for temporal modeling.

Architecture Pattern

📝text
1Raw Sensors → CNN Layers → Feature Maps → LSTM Layers → RUL Prediction
2     ↓              ↓              ↓              ↓
3  (30, 17)      Local        (30, 64)       Temporal         Scalar
4               Patterns                     Context

Why This Combination Works

  1. CNNs extract local patterns: Sensor spikes, local trends, noise filtering
  2. LSTMs model sequence: How patterns evolve over time
  3. Better gradient flow: CNN outputs are smoother, making LSTM training easier
  4. Reduced sequence length: Pooling layers can reduce temporal dimension

CNN-LSTM Results

PaperYearFD001FD002FD003FD004
Li et al. (DCNN)201812.6116.9412.6417.30
Zhao et al.201911.3715.1212.4516.78
Enhanced CNN-LSTM202210.9314.4511.7116.64

Adding Attention

Attention mechanisms provided the next improvement by allowing the model to focus on the most diagnostic timesteps:

αt=exp(et)j=1Texp(ej),c=t=1Tαtht\alpha_t = \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)}, \quad \mathbf{c} = \sum_{t=1}^{T} \alpha_t \mathbf{h}_t

Where αt\alpha_t is the attention weight for timestep tt, and c\mathbf{c} is the context vector.

CNN-LSTM-Attention models achieved RMSE of 14.45 and 16.64 on FD002 and FD004 respectively—substantial improvements but still far from solving the multi-condition challenge.


Transformer-Based Methods

The Transformer architecture, originally developed for NLP, brought pure attention-based sequence modeling to time series. Unlike RNNs, Transformers process all timesteps in parallel through self-attention.

Key Transformer Methods

DVGTformer (2024)

The Dual-View Graph Transformer introduced:

  • Temporal view: Self-attention across timesteps
  • Spatial view: Graph attention across sensors
  • Dual fusion: Combining both views for comprehensive representation

DKAMFormer (2025)

The Domain Knowledge-Augmented Multiscale Transformer represented the previous state-of-the-art:

  • Knowledge graphs: Incorporated domain knowledge about sensor relationships
  • Multiscale processing: Parallel paths for different temporal resolutions
  • Domain embeddings: Encoded physical knowledge into the architecture

Transformer Results (Previous SOTA)

MethodParamsFD001FD002FD003FD004
DVGTformer~5M11.4919.7711.7120.67
DKAMFormer~8M10.6810.7010.5212.89
AMNL (Ours)~0.8M10.436.749.518.16

Model Complexity

AMNL achieves better results with 10× fewer parameters than DKAMFormer. This suggests that the performance gains come from the training approach (equal task weighting), not just model capacity.

Multi-Task Learning Approaches

Multi-task learning (MTL) has been explored in prognostics as a way to leverage auxiliary tasks for improved feature learning. The core insight is that health state assessment and RUL prediction share underlying features.

Common MTL Architecture

📝text
1Shared Encoder (CNN/LSTM/Transformer)
23      Shared Features
4         ↙        ↘
5   RUL Head    Health Head
6      ↓              ↓
7 ŷ_RUL ∈ ℝ    ŷ_health ∈ {0,1,2}

Conventional Wisdom: Asymmetric Weighting

Prior MTL work in prognostics almost universally used asymmetric task weighting, with the primary RUL task receiving higher weight:

LMTL=λRULLRUL+λhealthLhealth\mathcal{L}_{\text{MTL}} = \lambda_{\text{RUL}} \cdot \mathcal{L}_{\text{RUL}} + \lambda_{\text{health}} \cdot \mathcal{L}_{\text{health}}

Typical settings in prior work:

Paperλ_RULλ_healthRationale
Liu et al. (2021)0.90.1Primary task should dominate
Xu et al. (2023)0.80.2Auxiliary as regularizer
Wang et al. (2023)0.70.3Balanced but RUL-focused
AMNL (Ours)0.50.5Equal importance (novel)

The Gradient Imbalance Problem

Asymmetric weighting creates gradient imbalance. When λRUL=0.9\lambda_{\text{RUL}} = 0.9, the shared encoder is dominated by RUL gradients, and the health classification task has minimal influence on learned representations.


Critical Limitations of Existing Methods

Despite years of progress, existing methods share several critical limitations that prevent truly robust RUL prediction:

1. Inconsistent Multi-Condition Performance

No prior method achieves consistent performance across all datasets. Methods that excel on FD001 often fail on FD002/FD004:

MethodFD001 RankFD002 RankConsistent?
DVGTformer1st (11.49)6th (19.77)No ❌
CNN-LSTM-Attention3rd (10.93)4th (14.45)No ❌
DKAMFormer1st (10.68)1st (10.70)Yes ✓
AMNL (Ours)1st (10.43)1st (6.74)Yes ✓

DKAMFormer was the first to achieve relatively consistent rankings, but its FD002/FD004 numbers remain significantly higher than FD001/FD003.

2. The Condition Confusion Problem

On multi-condition datasets, models often confuse operating condition effects with degradation. A sensor reading that indicates degradation in one condition might be perfectly normal in another.

P(degradationx)P(degradationx,condition)P(\text{degradation} | \mathbf{x}) \neq P(\text{degradation} | \mathbf{x}, \text{condition})

Prior solutions include:

  • Per-condition normalization: Normalize features separately for each operating condition (helps, but doesn't fully solve)
  • Domain adaptation: Explicitly align features across conditions (requires target domain data)
  • Condition embeddings: Add condition ID as input feature (limits generalization)

3. Auxiliary Task Underutilization

By weighting auxiliary tasks low (0.1-0.3), prior MTL approaches fail to fully leverage their regularization potential. The health classification task provides clean categorical supervision that could guide feature learning—but only if given sufficient influence.

4. Overfitting to Dataset Characteristics

Methods tuned on FD001 (single condition, single fault) often overfit to its specific characteristics:

  • Assume a single degradation pattern
  • Learn condition-specific features that don't generalize
  • Optimize hyperparameters that work only for simple scenarios

The Generalization Problem

Cross-dataset experiments reveal a troubling pattern: methods trained on one dataset often show negative transfer when applied to others. A model trained on FD002 (6 conditions) may perform worse on FD001 (1 condition) than a model trained directly on FD001.


How AMNL Addresses These Limitations

AMNL was designed specifically to overcome these limitations through a single key insight: equal task weighting.

The Core Innovation

LAMNL=0.5×LRUL+0.5×Lhealth\mathcal{L}_{\text{AMNL}} = 0.5 \times \mathcal{L}_{\text{RUL}} + 0.5 \times \mathcal{L}_{\text{health}}

This simple change has profound effects:

1. Condition-Invariant Representations

Health states are defined by RUL thresholds, not operating conditions:

  • Normal: RUL > 80 (regardless of condition)
  • Early Degradation: 30 < RUL ≤ 80 (regardless of condition)
  • Critical: RUL ≤ 30 (regardless of condition)

By forcing the model to predict health states accurately, we implicitly enforce condition invariance. The model cannot "cheat" by using condition-specific features—it must learn true degradation signatures.

2. Balanced Gradient Flow

Equal weighting ensures both tasks contribute meaningfully to the shared representation:

  • RUL gradients encourage fine-grained regression accuracy
  • Health gradients encourage categorical separation between degradation phases
  • Together, they create features that are both discriminative and continuous

3. Implicit Regularization

The classification task acts as a strong regularizer:

  • Prevents overfitting to dataset-specific noise in RUL labels
  • Encourages learning of phase-level features (Normal → Degradation → Critical)
  • Forces the model to maintain meaningful feature structure

Results: AMNL vs Prior SOTA

MetricDKAMFormerAMNLImprovement
FD001 RMSE10.6810.43+2.3%
FD002 RMSE10.706.74+37.0%
FD003 RMSE10.529.51+9.6%
FD004 RMSE12.898.16+36.7%
Multi-condition avg11.807.45+36.9%
Parameters~8M~0.8M10× smaller
The Counterintuitive Discovery: By treating the auxiliary task as equally important as the primary task, AMNL achieves better primary task performance than methods that prioritize it. This challenges fundamental assumptions in multi-task learning.

Summary

In this section, we have surveyed the landscape of RUL prediction methods:

  1. LSTM-based methods (2015-2017) established deep learning baselines with RMSE ~12-17 on FD001
  2. CNN-LSTM hybrids (2016-2018) combined local feature extraction with temporal modeling, reaching RMSE ~11-14
  3. Attention mechanisms enabled adaptive focus, improving to RMSE ~10-13
  4. Transformer methods (2021-2024) like DKAMFormer achieved previous SOTA at RMSE ~10.5-13
  5. Critical limitation: all prior methods show degraded performance on multi-condition datasets (FD002/FD004)
  6. AMNL's innovation: equal task weighting (0.5/0.5) creates condition-invariant representations
Evolution StageKey InsightLimitation Addressed
LSTMTemporal modelingManual feature engineering
CNN-LSTMLocal + temporalLocal pattern extraction
AttentionAdaptive focusFixed attention patterns
TransformersParallel attentionSequential processing
AMNLEqual task weightingMulti-condition generalization
Looking Ahead: With Chapter 1 complete, you now have a solid foundation in predictive maintenance, the RUL prediction problem, deep learning architectures, the C-MAPSS benchmark, and the state of the art. In Part II, we will dive into the data—exploring the C-MAPSS dataset in detail and building the preprocessing pipeline that enables our model to learn effectively.

You are now ready to move beyond theory and start working with real data and code.