AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Trace the evolution of RUL prediction methods from traditional ML to modern deep learning
Understand LSTM-based approaches and their success in capturing temporal dependencies
Know how CNN-LSTM hybrids combine local feature extraction with sequence modeling
Appreciate transformer architectures like DVGTformer and DKAMFormer that achieved previous SOTA
Identify the critical limitation: inconsistent performance across operating conditions
See where AMNL fits in this evolution and why its approach is novel

Why This Matters: Understanding the landscape of existing methods helps you appreciate why certain design decisions were made in AMNL. Every architecture choice has predecessors, and knowing what worked (and what didn't) guides the development of better solutions.

The Evolution of RUL Methods

RUL prediction has evolved through several distinct phases, each addressing limitations of previous approaches:

Era	Methods	RMSE Range (FD001)	Key Innovation
Pre-2015	Physics-based, Classical ML	20-30+	Domain knowledge, hand-crafted features
2015-2017	LSTM, GRU	12-17	End-to-end temporal modeling
2016-2018	CNN-LSTM hybrids	11-14	Local feature extraction + sequence modeling
2017-2020	Attention mechanisms	10-13	Adaptive focus on relevant timesteps
2021-2024	Transformers (DVGTformer, DKAMFormer)	10.5-11.5	Pure attention, domain knowledge integration
2024-Present	AMNL (Ours)	8.69-10.43	Equal-weight multi-task learning

The Narrowing Gap

On FD001 (the easiest dataset), methods have converged to similar performance—improvements are measured in fractions of RMSE points. But on complex datasets like FD002 and FD004, the gap remains enormous:

Method Type	FD001 RMSE	FD002 RMSE	FD004 RMSE	FD002/FD001 Ratio
Classical ML	~15	~22	~25	1.47×
LSTM	~13	~18	~20	1.38×
CNN-LSTM-Attention	~11	~14	~16	1.27×
DKAMFormer	10.68	10.70	12.89	1.00×
AMNL (Ours)	10.43	6.74	8.16	0.65×

The Multi-Condition Gap

Most methods show worse performance on multi-condition datasets (FD002, FD004). AMNL is the first method where FD002 RMSE is actually lower than FD001 RMSE—a complete inversion of the expected difficulty ordering.

LSTM-Based Approaches

Long Short-Term Memory networks marked the first major breakthrough in data-driven RUL prediction. Unlike traditional methods that required manual feature engineering, LSTMs could learn temporal patterns directly from raw sensor sequences.

How LSTMs Helped

Long-range dependencies: The cell state mechanism carries information across many timesteps, essential for tracking gradual degradation
Adaptive memory: Gates learn which information to remember and forget based on training data
End-to-end training: No need for hand-crafted features—the network learns representations optimized for RUL

Early LSTM Results

Paper	Year	FD001	FD002	FD003	FD004
Zheng et al.	2017	12.56	16.18	12.10	17.93
Zhang et al.	2018	13.65	15.72	13.43	16.21
BiLSTM baseline	2019	13.18	17.04	14.32	18.45

Limitations of Pure LSTM Approaches

No local pattern extraction: LSTMs process one timestep at a time, missing local sensor patterns
Sequential processing: Cannot parallelize across timesteps, slow training
Vanishing gradients: Despite gating, very long sequences still pose challenges
Multi-condition struggle: Performance degrades significantly on FD002/FD004

CNN-LSTM Hybrid Architectures

The insight that convolutional networks excel at local pattern extraction led to hybrid architectures: CNNs for features, LSTMs for temporal modeling.

Architecture Pattern

📝text

1Raw Sensors → CNN Layers → Feature Maps → LSTM Layers → RUL Prediction
2     ↓              ↓              ↓              ↓
3  (30, 17)      Local        (30, 64)       Temporal         Scalar
4               Patterns                     Context

Why This Combination Works

CNNs extract local patterns: Sensor spikes, local trends, noise filtering
LSTMs model sequence: How patterns evolve over time
Better gradient flow: CNN outputs are smoother, making LSTM training easier
Reduced sequence length: Pooling layers can reduce temporal dimension

CNN-LSTM Results

Paper	Year	FD001	FD002	FD003	FD004
Li et al. (DCNN)	2018	12.61	16.94	12.64	17.30
Zhao et al.	2019	11.37	15.12	12.45	16.78
Enhanced CNN-LSTM	2022	10.93	14.45	11.71	16.64

Adding Attention

Attention mechanisms provided the next improvement by allowing the model to focus on the most diagnostic timesteps:

\alpha_t = \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)}, \quad \mathbf{c} = \sum_{t=1}^{T} \alpha_t \mathbf{h}_t

Where $\alpha_t$ is the attention weight for timestep $t$ , and $\mathbf{c}$ is the context vector.

CNN-LSTM-Attention models achieved RMSE of 14.45 and 16.64 on FD002 and FD004 respectively—substantial improvements but still far from solving the multi-condition challenge.

Transformer-Based Methods

The Transformer architecture, originally developed for NLP, brought pure attention-based sequence modeling to time series. Unlike RNNs, Transformers process all timesteps in parallel through self-attention.

Key Transformer Methods

DVGTformer (2024)

The Dual-View Graph Transformer introduced:

Temporal view: Self-attention across timesteps
Spatial view: Graph attention across sensors
Dual fusion: Combining both views for comprehensive representation

DKAMFormer (2025)

The Domain Knowledge-Augmented Multiscale Transformer represented the previous state-of-the-art:

Knowledge graphs: Incorporated domain knowledge about sensor relationships
Multiscale processing: Parallel paths for different temporal resolutions
Domain embeddings: Encoded physical knowledge into the architecture

Transformer Results (Previous SOTA)

Method	Params	FD001	FD002	FD003	FD004
DVGTformer	~5M	11.49	19.77	11.71	20.67
DKAMFormer	~8M	10.68	10.70	10.52	12.89
AMNL (Ours)	~0.8M	10.43	6.74	9.51	8.16

Model Complexity

AMNL achieves better results with 10× fewer parameters than DKAMFormer. This suggests that the performance gains come from the training approach (equal task weighting), not just model capacity.

Multi-Task Learning Approaches

Multi-task learning (MTL) has been explored in prognostics as a way to leverage auxiliary tasks for improved feature learning. The core insight is that health state assessment and RUL prediction share underlying features.

Common MTL Architecture

📝text

1Shared Encoder (CNN/LSTM/Transformer)
2              ↓
3      Shared Features
4         ↙        ↘
5   RUL Head    Health Head
6      ↓              ↓
7 ŷ_RUL ∈ ℝ    ŷ_health ∈ {0,1,2}

Conventional Wisdom: Asymmetric Weighting

Prior MTL work in prognostics almost universally used asymmetric task weighting, with the primary RUL task receiving higher weight:

\mathcal{L}_{\text{MTL}} = \lambda_{\text{RUL}} \cdot \mathcal{L}_{\text{RUL}} + \lambda_{\text{health}} \cdot \mathcal{L}_{\text{health}}

Typical settings in prior work:

Paper	λ_RUL	λ_health	Rationale
Liu et al. (2021)	0.9	0.1	Primary task should dominate
Xu et al. (2023)	0.8	0.2	Auxiliary as regularizer
Wang et al. (2023)	0.7	0.3	Balanced but RUL-focused
AMNL (Ours)	0.5	0.5	Equal importance (novel)

The Gradient Imbalance Problem

Asymmetric weighting creates gradient imbalance. When $\lambda_{\text{RUL}} = 0.9$ , the shared encoder is dominated by RUL gradients, and the health classification task has minimal influence on learned representations.

Critical Limitations of Existing Methods

Despite years of progress, existing methods share several critical limitations that prevent truly robust RUL prediction:

1. Inconsistent Multi-Condition Performance

No prior method achieves consistent performance across all datasets. Methods that excel on FD001 often fail on FD002/FD004:

Method	FD001 Rank	FD002 Rank	Consistent?
DVGTformer	1st (11.49)	6th (19.77)	No ❌
CNN-LSTM-Attention	3rd (10.93)	4th (14.45)	No ❌
DKAMFormer	1st (10.68)	1st (10.70)	Yes ✓
AMNL (Ours)	1st (10.43)	1st (6.74)	Yes ✓

DKAMFormer was the first to achieve relatively consistent rankings, but its FD002/FD004 numbers remain significantly higher than FD001/FD003.

2. The Condition Confusion Problem

On multi-condition datasets, models often confuse operating condition effects with degradation. A sensor reading that indicates degradation in one condition might be perfectly normal in another.

P(\text{degradation} | \mathbf{x}) \neq P(\text{degradation} | \mathbf{x}, \text{condition})

Prior solutions include:

Per-condition normalization: Normalize features separately for each operating condition (helps, but doesn't fully solve)
Domain adaptation: Explicitly align features across conditions (requires target domain data)
Condition embeddings: Add condition ID as input feature (limits generalization)

3. Auxiliary Task Underutilization

By weighting auxiliary tasks low (0.1-0.3), prior MTL approaches fail to fully leverage their regularization potential. The health classification task provides clean categorical supervision that could guide feature learning—but only if given sufficient influence.

4. Overfitting to Dataset Characteristics

Methods tuned on FD001 (single condition, single fault) often overfit to its specific characteristics:

Assume a single degradation pattern
Learn condition-specific features that don't generalize
Optimize hyperparameters that work only for simple scenarios

The Generalization Problem

Cross-dataset experiments reveal a troubling pattern: methods trained on one dataset often show negative transfer when applied to others. A model trained on FD002 (6 conditions) may perform worse on FD001 (1 condition) than a model trained directly on FD001.

How AMNL Addresses These Limitations

AMNL was designed specifically to overcome these limitations through a single key insight: equal task weighting.

The Core Innovation

\mathcal{L}_{\text{AMNL}} = 0.5 \times \mathcal{L}_{\text{RUL}} + 0.5 \times \mathcal{L}_{\text{health}}

This simple change has profound effects:

1. Condition-Invariant Representations

Health states are defined by RUL thresholds, not operating conditions:

Normal: RUL > 80 (regardless of condition)
Early Degradation: 30 < RUL ≤ 80 (regardless of condition)
Critical: RUL ≤ 30 (regardless of condition)

By forcing the model to predict health states accurately, we implicitly enforce condition invariance. The model cannot "cheat" by using condition-specific features—it must learn true degradation signatures.

2. Balanced Gradient Flow

Equal weighting ensures both tasks contribute meaningfully to the shared representation:

RUL gradients encourage fine-grained regression accuracy
Health gradients encourage categorical separation between degradation phases
Together, they create features that are both discriminative and continuous

3. Implicit Regularization

The classification task acts as a strong regularizer:

Prevents overfitting to dataset-specific noise in RUL labels
Encourages learning of phase-level features (Normal → Degradation → Critical)
Forces the model to maintain meaningful feature structure

Results: AMNL vs Prior SOTA

Metric	DKAMFormer	AMNL	Improvement
FD001 RMSE	10.68	10.43	+2.3%
FD002 RMSE	10.70	6.74	+37.0%
FD003 RMSE	10.52	9.51	+9.6%
FD004 RMSE	12.89	8.16	+36.7%
Multi-condition avg	11.80	7.45	+36.9%
Parameters	~8M	~0.8M	10× smaller

The Counterintuitive Discovery: By treating the auxiliary task as equally important as the primary task, AMNL achieves better primary task performance than methods that prioritize it. This challenges fundamental assumptions in multi-task learning.

Summary

In this section, we have surveyed the landscape of RUL prediction methods:

LSTM-based methods (2015-2017) established deep learning baselines with RMSE ~12-17 on FD001
CNN-LSTM hybrids (2016-2018) combined local feature extraction with temporal modeling, reaching RMSE ~11-14
Attention mechanisms enabled adaptive focus, improving to RMSE ~10-13
Transformer methods (2021-2024) like DKAMFormer achieved previous SOTA at RMSE ~10.5-13
Critical limitation: all prior methods show degraded performance on multi-condition datasets (FD002/FD004)
AMNL's innovation: equal task weighting (0.5/0.5) creates condition-invariant representations

Evolution Stage	Key Insight	Limitation Addressed
LSTM	Temporal modeling	Manual feature engineering
CNN-LSTM	Local + temporal	Local pattern extraction
Attention	Adaptive focus	Fixed attention patterns
Transformers	Parallel attention	Sequential processing
AMNL	Equal task weighting	Multi-condition generalization

Looking Ahead: With Chapter 1 complete, you now have a solid foundation in predictive maintenance, the RUL prediction problem, deep learning architectures, the C-MAPSS benchmark, and the state of the art. In Part II, we will dive into the data—exploring the C-MAPSS dataset in detail and building the preprocessing pipeline that enables our model to learn effectively.

You are now ready to move beyond theory and start working with real data and code.