AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the two evaluation paradigms for RUL prediction
Know when to use each approach and why
Implement proper last-cycle extraction from predictions
Compare metrics across both evaluation modes
Align with published benchmarks using correct protocols

Why This Matters: The NASA C-MAPSS benchmark has a specific evaluation protocol—last-cycle RMSE and NASA score. Using all-cycles metrics instead gives different (and non-comparable) results. Understanding both approaches ensures accurate comparison with published state-of-the-art results.

Evaluation Paradigms

RUL predictions can be evaluated in two fundamentally different ways.

All-Cycles Evaluation

Evaluate every prediction made throughout the degradation process:

\text{Metric}_{\text{all}} = f\left(\{(y_i, \hat{y}_i)\}_{i=1}^{N}\right)

Where N is the total number of predictions across all engines and all cycles.

Last-Cycle Evaluation

Evaluate only the final prediction for each engine before failure:

\text{Metric}_{\text{last}} = f\left(\{(y_j^{\text{last}}, \hat{y}_j^{\text{last}})\}_{j=1}^{M}\right)

Where M is the number of test engines, and "last" denotes the final operating cycle.

Key Differences

Aspect	All-Cycles	Last-Cycle
Sample size	All predictions (~13K for FD001)	One per engine (100 for FD001)
Measures	Overall consistency	End-of-life accuracy
Literature standard	Rarely reported	Primary benchmark
Practical relevance	Monitoring throughout	Maintenance decision

Last-Cycle Protocol

The standard benchmark protocol focuses on final predictions.

Why Last-Cycle Matters

Maintenance decision point: The final prediction determines if maintenance is scheduled
Maximum information: Model has seen the full degradation history
Highest stakes: Error at this point has the largest consequence
Literature standard: All published C-MAPSS results use last-cycle

Extraction Process

📝text

1Engine 1: cycles 1, 2, 3, ..., 128  →  Take prediction at cycle 128
2Engine 2: cycles 1, 2, 3, ..., 145  →  Take prediction at cycle 145
3Engine 3: cycles 1, 2, 3, ..., 98   →  Take prediction at cycle 98
4...
5Engine M: cycles 1, 2, ..., k       →  Take prediction at cycle k
6
7Result: M predictions (one per engine)

True RUL at Last Cycle

At the last operating cycle, the true RUL varies by engine:

Dataset	Engines	RUL at Last Cycle
FD001	100	Varies (typically 0-125, median ~40)
FD002	259	Varies (typically 0-125)
FD003	100	Varies (typically 0-125)
FD004	248	Varies (typically 0-125)

Test Set Ground Truth

The NASA C-MAPSS test sets provide ground truth RUL for the last cycle of each engine in a separate file (RUL_FD00x.txt). These are the targets used for last-cycle evaluation.

Literature Benchmark Values

Dataset	SOTA RMSE (last)	Our RMSE (last)
FD001	11.48	11.21
FD002	17.44	10.98
FD003	11.67	10.55
FD004	18.15	11.49

All-Cycles Analysis

While not the primary benchmark, all-cycles metrics provide valuable diagnostic information.

Use Cases for All-Cycles

Model consistency: Does accuracy vary through degradation?
Early detection: How accurate are early-stage predictions?
Training diagnostics: Track learning across all timesteps
Degradation pattern: Analyze errors at different RUL ranges

All-Cycles vs. Last-Cycle Comparison

Typical relationship between the two metrics:

Metric	All-Cycles	Last-Cycle
RMSE	Usually higher	Usually lower (benchmark)
NASA Score	Much higher	Lower (benchmark)
Sample size	~13,000 (FD001)	100 (FD001)
Variance	Lower (more samples)	Higher (fewer samples)

Per-RUL-Range Analysis

Break down all-cycles performance by RUL range:

🐍python

1def analyze_by_rul_range(
2    predictions: np.ndarray,
3    targets: np.ndarray
4) -> Dict[str, float]:
5    """Analyze prediction accuracy across RUL ranges."""
6
7    results = {}
8
9    # Define RUL ranges
10    ranges = [
11        ('critical', 0, 15),      # Near failure
12        ('degrading', 15, 50),    # Active degradation
13        ('healthy', 50, 125),     # Normal operation
14    ]
15
16    for name, low, high in ranges:
17        mask = (targets >= low) & (targets < high)
18        if np.sum(mask) > 0:
19            range_preds = predictions[mask]
20            range_targets = targets[mask]
21            rmse = np.sqrt(np.mean((range_preds - range_targets) ** 2))
22            results[f'RMSE_{name}'] = float(rmse)
23            results[f'n_{name}'] = int(np.sum(mask))
24
25    return results

Implementation

Complete implementation for both evaluation paradigms.

Last-Cycle Extraction

🐍python

1def extract_last_cycle_predictions(
2    predictions: np.ndarray,
3    targets: np.ndarray,
4    unit_ids: np.ndarray
5) -> Tuple[np.ndarray, np.ndarray]:
6    """
7    Extract the last prediction for each engine.
8
9    Args:
10        predictions: All predictions
11        targets: All true RUL values
12        unit_ids: Engine ID for each prediction
13
14    Returns:
15        Tuple of (last_predictions, last_targets)
16    """
17    unique_units = np.unique(unit_ids)
18    last_predictions = []
19    last_targets = []
20
21    for unit_id in unique_units:
22        mask = unit_ids == unit_id
23        unit_preds = predictions[mask]
24        unit_targets = targets[mask]
25
26        if len(unit_preds) > 0:
27            # Take the last (final) cycle
28            last_predictions.append(unit_preds[-1])
29            last_targets.append(unit_targets[-1])
30
31    return np.array(last_predictions), np.array(last_targets)

Dual-Mode Evaluation

🐍python

1def evaluate_dual_mode(
2    predictions: np.ndarray,
3    targets: np.ndarray,
4    unit_ids: np.ndarray
5) -> Dict[str, float]:
6    """
7    Evaluate using both all-cycles and last-cycle modes.
8
9    Args:
10        predictions: All predicted RUL values
11        targets: All true RUL values
12        unit_ids: Engine ID for each prediction
13
14    Returns:
15        Dictionary with all metrics in both modes
16    """
17    # Cap RUL values at 125
18    predictions = np.minimum(predictions, 125.0)
19    targets = np.minimum(targets, 125.0)
20
21    results = {}
22
23    # ═══════════════════════════════════════════════════════════
24    # ALL-CYCLES METRICS
25    # ═══════════════════════════════════════════════════════════
26    rmse_all = np.sqrt(np.mean((predictions - targets) ** 2))
27    mae_all = np.mean(np.abs(predictions - targets))
28
29    ss_res_all = np.sum((targets - predictions) ** 2)
30    ss_tot_all = np.sum((targets - np.mean(targets)) ** 2)
31    r2_all = 1 - (ss_res_all / ss_tot_all) if ss_tot_all != 0 else 0.0
32
33    results['RMSE_all_cycles'] = float(rmse_all)
34    results['MAE_all_cycles'] = float(mae_all)
35    results['R2_all_cycles'] = float(r2_all)
36    results['n_total_predictions'] = len(predictions)
37
38    # ═══════════════════════════════════════════════════════════
39    # LAST-CYCLE METRICS
40    # ═══════════════════════════════════════════════════════════
41    last_preds, last_targets = extract_last_cycle_predictions(
42        predictions, targets, unit_ids
43    )
44
45    rmse_last = np.sqrt(np.mean((last_preds - last_targets) ** 2))
46    mae_last = np.mean(np.abs(last_preds - last_targets))
47
48    ss_res_last = np.sum((last_targets - last_preds) ** 2)
49    ss_tot_last = np.sum((last_targets - np.mean(last_targets)) ** 2)
50    r2_last = 1 - (ss_res_last / ss_tot_last) if ss_tot_last != 0 else 0.0
51
52    results['RMSE_last_cycle'] = float(rmse_last)
53    results['MAE_last_cycle'] = float(mae_last)
54    results['R2_last_cycle'] = float(r2_last)
55    results['n_units_evaluated'] = len(last_preds)
56
57    # ═══════════════════════════════════════════════════════════
58    # NASA SCORES (BOTH MODES)
59    # ═══════════════════════════════════════════════════════════
60    nasa_all = nasa_scoring_function_comprehensive(
61        targets, predictions, method='raw_sum'
62    )
63    results['nasa_score_raw'] = nasa_all.get('nasa_score_raw', 0.0)
64
65    nasa_last = nasa_scoring_function_comprehensive(
66        targets, predictions, method='paper_style', unit_ids=unit_ids
67    )
68    results['nasa_score_paper'] = nasa_last.get('nasa_score_paper', 0.0)
69
70    # ═══════════════════════════════════════════════════════════
71    # DIAGNOSTIC METRICS
72    # ═══════════════════════════════════════════════════════════
73    errors = predictions - targets
74    results['early_predictions'] = int(np.sum(errors < 0))
75    results['late_predictions'] = int(np.sum(errors > 0))
76    results['early_percentage'] = float(results['early_predictions'] / len(errors) * 100)
77    results['late_percentage'] = float(results['late_predictions'] / len(errors) * 100)
78    results['mean_predictions_per_unit'] = float(len(predictions) / len(np.unique(unit_ids)))
79
80    return results

Usage Example

🐍python

1# After model evaluation
2results = evaluate_dual_mode(
3    predictions=all_predictions,
4    targets=all_targets,
5    unit_ids=all_unit_ids
6)
7
8# Report for publication (last-cycle)
9print(f"RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
10print(f"NASA Score (paper): {results['nasa_score_paper']:.1f}")
11
12# Diagnostic (all-cycles)
13print(f"RMSE (all-cycles): {results['RMSE_all_cycles']:.2f}")
14print(f"Early predictions: {results['early_percentage']:.1f}%")
15print(f"Late predictions: {results['late_percentage']:.1f}%")

Benchmark Reporting

When comparing with published results, always report last-cycle RMSE and paper-style NASA score. These are the standard benchmark metrics used in all C-MAPSS literature.

Summary

In this section, we covered last-cycle vs. all-cycles evaluation:

Two paradigms: All-cycles (consistency) vs. last-cycle (benchmark)
Last-cycle is standard: All published C-MAPSS results use it
Sample size differs: ~13K all-cycles vs. 100 last-cycle (FD001)
Both are useful: Last-cycle for benchmark, all-cycles for diagnostics
Proper extraction: Take final prediction per engine

Evaluation Mode	Purpose	Reported In Literature
Last-cycle	Benchmark comparison	Yes (primary)
All-cycles	Model diagnostics	Rarely
Per-RUL-range	Detailed analysis	Sometimes

Looking Ahead: Our dual-task model also predicts health states (healthy, degrading, critical). The next section covers health classification metrics—accuracy, F1 score, and per-class analysis.

With evaluation protocols understood, we examine classification metrics.