Learning Objectives
By the end of this section, you will:
- Understand the two evaluation paradigms for RUL prediction
- Know when to use each approach and why
- Implement proper last-cycle extraction from predictions
- Compare metrics across both evaluation modes
- Align with published benchmarks using correct protocols
Why This Matters: The NASA C-MAPSS benchmark has a specific evaluation protocol—last-cycle RMSE and NASA score. Using all-cycles metrics instead gives different (and non-comparable) results. Understanding both approaches ensures accurate comparison with published state-of-the-art results.
Evaluation Paradigms
RUL predictions can be evaluated in two fundamentally different ways.
All-Cycles Evaluation
Evaluate every prediction made throughout the degradation process:
Where N is the total number of predictions across all engines and all cycles.
Last-Cycle Evaluation
Evaluate only the final prediction for each engine before failure:
Where M is the number of test engines, and "last" denotes the final operating cycle.
Key Differences
| Aspect | All-Cycles | Last-Cycle |
|---|---|---|
| Sample size | All predictions (~13K for FD001) | One per engine (100 for FD001) |
| Measures | Overall consistency | End-of-life accuracy |
| Literature standard | Rarely reported | Primary benchmark |
| Practical relevance | Monitoring throughout | Maintenance decision |
Last-Cycle Protocol
The standard benchmark protocol focuses on final predictions.
Why Last-Cycle Matters
- Maintenance decision point: The final prediction determines if maintenance is scheduled
- Maximum information: Model has seen the full degradation history
- Highest stakes: Error at this point has the largest consequence
- Literature standard: All published C-MAPSS results use last-cycle
Extraction Process
1Engine 1: cycles 1, 2, 3, ..., 128 → Take prediction at cycle 128
2Engine 2: cycles 1, 2, 3, ..., 145 → Take prediction at cycle 145
3Engine 3: cycles 1, 2, 3, ..., 98 → Take prediction at cycle 98
4...
5Engine M: cycles 1, 2, ..., k → Take prediction at cycle k
6
7Result: M predictions (one per engine)True RUL at Last Cycle
At the last operating cycle, the true RUL varies by engine:
| Dataset | Engines | RUL at Last Cycle |
|---|---|---|
| FD001 | 100 | Varies (typically 0-125, median ~40) |
| FD002 | 259 | Varies (typically 0-125) |
| FD003 | 100 | Varies (typically 0-125) |
| FD004 | 248 | Varies (typically 0-125) |
Test Set Ground Truth
The NASA C-MAPSS test sets provide ground truth RUL for the last cycle of each engine in a separate file (RUL_FD00x.txt). These are the targets used for last-cycle evaluation.
Literature Benchmark Values
| Dataset | SOTA RMSE (last) | Our RMSE (last) |
|---|---|---|
| FD001 | 11.48 | 11.21 |
| FD002 | 17.44 | 10.98 |
| FD003 | 11.67 | 10.55 |
| FD004 | 18.15 | 11.49 |
All-Cycles Analysis
While not the primary benchmark, all-cycles metrics provide valuable diagnostic information.
Use Cases for All-Cycles
- Model consistency: Does accuracy vary through degradation?
- Early detection: How accurate are early-stage predictions?
- Training diagnostics: Track learning across all timesteps
- Degradation pattern: Analyze errors at different RUL ranges
All-Cycles vs. Last-Cycle Comparison
Typical relationship between the two metrics:
| Metric | All-Cycles | Last-Cycle |
|---|---|---|
| RMSE | Usually higher | Usually lower (benchmark) |
| NASA Score | Much higher | Lower (benchmark) |
| Sample size | ~13,000 (FD001) | 100 (FD001) |
| Variance | Lower (more samples) | Higher (fewer samples) |
Per-RUL-Range Analysis
Break down all-cycles performance by RUL range:
1def analyze_by_rul_range(
2 predictions: np.ndarray,
3 targets: np.ndarray
4) -> Dict[str, float]:
5 """Analyze prediction accuracy across RUL ranges."""
6
7 results = {}
8
9 # Define RUL ranges
10 ranges = [
11 ('critical', 0, 15), # Near failure
12 ('degrading', 15, 50), # Active degradation
13 ('healthy', 50, 125), # Normal operation
14 ]
15
16 for name, low, high in ranges:
17 mask = (targets >= low) & (targets < high)
18 if np.sum(mask) > 0:
19 range_preds = predictions[mask]
20 range_targets = targets[mask]
21 rmse = np.sqrt(np.mean((range_preds - range_targets) ** 2))
22 results[f'RMSE_{name}'] = float(rmse)
23 results[f'n_{name}'] = int(np.sum(mask))
24
25 return resultsImplementation
Complete implementation for both evaluation paradigms.
Last-Cycle Extraction
1def extract_last_cycle_predictions(
2 predictions: np.ndarray,
3 targets: np.ndarray,
4 unit_ids: np.ndarray
5) -> Tuple[np.ndarray, np.ndarray]:
6 """
7 Extract the last prediction for each engine.
8
9 Args:
10 predictions: All predictions
11 targets: All true RUL values
12 unit_ids: Engine ID for each prediction
13
14 Returns:
15 Tuple of (last_predictions, last_targets)
16 """
17 unique_units = np.unique(unit_ids)
18 last_predictions = []
19 last_targets = []
20
21 for unit_id in unique_units:
22 mask = unit_ids == unit_id
23 unit_preds = predictions[mask]
24 unit_targets = targets[mask]
25
26 if len(unit_preds) > 0:
27 # Take the last (final) cycle
28 last_predictions.append(unit_preds[-1])
29 last_targets.append(unit_targets[-1])
30
31 return np.array(last_predictions), np.array(last_targets)Dual-Mode Evaluation
1def evaluate_dual_mode(
2 predictions: np.ndarray,
3 targets: np.ndarray,
4 unit_ids: np.ndarray
5) -> Dict[str, float]:
6 """
7 Evaluate using both all-cycles and last-cycle modes.
8
9 Args:
10 predictions: All predicted RUL values
11 targets: All true RUL values
12 unit_ids: Engine ID for each prediction
13
14 Returns:
15 Dictionary with all metrics in both modes
16 """
17 # Cap RUL values at 125
18 predictions = np.minimum(predictions, 125.0)
19 targets = np.minimum(targets, 125.0)
20
21 results = {}
22
23 # ═══════════════════════════════════════════════════════════
24 # ALL-CYCLES METRICS
25 # ═══════════════════════════════════════════════════════════
26 rmse_all = np.sqrt(np.mean((predictions - targets) ** 2))
27 mae_all = np.mean(np.abs(predictions - targets))
28
29 ss_res_all = np.sum((targets - predictions) ** 2)
30 ss_tot_all = np.sum((targets - np.mean(targets)) ** 2)
31 r2_all = 1 - (ss_res_all / ss_tot_all) if ss_tot_all != 0 else 0.0
32
33 results['RMSE_all_cycles'] = float(rmse_all)
34 results['MAE_all_cycles'] = float(mae_all)
35 results['R2_all_cycles'] = float(r2_all)
36 results['n_total_predictions'] = len(predictions)
37
38 # ═══════════════════════════════════════════════════════════
39 # LAST-CYCLE METRICS
40 # ═══════════════════════════════════════════════════════════
41 last_preds, last_targets = extract_last_cycle_predictions(
42 predictions, targets, unit_ids
43 )
44
45 rmse_last = np.sqrt(np.mean((last_preds - last_targets) ** 2))
46 mae_last = np.mean(np.abs(last_preds - last_targets))
47
48 ss_res_last = np.sum((last_targets - last_preds) ** 2)
49 ss_tot_last = np.sum((last_targets - np.mean(last_targets)) ** 2)
50 r2_last = 1 - (ss_res_last / ss_tot_last) if ss_tot_last != 0 else 0.0
51
52 results['RMSE_last_cycle'] = float(rmse_last)
53 results['MAE_last_cycle'] = float(mae_last)
54 results['R2_last_cycle'] = float(r2_last)
55 results['n_units_evaluated'] = len(last_preds)
56
57 # ═══════════════════════════════════════════════════════════
58 # NASA SCORES (BOTH MODES)
59 # ═══════════════════════════════════════════════════════════
60 nasa_all = nasa_scoring_function_comprehensive(
61 targets, predictions, method='raw_sum'
62 )
63 results['nasa_score_raw'] = nasa_all.get('nasa_score_raw', 0.0)
64
65 nasa_last = nasa_scoring_function_comprehensive(
66 targets, predictions, method='paper_style', unit_ids=unit_ids
67 )
68 results['nasa_score_paper'] = nasa_last.get('nasa_score_paper', 0.0)
69
70 # ═══════════════════════════════════════════════════════════
71 # DIAGNOSTIC METRICS
72 # ═══════════════════════════════════════════════════════════
73 errors = predictions - targets
74 results['early_predictions'] = int(np.sum(errors < 0))
75 results['late_predictions'] = int(np.sum(errors > 0))
76 results['early_percentage'] = float(results['early_predictions'] / len(errors) * 100)
77 results['late_percentage'] = float(results['late_predictions'] / len(errors) * 100)
78 results['mean_predictions_per_unit'] = float(len(predictions) / len(np.unique(unit_ids)))
79
80 return resultsUsage Example
1# After model evaluation
2results = evaluate_dual_mode(
3 predictions=all_predictions,
4 targets=all_targets,
5 unit_ids=all_unit_ids
6)
7
8# Report for publication (last-cycle)
9print(f"RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
10print(f"NASA Score (paper): {results['nasa_score_paper']:.1f}")
11
12# Diagnostic (all-cycles)
13print(f"RMSE (all-cycles): {results['RMSE_all_cycles']:.2f}")
14print(f"Early predictions: {results['early_percentage']:.1f}%")
15print(f"Late predictions: {results['late_percentage']:.1f}%")Benchmark Reporting
When comparing with published results, always report last-cycle RMSE and paper-style NASA score. These are the standard benchmark metrics used in all C-MAPSS literature.
Summary
In this section, we covered last-cycle vs. all-cycles evaluation:
- Two paradigms: All-cycles (consistency) vs. last-cycle (benchmark)
- Last-cycle is standard: All published C-MAPSS results use it
- Sample size differs: ~13K all-cycles vs. 100 last-cycle (FD001)
- Both are useful: Last-cycle for benchmark, all-cycles for diagnostics
- Proper extraction: Take final prediction per engine
| Evaluation Mode | Purpose | Reported In Literature |
|---|---|---|
| Last-cycle | Benchmark comparison | Yes (primary) |
| All-cycles | Model diagnostics | Rarely |
| Per-RUL-range | Detailed analysis | Sometimes |
Looking Ahead: Our dual-task model also predicts health states (healthy, degrading, critical). The next section covers health classification metrics—accuracy, F1 score, and per-class analysis.
With evaluation protocols understood, we examine classification metrics.