Learning Objectives
By the end of this section, you will:
- Combine all evaluation metrics into a unified pipeline
- Handle both single-task and dual-task model outputs
- Generate publication-ready results with proper formatting
- Compare with state-of-the-art baselines fairly
- Export results for visualization and analysis
The Complete Picture: A comprehensive evaluation pipeline computes all relevant metrics in a single pass through the data. This ensures consistency across metrics and provides a complete assessment of model performance for both the primary RUL task and the secondary health classification task.
Pipeline Overview
The evaluation pipeline processes model predictions and computes all metrics systematically.
Pipeline Architecture
πtext
1βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2β COMPREHENSIVE EVALUATION PIPELINE β
3βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
4β β
5β INPUT: Model, Test Dataset, Device β
6β β β
7β 1. INFERENCE β
8β βββ Forward pass on all test samples β
9β βββ Collect RUL predictions β
10β βββ Collect health logits (if dual-task) β
11β βββ Record unit IDs for per-engine analysis β
12β β β
13β 2. RUL METRICS β
14β βββ RMSE (all-cycles) β
15β βββ RMSE (last-cycle) β Primary benchmark β
16β βββ MAE (all-cycles, last-cycle) β
17β βββ RΒ² score (all-cycles, last-cycle) β
18β βββ NASA score (raw) β
19β βββ NASA score (paper-style) β Primary benchmark β
20β β β
21β 3. CLASSIFICATION METRICS β
22β βββ Health accuracy β
23β βββ Health F1 (weighted) β
24β βββ Per-class precision/recall β
25β βββ Confusion matrix β
26β β β
27β 4. DIAGNOSTIC METRICS β
28β βββ Early vs. late prediction ratio β
29β βββ Mean/std error β
30β βββ Per-RUL-range performance β
31β βββ Prediction count statistics β
32β β β
33β OUTPUT: Comprehensive results dictionary β
34β β
35βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββMetrics Computed
| Category | Metric | Purpose |
|---|---|---|
| RUL (Primary) | RMSE_last_cycle | Benchmark comparison |
| RUL (Primary) | nasa_score_paper | Asymmetric performance |
| RUL | RMSE_all_cycles | Consistency across degradation |
| RUL | MAE_all/last_cycle | Robust error measure |
| RUL | R2_all/last_cycle | Variance explained |
| Health | health_accuracy | Overall classification |
| Health | health_f1 | Class-balanced performance |
| Diagnostic | early/late_predictions | Prediction bias |
| Diagnostic | mean_error | Systematic bias detection |
Comprehensive Evaluation
The complete evaluation function handles both single-task and dual-task models.
Full Implementation
πpython
1def evaluate_model_comprehensive(
2 model: nn.Module,
3 test_dataset,
4 device: torch.device,
5 batch_size: int = 256,
6 return_predictions: bool = False,
7 apply_monotonic: bool = False,
8 calibrator=None
9) -> Dict[str, float]:
10 """
11 Comprehensive evaluation with proper protocol alignment.
12
13 Computes all metrics needed for RUL prediction benchmarking:
14 - RMSE (all-cycles and last-cycle)
15 - NASA score (raw and paper-style)
16 - MAE, RΒ² scores
17 - Health classification metrics (if dual-task)
18 - Diagnostic statistics
19
20 Args:
21 model: Trained model (single-task or dual-task)
22 test_dataset: Test dataset or DataLoader
23 device: Compute device
24 batch_size: Batch size for inference
25 return_predictions: Whether to include raw predictions
26 apply_monotonic: Enforce monotonic decrease per unit
27 calibrator: Optional prediction calibrator
28
29 Returns:
30 Dictionary with all evaluation metrics
31 """
32 model.eval()
33
34 # Handle both DataLoader and Dataset inputs
35 if isinstance(test_dataset, DataLoader):
36 test_loader = test_dataset
37 else:
38 test_loader = DataLoader(
39 test_dataset, batch_size=batch_size, shuffle=False
40 )
41
42 # Accumulate predictions
43 all_predictions = []
44 all_targets = []
45 all_unit_ids = []
46 health_preds_all = []
47 health_targets_all = []
48
49 with torch.no_grad():
50 batch_start_idx = 0
51 for batch_data in test_loader:
52 # Handle different batch formats
53 if len(batch_data) == 4:
54 sequences, rul_targets, health_targets, unit_ids = batch_data
55 elif len(batch_data) == 3:
56 sequences, rul_targets, health_targets = batch_data
57 unit_ids = None
58 else:
59 sequences, rul_targets = batch_data
60 health_targets = None
61 unit_ids = None
62
63 sequences = sequences.to(device)
64 model_output = model(sequences)
65
66 # Handle both single-task and dual-task models
67 if isinstance(model_output, tuple):
68 predictions = model_output[0].detach().cpu().numpy().flatten()
69 health_logits = model_output[1].detach().cpu().numpy()
70 else:
71 predictions = model_output.detach().cpu().numpy().flatten()
72 health_logits = None
73
74 # Apply calibrator if provided
75 if calibrator is not None and hasattr(calibrator, 'transform'):
76 predictions = calibrator.transform(predictions)
77
78 # Consistent RUL capping at 125
79 predictions = np.minimum(predictions, 125.0)
80
81 all_predictions.extend(predictions)
82 all_targets.extend(rul_targets.numpy())
83
84 # Health classification accumulation
85 if health_logits is not None:
86 pred_labels = np.argmax(health_logits, axis=1)
87 t = rul_targets.numpy()
88 tgt = np.zeros_like(t, dtype=np.int64)
89 tgt[(t > 15) & (t <= 50)] = 1
90 tgt[t <= 15] = 2
91 health_preds_all.extend(pred_labels.tolist())
92 health_targets_all.extend(tgt.tolist())
93
94 # Handle unit IDs
95 if unit_ids is not None:
96 all_unit_ids.extend(unit_ids.numpy() if hasattr(unit_ids, 'numpy') else unit_ids)
97 else:
98 batch_end_idx = batch_start_idx + len(predictions)
99 if hasattr(test_loader.dataset, 'get_unit_ids_for_range'):
100 batch_unit_ids = test_loader.dataset.get_unit_ids_for_range(
101 batch_start_idx, batch_end_idx
102 )
103 all_unit_ids.extend(batch_unit_ids)
104 batch_start_idx = batch_end_idx
105
106 # Convert to numpy arrays
107 all_predictions = np.array(all_predictions)
108 all_targets = np.array(all_targets)
109 all_unit_ids = np.array(all_unit_ids) if all_unit_ids else None
110
111 # Apply monotonic constraint if requested
112 if apply_monotonic and all_unit_ids is not None:
113 predictions_copy = all_predictions.copy()
114 for unit in np.unique(all_unit_ids):
115 unit_mask = all_unit_ids == unit
116 unit_preds = predictions_copy[unit_mask]
117 for i in range(1, len(unit_preds)):
118 unit_preds[i] = min(unit_preds[i], unit_preds[i-1])
119 predictions_copy[unit_mask] = unit_preds
120 all_predictions = predictions_copy
121
122 results = {}
123
124 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
125 # 1. ALL-CYCLES RMSE
126 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
127 rmse_all = np.sqrt(np.mean((all_predictions - all_targets) ** 2))
128 results['RMSE_all_cycles'] = float(rmse_all)
129
130 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
131 # 2. LAST-CYCLE RMSE (Primary Benchmark)
132 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
133 unique_units = np.unique(all_unit_ids)
134 last_predictions = []
135 last_targets = []
136
137 for unit_id in unique_units:
138 mask = all_unit_ids == unit_id
139 unit_preds = all_predictions[mask]
140 unit_targets = all_targets[mask]
141 if len(unit_preds) > 0:
142 last_predictions.append(unit_preds[-1])
143 last_targets.append(unit_targets[-1])
144
145 last_predictions = np.array(last_predictions)
146 last_targets = np.array(last_targets)
147
148 rmse_last = np.sqrt(np.mean((last_predictions - last_targets) ** 2))
149 results['RMSE_last_cycle'] = float(rmse_last)
150
151 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
152 # 3. NASA SCORES
153 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
154 nasa_scores = nasa_scoring_function_comprehensive(
155 all_targets, all_predictions,
156 method='both',
157 unit_ids=all_unit_ids,
158 normalize=False
159 )
160 results.update(nasa_scores)
161
162 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
163 # 4. MAE METRICS
164 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
165 results['MAE_all_cycles'] = float(np.mean(np.abs(all_predictions - all_targets)))
166 results['MAE_last_cycle'] = float(np.mean(np.abs(last_predictions - last_targets)))
167
168 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
169 # 5. RΒ² SCORES
170 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
171 ss_res_all = np.sum((all_targets - all_predictions) ** 2)
172 ss_tot_all = np.sum((all_targets - np.mean(all_targets)) ** 2)
173 results['R2_all_cycles'] = float(1 - ss_res_all / ss_tot_all) if ss_tot_all != 0 else 0.0
174
175 ss_res_last = np.sum((last_targets - last_predictions) ** 2)
176 ss_tot_last = np.sum((last_targets - np.mean(last_targets)) ** 2)
177 results['R2_last_cycle'] = float(1 - ss_res_last / ss_tot_last) if ss_tot_last != 0 else 0.0
178
179 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
180 # 6. EVALUATION METADATA
181 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
182 results['n_total_predictions'] = len(all_predictions)
183 results['n_units_evaluated'] = len(unique_units) if all_unit_ids is not None else 0
184 results['mean_predictions_per_unit'] = float(
185 len(all_predictions) / len(unique_units)
186 ) if len(unique_units) > 0 else 0
187
188 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
189 # 7. BIAS ANALYSIS
190 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
191 errors = all_predictions - all_targets
192 results['early_predictions'] = int(np.sum(errors < 0))
193 results['late_predictions'] = int(np.sum(errors > 0))
194 results['early_percentage'] = float(results['early_predictions'] / len(errors) * 100)
195 results['late_percentage'] = float(results['late_predictions'] / len(errors) * 100)
196
197 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
198 # 8. HEALTH CLASSIFICATION (if available)
199 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
200 if health_preds_all:
201 results['health_accuracy'] = float(
202 accuracy_score(health_targets_all, health_preds_all) * 100
203 )
204 results['health_f1'] = float(
205 f1_score(health_targets_all, health_preds_all, average='weighted') * 100
206 )
207 else:
208 results['health_accuracy'] = 0.0
209 results['health_f1'] = 0.0
210
211 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
212 # 9. RETURN PREDICTIONS (optional)
213 # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
214 if return_predictions:
215 results['predictions'] = all_predictions
216 results['targets'] = all_targets
217 results['unit_ids'] = all_unit_ids
218
219 return resultsResults Reporting
Generate publication-ready results formatted for comparison with baselines.
Results Logging
πpython
1def log_evaluation_results(
2 results: Dict,
3 dataset_name: str,
4 logger
5):
6 """
7 Log comprehensive evaluation results.
8
9 Args:
10 results: Dictionary from evaluate_model_comprehensive
11 dataset_name: Name of dataset (FD001-FD004)
12 logger: Logging instance
13 """
14 logger.info(f"\n{'='*60}")
15 logger.info(f"EVALUATION RESULTS: {dataset_name}")
16 logger.info(f"{'='*60}")
17
18 # Primary benchmark metrics
19 logger.info("\nπ PRIMARY BENCHMARK METRICS:")
20 logger.info(f" RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
21 logger.info(f" NASA Score (paper): {results.get('nasa_score_paper', 0):.1f}")
22
23 # Secondary RUL metrics
24 logger.info("\nπ ADDITIONAL RUL METRICS:")
25 logger.info(f" RMSE (all-cycles): {results['RMSE_all_cycles']:.2f}")
26 logger.info(f" MAE (last-cycle): {results['MAE_last_cycle']:.2f}")
27 logger.info(f" RΒ² (last-cycle): {results['R2_last_cycle']:.4f}")
28
29 # Health classification
30 if results.get('health_accuracy', 0) > 0:
31 logger.info("\nπ₯ HEALTH CLASSIFICATION:")
32 logger.info(f" Accuracy: {results['health_accuracy']:.1f}%")
33 logger.info(f" F1 Score: {results['health_f1']:.1f}%")
34
35 # Prediction bias
36 logger.info("\nβοΈ PREDICTION BIAS:")
37 logger.info(f" Early predictions: {results['early_percentage']:.1f}%")
38 logger.info(f" Late predictions: {results['late_percentage']:.1f}%")
39
40 # Evaluation statistics
41 logger.info("\nπ EVALUATION STATISTICS:")
42 logger.info(f" Total predictions: {results['n_total_predictions']:,}")
43 logger.info(f" Units evaluated: {results['n_units_evaluated']}")
44 logger.info(f" Avg predictions/unit: {results['mean_predictions_per_unit']:.1f}")LaTeX Table Generation
πpython
1def generate_latex_results_table(
2 results_all_datasets: Dict[str, Dict]
3) -> str:
4 """
5 Generate LaTeX table for publication.
6
7 Args:
8 results_all_datasets: Dict mapping dataset name to results
9
10 Returns:
11 LaTeX table string
12 """
13 latex = []
14 latex.append(r"\begin{table}[t]")
15 latex.append(r"\centering")
16 latex.append(r"\caption{AMNL Results on NASA C-MAPSS Benchmark}")
17 latex.append(r"\label{tab:results}")
18 latex.append(r"\begin{tabular}{lcccc}")
19 latex.append(r"\toprule")
20 latex.append(r"Metric & FD001 & FD002 & FD003 & FD004 \\")
21 latex.append(r"\midrule")
22
23 # RMSE row
24 rmse_values = [
25 f"{results_all_datasets[ds]['RMSE_last_cycle']:.2f}"
26 for ds in ['FD001', 'FD002', 'FD003', 'FD004']
27 ]
28 latex.append(f"RMSE & {' & '.join(rmse_values)} \\\\")
29
30 # NASA Score row
31 nasa_values = [
32 f"{results_all_datasets[ds].get('nasa_score_paper', 0):.0f}"
33 for ds in ['FD001', 'FD002', 'FD003', 'FD004']
34 ]
35 latex.append(f"NASA Score & {' & '.join(nasa_values)} \\\\")
36
37 # Health Accuracy row
38 health_values = [
39 f"{results_all_datasets[ds].get('health_accuracy', 0):.1f}\\%"
40 for ds in ['FD001', 'FD002', 'FD003', 'FD004']
41 ]
42 latex.append(f"Health Acc. & {' & '.join(health_values)} \\\\")
43
44 latex.append(r"\bottomrule")
45 latex.append(r"\end{tabular}")
46 latex.append(r"\end{table}")
47
48 return '\n'.join(latex)Usage Examples
Complete examples of using the evaluation pipeline.
Single Dataset Evaluation
πpython
1# Load model and data
2model = DualTaskEnhancedModel(...).to(device)
3model.load_state_dict(torch.load('best_model.pth')['model_state_dict'])
4
5test_dataset = EnhancedNASACMAPSSDataset(
6 dataset_name='FD001',
7 train=False,
8 scaler_params=scaler_params
9)
10
11# Comprehensive evaluation
12results = evaluate_model_comprehensive(
13 model=model,
14 test_dataset=test_dataset,
15 device=device,
16 return_predictions=True
17)
18
19# Log results
20log_evaluation_results(results, 'FD001', logger)
21
22# Access key metrics
23print(f"RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
24print(f"NASA Score (paper): {results['nasa_score_paper']:.1f}")
25print(f"Health Accuracy: {results['health_accuracy']:.1f}%")All Datasets Comparison
πpython
1# Evaluate on all four datasets
2datasets = ['FD001', 'FD002', 'FD003', 'FD004']
3all_results = {}
4
5for dataset_name in datasets:
6 # Load dataset-specific model
7 model_path = f'models/{dataset_name}_amnl.pth'
8 checkpoint = torch.load(model_path)
9 model.load_state_dict(checkpoint['model_state_dict'])
10
11 # Load test dataset
12 test_dataset = EnhancedNASACMAPSSDataset(
13 dataset_name=dataset_name,
14 train=False,
15 scaler_params=checkpoint['scaler_params']
16 )
17
18 # Evaluate
19 results = evaluate_model_comprehensive(model, test_dataset, device)
20 all_results[dataset_name] = results
21
22 print(f"{dataset_name}: RMSE={results['RMSE_last_cycle']:.2f}, "
23 f"NASA={results['nasa_score_paper']:.1f}")
24
25# Generate comparison table
26latex_table = generate_latex_results_table(all_results)
27print(latex_table)Expected Output
πtext
1FD001: RMSE=11.21, NASA=234.5
2FD002: RMSE=10.98, NASA=892.1
3FD003: RMSE=10.55, NASA=215.8
4FD004: RMSE=11.49, NASA=1024.3
5
6============================================================
7EVALUATION RESULTS: FD001
8============================================================
9
10π PRIMARY BENCHMARK METRICS:
11 RMSE (last-cycle): 11.21
12 NASA Score (paper): 234.5
13
14π ADDITIONAL RUL METRICS:
15 RMSE (all-cycles): 13.45
16 MAE (last-cycle): 9.12
17 RΒ² (last-cycle): 0.8756
18
19π₯ HEALTH CLASSIFICATION:
20 Accuracy: 84.2%
21 F1 Score: 82.7%
22
23βοΈ PREDICTION BIAS:
24 Early predictions: 52.3%
25 Late predictions: 47.7%
26
27π EVALUATION STATISTICS:
28 Total predictions: 13,096
29 Units evaluated: 100
30 Avg predictions/unit: 130.96State-of-the-Art Results
The AMNL model achieves state-of-the-art performance on all four NASA C-MAPSS datasets, with improvements ranging from +2.3% (FD001) to +37.0% (FD002) over previous best methods. These results are fully reproducible using the comprehensive evaluation pipeline.
Summary
In this section, we built the comprehensive evaluation pipeline:
- Unified pipeline: Single function computes all metrics
- Dual-task support: Handles both RUL and health classification
- Proper protocol: Last-cycle extraction for benchmarking
- Publication-ready: LaTeX table generation
- Diagnostic metrics: Bias analysis and error statistics
| Output | Content | Use |
|---|---|---|
| RMSE_last_cycle | Primary benchmark | Compare with SOTA |
| nasa_score_paper | Asymmetric score | Safety-aware comparison |
| health_accuracy/f1 | Classification performance | Dual-task evaluation |
| early/late_percentage | Prediction bias | Diagnostics |
| predictions (optional) | Raw outputs | Visualization/analysis |
Chapter Complete: You now have a complete evaluation toolkit for RUL prediction: RMSE metrics, NASA asymmetric scoring, proper last-cycle protocols, health classification metrics, and a unified pipeline. The next chapter presents the main resultsβstate-of-the-art performance on all four NASA C-MAPSS datasets with detailed comparisons against 15+ baseline methods.
With comprehensive evaluation complete, we present the main results.