Chapter 15
18 min read
Section 77 of 104

Comprehensive Evaluation Pipeline

Evaluation Metrics

Learning Objectives

By the end of this section, you will:

  1. Combine all evaluation metrics into a unified pipeline
  2. Handle both single-task and dual-task model outputs
  3. Generate publication-ready results with proper formatting
  4. Compare with state-of-the-art baselines fairly
  5. Export results for visualization and analysis
The Complete Picture: A comprehensive evaluation pipeline computes all relevant metrics in a single pass through the data. This ensures consistency across metrics and provides a complete assessment of model performance for both the primary RUL task and the secondary health classification task.

Pipeline Overview

The evaluation pipeline processes model predictions and computes all metrics systematically.

Pipeline Architecture

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚               COMPREHENSIVE EVALUATION PIPELINE             β”‚
3β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
4β”‚                                                             β”‚
5β”‚  INPUT: Model, Test Dataset, Device                         β”‚
6β”‚         ↓                                                   β”‚
7β”‚  1. INFERENCE                                               β”‚
8β”‚     β”œβ”€β”€ Forward pass on all test samples                    β”‚
9β”‚     β”œβ”€β”€ Collect RUL predictions                             β”‚
10β”‚     β”œβ”€β”€ Collect health logits (if dual-task)                β”‚
11β”‚     └── Record unit IDs for per-engine analysis             β”‚
12β”‚         ↓                                                   β”‚
13β”‚  2. RUL METRICS                                             β”‚
14β”‚     β”œβ”€β”€ RMSE (all-cycles)                                   β”‚
15β”‚     β”œβ”€β”€ RMSE (last-cycle) ← Primary benchmark               β”‚
16β”‚     β”œβ”€β”€ MAE (all-cycles, last-cycle)                        β”‚
17β”‚     β”œβ”€β”€ RΒ² score (all-cycles, last-cycle)                   β”‚
18β”‚     β”œβ”€β”€ NASA score (raw)                                    β”‚
19β”‚     └── NASA score (paper-style) ← Primary benchmark        β”‚
20β”‚         ↓                                                   β”‚
21β”‚  3. CLASSIFICATION METRICS                                  β”‚
22β”‚     β”œβ”€β”€ Health accuracy                                     β”‚
23β”‚     β”œβ”€β”€ Health F1 (weighted)                                β”‚
24β”‚     β”œβ”€β”€ Per-class precision/recall                          β”‚
25β”‚     └── Confusion matrix                                    β”‚
26β”‚         ↓                                                   β”‚
27β”‚  4. DIAGNOSTIC METRICS                                      β”‚
28β”‚     β”œβ”€β”€ Early vs. late prediction ratio                     β”‚
29β”‚     β”œβ”€β”€ Mean/std error                                      β”‚
30β”‚     β”œβ”€β”€ Per-RUL-range performance                           β”‚
31β”‚     └── Prediction count statistics                         β”‚
32β”‚         ↓                                                   β”‚
33β”‚  OUTPUT: Comprehensive results dictionary                   β”‚
34β”‚                                                             β”‚
35β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Metrics Computed

CategoryMetricPurpose
RUL (Primary)RMSE_last_cycleBenchmark comparison
RUL (Primary)nasa_score_paperAsymmetric performance
RULRMSE_all_cyclesConsistency across degradation
RULMAE_all/last_cycleRobust error measure
RULR2_all/last_cycleVariance explained
Healthhealth_accuracyOverall classification
Healthhealth_f1Class-balanced performance
Diagnosticearly/late_predictionsPrediction bias
Diagnosticmean_errorSystematic bias detection

Comprehensive Evaluation

The complete evaluation function handles both single-task and dual-task models.

Full Implementation

🐍python
1def evaluate_model_comprehensive(
2    model: nn.Module,
3    test_dataset,
4    device: torch.device,
5    batch_size: int = 256,
6    return_predictions: bool = False,
7    apply_monotonic: bool = False,
8    calibrator=None
9) -> Dict[str, float]:
10    """
11    Comprehensive evaluation with proper protocol alignment.
12
13    Computes all metrics needed for RUL prediction benchmarking:
14    - RMSE (all-cycles and last-cycle)
15    - NASA score (raw and paper-style)
16    - MAE, RΒ² scores
17    - Health classification metrics (if dual-task)
18    - Diagnostic statistics
19
20    Args:
21        model: Trained model (single-task or dual-task)
22        test_dataset: Test dataset or DataLoader
23        device: Compute device
24        batch_size: Batch size for inference
25        return_predictions: Whether to include raw predictions
26        apply_monotonic: Enforce monotonic decrease per unit
27        calibrator: Optional prediction calibrator
28
29    Returns:
30        Dictionary with all evaluation metrics
31    """
32    model.eval()
33
34    # Handle both DataLoader and Dataset inputs
35    if isinstance(test_dataset, DataLoader):
36        test_loader = test_dataset
37    else:
38        test_loader = DataLoader(
39            test_dataset, batch_size=batch_size, shuffle=False
40        )
41
42    # Accumulate predictions
43    all_predictions = []
44    all_targets = []
45    all_unit_ids = []
46    health_preds_all = []
47    health_targets_all = []
48
49    with torch.no_grad():
50        batch_start_idx = 0
51        for batch_data in test_loader:
52            # Handle different batch formats
53            if len(batch_data) == 4:
54                sequences, rul_targets, health_targets, unit_ids = batch_data
55            elif len(batch_data) == 3:
56                sequences, rul_targets, health_targets = batch_data
57                unit_ids = None
58            else:
59                sequences, rul_targets = batch_data
60                health_targets = None
61                unit_ids = None
62
63            sequences = sequences.to(device)
64            model_output = model(sequences)
65
66            # Handle both single-task and dual-task models
67            if isinstance(model_output, tuple):
68                predictions = model_output[0].detach().cpu().numpy().flatten()
69                health_logits = model_output[1].detach().cpu().numpy()
70            else:
71                predictions = model_output.detach().cpu().numpy().flatten()
72                health_logits = None
73
74            # Apply calibrator if provided
75            if calibrator is not None and hasattr(calibrator, 'transform'):
76                predictions = calibrator.transform(predictions)
77
78            # Consistent RUL capping at 125
79            predictions = np.minimum(predictions, 125.0)
80
81            all_predictions.extend(predictions)
82            all_targets.extend(rul_targets.numpy())
83
84            # Health classification accumulation
85            if health_logits is not None:
86                pred_labels = np.argmax(health_logits, axis=1)
87                t = rul_targets.numpy()
88                tgt = np.zeros_like(t, dtype=np.int64)
89                tgt[(t > 15) & (t <= 50)] = 1
90                tgt[t <= 15] = 2
91                health_preds_all.extend(pred_labels.tolist())
92                health_targets_all.extend(tgt.tolist())
93
94            # Handle unit IDs
95            if unit_ids is not None:
96                all_unit_ids.extend(unit_ids.numpy() if hasattr(unit_ids, 'numpy') else unit_ids)
97            else:
98                batch_end_idx = batch_start_idx + len(predictions)
99                if hasattr(test_loader.dataset, 'get_unit_ids_for_range'):
100                    batch_unit_ids = test_loader.dataset.get_unit_ids_for_range(
101                        batch_start_idx, batch_end_idx
102                    )
103                    all_unit_ids.extend(batch_unit_ids)
104                batch_start_idx = batch_end_idx
105
106    # Convert to numpy arrays
107    all_predictions = np.array(all_predictions)
108    all_targets = np.array(all_targets)
109    all_unit_ids = np.array(all_unit_ids) if all_unit_ids else None
110
111    # Apply monotonic constraint if requested
112    if apply_monotonic and all_unit_ids is not None:
113        predictions_copy = all_predictions.copy()
114        for unit in np.unique(all_unit_ids):
115            unit_mask = all_unit_ids == unit
116            unit_preds = predictions_copy[unit_mask]
117            for i in range(1, len(unit_preds)):
118                unit_preds[i] = min(unit_preds[i], unit_preds[i-1])
119            predictions_copy[unit_mask] = unit_preds
120        all_predictions = predictions_copy
121
122    results = {}
123
124    # ═══════════════════════════════════════════════════════════
125    # 1. ALL-CYCLES RMSE
126    # ═══════════════════════════════════════════════════════════
127    rmse_all = np.sqrt(np.mean((all_predictions - all_targets) ** 2))
128    results['RMSE_all_cycles'] = float(rmse_all)
129
130    # ═══════════════════════════════════════════════════════════
131    # 2. LAST-CYCLE RMSE (Primary Benchmark)
132    # ═══════════════════════════════════════════════════════════
133    unique_units = np.unique(all_unit_ids)
134    last_predictions = []
135    last_targets = []
136
137    for unit_id in unique_units:
138        mask = all_unit_ids == unit_id
139        unit_preds = all_predictions[mask]
140        unit_targets = all_targets[mask]
141        if len(unit_preds) > 0:
142            last_predictions.append(unit_preds[-1])
143            last_targets.append(unit_targets[-1])
144
145    last_predictions = np.array(last_predictions)
146    last_targets = np.array(last_targets)
147
148    rmse_last = np.sqrt(np.mean((last_predictions - last_targets) ** 2))
149    results['RMSE_last_cycle'] = float(rmse_last)
150
151    # ═══════════════════════════════════════════════════════════
152    # 3. NASA SCORES
153    # ═══════════════════════════════════════════════════════════
154    nasa_scores = nasa_scoring_function_comprehensive(
155        all_targets, all_predictions,
156        method='both',
157        unit_ids=all_unit_ids,
158        normalize=False
159    )
160    results.update(nasa_scores)
161
162    # ═══════════════════════════════════════════════════════════
163    # 4. MAE METRICS
164    # ═══════════════════════════════════════════════════════════
165    results['MAE_all_cycles'] = float(np.mean(np.abs(all_predictions - all_targets)))
166    results['MAE_last_cycle'] = float(np.mean(np.abs(last_predictions - last_targets)))
167
168    # ═══════════════════════════════════════════════════════════
169    # 5. RΒ² SCORES
170    # ═══════════════════════════════════════════════════════════
171    ss_res_all = np.sum((all_targets - all_predictions) ** 2)
172    ss_tot_all = np.sum((all_targets - np.mean(all_targets)) ** 2)
173    results['R2_all_cycles'] = float(1 - ss_res_all / ss_tot_all) if ss_tot_all != 0 else 0.0
174
175    ss_res_last = np.sum((last_targets - last_predictions) ** 2)
176    ss_tot_last = np.sum((last_targets - np.mean(last_targets)) ** 2)
177    results['R2_last_cycle'] = float(1 - ss_res_last / ss_tot_last) if ss_tot_last != 0 else 0.0
178
179    # ═══════════════════════════════════════════════════════════
180    # 6. EVALUATION METADATA
181    # ═══════════════════════════════════════════════════════════
182    results['n_total_predictions'] = len(all_predictions)
183    results['n_units_evaluated'] = len(unique_units) if all_unit_ids is not None else 0
184    results['mean_predictions_per_unit'] = float(
185        len(all_predictions) / len(unique_units)
186    ) if len(unique_units) > 0 else 0
187
188    # ═══════════════════════════════════════════════════════════
189    # 7. BIAS ANALYSIS
190    # ═══════════════════════════════════════════════════════════
191    errors = all_predictions - all_targets
192    results['early_predictions'] = int(np.sum(errors < 0))
193    results['late_predictions'] = int(np.sum(errors > 0))
194    results['early_percentage'] = float(results['early_predictions'] / len(errors) * 100)
195    results['late_percentage'] = float(results['late_predictions'] / len(errors) * 100)
196
197    # ═══════════════════════════════════════════════════════════
198    # 8. HEALTH CLASSIFICATION (if available)
199    # ═══════════════════════════════════════════════════════════
200    if health_preds_all:
201        results['health_accuracy'] = float(
202            accuracy_score(health_targets_all, health_preds_all) * 100
203        )
204        results['health_f1'] = float(
205            f1_score(health_targets_all, health_preds_all, average='weighted') * 100
206        )
207    else:
208        results['health_accuracy'] = 0.0
209        results['health_f1'] = 0.0
210
211    # ═══════════════════════════════════════════════════════════
212    # 9. RETURN PREDICTIONS (optional)
213    # ═══════════════════════════════════════════════════════════
214    if return_predictions:
215        results['predictions'] = all_predictions
216        results['targets'] = all_targets
217        results['unit_ids'] = all_unit_ids
218
219    return results

Results Reporting

Generate publication-ready results formatted for comparison with baselines.

Results Logging

🐍python
1def log_evaluation_results(
2    results: Dict,
3    dataset_name: str,
4    logger
5):
6    """
7    Log comprehensive evaluation results.
8
9    Args:
10        results: Dictionary from evaluate_model_comprehensive
11        dataset_name: Name of dataset (FD001-FD004)
12        logger: Logging instance
13    """
14    logger.info(f"\n{'='*60}")
15    logger.info(f"EVALUATION RESULTS: {dataset_name}")
16    logger.info(f"{'='*60}")
17
18    # Primary benchmark metrics
19    logger.info("\nπŸ“Š PRIMARY BENCHMARK METRICS:")
20    logger.info(f"  RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
21    logger.info(f"  NASA Score (paper): {results.get('nasa_score_paper', 0):.1f}")
22
23    # Secondary RUL metrics
24    logger.info("\nπŸ“ˆ ADDITIONAL RUL METRICS:")
25    logger.info(f"  RMSE (all-cycles): {results['RMSE_all_cycles']:.2f}")
26    logger.info(f"  MAE (last-cycle): {results['MAE_last_cycle']:.2f}")
27    logger.info(f"  RΒ² (last-cycle): {results['R2_last_cycle']:.4f}")
28
29    # Health classification
30    if results.get('health_accuracy', 0) > 0:
31        logger.info("\nπŸ₯ HEALTH CLASSIFICATION:")
32        logger.info(f"  Accuracy: {results['health_accuracy']:.1f}%")
33        logger.info(f"  F1 Score: {results['health_f1']:.1f}%")
34
35    # Prediction bias
36    logger.info("\nβš–οΈ PREDICTION BIAS:")
37    logger.info(f"  Early predictions: {results['early_percentage']:.1f}%")
38    logger.info(f"  Late predictions: {results['late_percentage']:.1f}%")
39
40    # Evaluation statistics
41    logger.info("\nπŸ“‹ EVALUATION STATISTICS:")
42    logger.info(f"  Total predictions: {results['n_total_predictions']:,}")
43    logger.info(f"  Units evaluated: {results['n_units_evaluated']}")
44    logger.info(f"  Avg predictions/unit: {results['mean_predictions_per_unit']:.1f}")

LaTeX Table Generation

🐍python
1def generate_latex_results_table(
2    results_all_datasets: Dict[str, Dict]
3) -> str:
4    """
5    Generate LaTeX table for publication.
6
7    Args:
8        results_all_datasets: Dict mapping dataset name to results
9
10    Returns:
11        LaTeX table string
12    """
13    latex = []
14    latex.append(r"\begin{table}[t]")
15    latex.append(r"\centering")
16    latex.append(r"\caption{AMNL Results on NASA C-MAPSS Benchmark}")
17    latex.append(r"\label{tab:results}")
18    latex.append(r"\begin{tabular}{lcccc}")
19    latex.append(r"\toprule")
20    latex.append(r"Metric & FD001 & FD002 & FD003 & FD004 \\")
21    latex.append(r"\midrule")
22
23    # RMSE row
24    rmse_values = [
25        f"{results_all_datasets[ds]['RMSE_last_cycle']:.2f}"
26        for ds in ['FD001', 'FD002', 'FD003', 'FD004']
27    ]
28    latex.append(f"RMSE & {' & '.join(rmse_values)} \\\\")
29
30    # NASA Score row
31    nasa_values = [
32        f"{results_all_datasets[ds].get('nasa_score_paper', 0):.0f}"
33        for ds in ['FD001', 'FD002', 'FD003', 'FD004']
34    ]
35    latex.append(f"NASA Score & {' & '.join(nasa_values)} \\\\")
36
37    # Health Accuracy row
38    health_values = [
39        f"{results_all_datasets[ds].get('health_accuracy', 0):.1f}\\%"
40        for ds in ['FD001', 'FD002', 'FD003', 'FD004']
41    ]
42    latex.append(f"Health Acc. & {' & '.join(health_values)} \\\\")
43
44    latex.append(r"\bottomrule")
45    latex.append(r"\end{tabular}")
46    latex.append(r"\end{table}")
47
48    return '\n'.join(latex)

Usage Examples

Complete examples of using the evaluation pipeline.

Single Dataset Evaluation

🐍python
1# Load model and data
2model = DualTaskEnhancedModel(...).to(device)
3model.load_state_dict(torch.load('best_model.pth')['model_state_dict'])
4
5test_dataset = EnhancedNASACMAPSSDataset(
6    dataset_name='FD001',
7    train=False,
8    scaler_params=scaler_params
9)
10
11# Comprehensive evaluation
12results = evaluate_model_comprehensive(
13    model=model,
14    test_dataset=test_dataset,
15    device=device,
16    return_predictions=True
17)
18
19# Log results
20log_evaluation_results(results, 'FD001', logger)
21
22# Access key metrics
23print(f"RMSE (last-cycle): {results['RMSE_last_cycle']:.2f}")
24print(f"NASA Score (paper): {results['nasa_score_paper']:.1f}")
25print(f"Health Accuracy: {results['health_accuracy']:.1f}%")

All Datasets Comparison

🐍python
1# Evaluate on all four datasets
2datasets = ['FD001', 'FD002', 'FD003', 'FD004']
3all_results = {}
4
5for dataset_name in datasets:
6    # Load dataset-specific model
7    model_path = f'models/{dataset_name}_amnl.pth'
8    checkpoint = torch.load(model_path)
9    model.load_state_dict(checkpoint['model_state_dict'])
10
11    # Load test dataset
12    test_dataset = EnhancedNASACMAPSSDataset(
13        dataset_name=dataset_name,
14        train=False,
15        scaler_params=checkpoint['scaler_params']
16    )
17
18    # Evaluate
19    results = evaluate_model_comprehensive(model, test_dataset, device)
20    all_results[dataset_name] = results
21
22    print(f"{dataset_name}: RMSE={results['RMSE_last_cycle']:.2f}, "
23          f"NASA={results['nasa_score_paper']:.1f}")
24
25# Generate comparison table
26latex_table = generate_latex_results_table(all_results)
27print(latex_table)

Expected Output

πŸ“text
1FD001: RMSE=11.21, NASA=234.5
2FD002: RMSE=10.98, NASA=892.1
3FD003: RMSE=10.55, NASA=215.8
4FD004: RMSE=11.49, NASA=1024.3
5
6============================================================
7EVALUATION RESULTS: FD001
8============================================================
9
10πŸ“Š PRIMARY BENCHMARK METRICS:
11  RMSE (last-cycle): 11.21
12  NASA Score (paper): 234.5
13
14πŸ“ˆ ADDITIONAL RUL METRICS:
15  RMSE (all-cycles): 13.45
16  MAE (last-cycle): 9.12
17  RΒ² (last-cycle): 0.8756
18
19πŸ₯ HEALTH CLASSIFICATION:
20  Accuracy: 84.2%
21  F1 Score: 82.7%
22
23βš–οΈ PREDICTION BIAS:
24  Early predictions: 52.3%
25  Late predictions: 47.7%
26
27πŸ“‹ EVALUATION STATISTICS:
28  Total predictions: 13,096
29  Units evaluated: 100
30  Avg predictions/unit: 130.96

State-of-the-Art Results

The AMNL model achieves state-of-the-art performance on all four NASA C-MAPSS datasets, with improvements ranging from +2.3% (FD001) to +37.0% (FD002) over previous best methods. These results are fully reproducible using the comprehensive evaluation pipeline.


Summary

In this section, we built the comprehensive evaluation pipeline:

  1. Unified pipeline: Single function computes all metrics
  2. Dual-task support: Handles both RUL and health classification
  3. Proper protocol: Last-cycle extraction for benchmarking
  4. Publication-ready: LaTeX table generation
  5. Diagnostic metrics: Bias analysis and error statistics
OutputContentUse
RMSE_last_cyclePrimary benchmarkCompare with SOTA
nasa_score_paperAsymmetric scoreSafety-aware comparison
health_accuracy/f1Classification performanceDual-task evaluation
early/late_percentagePrediction biasDiagnostics
predictions (optional)Raw outputsVisualization/analysis
Chapter Complete: You now have a complete evaluation toolkit for RUL prediction: RMSE metrics, NASA asymmetric scoring, proper last-cycle protocols, health classification metrics, and a unified pipeline. The next chapter presents the main resultsβ€”state-of-the-art performance on all four NASA C-MAPSS datasets with detailed comparisons against 15+ baseline methods.

With comprehensive evaluation complete, we present the main results.