Learning Objectives
By the end of this section, you will:
- Know what metrics to monitor during training
- Set up comprehensive logging to file and console
- Track training history for analysis
- Create informative visualizations of training progress
- Diagnose training issues from logged data
Why This Matters: Without proper monitoring, you are training blind. Good logging helps you catch problems early (exploding gradients, stuck training), understand model behavior, and reproduce experiments. Visualizations reveal patterns that raw numbers hide.
What to Monitor
Effective monitoring tracks multiple aspects of the training process.
Loss Metrics
| Metric | Why Monitor | Warning Signs |
|---|---|---|
| Training loss | Overall learning progress | Not decreasing, NaN |
| RUL loss | Primary task performance | Stuck, oscillating |
| Health loss | Secondary task performance | Much higher than RUL |
| Validation loss | Generalization | Diverging from train loss |
Optimization Metrics
| Metric | Why Monitor | Warning Signs |
|---|---|---|
| Learning rate | Scheduler behavior | Dropping too fast/slow |
| Gradient norm | Training stability | >10 or <0.001 |
| Weight decay | Regularization strength | Unexpected changes |
| Clip ratio | Gradient clipping frequency | >50% is concerning |
Evaluation Metrics
| Metric | Why Monitor | Warning Signs |
|---|---|---|
| RMSE (last-cycle) | Primary benchmark | Not improving after 50 epochs |
| RMSE (all-cycles) | Model consistency | Much higher than last-cycle |
| NASA score | Asymmetric performance | Sudden spikes |
| Health accuracy | Classification quality | <70% |
| Early/late ratio | Prediction bias | Strong bias either way |
Logging Setup
A proper logging setup writes to both console and file.
Logger Configuration
🐍python
1import logging
2from datetime import datetime
3from pathlib import Path
4
5def setup_logging(
6 experiment_name: str,
7 log_dir: str = 'logs'
8) -> logging.Logger:
9 """
10 Configure logging for training experiment.
11
12 Logs to both console (INFO level) and file (DEBUG level).
13
14 Args:
15 experiment_name: Name for this experiment
16 log_dir: Directory for log files
17
18 Returns:
19 Configured logger
20 """
21 # Create log directory
22 log_path = Path(log_dir)
23 log_path.mkdir(exist_ok=True)
24
25 # Generate timestamped log filename
26 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
27 log_file = log_path / f'{experiment_name}_{timestamp}.log'
28
29 # Create logger
30 logger = logging.getLogger(experiment_name)
31 logger.setLevel(logging.DEBUG)
32
33 # Clear any existing handlers
34 logger.handlers = []
35
36 # Console handler (INFO and above)
37 console_handler = logging.StreamHandler()
38 console_handler.setLevel(logging.INFO)
39 console_format = logging.Formatter(
40 '%(asctime)s - %(levelname)s - %(message)s',
41 datefmt='%H:%M:%S'
42 )
43 console_handler.setFormatter(console_format)
44
45 # File handler (DEBUG and above)
46 file_handler = logging.FileHandler(log_file)
47 file_handler.setLevel(logging.DEBUG)
48 file_format = logging.Formatter(
49 '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
50 )
51 file_handler.setFormatter(file_format)
52
53 # Add handlers
54 logger.addHandler(console_handler)
55 logger.addHandler(file_handler)
56
57 logger.info(f"Logging to {log_file}")
58
59 return loggerLogging During Training
🐍python
1# Example usage in training loop
2logger = setup_logging('amnl_fd001')
3
4logger.info("=" * 60)
5logger.info(f"Starting AMNL training on FD001")
6logger.info(f"Model: DualTaskEnhancedModel")
7logger.info(f"Parameters: {total_params:,}")
8logger.info("=" * 60)
9
10for epoch in range(epochs):
11 # Training
12 train_results = train_epoch(...)
13
14 # Evaluation
15 eval_results = evaluate_model_comprehensive(...)
16
17 # Log progress
18 logger.info(
19 f"Epoch [{epoch+1:3d}/{epochs}] | "
20 f"Loss: {train_results['loss']:.4f} | "
21 f"RMSE: {eval_results['RMSE_last_cycle']:.2f} | "
22 f"NASA: {eval_results['nasa_score_paper']:.1f} | "
23 f"Health: {eval_results['health_accuracy']:.1f}% | "
24 f"LR: {current_lr:.6f}"
25 )
26
27 # Log detailed debug info
28 logger.debug(
29 f"Gradient norm: {train_results['avg_grad_norm']:.4f}, "
30 f"RUL loss: {train_results['rul_loss']:.4f}, "
31 f"Health loss: {train_results['health_loss']:.4f}"
32 )
33
34 # Log best model updates
35 if is_new_best:
36 logger.info(f" New best model! RMSE: {eval_results['RMSE_last_cycle']:.2f}")
37
38logger.info("Training complete")
39logger.info(f"Best RMSE: {best_rmse:.2f} at epoch {best_epoch+1}")Training History
Storing the complete training history enables post-hoc analysis.
History Structure
🐍python
1from collections import defaultdict
2
3# Initialize history dictionary
4history = defaultdict(list)
5
6# During training, append each epoch's metrics
7for epoch in range(epochs):
8 # ... training and evaluation ...
9
10 # Store all metrics
11 history['train_loss'].append(train_results['loss'])
12 history['rul_loss'].append(train_results['rul_loss'])
13 history['health_loss'].append(train_results['health_loss'])
14
15 history['test_rmse_all'].append(eval_results['RMSE_all_cycles'])
16 history['test_rmse_last'].append(eval_results['RMSE_last_cycle'])
17 history['nasa_score_paper'].append(eval_results['nasa_score_paper'])
18 history['nasa_score_raw'].append(eval_results['nasa_score_raw'])
19
20 history['health_accuracy'].append(eval_results['health_accuracy'])
21 history['health_f1'].append(eval_results['health_f1'])
22
23 history['r2_all_cycles'].append(eval_results['R2_all_cycles'])
24 history['r2_last_cycle'].append(eval_results['R2_last_cycle'])
25
26 history['learning_rate'].append(current_lr)
27 history['gradient_norm'].append(train_results['avg_grad_norm'])
28 history['weight_decay'].append(current_wd)
29
30 # Prediction timing analysis
31 early_pct = eval_results['early_predictions'] / eval_results['n_total_predictions'] * 100
32 late_pct = eval_results['late_predictions'] / eval_results['n_total_predictions'] * 100
33 history['early_predictions_pct'].append(early_pct)
34 history['late_predictions_pct'].append(late_pct)Saving History
🐍python
1import json
2
3def convert_numpy_types(obj):
4 """Convert numpy types to Python types for JSON serialization."""
5 if isinstance(obj, np.ndarray):
6 return obj.tolist()
7 elif isinstance(obj, (np.int64, np.int32)):
8 return int(obj)
9 elif isinstance(obj, (np.float64, np.float32)):
10 return float(obj)
11 elif isinstance(obj, dict):
12 return {k: convert_numpy_types(v) for k, v in obj.items()}
13 elif isinstance(obj, list):
14 return [convert_numpy_types(v) for v in obj]
15 return obj
16
17
18def save_training_history(
19 history: dict,
20 filepath: str
21):
22 """Save training history to JSON file."""
23 # Convert numpy types
24 history_clean = convert_numpy_types(dict(history))
25
26 with open(filepath, 'w') as f:
27 json.dump(history_clean, f, indent=2)
28
29 print(f"Training history saved to {filepath}")Visualization
Visualizations reveal training dynamics that numbers alone cannot show.
Comprehensive Training Plots
🐍python
1import matplotlib.pyplot as plt
2import seaborn as sns
3
4def create_training_visualizations(
5 history: dict,
6 dataset_name: str,
7 save_dir: str = 'visualizations'
8):
9 """
10 Create comprehensive training visualization.
11
12 Generates a 3x3 grid of plots covering all key metrics.
13 """
14 save_path = Path(save_dir) / dataset_name
15 save_path.mkdir(parents=True, exist_ok=True)
16
17 epochs = range(1, len(history['train_loss']) + 1)
18
19 # Set style
20 plt.style.use('seaborn-v0_8-darkgrid')
21 sns.set_palette("husl")
22
23 fig, axes = plt.subplots(3, 3, figsize=(18, 15))
24 fig.suptitle(f'AMNL {dataset_name} Training Results', fontsize=16, fontweight='bold')
25
26 # Row 1: Loss and RUL Metrics
27 # Plot 1: Training Loss
28 axes[0, 0].plot(epochs, history['train_loss'], 'b-', linewidth=2, label='Train Loss')
29 axes[0, 0].set_title('Training Loss')
30 axes[0, 0].set_xlabel('Epoch')
31 axes[0, 0].set_ylabel('Loss')
32 axes[0, 0].legend()
33 axes[0, 0].grid(True, alpha=0.3)
34
35 # Plot 2: RMSE Evolution
36 axes[0, 1].plot(epochs, history['test_rmse_all'], 'g-', linewidth=2, label='All Cycles')
37 axes[0, 1].plot(epochs, history['test_rmse_last'], 'r-', linewidth=2, label='Last Cycle')
38 axes[0, 1].axhline(y=min(history['test_rmse_last']), color='blue', linestyle='--',
39 alpha=0.7, label=f'Best: {min(history["test_rmse_last"]):.2f}')
40 axes[0, 1].set_title('RMSE Evolution')
41 axes[0, 1].set_xlabel('Epoch')
42 axes[0, 1].set_ylabel('RMSE')
43 axes[0, 1].legend()
44 axes[0, 1].grid(True, alpha=0.3)
45
46 # Plot 3: NASA Score
47 axes[0, 2].plot(epochs, history['nasa_score_paper'], 'purple', linewidth=2)
48 axes[0, 2].set_title('NASA Score (Paper Style)')
49 axes[0, 2].set_xlabel('Epoch')
50 axes[0, 2].set_ylabel('NASA Score')
51 axes[0, 2].grid(True, alpha=0.3)
52
53 # Row 2: Health Classification and R²
54 # Plot 4: Health Accuracy
55 axes[1, 0].plot(epochs, history['health_accuracy'], 'cyan', linewidth=2)
56 axes[1, 0].axhline(y=80, color='g', linestyle='--', alpha=0.7, label='Target 80%')
57 axes[1, 0].set_title('Health State Accuracy')
58 axes[1, 0].set_xlabel('Epoch')
59 axes[1, 0].set_ylabel('Accuracy (%)')
60 axes[1, 0].legend()
61 axes[1, 0].grid(True, alpha=0.3)
62
63 # Plot 5: R² Score
64 axes[1, 1].plot(epochs, history['r2_all_cycles'], 'brown', linewidth=2, label='All Cycles')
65 axes[1, 1].plot(epochs, history['r2_last_cycle'], 'pink', linewidth=2, label='Last Cycle')
66 axes[1, 1].set_title('R² Score Evolution')
67 axes[1, 1].set_xlabel('Epoch')
68 axes[1, 1].set_ylabel('R² Score')
69 axes[1, 1].legend()
70 axes[1, 1].grid(True, alpha=0.3)
71
72 # Plot 6: Learning Rate
73 axes[1, 2].plot(epochs, history['learning_rate'], 'red', linewidth=2)
74 axes[1, 2].set_title('Learning Rate Schedule')
75 axes[1, 2].set_xlabel('Epoch')
76 axes[1, 2].set_ylabel('Learning Rate')
77 axes[1, 2].set_yscale('log')
78 axes[1, 2].grid(True, alpha=0.3)
79
80 # Row 3: Advanced Metrics
81 # Plot 7: Prediction Timing
82 axes[2, 0].plot(epochs, history['early_predictions_pct'], 'blue', linewidth=2, label='Early')
83 axes[2, 0].plot(epochs, history['late_predictions_pct'], 'red', linewidth=2, label='Late')
84 axes[2, 0].set_title('Prediction Timing Analysis')
85 axes[2, 0].set_xlabel('Epoch')
86 axes[2, 0].set_ylabel('Percentage')
87 axes[2, 0].legend()
88 axes[2, 0].grid(True, alpha=0.3)
89
90 # Plot 8: Gradient Norm
91 axes[2, 1].plot(epochs, history['gradient_norm'], 'purple', linewidth=2)
92 axes[2, 1].set_title('Gradient Norm')
93 axes[2, 1].set_xlabel('Epoch')
94 axes[2, 1].set_ylabel('Norm')
95 axes[2, 1].set_yscale('log')
96 axes[2, 1].grid(True, alpha=0.3)
97
98 # Plot 9: Weight Decay
99 axes[2, 2].plot(epochs, history['weight_decay'], 'green', linewidth=2)
100 axes[2, 2].set_title('Adaptive Weight Decay')
101 axes[2, 2].set_xlabel('Epoch')
102 axes[2, 2].set_ylabel('Weight Decay')
103 axes[2, 2].set_yscale('log')
104 axes[2, 2].grid(True, alpha=0.3)
105
106 plt.tight_layout()
107 plt.savefig(save_path / 'training_history.png', dpi=300, bbox_inches='tight')
108 plt.close()
109
110 print(f"Visualizations saved to {save_path}")Summary
In this section, we covered training monitoring and logging:
- Key metrics: Loss, optimization, and evaluation metrics
- Dual logging: Console (INFO) and file (DEBUG)
- Training history: Store all metrics for analysis
- Visualization: 3×3 grid covering all aspects
- Diagnostics: Gradient norms, LR schedule, timing
| Metric Category | Examples |
|---|---|
| Loss | train_loss, rul_loss, health_loss |
| RUL Performance | RMSE (last/all), NASA score, R² |
| Health Classification | Accuracy, F1, per-class metrics |
| Optimization | LR, gradient norm, weight decay |
| Diagnostics | Early/late ratio, clip ratio |
Looking Ahead: Monitoring tells us what is happening; checkpointing preserves it. The next section covers checkpointing strategy—how to save and restore model state for recovery and deployment.
With monitoring in place, we implement checkpointing.