Chapter 14
12 min read
Section 70 of 104

Training Monitoring and Logging

Complete Training Script

Learning Objectives

By the end of this section, you will:

  1. Know what metrics to monitor during training
  2. Set up comprehensive logging to file and console
  3. Track training history for analysis
  4. Create informative visualizations of training progress
  5. Diagnose training issues from logged data
Why This Matters: Without proper monitoring, you are training blind. Good logging helps you catch problems early (exploding gradients, stuck training), understand model behavior, and reproduce experiments. Visualizations reveal patterns that raw numbers hide.

What to Monitor

Effective monitoring tracks multiple aspects of the training process.

Loss Metrics

MetricWhy MonitorWarning Signs
Training lossOverall learning progressNot decreasing, NaN
RUL lossPrimary task performanceStuck, oscillating
Health lossSecondary task performanceMuch higher than RUL
Validation lossGeneralizationDiverging from train loss

Optimization Metrics

MetricWhy MonitorWarning Signs
Learning rateScheduler behaviorDropping too fast/slow
Gradient normTraining stability>10 or <0.001
Weight decayRegularization strengthUnexpected changes
Clip ratioGradient clipping frequency>50% is concerning

Evaluation Metrics

MetricWhy MonitorWarning Signs
RMSE (last-cycle)Primary benchmarkNot improving after 50 epochs
RMSE (all-cycles)Model consistencyMuch higher than last-cycle
NASA scoreAsymmetric performanceSudden spikes
Health accuracyClassification quality<70%
Early/late ratioPrediction biasStrong bias either way

Logging Setup

A proper logging setup writes to both console and file.

Logger Configuration

🐍python
1import logging
2from datetime import datetime
3from pathlib import Path
4
5def setup_logging(
6    experiment_name: str,
7    log_dir: str = 'logs'
8) -> logging.Logger:
9    """
10    Configure logging for training experiment.
11
12    Logs to both console (INFO level) and file (DEBUG level).
13
14    Args:
15        experiment_name: Name for this experiment
16        log_dir: Directory for log files
17
18    Returns:
19        Configured logger
20    """
21    # Create log directory
22    log_path = Path(log_dir)
23    log_path.mkdir(exist_ok=True)
24
25    # Generate timestamped log filename
26    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
27    log_file = log_path / f'{experiment_name}_{timestamp}.log'
28
29    # Create logger
30    logger = logging.getLogger(experiment_name)
31    logger.setLevel(logging.DEBUG)
32
33    # Clear any existing handlers
34    logger.handlers = []
35
36    # Console handler (INFO and above)
37    console_handler = logging.StreamHandler()
38    console_handler.setLevel(logging.INFO)
39    console_format = logging.Formatter(
40        '%(asctime)s - %(levelname)s - %(message)s',
41        datefmt='%H:%M:%S'
42    )
43    console_handler.setFormatter(console_format)
44
45    # File handler (DEBUG and above)
46    file_handler = logging.FileHandler(log_file)
47    file_handler.setLevel(logging.DEBUG)
48    file_format = logging.Formatter(
49        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
50    )
51    file_handler.setFormatter(file_format)
52
53    # Add handlers
54    logger.addHandler(console_handler)
55    logger.addHandler(file_handler)
56
57    logger.info(f"Logging to {log_file}")
58
59    return logger

Logging During Training

🐍python
1# Example usage in training loop
2logger = setup_logging('amnl_fd001')
3
4logger.info("=" * 60)
5logger.info(f"Starting AMNL training on FD001")
6logger.info(f"Model: DualTaskEnhancedModel")
7logger.info(f"Parameters: {total_params:,}")
8logger.info("=" * 60)
9
10for epoch in range(epochs):
11    # Training
12    train_results = train_epoch(...)
13
14    # Evaluation
15    eval_results = evaluate_model_comprehensive(...)
16
17    # Log progress
18    logger.info(
19        f"Epoch [{epoch+1:3d}/{epochs}] | "
20        f"Loss: {train_results['loss']:.4f} | "
21        f"RMSE: {eval_results['RMSE_last_cycle']:.2f} | "
22        f"NASA: {eval_results['nasa_score_paper']:.1f} | "
23        f"Health: {eval_results['health_accuracy']:.1f}% | "
24        f"LR: {current_lr:.6f}"
25    )
26
27    # Log detailed debug info
28    logger.debug(
29        f"Gradient norm: {train_results['avg_grad_norm']:.4f}, "
30        f"RUL loss: {train_results['rul_loss']:.4f}, "
31        f"Health loss: {train_results['health_loss']:.4f}"
32    )
33
34    # Log best model updates
35    if is_new_best:
36        logger.info(f"  New best model! RMSE: {eval_results['RMSE_last_cycle']:.2f}")
37
38logger.info("Training complete")
39logger.info(f"Best RMSE: {best_rmse:.2f} at epoch {best_epoch+1}")

Training History

Storing the complete training history enables post-hoc analysis.

History Structure

🐍python
1from collections import defaultdict
2
3# Initialize history dictionary
4history = defaultdict(list)
5
6# During training, append each epoch's metrics
7for epoch in range(epochs):
8    # ... training and evaluation ...
9
10    # Store all metrics
11    history['train_loss'].append(train_results['loss'])
12    history['rul_loss'].append(train_results['rul_loss'])
13    history['health_loss'].append(train_results['health_loss'])
14
15    history['test_rmse_all'].append(eval_results['RMSE_all_cycles'])
16    history['test_rmse_last'].append(eval_results['RMSE_last_cycle'])
17    history['nasa_score_paper'].append(eval_results['nasa_score_paper'])
18    history['nasa_score_raw'].append(eval_results['nasa_score_raw'])
19
20    history['health_accuracy'].append(eval_results['health_accuracy'])
21    history['health_f1'].append(eval_results['health_f1'])
22
23    history['r2_all_cycles'].append(eval_results['R2_all_cycles'])
24    history['r2_last_cycle'].append(eval_results['R2_last_cycle'])
25
26    history['learning_rate'].append(current_lr)
27    history['gradient_norm'].append(train_results['avg_grad_norm'])
28    history['weight_decay'].append(current_wd)
29
30    # Prediction timing analysis
31    early_pct = eval_results['early_predictions'] / eval_results['n_total_predictions'] * 100
32    late_pct = eval_results['late_predictions'] / eval_results['n_total_predictions'] * 100
33    history['early_predictions_pct'].append(early_pct)
34    history['late_predictions_pct'].append(late_pct)

Saving History

🐍python
1import json
2
3def convert_numpy_types(obj):
4    """Convert numpy types to Python types for JSON serialization."""
5    if isinstance(obj, np.ndarray):
6        return obj.tolist()
7    elif isinstance(obj, (np.int64, np.int32)):
8        return int(obj)
9    elif isinstance(obj, (np.float64, np.float32)):
10        return float(obj)
11    elif isinstance(obj, dict):
12        return {k: convert_numpy_types(v) for k, v in obj.items()}
13    elif isinstance(obj, list):
14        return [convert_numpy_types(v) for v in obj]
15    return obj
16
17
18def save_training_history(
19    history: dict,
20    filepath: str
21):
22    """Save training history to JSON file."""
23    # Convert numpy types
24    history_clean = convert_numpy_types(dict(history))
25
26    with open(filepath, 'w') as f:
27        json.dump(history_clean, f, indent=2)
28
29    print(f"Training history saved to {filepath}")

Visualization

Visualizations reveal training dynamics that numbers alone cannot show.

Comprehensive Training Plots

🐍python
1import matplotlib.pyplot as plt
2import seaborn as sns
3
4def create_training_visualizations(
5    history: dict,
6    dataset_name: str,
7    save_dir: str = 'visualizations'
8):
9    """
10    Create comprehensive training visualization.
11
12    Generates a 3x3 grid of plots covering all key metrics.
13    """
14    save_path = Path(save_dir) / dataset_name
15    save_path.mkdir(parents=True, exist_ok=True)
16
17    epochs = range(1, len(history['train_loss']) + 1)
18
19    # Set style
20    plt.style.use('seaborn-v0_8-darkgrid')
21    sns.set_palette("husl")
22
23    fig, axes = plt.subplots(3, 3, figsize=(18, 15))
24    fig.suptitle(f'AMNL {dataset_name} Training Results', fontsize=16, fontweight='bold')
25
26    # Row 1: Loss and RUL Metrics
27    # Plot 1: Training Loss
28    axes[0, 0].plot(epochs, history['train_loss'], 'b-', linewidth=2, label='Train Loss')
29    axes[0, 0].set_title('Training Loss')
30    axes[0, 0].set_xlabel('Epoch')
31    axes[0, 0].set_ylabel('Loss')
32    axes[0, 0].legend()
33    axes[0, 0].grid(True, alpha=0.3)
34
35    # Plot 2: RMSE Evolution
36    axes[0, 1].plot(epochs, history['test_rmse_all'], 'g-', linewidth=2, label='All Cycles')
37    axes[0, 1].plot(epochs, history['test_rmse_last'], 'r-', linewidth=2, label='Last Cycle')
38    axes[0, 1].axhline(y=min(history['test_rmse_last']), color='blue', linestyle='--',
39                       alpha=0.7, label=f'Best: {min(history["test_rmse_last"]):.2f}')
40    axes[0, 1].set_title('RMSE Evolution')
41    axes[0, 1].set_xlabel('Epoch')
42    axes[0, 1].set_ylabel('RMSE')
43    axes[0, 1].legend()
44    axes[0, 1].grid(True, alpha=0.3)
45
46    # Plot 3: NASA Score
47    axes[0, 2].plot(epochs, history['nasa_score_paper'], 'purple', linewidth=2)
48    axes[0, 2].set_title('NASA Score (Paper Style)')
49    axes[0, 2].set_xlabel('Epoch')
50    axes[0, 2].set_ylabel('NASA Score')
51    axes[0, 2].grid(True, alpha=0.3)
52
53    # Row 2: Health Classification and R²
54    # Plot 4: Health Accuracy
55    axes[1, 0].plot(epochs, history['health_accuracy'], 'cyan', linewidth=2)
56    axes[1, 0].axhline(y=80, color='g', linestyle='--', alpha=0.7, label='Target 80%')
57    axes[1, 0].set_title('Health State Accuracy')
58    axes[1, 0].set_xlabel('Epoch')
59    axes[1, 0].set_ylabel('Accuracy (%)')
60    axes[1, 0].legend()
61    axes[1, 0].grid(True, alpha=0.3)
62
63    # Plot 5: R² Score
64    axes[1, 1].plot(epochs, history['r2_all_cycles'], 'brown', linewidth=2, label='All Cycles')
65    axes[1, 1].plot(epochs, history['r2_last_cycle'], 'pink', linewidth=2, label='Last Cycle')
66    axes[1, 1].set_title('R² Score Evolution')
67    axes[1, 1].set_xlabel('Epoch')
68    axes[1, 1].set_ylabel('R² Score')
69    axes[1, 1].legend()
70    axes[1, 1].grid(True, alpha=0.3)
71
72    # Plot 6: Learning Rate
73    axes[1, 2].plot(epochs, history['learning_rate'], 'red', linewidth=2)
74    axes[1, 2].set_title('Learning Rate Schedule')
75    axes[1, 2].set_xlabel('Epoch')
76    axes[1, 2].set_ylabel('Learning Rate')
77    axes[1, 2].set_yscale('log')
78    axes[1, 2].grid(True, alpha=0.3)
79
80    # Row 3: Advanced Metrics
81    # Plot 7: Prediction Timing
82    axes[2, 0].plot(epochs, history['early_predictions_pct'], 'blue', linewidth=2, label='Early')
83    axes[2, 0].plot(epochs, history['late_predictions_pct'], 'red', linewidth=2, label='Late')
84    axes[2, 0].set_title('Prediction Timing Analysis')
85    axes[2, 0].set_xlabel('Epoch')
86    axes[2, 0].set_ylabel('Percentage')
87    axes[2, 0].legend()
88    axes[2, 0].grid(True, alpha=0.3)
89
90    # Plot 8: Gradient Norm
91    axes[2, 1].plot(epochs, history['gradient_norm'], 'purple', linewidth=2)
92    axes[2, 1].set_title('Gradient Norm')
93    axes[2, 1].set_xlabel('Epoch')
94    axes[2, 1].set_ylabel('Norm')
95    axes[2, 1].set_yscale('log')
96    axes[2, 1].grid(True, alpha=0.3)
97
98    # Plot 9: Weight Decay
99    axes[2, 2].plot(epochs, history['weight_decay'], 'green', linewidth=2)
100    axes[2, 2].set_title('Adaptive Weight Decay')
101    axes[2, 2].set_xlabel('Epoch')
102    axes[2, 2].set_ylabel('Weight Decay')
103    axes[2, 2].set_yscale('log')
104    axes[2, 2].grid(True, alpha=0.3)
105
106    plt.tight_layout()
107    plt.savefig(save_path / 'training_history.png', dpi=300, bbox_inches='tight')
108    plt.close()
109
110    print(f"Visualizations saved to {save_path}")

Summary

In this section, we covered training monitoring and logging:

  1. Key metrics: Loss, optimization, and evaluation metrics
  2. Dual logging: Console (INFO) and file (DEBUG)
  3. Training history: Store all metrics for analysis
  4. Visualization: 3×3 grid covering all aspects
  5. Diagnostics: Gradient norms, LR schedule, timing
Metric CategoryExamples
Losstrain_loss, rul_loss, health_loss
RUL PerformanceRMSE (last/all), NASA score, R²
Health ClassificationAccuracy, F1, per-class metrics
OptimizationLR, gradient norm, weight decay
DiagnosticsEarly/late ratio, clip ratio
Looking Ahead: Monitoring tells us what is happening; checkpointing preserves it. The next section covers checkpointing strategy—how to save and restore model state for recovery and deployment.

With monitoring in place, we implement checkpointing.