AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Know what metrics to monitor during training
Set up comprehensive logging to file and console
Track training history for analysis
Create informative visualizations of training progress
Diagnose training issues from logged data

Why This Matters: Without proper monitoring, you are training blind. Good logging helps you catch problems early (exploding gradients, stuck training), understand model behavior, and reproduce experiments. Visualizations reveal patterns that raw numbers hide.

What to Monitor

Effective monitoring tracks multiple aspects of the training process.

Loss Metrics

Metric	Why Monitor	Warning Signs
Training loss	Overall learning progress	Not decreasing, NaN
RUL loss	Primary task performance	Stuck, oscillating
Health loss	Secondary task performance	Much higher than RUL
Validation loss	Generalization	Diverging from train loss

Optimization Metrics

Metric	Why Monitor	Warning Signs
Learning rate	Scheduler behavior	Dropping too fast/slow
Gradient norm	Training stability	>10 or <0.001
Weight decay	Regularization strength	Unexpected changes
Clip ratio	Gradient clipping frequency	>50% is concerning

Evaluation Metrics

Metric	Why Monitor	Warning Signs
RMSE (last-cycle)	Primary benchmark	Not improving after 50 epochs
RMSE (all-cycles)	Model consistency	Much higher than last-cycle
NASA score	Asymmetric performance	Sudden spikes
Health accuracy	Classification quality	<70%
Early/late ratio	Prediction bias	Strong bias either way

Logging Setup

A proper logging setup writes to both console and file.

Logger Configuration

🐍python

1import logging
2from datetime import datetime
3from pathlib import Path
4
5def setup_logging(
6    experiment_name: str,
7    log_dir: str = 'logs'
8) -> logging.Logger:
9    """
10    Configure logging for training experiment.
11
12    Logs to both console (INFO level) and file (DEBUG level).
13
14    Args:
15        experiment_name: Name for this experiment
16        log_dir: Directory for log files
17
18    Returns:
19        Configured logger
20    """
21    # Create log directory
22    log_path = Path(log_dir)
23    log_path.mkdir(exist_ok=True)
24
25    # Generate timestamped log filename
26    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
27    log_file = log_path / f'{experiment_name}_{timestamp}.log'
28
29    # Create logger
30    logger = logging.getLogger(experiment_name)
31    logger.setLevel(logging.DEBUG)
32
33    # Clear any existing handlers
34    logger.handlers = []
35
36    # Console handler (INFO and above)
37    console_handler = logging.StreamHandler()
38    console_handler.setLevel(logging.INFO)
39    console_format = logging.Formatter(
40        '%(asctime)s - %(levelname)s - %(message)s',
41        datefmt='%H:%M:%S'
42    )
43    console_handler.setFormatter(console_format)
44
45    # File handler (DEBUG and above)
46    file_handler = logging.FileHandler(log_file)
47    file_handler.setLevel(logging.DEBUG)
48    file_format = logging.Formatter(
49        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
50    )
51    file_handler.setFormatter(file_format)
52
53    # Add handlers
54    logger.addHandler(console_handler)
55    logger.addHandler(file_handler)
56
57    logger.info(f"Logging to {log_file}")
58
59    return logger

Logging During Training

🐍python

1# Example usage in training loop
2logger = setup_logging('amnl_fd001')
3
4logger.info("=" * 60)
5logger.info(f"Starting AMNL training on FD001")
6logger.info(f"Model: DualTaskEnhancedModel")
7logger.info(f"Parameters: {total_params:,}")
8logger.info("=" * 60)
9
10for epoch in range(epochs):
11    # Training
12    train_results = train_epoch(...)
13
14    # Evaluation
15    eval_results = evaluate_model_comprehensive(...)
16
17    # Log progress
18    logger.info(
19        f"Epoch [{epoch+1:3d}/{epochs}] | "
20        f"Loss: {train_results['loss']:.4f} | "
21        f"RMSE: {eval_results['RMSE_last_cycle']:.2f} | "
22        f"NASA: {eval_results['nasa_score_paper']:.1f} | "
23        f"Health: {eval_results['health_accuracy']:.1f}% | "
24        f"LR: {current_lr:.6f}"
25    )
26
27    # Log detailed debug info
28    logger.debug(
29        f"Gradient norm: {train_results['avg_grad_norm']:.4f}, "
30        f"RUL loss: {train_results['rul_loss']:.4f}, "
31        f"Health loss: {train_results['health_loss']:.4f}"
32    )
33
34    # Log best model updates
35    if is_new_best:
36        logger.info(f"  New best model! RMSE: {eval_results['RMSE_last_cycle']:.2f}")
37
38logger.info("Training complete")
39logger.info(f"Best RMSE: {best_rmse:.2f} at epoch {best_epoch+1}")

Training History

Storing the complete training history enables post-hoc analysis.

History Structure

🐍python

1from collections import defaultdict
2
3# Initialize history dictionary
4history = defaultdict(list)
5
6# During training, append each epoch's metrics
7for epoch in range(epochs):
8    # ... training and evaluation ...
9
10    # Store all metrics
11    history['train_loss'].append(train_results['loss'])
12    history['rul_loss'].append(train_results['rul_loss'])
13    history['health_loss'].append(train_results['health_loss'])
14
15    history['test_rmse_all'].append(eval_results['RMSE_all_cycles'])
16    history['test_rmse_last'].append(eval_results['RMSE_last_cycle'])
17    history['nasa_score_paper'].append(eval_results['nasa_score_paper'])
18    history['nasa_score_raw'].append(eval_results['nasa_score_raw'])
19
20    history['health_accuracy'].append(eval_results['health_accuracy'])
21    history['health_f1'].append(eval_results['health_f1'])
22
23    history['r2_all_cycles'].append(eval_results['R2_all_cycles'])
24    history['r2_last_cycle'].append(eval_results['R2_last_cycle'])
25
26    history['learning_rate'].append(current_lr)
27    history['gradient_norm'].append(train_results['avg_grad_norm'])
28    history['weight_decay'].append(current_wd)
29
30    # Prediction timing analysis
31    early_pct = eval_results['early_predictions'] / eval_results['n_total_predictions'] * 100
32    late_pct = eval_results['late_predictions'] / eval_results['n_total_predictions'] * 100
33    history['early_predictions_pct'].append(early_pct)
34    history['late_predictions_pct'].append(late_pct)

Saving History

🐍python

1import json
2
3def convert_numpy_types(obj):
4    """Convert numpy types to Python types for JSON serialization."""
5    if isinstance(obj, np.ndarray):
6        return obj.tolist()
7    elif isinstance(obj, (np.int64, np.int32)):
8        return int(obj)
9    elif isinstance(obj, (np.float64, np.float32)):
10        return float(obj)
11    elif isinstance(obj, dict):
12        return {k: convert_numpy_types(v) for k, v in obj.items()}
13    elif isinstance(obj, list):
14        return [convert_numpy_types(v) for v in obj]
15    return obj
16
17
18def save_training_history(
19    history: dict,
20    filepath: str
21):
22    """Save training history to JSON file."""
23    # Convert numpy types
24    history_clean = convert_numpy_types(dict(history))
25
26    with open(filepath, 'w') as f:
27        json.dump(history_clean, f, indent=2)
28
29    print(f"Training history saved to {filepath}")

Visualization

Visualizations reveal training dynamics that numbers alone cannot show.

Comprehensive Training Plots

🐍python

1import matplotlib.pyplot as plt
2import seaborn as sns
3
4def create_training_visualizations(
5    history: dict,
6    dataset_name: str,
7    save_dir: str = 'visualizations'
8):
9    """
10    Create comprehensive training visualization.
11
12    Generates a 3x3 grid of plots covering all key metrics.
13    """
14    save_path = Path(save_dir) / dataset_name
15    save_path.mkdir(parents=True, exist_ok=True)
16
17    epochs = range(1, len(history['train_loss']) + 1)
18
19    # Set style
20    plt.style.use('seaborn-v0_8-darkgrid')
21    sns.set_palette("husl")
22
23    fig, axes = plt.subplots(3, 3, figsize=(18, 15))
24    fig.suptitle(f'AMNL {dataset_name} Training Results', fontsize=16, fontweight='bold')
25
26    # Row 1: Loss and RUL Metrics
27    # Plot 1: Training Loss
28    axes[0, 0].plot(epochs, history['train_loss'], 'b-', linewidth=2, label='Train Loss')
29    axes[0, 0].set_title('Training Loss')
30    axes[0, 0].set_xlabel('Epoch')
31    axes[0, 0].set_ylabel('Loss')
32    axes[0, 0].legend()
33    axes[0, 0].grid(True, alpha=0.3)
34
35    # Plot 2: RMSE Evolution
36    axes[0, 1].plot(epochs, history['test_rmse_all'], 'g-', linewidth=2, label='All Cycles')
37    axes[0, 1].plot(epochs, history['test_rmse_last'], 'r-', linewidth=2, label='Last Cycle')
38    axes[0, 1].axhline(y=min(history['test_rmse_last']), color='blue', linestyle='--',
39                       alpha=0.7, label=f'Best: {min(history["test_rmse_last"]):.2f}')
40    axes[0, 1].set_title('RMSE Evolution')
41    axes[0, 1].set_xlabel('Epoch')
42    axes[0, 1].set_ylabel('RMSE')
43    axes[0, 1].legend()
44    axes[0, 1].grid(True, alpha=0.3)
45
46    # Plot 3: NASA Score
47    axes[0, 2].plot(epochs, history['nasa_score_paper'], 'purple', linewidth=2)
48    axes[0, 2].set_title('NASA Score (Paper Style)')
49    axes[0, 2].set_xlabel('Epoch')
50    axes[0, 2].set_ylabel('NASA Score')
51    axes[0, 2].grid(True, alpha=0.3)
52
53    # Row 2: Health Classification and R²
54    # Plot 4: Health Accuracy
55    axes[1, 0].plot(epochs, history['health_accuracy'], 'cyan', linewidth=2)
56    axes[1, 0].axhline(y=80, color='g', linestyle='--', alpha=0.7, label='Target 80%')
57    axes[1, 0].set_title('Health State Accuracy')
58    axes[1, 0].set_xlabel('Epoch')
59    axes[1, 0].set_ylabel('Accuracy (%)')
60    axes[1, 0].legend()
61    axes[1, 0].grid(True, alpha=0.3)
62
63    # Plot 5: R² Score
64    axes[1, 1].plot(epochs, history['r2_all_cycles'], 'brown', linewidth=2, label='All Cycles')
65    axes[1, 1].plot(epochs, history['r2_last_cycle'], 'pink', linewidth=2, label='Last Cycle')
66    axes[1, 1].set_title('R² Score Evolution')
67    axes[1, 1].set_xlabel('Epoch')
68    axes[1, 1].set_ylabel('R² Score')
69    axes[1, 1].legend()
70    axes[1, 1].grid(True, alpha=0.3)
71
72    # Plot 6: Learning Rate
73    axes[1, 2].plot(epochs, history['learning_rate'], 'red', linewidth=2)
74    axes[1, 2].set_title('Learning Rate Schedule')
75    axes[1, 2].set_xlabel('Epoch')
76    axes[1, 2].set_ylabel('Learning Rate')
77    axes[1, 2].set_yscale('log')
78    axes[1, 2].grid(True, alpha=0.3)
79
80    # Row 3: Advanced Metrics
81    # Plot 7: Prediction Timing
82    axes[2, 0].plot(epochs, history['early_predictions_pct'], 'blue', linewidth=2, label='Early')
83    axes[2, 0].plot(epochs, history['late_predictions_pct'], 'red', linewidth=2, label='Late')
84    axes[2, 0].set_title('Prediction Timing Analysis')
85    axes[2, 0].set_xlabel('Epoch')
86    axes[2, 0].set_ylabel('Percentage')
87    axes[2, 0].legend()
88    axes[2, 0].grid(True, alpha=0.3)
89
90    # Plot 8: Gradient Norm
91    axes[2, 1].plot(epochs, history['gradient_norm'], 'purple', linewidth=2)
92    axes[2, 1].set_title('Gradient Norm')
93    axes[2, 1].set_xlabel('Epoch')
94    axes[2, 1].set_ylabel('Norm')
95    axes[2, 1].set_yscale('log')
96    axes[2, 1].grid(True, alpha=0.3)
97
98    # Plot 9: Weight Decay
99    axes[2, 2].plot(epochs, history['weight_decay'], 'green', linewidth=2)
100    axes[2, 2].set_title('Adaptive Weight Decay')
101    axes[2, 2].set_xlabel('Epoch')
102    axes[2, 2].set_ylabel('Weight Decay')
103    axes[2, 2].set_yscale('log')
104    axes[2, 2].grid(True, alpha=0.3)
105
106    plt.tight_layout()
107    plt.savefig(save_path / 'training_history.png', dpi=300, bbox_inches='tight')
108    plt.close()
109
110    print(f"Visualizations saved to {save_path}")

Summary

In this section, we covered training monitoring and logging:

Key metrics: Loss, optimization, and evaluation metrics
Dual logging: Console (INFO) and file (DEBUG)
Training history: Store all metrics for analysis
Visualization: 3×3 grid covering all aspects
Diagnostics: Gradient norms, LR schedule, timing

Metric Category	Examples
Loss	train_loss, rul_loss, health_loss
RUL Performance	RMSE (last/all), NASA score, R²
Health Classification	Accuracy, F1, per-class metrics
Optimization	LR, gradient norm, weight decay
Diagnostics	Early/late ratio, clip ratio

Looking Ahead: Monitoring tells us what is happening; checkpointing preserves it. The next section covers checkpointing strategy—how to save and restore model state for recovery and deployment.

With monitoring in place, we implement checkpointing.