AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand AMNL's training time of ~90 minutes per dataset
Learn optimization techniques to accelerate training
Analyze hardware scaling for multi-GPU training
Apply practical tips for efficient experimentation
Plan compute budgets for model development

Core Insight: AMNL trains in approximately 90 minutes per dataset on a single GPU, enabling rapid experimentation and hyperparameter tuning. With mixed precision and optimized data loading, training time can be reduced to under 60 minutes without sacrificing model quality.

Training Time Overview

Training efficiency directly impacts research velocity and operational costs. AMNL achieves a good balance between model complexity and training speed.

Baseline Training Times

Dataset	Training Samples	Epochs	Time (RTX 5000)
FD001	~17,700	100	~75 min
FD002	~53,200	100	~95 min
FD003	~21,800	100	~80 min
FD004	~61,200	100	~105 min
Average	-	100	~90 min

Training Breakdown

Where does training time go?

Phase	Time	Percentage
Forward pass	~35 min	39%
Backward pass	~40 min	44%
Data loading	~8 min	9%
Validation	~5 min	6%
Overhead (logging, etc.)	~2 min	2%

Backward Pass Dominates

The backward pass takes longer than the forward pass due to gradient computation and accumulation across the dual-task loss. This is typical for multi-task models with shared encoders.

Optimization Techniques

Several techniques can significantly reduce training time without affecting model quality.

1. Mixed Precision Training

Using FP16 for forward/backward passes with FP32 accumulation provides a significant speedup:

🐍python

1from torch.cuda.amp import autocast, GradScaler
2
3scaler = GradScaler()
4
5for batch in dataloader:
6    optimizer.zero_grad()
7
8    # Forward pass in FP16
9    with autocast():
10        rul_pred, health_pred = model(batch)
11        loss = amnl_loss(rul_pred, health_pred, targets)
12
13    # Backward pass with scaled gradients
14    scaler.scale(loss).backward()
15    scaler.step(optimizer)
16    scaler.update()
17
18# Speedup: ~1.4x (90 min → ~65 min)

Precision	Training Time	Speedup	Memory
FP32 (baseline)	90 min	1.0x	3.2 GB
Mixed (FP16+FP32)	65 min	1.4x	2.1 GB
Pure FP16	58 min	1.55x	1.8 GB*

FP16 Caution

Pure FP16 training can cause numerical instability with small gradients. Mixed precision (FP16 forward, FP32 accumulation) is recommended for reliable training.

2. Optimized Data Loading

🐍python

1from torch.utils.data import DataLoader
2
3# Optimized DataLoader configuration
4train_loader = DataLoader(
5    train_dataset,
6    batch_size=256,
7    shuffle=True,
8    num_workers=8,       # Parallel data loading
9    pin_memory=True,     # Faster GPU transfer
10    prefetch_factor=2,   # Prefetch batches
11    persistent_workers=True  # Keep workers alive
12)
13
14# Speedup: ~1.15x (8 min data loading → ~4 min)

3. Gradient Accumulation

For larger effective batch sizes without memory increase:

🐍python

1accumulation_steps = 4
2effective_batch_size = 256 * 4  # = 1024
3
4for i, batch in enumerate(dataloader):
5    loss = compute_loss(batch) / accumulation_steps
6    loss.backward()
7
8    if (i + 1) % accumulation_steps == 0:
9        optimizer.step()
10        optimizer.zero_grad()
11
12# Benefit: Larger effective batch → faster convergence
13# ~10% fewer epochs needed for same performance

4. Early Stopping

Avoid unnecessary training epochs with proper early stopping:

🐍python

1class EarlyStopping:
2    def __init__(self, patience=15, min_delta=0.001):
3        self.patience = patience
4        self.min_delta = min_delta
5        self.best_loss = float('inf')
6        self.counter = 0
7
8    def __call__(self, val_loss):
9        if val_loss < self.best_loss - self.min_delta:
10            self.best_loss = val_loss
11            self.counter = 0
12            return False  # Continue training
13        else:
14            self.counter += 1
15            return self.counter >= self.patience  # Stop if patience exceeded
16
17# Typical result: Training stops at epoch 60-80 instead of 100
18# Time savings: ~20-30%

Combined Optimization Impact

Optimization	Time	Cumulative Speedup
Baseline	90 min	1.0x
+ Mixed precision	65 min	1.4x
+ Data loading	57 min	1.6x
+ Early stopping	45 min	2.0x
+ Gradient accumulation	40 min	2.25x

Hardware Scaling

Training can be accelerated further with better hardware or distributed training.

Single GPU Scaling

GPU	VRAM	Training Time	Cost/Hour
RTX 3060	12 GB	~140 min	$0.15
RTX 3080	10 GB	~95 min	$0.30
RTX 5000	16 GB	~90 min	$0.50
A100 40GB	40 GB	~45 min	$2.50
H100 80GB	80 GB	~25 min	$4.00

Multi-GPU Training

🐍python

1import torch.distributed as dist
2from torch.nn.parallel import DistributedDataParallel as DDP
3
4# Initialize distributed training
5dist.init_process_group("nccl")
6model = DDP(model, device_ids=[local_rank])
7
8# Training loop remains similar
9# Data sampler ensures each GPU sees different data

Configuration	Training Time	Speedup	Efficiency
1× RTX 5000	90 min	1.0x	100%
2× RTX 5000	48 min	1.88x	94%
4× RTX 5000	28 min	3.21x	80%
8× RTX 5000	18 min	5.00x	62%

Practical Training Tips

Based on extensive experimentation, here are practical recommendations for efficient AMNL training.

Hyperparameter Tuning Strategy

Start with defaults: Our published hyperparameters work well across all datasets
Quick iterations first: Run 20-epoch tests to validate changes
Focus on high-impact params: Learning rate and batch size matter most
Use learning rate finder: Automated LR range test saves time

🐍python

1# Learning rate finder
2from torch_lr_finder import LRFinder
3
4lr_finder = LRFinder(model, optimizer, criterion)
5lr_finder.range_test(train_loader, end_lr=1, num_iter=100)
6lr_finder.plot()  # Find the steepest descent point
7
8# Typical result for AMNL: lr ≈ 1e-3 to 3e-3

Checkpointing Strategy

🐍python

1# Save checkpoints efficiently
2def save_checkpoint(model, optimizer, epoch, best_loss, path):
3    torch.save({
4        'epoch': epoch,
5        'model_state_dict': model.state_dict(),
6        'optimizer_state_dict': optimizer.state_dict(),
7        'best_loss': best_loss,
8    }, path)
9
10# Save every N epochs + best model
11if epoch % 10 == 0:
12    save_checkpoint(model, optimizer, epoch, loss, f'checkpoint_{epoch}.pt')
13if loss < best_loss:
14    save_checkpoint(model, optimizer, epoch, loss, 'best_model.pt')

Efficient Experimentation

Practice	Benefit	Time Saved
Use validation subset	Faster validation	10-15%
Cache preprocessed data	Eliminate preprocessing	5-10%
Profile before optimizing	Target real bottlenecks	Variable
Use TensorBoard	Catch issues early	Prevent wasted runs
Version control configs	Reproducible experiments	Debug time

Training Schedule Recommendations

Phase	Epochs	Learning Rate	Purpose
Warmup	1-5	0 → base_lr	Stable initialization
Main training	5-60	base_lr	Learn patterns
Cosine annealing	60-80	Decaying	Fine-tune
Final convergence	80-100	Very low	Polish

Early Convergence

Most AMNL models converge to near-optimal performance by epoch 60-70. The final 30 epochs provide marginal improvements (~0.5-1% RMSE). For rapid experimentation, early stopping at epoch 70 is acceptable.

Summary

Training Time Optimization - Summary:

Baseline training: ~90 minutes per dataset on RTX 5000
Mixed precision: 1.4× speedup with minimal code changes
All optimizations: 2.25× speedup (90 min → 40 min)
Multi-GPU scaling: Near-linear up to 4 GPUs
Early stopping: Training typically converges by epoch 70

Scenario	Hardware	Optimizations	Time per Dataset
Research (quick)	1× RTX 3060	All + early stop	~60 min
Research (full)	1× RTX 5000	Mixed precision	~65 min
Production	1× A100	All optimizations	~25 min
Large-scale	4× A100	Distributed + all	~10 min

Key Insight: AMNL's training efficiency enables rapid iteration during research and cost-effective retraining in production. With optimizations, a complete training run takes under an hour on commodity hardware, making predictive maintenance model development accessible without expensive compute infrastructure. For production deployments requiring frequent retraining, the 10-minute training time on multi-GPU systems enables continuous model updates.

This concludes our analysis of computational efficiency. AMNL achieves an excellent balance of performance and efficiency: state-of-the-art accuracy with only 3.5M parameters, 31K samples/second inference, under 500 MB memory, and 90-minute training time. This efficiency enables deployment across a wide range of industrial scenarios.