Learning Objectives
By the end of this section, you will:
- Understand why early stopping prevents overfitting
- Configure patience and minimum delta parameters
- Implement best weight restoration for optimal checkpoints
- Choose appropriate stopping criteria for RUL prediction
- Integrate early stopping with other training enhancements
Why This Matters: Neural networks can memorize training data if trained too long, leading to poor generalization. Early stopping monitors validation performance and halts training when the model stops improvingβautomatically finding the sweet spot between underfitting and overfitting.
Why Early Stopping?
Early stopping is one of the most effective regularization techniques for deep learning.
The Overfitting Problem
Training and validation loss typically follow different trajectories:
1Training vs. Validation Loss Over Time:
2
3Loss
4 β
5 β β²
6 β β² Training Loss
7 β β²ββββββββββββββββββββββββββββββ
8 β β²
9 β β²
10 β β²___
11 β β²__
12 β β²___
13 β βββββββββββββββββββββββββββ
14 β β± Validation Loss
15 β β± (starts rising!)
16 β β±
17 β β β¬οΈ Early stopping point
18 β β
19 βββββββ΄βββββββββββββββββββββββββββββ Epochs
20 β
21 Optimalβ
22 Stop β
23
24Before this point: underfitting (both losses high)
25At this point: optimal generalization
26After this point: overfitting (val loss rises)When to Stop
The goal is to stop training when validation performance stops improving. This requires:
- Monitoring validation loss (or another metric like RMSE)
- Detecting when improvement stops for a sustained period
- Saving the best weights encountered during training
- Restoring best weights when training ends
Early Stopping vs. Other Regularization
| Technique | How It Works | Computational Cost |
|---|---|---|
| Early stopping | Stop before overfitting | Saves time (shorter training) |
| Weight decay | Penalize large weights | Minimal overhead |
| Dropout | Random neuron masking | Slight overhead |
| Data augmentation | Increase data diversity | Data processing cost |
Early Stopping Saves Time
Unlike other regularization techniques that add cost, early stopping actually reduces training time by ending training early. It is the only regularization technique that speeds up training.
Patience and Minimum Delta
Two key parameters control early stopping behavior.
Patience
Patience is the number of epochs to wait for improvement before stopping:
| Patience | Behavior | Use Case |
|---|---|---|
| 10 | Aggressive stopping | Quick experiments, small datasets |
| 20-30 | Moderate | Standard training |
| 50-80 | Conservative | Long training, complex models |
| 100+ | Very conservative | Large models, slow convergence |
Minimum Delta (min_delta)
Minimum delta defines the threshold for what counts as "improvement":
Setting min_delta > 0 prevents stopping on tiny, meaningless improvements:
| min_delta | Effect | Recommendation |
|---|---|---|
| 0 | Any improvement counts | May stop prematurely on noise |
| 0.0001 | Ignore tiny improvements | Standard choice |
| 0.001 | Only significant improvements | Conservative |
| 0.01 | Only major improvements | Very conservative |
AMNL Settings
For RUL prediction with AMNL, we use patience = 80 and min_delta = 0.0001. This conservative setting allows the model time to escape plateaus while ignoring noise.
Restoring Best Weights
The key insight: training often continues past the optimal point before stopping.
Why Restore Best Weights?
1Validation Loss During Training:
2
3Loss
4 β
5 β Best weights
6 β β
7 β βββββββββββββββββββββββββββββββ
8 β β βββββββββββββββ
9 β β β± Validation loss
10 β β β± starts rising
11 β β β±
12 β ββ±
13 β β± β Patience period starts
14 β β±
15 β β± (20 epochs of no improvement)
16 β β±
17 β β± Training stops here
18 β β± but this is NOT the best!
19 β βββββββββββββββββββββββββββββββ
20 βββββββββββββββββββββββββββββββββββββββ Epochs
21 β β
22 Best epoch Stop epoch
23
24Without restore: Use weights at stop epoch (suboptimal)
25With restore: Use weights at best epoch (optimal)Implementation Strategy
At each epoch, if validation performance improves:
- Save a deep copy of model.state_dict()
- Update best_loss to current loss
- Reset patience counter to 0
When training stops:
- Load the saved best weights back into the model
- Use these restored weights for evaluation and deployment
Deep Copy Required
You must use copy.deepcopy(model.state_dict()) when saving best weights. A shallow copy would be a reference to the same tensor objects, which continue to be modified during training.
Implementation
Our research implementation provides a clean, efficient early stopping class with best weight restoration.
AMNL Research Implementation
Integration with Training Loop
1def train_with_early_stopping(
2 model: nn.Module,
3 train_loader: DataLoader,
4 val_loader: DataLoader,
5 optimizer: torch.optim.Optimizer,
6 criterion: nn.Module,
7 epochs: int,
8 patience: int = 80,
9 min_delta: float = 0.0001,
10 device: torch.device = torch.device('cuda')
11) -> dict:
12 """
13 Training loop with early stopping.
14
15 Returns:
16 Dictionary with training history and best epoch
17 """
18 early_stopping = EarlyStopping(
19 patience=patience,
20 min_delta=min_delta,
21 restore_best_weights=True
22 )
23
24 history = {
25 'train_loss': [],
26 'val_loss': [],
27 'best_epoch': -1
28 }
29
30 for epoch in range(epochs):
31 # Training phase
32 model.train()
33 train_loss = 0.0
34 for batch in train_loader:
35 x, y = batch
36 x, y = x.to(device), y.to(device)
37
38 optimizer.zero_grad()
39 pred = model(x)
40 loss = criterion(pred, y)
41 loss.backward()
42 optimizer.step()
43
44 train_loss += loss.item()
45
46 avg_train_loss = train_loss / len(train_loader)
47
48 # Validation phase
49 model.eval()
50 val_loss = 0.0
51 with torch.no_grad():
52 for batch in val_loader:
53 x, y = batch
54 x, y = x.to(device), y.to(device)
55 pred = model(x)
56 val_loss += criterion(pred, y).item()
57
58 avg_val_loss = val_loss / len(val_loader)
59
60 # Record history
61 history['train_loss'].append(avg_train_loss)
62 history['val_loss'].append(avg_val_loss)
63
64 # Track best epoch
65 if avg_val_loss < early_stopping.best_loss:
66 history['best_epoch'] = epoch
67
68 # Early stopping check
69 if early_stopping(avg_val_loss, model):
70 print(f"Early stopping triggered at epoch {epoch + 1}")
71 print(f"Best epoch was {history['best_epoch'] + 1} "
72 f"with val_loss: {early_stopping.best_loss:.4f}")
73 break
74
75 # Progress logging
76 if (epoch + 1) % 10 == 0:
77 print(f"Epoch {epoch + 1}/{epochs} | "
78 f"Train: {avg_train_loss:.4f} | "
79 f"Val: {avg_val_loss:.4f} | "
80 f"Patience: {early_stopping.counter}/{patience}")
81
82 return historyUsing RMSE Instead of Loss
For RUL prediction, we typically monitor RMSE rather than raw loss:
1class EarlyStoppingRMSE:
2 """
3 Early stopping based on validation RMSE.
4
5 For RUL prediction, RMSE is more interpretable than raw loss.
6 """
7
8 def __init__(
9 self,
10 patience: int = 80,
11 min_delta: float = 0.01, # RMSE units (e.g., 0.01 cycles)
12 restore_best_weights: bool = True
13 ):
14 self.patience = patience
15 self.min_delta = min_delta
16 self.restore_best_weights = restore_best_weights
17
18 self.best_rmse = float('inf')
19 self.counter = 0
20 self.best_weights = None
21
22 def __call__(self, val_rmse: float, model: nn.Module) -> bool:
23 """
24 Check if training should stop based on RMSE.
25
26 Args:
27 val_rmse: Current validation RMSE
28 model: Model to potentially save/restore weights
29
30 Returns:
31 True if training should stop, False otherwise
32 """
33 if val_rmse < self.best_rmse - self.min_delta:
34 self.best_rmse = val_rmse
35 self.counter = 0
36
37 if self.restore_best_weights:
38 self.best_weights = copy.deepcopy(model.state_dict())
39 else:
40 self.counter += 1
41
42 if self.counter >= self.patience:
43 if self.restore_best_weights and self.best_weights:
44 model.load_state_dict(self.best_weights)
45 return True
46
47 return FalseSummary
In this section, we covered early stopping with best weights:
- Purpose: Prevent overfitting by stopping when validation performance degrades
- Patience: Number of epochs to wait before stopping (80 for AMNL)
- min_delta: Threshold for meaningful improvement (0.0001)
- Best weights: Deep copy of weights at best validation performance
- Restoration: Load best weights when training ends
| Parameter | Value |
|---|---|
| Patience | 80 epochs |
| min_delta | 0.0001 |
| Restore best weights | Yes |
| Monitoring metric | Validation RMSE |
Looking Ahead: Early stopping tells us when to stop. The next section covers mixed precision trainingβa technique that uses 16-bit floats to speed up training by 2-3Γ while maintaining accuracy.
With early stopping configured, we explore mixed precision training.