Learning Objectives
By the end of this section, you will:
- Understand why reproducibility is crucial for research and production
- Identify all sources of randomness in deep learning training
- Set seeds correctly for Python, NumPy, and PyTorch
- Configure deterministic CUDA operations
- Know the trade-offs between reproducibility and performance
Why This Matters: Without proper seeding, running the same code twice produces different results. This makes debugging impossible, invalidates comparisons between experiments, and undermines scientific claims. A reproducible training pipeline is essential for trustworthy research and reliable production deployments.
Why Reproducibility Matters
Reproducibility is a cornerstone of both scientific research and production machine learning.
Research Requirements
- Ablation studies: Compare model variants fairly
- Peer review: Others must replicate your results
- Debugging: Reproduce errors to fix them
- Baselines: Compare against previous work accurately
Production Requirements
- Testing: Unit tests need deterministic outputs
- Validation: Verify model before deployment
- Auditing: Demonstrate consistent behavior
- Rollback: Recreate previous model versions
The Reproducibility Crisis
Many published deep learning results cannot be reproduced. Common causes:
| Issue | Impact |
|---|---|
| Random seeds not set | Different runs give different results |
| Seeds not reported | Cannot replicate exact experiment |
| CUDA non-determinism | Results vary between GPUs |
| Library versions differ | Different implementations |
| Data preprocessing varies | Different input data |
AMNL Reproducibility
All experiments in this book use seed = 42 for reproducibility. We report the exact configuration and library versions to enable exact replication of our state-of-the-art results.
Sources of Randomness
Deep learning training has multiple sources of randomness that must all be controlled.
Python Random Module
The Python random module is used for data shuffling and random sampling:
- random.shuffle() for shuffling lists
- random.sample() for random sampling
- random.choice() for random selection
NumPy Random
NumPy is used extensively for numerical operations:
- np.random.shuffle() for array shuffling
- np.random.randn() for random initialization
- Train/validation splits
PyTorch Random
PyTorch has its own random number generator for:
- Weight initialization (Xavier, Kaiming, etc.)
- Dropout masks
- DataLoader shuffling
- Data augmentation
CUDA Random
GPU operations have additional randomness sources:
- cuDNN algorithm selection
- Atomics for parallel operations
- Multi-stream execution order
| Source | Examples | Control Method |
|---|---|---|
| Python random | Data splitting, shuffling | random.seed() |
| NumPy | Initialization, preprocessing | np.random.seed() |
| PyTorch CPU | Weight init, dropout | torch.manual_seed() |
| PyTorch CUDA | GPU operations | torch.cuda.manual_seed_all() |
| cuDNN | Algorithm selection | cudnn.deterministic = True |
Controlling Randomness
Each source of randomness requires specific configuration.
Setting All Seeds
The complete seed-setting function must address every source:
1import random
2import numpy as np
3import torch
4
5def set_seed(seed: int = 42):
6 """
7 Set random seeds for reproducibility.
8
9 Controls all sources of randomness in typical PyTorch training:
10 - Python's random module
11 - NumPy's random generator
12 - PyTorch's CPU random generator
13 - PyTorch's CUDA random generator (all GPUs)
14 - cuDNN deterministic mode
15
16 Args:
17 seed: Integer seed value (default: 42)
18 """
19 # Python random module
20 random.seed(seed)
21
22 # NumPy
23 np.random.seed(seed)
24
25 # PyTorch CPU
26 torch.manual_seed(seed)
27
28 # PyTorch CUDA (all GPUs)
29 torch.cuda.manual_seed_all(seed)
30
31 # cuDNN deterministic mode
32 torch.backends.cudnn.deterministic = True
33 torch.backends.cudnn.benchmark = False
34
35 print(f"Set random seed to {seed} for reproducibility")cuDNN Configuration
cuDNN has two important flags:
| Flag | Default | For Reproducibility |
|---|---|---|
| cudnn.deterministic | False | True |
| cudnn.benchmark | False | False (keep as is) |
DataLoader Worker Seeds
When using multiple DataLoader workers, each worker needs its own seed:
1def seed_worker(worker_id: int):
2 """
3 Seed function for DataLoader workers.
4
5 Each worker gets a deterministic seed based on the
6 initial seed and its worker ID.
7 """
8 worker_seed = torch.initial_seed() % 2**32
9 np.random.seed(worker_seed)
10 random.seed(worker_seed)
11
12
13# Create DataLoader with worker seeding
14train_loader = DataLoader(
15 dataset,
16 batch_size=256,
17 shuffle=True,
18 num_workers=4,
19 worker_init_fn=seed_worker, # Seed each worker
20 generator=torch.Generator().manual_seed(42) # Seed shuffling
21)Implementation
Complete reproducibility setup for AMNL training.
Full Reproducibility Configuration
1import random
2import numpy as np
3import torch
4from torch.utils.data import DataLoader
5
6def configure_reproducibility(seed: int = 42):
7 """
8 Configure complete reproducibility for training.
9
10 Sets all random seeds and configures deterministic behavior.
11 Should be called at the very start of training script.
12
13 Args:
14 seed: Random seed for all generators
15
16 Returns:
17 Generator for DataLoader shuffling
18 """
19 # Set all seeds
20 random.seed(seed)
21 np.random.seed(seed)
22 torch.manual_seed(seed)
23 torch.cuda.manual_seed_all(seed)
24
25 # Configure cuDNN for determinism
26 torch.backends.cudnn.deterministic = True
27 torch.backends.cudnn.benchmark = False
28
29 # Create generator for DataLoader
30 generator = torch.Generator()
31 generator.manual_seed(seed)
32
33 print(f"Reproducibility configured with seed={seed}")
34 print(f" - Python random: seeded")
35 print(f" - NumPy random: seeded")
36 print(f" - PyTorch manual_seed: seeded")
37 print(f" - CUDA manual_seed_all: seeded")
38 print(f" - cuDNN deterministic: True")
39 print(f" - cuDNN benchmark: False")
40
41 return generator
42
43
44def seed_worker(worker_id: int):
45 """Deterministic seeding for DataLoader workers."""
46 worker_seed = torch.initial_seed() % 2**32
47 np.random.seed(worker_seed)
48 random.seed(worker_seed)
49
50
51def create_reproducible_dataloader(
52 dataset,
53 batch_size: int,
54 shuffle: bool,
55 num_workers: int,
56 generator: torch.Generator
57) -> DataLoader:
58 """
59 Create a fully reproducible DataLoader.
60
61 Args:
62 dataset: PyTorch dataset
63 batch_size: Batch size
64 shuffle: Whether to shuffle
65 num_workers: Number of worker processes
66 generator: Seeded random generator
67
68 Returns:
69 Reproducible DataLoader
70 """
71 return DataLoader(
72 dataset,
73 batch_size=batch_size,
74 shuffle=shuffle,
75 num_workers=num_workers,
76 worker_init_fn=seed_worker,
77 generator=generator,
78 pin_memory=True,
79 drop_last=False
80 )Usage in Training Script
1# At the very start of training script
2SEED = 42
3generator = configure_reproducibility(SEED)
4
5# Create reproducible data loaders
6train_loader = create_reproducible_dataloader(
7 train_dataset,
8 batch_size=256,
9 shuffle=True,
10 num_workers=4,
11 generator=generator
12)
13
14test_loader = create_reproducible_dataloader(
15 test_dataset,
16 batch_size=256,
17 shuffle=False,
18 num_workers=4,
19 generator=generator
20)
21
22# Initialize model (weight init uses PyTorch's seeded RNG)
23model = DualTaskEnhancedModel(
24 input_size=15,
25 hidden_size=256,
26 dropout=0.3
27)
28
29# Rest of training...Verifying Reproducibility
1def verify_reproducibility(seed: int = 42, n_runs: int = 3):
2 """
3 Verify that training is reproducible.
4
5 Runs training multiple times and checks that results match.
6 """
7 results = []
8
9 for run in range(n_runs):
10 # Reset all seeds
11 configure_reproducibility(seed)
12
13 # Run training (abbreviated)
14 model = create_model()
15 train(model)
16 val_loss = evaluate(model)
17
18 results.append(val_loss)
19 print(f"Run {run + 1}: val_loss = {val_loss:.6f}")
20
21 # Check all results match
22 if len(set(results)) == 1:
23 print(f"✓ Reproducibility verified: all {n_runs} runs match")
24 else:
25 print(f"✗ Reproducibility failed: results differ")
26 print(f" Results: {results}")Remaining Non-Determinism
Even with all seeds set, some PyTorch operations remain non-deterministic on GPU (e.g., atomicAdd). For complete determinism, use torch.use_deterministic_algorithms(True), but this may disable some operations or reduce performance significantly.
Summary
In this section, we covered reproducibility:
- Multiple sources: Python, NumPy, PyTorch CPU/CUDA, cuDNN
- All must be seeded: Missing any one breaks reproducibility
- cuDNN flags: deterministic=True, benchmark=False
- DataLoader workers: Each needs its own seed
- Trade-off: Determinism may cost 10-20% performance
| Configuration | Value |
|---|---|
| Seed | 42 |
| random.seed() | 42 |
| np.random.seed() | 42 |
| torch.manual_seed() | 42 |
| torch.cuda.manual_seed_all() | 42 |
| cudnn.deterministic | True |
| cudnn.benchmark | False |
Chapter Complete: You now have a complete training enhancement toolkit: EMA for weight smoothing, early stopping to prevent overfitting, mixed precision for speed, gradient accumulation for larger batches, and reproducibility for reliable experiments. The next chapter brings it all together in the complete training script.
With all enhancements covered, we assemble the complete training pipeline.