Chapter 13
10 min read
Section 67 of 104

Reproducibility: Setting Random Seeds

Training Enhancements

Learning Objectives

By the end of this section, you will:

  1. Understand why reproducibility is crucial for research and production
  2. Identify all sources of randomness in deep learning training
  3. Set seeds correctly for Python, NumPy, and PyTorch
  4. Configure deterministic CUDA operations
  5. Know the trade-offs between reproducibility and performance
Why This Matters: Without proper seeding, running the same code twice produces different results. This makes debugging impossible, invalidates comparisons between experiments, and undermines scientific claims. A reproducible training pipeline is essential for trustworthy research and reliable production deployments.

Why Reproducibility Matters

Reproducibility is a cornerstone of both scientific research and production machine learning.

Research Requirements

  • Ablation studies: Compare model variants fairly
  • Peer review: Others must replicate your results
  • Debugging: Reproduce errors to fix them
  • Baselines: Compare against previous work accurately

Production Requirements

  • Testing: Unit tests need deterministic outputs
  • Validation: Verify model before deployment
  • Auditing: Demonstrate consistent behavior
  • Rollback: Recreate previous model versions

The Reproducibility Crisis

Many published deep learning results cannot be reproduced. Common causes:

IssueImpact
Random seeds not setDifferent runs give different results
Seeds not reportedCannot replicate exact experiment
CUDA non-determinismResults vary between GPUs
Library versions differDifferent implementations
Data preprocessing variesDifferent input data

AMNL Reproducibility

All experiments in this book use seed = 42 for reproducibility. We report the exact configuration and library versions to enable exact replication of our state-of-the-art results.


Sources of Randomness

Deep learning training has multiple sources of randomness that must all be controlled.

Python Random Module

The Python random module is used for data shuffling and random sampling:

  • random.shuffle() for shuffling lists
  • random.sample() for random sampling
  • random.choice() for random selection

NumPy Random

NumPy is used extensively for numerical operations:

  • np.random.shuffle() for array shuffling
  • np.random.randn() for random initialization
  • Train/validation splits

PyTorch Random

PyTorch has its own random number generator for:

  • Weight initialization (Xavier, Kaiming, etc.)
  • Dropout masks
  • DataLoader shuffling
  • Data augmentation

CUDA Random

GPU operations have additional randomness sources:

  • cuDNN algorithm selection
  • Atomics for parallel operations
  • Multi-stream execution order
SourceExamplesControl Method
Python randomData splitting, shufflingrandom.seed()
NumPyInitialization, preprocessingnp.random.seed()
PyTorch CPUWeight init, dropouttorch.manual_seed()
PyTorch CUDAGPU operationstorch.cuda.manual_seed_all()
cuDNNAlgorithm selectioncudnn.deterministic = True

Controlling Randomness

Each source of randomness requires specific configuration.

Setting All Seeds

The complete seed-setting function must address every source:

🐍python
1import random
2import numpy as np
3import torch
4
5def set_seed(seed: int = 42):
6    """
7    Set random seeds for reproducibility.
8
9    Controls all sources of randomness in typical PyTorch training:
10    - Python's random module
11    - NumPy's random generator
12    - PyTorch's CPU random generator
13    - PyTorch's CUDA random generator (all GPUs)
14    - cuDNN deterministic mode
15
16    Args:
17        seed: Integer seed value (default: 42)
18    """
19    # Python random module
20    random.seed(seed)
21
22    # NumPy
23    np.random.seed(seed)
24
25    # PyTorch CPU
26    torch.manual_seed(seed)
27
28    # PyTorch CUDA (all GPUs)
29    torch.cuda.manual_seed_all(seed)
30
31    # cuDNN deterministic mode
32    torch.backends.cudnn.deterministic = True
33    torch.backends.cudnn.benchmark = False
34
35    print(f"Set random seed to {seed} for reproducibility")

cuDNN Configuration

cuDNN has two important flags:

FlagDefaultFor Reproducibility
cudnn.deterministicFalseTrue
cudnn.benchmarkFalseFalse (keep as is)

DataLoader Worker Seeds

When using multiple DataLoader workers, each worker needs its own seed:

🐍python
1def seed_worker(worker_id: int):
2    """
3    Seed function for DataLoader workers.
4
5    Each worker gets a deterministic seed based on the
6    initial seed and its worker ID.
7    """
8    worker_seed = torch.initial_seed() % 2**32
9    np.random.seed(worker_seed)
10    random.seed(worker_seed)
11
12
13# Create DataLoader with worker seeding
14train_loader = DataLoader(
15    dataset,
16    batch_size=256,
17    shuffle=True,
18    num_workers=4,
19    worker_init_fn=seed_worker,  # Seed each worker
20    generator=torch.Generator().manual_seed(42)  # Seed shuffling
21)

Implementation

Complete reproducibility setup for AMNL training.

Full Reproducibility Configuration

🐍python
1import random
2import numpy as np
3import torch
4from torch.utils.data import DataLoader
5
6def configure_reproducibility(seed: int = 42):
7    """
8    Configure complete reproducibility for training.
9
10    Sets all random seeds and configures deterministic behavior.
11    Should be called at the very start of training script.
12
13    Args:
14        seed: Random seed for all generators
15
16    Returns:
17        Generator for DataLoader shuffling
18    """
19    # Set all seeds
20    random.seed(seed)
21    np.random.seed(seed)
22    torch.manual_seed(seed)
23    torch.cuda.manual_seed_all(seed)
24
25    # Configure cuDNN for determinism
26    torch.backends.cudnn.deterministic = True
27    torch.backends.cudnn.benchmark = False
28
29    # Create generator for DataLoader
30    generator = torch.Generator()
31    generator.manual_seed(seed)
32
33    print(f"Reproducibility configured with seed={seed}")
34    print(f"  - Python random: seeded")
35    print(f"  - NumPy random: seeded")
36    print(f"  - PyTorch manual_seed: seeded")
37    print(f"  - CUDA manual_seed_all: seeded")
38    print(f"  - cuDNN deterministic: True")
39    print(f"  - cuDNN benchmark: False")
40
41    return generator
42
43
44def seed_worker(worker_id: int):
45    """Deterministic seeding for DataLoader workers."""
46    worker_seed = torch.initial_seed() % 2**32
47    np.random.seed(worker_seed)
48    random.seed(worker_seed)
49
50
51def create_reproducible_dataloader(
52    dataset,
53    batch_size: int,
54    shuffle: bool,
55    num_workers: int,
56    generator: torch.Generator
57) -> DataLoader:
58    """
59    Create a fully reproducible DataLoader.
60
61    Args:
62        dataset: PyTorch dataset
63        batch_size: Batch size
64        shuffle: Whether to shuffle
65        num_workers: Number of worker processes
66        generator: Seeded random generator
67
68    Returns:
69        Reproducible DataLoader
70    """
71    return DataLoader(
72        dataset,
73        batch_size=batch_size,
74        shuffle=shuffle,
75        num_workers=num_workers,
76        worker_init_fn=seed_worker,
77        generator=generator,
78        pin_memory=True,
79        drop_last=False
80    )

Usage in Training Script

🐍python
1# At the very start of training script
2SEED = 42
3generator = configure_reproducibility(SEED)
4
5# Create reproducible data loaders
6train_loader = create_reproducible_dataloader(
7    train_dataset,
8    batch_size=256,
9    shuffle=True,
10    num_workers=4,
11    generator=generator
12)
13
14test_loader = create_reproducible_dataloader(
15    test_dataset,
16    batch_size=256,
17    shuffle=False,
18    num_workers=4,
19    generator=generator
20)
21
22# Initialize model (weight init uses PyTorch's seeded RNG)
23model = DualTaskEnhancedModel(
24    input_size=15,
25    hidden_size=256,
26    dropout=0.3
27)
28
29# Rest of training...

Verifying Reproducibility

🐍python
1def verify_reproducibility(seed: int = 42, n_runs: int = 3):
2    """
3    Verify that training is reproducible.
4
5    Runs training multiple times and checks that results match.
6    """
7    results = []
8
9    for run in range(n_runs):
10        # Reset all seeds
11        configure_reproducibility(seed)
12
13        # Run training (abbreviated)
14        model = create_model()
15        train(model)
16        val_loss = evaluate(model)
17
18        results.append(val_loss)
19        print(f"Run {run + 1}: val_loss = {val_loss:.6f}")
20
21    # Check all results match
22    if len(set(results)) == 1:
23        print(f"✓ Reproducibility verified: all {n_runs} runs match")
24    else:
25        print(f"✗ Reproducibility failed: results differ")
26        print(f"  Results: {results}")

Remaining Non-Determinism

Even with all seeds set, some PyTorch operations remain non-deterministic on GPU (e.g., atomicAdd). For complete determinism, use torch.use_deterministic_algorithms(True), but this may disable some operations or reduce performance significantly.


Summary

In this section, we covered reproducibility:

  1. Multiple sources: Python, NumPy, PyTorch CPU/CUDA, cuDNN
  2. All must be seeded: Missing any one breaks reproducibility
  3. cuDNN flags: deterministic=True, benchmark=False
  4. DataLoader workers: Each needs its own seed
  5. Trade-off: Determinism may cost 10-20% performance
ConfigurationValue
Seed42
random.seed()42
np.random.seed()42
torch.manual_seed()42
torch.cuda.manual_seed_all()42
cudnn.deterministicTrue
cudnn.benchmarkFalse
Chapter Complete: You now have a complete training enhancement toolkit: EMA for weight smoothing, early stopping to prevent overfitting, mixed precision for speed, gradient accumulation for larger batches, and reproducibility for reliable experiments. The next chapter brings it all together in the complete training script.

With all enhancements covered, we assemble the complete training pipeline.