AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand why reproducibility is crucial for research and production
Identify all sources of randomness in deep learning training
Set seeds correctly for Python, NumPy, and PyTorch
Configure deterministic CUDA operations
Know the trade-offs between reproducibility and performance

Why This Matters: Without proper seeding, running the same code twice produces different results. This makes debugging impossible, invalidates comparisons between experiments, and undermines scientific claims. A reproducible training pipeline is essential for trustworthy research and reliable production deployments.

Why Reproducibility Matters

Reproducibility is a cornerstone of both scientific research and production machine learning.

Research Requirements

Ablation studies: Compare model variants fairly
Peer review: Others must replicate your results
Debugging: Reproduce errors to fix them
Baselines: Compare against previous work accurately

Production Requirements

Testing: Unit tests need deterministic outputs
Validation: Verify model before deployment
Auditing: Demonstrate consistent behavior
Rollback: Recreate previous model versions

The Reproducibility Crisis

Many published deep learning results cannot be reproduced. Common causes:

Issue	Impact
Random seeds not set	Different runs give different results
Seeds not reported	Cannot replicate exact experiment
CUDA non-determinism	Results vary between GPUs
Library versions differ	Different implementations
Data preprocessing varies	Different input data

AMNL Reproducibility

All experiments in this book use seed = 42 for reproducibility. We report the exact configuration and library versions to enable exact replication of our state-of-the-art results.

Sources of Randomness

Deep learning training has multiple sources of randomness that must all be controlled.

Python Random Module

The Python random module is used for data shuffling and random sampling:

random.shuffle() for shuffling lists
random.sample() for random sampling
random.choice() for random selection

NumPy Random

NumPy is used extensively for numerical operations:

np.random.shuffle() for array shuffling
np.random.randn() for random initialization
Train/validation splits

PyTorch Random

PyTorch has its own random number generator for:

Weight initialization (Xavier, Kaiming, etc.)
Dropout masks
DataLoader shuffling
Data augmentation

CUDA Random

GPU operations have additional randomness sources:

cuDNN algorithm selection
Atomics for parallel operations
Multi-stream execution order

Source	Examples	Control Method
Python random	Data splitting, shuffling	random.seed()
NumPy	Initialization, preprocessing	np.random.seed()
PyTorch CPU	Weight init, dropout	torch.manual_seed()
PyTorch CUDA	GPU operations	torch.cuda.manual_seed_all()
cuDNN	Algorithm selection	cudnn.deterministic = True

Controlling Randomness

Each source of randomness requires specific configuration.

Setting All Seeds

The complete seed-setting function must address every source:

🐍python

1import random
2import numpy as np
3import torch
4
5def set_seed(seed: int = 42):
6    """
7    Set random seeds for reproducibility.
8
9    Controls all sources of randomness in typical PyTorch training:
10    - Python's random module
11    - NumPy's random generator
12    - PyTorch's CPU random generator
13    - PyTorch's CUDA random generator (all GPUs)
14    - cuDNN deterministic mode
15
16    Args:
17        seed: Integer seed value (default: 42)
18    """
19    # Python random module
20    random.seed(seed)
21
22    # NumPy
23    np.random.seed(seed)
24
25    # PyTorch CPU
26    torch.manual_seed(seed)
27
28    # PyTorch CUDA (all GPUs)
29    torch.cuda.manual_seed_all(seed)
30
31    # cuDNN deterministic mode
32    torch.backends.cudnn.deterministic = True
33    torch.backends.cudnn.benchmark = False
34
35    print(f"Set random seed to {seed} for reproducibility")

cuDNN Configuration

cuDNN has two important flags:

Flag	Default	For Reproducibility
cudnn.deterministic	False	True
cudnn.benchmark	False	False (keep as is)

DataLoader Worker Seeds

When using multiple DataLoader workers, each worker needs its own seed:

🐍python

1def seed_worker(worker_id: int):
2    """
3    Seed function for DataLoader workers.
4
5    Each worker gets a deterministic seed based on the
6    initial seed and its worker ID.
7    """
8    worker_seed = torch.initial_seed() % 2**32
9    np.random.seed(worker_seed)
10    random.seed(worker_seed)
11
12
13# Create DataLoader with worker seeding
14train_loader = DataLoader(
15    dataset,
16    batch_size=256,
17    shuffle=True,
18    num_workers=4,
19    worker_init_fn=seed_worker,  # Seed each worker
20    generator=torch.Generator().manual_seed(42)  # Seed shuffling
21)

Implementation

Complete reproducibility setup for AMNL training.

Full Reproducibility Configuration

🐍python

1import random
2import numpy as np
3import torch
4from torch.utils.data import DataLoader
5
6def configure_reproducibility(seed: int = 42):
7    """
8    Configure complete reproducibility for training.
9
10    Sets all random seeds and configures deterministic behavior.
11    Should be called at the very start of training script.
12
13    Args:
14        seed: Random seed for all generators
15
16    Returns:
17        Generator for DataLoader shuffling
18    """
19    # Set all seeds
20    random.seed(seed)
21    np.random.seed(seed)
22    torch.manual_seed(seed)
23    torch.cuda.manual_seed_all(seed)
24
25    # Configure cuDNN for determinism
26    torch.backends.cudnn.deterministic = True
27    torch.backends.cudnn.benchmark = False
28
29    # Create generator for DataLoader
30    generator = torch.Generator()
31    generator.manual_seed(seed)
32
33    print(f"Reproducibility configured with seed={seed}")
34    print(f"  - Python random: seeded")
35    print(f"  - NumPy random: seeded")
36    print(f"  - PyTorch manual_seed: seeded")
37    print(f"  - CUDA manual_seed_all: seeded")
38    print(f"  - cuDNN deterministic: True")
39    print(f"  - cuDNN benchmark: False")
40
41    return generator
42
43
44def seed_worker(worker_id: int):
45    """Deterministic seeding for DataLoader workers."""
46    worker_seed = torch.initial_seed() % 2**32
47    np.random.seed(worker_seed)
48    random.seed(worker_seed)
49
50
51def create_reproducible_dataloader(
52    dataset,
53    batch_size: int,
54    shuffle: bool,
55    num_workers: int,
56    generator: torch.Generator
57) -> DataLoader:
58    """
59    Create a fully reproducible DataLoader.
60
61    Args:
62        dataset: PyTorch dataset
63        batch_size: Batch size
64        shuffle: Whether to shuffle
65        num_workers: Number of worker processes
66        generator: Seeded random generator
67
68    Returns:
69        Reproducible DataLoader
70    """
71    return DataLoader(
72        dataset,
73        batch_size=batch_size,
74        shuffle=shuffle,
75        num_workers=num_workers,
76        worker_init_fn=seed_worker,
77        generator=generator,
78        pin_memory=True,
79        drop_last=False
80    )

Usage in Training Script

🐍python

1# At the very start of training script
2SEED = 42
3generator = configure_reproducibility(SEED)
4
5# Create reproducible data loaders
6train_loader = create_reproducible_dataloader(
7    train_dataset,
8    batch_size=256,
9    shuffle=True,
10    num_workers=4,
11    generator=generator
12)
13
14test_loader = create_reproducible_dataloader(
15    test_dataset,
16    batch_size=256,
17    shuffle=False,
18    num_workers=4,
19    generator=generator
20)
21
22# Initialize model (weight init uses PyTorch's seeded RNG)
23model = DualTaskEnhancedModel(
24    input_size=15,
25    hidden_size=256,
26    dropout=0.3
27)
28
29# Rest of training...

Verifying Reproducibility

🐍python

1def verify_reproducibility(seed: int = 42, n_runs: int = 3):
2    """
3    Verify that training is reproducible.
4
5    Runs training multiple times and checks that results match.
6    """
7    results = []
8
9    for run in range(n_runs):
10        # Reset all seeds
11        configure_reproducibility(seed)
12
13        # Run training (abbreviated)
14        model = create_model()
15        train(model)
16        val_loss = evaluate(model)
17
18        results.append(val_loss)
19        print(f"Run {run + 1}: val_loss = {val_loss:.6f}")
20
21    # Check all results match
22    if len(set(results)) == 1:
23        print(f"✓ Reproducibility verified: all {n_runs} runs match")
24    else:
25        print(f"✗ Reproducibility failed: results differ")
26        print(f"  Results: {results}")

Remaining Non-Determinism

Even with all seeds set, some PyTorch operations remain non-deterministic on GPU (e.g., atomicAdd). For complete determinism, use torch.use_deterministic_algorithms(True), but this may disable some operations or reduce performance significantly.

Summary

In this section, we covered reproducibility:

Multiple sources: Python, NumPy, PyTorch CPU/CUDA, cuDNN
All must be seeded: Missing any one breaks reproducibility
cuDNN flags: deterministic=True, benchmark=False
DataLoader workers: Each needs its own seed
Trade-off: Determinism may cost 10-20% performance

Configuration	Value
Seed	42
random.seed()	42
np.random.seed()	42
torch.manual_seed()	42
torch.cuda.manual_seed_all()	42
cudnn.deterministic	True
cudnn.benchmark	False

Chapter Complete: You now have a complete training enhancement toolkit: EMA for weight smoothing, early stopping to prevent overfitting, mixed precision for speed, gradient accumulation for larger batches, and reproducibility for reliable experiments. The next chapter brings it all together in the complete training script.

With all enhancements covered, we assemble the complete training pipeline.