A Recipe That Survives The Kitchen Move
The Tartine Bakery cookbook gives precise instructions: 1000 g flour, 700 g water, 20 g salt, 200 g levain, 12-hour autolyse, 4-hour bulk ferment at 26 °C, three folds, 2-hour proof. Follow the recipe in San Francisco and in Singapore and the bread is recognisably the same loaf — the chemistry doesn't depend on which kitchen you stand in. Now imagine the cookbook had said ‘mix some flour with water, ferment until it feels right’. Same ingredients, unrecognisable outcomes.
Reproducible machine learning is the same engineering problem. The paper's 7.72 RMSE on FD002 with 5 seeds is the loaf; the recipe is grace/training/seed_utils.py:set_seed, the environment lockfile, the cuDNN flags, the per-worker DataLoader hooks, and the choice of 5 seeds rather than 1. This section makes every ingredient explicit, shows the three levels of reproducibility you can target, and explains why the paper commits to distribution-stable rather than the impossible bit-exact ideal.
Three Levels Of Reproducibility
Reproducibility is not a single property; it is a hierarchy. Different scientific claims need different levels:
| Level | Tolerance | What is controlled | Used for |
|---|---|---|---|
| Bit-exact | Δ = 0 | Hardware, software, seeds, cuDNN flags, OS — everything | Regression tests, CI, internal sanity checks |
| Seed-stable | Δ ≲ 0.01 RMSE (FP drift) | Seeds, cuDNN flags, major library versions | Re-running the SAME experiment on a different machine |
| Distribution-stable | Mean within ±2·SEM, std within ±20% | Code, hyperparameters, 5 seeds — but NOT init draws or shuffle order | Publishing ‘our method beats X’ claims |
The paper targets distribution-stable. The published 7.72 RMSE on FD002 is — a 5-seed mean with standard deviation. Re-running with a fresh set of seeds (say [11, 22, 33, 44, 55]) would give a mean within 2·SEM of 7.72 (≈ ±0.30) but per-seed numbers would be different. This is what every published ML result actually claims, even when the paper writes it as a single number.
Interactive: Pick A Level, See The Tradeoffs
Click each level to see what must be controlled, what remains uncontrolled at that tolerance, and the paper's accept/reject status for that level. The distribution-stable panel shows the actual per-method seed-std on FD002 from cmapss_h256_complete_140.csv — the empirical noise floor that any reproducibility claim must clear.
Why this matters for reading the paper. A result like ‘GRACE 7.72 RMSE’ without a standard deviation is silently single-seed and not reproducible at any meaningful level. The paper reports precisely because the ±0.66 is what defines ‘reproduced’ for a third party.
Seed Pinning: Five Streams Of Randomness
A typical PyTorch training run draws random numbers from five distinct streams. Each must be pinned independently:
| Stream | API | What it controls | Pinned by |
|---|---|---|---|
| 1. Python builtin | random.* | DataLoader shuffle ordering, some sampler classes | random.seed(seed) |
| 2. NumPy global | np.random.* | Anywhere np.random.* is used (datasets, augmentation, baselines) | np.random.seed(seed) |
| 3. NumPy generator | rng = np.random.default_rng(seed); rng.integers(...) | Modern API — independent state, doesn't read np.random.seed | Pass seed at construction time |
| 4. PyTorch CPU | torch.randn, torch.rand, torch.randperm | CPU tensors and any non-CUDA op | torch.manual_seed(seed) |
| 5. PyTorch CUDA | Same APIs but on .cuda() tensors | GPU tensors. Per-device — must seed ALL devices | torch.cuda.manual_seed_all(seed) |
The paper's set_seed at grace/training/seed_utils.py:14-22 pins streams 1, 2, 4, 5 plus two cuDNN flags. Stream 3 (the modern Generator API) is handled separately wherever it is used; in production the GRACE codebase mostly uses the legacy global pinned by stream 2.
Why Five Seeds — A Statistical-Power Argument
Five is not arbitrary. It is the smallest seed count that gives enough statistical power to publish a comparison. The Friedman test on the 9-method × 10-block (FD002+FD004 × 5 seeds) NASA matrix from statistical_tests_results.json reports — significant at p < 0.05. With 3 seeds the same test would have produced p ≈ 0.05 (borderline); with 1 seed there would be no test to run.
Concretely: GRACE's FD002 RMSE has across the 5 published seeds. The standard error of the mean is . This is the precision at which we can compare GRACE vs (e.g.) GABA: GABA's 5-seed mean is 7.53 with SEM 0.29. The difference (0.19) is within the combined 2·SEM band, so the paper does NOT claim ‘GRACE has lower RMSE than GABA’ on FD002 — it cannot, with this seed budget. What it does claim, justifiably, is that GRACE has the lowest across all methods (223.4) — the NASA differences are large enough to clear the noise floor.
Locking The Environment
Seeds alone are not enough. A 2024 cuDNN release can change the convolution algorithm catalog and silently produce different gradients than the 2023 release with the same seed. The paper's reproducibility recipe locks five environment layers:
- Python interpreter. Major version (3.10 vs 3.11) can differ in
randominternals. Lock to a specific minor version inrequirements.txtorpyproject.toml. - Library versions.
torch>=2.0.0, numpy>=1.24.0, scipy>=1.10.0, scikit-learn>=1.2.0(paper'srequirements.txt). Pin exact versions for publication runs. - CUDA toolkit + cuDNN. Each torch wheel embeds a specific CUDA + cuDNN. Match what the paper used. Version mismatch silently changes conv algorithms.
- GPU hardware. Different micro-architectures (A100 vs H100) compute matrix products in different orders → ~10⁻⁶ FP drift per op. Cumulative over 500 epochs this is ~0.01 RMSE.
- OS / glibc. Numerical libraries call into glibc
libmfor some transcendentals. Linux distros differ in libm. Containerise with a fixed base image (paper used Ubuntu 22.04 in a Docker container).
DataLoader Determinism: Workers And Generators
The paper's phase1_cmapss.py uses num_workers=0 (single-process loading) because the C-MAPSS dataset fits in RAM and the BiLSTM forward pass dominates. With num_workers>0, three additional controls are required:
| Control | Why |
|---|---|
| torch.Generator passed to DataLoader | Pins shuffle order independently of the global torch RNG. Without it, the shuffle is sensitive to any other torch.* call between epochs. |
| worker_init_fn = seed_worker | Each fork()'ed worker inherits the parent's RNG state and produces identical augmentation noise. This function re-seeds per worker using torch.initial_seed(). |
| Persistent workers (PyTorch 1.7+) | persistent_workers=True keeps workers alive across epochs, avoiding fork() overhead. Worker init is run only once. |
The Python demo code in set_seed() alone is enough for num_workers=0 reproducibility. The PyTorch demo below extends this with the generator + worker_init_fn pattern that scales to multi-worker production training.
Determinism Limits — What Not To Expect
Some sources of variation cannot be eliminated even with full seed pinning. The paper accepts these and clears them via repeated runs:
- Cross-GPU FP drift. A100 vs H100 produces ~10⁻⁶ FP differences per op, accumulating to ~0.01 RMSE over 500 epochs. Bit-exact reproduction across GPU families is impossible without emulation.
- Atomic-add reductions. Some PyTorch ops (e.g.
scatter_add) use atomic floating-point adds that are non-deterministic regardless of cuDNN flags. PyTorch will warn but proceed unless you calltorch.use_deterministic_algorithms(True). - Multi-threaded BLAS. MKL/OpenBLAS reduce floats in different orders depending on thread count. Set
OMP_NUM_THREADS=1for bit-exact CPU reproduction. - PyTorch version updates. Even patch releases (2.1.0 → 2.1.1) can change op implementations. Lock to exact versions; treat upgrades as re-validations.
Python: Reproducibility Recipe From Scratch
Five streams of randomness, three different APIs, two seeded runs side by side. The script below shows that random.seed, np.random.seed, and np.random.default_rng are mutually independent: seeding any one does not affect the others, and the legacy global numpy RNG produces a different sequence from the modern Generator at the same seed.
PyTorch: The Paper's set_seed() Plus Worker Init
Production reproducibility plumbing. set_seed is a verbatim copy of grace/training/seed_utils.py:14-22; the script extends it with the DataLoader generator + worker init function so the recipe works at num_workers>0. The final torch.equal check is the empirical proof that the recipe holds.
RuntimeError at runtime. Two options: switch to a deterministic-friendly op (preferred), or call torch.use_deterministic_algorithms(True, warn_only=True) to demote errors to warnings. The paper takes the first route — it ensures the published numbers are bit-reproducible per seed.Reproducibility In Other Domains
| Domain | Critical reproducibility ingredient | Common pitfall |
|---|---|---|
| Self-driving perception | Augmentation pipeline (random crops, flips, color jitter) must use seeded RNG per worker. | Forgetting worker_init_fn produces identical augmentation across all workers — model never sees a diverse batch. |
| Language model fine-tuning | torch.use_deterministic_algorithms(True) for backward passes; pin tokenizer version exactly. | Tokenizer minor-version updates (e.g. transformers 4.36 → 4.37) silently change BPE merges; loss curves diverge. |
| Reinforcement learning | Environment seed (gym.make(env, seed=...)) plus action-sampling RNG — TWO separate seeds. | Environment seed alone reproduces transitions but not exploration; agent's action RNG must be pinned too. |
| Medical imaging segmentation | Patch-extraction RNG must be deterministic; Dice score is highly sensitive to which patches are seen. | Random patch sampling without seeding gives ±2% Dice variation seed-to-seed. |
| Hyperparameter search frameworks (Optuna, Ray Tune) | Pin sampler seed AND per-trial seed; otherwise trial assignment is non-reproducible. | Pinning only the trial seed, not the sampler — different trials run on different reruns even with the same trial seed. |
The pattern: reproducibility recipes generalise structurally, but the specific RNG sources depend on the domain. The invariant is that every stream of randomness must be identified and pinned; the variable part is which streams exist in your stack.
Pitfalls That Break Reproducibility Silently
Pitfall 1: relying on a global seed for a Generator API
Calling np.random.seed(42) does NOT affect rng = np.random.default_rng() instances. The modern Generator carries its own state. Either pass the seed at construction (default_rng(42)) or stick to the legacy global API consistently — never mix.
Pitfall 2: forgetting worker_init_fn with num_workers > 0
With num_workers=4 and no worker_init_fn, every worker fork() inherits the same parent state and produces identical random augmentation per sample. Symptom: training loss looks fine but validation loss is unstable seed-to-seed because the model has effectively seen 1/4 the augmentation diversity it should have.
Pitfall 3: cudnn.benchmark = True for ‘speed’
The benchmark flag picks the fastest cuDNN algorithm based on first-batch timing. Timings depend on system load → the chosen algorithm varies → seeds are no longer enough for reproducibility. The paper leaves benchmark = False; the <5% speedup is not worth the loss of reproducibility on a publication run.
Pitfall 4: claiming a single-seed result
A single-seed result is not reproducible at any meaningful level — another seed will give a different number. Anything published without a standard deviation is implicitly using the bit-exact level of reproducibility, which fails as soon as the reader is on different hardware. Always run at least 3 seeds for sweep decisions and 5 for publication.
Pitfall 5: not recording the random state alongside results
The paper's output JSON includes the seed in the file path: phase1/<DS>/<method>/seed_<n>/results.json. Without the seed in the artefact, you cannot re-run a single result — it has to be re-run from the full seed list. Make the seed part of the result.
Takeaway
- Reproducibility has three levels: bit-exact, seed-stable, and distribution-stable. The paper targets the third — mean within ±2·SEM across 5 seeds.
- Five streams of randomness must be pinned: Python random, NumPy global, NumPy generator, PyTorch CPU, PyTorch CUDA. Plus two cuDNN flags (
deterministic = True,benchmark = False). - With
num_workers > 0on the DataLoader, also pin a torch.Generator for shuffle order and use aworker_init_fnfor per-worker augmentation state. - 5 seeds is the minimum for a Friedman test to reach . With on FD002, the SEM is — the smallest difference the paper can statistically claim.
- Bit-exact reproduction across hardware (A100 vs H100) is impossible. The paper's recipe is: pin everything that can be pinned, average over 5 seeds, report mean ± std.