Boo-AI — Master Artificial Intelligence by Building from Scratch

A Recipe That Survives The Kitchen Move

The Tartine Bakery cookbook gives precise instructions: 1000 g flour, 700 g water, 20 g salt, 200 g levain, 12-hour autolyse, 4-hour bulk ferment at 26 °C, three folds, 2-hour proof. Follow the recipe in San Francisco and in Singapore and the bread is recognisably the same loaf — the chemistry doesn't depend on which kitchen you stand in. Now imagine the cookbook had said ‘mix some flour with water, ferment until it feels right’. Same ingredients, unrecognisable outcomes.

Reproducible machine learning is the same engineering problem. The paper's 7.72 RMSE on FD002 with 5 seeds is the loaf; the recipe is grace/training/seed_utils.py:set_seed, the environment lockfile, the cuDNN flags, the per-worker DataLoader hooks, and the choice of 5 seeds rather than 1. This section makes every ingredient explicit, shows the three levels of reproducibility you can target, and explains why the paper commits to distribution-stable rather than the impossible bit-exact ideal.

The headline. Five streams of randomness (Python random, NumPy global, NumPy generator, PyTorch CPU, PyTorch CUDA), two cuDNN flags, one DataLoader generator, one worker_init_fn, and 5 seeds — that is the full recipe. Drop any one and the published 7.72 RMSE on FD002 cannot be reproduced by a third party.

Three Levels Of Reproducibility

Reproducibility is not a single property; it is a hierarchy. Different scientific claims need different levels:

Level	Tolerance	What is controlled	Used for
Bit-exact	Δ = 0	Hardware, software, seeds, cuDNN flags, OS — everything	Regression tests, CI, internal sanity checks
Seed-stable	Δ ≲ 0.01 RMSE (FP drift)	Seeds, cuDNN flags, major library versions	Re-running the SAME experiment on a different machine
Distribution-stable	Mean within ±2·SEM, std within ±20%	Code, hyperparameters, 5 seeds — but NOT init draws or shuffle order	Publishing ‘our method beats X’ claims

The paper targets distribution-stable. The published 7.72 RMSE on FD002 is $7.72 \pm 0.66$ — a 5-seed mean with standard deviation. Re-running with a fresh set of seeds (say [11, 22, 33, 44, 55]) would give a mean within 2·SEM of 7.72 (≈ ±0.30) but per-seed numbers would be different. This is what every published ML result actually claims, even when the paper writes it as a single number.

Interactive: Pick A Level, See The Tradeoffs

Click each level to see what must be controlled, what remains uncontrolled at that tolerance, and the paper's accept/reject status for that level. The distribution-stable panel shows the actual per-method seed-std on FD002 from cmapss_h256_complete_140.csv — the empirical noise floor that any reproducibility claim must clear.

Loading reproducibility levels viewer…

Why this matters for reading the paper. A result like ‘GRACE 7.72 RMSE’ without a standard deviation is silently single-seed and not reproducible at any meaningful level. The paper reports $7.72 \pm 0.66$ precisely because the ±0.66 is what defines ‘reproduced’ for a third party.

Seed Pinning: Five Streams Of Randomness

A typical PyTorch training run draws random numbers from five distinct streams. Each must be pinned independently:

Stream	API	What it controls	Pinned by
1. Python builtin	random.*	DataLoader shuffle ordering, some sampler classes	random.seed(seed)
2. NumPy global	np.random.*	Anywhere np.random.* is used (datasets, augmentation, baselines)	np.random.seed(seed)
3. NumPy generator	rng = np.random.default_rng(seed); rng.integers(...)	Modern API — independent state, doesn't read np.random.seed	Pass seed at construction time
4. PyTorch CPU	torch.randn, torch.rand, torch.randperm	CPU tensors and any non-CUDA op	torch.manual_seed(seed)
5. PyTorch CUDA	Same APIs but on .cuda() tensors	GPU tensors. Per-device — must seed ALL devices	torch.cuda.manual_seed_all(seed)

The paper's set_seed at grace/training/seed_utils.py:14-22 pins streams 1, 2, 4, 5 plus two cuDNN flags. Stream 3 (the modern Generator API) is handled separately wherever it is used; in production the GRACE codebase mostly uses the legacy global pinned by stream 2.

Why Five Seeds — A Statistical-Power Argument

Five is not arbitrary. It is the smallest seed count that gives enough statistical power to publish a comparison. The Friedman test on the 9-method × 10-block (FD002+FD004 × 5 seeds) NASA matrix from statistical_tests_results.json reports $\chi^2 = 29.84,\ p = 0.0002$ — significant at p < 0.05. With 3 seeds the same test would have produced p ≈ 0.05 (borderline); with 1 seed there would be no test to run.

Concretely: GRACE's FD002 RMSE has $\sigma \approx 0.66$ across the 5 published seeds. The standard error of the mean is $\sigma/\sqrt{5} \approx 0.30$ . This is the precision at which we can compare GRACE vs (e.g.) GABA: GABA's 5-seed mean is 7.53 with SEM 0.29. The difference (0.19) is within the combined 2·SEM band, so the paper does NOT claim ‘GRACE has lower RMSE than GABA’ on FD002 — it cannot, with this seed budget. What it does claim, justifiably, is that GRACE has the lowest $NASA$ across all methods (223.4) — the NASA differences are large enough to clear the noise floor.

The minimum-seeds rule. Use

n \geq 4\,\sigma^2 / \delta^2

seeds where

\delta

is the smallest difference you want to detect with 95% confidence. For GRACE on FD002 with

\sigma = 0.66

and a 0.5-RMSE detection threshold, n ≈ 7. The paper's 5 is a compromise between cost and power; some claims need 5, some are robust with 3, none are credible at 1.

Locking The Environment

Seeds alone are not enough. A 2024 cuDNN release can change the convolution algorithm catalog and silently produce different gradients than the 2023 release with the same seed. The paper's reproducibility recipe locks five environment layers:

Python interpreter. Major version (3.10 vs 3.11) can differ in random internals. Lock to a specific minor version in requirements.txt or pyproject.toml.
Library versions. torch>=2.0.0, numpy>=1.24.0, scipy>=1.10.0, scikit-learn>=1.2.0 (paper's requirements.txt). Pin exact versions for publication runs.
CUDA toolkit + cuDNN. Each torch wheel embeds a specific CUDA + cuDNN. Match what the paper used. Version mismatch silently changes conv algorithms.
GPU hardware. Different micro-architectures (A100 vs H100) compute matrix products in different orders → ~10⁻⁶ FP drift per op. Cumulative over 500 epochs this is ~0.01 RMSE.
OS / glibc. Numerical libraries call into glibc libm for some transcendentals. Linux distros differ in libm. Containerise with a fixed base image (paper used Ubuntu 22.04 in a Docker container).

DataLoader Determinism: Workers And Generators

The paper's phase1_cmapss.py uses num_workers=0 (single-process loading) because the C-MAPSS dataset fits in RAM and the BiLSTM forward pass dominates. With num_workers>0, three additional controls are required:

Control	Why
torch.Generator passed to DataLoader	Pins shuffle order independently of the global torch RNG. Without it, the shuffle is sensitive to any other torch.* call between epochs.
worker_init_fn = seed_worker	Each fork()'ed worker inherits the parent's RNG state and produces identical augmentation noise. This function re-seeds per worker using torch.initial_seed().
Persistent workers (PyTorch 1.7+)	persistent_workers=True keeps workers alive across epochs, avoiding fork() overhead. Worker init is run only once.

The Python demo code in set_seed() alone is enough for num_workers=0 reproducibility. The PyTorch demo below extends this with the generator + worker_init_fn pattern that scales to multi-worker production training.

Determinism Limits — What Not To Expect

Some sources of variation cannot be eliminated even with full seed pinning. The paper accepts these and clears them via repeated runs:

Cross-GPU FP drift. A100 vs H100 produces ~10⁻⁶ FP differences per op, accumulating to ~0.01 RMSE over 500 epochs. Bit-exact reproduction across GPU families is impossible without emulation.
Atomic-add reductions. Some PyTorch ops (e.g. scatter_add) use atomic floating-point adds that are non-deterministic regardless of cuDNN flags. PyTorch will warn but proceed unless you call torch.use_deterministic_algorithms(True).
Multi-threaded BLAS. MKL/OpenBLAS reduce floats in different orders depending on thread count. Set OMP_NUM_THREADS=1 for bit-exact CPU reproduction.
PyTorch version updates. Even patch releases (2.1.0 → 2.1.1) can change op implementations. Lock to exact versions; treat upgrades as re-validations.

Python: Reproducibility Recipe From Scratch

Five streams of randomness, three different APIs, two seeded runs side by side. The script below shows that random.seed, np.random.seed, and np.random.default_rng are mutually independent: seeding any one does not affect the others, and the legacy global numpy RNG produces a different sequence from the modern Generator at the same seed.

Five randomness streams, isolated

🐍grace_seed_streams.py

Explanation(29)

Code(56)

1docstring

Names the contract: enumerate the five streams of randomness in a typical PyTorch training run. The script focuses on the Python-level streams; the cuDNN flags require torch and are shown in the PyTorch demo below.

9import random

Python standard library. Used internally by torch.utils.data.DataLoader for the train_loader's shuffle order if shuffle=True.

EXECUTION STATE

📚 random.seed(s) = Sets the seed of the GLOBAL Python RNG. Affects random.randint, random.choice, random.shuffle, random.sample.

📚 random.randint(a, b) = Inclusive integer in [a, b].

10import numpy as np

NumPy. Has its own global RNG (np.random.*) and a modern Generator API (np.random.default_rng).

EXECUTION STATE

→ two NumPy RNGs = Legacy global (np.random.seed/np.random.randint) and modern Generator (rng = np.random.default_rng(seed)). The legacy one is shared with anything else using np.random.*; the Generator is local to its instance.

13def set_all_seeds(seed: int = 42) -> None:

Helper that pins every Python-level randomness source in one call. Mirrors the pattern in grace/training/seed_utils.py:set_seed.

EXECUTION STATE

⬇ input: seed = 42 = Integer. The paper's default; first of [42, 123, 456, 789, 1024].

⬆ returns = None. Side-effects only.

15random.seed(seed)

Pin the Python global RNG. Without this, DataLoader's shuffle order at every epoch is non-reproducible.

EXECUTION STATE

→ side effect = Future random.* calls in this process produce a deterministic stream.

16np.random.seed(seed)

Pin the legacy NumPy global RNG. Affects np.random.* anywhere in the code, including any random sampling done inside the dataset class.

EXECUTION STATE

📚 np.random.seed = Sets the seed of the legacy global RNG. Returns None.

→ caveat = Code using np.random.default_rng (the modern generator API) is NOT affected by np.random.seed. Each Generator carries its own state.

18# torch.manual_seed(seed)

Commented-out reminder. The PyTorch demo below shows the full set_seed.

19# torch.cuda.manual_seed_all(seed)

Pins ALL CUDA devices, not just the current one. Important for multi-GPU training where every device has its own RNG.

23def demo_python_random(seed: int = 42) -> list[int]:

Helper: seed Python RNG and produce 5 random ints. Lets us prove that the same seed yields the same sequence.

EXECUTION STATE

⬇ input: seed = Integer to pin.

⬆ returns = list[int] — 5 ints in [0, 99].

15random.seed(seed)

Pin the seed locally to this call.

25return [random.randint(0, 99) for _ in range(5)]

List comprehension: 5 random ints. With seed=42 always returns the same sequence.

EXECUTION STATE

📚 [random.randint(0, 99) for _ in range(5)] = Five INDEPENDENT draws, one per loop iteration. Each call advances the RNG state.

→ seed=42 sequence = [81, 14, 3, 94, 35]

29def demo_numpy_global(seed: int = 42) -> np.ndarray:

Same demo for NumPy's legacy global RNG.

EXECUTION STATE

⬆ returns = ndarray (5,) of int64 in [0, 99].

16np.random.seed(seed)

Pin the legacy NumPy global RNG.

31return np.random.randint(0, 100, size=5)

Single vectorised call — five random ints at once. Different from looping random.randint() five times because vectorised draws are reproducible per-shape, not per-element-call.

EXECUTION STATE

📚 np.random.randint(low, high, size) = Uniform integers in [low, high). Returns ndarray of shape `size`.

→ seed=42 sequence = [51, 92, 14, 71, 60]

35def demo_numpy_generator(seed: int = 42) -> np.ndarray:

Same demo using the modern Generator API. Independent state from the legacy global.

36rng = np.random.default_rng(seed)

Construct a fresh Generator. Has its own state — does not interact with np.random.seed.

EXECUTION STATE

📚 np.random.default_rng(seed) = Returns numpy.random.Generator. The recommended modern API since NumPy 1.17.

→ why prefer this? = Self-contained state. Two separate Generators with the same seed produce the same stream regardless of what np.random.* code elsewhere is doing.

37return rng.integers(0, 100, size=5)

Generator method (not a free function).

EXECUTION STATE

📚 rng.integers(low, high, size) = Note the API renaming: legacy uses `randint`, Generator uses `integers`. Same semantics: high is EXCLUSIVE.

→ seed=42 sequence = [ 9, 77, 65, 44, 43] — DIFFERENT from the legacy global. The two RNG implementations are not interchangeable even at the same seed.

40print("Python random.randint(seed=42):")

Header for the Python random demo.

41print(" run 1:", demo_python_random(42))

First run with seed 42.

EXECUTION STATE

Output = run 1: [81, 14, 3, 94, 35]

42print(" run 2:", demo_python_random(42))

Second run with seed 42 — IDENTICAL output.

EXECUTION STATE

Output = run 2: [81, 14, 3, 94, 35] — proves seeding works.

44print("\nNumPy global np.random.randint(seed=42):")

Header for the legacy NumPy demo.

45print(" run 1:", demo_numpy_global(42).tolist())

First run.

EXECUTION STATE

Output = run 1: [51, 92, 14, 71, 60]

46print(" run 2:", demo_numpy_global(42).tolist())

Second run — IDENTICAL.

EXECUTION STATE

Output = run 2: [51, 92, 14, 71, 60]

48print("\nNumPy default_rng(seed=42):")

Header for the modern Generator demo.

49print(" run 1:", demo_numpy_generator(42).tolist())

First run with the Generator API.

EXECUTION STATE

Output = run 1: [9, 77, 65, 44, 43] — different from legacy global because the algorithms differ.

50print(" run 2:", demo_numpy_generator(42).tolist())

Second run — IDENTICAL.

EXECUTION STATE

Output = run 2: [9, 77, 65, 44, 43]

53print("\nPython random.randint with two different seeds:")

Final demo: same code, different seed → different sequence.

54print(" seed=42: ", demo_python_random(42))

Seed 42.

EXECUTION STATE

Output = seed=42: [81, 14, 3, 94, 35]

55print(" seed=123: ", demo_python_random(123))

Seed 123 — completely different sequence.

EXECUTION STATE

Final output =

Python random.randint(seed=42):
  run 1: [81, 14, 3, 94, 35]
  run 2: [81, 14, 3, 94, 35]

NumPy global np.random.randint(seed=42):
  run 1: [51, 92, 14, 71, 60]
  run 2: [51, 92, 14, 71, 60]

NumPy default_rng(seed=42):
  run 1: [9, 77, 65, 44, 43]
  run 2: [9, 77, 65, 44, 43]

Python random.randint with two different seeds:
  seed=42:   [81, 14, 3, 94, 35]
  seed=123:  [6, 27, 76, 39, 89]

27 lines without explanation

1"""Reproducibility from first principles.
2
3Five streams of randomness need to be pinned for a PyTorch run to
4reproduce: Python random, NumPy, PyTorch CPU, PyTorch CUDA, and the
5cuDNN auto-tuner. This script shows what each one controls, in
6isolation.
7"""
8
9import random
10import numpy as np
11
12
13def set_all_seeds(seed: int = 42) -> None:
14    """Pin every Python-level randomness source."""
15    random.seed(seed)             # used by random.shuffle, random.choice
16    np.random.seed(seed)          # legacy global numpy RNG
17    # The next two lines would also fire if torch were imported:
18    #   torch.manual_seed(seed)
19    #   torch.cuda.manual_seed_all(seed)
20
21
22# ---------- Stream 1: Python random ----------
23def demo_python_random(seed: int = 42) -> list[int]:
24    random.seed(seed)
25    return [random.randint(0, 99) for _ in range(5)]
26
27
28# ---------- Stream 2: NumPy global RNG ----------
29def demo_numpy_global(seed: int = 42) -> np.ndarray:
30    np.random.seed(seed)
31    return np.random.randint(0, 100, size=5)
32
33
34# ---------- Stream 3: NumPy generator (recommended modern API) ----------
35def demo_numpy_generator(seed: int = 42) -> np.ndarray:
36    rng = np.random.default_rng(seed)
37    return rng.integers(0, 100, size=5)
38
39
40# ---------- Run twice with the same seed ----------
41print("Python random.randint(seed=42):")
42print("  run 1:", demo_python_random(42))
43print("  run 2:", demo_python_random(42))
44
45print("\nNumPy global np.random.randint(seed=42):")
46print("  run 1:", demo_numpy_global(42).tolist())
47print("  run 2:", demo_numpy_global(42).tolist())
48
49print("\nNumPy default_rng(seed=42):")
50print("  run 1:", demo_numpy_generator(42).tolist())
51print("  run 2:", demo_numpy_generator(42).tolist())
52
53# ---------- Different seeds give different streams ----------
54print("\nPython random.randint with two different seeds:")
55print("  seed=42:  ", demo_python_random(42))
56print("  seed=123: ", demo_python_random(123))

PyTorch: The Paper's set_seed() Plus Worker Init

Production reproducibility plumbing. set_seed is a verbatim copy of grace/training/seed_utils.py:14-22; the script extends it with the DataLoader generator + worker init function so the recipe works at num_workers>0. The final torch.equal check is the empirical proof that the recipe holds.

Seed pinning + DataLoader determinism

🐍grace_reproducibility.py

Explanation(37)

Code(66)

1docstring

States the contract. The script reproduces the paper's set_seed and adds the DataLoader-level reproducibility scaffolding (Generator + worker_init_fn) that production training pipelines need.

9import random

Python standard RNG. Pinned by random.seed inside set_seed.

10import numpy as np

NumPy. The legacy global RNG is also pinned.

11import torch

PyTorch core. We need torch.manual_seed, torch.cuda.manual_seed_all, torch.backends.cudnn flags, torch.Generator, and torch.initial_seed.

12from torch.utils.data import DataLoader, TensorDataset

Standard PyTorch data plumbing. TensorDataset wraps two tensors as a (x_i, y_i) iterable; DataLoader batches and shuffles.

EXECUTION STATE

📚 TensorDataset(*tensors) = Each tensor must have the same first-dim length. Index i returns the i-th element of each tensor.

📚 DataLoader(dataset, batch_size, shuffle, num_workers, worker_init_fn, generator) = Iterable of batches. With num_workers>0, fork() worker processes that prefetch in parallel.

15def set_seed(seed: int = 42) -> None:

Paper-canonical pin. Six side-effecting calls — five seed pinners and two cuDNN flags. Source: grace/training/seed_utils.py:14.

EXECUTION STATE

⬇ input: seed = 42 = Integer. The paper's default seed. The 5-seed publication runs use [42, 123, 456, 789, 1024].

21random.seed(seed)

Pin the Python global RNG. DataLoader internally uses this when constructing its worker startup state.

22np.random.seed(seed)

Pin the legacy NumPy global RNG. Affects np.random.* anywhere in the code base — datasets, augmentation, etc.

23torch.manual_seed(seed)

Pin the CPU PyTorch RNG. Affects torch.randn, torch.rand, torch.randperm, weight inits when run on CPU.

EXECUTION STATE

📚 torch.manual_seed(s) = Sets the global PyTorch CPU seed. Returns the seeded torch.Generator (rarely used).

24torch.cuda.manual_seed_all(seed)

Pin EVERY CUDA device's RNG. The `_all` variant is critical: torch.cuda.manual_seed only pins device 0. With multi-GPU runs, devices 1..N would still have non-deterministic state without this call.

EXECUTION STATE

📚 torch.cuda.manual_seed_all(s) = Pins all visible CUDA devices. No-op on CPU-only systems. Safe to call always.

25torch.backends.cudnn.deterministic = True

Force cuDNN to use deterministic algorithm choices. Some cuDNN convolution implementations are non-deterministic (atomic-add reductions); this flag forbids them.

EXECUTION STATE

→ cost = ~10-15% slowdown on convolution-heavy backbones. For GRACE (BiLSTM-heavy), the impact is <5%.

→ caveat = PyTorch will RAISE an error if a non-deterministic op has no deterministic alternative. To match paper behaviour, leave it at True; fix any errors that surface.

26torch.backends.cudnn.benchmark = False

Disable cuDNN's heuristic algorithm picker. With benchmark=True, cuDNN times several conv algorithms on the first batch and picks the fastest — which can vary run-to-run because timings depend on system load.

EXECUTION STATE

→ why off? = Run-to-run reproducibility. With benchmark=True you can get bit-different outputs from the SAME seed depending on what algorithm cuDNN happened to pick.

29def seed_worker(worker_id: int) -> None:

Per-worker seeding for DataLoader subprocesses. Required when num_workers > 0 because each worker is a fork()'ed process with its own RNG state.

EXECUTION STATE

⬇ input: worker_id = Integer 0..num_workers-1. Passed in by DataLoader for each worker.

→ why per-worker? = Without this, every worker starts with the SAME state (inherited from fork()) and produces the SAME augmentation noise on every sample. Effectively no augmentation diversity.

31worker_seed = torch.initial_seed() % 2 ** 32

Read the per-worker base seed that PyTorch derived from the main-process seed + worker_id. Modulo 2³² to fit numpy's seed type (uint32).

EXECUTION STATE

📚 torch.initial_seed() = Returns the seed used to initialise the current worker's torch RNG. Different per worker.

→ recipe origin = PyTorch docs (torch.utils.data ‘reproducibility’ section). Standard pattern for any DataLoader with num_workers>0.

32np.random.seed(worker_seed)

Pin numpy in this worker. The dataset class uses np.random for any per-sample augmentation.

32random.seed(worker_seed)

Pin Python random in this worker. Used by torch internals for some sampler classes.

36def make_loader(seed: int = 42, num_workers: int = 4) -> DataLoader:

Build a fully reproducible DataLoader. Three ingredients: the global set_seed (called BEFORE), the per-worker init function, and a torch.Generator that pins the SHUFFLE order across workers.

39g = torch.Generator()

Local torch RNG. Carries shuffle state for this DataLoader instance. Independent of the global torch RNG.

EXECUTION STATE

📚 torch.Generator() = Creates a fresh CPU torch.Generator. Methods: .manual_seed, .seed, .get_state, .set_state.

40g.manual_seed(seed)

Seed the local Generator. Without this, multi-worker shuffle order would still drift run-to-run because the worker_init_fn does not control the shuffle order — only the augmentation RNG.

42x = torch.randn(1024, 14)

Synthetic features. 1024 samples, 14-dim — matches the C-MAPSS sensor count.

EXECUTION STATE

📚 torch.randn(*size) = Sample from N(0, 1). Reproducible if torch.manual_seed has been called.

43y = torch.rand(1024) * 125.0

Synthetic RUL targets uniform in [0, 125].

44ds = TensorDataset(x, y)

Pair (x, y) tensors into an indexable dataset.

46return DataLoader(

Construct the loader with all reproducibility hooks wired up.

47ds, batch_size=256, shuffle=True,

Standard args. shuffle=True is what makes the Generator necessary.

48num_workers=num_workers,

Number of worker subprocesses. 0 = main process only (deterministic by default). 4 = four workers (need worker_init_fn).

EXECUTION STATE

→ trade-off = More workers → faster I/O, more reproducibility plumbing required. Paper uses num_workers=0 in phase1_cmapss.py because the C-MAPSS dataset is small enough to load entirely in RAM.

49worker_init_fn=seed_worker,

Hook that runs in each worker after fork(). Sets the per-worker RNG state.

50generator=g,

Pin the shuffle order. Without this, the shuffle is driven by the GLOBAL torch RNG, which can be advanced by ANY torch.* call in the main process between epochs — making epoch-to-epoch shuffle order non-deterministic in non-trivial training loops.

EXECUTION STATE

→ why a separate Generator? = Isolation. The shuffle order should NOT be affected by parameter init or augmentation noise. A dedicated Generator gives the loader its own bookkeeping.

55set_seed(42)

Global seed before constructing the loader. The data tensors x, y are reproducible because torch.randn / torch.rand use the global torch RNG.

56loader1 = make_loader(seed=42)

Build the deterministic loader.

57batch1 = next(iter(loader1))[1][:5]

Pull the first batch and take its first 5 RUL values. Index [1] is `y` (TensorDataset returns (x, y)); [:5] takes the first 5 of the batch.

EXECUTION STATE

📚 next(iter(loader)) = Get the first batch without consuming the iterator further. Standard idiom for ‘peek at one batch’.

batch1 (illustrative) = Tensor (5,) with RUL values from the first batch's first 5 samples. With seed=42 reproducible.

55set_seed(42)

Re-pin all RNGs. Critical: if we skip this, loader2 below would draw a different `g` state.

60loader2 = make_loader(seed=42)

Build a fresh loader.

61batch2 = next(iter(loader2))[1][:5]

Pull the first batch from the new loader.

62print("Batch 0 first-5 RUL values:")

Header.

63print(f" pass 1: {batch1.tolist()}")

First pass output.

64print(f" pass 2: {batch2.tolist()}")

Second pass output. With set_seed pinned both times, batch1 == batch2 element-by-element.

65print(f" identical: {torch.equal(batch1, batch2)}")

torch.equal returns True iff both tensors have identical shape AND identical values. Final reproducibility check.

EXECUTION STATE

📚 torch.equal(a, b) = Element-wise equality + shape match. Returns Python bool.

Output (illustrative) =

Batch 0 first-5 RUL values:
  pass 1: [82.34, 17.59, 5.21, 110.55, 64.87]
  pass 2: [82.34, 17.59, 5.21, 110.55, 64.87]
  identical: True

29 lines without explanation

1"""Full PyTorch reproducibility recipe — paper&apos;s set_seed plus
2DataLoader-level determinism via generator + worker_init_fn.
3
4Source: grace/training/seed_utils.py:set_seed (lines 14-22).
5This script extends it with DataLoader plumbing — what the paper&apos;s
6phase1_cmapss.py wires through to its DataLoader instances.
7"""
8
9import random
10import numpy as np
11import torch
12from torch.utils.data import DataLoader, TensorDataset
13
14
15def set_seed(seed: int = 42) -> None:
16    """Paper-canonical reproducibility setup.
17
18    Mirrors grace/training/seed_utils.py:set_seed verbatim — we copy
19    instead of importing so the script is self-contained.
20    """
21    random.seed(seed)
22    np.random.seed(seed)
23    torch.manual_seed(seed)
24    torch.cuda.manual_seed_all(seed)
25    torch.backends.cudnn.deterministic = True
26    torch.backends.cudnn.benchmark = False
27
28
29def seed_worker(worker_id: int) -> None:
30    """Per-DataLoader-worker seeding. Required when num_workers > 0."""
31    worker_seed = torch.initial_seed() % 2 ** 32
32    np.random.seed(worker_seed)
33    random.seed(worker_seed)
34
35
36def make_loader(seed: int = 42, num_workers: int = 4) -> DataLoader:
37    """Build a deterministic DataLoader. The Generator carries the
38    shuffle state so cross-worker order is reproducible too."""
39    g = torch.Generator()
40    g.manual_seed(seed)
41
42    x = torch.randn(1024, 14)
43    y = torch.rand(1024) * 125.0
44    ds = TensorDataset(x, y)
45
46    return DataLoader(
47        ds, batch_size=256, shuffle=True,
48        num_workers=num_workers,
49        worker_init_fn=seed_worker,
50        generator=g,
51    )
52
53
54# ---------- Run two passes ----------
55set_seed(42)
56loader1 = make_loader(seed=42)
57batch1 = next(iter(loader1))[1][:5]      # first 5 RUL values from batch 0
58
59set_seed(42)
60loader2 = make_loader(seed=42)
61batch2 = next(iter(loader2))[1][:5]
62
63print("Batch 0 first-5 RUL values:")
64print(f"  pass 1: {batch1.tolist()}")
65print(f"  pass 2: {batch2.tolist()}")
66print(f"  identical: {torch.equal(batch1, batch2)}")

cudnn.deterministic = True can raise. Some PyTorch ops have no deterministic alternative (e.g. certain backward passes for grouped convolutions). Setting deterministic = True will then raise RuntimeError at runtime. Two options: switch to a deterministic-friendly op (preferred), or call torch.use_deterministic_algorithms(True, warn_only=True) to demote errors to warnings. The paper takes the first route — it ensures the published numbers are bit-reproducible per seed.

Reproducibility In Other Domains

Domain	Critical reproducibility ingredient	Common pitfall
Self-driving perception	Augmentation pipeline (random crops, flips, color jitter) must use seeded RNG per worker.	Forgetting worker_init_fn produces identical augmentation across all workers — model never sees a diverse batch.
Language model fine-tuning	torch.use_deterministic_algorithms(True) for backward passes; pin tokenizer version exactly.	Tokenizer minor-version updates (e.g. transformers 4.36 → 4.37) silently change BPE merges; loss curves diverge.
Reinforcement learning	Environment seed (gym.make(env, seed=...)) plus action-sampling RNG — TWO separate seeds.	Environment seed alone reproduces transitions but not exploration; agent's action RNG must be pinned too.
Medical imaging segmentation	Patch-extraction RNG must be deterministic; Dice score is highly sensitive to which patches are seen.	Random patch sampling without seeding gives ±2% Dice variation seed-to-seed.
Hyperparameter search frameworks (Optuna, Ray Tune)	Pin sampler seed AND per-trial seed; otherwise trial assignment is non-reproducible.	Pinning only the trial seed, not the sampler — different trials run on different reruns even with the same trial seed.

The pattern: reproducibility recipes generalise structurally, but the specific RNG sources depend on the domain. The invariant is that every stream of randomness must be identified and pinned; the variable part is which streams exist in your stack.

Pitfalls That Break Reproducibility Silently

Pitfall 1: relying on a global seed for a Generator API

Calling np.random.seed(42) does NOT affect rng = np.random.default_rng() instances. The modern Generator carries its own state. Either pass the seed at construction (default_rng(42)) or stick to the legacy global API consistently — never mix.

Pitfall 2: forgetting worker_init_fn with num_workers > 0

With num_workers=4 and no worker_init_fn, every worker fork() inherits the same parent state and produces identical random augmentation per sample. Symptom: training loss looks fine but validation loss is unstable seed-to-seed because the model has effectively seen 1/4 the augmentation diversity it should have.

Pitfall 3: cudnn.benchmark = True for ‘speed’

The benchmark flag picks the fastest cuDNN algorithm based on first-batch timing. Timings depend on system load → the chosen algorithm varies → seeds are no longer enough for reproducibility. The paper leaves benchmark = False; the <5% speedup is not worth the loss of reproducibility on a publication run.

Pitfall 4: claiming a single-seed result

A single-seed result is not reproducible at any meaningful level — another seed will give a different number. Anything published without a standard deviation is implicitly using the bit-exact level of reproducibility, which fails as soon as the reader is on different hardware. Always run at least 3 seeds for sweep decisions and 5 for publication.

Pitfall 5: not recording the random state alongside results

The paper's output JSON includes the seed in the file path: phase1/<DS>/<method>/seed_<n>/results.json. Without the seed in the artefact, you cannot re-run a single result — it has to be re-run from the full seed list. Make the seed part of the result.

Takeaway

Reproducibility has three levels: bit-exact, seed-stable, and distribution-stable. The paper targets the third — mean within ±2·SEM across 5 seeds.
Five streams of randomness must be pinned: Python random, NumPy global, NumPy generator, PyTorch CPU, PyTorch CUDA. Plus two cuDNN flags (deterministic = True, benchmark = False).
With num_workers > 0 on the DataLoader, also pin a torch.Generator for shuffle order and use a worker_init_fn for per-worker augmentation state.
5 seeds is the minimum for a Friedman test to reach $p < 0.05$ . With $\sigma \approx 0.66$ on FD002, the SEM is $\approx 0.30$ — the smallest difference the paper can statistically claim.
Bit-exact reproduction across hardware (A100 vs H100) is impossible. The paper's recipe is: pin everything that can be pinned, average over 5 seeds, report mean ± std.