Walk onto a film set during principal photography and you find the director holding a master shot list — one page per scene, every camera angle numbered, every actor blocked, every light cued. The list is the contract between intention and execution. Production runs all the way through; nothing is improvised.
The paper's training script is the same artefact for the published 7.72 RMSE on FD002. Every line is a numbered shot. To reproduce the result you do not interpret the script — you execute it. This section walks the entire production entry point run_single_experiment() from grace/experiments/phase1_cmapss.py, plus the UnifiedTrainer.fit() body and the per-batch _train_epoch() inner loop, then traces what actually happened across 190 epochs of one published GABA run.
The headline. Production training is ~150 lines of Python: 60 lines to set up dataset / model / loss / optimiser, 90 lines to drive the 500-epoch loop, save checkpoints, and persist results. Every component composes chapters 5–21; nothing new is invented in chapter 22.
Epoch counter, warmup, adaptive WD schedule, _train_epoch call, EMA-aware eval, scheduler step, best tracking, early stopping, history accumulation.
~20-30 minutes
3. Persist
phase1_cmapss tail — 15 lines
Build serialisable dict, json.dump to results.json, log to stderr.
~1 second
Time budget on a single A100: ~30 minutes per seed × 5 seeds × 4 datasets × 7 methods = 70 hours for the full Phase 1 sweep. With a 4-GPU host running seeds in parallel, the whole sweep finishes in ~18 hours.
Interactive: A Real Run From The Paper Repo
Below is the actual per-epoch trajectory of GABA on FD002 with seed 42 — 190 epochs of real data from grace/outputs/phase1/FD002/gaba/seed_42/logs/gradient_stats.csv. Hover the chart to read λrul∗ and the gradient ratio at any epoch. Click the markers to see what the trainer did at that boundary.
Loading training timeline visualizer…
What the trajectory tells you. Epoch 0: λrul∗=0.164 because the GABA controller is in its 100-step warmup phase — uniform-ish weights. By epoch 5 (post-warmup), the closed form kicks in and λrul∗→0.0006 because the gradient ratio measures ~2400×. The floor-renormalised steady state hovers around 0.003 for the rest of training. Best epoch lands at 109; early stopping fires at 189 after 80 stagnant epochs.
Phase 1 — Setup: run_single_experiment()
The PyTorch walkthrough below is the verbatim production function. The setup is dataset-aware (per-condition normalisation auto-enabled for FD002/FD004) and method-aware (weighted MSE on for GRACE, off for GABA). The output directory is constructed deterministically so the artefact path encodes all the run's identity: output_dir/phase1/FD002/gaba/seed_42.
Two non-obvious responsibilities of the setup:
Test scaler reuse.scaler_params=train_ds.get_scaler_params() on the test dataset prevents data leakage. The model sees the test set normalised by the TRAIN-FIT scaler, not the test scaler.
MTL parameter union.params = list(model.parameters()) + list(mtl_loss.parameters()) ensures any LEARNABLE parameters in the MTL loss (Uncertainty σ's, GradNorm weights) are included in the optimiser. GABA has none; the union pattern is uniform across all 9 method variants in loss_registry.py.
Phase 2 — Main Loop: UnifiedTrainer.fit()
fit() is 130 lines, but the structural pattern is a single nested loop:
Lines (in trainer.py)
What runs
Cadence
126
for epoch in range(epochs):
Outer loop (≤500 iterations)
128
self.warmup.apply(self.optimizer, epoch)
Per epoch — sets LR via warmup ramp
131-134
if adaptive_weight_decay and epoch > 100: rescale wd
if rmse_last < best: snapshot weights + checkpointer.save
Per epoch — best tracking
210
if early_stopping(rmse_last, model): break
Per epoch — patience check
179-193
history[*].append(...)
Per epoch — log accumulation
The order matters. Warmup sets the LR BEFORE training so the very first batch already runs at the correct rate. Adaptive WD runs AFTER warmup so the schedule's epoch-100 trigger applies to the post-warmup state. Eval runs AFTER the inner training loop with EMA shadow weights. Scheduler step runs only after warmup (otherwise the scheduler would fight the linear ramp). Early stopping runs LAST so the current epoch's improvement is counted before the patience check.
Phase 3 — Per-Epoch: _train_epoch()
Inside _train_epoch (trainer.py:243-290), the per-batch loop is the eight stages walked through in section 22.1. The shape of one iteration:
for batch in loader: — pull (seq, rul, health, uid).
if self.ema: self.ema.update(self.model) — EMA shadow update.
Eight stages × ~110 batches per epoch on FD002 = 880 ops per epoch. At 500 epochs cap that is 440,000 forward+backward iterations — the budget that produces the published 7.72 RMSE.
Python: A Minimal End-To-End Mini-fit()
Every callback the production trainer wires together, by hand, in 120 lines of NumPy. Linear regression on synthetic data — the model is trivial; the scaffolding is what matters. Replace the closed-form gradient on line 79 with loss.backward() and you have the production loop.
Production training loop, minimal NumPy version
🐍grace_minimal_fit.py
Explanation(67)
Code(120)
1docstring
States the contract: a fully self-contained mini training loop that exercises every callback in the paper's UnifiedTrainer.fit() — warmup, AdamW, gradient clip, EMA, ReduceLROnPlateau, early stopping. Replace the linear closed-form gradient with PyTorch autograd and you have the production loop.
11import numpy as np
Numerical-array library. Single dependency for the whole demo.
13rng = np.random.default_rng(42)
Modern NumPy RNG. seed=42 makes the synthetic data + init reproducible.
17N_TRAIN, N_TEST = 1024, 256
Sample counts. Ratio 4:1 — enough for a stable RMSE and a bit of validation headroom.
18x_train = rng.normal(0, 1, (N_TRAIN, 3))
Three features, drawn from N(0, 1).
EXECUTION STATE
📚 rng.normal(loc, scale, size) = Gaussian samples. shape (N, 3) means N rows × 3 features.
Three-step formula. Predict → squared residual → mean → sqrt. Cast to plain float.
60best_rmse = float('inf')
Best-so-far RMSE. Initialised to infinity so any real value wins.
61best_theta = theta.copy()
Snapshot of the best parameters. Updated whenever val RMSE improves.
62no_improve = 0
Counter for ReduceLROnPlateau and EarlyStopping. Reset on improvement.
63scheduler_lr = lr0
Learning rate the scheduler maintains AFTER warmup. Halved on plateaus.
64step = 0
Global step counter. Used for AdamW bias correction.
65history = []
Per-epoch log. Each tuple = (epoch, lr, val_rmse). Used for plotting and final reporting.
69for epoch in range(n_epochs):
Outer loop over epochs. Mirrors UnifiedTrainer.fit's for-loop in trainer.py:126.
70lr = lr_at(epoch) if epoch < warmup_epochs else scheduler_lr
Choose the learning rate. Warmup phase uses the linear ramp; after that, the scheduler value applies.
EXECUTION STATE
📚 conditional expression = Python ternary: <expr_if_true> if <cond> else <expr_if_false>. Compact form of an if/else assignment.
73idx = rng.permutation(N_TRAIN)
Shuffle order for THIS epoch. Equivalent to DataLoader(shuffle=True). New permutation per epoch.
EXECUTION STATE
📚 rng.permutation(n) = Returns a random permutation of integers 0..n-1.
74for start in range(0, N_TRAIN, batch_size):
Inner loop over mini-batches. 16 iterations at batch_size=64.
75b = idx[start:start + batch_size]
Slice of the shuffled indices for this batch.
76xb, yb = x_train[b], y_train[b]
Fancy indexing — pulls rows of x_train at positions b. yb is the matching subset of targets.
79residual = predict(theta, xb) - yb
Per-sample prediction error. Shape (B,).
80g = (2.0 / len(b)) * (xb.T @ residual)
Closed-form MSE gradient: dL/dθ = (2/B)·X^T·(Xθ - y). For PyTorch this would be loss.backward(); here we have the closed form because the model is linear.
EXECUTION STATE
📚 .T = Transpose. xb (B, 3) → xb.T (3, B).
📚 @ result = (3, B) @ (B,) → (3,) — gradient vector for theta.
83gn = np.linalg.norm(g)
Global gradient L2 norm. The clip threshold is applied to this scalar.
Recovered coefficients. Should be close to the target [2.0, 0.5, -1.0].
EXECUTION STATE
Output (illustrative) =
best epoch: 32 | rmse 0.0623
final theta: [1.999, 0.501, -0.999]
target theta: [2.0, 0.5, -1.0]
119print(f"target theta: [2.0, 0.5, -1.0]")
The truth, for visual comparison. AdamW recovers the coefficients to ~0.001 accuracy on the noise-free test.
53 lines without explanation
1"""End-to-end mini training loop. NumPy + closed-form gradient.
23Stitches every UnifiedTrainer.fit() callback into one self-contained
4script: data, model, GABA controller, AdamW, gradient clip, EMA,
5LR warmup, ReduceLROnPlateau, early stopping, best checkpoint.
67Synthetic regression task — just enough signal that GABA + AdamW
8beat plain SGD. The point is the SCAFFOLDING, not the model.
9"""1011import numpy as np
1213rng = np.random.default_rng(42)141516# ---------- Synthetic data ----------17N_TRAIN, N_TEST =1024,25618x_train = rng.normal(0,1,(N_TRAIN,3))19y_train =(202.0* x_train[:,0]+0.5* x_train[:,1]- x_train[:,2]21+ rng.normal(0,0.3, N_TRAIN)22)23x_test = rng.normal(0,1,(N_TEST,3))24y_test =2.0* x_test[:,0]+0.5* x_test[:,1]- x_test[:,2]252627# ---------- Model: 3-feature linear ----------28theta = rng.normal(0,0.1, size=3)29m, v = np.zeros(3), np.zeros(3)30shadow = theta.copy()313233# ---------- Hyperparameters ----------34lr0, beta1, beta2, eps =1e-2,0.9,0.999,1e-835weight_decay =1e-436ema_alpha =0.99937warmup_epochs =538patience_es =2039batch_size =6440n_epochs =100414243deflr_at(epoch):44"""Linear LR warmup."""45if epoch < warmup_epochs:46return lr0 *(0.1+0.9* epoch / warmup_epochs)47return lr0
484950defpredict(theta, x):51"""Linear forward pass."""52return x @ theta
535455defrmse(theta, x, y):56returnfloat(np.sqrt(((predict(theta, x)- y)**2).mean()))575859# ---------- Training state ----------60best_rmse =float("inf")61best_theta = theta.copy()62no_improve =063scheduler_lr = lr0
64step =065history =[]666768# ---------- Training loop ----------69for epoch inrange(n_epochs):70 lr = lr_at(epoch)if epoch < warmup_epochs else scheduler_lr
7172# Shuffle and iterate mini-batches73 idx = rng.permutation(N_TRAIN)74for start inrange(0, N_TRAIN, batch_size):75 b = idx[start:start + batch_size]76 xb, yb = x_train[b], y_train[b]7778# Forward + backward (closed-form gradient for linear MSE)79 residual = predict(theta, xb)- yb # (B,)80 g =(2.0/len(b))*(xb.T @ residual)# (3,)8182# Gradient clip (global norm 1.0)83 gn = np.linalg.norm(g)84if gn >1.0:85 g = g / gn
8687# AdamW update88 step +=189 m = beta1 * m +(1- beta1)* g
90 v = beta2 * v +(1- beta2)* g **291 m_hat = m /(1- beta1 ** step)92 v_hat = v /(1- beta2 ** step)93 theta -= lr *(m_hat /(np.sqrt(v_hat)+ eps)+ weight_decay * theta)9495# EMA shadow96 shadow = ema_alpha * shadow +(1- ema_alpha)* theta
9798# Per-epoch eval (on EMA shadow)99 val_rmse = rmse(shadow, x_test, y_test)100 history.append((epoch, lr, val_rmse))101102# Track best + early stopping103if val_rmse < best_rmse -1e-4:104 best_rmse = val_rmse
105 best_theta = theta.copy()106 no_improve =0107else:108 no_improve +=1109110# ReduceLROnPlateau (patience 5 here for the toy)111if no_improve >0and no_improve %5==0:112 scheduler_lr =max(scheduler_lr *0.5,1e-6)113114if no_improve >= patience_es:115print(f"early stop at epoch {epoch}")116break117118print(f"best epoch: {history[-1][0]- no_improve} | rmse {best_rmse:.4f}")119print(f"final theta: {best_theta.round(3).tolist()}")120print(f"target theta: [2.0, 0.5, -1.0]")
What the toy proves. AdamW + warmup + EMA + clip + ReduceLROnPlateau + EarlyStopping recover the true coefficients [2.0,0.5,−1.0] on the synthetic regression to ~0.001 accuracy in 30–50 epochs. The same scaffolding applied to a 1.7M-parameter CNN-BiLSTM-Attention backbone produces the published GRACE result in 100–200 epochs.
PyTorch: The Paper's run_single_experiment, Verbatim
The actual production entry point, line by line. Click any line for the per-line trace. Every constructor call, every keyword argument, every config-dict key has a card that explains what it does and why this value was chosen. Every published row of the paper's 140-experiment table came out of this function.
Names the contract: this is the paper's production single-seed entry point. Every published row of Table I (FD001-FD004 × 7 methods × 5 seeds = 140 runs) was produced by calling this function.
8import json
Standard library JSON. Used to persist the per-seed results.json artefact.
9from pathlib import Path
Modern path manipulation. Path objects support / for joining and .mkdir(parents=True). Replaces the older os.path API.
11import torch
Core PyTorch.
12import torch.nn as nn
Module/parameter machinery.
13import torch.optim as optim
AdamW + ReduceLROnPlateau scheduler.
14from torch.utils.data import DataLoader
Mini-batch iterator over the MTL-wrapped CMAPSSDataset.
Concatenate model parameters with any LEARNABLE params in the MTL loss. GABA has none; Uncertainty has σ_rul, σ_health; GradNorm has the loss weights themselves.
1"""Paper's production training entry point — verbatim.
23Source: paper_ieee_tii/grace/experiments/phase1_cmapss.py:37-156.
4Single-seed runner that produces one row of the published 140-experiment
5table.
6"""78import json
9from pathlib import Path
1011import torch
12import torch.nn as nn
13import torch.optim as optim
14from torch.utils.data import DataLoader
1516from..experiments.config import ExperimentConfig
17from..core.loss_registry import get_loss
18from..core.weighted_mse import moderate_weighted_mse_loss
19from..models.backbone import UnifiedBackbone
20from..models.dual_task_model import DualTaskModel
21from..models.model_configs import get_model_config
22from..data.cmapss_dataset import CMAPSSDataset
23from..data.health_labels import rul_to_health_3class
24from..data.mtl_wrapper import MTLDatasetWrapper
25from..training.trainer import UnifiedTrainer
26from..training.seed_utils import set_seed
272829defrun_single_experiment(cfg: ExperimentConfig, seed:int):30"""One seed of one method on one dataset. Returns serialisable dict."""31 set_seed(seed)32 device = torch.device("cuda"if torch.cuda.is_available()else"cpu")3334# ---- Data ----35 train_ds = CMAPSSDataset(36 dataset_name=cfg.dataset, data_dir=cfg.data_dir,37 sequence_length=cfg.sequence_length, train=True,38 random_seed=seed, per_condition_norm=cfg.per_condition_norm,39)40 test_ds = CMAPSSDataset(41 dataset_name=cfg.dataset, data_dir=cfg.data_dir,42 sequence_length=cfg.sequence_length, train=False,43 scaler_params=train_ds.get_scaler_params(),44 random_seed=seed, per_condition_norm=cfg.per_condition_norm,45)46 train_mtl = MTLDatasetWrapper(train_ds, rul_to_health_3class)47 test_mtl = MTLDatasetWrapper(test_ds, rul_to_health_3class)48 train_loader = DataLoader(train_mtl, batch_size=cfg.batch_size, shuffle=True)49 test_loader = DataLoader(test_mtl, batch_size=cfg.batch_size, shuffle=False)5051# ---- Model ----52 mc = get_model_config(cfg.model_config)53 backbone = UnifiedBackbone(54 input_size=mc.input_size, hidden_size=mc.hidden_size,55 cnn_channels=mc.cnn_channels, num_attn_heads=mc.num_attn_heads,56 fc_dims=mc.fc_dims, dropout=mc.dropout,57 use_attention=mc.use_attention, use_residual=mc.use_residual,58)59 model = DualTaskModel(backbone, num_health_states=mc.num_health_states,60 dropout=mc.dropout)6162# ---- Loss ----63 mtl_loss = get_loss(cfg.mtl_method)64 rul_criterion =(moderate_weighted_mse_loss if cfg.use_weighted_mse
65else nn.MSELoss())6667# ---- Optimiser + scheduler ----68 params =list(model.parameters())+list(mtl_loss.parameters())69 optimizer = optim.AdamW(params, lr=cfg.lr, weight_decay=cfg.weight_decay)70 scheduler = optim.lr_scheduler.ReduceLROnPlateau(71 optimizer, mode="min", factor=cfg.scheduler_factor,72 patience=cfg.scheduler_patience, min_lr=cfg.min_lr,73)7475# ---- Output dir ----76 run_dir = Path(cfg.output_dir)/"phase1"/ cfg.dataset / cfg.mtl_method /f"seed_{seed}"77 run_dir.mkdir(parents=True, exist_ok=True)7879# ---- Trainer ----80 trainer = UnifiedTrainer(81 model=model, mtl_loss=mtl_loss,82 rul_criterion=rul_criterion,83 health_criterion=nn.CrossEntropyLoss(),84 optimizer=optimizer, scheduler=scheduler, device=device,85 config={86"epochs": cfg.epochs,"patience": cfg.patience,87"grad_clip": cfg.grad_clip,"use_ema": cfg.use_ema,88"ema_decay": cfg.ema_decay,"warmup_epochs": cfg.warmup_epochs,89"lr": cfg.lr,90"checkpoint_dir":str(run_dir /"checkpoints"),91"log_dir":str(run_dir /"logs"),92"adaptive_weight_decay": cfg.adaptive_weight_decay,93"initial_weight_decay": cfg.initial_weight_decay,94},95)9697# ---- Run + persist ----98 results = trainer.fit(train_loader, test_loader)99100 serialisable ={101"dataset": cfg.dataset,"method": cfg.mtl_method,"seed": seed,102"best_epoch": results["best_epoch"],103"best_rmse": results["best_rmse"],104"metrics":{k: v for k, v in results["best_metrics"].items()},105}106withopen(run_dir /"results.json","w")as f:107 json.dump(serialisable, f, indent=2, default=str)108return serialisable
The MTL parameter union pattern (line 68).list(mtl_loss.parameters()) appears innocuous — for GABA it returns an empty list. For Uncertainty (Kendall et al.) it returns the two learnable σ scalars; for GradNorm (Chen et al.) it returns the per-task weight parameters. Without the concatenation, those scalars would never see a gradient step and the algorithm would degenerate to fixed-σ scaling. The published tables for those baselines depend on this one line.
What Happens Across 190 Epochs
Reading the timeline above with the trainer code in mind, here is the narrative of the published GABA-on-FD002-seed-42 run:
Epoch range
What's happening
Observable
0–9
LR warmup + GABA warmup_steps. Model is in tiny-LR territory; GABA emits uniform 0.5/0.5 weights for the first ~110 batches.
λ_rul stays near 0.16, gradient ratio still moderate (~100×) because the model is essentially un-trained.
10–30
Full LR. GABA enters its closed-form regime. Gradient ratio explodes to ~2500× as the model learns coarse features and L_rul shoots up.
λ_rul drops sharply to ~5×10⁻⁴. Training loss falls fastest in this window.
30–100
Steady descent. EMA smoothing settles λ_rul at ~2×10⁻³. ReduceLROnPlateau hasn't fired yet — rmse_last keeps improving.
λ_rul climbs slowly upward as the gradient ratio narrows from ~2000× to ~500×.
100
Adaptive weight decay halves (1e-4 → 5e-5). Optimisation enters its second phase.
Vertical orange dashed line in the chart.
100–109
Best window. The model lands its lowest rmse_last at epoch 109 (RMSE = 7.42). Checkpointer saves model + optimiser + EMA shadow.
Violet dashed line at the best epoch.
109–189
Patience countdown. 80 stagnant epochs. ReduceLROnPlateau fires at ~epoch 140 (LR halved); fires again at ~epoch 170 (halved again).
λ_rul keeps drifting upward (gradient ratio shrinks as the model becomes more accurate); rmse_last oscillates ±0.5 around 7.5.
189
Early stopping fires. Best weights restored. Final eval recomputed with best EMA shadow.
Red dashed line. Run terminates.
The GABA controller is doing its job throughout. At every epoch the closed form λraw∝1/grul is being recomputed; EMA smooths the per-batch fluctuation; the floor (0.05) caps the minimum per-task weight; renormalisation restores sum-to-one. The trajectory is the controller's record — not a tuned schedule.
The Same Skeleton For Other Domains
The 150-line script generalises by replacing three things: dataset, backbone, MTL loss. Everything else — AdamW, scheduler, warmup, EMA, clip, early stopping, checkpointer — is unchanged.
Domain
Replace dataset with
Replace backbone with
Replace MTL loss with
Self-driving 3D detection
nuScenes / Waymo Open Dataset class with multi-modal input
PointPillars or CenterPoint encoder
PCGrad or GradNorm — the per-task gradients can conflict on far-vs-near objects
Speech recognition
LibriSpeech with mel-spectrogram windows
Conformer encoder
Uncertainty (Kendall) — the CTC and attention-decoder losses have different scales
Medical imaging
BraTS / MIMIC-CXR with paired image+text
UNet++ or SegFormer encoder
DWA (Liu et al.) — the loss-ratio history is well-suited to multi-modal medical tasks
PCGrad — the per-task gradients can flip signs around the task switch
The skeleton invariant is what makes ML papers reproducible: the contribution lives in the three replaceable slots, not in the surrounding training loop. A reviewer who knows the skeleton can read any new MTL paper's training script in five minutes by spotting which slot is the contribution.
Pitfalls When Reading Or Modifying The Script
Pitfall 1: changing one line without checking the cadence
The placement of scheduler.step() AFTER warmup (line 161 in trainer.py) is non-trivial. Moving it before warmup makes the scheduler fight the warmup ramp; moving it inside _train_epoch turns ReduceLROnPlateau into a per-batch decay. Either change silently destroys the published result.
Pitfall 2: forgetting to pass the train scaler to the test set
The scaler_params=train_ds.get_scaler_params() argument (line 43) is the difference between leaked and clean evaluation. Skipping it lets the test set fit its own scaler — the model implicitly sees test-set statistics, RMSE typically drops by 2–3 cycles, and the result is not publishable.
Pitfall 3: shuffle=True on the test loader
Line 49: shuffle=False. The evaluator's per-unit last-cycle NASA score depends on the loader yielding windows in original temporal order. Setting shuffle=True makes ‘last cycle per engine’ pick a random window per engine; the NASA score becomes meaningless.
Pitfall 4: not creating output_dir before the trainer
Line 77's run_dir.mkdir(parents=True, exist_ok=True) runs BEFORE the trainer is constructed (line 80). The Checkpointer fails fast if the directory does not exist; doing the mkdir inside the trainer would couple two responsibilities.
Pitfall 5: skipping the params union for non-GABA methods
The line params = list(model.parameters()) + list(mtl_loss.parameters()) is harmless for GABA (the second list is empty). Removing it for ‘cleanliness’ silently breaks Uncertainty, GradNorm, and AMNL-v7 — their learnable parameters never get optimised.
Takeaway
Production training is run_single_experiment at grace/experiments/phase1_cmapss.py:37 — ~150 lines that compose every component from chapters 5–21 into one reproducible call.
Three phases: setup (60 lines, ~10 s), main loop (130 lines, ~30 min), persist (15 lines, ~1 s). Every published 140-row experiment was produced by calling this function.
The fit() body has a strict cadence: warmup → adaptive WD → train_epoch → EMA-eval → scheduler → best track → early stop. Reordering breaks reproducibility.
The published GABA-FD002-seed-42 run lands its best at epoch 109 (RMSE = 7.42), and early-stops at epoch 189. The trajectory is visible in the timeline viz from real gradient_stats.csv data.
The 150-line skeleton is domain-agnostic. Replace dataset, backbone, MTL loss; everything else is unchanged. New MTL papers add 1 of those 3 slots and reuse the rest.