A bottle of Burgundy and a bottle of Beaujolais can be made from the same Pinot Noir grapes harvested from neighbouring hillsides. What separates them is the recipe: yeast strain, fermentation temperature, time on lees, oak ageing. The grapes set the ceiling; the recipe sets the floor. Same logic in machine learning: the architecture sets the ceiling on what the model can represent, and the hyperparameters set the floor on what it actually does.
GRACE's architecture has been fixed since chapter 8 (CNN + BiLSTM + multi-head attention + FC). The published 7.72 RMSE on FD002 depends on every hyperparameter being at its production default. This section is the catalogue: every default, every value's sensitivity (from the paper's ablation), and the search protocol the authors used to pick them. Section 22.3 gives the reproducibility recipe, but it cannot reproduce anything if you don't know which knob to leave alone.
The headline. 18 hyperparameters across two categories. Architecture defaults set in grace/models/model_configs.py; training defaults set in grace/experiments/config.py. All are 5-seed validated. The published GABA controller has been measured to converge to the same λhealth equilibrium across all four C-MAPSS subsets within ±0.008 — robustness is built in.
Two Distinct Categories: Architecture vs. Training
Hyperparameters in any deep-learning system fall into two categories with very different sensitivity profiles:
Category
What it controls
Lives in
Tuning frequency
Architecture
Backbone width, # heads, FC dims, dropout, sequence length
model_configs.py:ModelConfig
Once per dataset family (e.g. C-MAPSS vs. N-CMAPSS)
Training
lr, batch size, optimiser, scheduler, EMA, warmup, early stopping
experiments/config.py:ExperimentConfig
Once per training-method change (GABA → GRACE)
The split matters because the search budgets are different. Architecture sweeps need full-data training (~30 min per trial × 5 seeds = 2.5 hours per point), so the paper sweeps only a handful of named configs (cmapss, ncmapss_20feat, ncmapss_32feat). Training sweeps run 4–5 candidates per axis along coordinate-descent, totalling ~20 hours for all axes on FD002.
Architecture Defaults
The C-MAPSS production architecture, from grace/models/model_configs.py:"cmapss":
Field
Value
Role
input_size
17
Number of selected sensors after the paper's feature filter.
hidden_size
256
BiLSTM hidden size. GRACE used 256; GABA-only paper used 128.
cnn_channels
(64, 128, 64)
Three 1D-conv layers; bottleneck at 128 then narrow back.
num_lstm_layers
2
Two stacked BiLSTM layers; deeper saturates.
num_attn_heads
8
Multi-head self-attention. Removing attention costs +0.94 RMSE on FD002.
fc_dims
(256, 64, 32)
Three FC layers from the BiLSTM output (= 512) down to the heads.
dropout
0.3
Applied after BiLSTM and FC layers. Same for ALL datasets — no per-dataset tuning.
use_attention
True
Enable multi-head self-attention block. Disable only for CNN-only ablation.
use_residual
True
Residual connections inside the backbone.
num_health_states
3
3-class health labels: healthy / degrading / critical.
Two siblings exist for N-CMAPSS: ncmapss_20feat (input_size=20, hidden_size=256, 8 heads — matches the DKAMFormer protocol) and ncmapss_32feat (input_size=32, hidden_size=512, 16 heads — uses all sensors plus operating conditions for higher capacity).
Training Defaults
Production training hyperparameters, from grace/experiments/config.py:ExperimentConfig:
Hyperparameter
Value
Why this value
epochs
500
Loose cap; early stopping fires at 130-230 in practice.
batch_size
256
GPU-fits; ablation shows 0.06 RMSE penalty for 128, 0.13 for 64.
lr
1e-3
AdamW base. Warmup ramps from 1e-4. Broad minimum on FD002.
patience (early stop)
80
No-improvement budget before stopping + restoring best weights.
grad_clip max_norm
1.0
Global L2 threshold. Defends BiLSTM gradient spikes.
use_ema
True
EMA shadow weights at evaluation. ~0.3 RMSE benefit on FD002.
ema_decay
0.999
1000-step memory. Sharp minimum on this axis.
warmup_epochs
10
Linear LR ramp 1e-4 → 1e-3 over first 10 epochs.
scheduler_factor
0.5
Halve LR on plateau.
scheduler_patience
30
30 stagnant epochs before halving.
min_lr (scheduler)
5e-6
Floor. Below this the optimiser is essentially frozen.
weight_decay
1e-4
AdamW decoupled. Halved at epoch 100, /10 at 200 (GRACE schedule).
sequence_length
30
Sliding-window length. Longer windows didn't help in the sweep.
use_weighted_mse
True
Failure-biased w(y) on RUL task. GRACE-only switch.
adaptive_weight_decay
True
GRACE-only. Marginal but consistent improvement.
initial_weight_decay
1e-4
Pre-schedule wd value.
seeds
[42, 123, 456, 789, 1024]
5 seeds for reported means.
GABA-Specific Defaults And Robustness
The OUTER controller adds three of its own knobs:
Hyperparameter
Value
Role
GABA β (EMA smoothing)
0.99
100-step memory for the per-task weights. Different from the model EMA (0.999) — the controller needs a faster response.
min_weight (floor)
0.05
5% per task — prevents the over-gradient task from being completely silenced when the EMA converges.
warmup_steps
100
First 100 mini-batches use uniform 1/K weights. Stops the controller from reacting before EMA has a meaningful estimate.
The remarkable property of GABA documented in the paper's Fig. 6: the converged λhealth sits in [0.985,1.001] across all four C-MAPSS subsets and all five seeds (n=20 runs total). Standard deviation is σ≈0.003. This is the OUTER axis being self-calibrating: regardless of which dataset you put in, GABA finds the same equilibrium because the gradient ratio is structurally similar (450–650× across all subsets — chapter 21 §3 Pareto explorer).
Why this robustness matters. A typical adaptive controller (DWA, GradNorm, Uncertainty) varies its converged weights by >0.1 across these same four subsets. GABA's ±0.008 spread means you don't need to retune β, ε_floor, or warmup_steps when moving between datasets. The same configuration handles FD001 (the easy single-condition subset) and FD004 (the hardest multi-condition multi-fault subset).
Interactive: Hyperparameter Impact On RMSE
Click any row to see the description and the GRACE recommendation. The bars show RMSE on FD002 (left column) and FD004 (right column) relative to the V7 baseline. Green = better than baseline, red = worse. Numbers come from the paper's data_analysis/comprehensive_analysis.py ablation block.
Loading hyperparameter sensitivity table…
What stands out. Removing the health task (Single-Task) is catastrophic (+260% RMSE) — the auxiliary is doing real regularisation work. Removing attention costs ~13%. Switching from WMSE to plain MSE swaps the dataset winners (FD002 worsens, FD004 improves) — a useful diagnostic when GRACE underperforms on a new dataset. Most other choices are within 1–5% of baseline; the architecture and the choice of multi-task learning are what matter most.
How The Defaults Were Found: A Search Protocol
The paper authors did not run a full 18-dimensional grid search — that would be 318 ≈ 387 million trials at 30 minutes each. Instead the protocol was sequential, prioritised by sensitivity:
Step 1: literature defaults. AdamW lr=1e-3, β=(0.9, 0.999), weight_decay=1e-4, batch_size=256. These match transformer-era literature. No tuning yet.
Step 2: architecture sweep. Four named configs (full / lstm_only / transformer / mlp / cnn_only) with GABA on FD001+FD002, 5 seeds each. Locks in CNN + BiLSTM + Attention as the backbone (chapter 27).
Step 3: GABA controller sweep. β ∈ {0.9, 0.99, 0.999}, ε_floor ∈ {0.01, 0.05, 0.10}, warmup_steps ∈ {0, 100, 500}. 27-trial grid on FD002 only (single-dataset is enough — chapter 22 §2 robustness). Locks in (0.99, 0.05, 100).
Step 4: training-cadence coordinate descent. One axis at a time: warmup_epochs, scheduler_patience, ema_decay, grad_clip. Single-axis sweeps because the surfaces are approximately separable. Total ~20 trials.
Step 5: weight-decay schedule (GRACE-only). Three options compared: fixed 1e-4, halve-at-100, schedule (1e-4 → 5e-5 → 1e-5). The schedule won by 0.05 RMSE on FD002; adopted as default.
Step 6: 5-seed cross-validation. Final configuration run with 5 seeds on all four C-MAPSS subsets + N-CMAPSS DS02. Reported numbers are 5-seed means.
Total compute: ~150 trials × 30 min = 75 hours of GPU time. A full grid search would have needed millions; coordinate descent and judgement reduce the budget by four orders of magnitude.
Python: Coordinate-Descent Hyperparameter Search
A self-contained NumPy reproduction of the search protocol on a toy RMSE landscape that mirrors the paper's actual landscape on FD002. Grid search uses 27 trials; coordinate descent finds the same optimum with 9. Read the code line by line to see how the dict-spread + min-with-key idioms encode coordinate descent in three lines.
Grid vs. coordinate descent on a toy landscape
🐍grace_hp_search_demo.py
Explanation(35)
Code(64)
1docstring
Names the contract: a comparison of grid search and coordinate descent on a synthetic RMSE landscape. This is the strategy the paper authors used to choose lr, batch_size, and EMA beta — sequential, one-axis-at-a-time, far cheaper than full grid.
8import numpy as np
Numerical-array library. Used for log10, log2, and inf.
EXECUTION STATE
📚 numpy = ndarray + math helpers. Standard alias np.
11def toy_rmse(lr, bs, beta):
Synthetic landscape with a global minimum at (lr=1e-3, bs=256, beta=0.999). Mirrors the paper's actual landscape on FD002: lr is forgiving (broad), bs has a mild slope, beta has a sharp minimum.
EXECUTION STATE
⬇ input: lr = Learning rate in {1e-4, 1e-3, 1e-2}.
⬇ input: bs = Batch size in {64, 128, 256}.
⬇ input: beta = EMA decay in {0.99, 0.999, 0.9999}.
⬆ returns = Float — synthetic RMSE for that (lr, bs, beta) triple.
19lr_term = 5.0 * ((np.log10(lr) + 3) / 2) ** 2
Parabolic penalty around log10(lr) = -3. lr=1e-3 is the minimum; lr=1e-4 and lr=1e-2 are equally penalised.
→ asymmetry = 0.9999 is far less penalised than 0.99 because the (beta-0.999)/0.001 scaling makes 0.99 ten step-units away. This mirrors the paper's observation: high-beta is forgiving, low-beta is dangerous.
22return 8.0 + lr_term + bs_term + beta_term
Floor of 8.0 RMSE plus the three additive penalties. Minimum total ≈ 8.0 at (1e-3, 256, 0.999).
EXECUTION STATE
⬆ return: toy_rmse = Float in [8, ~170]. The reference good config sits at 8.0.
26LRS = [1e-4, 1e-3, 1e-2]
Three-point lr grid spanning two orders of magnitude. The paper used five (3e-4, 5e-4, 1e-3, 2e-3, 5e-3); we keep the toy at 3 for compactness.
Cheaper alternative: sweep one axis at a time, freezing the others. 3+3+3=9 trials. Works when the surface is approximately separable — which the paper's ablations show it is for these three axes.
EXECUTION STATE
⬇ input: lr0 = Starting lr. Defaults to 1e-3 (the literature default).
⬇ input: bs0 = Starting bs. 128 — middle of the range.
⬇ input: beta0 = Starting beta. 0.99 — middle of the range.
Mutable dict carrying the current best. Each axis sweep updates one key.
34n_trials = 0
Trial counter for the comparison.
49for axis, candidates in [('lr', LRS), ('bs', BSS), ('beta', BETAS)]:
Outer loop over the three axes. ORDER MATTERS for coordinate descent: searching lr before beta typically converges faster because lr has the broadest minimum.
63print(f"Speedup: {g_n / c_n:.1f}× — same optimum, less compute")
27 / 9 = 3.0× speedup. On a 30-min-per-trial production budget, the difference is 13.5 hours vs 4.5 hours of GPU time per dataset.
EXECUTION STATE
Final output =
Grid search: 27 trials → RMSE 8.0000 at (0.001, 256, 0.999)
Coordinate descent: 9 trials → RMSE 8.0000 at (0.001, 256, 0.999)
Speedup: 3.0× — same optimum, less compute
→ caveat = Coordinate descent finds the global optimum HERE because the surface is separable. On non-separable landscapes (interactions between lr and bs), it can stall at a saddle and grid search wins. Always validate the assumption with at least one off-axis check.
29 lines without explanation
1"""Hyperparameter search by coordinate descent — what the paper did.
23Compares grid search (27 trials) with coordinate descent (9 trials).
4Both find the same optimum on a toy RMSE surface that mimics the
5landscape the GRACE authors saw on FD002.
6"""78import numpy as np
91011deftoy_rmse(lr, bs, beta):12"""Synthetic RMSE landscape with a sharp minimum at (1e-3, 256, 0.999).
1314 Mirrors the paper's observation: lr is fairly forgiving (broad
15 minimum), batch size has a mild slope, and EMA beta has a sharp
16 minimum at 0.999 — outside that band the EMA either tracks too
17 slowly (0.9999) or too fast (0.99) for the late-training noise.
18 """19 lr_term =5.0*((np.log10(lr)+3)/2)**220 bs_term =0.5*((np.log2(bs)- np.log2(256))/1.0)**221 beta_term =2.0*((beta -0.999)/0.001)**222return8.0+ lr_term + bs_term + beta_term
232425# ---------- Search space ----------26LRS =[1e-4,1e-3,1e-2]27BSS =[64,128,256]28BETAS =[0.99,0.999,0.9999]293031# ---------- Grid search (27 trials) ----------32defgrid_search():33 best, best_r =None, np.inf
34 n_trials =035for lr in LRS:36for bs in BSS:37for beta in BETAS:38 r = toy_rmse(lr, bs, beta)39 n_trials +=140if r < best_r:41 best, best_r =(lr, bs, beta), r
42return best, best_r, n_trials
434445# ---------- Coordinate descent (9 trials) ----------46defcoordinate_descent(lr0=1e-3, bs0=128, beta0=0.99):47 cur ={"lr": lr0,"bs": bs0,"beta": beta0}48 n_trials =049for axis, candidates in[("lr", LRS),("bs", BSS),("beta", BETAS)]:50# Sweep ONLY this axis; freeze the others51 scores =[(c, toy_rmse(**{**cur, axis: c}))for c in candidates]52 n_trials +=len(scores)53 cur[axis]=min(scores, key=lambda t: t[1])[0]54 best =(cur["lr"], cur["bs"], cur["beta"])55return best, toy_rmse(*best), n_trials
565758# ---------- Run both ----------59g_best, g_rmse, g_n = grid_search()60c_best, c_rmse, c_n = coordinate_descent()6162print(f"Grid search: {g_n} trials → RMSE {g_rmse:.4f} at {g_best}")63print(f"Coordinate descent: {c_n} trials → RMSE {c_rmse:.4f} at {c_best}")64print(f"Speedup: {g_n / c_n:.1f}× — same optimum, less compute")
Coordinate descent is not always optimal. It works when the surface is approximately separable along the chosen axes — which the paper's ablations show is true for GRACE's training hyperparameters. On non-separable landscapes (notable in joint architecture+lr searches) it can stall at saddles. Always validate with at least one off-axis spot check before committing to the result.
PyTorch: Hyperparameter Scoping In ExperimentConfig
Production search uses the paper's ExperimentConfig dataclass + a generic sweep_axis() helper. Every result is a (value, mean_rmse) tuple recoverable from disk. Reproducibility is built in — the dataclass serialises to JSON and back, so a sweep can be paused, audited, and resumed.
Production hyperparameter sweep with paper helpers
🐍grace_hp_sweep.py
Explanation(46)
Code(81)
1docstring
Names the contract: a reproducible hyperparameter scope. Every result is a config dict plus an RMSE — easy to re-run, easy to compare.
Python's dataclass machinery. dataclass adds __init__, __repr__ automatically; replace returns a NEW instance with one field changed (immutable copy).
EXECUTION STATE
📚 @dataclass = Decorator that generates __init__ from class-level type-annotated fields.
📚 field(default_factory=...) = Used for mutable defaults (e.g. list). default_factory is called per instance to avoid sharing state.
📚 replace(obj, **changes) = Returns a copy of obj with the specified fields overridden. Like dict.copy() with selective override.
Decorator. Auto-generates __init__(self, dataset='FD002', mtl_method='gaba', epochs=500, ...). Without it we'd need to write the constructor by hand.
16class ExperimentConfig:
The single source of truth for every training hyperparameter. Mirrors grace/experiments/config.py:ExperimentConfig — every field below has the exact paper default.
17docstring
Records the source: this dataclass is the paper's production config. Keeping it as a dataclass (not a dict) gets type-checking and IDE autocomplete for free.
18dataset: str = "FD002"
Dataset name. One of FD001, FD002, FD003, FD004, ncmapss_ds02.
EXECUTION STATE
→ triggers = FD002 / FD004 also automatically enable per_condition_norm=True (the dataset class detects this).
19mtl_method: str = "gaba"
MTL loss name from grace/core/loss_registry.py. To run a baseline, change to ‘fixed_050’, ‘dwa’, ‘gradnorm’, etc. Same training loop, different controller.
22epochs: int = 500
Maximum epoch budget. Early stopping (patience=80) usually fires at 130-230 in practice.
23batch_size: int = 256
Per-batch sample count. 256 fits in 24GB VRAM with the BiLSTM backbone; doubling halves throughput. The ablation in §22.2 viz shows bs=128 costs ~0.5% RMSE — small enough that hardware constraints can override.
24lr: float = 1e-3
AdamW base learning rate. Warmup ramps from 1e-4. ReduceLROnPlateau halves on plateau down to min_lr. The 1e-3 default is robust across all four C-MAPSS subsets.
25patience: int = 80
Early stopping patience. 80 epochs without rmse_last improvement triggers stop + restore best weights.
26grad_clip: float = 1.0
Global L2-norm threshold for torch.nn.utils.clip_grad_norm_. 1.0 protects against the BiLSTM's occasional gradient spikes on long sequences without throttling normal updates.
27use_ema: bool = True
Enable EMA shadow weights at evaluation time. Disabling costs ~0.3 RMSE on FD002.
28ema_decay: float = 0.999
Effective memory of 1000 steps. The ablation viz above shows this axis is sharp — smaller decays (0.99) lose the smoothing benefit; larger (0.9999) lag the live weights too much late in training.
29warmup_epochs: int = 10
Linear LR ramp from 0.1·lr to lr over the first 10 epochs. Critical for GRACE because the GABA controller is uniform-weighted for its first 100 STEPS and the model is in a noisy regime.
30scheduler_factor: float = 0.5
ReduceLROnPlateau halves the LR when triggered. Standard.
31scheduler_patience: int = 30
ReduceLROnPlateau patience. 30 epochs of stagnation before halving. Different from EarlyStopping's 80.
32min_lr: float = 5e-6
Floor for the scheduler. Below this, the optimiser is essentially frozen.
33weight_decay: float = 1e-4
AdamW decoupled weight decay. Halved at epoch 100 and divided by 10 at epoch 200 if adaptive_weight_decay=True.
36sequence_length: int = 30
Sliding-window length. 30 cycles is long enough to capture short-term degradation trends; longer windows did not help in the paper's sweep (data analysis comprehensive_analysis.py).
37per_condition_norm: bool = False
Per-operating-condition feature normalisation. Set automatically to True for FD002/FD004 (multi-condition) by the dataset class. Manual override here for ncmapss.
40use_weighted_mse: bool = True
Failure-biased w(y) = 1 + clip(1 - y/125, 0, 1) on the RUL task. GRACE-specific. Set to False to switch to plain MSE (= GABA).
41adaptive_weight_decay: bool = True
GRACE-only addition. Halves wd at epoch 100, /10 at epoch 200. Hurts only marginally on AMNL but adds ~0.1 RMSE for GRACE.
Five seeds for variance estimation. field(default_factory=...) is required for mutable defaults — sharing a single list across instances would corrupt state.
EXECUTION STATE
📚 lambda: [42, ...] = A nullary function returning a fresh list. default_factory calls it once per dataclass instance.
→ published numbers = Every result in the paper is a 5-seed mean. Single-seed numbers are not reportable.
Generic coordinate-descent helper. One axis at a time, one config at a time.
EXECUTION STATE
⬇ input: base_cfg = Production defaults; the unchanged template.
⬇ input: axis = Field name to vary, e.g. 'lr' or 'ema_decay'.
⬇ input: candidates = List of values for that axis.
⬆ returns = List[(value, mean_rmse)] — one entry per candidate, averaged over the seeds in base_cfg.seeds.
61docstring
Records the function's contract. Coupled with type hints, this is enough for IDEs to surface the function's purpose at the call site.
60results = []
Accumulator for (value, mean_rmse) tuples.
61for value in candidates:
Outer loop over axis values.
LOOP TRACE · 5 iterations
value=3e-4 (lr)
cfg.lr = 3e-4 (overridden); all other fields = base_cfg defaults.
seed_rmses = 5-element list (one per seed).
value=5e-4
cfg.lr = 5e-4.
value=1e-3 (default)
cfg.lr = 1e-3 — same as base_cfg, expected best.
value=2e-3
cfg.lr = 2e-3.
value=5e-3
cfg.lr = 5e-3 — likely diverges or plateaus high.
62cfg = replace(base_cfg, **{axis: value})
Build a new config that differs from base_cfg in exactly ONE field. replace() is the dataclass equivalent of dict copy-with-update.
EXECUTION STATE
📚 replace(base, **kwargs) = Returns a copy. Original base_cfg is untouched.
📚 **{axis: value} = Dict spread converts the axis name string into a kwarg. Equivalent to replace(base, lr=value) when axis='lr'.
63seed_rmses = []
Per-seed scratch list. Mean computed at the end of the inner loop.
64for seed in cfg.seeds:
Inner loop over seeds (5 by default).
LOOP TRACE · 5 iterations
seed=42
r['best_rmse'] = Float. Per-seed result for this config.
seed=123
r['best_rmse'] = Float. Variance estimator builds across seeds.
seed=456
r['best_rmse'] = Float.
seed=789
r['best_rmse'] = Float.
seed=1024
r['best_rmse'] = Float. Average across all 5 yields a stable estimate.
65r = run_single_experiment(cfg, seed)
Full 500-epoch training run. About 30 minutes per call on a single GPU. The reason the paper authors used coordinate descent rather than grid: 27 trials × 5 seeds × 30 min = 67.5 hours; 9 trials × 5 seeds × 30 min = 22.5 hours.
66seed_rmses.append(r['best_rmse'])
Collect the per-seed best RMSE.
67mean_rmse = sum(seed_rmses) / len(seed_rmses)
Across-seed average. The paper reports this scalar; the per-seed list goes into the std error.
68results.append((value, mean_rmse))
One entry per candidate.
69return results
Return the list. Caller plots a curve, picks the minimum, or feeds the next coordinate-descent step.
1"""Production hyperparameter scope: ExperimentConfig + sweep loop.
23The paper centralises every default in a single dataclass at
4grace/experiments/config.py. Sweeping a hyperparameter means
5overriding ONE field and rerunning. This pattern keeps the search
6reproducible: every result is a (config_dict, rmse) tuple that can
7be re-run from disk.
8"""910from dataclasses import dataclass, field, replace
11from typing import List
12from grace.experiments.phase1_cmapss import run_single_experiment
131415@dataclass16classExperimentConfig:17"""Mirror of grace/experiments/config.py with the production defaults."""18 dataset:str="FD002"19 mtl_method:str="gaba"2021# Training schedule22 epochs:int=50023 batch_size:int=25624 lr:float=1e-325 patience:int=8026 grad_clip:float=1.027 use_ema:bool=True28 ema_decay:float=0.99929 warmup_epochs:int=1030 scheduler_factor:float=0.531 scheduler_patience:int=3032 min_lr:float=5e-633 weight_decay:float=1e-43435# Data36 sequence_length:int=3037 per_condition_norm:bool=False3839# GRACE-specific40 use_weighted_mse:bool=True41 adaptive_weight_decay:bool=True42 initial_weight_decay:float=1e-44344# Reproducibility45 seeds: List[int]= field(default_factory=lambda:[42,123,456,789,1024])464748# ---------- Coordinate-descent sweep helper ----------49defsweep_axis(base_cfg: ExperimentConfig, axis:str, candidates:list):50"""Run the same training pipeline once per candidate value on one axis.
5152 Args:
53 base_cfg: Production-default ExperimentConfig.
54 axis: Field name to override (e.g. 'lr', 'ema_decay').
55 candidates: Values to try for that field.
5657 Returns:
58 List of (value, mean_rmse) tuples — averaged over base_cfg.seeds.
59 """60 results =[]61for value in candidates:62 cfg = replace(base_cfg,**{axis: value})63 seed_rmses =[]64for seed in cfg.seeds:65 r = run_single_experiment(cfg, seed)66 seed_rmses.append(r["best_rmse"])67 mean_rmse =sum(seed_rmses)/len(seed_rmses)68 results.append((value, mean_rmse))69return results
707172# ---------- Run the three production sweeps ----------73base = ExperimentConfig(dataset="FD002")7475lr_results = sweep_axis(base,"lr",[3e-4,5e-4,1e-3,2e-3,5e-3])76ema_results = sweep_axis(base,"ema_decay",[0.99,0.995,0.999,0.9995])77bs_results = sweep_axis(base,"batch_size",[64,128,256,512])7879print("LR sweep:", lr_results)80print("EMA sweep:", ema_results)81print("BS sweep:", bs_results)
Why dataclass + replace. Mutating a dict in-place (cfg['lr'] = 1e-4) propagates the change to every reference of the dict and silently breaks result accounting. dataclasses.replace returns a new instance, leaving the original untouched. Pure-functional sweeping is the safest pattern for multi-day jobs that may be interrupted and resumed.
Adapting These Defaults To Other Domains
Domain
Architecture changes
Training changes
Self-driving perception
Replace CNN-BiLSTM with FPN-Transformer; sequence_length=1 (single image).
lr=2e-3 (smaller param count tolerates higher lr), no warmup, batch_size=4096.
The pattern is consistent: training cadence values (warmup, scheduler, EMA) generalise within ±2× without retuning. Architecture-specific values (sequence_length, hidden_size, n_heads) need a per-domain sweep because they interact with the data's temporal/spatial structure. Lr and batch_size sit in between — transferable as orders of magnitude but not as exact values.
Pitfalls When Tuning Hyperparameters
Pitfall 1: tuning on the test set
The C-MAPSS ‘test’ set has only 100–260 units. Sweeping multiple training hyperparameters and reporting the best test rmse_last leaks information. The paper avoids this by holding out a 10% validation slice from training (chapter 7 §3) for sweep decisions; test numbers are computed only at the very end on the locked configuration.
Pitfall 2: insufficient seeds for a sweep decision
Single-seed differences of 0.3 RMSE on FD002 are within the seed-to-seed noise. Always average at least 3 seeds before choosing between two candidates; the paper used 5. A two-seed difference of 1.0 RMSE may still be noise — check the per-seed std (chapter 22 §3 reproducibility).
Pitfall 3: optimising for RMSE only
AMNL (cell B from chapter 21 §1) wins RMSE 6.74 on FD002 but loses NASA 356. If the production goal is safety-aware prediction, optimising only on RMSE picks the wrong winner. Always run the search on the metric that matches the deployment objective — or jointly on a Pareto frontier.
Pitfall 4: assuming separability across categories
Architecture and training are not separable when moving between datasets with very different scale (C-MAPSS ~4M windows vs. tabular toy datasets ~1k rows). Doubling hidden_size on a small dataset hurts because the lr and batch_size have to come down too. Always sweep architecture first; only after locking it should you tune training.
Pitfall 5: not recording the random seeds
Reproducing a result needs the seeds used for both the data split and the model init. The paper records both via set_seed(seed) and writes the seed into the output directory name. A ‘best result’ without its seed is not reproducible.
Takeaway
GRACE has 18 hyperparameters split between architecture (10, in model_configs.py) and training (8+, in experiments/config.py) plus 3 GABA-specific knobs.
The published values are 5-seed validated; the GABA controller is empirically robust across all four C-MAPSS subsets (converged λhealth=0.99±0.008).
The largest sensitivity directions are the loss formulation (single-task is +260% RMSE; no attention is +13%); most training-cadence hyperparameters move RMSE by <5%.
The search protocol is coordinate descent, not grid: ~150 trials, ~75 GPU-hours total. Grid search would need 387M.
Adapt to a new domain by sweeping architecture first, then training. Training-cadence values (warmup, scheduler, EMA) generalise within ±2× without retuning.