Boo-AI — Master Artificial Intelligence by Building from Scratch

Wine, Grapes, And Hyperparameters

A bottle of Burgundy and a bottle of Beaujolais can be made from the same Pinot Noir grapes harvested from neighbouring hillsides. What separates them is the recipe: yeast strain, fermentation temperature, time on lees, oak ageing. The grapes set the ceiling; the recipe sets the floor. Same logic in machine learning: the architecture sets the ceiling on what the model can represent, and the hyperparameters set the floor on what it actually does.

GRACE's architecture has been fixed since chapter 8 (CNN + BiLSTM + multi-head attention + FC). The published 7.72 RMSE on FD002 depends on every hyperparameter being at its production default. This section is the catalogue: every default, every value's sensitivity (from the paper's ablation), and the search protocol the authors used to pick them. Section 22.3 gives the reproducibility recipe, but it cannot reproduce anything if you don't know which knob to leave alone.

The headline. 18 hyperparameters across two categories. Architecture defaults set in grace/models/model_configs.py; training defaults set in grace/experiments/config.py. All are 5-seed validated. The published GABA controller has been measured to converge to the same

\lambda_{\text{health}}

equilibrium across all four C-MAPSS subsets within ±0.008 — robustness is built in.

Two Distinct Categories: Architecture vs. Training

Hyperparameters in any deep-learning system fall into two categories with very different sensitivity profiles:

Category	What it controls	Lives in	Tuning frequency
Architecture	Backbone width, # heads, FC dims, dropout, sequence length	model_configs.py:ModelConfig	Once per dataset family (e.g. C-MAPSS vs. N-CMAPSS)
Training	lr, batch size, optimiser, scheduler, EMA, warmup, early stopping	experiments/config.py:ExperimentConfig	Once per training-method change (GABA → GRACE)

The split matters because the search budgets are different. Architecture sweeps need full-data training (~30 min per trial × 5 seeds = 2.5 hours per point), so the paper sweeps only a handful of named configs (cmapss, ncmapss_20feat, ncmapss_32feat). Training sweeps run 4–5 candidates per axis along coordinate-descent, totalling ~20 hours for all axes on FD002.

Architecture Defaults

The C-MAPSS production architecture, from grace/models/model_configs.py:"cmapss":

Field	Value	Role
input_size	17	Number of selected sensors after the paper's feature filter.
hidden_size	256	BiLSTM hidden size. GRACE used 256; GABA-only paper used 128.
cnn_channels	(64, 128, 64)	Three 1D-conv layers; bottleneck at 128 then narrow back.
num_lstm_layers	2	Two stacked BiLSTM layers; deeper saturates.
num_attn_heads	8	Multi-head self-attention. Removing attention costs +0.94 RMSE on FD002.
fc_dims	(256, 64, 32)	Three FC layers from the BiLSTM output (= 512) down to the heads.
dropout	0.3	Applied after BiLSTM and FC layers. Same for ALL datasets — no per-dataset tuning.
use_attention	True	Enable multi-head self-attention block. Disable only for CNN-only ablation.
use_residual	True	Residual connections inside the backbone.
num_health_states	3	3-class health labels: healthy / degrading / critical.

Two siblings exist for N-CMAPSS: ncmapss_20feat (input_size=20, hidden_size=256, 8 heads — matches the DKAMFormer protocol) and ncmapss_32feat (input_size=32, hidden_size=512, 16 heads — uses all sensors plus operating conditions for higher capacity).

Training Defaults

Production training hyperparameters, from grace/experiments/config.py:ExperimentConfig:

Hyperparameter	Value	Why this value
epochs	500	Loose cap; early stopping fires at 130-230 in practice.
batch_size	256	GPU-fits; ablation shows 0.06 RMSE penalty for 128, 0.13 for 64.
lr	1e-3	AdamW base. Warmup ramps from 1e-4. Broad minimum on FD002.
patience (early stop)	80	No-improvement budget before stopping + restoring best weights.
grad_clip max_norm	1.0	Global L2 threshold. Defends BiLSTM gradient spikes.
use_ema	True	EMA shadow weights at evaluation. ~0.3 RMSE benefit on FD002.
ema_decay	0.999	1000-step memory. Sharp minimum on this axis.
warmup_epochs	10	Linear LR ramp 1e-4 → 1e-3 over first 10 epochs.
scheduler_factor	0.5	Halve LR on plateau.
scheduler_patience	30	30 stagnant epochs before halving.
min_lr (scheduler)	5e-6	Floor. Below this the optimiser is essentially frozen.
weight_decay	1e-4	AdamW decoupled. Halved at epoch 100, /10 at 200 (GRACE schedule).
sequence_length	30	Sliding-window length. Longer windows didn't help in the sweep.
use_weighted_mse	True	Failure-biased w(y) on RUL task. GRACE-only switch.
adaptive_weight_decay	True	GRACE-only. Marginal but consistent improvement.
initial_weight_decay	1e-4	Pre-schedule wd value.
seeds	[42, 123, 456, 789, 1024]	5 seeds for reported means.

GABA-Specific Defaults And Robustness

The OUTER controller adds three of its own knobs:

Hyperparameter	Value	Role
GABA β (EMA smoothing)	0.99	100-step memory for the per-task weights. Different from the model EMA (0.999) — the controller needs a faster response.
min_weight (floor)	0.05	5% per task — prevents the over-gradient task from being completely silenced when the EMA converges.
warmup_steps	100	First 100 mini-batches use uniform 1/K weights. Stops the controller from reacting before EMA has a meaningful estimate.

The remarkable property of GABA documented in the paper's Fig. 6: the converged $\lambda_{\text{health}}$ sits in $[0.985,\ 1.001]$ across all four C-MAPSS subsets and all five seeds (n=20 runs total). Standard deviation is $\sigma \approx 0.003$ . This is the OUTER axis being self-calibrating: regardless of which dataset you put in, GABA finds the same equilibrium because the gradient ratio is structurally similar (450–650× across all subsets — chapter 21 §3 Pareto explorer).

Why this robustness matters. A typical adaptive controller (DWA, GradNorm, Uncertainty) varies its converged weights by >0.1 across these same four subsets. GABA's ±0.008 spread means you don't need to retune β, ε_floor, or warmup_steps when moving between datasets. The same configuration handles FD001 (the easy single-condition subset) and FD004 (the hardest multi-condition multi-fault subset).

Interactive: Hyperparameter Impact On RMSE

Click any row to see the description and the GRACE recommendation. The bars show RMSE on FD002 (left column) and FD004 (right column) relative to the V7 baseline. Green = better than baseline, red = worse. Numbers come from the paper's data_analysis/comprehensive_analysis.py ablation block.

Loading hyperparameter sensitivity table…

What stands out. Removing the health task (Single-Task) is catastrophic (+260% RMSE) — the auxiliary is doing real regularisation work. Removing attention costs ~13%. Switching from WMSE to plain MSE swaps the dataset winners (FD002 worsens, FD004 improves) — a useful diagnostic when GRACE underperforms on a new dataset. Most other choices are within 1–5% of baseline; the architecture and the choice of multi-task learning are what matter most.

How The Defaults Were Found: A Search Protocol

The paper authors did not run a full 18-dimensional grid search — that would be 3¹⁸ ≈ 387 million trials at 30 minutes each. Instead the protocol was sequential, prioritised by sensitivity:

Step 1: literature defaults. AdamW lr=1e-3, β=(0.9, 0.999), weight_decay=1e-4, batch_size=256. These match transformer-era literature. No tuning yet.
Step 2: architecture sweep. Four named configs (full / lstm_only / transformer / mlp / cnn_only) with GABA on FD001+FD002, 5 seeds each. Locks in CNN + BiLSTM + Attention as the backbone (chapter 27).
Step 3: GABA controller sweep. β ∈ {0.9, 0.99, 0.999}, ε_floor ∈ {0.01, 0.05, 0.10}, warmup_steps ∈ {0, 100, 500}. 27-trial grid on FD002 only (single-dataset is enough — chapter 22 §2 robustness). Locks in (0.99, 0.05, 100).
Step 4: training-cadence coordinate descent. One axis at a time: warmup_epochs, scheduler_patience, ema_decay, grad_clip. Single-axis sweeps because the surfaces are approximately separable. Total ~20 trials.
Step 5: weight-decay schedule (GRACE-only). Three options compared: fixed 1e-4, halve-at-100, schedule (1e-4 → 5e-5 → 1e-5). The schedule won by 0.05 RMSE on FD002; adopted as default.
Step 6: 5-seed cross-validation. Final configuration run with 5 seeds on all four C-MAPSS subsets + N-CMAPSS DS02. Reported numbers are 5-seed means.

Total compute: ~150 trials × 30 min = 75 hours of GPU time. A full grid search would have needed millions; coordinate descent and judgement reduce the budget by four orders of magnitude.

Python: Coordinate-Descent Hyperparameter Search

A self-contained NumPy reproduction of the search protocol on a toy RMSE landscape that mirrors the paper's actual landscape on FD002. Grid search uses 27 trials; coordinate descent finds the same optimum with 9. Read the code line by line to see how the dict-spread + min-with-key idioms encode coordinate descent in three lines.

Grid vs. coordinate descent on a toy landscape

🐍grace_hp_search_demo.py

Explanation(35)

Code(64)

1docstring

Names the contract: a comparison of grid search and coordinate descent on a synthetic RMSE landscape. This is the strategy the paper authors used to choose lr, batch_size, and EMA beta — sequential, one-axis-at-a-time, far cheaper than full grid.

8import numpy as np

Numerical-array library. Used for log10, log2, and inf.

EXECUTION STATE

📚 numpy = ndarray + math helpers. Standard alias np.

11def toy_rmse(lr, bs, beta):

Synthetic landscape with a global minimum at (lr=1e-3, bs=256, beta=0.999). Mirrors the paper's actual landscape on FD002: lr is forgiving (broad), bs has a mild slope, beta has a sharp minimum.

EXECUTION STATE

⬇ input: lr = Learning rate in {1e-4, 1e-3, 1e-2}.

⬇ input: bs = Batch size in {64, 128, 256}.

⬇ input: beta = EMA decay in {0.99, 0.999, 0.9999}.

⬆ returns = Float — synthetic RMSE for that (lr, bs, beta) triple.

19lr_term = 5.0 * ((np.log10(lr) + 3) / 2) ** 2

Parabolic penalty around log10(lr) = -3. lr=1e-3 is the minimum; lr=1e-4 and lr=1e-2 are equally penalised.

EXECUTION STATE

📚 np.log10(x) = Base-10 logarithm. log10(1e-3) = -3.0.

lr=1e-3 term = 5.0 * ((-3 + 3)/2)² = 0

lr=1e-4 term = 5.0 * ((-4 + 3)/2)² = 5.0 * 0.25 = 1.25

lr=1e-2 term = 5.0 * ((-2 + 3)/2)² = 5.0 * 0.25 = 1.25

20bs_term = 0.5 * ((np.log2(bs) - np.log2(256)) / 1.0) ** 2

Mild parabolic penalty in log-batch-size space. Minimum at bs=256.

EXECUTION STATE

📚 np.log2(x) = Base-2 logarithm. log2(256)=8.

bs=256 term = 0.5 * (8 - 8)² = 0

bs=128 term = 0.5 * (7 - 8)² = 0.5

bs=64 term = 0.5 * (6 - 8)² = 2.0

21beta_term = 2.0 * ((beta - 0.999) / 0.001) ** 2

Sharp parabolic penalty around beta=0.999. The (beta - 0.999)/0.001 term puts beta=0.99 and beta=0.9999 each 1 step away from optimum.

EXECUTION STATE

beta=0.999 term = 2.0 * (0)² = 0

beta=0.99 term = 2.0 * ((-0.009)/0.001)² = 2.0 * 81 = 162.0 — large penalty.

beta=0.9999 term = 2.0 * (0.0009/0.001)² = 2.0 * 0.81 = 1.62

→ asymmetry = 0.9999 is far less penalised than 0.99 because the (beta-0.999)/0.001 scaling makes 0.99 ten step-units away. This mirrors the paper's observation: high-beta is forgiving, low-beta is dangerous.

22return 8.0 + lr_term + bs_term + beta_term

Floor of 8.0 RMSE plus the three additive penalties. Minimum total ≈ 8.0 at (1e-3, 256, 0.999).

EXECUTION STATE

⬆ return: toy_rmse = Float in [8, ~170]. The reference good config sits at 8.0.

26LRS = [1e-4, 1e-3, 1e-2]

Three-point lr grid spanning two orders of magnitude. The paper used five (3e-4, 5e-4, 1e-3, 2e-3, 5e-3); we keep the toy at 3 for compactness.

27BSS = [64, 128, 256]

Three-point batch-size grid. Doubling steps — typical hardware-aligned choice.

28BETAS = [0.99, 0.999, 0.9999]

Three-point EMA-beta grid. Powers-of-9.99×0.1 around the standard 0.999.

32def grid_search():

Exhaustive: try every (lr, bs, beta) triple. 3³ = 27 trials. The reference benchmark.

EXECUTION STATE

⬆ returns = Tuple (best_config, best_rmse, n_trials).

33best, best_r = None, np.inf

Initialise the running best. np.inf so any real RMSE wins on first comparison.

EXECUTION STATE

📚 np.inf = NumPy infinity constant. Useful as an initial best-so-far value.

34n_trials = 0

Counter for how many configurations we evaluated.

35for lr in LRS:

Outer loop over lr.

LOOP TRACE · 3 iterations

lr=1e-4

→ explores 9 (bs, beta) combinations = All bs and beta values for this lr.

lr=1e-3

→ explores 9 (bs, beta) combinations = Includes the global optimum (1e-3, 256, 0.999) → RMSE 8.0.

lr=1e-2

→ explores 9 (bs, beta) combinations = Same 9 combinations again, all worse than the lr=1e-3 group.

36for bs in BSS:

Middle loop over bs.

37for beta in BETAS:

Inner loop over beta. Triple-nested for product space.

38r = toy_rmse(lr, bs, beta)

Evaluate this configuration. In production this would be a 30-min training run; here it is a closed-form evaluation.

39n_trials += 1

Increment counter.

40if r < best_r:

Compare to running best.

41best, best_r = (lr, bs, beta), r

Update both the config tuple and the score.

42return best, best_r, n_trials

Triple of (best_config, best_rmse, n_trials).

EXECUTION STATE

⬆ best = (1e-3, 256, 0.999)

⬆ best_r = 8.0

⬆ n_trials = 27

46def coordinate_descent(lr0=1e-3, bs0=128, beta0=0.99):

Cheaper alternative: sweep one axis at a time, freezing the others. 3+3+3=9 trials. Works when the surface is approximately separable — which the paper's ablations show it is for these three axes.

EXECUTION STATE

⬇ input: lr0 = Starting lr. Defaults to 1e-3 (the literature default).

⬇ input: bs0 = Starting bs. 128 — middle of the range.

⬇ input: beta0 = Starting beta. 0.99 — middle of the range.

⬆ returns = Tuple (best_config, best_rmse, n_trials).

47cur = {'lr': lr0, 'bs': bs0, 'beta': beta0}

Mutable dict carrying the current best. Each axis sweep updates one key.

34n_trials = 0

Trial counter for the comparison.

49for axis, candidates in [('lr', LRS), ('bs', BSS), ('beta', BETAS)]:

Outer loop over the three axes. ORDER MATTERS for coordinate descent: searching lr before beta typically converges faster because lr has the broadest minimum.

LOOP TRACE · 3 iterations

axis='lr', candidates=[1e-4, 1e-3, 1e-2]

scores = [(1e-4, 171.75), (1e-3, 170.50), (1e-2, 171.75)]

best per lr = lr=1e-3 wins. cur['lr'] = 1e-3.

axis='bs', candidates=[64, 128, 256]

scores = [(64, 172.0), (128, 170.5), (256, 170.0)]

best per bs = bs=256 wins. cur['bs'] = 256.

axis='beta', candidates=[0.99, 0.999, 0.9999]

scores = [(0.99, 170.0), (0.999, 8.0), (0.9999, 9.62)]

best per beta = beta=0.999 wins. cur['beta'] = 0.999.

→ reading = The beta sweep is what brings the score from 170 down to 8.0. EMA beta is the ‘sharp’ axis on this landscape.

51scores = [(c, toy_rmse(**{**cur, axis: c})) for c in candidates]

List comprehension. {**cur, axis: c} unpacks the current config and overrides ONE axis with c. ** unpacks for kwargs into toy_rmse.

EXECUTION STATE

📚 dict spread {**a, k: v} = Python 3.5+: copy `a`, then set key `k` to value `v`. Used here to vary just the current axis.

📚 ** unpack = Unpacks dict into keyword arguments. toy_rmse(**{'lr': 1e-3, 'bs': 256, 'beta': 0.999}) = toy_rmse(lr=1e-3, bs=256, beta=0.999).

→ result type = List of (candidate_value, score) tuples.

52n_trials += len(scores)

Bump the counter by the number of candidates on this axis (3 each).

53cur[axis] = min(scores, key=lambda t: t[1])[0]

Pick the candidate with the lowest score. min(..., key=lambda t: t[1]) sorts by the score (second tuple element); [0] extracts the candidate value.

EXECUTION STATE

📚 min(iter, key) = Python builtin. Returns the element minimising the key function.

📚 lambda t: t[1] = Anonymous function: take the second element of t. Used to sort tuples by their second component.

54best = (cur['lr'], cur['bs'], cur['beta'])

Pack the final config tuple in the canonical (lr, bs, beta) order.

55return best, toy_rmse(*best), n_trials

Re-evaluate at the final config and return.

EXECUTION STATE

📚 *best = Positional unpack: toy_rmse(*[1e-3, 256, 0.999]) = toy_rmse(1e-3, 256, 0.999).

⬆ best = (1e-3, 256, 0.999)

⬆ rmse = 8.0

⬆ n_trials = 9

59g_best, g_rmse, g_n = grid_search()

Run grid search.

EXECUTION STATE

(g_best, g_rmse, g_n) = ((1e-3, 256, 0.999), 8.0, 27)

60c_best, c_rmse, c_n = coordinate_descent()

Run coordinate descent.

EXECUTION STATE

(c_best, c_rmse, c_n) = ((1e-3, 256, 0.999), 8.0, 9)

61print(f"Grid search: {g_n} trials → RMSE {g_rmse:.4f} at {g_best}")

Report grid search.

EXECUTION STATE

Output = Grid search: 27 trials → RMSE 8.0000 at (0.001, 256, 0.999)

62print(f"Coordinate descent: {c_n} trials → RMSE {c_rmse:.4f} at {c_best}")

Report coordinate descent.

EXECUTION STATE

Output = Coordinate descent: 9 trials → RMSE 8.0000 at (0.001, 256, 0.999)

63print(f"Speedup: {g_n / c_n:.1f}× — same optimum, less compute")

27 / 9 = 3.0× speedup. On a 30-min-per-trial production budget, the difference is 13.5 hours vs 4.5 hours of GPU time per dataset.

EXECUTION STATE

Final output =

Grid search:        27 trials → RMSE 8.0000 at (0.001, 256, 0.999)
Coordinate descent: 9 trials → RMSE 8.0000 at (0.001, 256, 0.999)
Speedup:            3.0× — same optimum, less compute

→ caveat = Coordinate descent finds the global optimum HERE because the surface is separable. On non-separable landscapes (interactions between lr and bs), it can stall at a saddle and grid search wins. Always validate the assumption with at least one off-axis check.

29 lines without explanation

1"""Hyperparameter search by coordinate descent — what the paper did.
2
3Compares grid search (27 trials) with coordinate descent (9 trials).
4Both find the same optimum on a toy RMSE surface that mimics the
5landscape the GRACE authors saw on FD002.
6"""
7
8import numpy as np
9
10
11def toy_rmse(lr, bs, beta):
12    """Synthetic RMSE landscape with a sharp minimum at (1e-3, 256, 0.999).
13
14    Mirrors the paper&apos;s observation: lr is fairly forgiving (broad
15    minimum), batch size has a mild slope, and EMA beta has a sharp
16    minimum at 0.999 — outside that band the EMA either tracks too
17    slowly (0.9999) or too fast (0.99) for the late-training noise.
18    """
19    lr_term   = 5.0 * ((np.log10(lr) + 3) / 2) ** 2
20    bs_term   = 0.5 * ((np.log2(bs) - np.log2(256)) / 1.0) ** 2
21    beta_term = 2.0 * ((beta - 0.999) / 0.001) ** 2
22    return 8.0 + lr_term + bs_term + beta_term
23
24
25# ---------- Search space ----------
26LRS   = [1e-4, 1e-3, 1e-2]
27BSS   = [64, 128, 256]
28BETAS = [0.99, 0.999, 0.9999]
29
30
31# ---------- Grid search (27 trials) ----------
32def grid_search():
33    best, best_r = None, np.inf
34    n_trials = 0
35    for lr in LRS:
36        for bs in BSS:
37            for beta in BETAS:
38                r = toy_rmse(lr, bs, beta)
39                n_trials += 1
40                if r < best_r:
41                    best, best_r = (lr, bs, beta), r
42    return best, best_r, n_trials
43
44
45# ---------- Coordinate descent (9 trials) ----------
46def coordinate_descent(lr0=1e-3, bs0=128, beta0=0.99):
47    cur = {"lr": lr0, "bs": bs0, "beta": beta0}
48    n_trials = 0
49    for axis, candidates in [("lr", LRS), ("bs", BSS), ("beta", BETAS)]:
50        # Sweep ONLY this axis; freeze the others
51        scores = [(c, toy_rmse(**{**cur, axis: c})) for c in candidates]
52        n_trials += len(scores)
53        cur[axis] = min(scores, key=lambda t: t[1])[0]
54    best = (cur["lr"], cur["bs"], cur["beta"])
55    return best, toy_rmse(*best), n_trials
56
57
58# ---------- Run both ----------
59g_best, g_rmse, g_n = grid_search()
60c_best, c_rmse, c_n = coordinate_descent()
61
62print(f"Grid search:        {g_n} trials → RMSE {g_rmse:.4f} at {g_best}")
63print(f"Coordinate descent: {c_n} trials → RMSE {c_rmse:.4f} at {c_best}")
64print(f"Speedup:            {g_n / c_n:.1f}× — same optimum, less compute")

Coordinate descent is not always optimal. It works when the surface is approximately separable along the chosen axes — which the paper's ablations show is true for GRACE's training hyperparameters. On non-separable landscapes (notable in joint architecture+lr searches) it can stall at saddles. Always validate with at least one off-axis spot check before committing to the result.

PyTorch: Hyperparameter Scoping In ExperimentConfig

Production search uses the paper's ExperimentConfig dataclass + a generic sweep_axis() helper. Every result is a (value, mean_rmse) tuple recoverable from disk. Reproducibility is built in — the dataclass serialises to JSON and back, so a sweep can be paused, audited, and resumed.

Production hyperparameter sweep with paper helpers

🐍grace_hp_sweep.py

Explanation(46)

Code(81)

1docstring

Names the contract: a reproducible hyperparameter scope. Every result is a config dict plus an RMSE — easy to re-run, easy to compare.

10from dataclasses import dataclass, field, replace

Python's dataclass machinery. dataclass adds __init__, __repr__ automatically; replace returns a NEW instance with one field changed (immutable copy).

EXECUTION STATE

📚 @dataclass = Decorator that generates __init__ from class-level type-annotated fields.

📚 field(default_factory=...) = Used for mutable defaults (e.g. list). default_factory is called per instance to avoid sharing state.

📚 replace(obj, **changes) = Returns a copy of obj with the specified fields overridden. Like dict.copy() with selective override.

11from typing import List

Type hint for List[int] on the seeds field.

12from grace.experiments.phase1_cmapss import run_single_experiment

Paper's production training entry point. Takes an ExperimentConfig + seed, returns a results dict with best_rmse and full metrics.

EXECUTION STATE

📚 run_single_experiment(cfg, seed) = Source: grace/experiments/phase1_cmapss.py:37. Builds dataset, model, GABA, optimiser; calls trainer.fit; returns serialisable result.

15@dataclass

Decorator. Auto-generates __init__(self, dataset='FD002', mtl_method='gaba', epochs=500, ...). Without it we'd need to write the constructor by hand.

16class ExperimentConfig:

The single source of truth for every training hyperparameter. Mirrors grace/experiments/config.py:ExperimentConfig — every field below has the exact paper default.

17docstring

Records the source: this dataclass is the paper's production config. Keeping it as a dataclass (not a dict) gets type-checking and IDE autocomplete for free.

18dataset: str = "FD002"

Dataset name. One of FD001, FD002, FD003, FD004, ncmapss_ds02.

EXECUTION STATE

→ triggers = FD002 / FD004 also automatically enable per_condition_norm=True (the dataset class detects this).

19mtl_method: str = "gaba"

MTL loss name from grace/core/loss_registry.py. To run a baseline, change to ‘fixed_050’, ‘dwa’, ‘gradnorm’, etc. Same training loop, different controller.

22epochs: int = 500

Maximum epoch budget. Early stopping (patience=80) usually fires at 130-230 in practice.

23batch_size: int = 256

Per-batch sample count. 256 fits in 24GB VRAM with the BiLSTM backbone; doubling halves throughput. The ablation in §22.2 viz shows bs=128 costs ~0.5% RMSE — small enough that hardware constraints can override.

24lr: float = 1e-3

AdamW base learning rate. Warmup ramps from 1e-4. ReduceLROnPlateau halves on plateau down to min_lr. The 1e-3 default is robust across all four C-MAPSS subsets.

25patience: int = 80

Early stopping patience. 80 epochs without rmse_last improvement triggers stop + restore best weights.

26grad_clip: float = 1.0

Global L2-norm threshold for torch.nn.utils.clip_grad_norm_. 1.0 protects against the BiLSTM's occasional gradient spikes on long sequences without throttling normal updates.

27use_ema: bool = True

Enable EMA shadow weights at evaluation time. Disabling costs ~0.3 RMSE on FD002.

28ema_decay: float = 0.999

Effective memory of 1000 steps. The ablation viz above shows this axis is sharp — smaller decays (0.99) lose the smoothing benefit; larger (0.9999) lag the live weights too much late in training.

29warmup_epochs: int = 10

Linear LR ramp from 0.1·lr to lr over the first 10 epochs. Critical for GRACE because the GABA controller is uniform-weighted for its first 100 STEPS and the model is in a noisy regime.

30scheduler_factor: float = 0.5

ReduceLROnPlateau halves the LR when triggered. Standard.

31scheduler_patience: int = 30

ReduceLROnPlateau patience. 30 epochs of stagnation before halving. Different from EarlyStopping's 80.

32min_lr: float = 5e-6

Floor for the scheduler. Below this, the optimiser is essentially frozen.

33weight_decay: float = 1e-4

AdamW decoupled weight decay. Halved at epoch 100 and divided by 10 at epoch 200 if adaptive_weight_decay=True.

36sequence_length: int = 30

Sliding-window length. 30 cycles is long enough to capture short-term degradation trends; longer windows did not help in the paper's sweep (data analysis comprehensive_analysis.py).

37per_condition_norm: bool = False

Per-operating-condition feature normalisation. Set automatically to True for FD002/FD004 (multi-condition) by the dataset class. Manual override here for ncmapss.

40use_weighted_mse: bool = True

Failure-biased w(y) = 1 + clip(1 - y/125, 0, 1) on the RUL task. GRACE-specific. Set to False to switch to plain MSE (= GABA).

41adaptive_weight_decay: bool = True

GRACE-only addition. Halves wd at epoch 100, /10 at epoch 200. Hurts only marginally on AMNL but adds ~0.1 RMSE for GRACE.

42initial_weight_decay: float = 1e-4

Starting wd value before the schedule kicks in.

45seeds: List[int] = field(default_factory=lambda: [42, 123, 456, 789, 1024])

Five seeds for variance estimation. field(default_factory=...) is required for mutable defaults — sharing a single list across instances would corrupt state.

EXECUTION STATE

📚 lambda: [42, ...] = A nullary function returning a fresh list. default_factory calls it once per dataclass instance.

→ published numbers = Every result in the paper is a 5-seed mean. Single-seed numbers are not reportable.

49def sweep_axis(base_cfg: ExperimentConfig, axis: str, candidates: list):

Generic coordinate-descent helper. One axis at a time, one config at a time.

EXECUTION STATE

⬇ input: base_cfg = Production defaults; the unchanged template.

⬇ input: axis = Field name to vary, e.g. 'lr' or 'ema_decay'.

⬇ input: candidates = List of values for that axis.

⬆ returns = List[(value, mean_rmse)] — one entry per candidate, averaged over the seeds in base_cfg.seeds.

61docstring

Records the function's contract. Coupled with type hints, this is enough for IDEs to surface the function's purpose at the call site.

60results = []

Accumulator for (value, mean_rmse) tuples.

61for value in candidates:

Outer loop over axis values.

LOOP TRACE · 5 iterations

value=3e-4 (lr)

cfg.lr = 3e-4 (overridden); all other fields = base_cfg defaults.

seed_rmses = 5-element list (one per seed).

value=5e-4

cfg.lr = 5e-4.

value=1e-3 (default)

cfg.lr = 1e-3 — same as base_cfg, expected best.

value=2e-3

cfg.lr = 2e-3.

value=5e-3

cfg.lr = 5e-3 — likely diverges or plateaus high.

62cfg = replace(base_cfg, **{axis: value})

Build a new config that differs from base_cfg in exactly ONE field. replace() is the dataclass equivalent of dict copy-with-update.

EXECUTION STATE

📚 replace(base, **kwargs) = Returns a copy. Original base_cfg is untouched.

📚 **{axis: value} = Dict spread converts the axis name string into a kwarg. Equivalent to replace(base, lr=value) when axis='lr'.

63seed_rmses = []

Per-seed scratch list. Mean computed at the end of the inner loop.

64for seed in cfg.seeds:

Inner loop over seeds (5 by default).

LOOP TRACE · 5 iterations

seed=42

r['best_rmse'] = Float. Per-seed result for this config.

seed=123

r['best_rmse'] = Float. Variance estimator builds across seeds.

seed=456

r['best_rmse'] = Float.

seed=789

r['best_rmse'] = Float.

seed=1024

r['best_rmse'] = Float. Average across all 5 yields a stable estimate.

65r = run_single_experiment(cfg, seed)

Full 500-epoch training run. About 30 minutes per call on a single GPU. The reason the paper authors used coordinate descent rather than grid: 27 trials × 5 seeds × 30 min = 67.5 hours; 9 trials × 5 seeds × 30 min = 22.5 hours.

66seed_rmses.append(r['best_rmse'])

Collect the per-seed best RMSE.

67mean_rmse = sum(seed_rmses) / len(seed_rmses)

Across-seed average. The paper reports this scalar; the per-seed list goes into the std error.

68results.append((value, mean_rmse))

One entry per candidate.

69return results

Return the list. Caller plots a curve, picks the minimum, or feeds the next coordinate-descent step.

EXECUTION STATE

⬆ return = List[(value, mean_rmse)]. Length = len(candidates).

73base = ExperimentConfig(dataset='FD002')

Production defaults, FD002 picked. Override only what differs from the dataclass defaults.

75lr_results = sweep_axis(base, 'lr', [3e-4, 5e-4, 1e-3, 2e-3, 5e-3])

Sweep the learning rate. Five points centred on the default 1e-3.

EXECUTION STATE

lr_results (illustrative) = [(3e-4, 8.2), (5e-4, 7.9), (1e-3, 7.72), (2e-3, 8.1), (5e-3, 9.6)]

→ reading = 1e-3 wins, both adjacent points within 0.4 RMSE — a broad minimum, exactly what the paper's sensitivity analysis shows.

76ema_results = sweep_axis(base, 'ema_decay', [0.99, 0.995, 0.999, 0.9995])

Sweep EMA decay. Four points around the default 0.999.

EXECUTION STATE

ema_results (illustrative) = [(0.99, 8.0), (0.995, 7.85), (0.999, 7.72), (0.9995, 7.78)]

→ reading = Tighter minimum than lr — 0.999 is sharply best. Consistent with the toy landscape from the NumPy demo.

77bs_results = sweep_axis(base, 'batch_size', [64, 128, 256, 512])

Sweep batch size. Four points doubling from 64.

EXECUTION STATE

bs_results (illustrative) = [(64, 7.95), (128, 7.78), (256, 7.72), (512, 7.85)]

→ reading = Mild slope. 256 wins, but 128 is within 0.06. Hardware constraints can override.

90print("LR sweep:", lr_results)

Final report of the LR sweep.

91print("EMA sweep:", ema_results)

Final report of the EMA sweep.

92print("BS sweep:", bs_results)

Final report of the BS sweep.

EXECUTION STATE

Final output (illustrative) =

LR    sweep: [(0.0003, 8.2), (0.0005, 7.9), (0.001, 7.72), (0.002, 8.1), (0.005, 9.6)]
EMA   sweep: [(0.99, 8.0), (0.995, 7.85), (0.999, 7.72), (0.9995, 7.78)]
BS    sweep: [(64, 7.95), (128, 7.78), (256, 7.72), (512, 7.85)]

35 lines without explanation

1"""Production hyperparameter scope: ExperimentConfig + sweep loop.
2
3The paper centralises every default in a single dataclass at
4grace/experiments/config.py. Sweeping a hyperparameter means
5overriding ONE field and rerunning. This pattern keeps the search
6reproducible: every result is a (config_dict, rmse) tuple that can
7be re-run from disk.
8"""
9
10from dataclasses import dataclass, field, replace
11from typing import List
12from grace.experiments.phase1_cmapss import run_single_experiment
13
14
15@dataclass
16class ExperimentConfig:
17    """Mirror of grace/experiments/config.py with the production defaults."""
18    dataset:       str   = "FD002"
19    mtl_method:    str   = "gaba"
20
21    # Training schedule
22    epochs:                int   = 500
23    batch_size:            int   = 256
24    lr:                    float = 1e-3
25    patience:              int   = 80
26    grad_clip:             float = 1.0
27    use_ema:               bool  = True
28    ema_decay:             float = 0.999
29    warmup_epochs:         int   = 10
30    scheduler_factor:      float = 0.5
31    scheduler_patience:    int   = 30
32    min_lr:                float = 5e-6
33    weight_decay:          float = 1e-4
34
35    # Data
36    sequence_length:       int   = 30
37    per_condition_norm:    bool  = False
38
39    # GRACE-specific
40    use_weighted_mse:        bool  = True
41    adaptive_weight_decay:   bool  = True
42    initial_weight_decay:    float = 1e-4
43
44    # Reproducibility
45    seeds: List[int] = field(default_factory=lambda: [42, 123, 456, 789, 1024])
46
47
48# ---------- Coordinate-descent sweep helper ----------
49def sweep_axis(base_cfg: ExperimentConfig, axis: str, candidates: list):
50    """Run the same training pipeline once per candidate value on one axis.
51
52    Args:
53        base_cfg:    Production-default ExperimentConfig.
54        axis:        Field name to override (e.g. 'lr', 'ema_decay').
55        candidates:  Values to try for that field.
56
57    Returns:
58        List of (value, mean_rmse) tuples — averaged over base_cfg.seeds.
59    """
60    results = []
61    for value in candidates:
62        cfg = replace(base_cfg, **{axis: value})
63        seed_rmses = []
64        for seed in cfg.seeds:
65            r = run_single_experiment(cfg, seed)
66            seed_rmses.append(r["best_rmse"])
67        mean_rmse = sum(seed_rmses) / len(seed_rmses)
68        results.append((value, mean_rmse))
69    return results
70
71
72# ---------- Run the three production sweeps ----------
73base = ExperimentConfig(dataset="FD002")
74
75lr_results        = sweep_axis(base, "lr",         [3e-4, 5e-4, 1e-3, 2e-3, 5e-3])
76ema_results       = sweep_axis(base, "ema_decay",  [0.99,  0.995, 0.999, 0.9995])
77bs_results        = sweep_axis(base, "batch_size", [64, 128, 256, 512])
78
79print("LR    sweep:", lr_results)
80print("EMA   sweep:", ema_results)
81print("BS    sweep:", bs_results)

Why dataclass + replace. Mutating a dict in-place (cfg['lr'] = 1e-4) propagates the change to every reference of the dict and silently breaks result accounting. dataclasses.replace returns a new instance, leaving the original untouched. Pure-functional sweeping is the safest pattern for multi-day jobs that may be interrupted and resumed.

Adapting These Defaults To Other Domains

Domain	Architecture changes	Training changes
Self-driving perception	Replace CNN-BiLSTM with FPN-Transformer; sequence_length=1 (single image).	lr=2e-4 (transformer schedule), batch_size=32 (larger images), warmup_epochs=5 (shorter pretrain).
Speech recognition (ASR)	Conformer encoder; sequence_length variable (10s windows).	lr=1e-3 OK, but use Noam schedule (linear-warmup + 1/√t decay). EMA still beneficial.
Medical imaging segmentation	U-Net or DeepLab backbone; sequence_length=1, image_size=512.	lr=1e-4 (smaller datasets, higher overfitting risk), grad_clip=5.0 (looser, stable in U-Nets).
Time-series forecasting	TimesFM/Chronos backbone; sequence_length=336 (~14 days).	lr=1e-3, batch_size=64-128 (long sequences), warmup=2000 steps.
Tabular classification	TabNet or shallow MLP; sequence_length=N/A.	lr=2e-3 (smaller param count tolerates higher lr), no warmup, batch_size=4096.

The pattern is consistent: training cadence values (warmup, scheduler, EMA) generalise within ±2× without retuning. Architecture-specific values (sequence_length, hidden_size, n_heads) need a per-domain sweep because they interact with the data's temporal/spatial structure. Lr and batch_size sit in between — transferable as orders of magnitude but not as exact values.

Pitfalls When Tuning Hyperparameters

Pitfall 1: tuning on the test set

The C-MAPSS ‘test’ set has only 100–260 units. Sweeping multiple training hyperparameters and reporting the best test rmse_last leaks information. The paper avoids this by holding out a 10% validation slice from training (chapter 7 §3) for sweep decisions; test numbers are computed only at the very end on the locked configuration.

Pitfall 2: insufficient seeds for a sweep decision

Single-seed differences of 0.3 RMSE on FD002 are within the seed-to-seed noise. Always average at least 3 seeds before choosing between two candidates; the paper used 5. A two-seed difference of 1.0 RMSE may still be noise — check the per-seed std (chapter 22 §3 reproducibility).

Pitfall 3: optimising for RMSE only

AMNL (cell B from chapter 21 §1) wins RMSE 6.74 on FD002 but loses NASA 356. If the production goal is safety-aware prediction, optimising only on RMSE picks the wrong winner. Always run the search on the metric that matches the deployment objective — or jointly on a Pareto frontier.

Pitfall 4: assuming separability across categories

Architecture and training are not separable when moving between datasets with very different scale (C-MAPSS ~4M windows vs. tabular toy datasets ~1k rows). Doubling hidden_size on a small dataset hurts because the lr and batch_size have to come down too. Always sweep architecture first; only after locking it should you tune training.

Pitfall 5: not recording the random seeds

Reproducing a result needs the seeds used for both the data split and the model init. The paper records both via set_seed(seed) and writes the seed into the output directory name. A ‘best result’ without its seed is not reproducible.

Takeaway

GRACE has 18 hyperparameters split between architecture (10, in model_configs.py) and training (8+, in experiments/config.py) plus 3 GABA-specific knobs.
The published values are 5-seed validated; the GABA controller is empirically robust across all four C-MAPSS subsets (converged $\lambda_{\text{health}} = 0.99 \pm 0.008$ ).
The largest sensitivity directions are the loss formulation (single-task is +260% RMSE; no attention is +13%); most training-cadence hyperparameters move RMSE by <5%.
The search protocol is coordinate descent, not grid: ~150 trials, ~75 GPU-hours total. Grid search would need 387M.
Adapt to a new domain by sweeping architecture first, then training. Training-cadence values (warmup, scheduler, EMA) generalise within ±2× without retuning.