Tailored vs Off-the-Rack
A bespoke suit fits one body better than any off-the-rack option - but cost five times as much, takes four weeks to deliver, and only fits THAT body. AMNL's legacy V7 pipeline started bespoke (per-dataset dropout) and the GRACE refactor moved off-the-rack (uniform 0.3). The penalty was about half a cycle of RMSE on two of four C-MAPSS subsets - within seed variance. The savings: one less hyperparameter and a config that transfers to NEW datasets without retuning.
{FD001: 0.3, FD002: 0.2, FD003: 0.3, FD004: 0.2}; GRACE picks 0.3 for everything. Both come from paper_ieee_tii/. The book recommends GRACE's uniform default for any new pipeline - but understanding the legacy choice tells you when to tune your own.Two Regimes, Two Optima
Legacy V7's per-dataset choice maps directly to dataset size and condition variability:
| Subset | # train engines | Conditions | Trajectories | Best dropout p |
|---|---|---|---|---|
| FD001 | 100 | 1 op cond × 1 fault mode | 100 | 0.30 |
| FD002 | 260 | 6 op conds × 1 fault mode | 260 | 0.20 |
| FD003 | 100 | 1 op cond × 2 fault modes | 100 | 0.30 |
| FD004 | 248 | 6 op conds × 2 fault modes | 248 | 0.20 |
Why Dropout Behaves This Way
Dropout zeros each activation independently with probability during training and rescales the rest by to preserve expectations. Effective network capacity scales as for an L-layer stack. With (CNN + BiLSTM + Attention + FC funnel - just the dropped layers), capacity at p=0.2 is and at p=0.3 is . Almost a 2× difference in effective capacity - which is why the choice matters more on smaller datasets.
Interactive: RMSE vs Dropout
Drag p from 0.05 (under-regularised) to 0.50 (over-regularised). The U-shape per dataset shows the bias-variance trade-off. Each curve's minimum is marked; toggle datasets in the legend to focus.
Try this. Toggle off FD001 and FD003 (the single-condition subsets). The remaining curves (FD002, FD004, Average) bottom out at p=0.20. Toggle back on; bottom of FD001/FD003 sits at p=0.30. The average curve compromises around p=0.20 BUT the difference at p=0.30 is small enough that GRACE's simpler choice is defensible.
Python: Two Configurations
Both policies as standalone functions. The smoke test prints a side-by-side comparison and computes the average / max RMSE penalty for using uniform 0.3 vs per-dataset tuning.
PyTorch: Legacy V7 vs GRACE Refactor
Two builders, paper-canonical: build_model_v7(dataset_name, input_size) uses the per-dataset dict; build_model_grace(cfg: ModelConfig) uses a dataclass with a single dropout field. The smoke test instantiates both for all four C-MAPSS subsets and prints the per-subset dropout used.
Where Per-Dataset Tuning Earns Its Keep
| Domain | Per-dataset benefit | Verdict |
|---|---|---|
| RUL prediction (this book) | <0.5 cycles RMSE | uniform 0.3 - book recommends |
| NLP fine-tuning (BERT on GLUE) | +1-3 dev-set points | per-task often worth it |
| Medical imaging (small-cohort) | +5-15% AUROC | per-cohort essential |
| Industrial sensor classification (~300 classes) | +0.5-1.5% accuracy | moderate gain - case by case |
| Robotics policy (sim-to-real) | varies wildly | must tune per environment |
| Self-driving perception (multi-camera) | <0.3% mAP | uniform fine |
Three Dropout-Tuning Pitfalls
model.eval() before measuring metrics.dropout * 0.5and dropout * 0.3 for the first and second layers respectively (paper file grace/models/task_heads.py). Setting them all equal (or all to 0.3) loses ~0.2 cycles RMSE. Keep the scaled fractions the paper ships.The point. Per-dataset dropout tuning is a legacy V7 detail. The GRACE refactor abandoned it because the gain was within seed variance. The book recommends uniform 0.3 for any new C-MAPSS-style pipeline. §15.5 ties the whole AMNL pipeline together with a full training script.
Takeaway
- Legacy V7: {FD001: 0.3, FD002: 0.2, FD003: 0.3, FD004: 0.2}.
- GRACE refactor: uniform 0.3 across all C-MAPSS subsets.
- Penalty: <0.5 cycles RMSE on FD002/FD004; 0 on FD001/FD003.
- Effective capacity: . p=0.2 vs 0.3 is ~2× capacity difference.
- Per-layer fractions: task heads use
dropout * 0.5/dropout * 0.3- do NOT flatten. - Decision rule: per-dataset only if gain > 2× seed variance.