Chapter 15
10 min read
Section 62 of 121

Per-Dataset Dropout Tuning (Legacy-Pipeline Note)

AMNL Training Pipeline

Tailored vs Off-the-Rack

A bespoke suit fits one body better than any off-the-rack option - but cost five times as much, takes four weeks to deliver, and only fits THAT body. AMNL's legacy V7 pipeline started bespoke (per-dataset dropout) and the GRACE refactor moved off-the-rack (uniform 0.3). The penalty was about half a cycle of RMSE on two of four C-MAPSS subsets - within seed variance. The savings: one less hyperparameter and a config that transfers to NEW datasets without retuning.

The headline. Legacy V7 picks {FD001: 0.3, FD002: 0.2, FD003: 0.3, FD004: 0.2}; GRACE picks 0.3 for everything. Both come from paper_ieee_tii/. The book recommends GRACE's uniform default for any new pipeline - but understanding the legacy choice tells you when to tune your own.

Two Regimes, Two Optima

Legacy V7's per-dataset choice maps directly to dataset size and condition variability:

Subset# train enginesConditionsTrajectoriesBest dropout p
FD0011001 op cond × 1 fault mode1000.30
FD0022606 op conds × 1 fault mode2600.20
FD0031001 op cond × 2 fault modes1000.30
FD0042486 op conds × 2 fault modes2480.20
Why FD002/FD004 want LESS dropout. ~2.5× more training engines AND condition diversity reduce the overfitting risk. With p=0.3 (V7 single-cond default) the model is OVER-regularised on these subsets ⇒ underfits. p=0.2 is the right balance. FD001/FD003 have only 100 engines and a single condition ⇒ overfitting is the dominant risk ⇒ p=0.3 wins.

Why Dropout Behaves This Way

Dropout zeros each activation independently with probability pp during training and rescales the rest by 1/(1p)1/(1-p) to preserve expectations. Effective network capacity scales as (1p)L(1 - p)^L for an L-layer stack. With L=4L = 4 (CNN + BiLSTM + Attention + FC funnel - just the dropped layers), capacity at p=0.2 is(0.8)40.41(0.8)^4 \approx 0.41 and at p=0.3 is (0.7)40.24(0.7)^4 \approx 0.24. Almost a 2× difference in effective capacity - which is why the choice matters more on smaller datasets.

Interactive: RMSE vs Dropout

Drag p from 0.05 (under-regularised) to 0.50 (over-regularised). The U-shape per dataset shows the bias-variance trade-off. Each curve's minimum is marked; toggle datasets in the legend to focus.

Loading dropout sensitivity viz…
Try this. Toggle off FD001 and FD003 (the single-condition subsets). The remaining curves (FD002, FD004, Average) bottom out at p=0.20. Toggle back on; bottom of FD001/FD003 sits at p=0.30. The average curve compromises around p=0.20 BUT the difference at p=0.30 is small enough that GRACE's simpler choice is defensible.

Python: Two Configurations

Both policies as standalone functions. The smoke test prints a side-by-side comparison and computes the average / max RMSE penalty for using uniform 0.3 vs per-dataset tuning.

get_dropout_v7() and get_dropout_grace()
🐍dropout_policies_numpy.py
1import numpy as np

We use NumPy for the per-dataset RMSE-penalty computation - np.array, .mean(), .max(). Pure Python would also work but the array methods read cleaner.

EXECUTION STATE
📚 numpy = Library: ndarray + math + statistics.
as np = Universal alias.
4def get_dropout_v7(dataset_name, default=0.25) -> float:

Legacy V7 per-dataset policy. Each C-MAPSS subset gets its own dropout rate. Source: <code>paper_ieee_tii/experiments/train_amnl_v7.py</code> lines 445-452.

EXECUTION STATE
⬇ input: dataset_name = String like &lsquo;FD001&rsquo;, &lsquo;FD002&rsquo;, etc. Looked up in the config dict.
⬇ input: default = 0.25 = Fallback used if the dataset name is not in the dict (e.g. for cross-dataset transfer experiments).
⬆ returns = Float dropout rate in [0.2, 0.3] for the four C-MAPSS subsets.
11dataset_dropout_config = { "FD001": 0.3, "FD002": 0.2, "FD003": 0.3, "FD004": 0.2 }

The legacy mapping. Single-condition subsets (FD001, FD003) get the heavier regularisation (0.3); multi-condition subsets (FD002, FD004) get less (0.2) because they have ~2× more samples.

EXECUTION STATE
→ dict literal = Python `{key: value, ...}` syntax. Stable iteration order in Python 3.7+.
→ why per-dataset? = Smaller datasets need stronger regularisation (overfitting risk). Larger datasets need less (under-regularisation hurts more than over-regularisation).
⬆ result: dataset_dropout_config = {&apos;FD001&apos;: 0.3, &apos;FD002&apos;: 0.2, &apos;FD003&apos;: 0.3, &apos;FD004&apos;: 0.2}
17return dataset_dropout_config.get(dataset_name, default)

.get(key, default) returns the value at key if present, else default. Safer than [key] which raises KeyError on missing keys.

EXECUTION STATE
📚 dict.get(key, default) = Dict method. Returns dict[key] if key exists, else returns default. No exception raised.
⬇ arg 1: key = dataset_name = Lookup key.
⬇ arg 2: default = 0.25 = Fallback for unknown datasets.
→ why .get not []? = dataset_dropout_config['cross_domain'] would raise KeyError. .get returns 0.25 silently - safer for transfer experiments.
⬆ returns = 0.3 (FD001) | 0.2 (FD002) | 0.3 (FD003) | 0.2 (FD004) | 0.25 (anything else)
20def get_dropout_grace(dataset_name: str = "cmapss") -> float:

GRACE refactor policy: ONE dropout rate for everything. Source: <code>paper_ieee_tii/grace/models/model_configs.py</code> line 38. The dataset_name argument is kept for API compatibility but unused.

EXECUTION STATE
⬇ input: dataset_name = Unused - kept so this fn can drop in for get_dropout_v7. Default &lsquo;cmapss&rsquo;.
⬆ returns = Always 0.3 - the canonical paper choice.
26return 0.3

Constant return value. The simplification cost &lt;0.5 cycles RMSE on FD002/FD004 in exchange for losing one hyperparameter and gaining cross-dataset robustness.

EXECUTION STATE
→ trade-off = Per-dataset tuning gains 0.5 cycles on multi-condition subsets BUT the optimum on a NEW dataset is unknown. Uniform 0.3 is the conservative cross-dataset default.
→ recall §15.2 = Same philosophy as &lsquo;don&apos;t tune w_max&rsquo; in §14.3 - simplicity beats per-instance tuning when the gain is small.
30datasets = ["FD001", "FD002", "FD003", "FD004"]

List of subsets to compare.

EXECUTION STATE
⬆ result: datasets = [&apos;FD001&apos;, &apos;FD002&apos;, &apos;FD003&apos;, &apos;FD004&apos;]
33rmse_at_v7 = {...}

Synthetic but shape-correct - the per-dataset RMSE at each subset&apos;s own best p. FD002 &amp; FD004 (multi-condition) bottom out at p=0.2; FD001 &amp; FD003 (single-condition) at p=0.3.

EXECUTION STATE
⬆ result: rmse_at_v7 = {FD001: 10.80, FD002: 13.90, FD003: 11.20, FD004: 17.40}
35rmse_at_grace = {...}

Per-dataset RMSE at the uniform p=0.3. Same as V7 for FD001/FD003 (already at their optimum); slightly worse for FD002/FD004 (which would prefer p=0.2).

EXECUTION STATE
⬆ result: rmse_at_grace = {FD001: 10.80, FD002: 14.60, FD003: 11.20, FD004: 18.00}
37print(f"{'dataset':<8s} | {'p (V7)':>7s} | ... ")

Header row using format-spec strings. <code>:&lt;8s</code> = left-aligned width 8, <code>:&gt;7s</code> = right-aligned width 7.

EXECUTION STATE
→ :<8s = String, left-aligned, min width 8.
→ :>7s = String, right-aligned, min width 7.
→ :>9s, :>10s, :>11s, :>9s = Same idea with different widths to align the numeric columns.
Output = dataset | p (V7) | RMSE V7 | p (GRACE) | RMSE GRACE | penalty
38for ds in datasets:

Iterate the four subsets.

EXECUTION STATE
iter var: ds = &apos;FD001&apos; → &apos;FD002&apos; → &apos;FD003&apos; → &apos;FD004&apos;.
LOOP TRACE · 4 iterations
ds = 'FD001'
p_v7 = 0.30 (single-condition)
p_grace = 0.30 (uniform)
rmse_v7 = 10.80
rmse_grace = 10.80
penalty = 0.00 - GRACE matches V7 exactly
ds = 'FD002'
p_v7 = 0.20 (multi-condition)
p_grace = 0.30 (uniform)
rmse_v7 = 13.90
rmse_grace = 14.60
penalty = +0.70 - GRACE costs 0.7 cycles on this subset
ds = 'FD003'
p_v7 = 0.30 (single-condition)
p_grace = 0.30 (uniform)
rmse_v7 = 11.20
rmse_grace = 11.20
penalty = 0.00 - GRACE matches V7
ds = 'FD004'
p_v7 = 0.20 (multi-condition)
p_grace = 0.30 (uniform)
rmse_v7 = 17.40
rmse_grace = 18.00
penalty = +0.60 - GRACE costs 0.6 cycles on this subset
39p_v7 = get_dropout_v7(ds)

Look up the legacy V7 dropout rate.

40p_grace = get_dropout_grace()

GRACE doesn&apos;t need the dataset name (constant 0.3) - we call with no arg.

41rmse_v7 = rmse_at_v7[ds]

Look up the V7 RMSE for this subset. Uses [] indexing - safe here because every key in `datasets` is also in the dict.

42rmse_grace = rmse_at_grace[ds]

Same for GRACE.

43penalty = rmse_grace - rmse_v7

Per-subset penalty for using uniform 0.3 vs the per-dataset optimum.

EXECUTION STATE
→ at FD001/FD003 = Both policies pick the same p (0.3) ⇒ penalty = 0.
→ at FD002/FD004 = Uniform uses 0.3 instead of 0.2 ⇒ small positive penalty.
44print(f"{ds:<8s} | {p_v7:>7.2f} | {rmse_v7:>9.2f} | ...")

One row per subset. <code>:&gt;7.2f</code> = right-aligned float width 7, 2 decimals. <code>:&gt;+9.2f</code> = same with forced sign.

EXECUTION STATE
→ :>7.2f = Float, right-aligned, width 7, 2 decimals.
→ :>+9.2f = Float, right-aligned, width 9, 2 decimals, FORCE the sign character (+ or -).
Output = FD001 | 0.30 | 10.80 | 0.30 | 10.80 | +0.00 FD002 | 0.20 | 13.90 | 0.30 | 14.60 | +0.70 FD003 | 0.30 | 11.20 | 0.30 | 11.20 | +0.00 FD004 | 0.20 | 17.40 | 0.30 | 18.00 | +0.60
47penalties = np.array([rmse_at_grace[ds] - rmse_at_v7[ds] for ds in datasets])

Build a NumPy array of per-subset penalties via list comprehension. NumPy gives us .mean() and .max() in one line each.

EXECUTION STATE
📚 np.array(seq) = Construct an ndarray from a Python sequence.
→ list comprehension = [expr for ds in datasets] - lazy syntax for [transformed values].
⬆ result: penalties = [0.0, 0.7, 0.0, 0.6]
48print(f"mean penalty across 4 subsets : {penalties.mean():+.3f} cycles")

Mean penalty = average across the four subsets.

EXECUTION STATE
📚 .mean() = Reduce-mean over all elements. Returns a 0-D scalar.
→ :+.3f = Float, force sign, 3 decimals.
Output = mean penalty across 4 subsets : +0.325 cycles
49print(f"max penalty : {penalties.max():+.3f} cycles")

Worst-case penalty.

EXECUTION STATE
📚 .max() = Reduce-max over all elements.
Output = max penalty : +0.700 cycles
→ reading = Average +0.33 cycles, worst +0.70 cycles. Compare with seed-to-seed variance on FD002 (~0.5-1.0 cycles): the per-dataset gain is WITHIN seed variance. That is why GRACE dropped per-dataset tuning - the savings don&apos;t survive replication.
29 lines without explanation
1import numpy as np
2
3
4def get_dropout_v7(dataset_name: str, default: float = 0.25) -> float:
5    """Legacy V7 per-dataset dropout - paper file train_amnl_v7.py:445-452.
6
7    Single-condition subsets (FD001, FD003) get 0.3; multi-condition
8    subsets (FD002, FD004) get 0.2 because they have ~2x more samples
9    and need less regularisation.
10    """
11    dataset_dropout_config = {
12        "FD001": 0.3,
13        "FD002": 0.2,
14        "FD003": 0.3,
15        "FD004": 0.2,
16    }
17    return dataset_dropout_config.get(dataset_name, default)
18
19
20def get_dropout_grace(dataset_name: str = "cmapss") -> float:
21    """GRACE refactor: uniform 0.3 for ALL datasets - paper file model_configs.py:38.
22
23    Same model_config for every C-MAPSS subset. The simplification cost
24    &lt;0.5 cycles RMSE on FD002/FD004 in exchange for one less hyperparameter.
25    """
26    return 0.3
27
28
29# ---------- Worked example: compare both policies ----------
30datasets = ["FD001", "FD002", "FD003", "FD004"]
31
32# Synthetic: per-dataset RMSE at the dataset&apos;s OWN best p (legacy V7)
33rmse_at_v7 = {"FD001": 10.80, "FD002": 13.90, "FD003": 11.20, "FD004": 17.40}
34# Synthetic: same RMSE at uniform p=0.3 (GRACE)
35rmse_at_grace = {"FD001": 10.80, "FD002": 14.60, "FD003": 11.20, "FD004": 18.00}
36
37print(f"{'dataset':<8s} | {'p (V7)':>7s} | {'RMSE V7':>9s} | {'p (GRACE)':>10s} | {'RMSE GRACE':>11s} | {'penalty':>9s}")
38for ds in datasets:
39    p_v7        = get_dropout_v7(ds)
40    p_grace     = get_dropout_grace()
41    rmse_v7     = rmse_at_v7[ds]
42    rmse_grace  = rmse_at_grace[ds]
43    penalty     = rmse_grace - rmse_v7
44    print(f"{ds:<8s} | {p_v7:>7.2f} | {rmse_v7:>9.2f} | {p_grace:>10.2f} | {rmse_grace:>11.2f} | {penalty:>+9.2f}")
45
46print()
47penalties = np.array([rmse_at_grace[ds] - rmse_at_v7[ds] for ds in datasets])
48print(f"mean penalty across 4 subsets : {penalties.mean():+.3f} cycles")
49print(f"max  penalty                  : {penalties.max():+.3f} cycles")

PyTorch: Legacy V7 vs GRACE Refactor

Two builders, paper-canonical: build_model_v7(dataset_name, input_size) uses the per-dataset dict; build_model_grace(cfg: ModelConfig) uses a dataclass with a single dropout field. The smoke test instantiates both for all four C-MAPSS subsets and prints the per-subset dropout used.

build_model_v7 + build_model_grace + ModelConfig
🐍dropout_builders_torch.py
1import torch

Top-level PyTorch.

EXECUTION STATE
📚 torch = Tensor library + autograd + nn modules + optim.
2import torch.nn as nn

Module containers - we use nn.Dropout and nn.Linear in the stub.

3from dataclasses import dataclass

Dataclasses turn a plain class into an auto-typed config bag. Saves writing __init__, __repr__, __eq__ by hand. Used by paper&apos;s ModelConfig.

EXECUTION STATE
📚 @dataclass = Standard library decorator. Auto-generates __init__, __repr__, __eq__ from field annotations. Python ≥ 3.7.
9def build_model_v7(dataset_name, input_size) -> nn.Module:

Legacy V7 builder. Source: <code>paper_ieee_tii/experiments/train_amnl_v7.py</code> lines 444-462.

EXECUTION STATE
⬇ input: dataset_name = Lookup key for the per-dataset dropout config.
⬇ input: input_size = Number of input channels (17 for full C-MAPSS, 14 after pruning).
⬆ returns = nn.Module - DualTaskEnhancedModel with the chosen dropout rate.
11dataset_dropout_config = { ... }

The per-dataset mapping. Verbatim from the paper.

EXECUTION STATE
→ reading = FD001 / FD003 ⇒ 0.3 (single-condition, smaller dataset, more regularisation needed) FD002 / FD004 ⇒ 0.2 (multi-condition, larger dataset, less regularisation needed)
17dropout_rate = dataset_dropout_config.get(dataset_name, 0.25)

.get with a fallback - same as the NumPy block.

EXECUTION STATE
📚 dict.get(key, default) = Safe lookup. Returns dict[key] if present else default.
⬇ arg 1: key = dataset_name = Lookup key.
⬇ arg 2: default = 0.25 = Fallback for unknown datasets.
18return DualTaskEnhancedModel(input_size=input_size, hidden_size=256, dropout=dropout_rate, use_attention=True, use_residual=True)

Construct the model with the chosen dropout. Other hyperparameters are fixed.

EXECUTION STATE
⬇ kwarg: hidden_size = 256 = Higher capacity from V2 - paper uses this for all C-MAPSS subsets.
⬇ kwarg: dropout = dropout_rate = PER-DATASET in V7. This is the line GRACE&apos;s refactor changes.
⬇ kwarg: use_attention = True = Multi-head self-attention block enabled. Paper default.
⬇ kwarg: use_residual = True = Residual + LayerNorm around attention. Paper default.
27@dataclass

Decorator. Reads the type-annotated fields below and auto-generates __init__, __repr__, __eq__. The class body just declares typed fields with defaults.

EXECUTION STATE
→ effect = ModelConfig(input_size=17) becomes valid; non-default fields must come first; type annotations are required.
28class ModelConfig:

Config dataclass mirroring paper&apos;s ModelConfig (paper_ieee_tii/grace/models/model_configs.py:18-26).

29input_size: int

Number of sensor channels. Required field (no default).

30hidden_size: int = 128

BiLSTM hidden size per direction. Default 128; C-MAPSS overrides to 256.

31cnn_channels: tuple = (64, 128, 64)

CNN channel widths through the three conv blocks.

32fc_dims: tuple = (128, 64, 32)

FC funnel dims from BiLSTM output (1024) down to 32.

33dropout: float = 0.3 # uniform across datasets

THE refactored line. Single dropout for every config. The trailing comment is paper-canonical - it&apos;s in the actual model_configs.py source file.

EXECUTION STATE
→ contrast with V7 = V7 stored dropout in a per-dataset DICT. GRACE stores it as a per-CONFIG SCALAR. Same default rate (0.3) but no dataset-specific overrides.
34num_health_states: int = 3

Number of health classes. Default 3 (healthy / degrading / critical).

38CMAPSS_CONFIG = ModelConfig(...)

Pre-built config for C-MAPSS. Constructed once and re-used across all four subsets.

EXECUTION STATE
⬇ kwarg: input_size = 17 = Full C-MAPSS sensor count.
⬇ kwarg: hidden_size = 256 = GRACE override - was 128 in earlier GABA configs.
⬇ kwarg: cnn_channels = (64, 128, 64) = Three conv blocks.
⬇ kwarg: fc_dims = (256, 64, 32) = GRACE override - bigger first FC layer because BiLSTM output is now 512.
⬇ kwarg: dropout = 0.3 = UNIFORM. Same value passed to FD001, FD002, FD003, FD004.
⬇ kwarg: num_health_states = 3 = Three health classes.
47def build_model_grace(cfg: ModelConfig) -> nn.Module:

GRACE builder. Takes a config object instead of looking up a dict per dataset.

EXECUTION STATE
⬇ input: cfg = ModelConfig dataclass instance.
⬆ returns = nn.Module with the config&apos;s dropout (always 0.3 for cmapss).
49return DualTaskEnhancedModel(input_size=cfg.input_size, hidden_size=cfg.hidden_size, dropout=cfg.dropout, use_attention=True, use_residual=True)

Construct from cfg fields. All four C-MAPSS subsets pass the SAME cfg - the dropout is identical.

59class DualTaskEnhancedModel(nn.Module):

Stub that just stores dropout_rate so the smoke test can read it. Real model is the §11.4 DualTaskModel - this stub keeps the demo runnable in isolation.

60def __init__(self, input_size, hidden_size=256, dropout=0.3, use_attention=True, use_residual=True) -> None:

Mirrors the real DualTaskEnhancedModel signature.

EXECUTION STATE
⬇ input: dropout = 0.3 = Default if no dataset-specific override is passed.
63super().__init__()

Initialise nn.Module.

64self.dropout_rate = dropout

Store the rate as a plain Python float for inspection. Real model would also instantiate nn.Dropout layers in the conv/FC blocks using this value.

65self.dropout = nn.Dropout(dropout)

One actual nn.Dropout layer for show. Real model has multiple - the §11.2/§11.3 task heads each apply dropout with this rate (or scaled fractions of it).

EXECUTION STATE
📚 nn.Dropout(p) = Applies dropout with rate p during training (zeros out a random fraction of elements and rescales the rest). No-op in eval mode.
⬇ arg: p = dropout = 0.3 = Probability that each element is zeroed.
66self.linear = nn.Linear(input_size, hidden_size)

Stub linear layer - just exists so model.parameters() yields something. Not used in the smoke-test forward path.

70torch.manual_seed(0)

Repro.

EXECUTION STATE
📚 torch.manual_seed(s) = Set the global PyTorch PRNG.
⬇ arg: s = 0 = Conventional canonical seed.
71for ds in ("FD001", "FD002", "FD003", "FD004"):

Loop over the four C-MAPSS subsets.

LOOP TRACE · 4 iterations
ds = 'FD001'
v7 dropout = 0.30
grace dropout = 0.30
ds = 'FD002'
v7 dropout = 0.20
grace dropout = 0.30
ds = 'FD003'
v7 dropout = 0.30
grace dropout = 0.30
ds = 'FD004'
v7 dropout = 0.20
grace dropout = 0.30
72v7 = build_model_v7(ds, input_size=14)

Build legacy V7 model with per-dataset dropout.

73grace = build_model_grace(CMAPSS_CONFIG)

Build GRACE model from the single shared config.

EXECUTION STATE
→ cfg-driven = Same CMAPSS_CONFIG passed for all four datasets. The dropout never changes.
74print(f"{ds:<8s} | V7 dropout={v7.dropout_rate:.2f} | GRACE dropout={grace.dropout_rate:.2f}")

Print one row per dataset showing the policy difference.

EXECUTION STATE
→ :<8s = String, left-aligned, min width 8.
→ :.2f = Float, 2 decimals.
Output = FD001 | V7 dropout=0.30 | GRACE dropout=0.30 FD002 | V7 dropout=0.20 | GRACE dropout=0.30 FD003 | V7 dropout=0.30 | GRACE dropout=0.30 FD004 | V7 dropout=0.20 | GRACE dropout=0.30
→ reading = Two of four rows differ. Those two cost the GRACE refactor &lt;0.5 cycles RMSE each - within seed variance. Net: simpler config, same paper-canonical results.
48 lines without explanation
1import torch
2import torch.nn as nn
3from dataclasses import dataclass
4
5
6# ----------------------------------------------------------------------
7# LEGACY V7 (paper_ieee_tii/experiments/train_amnl_v7.py:444-462)
8# ----------------------------------------------------------------------
9def build_model_v7(dataset_name: str, input_size: int) -> nn.Module:
10    """Legacy V7 builder - per-dataset dropout."""
11    dataset_dropout_config = {
12        "FD001": 0.3,
13        "FD002": 0.2,
14        "FD003": 0.3,
15        "FD004": 0.2,
16    }
17    dropout_rate = dataset_dropout_config.get(dataset_name, 0.25)
18    return DualTaskEnhancedModel(
19        input_size=input_size,
20        hidden_size=256,
21        dropout=dropout_rate,
22        use_attention=True,
23        use_residual=True,
24    )
25
26
27# ----------------------------------------------------------------------
28# GRACE refactor (paper_ieee_tii/grace/models/model_configs.py:25-40)
29# ----------------------------------------------------------------------
30@dataclass
31class ModelConfig:
32    input_size:      int
33    hidden_size:     int   = 128
34    cnn_channels:    tuple = (64, 128, 64)
35    fc_dims:         tuple = (128, 64, 32)
36    dropout:         float = 0.3                     # uniform across datasets
37    num_health_states: int = 3
38
39
40CMAPSS_CONFIG = ModelConfig(
41    input_size=17,
42    hidden_size=256,
43    cnn_channels=(64, 128, 64),
44    fc_dims=(256, 64, 32),
45    dropout=0.3,                                     # GRACE: 0.3 for ALL C-MAPSS subsets
46    num_health_states=3,
47)
48
49
50def build_model_grace(cfg: ModelConfig) -> nn.Module:
51    """GRACE builder - one config for every dataset."""
52    return DualTaskEnhancedModel(
53        input_size=cfg.input_size,
54        hidden_size=cfg.hidden_size,
55        dropout=cfg.dropout,
56        use_attention=True,
57        use_residual=True,
58    )
59
60
61# ---------- Stub model class (just to make the smoke test runnable) ----------
62class DualTaskEnhancedModel(nn.Module):
63    def __init__(self, input_size: int, hidden_size: int = 256,
64                  dropout: float = 0.3, use_attention: bool = True,
65                  use_residual: bool = True) -> None:
66        super().__init__()
67        self.dropout_rate = dropout
68        self.dropout      = nn.Dropout(dropout)
69        self.linear       = nn.Linear(input_size, hidden_size)
70
71
72# ---------- Smoke test ----------
73torch.manual_seed(0)
74for ds in ("FD001", "FD002", "FD003", "FD004"):
75    v7    = build_model_v7(ds, input_size=14)
76    grace = build_model_grace(CMAPSS_CONFIG)
77    print(f"{ds:<8s} | V7 dropout={v7.dropout_rate:.2f} | GRACE dropout={grace.dropout_rate:.2f}")

Where Per-Dataset Tuning Earns Its Keep

DomainPer-dataset benefitVerdict
RUL prediction (this book)&lt;0.5 cycles RMSEuniform 0.3 - book recommends
NLP fine-tuning (BERT on GLUE)+1-3 dev-set pointsper-task often worth it
Medical imaging (small-cohort)+5-15% AUROCper-cohort essential
Industrial sensor classification (~300 classes)+0.5-1.5% accuracymoderate gain - case by case
Robotics policy (sim-to-real)varies wildlymust tune per environment
Self-driving perception (multi-camera)&lt;0.3% mAPuniform fine
Rule of thumb. If per-dataset tuning earns you LESS than your seed variance, drop it. If MORE than 2× seed variance, keep it. C-MAPSS sits at 1× seed variance ⇒ GRACE drops it; medical imaging often sits at 5-10× ⇒ keep it.

Three Dropout-Tuning Pitfalls

Pitfall 1: Tuning dropout WITH the loss weights. Sweeping (dropout, lambda) jointly on a small val set invites overfitting to that val set. Pick dropout from a coarse 5-point sweep on a held-out cohort, freeze it, then tune lambda separately if needed.
Pitfall 2: Forgetting model.eval() at inference. nn.Dropout's behaviour DEPENDS on training mode - it's a no-op in eval mode. If you measure RMSE in train mode with dropout active, results jitter from run to run. Alwaysmodel.eval() before measuring metrics.
Pitfall 3: Per-layer dropout schedules. The paper's task heads use dropout * 0.5and dropout * 0.3 for the first and second layers respectively (paper file grace/models/task_heads.py). Setting them all equal (or all to 0.3) loses ~0.2 cycles RMSE. Keep the scaled fractions the paper ships.
The point. Per-dataset dropout tuning is a legacy V7 detail. The GRACE refactor abandoned it because the gain was within seed variance. The book recommends uniform 0.3 for any new C-MAPSS-style pipeline. §15.5 ties the whole AMNL pipeline together with a full training script.

Takeaway

  • Legacy V7: {FD001: 0.3, FD002: 0.2, FD003: 0.3, FD004: 0.2}.
  • GRACE refactor: uniform 0.3 across all C-MAPSS subsets.
  • Penalty: <0.5 cycles RMSE on FD002/FD004; 0 on FD001/FD003.
  • Effective capacity: (1p)L(1 - p)^L. p=0.2 vs 0.3 is ~2× capacity difference.
  • Per-layer fractions: task heads use dropout * 0.5 / dropout * 0.3 - do NOT flatten.
  • Decision rule: per-dataset only if gain > 2× seed variance.
Loading comments...