Chapter 16
12 min read
Section 66 of 121

Cross-Pipeline Caveats

AMNL Empirical Results

What the Dagger Means

Read the AMNL row of paper Table I carefully and you will see a next to the method name. The footnote spells it out: “AMNL uses a legacy training pipeline (separate model class, per-dataset dropout); comparisons are cross-pipeline.” This section explains exactly what that means - and how to read the numbers responsibly.

The headline. AMNL was trained with the original V7 stack (train_amnl_v7.py +DualTaskEnhancedModel); GABA / GRACE were trained with the refactored stack (grace/training/+ DualTaskModel). The compute graph is the same but the head names, dropout policy, and config plumbing differ. Three medium-impact differences ⇒ “USE WITH CAVEAT”.

The Three Real Differences

Most of the V7 → GRACE refactor was code organisation - the compute graph, hyperparameters, and training schedule are all identical. The remaining differences are:

FeatureV7 legacyGRACE refactorImpact
Model classDualTaskEnhancedModelDualTaskModelmed
Head namingrul_branch + health_branchrul_head + health_headlow (cosmetic)
Dropout policy{FD001:0.3, FD002:0.2, FD003:0.3, FD004:0.2}uniform 0.3med (~0.5 cycles RMSE)
Config sourcehard-coded inline in train_amnl_v7.pyModelConfig dataclass in model_configs.pylow (organisation only)
Loss / Optimiser / Scheduler / EMAmoderate_weighted_mse_loss, AdamW, ReduceLROnPlateau, EMA(0.999)samelow
What “medium impact” means here. The difference can be COMPENSATED for: re-running the GRACE refactor with V7's per-dataset dropout would shift results by less than the seed std on most subsets. So AMNL's 6.74 FD002 RMSE is comparable to GABA's 7.53 TO WITHIN cross-pipeline noise (~0.5 cycles). The dagger signals “defensible but annotated”, not “invalid”.

Interactive: Build Your Own Comparison

Ten potential pipeline differences, three default-on. Click to add or remove from the comparison; the verdict updates live. The default selection (model class + dropout + task combiner) reflects what the paper actually flags with the dagger.

Loading pipeline diff matrix…
Try this. Toggle ALL ten features ON. The verdict stays at “USE WITH CAVEAT” - never goes red. Toggle just the cosmetic ones (head naming, config source). The verdict turns green: cross-pipeline but nothing material differs. The honest reading of the AMNL row is the latter, but the conservative reporting style keeps the dagger.

What These Differences Cost

The dropout policy difference is the only one with a quantifiable RMSE cost. From §15.4's ablation, switching from per-dataset to uniform 0.3 costs about +0.5+0.5 cycles RMSE on FD002 / FD004 (which would prefer 0.2) and 0 cycles on FD001 / FD003 (which already use 0.3). The model-class rename (rul_branch → rul_head) has no compute-graph effect; same arithmetic, same gradients, same final weights up to numerical noise.

Cross-pipeline corrected estimates (rough rule of thumb):

SubsetAMNL (V7) RMSEDropout correctionAMNL-on-GRACE RMSE estimate
FD00110.43 ± 1.94010.43 ± 1.94 (no change)
FD0026.74 ± 0.91+0.5≈ 7.2 ± 1.0
FD0039.51 ± 1.7409.51 ± 1.74 (no change)
FD0048.16 ± 2.17+0.5≈ 8.7 ± 2.3
Even after correction, AMNL FD002 wins. The estimated AMNL-on-GRACE RMSE on FD002 is ≈ 7.2 - still below GABA's 7.53 and well below DMHA-ATCN's 16.95. The conclusion of §16.1 doesn't change; only the magnitude of the gap softens slightly.

Python: Detect Which Pipeline a Model Came From

Pure Python heuristic. Look at state-dict key names and dropout config; classify as V7 / GRACE / UNKNOWN. comparison_verdict then decides whether two methods can be compared head-to-head.

detect_pipeline() + comparison_verdict()
🐍pipeline_detector_numpy.py
1import numpy as np

Imported by convention. The detector logic is pure-Python set ops; we do not actually use NumPy here.

EXECUTION STATE
📚 numpy = Library: ndarray + math. Unused in this script.
as np = Universal alias.
4def detect_pipeline(state_dict_keys, dropout_per_dataset=None) -> str:

Heuristic classifier. Looks at module-name signatures in the model state-dict and at the dropout config to decide whether a checkpoint came from V7 (legacy AMNL) or the GRACE refactor.

EXECUTION STATE
⬇ input: state_dict_keys = List of strings - the keys returned by model.state_dict().keys(). Each key is a fully-qualified parameter path like ‘backbone.cnn.0.weight’.
⬇ input: dropout_per_dataset = Optional dict {dataset_name: dropout_rate}. None ⇒ unknown (e.g. inference-only checkpoint).
⬆ returns = 'V7', 'GRACE', or 'UNKNOWN'.
18has_v7_branches = any("rul_branch." in k for k in state_dict_keys)

V7 model class (DualTaskEnhancedModel) names its task heads ‘rul_branch’ and ‘health_branch’. Search the state-dict keys for that prefix.

EXECUTION STATE
📚 any(iterable) = Built-in. Returns True if any element of the iterable is truthy.
📚 generator expression = (expr for x in iterable) - lazy iterator. Same as a list comprehension but without materialising the list.
📚 "... in str" = Substring containment test. Returns True if the left string is found anywhere in the right string.
→ why prefix match? = Direct equality (k == ‘rul_branch’) misses children like ‘rul_branch.0.weight’. Substring match catches all sub-keys with one check.
⬆ result: has_v7_branches = Boolean. True ⇒ at least one state-dict key contains ‘rul_branch.’.
19has_grace_heads = any("rul_head." in k for k in state_dict_keys)

Same trick for GRACE's naming convention.

EXECUTION STATE
→ naming convention difference = V7 used ‘branch’ (because the original code thought of them as separate compute branches). GRACE uses ‘head’ (matching the §11.4 nomenclature). The rename was the largest API-visible change in the refactor.
21has_per_dataset_dropout = ( dropout_per_dataset is not None and len(set(dropout_per_dataset.values())) > 1 )

True if the caller supplied a config AND that config has > 1 unique dropout value (e.g. {0.2, 0.3}). Uniform configs (all 0.3) collapse to a 1-element set and return False.

EXECUTION STATE
📚 set(iterable) = Build a set from an iterable. Removes duplicates, no defined order.
📚 .values() = Dict view of the dict's values.
📚 len(seq) = Built-in. Returns the count of elements.
📚 is not None = Identity check against the None singleton. Preferred over `!= None` for None comparisons.
→ V7 case = {0.3, 0.2}.values() ⇒ set({0.3, 0.2}) has len 2. has_per_dataset_dropout = True.
→ GRACE case = All 0.3 ⇒ set has len 1. has_per_dataset_dropout = False.
⬆ result = Boolean.
26if has_v7_branches and has_per_dataset_dropout:

Both signals must agree. Belt-and-suspenders. A checkpoint with V7-style names but uniform dropout is unusual and we don't want to claim V7 confidently.

27return "V7"

V7 verdict.

28if has_grace_heads and not has_per_dataset_dropout:

GRACE: head-style names AND uniform dropout.

29return "GRACE"

GRACE verdict.

30return "UNKNOWN"

Mixed or missing signals. Should be rare; flag for human inspection.

EXECUTION STATE
⬆ returns = 'UNKNOWN' - signals incomplete metadata, not necessarily a corrupt checkpoint.
33def comparison_verdict(method_a_pipeline, method_b_pipeline) -> str:

Decide whether two methods can be compared head-to-head. Three outcomes: DIRECT (same pipeline), CROSS-PIPELINE (different but both known - annotate with dagger), or INSUFFICIENT-INFO (at least one is UNKNOWN).

EXECUTION STATE
⬇ input: method_a_pipeline = 'V7' / 'GRACE' / 'UNKNOWN' for method A.
⬇ input: method_b_pipeline = Same for method B.
⬆ returns = Verdict string.
35if method_a_pipeline == method_b_pipeline:

Same pipeline ⇒ direct comparison is valid.

36return "DIRECT"

Both methods come from the same training stack.

37if "UNKNOWN" in (method_a_pipeline, method_b_pipeline):

Tuple membership test. True if either argument is the UNKNOWN string. Cleaner than two separate equality checks.

EXECUTION STATE
📚 "x" in tuple = Membership test. O(n) for tuples; O(1) for sets. Tuple is fine here because we only have 2 elements.
38return "INSUFFICIENT-INFO"

Cannot make a confident verdict.

39return "CROSS-PIPELINE (annotate with dagger)"

Both pipelines known but different. The paper's convention is to mark this case with † in tables (and explain in the footnote).

43v7_keys = [ ... ]

Real example state-dict keys from a V7 checkpoint. Notice <code>rul_branch</code> and <code>health_branch</code> - the legacy names.

EXECUTION STATE
→ 'backbone.cnn.0.weight' = Conv layer weight. Same across V7 and GRACE - both use the same backbone.
→ 'backbone.lstm.weight_ih_l0' = BiLSTM input-to-hidden weight, layer 0. Same across pipelines.
→ 'rul_branch.0.weight' = V7 marker. The .0 indicates the first nn.Sequential element.
→ 'health_branch.0.weight' = V7 marker - the classification head.
51grace_keys = [ ... ]

Real example state-dict keys from a GRACE checkpoint. Notice <code>rul_head</code> and <code>health_head</code>.

EXECUTION STATE
→ 'rul_head.0.weight' = GRACE marker. The renaming was the most visible refactor change.
→ 'health_head.0.weight' = GRACE marker.
59v7_dropout = {"FD001": 0.3, "FD002": 0.2, "FD003": 0.3, "FD004": 0.2}

Real V7 per-dataset dropout (paper file <code>experiments/train_amnl_v7.py</code>:445-450).

EXECUTION STATE
→ unique values = {0.3, 0.2} - 2 distinct values ⇒ has_per_dataset_dropout = True.
60grace_dropout = {"FD001": 0.3, "FD002": 0.3, "FD003": 0.3, "FD004": 0.3}

GRACE refactor uniform dropout (paper file <code>grace/models/model_configs.py</code>:38).

EXECUTION STATE
→ unique values = {0.3} - 1 distinct value ⇒ has_per_dataset_dropout = False.
62print(f"AMNL row pipeline : {detect_pipeline(v7_keys, v7_dropout)}")

Detect AMNL row pipeline.

EXECUTION STATE
Output = AMNL row pipeline : V7
63print(f"GABA row pipeline : {detect_pipeline(grace_keys, grace_dropout)}")

Detect GABA row pipeline.

EXECUTION STATE
Output = GABA row pipeline : GRACE
64print(f"GRACE row pipeline : {detect_pipeline(grace_keys, grace_dropout)}")

Detect GRACE row pipeline (same model class as GABA).

EXECUTION STATE
Output = GRACE row pipeline : GRACE
65print()

Blank line.

66print(f"AMNL vs GABA : {comparison_verdict('V7', 'GRACE')}")

AMNL came out of V7; GABA came out of GRACE. The verdict is CROSS-PIPELINE - paper marks the AMNL row with † and notes the caveat.

EXECUTION STATE
Output = AMNL vs GABA : CROSS-PIPELINE (annotate with dagger)
67print(f"GABA vs GRACE : {comparison_verdict('GRACE', 'GRACE')}")

Both GRACE refactor. Direct comparison valid.

EXECUTION STATE
Output = GABA vs GRACE : DIRECT
68print(f"AMNL vs unknown: {comparison_verdict('V7', 'UNKNOWN')}")

If you cannot identify the second pipeline (e.g. someone hands you a checkpoint without metadata), refuse the comparison rather than guess.

EXECUTION STATE
Output = AMNL vs unknown: INSUFFICIENT-INFO
→ reading = Always demand pipeline metadata BEFORE quoting comparative numbers. Reviewers who skip this step routinely produce misleading SOTA tables.
39 lines without explanation
1import numpy as np
2
3
4def detect_pipeline(state_dict_keys: list[str],
5                     dropout_per_dataset: dict[str, float] | None = None) -> str:
6    """Heuristic to identify which paper pipeline produced a checkpoint.
7
8    V7 (legacy AMNL):
9        - state-dict keys start with &quot;backbone.&quot;, &quot;rul_branch.&quot;, &quot;health_branch.&quot;
10        - per-dataset dropout dict (FD001:0.3, FD002:0.2, ...)
11        - separate model class DualTaskEnhancedModel
12
13    GRACE refactor:
14        - state-dict keys start with &quot;backbone.&quot;, &quot;rul_head.&quot;, &quot;health_head.&quot;
15        - uniform dropout (constant 0.3 across all datasets)
16        - unified DualTaskModel + ModelConfig dataclass
17    """
18    has_v7_branches = any("rul_branch." in k for k in state_dict_keys)
19    has_grace_heads = any("rul_head." in k for k in state_dict_keys)
20
21    has_per_dataset_dropout = (
22        dropout_per_dataset is not None
23        and len(set(dropout_per_dataset.values())) > 1
24    )
25
26    if has_v7_branches and has_per_dataset_dropout:
27        return "V7"
28    if has_grace_heads and not has_per_dataset_dropout:
29        return "GRACE"
30    return "UNKNOWN"
31
32
33def comparison_verdict(method_a_pipeline: str,
34                        method_b_pipeline: str) -> str:
35    """Decide if two methods can be compared head-to-head."""
36    if method_a_pipeline == method_b_pipeline:
37        return "DIRECT"
38    if "UNKNOWN" in (method_a_pipeline, method_b_pipeline):
39        return "INSUFFICIENT-INFO"
40    return "CROSS-PIPELINE (annotate with dagger)"
41
42
43# ---------- Worked example: classify three real checkpoints ----------
44v7_keys = [
45    "backbone.cnn.0.weight",   "backbone.cnn.2.weight",
46    "backbone.lstm.weight_ih_l0",
47    "rul_branch.0.weight",     "rul_branch.2.weight",     # ← V7 marker
48    "health_branch.0.weight",
49]
50grace_keys = [
51    "backbone.cnn.0.weight",   "backbone.cnn.2.weight",
52    "backbone.lstm.weight_ih_l0",
53    "rul_head.0.weight",       "rul_head.2.weight",       # ← GRACE marker
54    "health_head.0.weight",
55]
56
57v7_dropout    = {"FD001": 0.3, "FD002": 0.2, "FD003": 0.3, "FD004": 0.2}
58grace_dropout = {"FD001": 0.3, "FD002": 0.3, "FD003": 0.3, "FD004": 0.3}
59
60print(f"AMNL row pipeline    : {detect_pipeline(v7_keys, v7_dropout)}")
61print(f"GABA row pipeline    : {detect_pipeline(grace_keys, grace_dropout)}")
62print(f"GRACE row pipeline   : {detect_pipeline(grace_keys, grace_dropout)}")
63print()
64print(f"AMNL vs GABA  : {comparison_verdict('V7', 'GRACE')}")
65print(f"GABA vs GRACE : {comparison_verdict('GRACE', 'GRACE')}")
66print(f"AMNL vs unknown: {comparison_verdict('V7', 'UNKNOWN')}")

PyTorch: Read State-Dict Signatures

Production version - takes a live nn.Module and inspects its state_dict + module tree to fingerprint the pipeline. Stub V7 / GRACE classes mirror the real ones well enough for the smoke test.

pipeline_signature() with V7 / GRACE stubs
🐍pipeline_signature_torch.py
1import torch

Top-level PyTorch.

EXECUTION STATE
📚 torch = Tensor library + autograd + nn modules + optim.
2import torch.nn as nn

Module containers - we use nn.Linear, nn.Dropout, nn.Sequential to build the stub models.

5def pipeline_signature(model: nn.Module) -> dict:

Inspect a live model and return its pipeline fingerprint. Combines class name, state-dict naming convention, and a representative dropout rate.

EXECUTION STATE
⬇ input: model = Any nn.Module. We read .state_dict().keys() and walk .modules().
⬆ returns = Dict with class name, two boolean signature flags, and a dropout value.
11keys = list(model.state_dict().keys())

Materialise the state-dict keys as a Python list. <code>state_dict()</code> returns an OrderedDict with all parameter and buffer paths.

EXECUTION STATE
📚 .state_dict() = PyTorch nn.Module method. Returns an OrderedDict of {parameter_name: tensor}. The keys are dotted paths like &lsquo;backbone.cnn.0.weight&rsquo;.
📚 .keys() = Dict method - returns a view of the keys.
📚 list(view) = Materialise a dict view as a list. Useful for repeated iteration.
12sig = {...}

Initialise the signature dict.

13"model_class": type(model).__name__,

Read the model&apos;s class name. <code>type(obj)</code> returns the class object; <code>.__name__</code> returns the name as a string.

EXECUTION STATE
📚 type(obj) = Built-in. Returns the class of obj.
→ .__name__ = Special attribute on classes. Returns the class name as a string.
→ DualTaskEnhancedModel ⇒ V7 = The model_class string is the strongest single signal.
14"has_v7_branches": any("rul_branch." in k for k in keys),

Same prefix-match trick as the NumPy block.

15"has_grace_heads": any("rul_head." in k for k in keys),

Same for GRACE.

18for module in model.modules():

.modules() yields every submodule in the tree (depth-first). Iterating it lets us find the FIRST nn.Dropout instance and read its rate.

EXECUTION STATE
📚 .modules() = Iterator yielding self + every child + every grandchild, recursively. Compare with .children() (only direct children).
19if isinstance(module, nn.Dropout):

Type check - we only care about nn.Dropout instances.

EXECUTION STATE
📚 isinstance(obj, cls) = Built-in. Returns True if obj is an instance of cls (or a subclass).
20sig["dropout_value"] = module.p

<code>nn.Dropout</code> exposes its rate as <code>.p</code>. Read it and stash in the signature.

EXECUTION STATE
→ why first? = Real models may have several Dropout layers (e.g. one per head, one in the FC funnel). The first one is representative since the GRACE refactor uses uniform rate; the V7 model also uses one rate per training run.
21break

Stop after the first Dropout.

EXECUTION STATE
📚 break = Exit the closest enclosing for/while loop.
22else:

Loop-else clause. Runs ONLY if the for-loop completed without break. Niche Python feature; here it handles the case where no Dropout module was found.

EXECUTION STATE
📚 for/else = Python loop-else. Runs after the for-loop body if the loop did NOT exit via break. Counter-intuitive but useful for &lsquo;searched but did not find&rsquo; patterns.
23sig["dropout_value"] = None

Fallback: model has no Dropout layers.

24return sig

Return the fingerprint dict.

EXECUTION STATE
⬆ returns = Dict with 4 entries.
28class DualTaskEnhancedModel(nn.Module): # V7 legacy

Stub V7 model. Real one has CNN+BiLSTM+Attention+heads but the only thing the detector cares about is the head naming (rul_branch / health_branch).

EXECUTION STATE
→ naming = rul_branch + health_branch ⇒ V7 signature.
29def __init__(self, dropout: float = 0.3):

Stub constructor.

30super().__init__()

Initialise nn.Module.

31self.backbone = nn.Linear(14, 256)

Single-layer stand-in for the real CNN+BiLSTM+Attention backbone.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Fully-connected layer.
⬇ args = (14, 256) - 14 inputs (sensors) → 256 hidden.
32self.rul_branch = nn.Sequential(nn.Linear(256, 1), nn.Dropout(dropout))

V7 marker - notice the &lsquo;branch&rsquo; suffix. nn.Sequential wraps Linear + Dropout into one named submodule.

EXECUTION STATE
📚 nn.Sequential(*modules) = Run children in order. Each becomes a numbered child (.0, .1, ...) - that is why state-dict keys look like &lsquo;rul_branch.0.weight&rsquo;.
33self.health_branch = nn.Sequential(nn.Linear(256, 3), nn.Dropout(dropout))

Same pattern for the classification head.

36class DualTaskModel(nn.Module): # GRACE refactor

Stub GRACE model. Same compute graph as V7 but renamed heads.

EXECUTION STATE
→ naming = rul_head + health_head ⇒ GRACE signature.
37def __init__(self, dropout: float = 0.3):

Same constructor.

38super().__init__()

Initialise nn.Module.

39self.backbone = nn.Linear(14, 256)

Same backbone.

40self.rul_head = nn.Sequential(nn.Linear(256, 1), nn.Dropout(dropout))

GRACE marker - &lsquo;head&rsquo; suffix.

41self.health_head = nn.Sequential(nn.Linear(256, 3), nn.Dropout(dropout))

Same for classification head.

45torch.manual_seed(0)

Repro.

EXECUTION STATE
📚 torch.manual_seed(s) = Set the global PyTorch PRNG.
⬇ arg: s = 0 = Conventional canonical seed.
46v7_model = DualTaskEnhancedModel(dropout=0.2)

Build a V7 model with FD002 dropout=0.2.

47grace_model = DualTaskModel(dropout=0.3)

Build a GRACE model with uniform 0.3.

49print(f"{'model':<26s} | class | v7_branches | grace_heads | dropout")

Header row. Format spec :<26s = left-aligned width 26.

50print(f"{'-' * 110}")

Separator. <code>str * int</code> repeats the string.

EXECUTION STATE
→ 'x' * n = Python: repeat the string n times. &apos;-&apos; * 5 = &apos;-----&apos;.
51for name, m in [("AMNL FD002 (V7 legacy)", v7_model), ("GRACE FD002 (refactor)", grace_model)]:

Iterate two (label, model) pairs.

LOOP TRACE · 2 iterations
name = 'AMNL FD002 (V7 legacy)'
model_class = DualTaskEnhancedModel
has_v7_branches = True
has_grace_heads = False
dropout_value = 0.2
name = 'GRACE FD002 (refactor)'
model_class = DualTaskModel
has_v7_branches = False
has_grace_heads = True
dropout_value = 0.3
53sig = pipeline_signature(m)

Run the detector.

54print(f"{name:<26s} | {sig['model_class']:<27s} | "

Format-string row.

EXECUTION STATE
→ :<26s = String, left-aligned, min width 26.
→ :<27s = String, left-aligned, min width 27.
55 f"{str(sig['has_v7_branches']):<11s} | {str(sig['has_grace_heads']):<11s} | "

Continuation. <code>str(bool)</code> converts True/False to &apos;True&apos;/&apos;False&apos;. The format spec uses :<11s which only works on strings, so the conversion is mandatory.

EXECUTION STATE
📚 str(obj) = Built-in. Calls obj.__str__(). For bools yields &apos;True&apos; or &apos;False&apos;.
56 f"{sig['dropout_value']}")

Tail of the row.

EXECUTION STATE
Output = model | class | v7_branches | grace_heads | dropout -------------------------------------------------------------------------------------------------------------- AMNL FD002 (V7 legacy) | DualTaskEnhancedModel | True | False | 0.2 GRACE FD002 (refactor) | DualTaskModel | False | True | 0.3
→ reading = All four signature signals agree per row. The detector would classify the first as V7 and the second as GRACE. The dropout values (0.2 vs 0.3) match the per-dataset and uniform conventions respectively.
20 lines without explanation
1import torch
2import torch.nn as nn
3
4
5def pipeline_signature(model: nn.Module) -> dict:
6    """Read state-dict keys + dropout config to fingerprint a model.
7
8    Returns:
9        Dict with &apos;model_class&apos;, &apos;has_v7_branches&apos;, &apos;has_grace_heads&apos;,
10        &apos;dropout_value&apos; if extractable.
11    """
12    keys = list(model.state_dict().keys())
13    sig = {
14        "model_class":     type(model).__name__,
15        "has_v7_branches": any("rul_branch."  in k for k in keys),
16        "has_grace_heads": any("rul_head."    in k for k in keys),
17    }
18    # Try to extract a representative dropout value
19    for module in model.modules():
20        if isinstance(module, nn.Dropout):
21            sig["dropout_value"] = module.p
22            break
23    else:
24        sig["dropout_value"] = None
25    return sig
26
27
28# ---------- Stub model classes for the smoke test ----------
29class DualTaskEnhancedModel(nn.Module):       # V7 legacy
30    def __init__(self, dropout: float = 0.3):
31        super().__init__()
32        self.backbone     = nn.Linear(14, 256)
33        self.rul_branch    = nn.Sequential(nn.Linear(256, 1), nn.Dropout(dropout))
34        self.health_branch = nn.Sequential(nn.Linear(256, 3), nn.Dropout(dropout))
35
36
37class DualTaskModel(nn.Module):               # GRACE refactor
38    def __init__(self, dropout: float = 0.3):
39        super().__init__()
40        self.backbone    = nn.Linear(14, 256)
41        self.rul_head     = nn.Sequential(nn.Linear(256, 1), nn.Dropout(dropout))
42        self.health_head  = nn.Sequential(nn.Linear(256, 3), nn.Dropout(dropout))
43
44
45# ---------- Smoke test ----------
46torch.manual_seed(0)
47v7_model    = DualTaskEnhancedModel(dropout=0.2)        # FD002 V7 dropout
48grace_model = DualTaskModel(dropout=0.3)                 # GRACE uniform
49
50print(f"{'model':<26s} | class                       | v7_branches | grace_heads | dropout")
51print(f"{'-' * 110}")
52for name, m in [("AMNL FD002 (V7 legacy)", v7_model),
53                ("GRACE FD002 (refactor)", grace_model)]:
54    sig = pipeline_signature(m)
55    print(f"{name:<26s} | {sig['model_class']:<27s} | "
56          f"{str(sig['has_v7_branches']):<11s} | {str(sig['has_grace_heads']):<11s} | "
57          f"{sig['dropout_value']}")

Cross-Pipeline Caveats Beyond C-MAPSS

The same problem shows up wherever a research codebase is refactored mid-paper or methods come from different codebases. Common sources of cross-pipeline drift:

DomainCommon drift sourceDetection
RUL prediction (this book)model class rename + per-dataset dropoutstate_dict key prefixes
NLP fine-tuning across HF model cardstokeniser version, prompt template, data preprocessingtokenizer.json hash
Object detection (mmdet vs Detectron2)anchor generator, NMS, label-encoding formatconfig file diff
Graph neural networks (PyG vs DGL)message-passing implementation, edge representationmodule.__module__ string
Time-series forecasting (Darts vs Prophet)lag-feature construction, exogenous handlingfeature manifest hash
Recsys (DLRM vs Wide&amp;Deep)embedding init, negative sampling protocoltraining config

Three Cross-Pipeline Pitfalls

Pitfall 1: Comparing methods without naming the pipeline. The most common failure mode in SOTA tables. Always cite the codebase, the commit hash, the seed count, AND the dataset split version. If a method's pipeline is unknown, refuse the comparison rather than guess.
Pitfall 2: Trusting state-dict naming alone. Someone could load a V7 checkpoint into a wrapper that renames the keys; the detector would mis-classify. Combine multiple signals (class name + key prefix + dropout dict + CONFIG VERSION).
Pitfall 3: Treating “cross-pipeline” as “invalid”. Honest cross-pipeline comparisons with explicit annotation ARE valid - they just need the † footnote. The danger is SILENTLY mixing pipelines, not using both at all.
The point. AMNL's Table I row carries a dagger because three medium-impact differences exist between its V7 training pipeline and the GRACE refactor that GABA / GRACE use. The differences cost < 1 cycle RMSE. The conclusion (AMNL wins FD002) survives correction. §16.4 turns the chapter into a deployment rule.

Takeaway

  • The dagger means cross-pipeline. AMNL was trained with V7; GABA / GRACE with the refactor.
  • Three real differences: model class name, dropout policy, config source. Two are organisational; one (dropout) costs ~0.5 cycles RMSE on FD002 / FD004.
  • AMNL's wins survive correction. AMNL-on-GRACE FD002 estimate is ≈ 7.2 - still below GABA and DMHA-ATCN.
  • Detector heuristic. state-dict key prefix + dropout config + class name. Three signals; two must agree.
  • Annotate, don't hide. Cross-pipeline comparisons are defensible if explicitly flagged.
Loading comments...