The Cold-Start Problem
On a cold morning your car's engine doesn't go straight to peak efficiency. It runs richer for a minute or two, the catalytic converter waits, the emissions-control loop holds back. The on-board computer knows that running the full closed-loop fuel-injection algorithm against COLD sensors produces nonsense, so it gates the controller off until the engine reaches a sensible operating point.
GABA has the same cold-start problem. The closed form assumes reflects the steady-state task structure of the loss landscape. At step 0 the model is at random initialisation: the gradients reflect random projections of random labels, not the 500× imbalance the paper characterises in §12.3. Feeding those transient gradients into the closed form — and into the EMA on top — produces garbage for the first few hundred steps.
Why Step-0 Gradients Are Untrustworthy
Three reasons the first ~50 training steps look nothing like the steady-state regime characterised in §12.3:
- Random init dominates. At step 0 the backbone is at Kaiming-init values; logits are random. Cross-entropy on random logits is regardless of the true labels. MSE on random RUL predictions is dominated by random-prediction variance, not by useful regression structure.
- The 500× imbalance is a steady-state property. Paper main.tex:319 measures it on epoch-level samples FROM TRAINED MODELS — not at init. At step 0 the ratio can be anywhere from 1× to 100×, depending on which random seed was drawn.
- Adam's own bias-correction kicks in over the first ~100 steps. Adam's first-moment estimator is biased toward zero for steps; the second-moment estimator for steps but with smaller variance. Combining adaptive loss weighting with adaptive optimiser estimates during THEIR transient phase amplifies oscillation.
Trying to use GABA from step 1 produces wild oscillations in as the EMA chases transient noise. The visualization below shows this directly: with (no warmup) the trajectory starts dropping from 0.5 immediately and overshoots before settling. With paper's the trajectory stays at 0.5 until the model has had a chance to find a stable operating point.
The Warmup Gate (Paper Algorithm 1)
Paper main.tex:362-374 specifies the gate:
Where is the warmup duration and is a 1-indexed step counter. Two important details:
- The comparison is inclusive. means step 100 is still in warmup; step 101 is the first active step. Paper code:
self.step_count.item() <= self.warmup_steps. - The EMA buffer is NOT updated during warmup. The closed form is not computed; there is nothing to feed into the EMA. The buffer stays at its initial value (paper Algorithm 1 line 2). When the gate flips at step 101, the EMA starts from a sensible 1/K value — which matches the warmup output, so there's no discontinuity in itself.
What Uniform 1/K Means In Practice
For (RUL + health), uniform weighting means . The combined loss during warmup is exactly:
In other words, GABA falls back to the ‘Fixed Baseline’ method (paper §3.5) for the first 100 steps. This is intentional. Fixed Baseline is the simplest, safest, most-studied multi-task scheme: every published MTL paper has experience with it, every deep-learning library has well-understood behaviour for it, and Adam's bias correction has been engineered for losses that look like this.
Why warmup doesn't hurt the 500× imbalance regime. One could worry that 100 steps of equal weighting lets the RUL gradient dominate and train backbone features that are useless for health. Empirically the paper's ablations show no measurable harm: 100 steps is a tiny fraction of the full 500-epoch training horizon, and the active GABA controller has 400+ epochs to re-shape the backbone afterwards.
Why W = 100 Steps
The choice is paper canonical (main.tex:362). Three justifications:
| Reason | Quantitative tie | Implication |
|---|---|---|
| Match the EMA time constant | τ = 1/(1−β) = 1/0.01 = 100 steps for β=0.99 | After warmup, the EMA has had one τ to absorb a meaningful signal |
| Cover Adam first-moment bias correction | 1/(1−β₁) = 10 steps for β₁=0.9 | 10x safety margin over Adam's first-moment transient |
| Long enough that gradient norms reflect data, not init | Empirically: ~50 steps suffice for gradient magnitudes to stabilise | 100 steps gives margin even for harder datasets |
| Short enough not to waste training | 100 / (500 epochs × 100 steps/epoch) ≈ 0.2% of training | Negligible cost for the safety margin |
The robustness ablation in paper §5.8 shows results are statistically indistinguishable for ; the canonical 100 sits comfortably in the middle.
What Happens At Step W + 1
At step the gate flips and three things happen in one update:
- Closed form computes. For the paper-realistic 500× imbalance the raw is .
- EMA absorbs 1% of the new value. . Tiny shift — the EMA has a long memory.
- Floor is inactive. Both EMA values are above ; clamp is a no-op; renormalisation divides by 1.
So the FIRST active step changes from to — a 0.5% relative shift, invisible to the optimiser. Convergence to the steady-state 0.04762 takes another steps. Total: warmup + post-warmup settling = 400 steps to reach the floor-bound regime, out of typically + training steps.
Interactive: Slide The Warmup Boundary
Drag the W slider. The amber band marks the warmup region; inside it is pinned at 0.5. Outside the band the GABA pipeline runs. The dashed grey line is the same simulation with for comparison. Watch how W = 0 immediately starts dropping while W = 100 stays flat through the window.
Try this. Set W = 0 and watch the blue and grey traces collapse onto each other — no warmup, immediate GABA. Set W = 300 and watch the trace stay at 0.5 for the entire visible window — warmup eats the whole simulation. Paper's W = 100 is the sweet spot: enough warmup to dodge the cold-start transient, short enough to leave the rest of training in active mode.
Python: Warmup Gate From Scratch
Implement the full GABA per-step update with the warmup gate in pure NumPy. Run for 500 steps and print the trajectory at milestone steps so the warmup plateau and the post-warmup settling are visible.
PyTorch: The Paper's Branching Code
The actual paper code lives at grace/core/gaba.py:107-108 and is two lines: if shared_params is None or self.step_count.item() <= self.warmup_steps: followed by weights = torch.ones(K, device=device) / K. We extract the gate (and its surrounding state-management machinery) into a standalone class for clarity.
Warmup Patterns In Other Pipelines
| Field | Warmup mechanism | Typical duration | Why it's needed |
|---|---|---|---|
| Predictive maintenance (this paper) | GABA gate (uniform 1/K for first W steps) | 100 steps | Avoid cold-start transient in gradient norms |
| Optimisation: linear LR warmup | Linear ramp from 0 to peak LR | 500-2,000 steps (BERT, GPT) | Adam's second-moment estimator is biased; small LR avoids divergence |
| Computer vision: BatchNorm running stats | First few epochs use batch stats only | 1-5 epochs | Running-mean estimates need data to be meaningful |
| Reinforcement learning: replay buffer fill | No gradient updates until buffer holds N samples | 10K-50K steps | Off-policy methods need a non-trivial buffer to sample from |
| Generative models: classifier-free guidance | First few steps use unconditional generation | 10% of denoising steps | Avoids over-conditioning on weak class signal early in sampling |
| Training schedules: gradual unfreezing | Train head first, then unfreeze backbone layers progressively | 1 epoch per layer | Prevents catastrophic forgetting in fine-tuning |
| Continual learning: rehearsal warmup | Train on old + new data with fixed mixing for first epoch | 1-3 epochs | Lets the new task settle before adaptive task-balancing kicks in |
The pattern — gate the adaptive controller off until the system has reached a sensible operating point — recurs across nearly every adaptive learning pipeline. GABA inherits the pattern; the implementation is just a step counter and an inclusive comparison.
Pitfalls In Warmup Gating
state_dict() captures it.Takeaway
- Warmup gates the adaptive controller off for the first W = 100 steps. Paper Algorithm 1 lines 4-6: if return uniform weights; otherwise run the GABA pipeline.
- W = 100 matches the EMA time constant. One τ = 1/(1−β) = 100 steps. This alignment isn't coincidence: warmup gives the EMA exactly one time constant of clean uniform initialisation before adaptive logic kicks in.
- The EMA buffer is not updated during warmup. It stays at the initial , which matches the warmup output, so no discontinuity in at step W+1.
- The transition is smooth. First active step shifts by for the realistic 500× imbalance — well below Adam's noise floor.
- step_count must be a buffer. So checkpoint resume doesn't restart the warmup counter mid-training.
- The pattern is universal. Linear LR warmup, BatchNorm running stats, replay-buffer fill, gradual unfreezing — every adaptive controller benefits from a startup gate.