Hook: A Microphone, A Hiss, And A Capacitor
Plug a vocal microphone into a cheap mixing board and you hear two things: the singer's voice (the slow signal you want) and an air hiss from the preamp (the fast noise you don't). To clean it up, every audio engineer reaches for the same tool: a low-pass filter. Pass the slow voice through; reject the high-frequency hiss. The simplest realisation is a single resistor and a single capacitor — a textbook RC low-pass network — and the cutoff frequency is set by . That circuit dates to 1899 and it has not gone out of style.
GABA solves the same problem in software. The closed-form rule from §19.1 emits a per-step weight that is brutally noisy: the per-batch gradient ratio in the paper fluctuates between 10× and 10,000× from one minibatch to the next (paper main.tex:576), so the un-smoothed swings wildly. Feeding that directly into the optimiser would oscillate the loss combination and wreck training. The fix is exactly the audio-engineer move: put a low-pass filter between the controller and the actuator. Paper main.tex:387 names that filter explicitly — “The EMA with serves as a first-order IIR low-pass filter (time constant steps) that smooths stochastic gradient noise, preventing oscillation.”
Why this section matters. The EMA buffer is not just an empirical smoother — it is a digital realisation of the same RC low-pass topology that an audio engineer reaches for. Once we identify the filter, we get its frequency response, its time constant, and its noise-rejection figure for free from a hundred years of DSP textbooks. We'll prove the time constant is exactly samples, derive the magnitude response, and show why is the lowest gain that still rejects the paper's observed batch-to-batch noise by 30+ dB.
What Is A First-Order IIR Low-Pass Filter?
A digital filter is a recipe that turns one sequence of samples into another sequence . There are exactly two flavours.
| Flavour | Recursion | Memory | Examples |
|---|---|---|---|
| FIR (Finite Impulse Response) | y[t] = Σ_k a_k · x[t−k] (no past y) | Bounded — only depends on the last N inputs | Moving average; convolution kernel; CNN layer |
| IIR (Infinite Impulse Response) | y[t] = Σ_k a_k · x[t−k] + Σ_k b_k · y[t−k] (past y appears) | Unbounded — recursive memory of all past inputs | EMA; RC circuit; Kalman filter; GABA |
IIR filters are recursive: the new output depends on past outputs. That recursion is what gives them infinite memory and an exponential decay structure. The simplest non-trivial IIR is the first-order low-pass:
Two coefficients, both fixed by a single parameter . The new output is a convex combination of yesterday's output (weight ) and today's input (weight ). Because the coefficients sum to 1, the filter passes a constant signal through unchanged — what engineers call “unity DC gain.”
The transfer function (z-transform of the recursion) is
That single pole on the real axis at is what makes this a one-pole low-pass. The pole at sits very close to the unit circle — very close to the boundary of instability — which is exactly why it has such a long memory.
Why The EMA Update IS That Filter
Compare the two recursions side by side. The paper's EMA update is paper eq. 5 (main.tex:346):
The first-order IIR low-pass recursion is
Rename (the raw per-step weight from the closed-form P-controller) and (the smoothed weight). The two recursions are the same equation, character for character. The EMA buffer is a first-order digital low-pass IIR filter applied to the closed-form weight stream.
The impulse response — what the filter does to a single, isolated sample at — falls out of the recursion by unrolling:
Geometric decay with ratio . At , the weight on a sample 100 steps ago is — about of the weight on the most recent sample, but spread across all future outputs. This unbounded but exponentially-decaying memory is the “I” in IIR.
The Time Constant: τ = 1 / (1 − β) = 100 Steps
Every first-order low-pass has a single number that summarises its memory: the time constant . For the discrete EMA filter it is
The paper's default gives . Three concrete consequences fall out of this number:
- Step-response settling. If the input jumps from 0 to 1 at , the filter output reaches at , at , and at . Same canonical milestones as an analog RC low-pass.
- Effective averaging window. The geometric weights have an effective window of about samples — ~63% of the total weight comes from the most recent samples. Equivalent to a uniform moving average of length , but causal and recursive (constant memory and compute per step instead of ).
- Adaptation speed. If the ‘true’ jumps from 0.5 to 0.998 (the paper's post-warmup transition), the filter takes about steps to converge to within 5% of the new value. Paper main.tex:576 confirms this empirically: “ converges from the equal initialization (0.5) to within ~10 epochs” — and 10 epochs at the paper's ~30 batches/epoch is steps, exactly .
Frequency Response And The −3 dB Cutoff
Substitute into the transfer function to get the steady-state response to a sinusoid at angular frequency rad/sample:
At (DC): — unity gain at zero frequency. At (Nyquist): — the maximum possible attenuation. For , that is or dB. Between DC and Nyquist the magnitude is monotonically decreasing.
The conventional summary number is the −3 dB cutoff — the frequency at which the magnitude drops to . Solving for the first-order IIR gives
For : rad/sample, so cycles/step. Past that cutoff the filter rolls off at the textbook −20 dB/decade for any first-order low-pass — the same Bode slope you see on a single-RC stage.
Interactive: Impulse And Frequency Response
Two coordinated panels. The left shows the impulse-response weights — the memory of the filter, stem-plot style. The right shows the magnitude Bode plot in dB versus normalised frequency. Move the slider to retune the filter; move the probe-frequency slider to inject a sinusoidal disturbance and read off the steady-state attenuation.
Three guided experiments to run on the viz:
- Memory length. Click the preset. Read the time-constant marker at on the impulse panel. About 63% of the total weight sits to the left of that marker — this is the “effective averaging window.”
- Noise rejection at paper-realistic frequencies. Set the probe slider to cycles/step (one oscillation every 10 steps, paper-realistic for batch-to-batch gradient noise). Read the dB number on the right panel: at the attenuation is ~−36 dB, i.e. the noise amplitude shrinks by a factor of ~63.
- Why can't be smaller. Drop to 0.5. The time constant collapses to 2 steps, the cutoff frequency moves right by a factor of 50, and the same probe at now sees only ~−3 dB of attenuation. The smoothed weight would shadow the noisy raw weight almost step for step — precisely the oscillation regime that paper main.tex:553 attributes to GradNorm's divergence on 1/5 N-CMAPSS seeds.
Why β = 0.99 Survives The 10× to 10,000× Disturbance Range
The paper's Fig. gradient_dynamics reports that the per-batch gradient ratio fluctuates between 10× and 10,000× on the log scale, thin-line per-seed curves piling on top of a slowly-drifting mean (paper main.tex:576). On the simplex this corresponds to a raw un-smoothed swinging between approximately 0.0001 and 0.1 every few batches.
| β | τ (samples) | f_c (cycles/step) | |H| at f = 0.1 | Attenuation | Verdict |
|---|---|---|---|---|---|
| 0.50 | 2 | 0.157 | 0.71 | 1.4× | Almost no smoothing — output ≈ input. Oscillates. |
| 0.90 | 10 | 0.017 | 0.16 | 6.4× | Mild smoothing. Still visibly noisy. |
| 0.99 (paper) | 100 | 0.0016 | 0.016 | 63× | 30+ dB rejection. Tracks the trend, ignores per-batch noise. |
| 0.999 | 1000 | 1.6e-4 | 0.0016 | 630× | Over-smoothed — adapts 10× slower than paper default. |
| 1.000 | ∞ | 0 | 0 | ∞ | Pure integrator. Frozen at initial state. UNDEFINED. |
The row sits at the knee: enough rejection (63×) that the per-batch noise is invisible in the smoothed weight, but not so much that the filter drags behind a real regime shift. The paper main.tex:647 reports that across 40 runs the converged smoothed weight is — a coefficient of variation of 0.14% — which is only achievable with this much rejection.
Python: IIR Filter From Scratch (NumPy)
Build the filter as a small class with three methods: a single sample step, a vector wrapper, and a frequency-domain probe. Run a unit-step input to observe the canonical 63%/86%/95% milestones at, and a sinusoidal probe to verify the magnitude formula numerically.
The output is the textbook step response: 1.00% absorbed in one sample, 63.4% at , 86.6% at , 95.1% at . The frequency probe at hits −29.9 dB — a 31× attenuation on a paper-realistic disturbance frequency.
PyTorch: The EMA Branch Of GABALoss Verbatim
The production code in grace/core/gaba.py writes the EMA recursion directly inside GABALoss.forward_k. To isolate it as a standalone module, we extract just the recursion and the buffer-registration logic. The result is bit-exact to the EMA branch of GABALoss when the floor is disabled.
The smoke test reproduces the paper's warmup-to-active transition: 100 steps of equal-share input leave the buffer untouched (it was already at 0.5); the next 100 steps push it from 0.5 toward 0.998 with the canonical exponential trajectory, hitting about 0.184/0.816 at post-transition — the textbook 63.2% settling.
The Same Filter In Other Domains
The first-order IIR low-pass is one of the most widely-deployed algorithms in engineering. It shows up under many names because the recursion is the same regardless of what is being filtered.
| Domain | Filter purpose | β-equivalent | Time constant |
|---|---|---|---|
| GABA EMA buffer (this paper) | Smooth per-batch gradient-ratio noise | β = 0.99 | τ = 100 steps |
| Adam optimiser (Kingma & Ba 2015), 1st moment | Smooth per-step gradient direction | β₁ = 0.9 | τ ≈ 10 steps |
| Adam optimiser, 2nd moment | Smooth per-step squared gradient | β₂ = 0.999 | τ ≈ 1000 steps |
| BatchNorm running statistics | Smooth per-batch activation mean/var for inference | 0.99 (PyTorch default) | τ ≈ 100 batches |
| BYOL / MoCo target encoder (He et al. 2020) | EMA-track the online network for stable contrastive targets | β = 0.99 to 0.9999 | τ = 100 to 10,000 steps |
| Polyak averaging (1992) for trained NN evaluation | Smooth final epochs of weights for evaluation | β = 0.999 | τ = 1000 steps |
| Reinforcement learning Q-target soft update (DQN extensions) | Smooth target Q-network parameters for stable bootstrapping | τ_polyak = 0.005 ⇒ β = 0.995 | τ = 200 updates |
| RC low-pass audio filter (analog) | Reject high-frequency hiss while passing voice | β = exp(−Δt / RC) | τ = RC seconds |
| Pulse-oximeter heart-rate display | Smooth per-pulse rate to a stable Hz reading | β ≈ 0.95 | τ ≈ 20 pulses |
| Smartphone accelerometer gravity vector | Estimate gravity by low-pass filtering (g + linear acceleration) | β ≈ 0.9 | τ ≈ 10 samples |
The pattern is “recursion + state buffer + DC-unity gain.” The specific value of in each row encodes a domain-specific tradeoff between adaptation speed and noise rejection — but the algorithm and the analysis are identical. Recognising the filter in one row gives you the time constant, the cutoff frequency, and the noise-rejection figure in every other row for free.
Cross-domain analogy: BatchNorm running stats. BatchNorm tracks the running mean and variance of activations with an EMA at — identical numerics to GABA's weight smoother, identical time constant, identical motivation (per-batch statistics are noisy and you want the steady trend). When you train a CNN on ImageNet you rely on this filter every step without thinking about it. GABA is applying exactly the same trick to a different signal: per-batch gradient ratios instead of per-batch activation statistics.
Pitfalls In Tuning Or Replacing The EMA
.detach() on the EMA write-back in PyTorch. Without raw.detach() in the recursion, every training step grows the autograd graph by one more EMA-update node; over a few hundred steps the graph balloons to gigabytes and OOM kills the process. Paper grace/core/gaba.py:94 uses .detach() for exactly this reason; sec. 18.5 calls this out as Pitfall 3 in the GABALoss wiring discussion.get_gradient_stats() exposes both keys (raw_weight_health and post-floor) so you can inspect the gap during training.Takeaway
- The EMA recursion IS a first-order IIR low-pass filter, character for character. Paper eq. 5 and the textbook DSP recursion are the same equation; the EMA buffer is a digital realisation of the same RC low-pass topology audio engineers have been using since 1899.
- Time constant samples. Step response hits 63%/86%/95% at . Effective averaging window is ~τ samples.
- Cutoff frequency cycles/step. Magnitude rolls off at −20 dB/decade past the cutoff. At paper-realistic disturbance bandwidths (~0.05–0.1 cycles/step) the attenuation is 30+ dB — comfortably above the engineering rule of thumb.
- The default is the lowest value satisfying the 30 dB rule on the paper's observed disturbance bandwidth. Going higher slows adaptation unnecessarily; going lower fails the rule and re-introduces the oscillation regime.
- The same filter is everywhere. Adam moments, BatchNorm running stats, BYOL/MoCo targets, Polyak averaging, RL target networks, RC analog audio filters — the recursion, the time constant, and the Bode plot are identical. Recognising the EMA in GABA gives you all of those analyses for free.
- Coming next. §19.3 identifies the floor as the anti-windup mechanism that completes the controller, and proves the bounded-weight guarantee that paper main.tex:387 calls ‘a stability property absent from loss-based approaches.’