Training a deep network is mostly smooth. Mostly. Two kinds of shock disrupt that smoothness: occasional outlier batches with huge gradient norms, and per-step weight jitter that makes any single ‘final’ snapshot a poor representative of the trajectory's plateau. AMNL pipes both shocks through dedicated stabilisers: gradient clipping caps the first; weight EMA smooths the second.
The pair. Clip joint gradient norm to 1.0 (paper-canonical); maintain an EMA of every parameter with decay 0.999. Both come from paper_ieee_tii/grace/training/callbacks.py lines 54-111. Together they recover ~0.3-0.5 cycles RMSE on FD002 over a no-stabiliser baseline.
Gradient Clipping: Rescale by Norm
Concatenate every per-parameter gradient into one big vector; compute its L2 norm; if it exceeds a threshold c, rescale every gradient by the same factor c/∥g∥2. The DIRECTION is preserved; only the magnitude shrinks.
gi←{gi⋅c/∥g∥2giif ∥g∥2>cotherwise
Why JOINT norm and not per-parameter? Per-parameter clipping would break the gradient's direction - one parameter shrinks while another doesn't, so the combined ‘arrow’ rotates. Joint norm clipping rescales every entry by the same scalar, preserving the direction. Mathematically: clipping to a ball, not a cube.
Weight EMA: Inertial Average
After every optim.step(), update a SHADOW copy of every parameter:
θˉt=β⋅θˉt−1+(1−β)⋅θt
with β=0.999. The shadow is a single-pole IIR low-pass filter on the weight trajectory; its half-life is t1/2=log(0.5)/log(β)≈693 steps. At validation time, swap the shadow IN as the live weights, run forward, then swap them back.
β
half-life (steps)
behaviour
0.9
≈ 7
barely smooths - shadow tracks raw closely
0.99
≈ 69
modest smoothing - useful when training is short
0.999
≈ 693
paper choice - smooth but lags by ~1 epoch
0.9999
≈ 6,932
very heavy lag - shadow may stay near init for long runs
Interactive: Clip Threshold and EMA Decay
Drag the clip slider to see what fraction of mini-batches gets rescaled. Drag the decay slider to see how strongly the shadow smooths a noisy weight trajectory. Both panels update together.
Loading clip+EMA viz…
Try this. Set clip = 0.5 - the histogram shows ~50% of batches get clipped, training would crawl. Set clip = 5.0 - 0% clipped, no protection against spike batches. Paper's clip = 1.0 catches just the top ~10-20% spike batches without throttling normal training.
Python: Both From Scratch
Pure NumPy. clip_grad_by_norm mirrors PyTorch's clip_grad_norm_; EMA mirrors paper's ExponentialMovingAverage. The smoke test runs 2000 SGD steps on a synthetic 8×8 weight matrix and prints raw vs shadow trajectories.
clip_grad_by_norm() + class EMA
🐍clip_ema_numpy.py
Explanation(46)
Code(78)
1import numpy as np
NumPy provides the (P,) gradient and weight arrays plus the math we need: np.sum, np.log, np.random.seed, np.random.randn. The whole stabiliser stack runs in pure NumPy.
EXECUTION STATE
📚 numpy = Library: ndarray + linear algebra + math + random.
Reimplementation of torch.nn.utils.clip_grad_norm_(parameters, max_norm). Computes the JOINT norm of all gradient arrays (treated as one big concatenated vector) and rescales them all by the same factor if the joint norm exceeds max_norm.
EXECUTION STATE
⬇ input: grads = List of ndarrays - one per parameter group. Joint-norm convention treats them as one big vector.
⬇ input: max_norm = 1.0 = Paper default. Joint norm is rescaled to at most this value.
⬇ input: norm_type = 2.0 = Lp norm. Default is L2 (Euclidean). Same default as PyTorch's clip_grad_norm_.
⬆ returns = (pre_clip_norm, clipped_grads_list) tuple. Pre-clip norm is logged; clipped list replaces the originals.
12sq = 0.0
Accumulator for sum of squared gradient elements (across every group).
13for g in grads:
Iterate parameter groups. Each g is one ndarray of gradient values.
EXECUTION STATE
iter var: g = One ndarray of gradients. Could be (D,), (D, D'), etc - shape doesn't matter for the joint norm.
LOOP TRACE · 1 iterations
g = grad of W (8, 8)
g shape = (8, 8) - 64 elements
Σ g² = Σ over all 64 elements
running sq = += 64-element sum of squares
14sq += float(np.sum(g ** norm_type))
Accumulate the sum of |g_ij|^p across all groups. With norm_type=2 this is Σ g_ij². With norm_type='inf' we'd use np.max(|g|) instead.
EXECUTION STATE
operator: ** norm_type = Element-wise power. For p=2 this squares each element.
📚 np.sum(arr) = Reduce-sum over all elements.
📚 float(x) = 0-D ndarray → Python float so the running accumulator stays scalar.
Only rescale if joint norm exceeded max_norm. Otherwise leave grads untouched.
20clipped = [g * coef for g in grads]
List comprehension. Each gradient array is multiplied by coef. The result is a NEW list of new arrays - originals are not modified (PyTorch's in-place version IS in-place).
EXECUTION STATE
→ list comprehension = [expr for x in iter] - builds a new list.
operator: * = Scalar × ndarray broadcast. Each element of g is multiplied by coef.
⬆ result: clipped = List of rescaled gradient arrays. Joint norm is now exactly max_norm.
21else:
Pass-through path.
22clipped = [g.copy() for g in grads]
Even on the pass-through path we return COPIES so the caller can't accidentally mutate originals through the returned list.
EXECUTION STATE
📚 .copy() = ndarray method. Allocates a new array with the same shape and values. NOT a view; later edits don't affect the original.
23return total_norm, clipped
Return the PRE-clip norm (for logging) and the post-clip gradients. Trainer logs the pre-clip norm so you can see how often clipping fires.
26class EMA:
Pure-NumPy port of paper's ExponentialMovingAverage class (paper_ieee_tii/grace/training/callbacks.py:54-87). Stores a SHADOW copy of every parameter; updates that shadow toward the live weights at every step using exponential decay.
35self.shadow = {name: w.copy() for name, w in params.items()}
Dict comprehension. Make a fresh copy of every parameter array - the shadow needs to live independently so updates don't accidentally mutate the live weights.
EXECUTION STATE
📚 dict comprehension = {key_expr: value_expr for k, v in iter} - builds a dict by evaluating both exprs per pair.
📚 dict.items() = View of (key, value) pairs.
📚 .copy() = Independent copy of the ndarray.
⬆ result: self.shadow = {name: ndarray} - exact replica of params at construction time.
36self.backup: dict[str, np.ndarray] = {}
Empty backup dict. Used by apply_shadow / restore to swap live weights with shadow values temporarily during validation.
EXECUTION STATE
→ why backup? = Validation uses the SHADOW weights (smoother, generalises better). But training continues with the LIVE weights. apply_shadow swaps them in for eval; restore swaps them back.
38def update(self, params) -> None:
Call this AFTER optim.step() each iteration. Pulls the shadow toward the freshly-updated live weights.
EXECUTION STATE
⬇ input: params = Dict {name: ndarray} of CURRENT live weights.
40for name, w in params.items():
Iterate parameters.
LOOP TRACE · 1 iterations
name = 'W'
shadow_new = 0.999 · shadow_old + 0.001 · w
interpretation = 99.9% inertia from past + 0.1% from this step
→ reading = Shadow values cluster near 0; raw values jitter. EMA smooths exactly the noise we don't want to evaluate on.
76print(f"raw final - shadow final : {raw_traj[-1] - shadow_traj[-1]:+.4f}")
Per-element gap at the final logged step. The shadow lags the raw by ~1 half-life worth of trajectory.
EXECUTION STATE
→ :+.4f = Float, force sign, 4 decimals.
Output (one realisation) = raw final - shadow final : -0.0103
32 lines without explanation
1import numpy as np
234defclip_grad_by_norm(grads:list[np.ndarray],5 max_norm:float=1.0,6 norm_type:float=2.0)->tuple[float,list[np.ndarray]]:7"""Rescale a list of gradient arrays so their joint L2 norm is <= max_norm.
89 Mirrors torch.nn.utils.clip_grad_norm_(parameters, max_norm).
10 Returns (pre_clip_norm, clipped_grads_list).
11 """12# 1. Compute the joint norm across every parameter group13 sq =0.014for g in grads:15 sq +=float(np.sum(g ** norm_type))16 total_norm =float(sq **(1.0/ norm_type))1718# 2. If the joint norm exceeds max_norm, rescale every grad19 coef = max_norm /(total_norm +1e-6)20if coef <1.0:21 clipped =[g * coef for g in grads]22else:23 clipped =[g.copy()for g in grads]24return total_norm, clipped
252627classEMA:28"""Per-parameter exponential moving average of model weights.
2930 Mirrors paper_ieee_tii/grace/training/callbacks.py::ExponentialMovingAverage.
3132 shadow_t = decay * shadow_{t-1} + (1 - decay) * theta_t
33 """3435def__init__(self, params:dict[str, np.ndarray], decay:float=0.999)->None:36 self.decay = decay
37 self.shadow ={name: w.copy()for name, w in params.items()}38 self.backup:dict[str, np.ndarray]={}3940defupdate(self, params:dict[str, np.ndarray])->None:41"""Call once per training step AFTER optim.step()."""42for name, w in params.items():43 self.shadow[name]= self.decay * self.shadow[name]+(1.0- self.decay)* w
4445defapply_shadow(self, params:dict[str, np.ndarray])->None:46"""Replace live weights with shadow values; backup the live ones."""47for name, w in params.items():48 self.backup[name]= w.copy()49 params[name]= self.shadow[name].copy()5051defrestore(self, params:dict[str, np.ndarray])->None:52"""Put back the original live weights from the backup."""53for name in self.backup:54 params[name]= self.backup[name].copy()555657# ---------- Smoke test ----------58np.random.seed(0)59W = np.random.randn(8,8).astype(np.float32)*0.160ema = EMA({"W": W.copy()}, decay=0.999)6162half_life = np.log(0.5)/ np.log(0.999)# ≈ 692 steps63shadow_traj =[]64raw_traj =[]65params ={"W": W.copy()}66for step inrange(2000):67 grad = np.random.randn(*W.shape).astype(np.float32)# synthetic noisy grad68 norm, clipped = clip_grad_by_norm([grad], max_norm=1.0)69 params["W"]-=1e-2* clipped[0]# plain SGD step70 ema.update(params)71if step %200==0:72 raw_traj.append(float(params["W"][0,0]))73 shadow_traj.append(float(ema.shadow["W"][0,0]))7475print(f"half-life of decay=0.999 : {half_life:.0f} steps")76print(f"raw θ_0 (every 200 steps) : {[round(v,3)for v in raw_traj]}")77print(f"ema θ_0 (every 200 steps) : {[round(v,3)for v in shadow_traj]}")78print(f"raw final - shadow final : {raw_traj[-1]- shadow_traj[-1]:+.4f}")
PyTorch: Paper's Implementations
The exact ExponentialMovingAverage class from paper_ieee_tii/grace/training/callbacks.py lines 54-87, plus a smoke test that uses torch.nn.utils.clip_grad_norm_ after each backward and validates with the shadow weights.
ExponentialMovingAverage class + clip_grad_norm_ usage
Module containers - we use nn.Linear in the smoke test.
5class ExponentialMovingAverage:
EXACT paper class from <code>paper_ieee_tii/grace/training/callbacks.py</code> lines 54-87. Plain Python class (NOT an nn.Module) - the shadow lives outside the autograd graph by design.
8def __init__(self, model, decay=0.999) -> None:
Build the shadow at construction time.
EXECUTION STATE
⬇ input: model = An nn.Module - we read .named_parameters() off it.
Empty dict. Will hold one shadow tensor per parameter, keyed by name.
11self.backup: dict[str, torch.Tensor] = {}
Empty backup dict for the apply_shadow / restore swap.
12for name, param in model.named_parameters():
Iterate every parameter with its qualified name (e.g. 'weight', 'cnn.stack.0.bias').
EXECUTION STATE
📚 .named_parameters() = Iterator yielding (full_qualified_name, parameter) for every param in the module tree.
iter vars = name (str), param (nn.Parameter).
13if param.requires_grad:
Only shadow LEARNABLE parameters. Frozen layers (e.g. pretrained backbones with requires_grad=False) get skipped - their values don't change so a shadow would be redundant.
14self.shadow[name] = param.data.clone()
.data accesses the underlying Tensor without autograd tracking. .clone() copies storage.
EXECUTION STATE
📚 .data = Direct access to the parameter's tensor value, bypassing autograd. Useful for non-gradient bookkeeping like EMA.
📚 .clone() = Returns a tensor with its own storage and the same values.
→ why .data and .clone()? = .data avoids accidentally creating an autograd graph; .clone() ensures the shadow lives independently of the live weight.
16def update(self, model) -> None:
Call AFTER optimizer.step() each iteration. Pulls the shadow toward freshly-updated live weights.
EXECUTION STATE
⬇ input: model = The same nn.Module passed at __init__. Its weights have been updated by optim.step() since the last call.
17for name, param in model.named_parameters():
Iterate.
18if param.requires_grad and name in self.shadow:
Defensive check: skip params that are frozen now (even if they were trainable at init) or that never got into the shadow dict.
EXECUTION STATE
→ why both checks? = name in self.shadow handles the case where a parameter was added AFTER EMA construction (e.g. dynamic head registration). param.requires_grad handles the case where a parameter got frozen mid-training.
operator: * / + = Tensor arithmetic, no autograd graph since param.data is used.
→ no learnable params = self.decay is a Python float. self.shadow[name] is a tensor but with no grad. The shadow stays out of the autograd graph by design.
24def apply_shadow(self, model) -> None:
Swap live weights with shadow values for validation.
IN-PLACE copy of the shadow values into the live parameter. The trailing underscore in copy_ marks it in-place. We need in-place because the live param tensor is referenced from many places (optimiser state, autograd graph, etc.) and replacing the storage would break those references.
EXECUTION STATE
📚 .copy_(src) = In-place copy from src into self. Matches PyTorch convention: trailing underscore = in-place.
→ why in-place? = Optimiser holds references to param.data; replacing the tensor object would orphan those references. In-place copy keeps the same Tensor object but overwrites the values.
30def restore(self, model) -> None:
Reverse of apply_shadow.
31for name, param in model.named_parameters():
Iterate.
32if param.requires_grad and name in self.backup:
Defensive check - only restore parameters we actually backed up.
33param.data.copy_(self.backup[name])
In-place copy from backup.
37torch.manual_seed(0)
Repro.
EXECUTION STATE
📚 torch.manual_seed(s) = Set the global PyTorch PRNG.
Compute joint L2 norm of all gradients and rescale them in-place if it exceeds max_norm. Returns the PRE-clip total norm for logging.
EXECUTION STATE
📚 torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2.0, error_if_nonfinite=False, foreach=None) = Computes the total p-norm of `parameters[*].grad`; if it exceeds max_norm, scales every grad in place by max_norm / total_norm. Trailing underscore = in-place.
⬇ arg 1: parameters = model.parameters() = Iterator over the params whose grads we want to clip jointly.
⬇ arg 2: max_norm = 1.0 = Paper default. After clipping, joint L2 norm ≤ 1.0.
→ in-place = Modifies param.grad tensors directly. The optimizer that runs next sees the clipped gradients.
⬆ result: pre_clip = 0-D tensor - the PRE-clip joint norm (useful for logging).
54optimizer.step()
Apply the AdamW update with the (now clipped) gradients.
57ema.update(model)
Pull the shadow toward the freshly-updated live weights. ORDER MATTERS: must be AFTER optimizer.step() - otherwise the shadow tracks the pre-step weight.
Put live weights back so training continues correctly.
66print(f"val_loss with shadow weights : {val_loss:.4f}")
Log.
EXECUTION STATE
Output (one realisation) = val_loss with shadow weights : 1.1820
→ reading = After 5 steps the shadow has barely moved (decay=0.999, half-life=693). Val loss is close to step-4's loss with raw weights. The smoothing benefit shows after hundreds of steps.
22 lines without explanation
1import torch
2import torch.nn as nn
34# Source: paper_ieee_tii/grace/training/callbacks.py:54-875classExponentialMovingAverage:6"""Maintains an EMA of model parameters for evaluation."""78def__init__(self, model: nn.Module, decay:float=0.999)->None:9 self.decay = decay
10 self.shadow:dict[str, torch.Tensor]={}11 self.backup:dict[str, torch.Tensor]={}12for name, param in model.named_parameters():13if param.requires_grad:14 self.shadow[name]= param.data.clone()1516defupdate(self, model: nn.Module)->None:17for name, param in model.named_parameters():18if param.requires_grad and name in self.shadow:19 self.shadow[name]= self.shadow[name].to(param.device)20 self.shadow[name]=(21 self.decay * self.shadow[name]+(1.0- self.decay)* param.data
22)2324defapply_shadow(self, model: nn.Module)->None:25for name, param in model.named_parameters():26if param.requires_grad and name in self.shadow:27 self.backup[name]= param.data.clone()28 param.data.copy_(self.shadow[name].to(param.device))2930defrestore(self, model: nn.Module)->None:31for name, param in model.named_parameters():32if param.requires_grad and name in self.backup:33 param.data.copy_(self.backup[name])343536# ---------- Smoke test: clip + EMA in a training step ----------37torch.manual_seed(0)38model = nn.Linear(64,1)39optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)40ema = ExponentialMovingAverage(model, decay=0.999)4142x = torch.randn(32,64)43target = torch.randn(32,1)4445for step inrange(5):46 optimizer.zero_grad()47 pred = model(x)48 loss =((pred - target)**2).mean()49 loss.backward()5051# 1. Clip joint gradient norm to 1.0 (paper default)52 pre_clip = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)5354 optimizer.step()5556# 2. Update EMA shadow AFTER optimizer.step()57 ema.update(model)5859print(f"step={step} pre_clip_norm={pre_clip.item():.4f} loss={loss.item():.4f}")6061# At validation time: swap in the shadow weights62ema.apply_shadow(model)63val_pred = model(x)64val_loss =((val_pred - target)**2).mean().item()65ema.restore(model)66print(f"val_loss with shadow weights : {val_loss:.4f}")
Where Else These Two Show Up
Both stabilisers transfer to nearly every deep-learning domain. The hyperparameters change with training length and gradient regime; the underlying recipe stays.
Domain
max_norm
EMA decay
Notes
RUL prediction (this book)
1.0
0.999
paper default
Transformer language modelling
1.0
0.9999
long runs ⇒ heavier EMA
Vision Transformer (ImageNet)
1.0
0.9998
Polyak averaging - same idea
GAN generator training
5.0 - 10.0
0.999
rare clipping; EMA on G is critical
Reinforcement learning (PPO)
0.5
—
no EMA - policy distribution changes too fast
Diffusion model training
1.0
0.9999
EMA at inference is THE main eval trick
Diffusion models live or die on EMA. Many diffusion training runs report two metrics: with and without EMA. The EMA version often improves FID by 30-50%. It's not a small effect.
Three Stabiliser Pitfalls
Pitfall 1: Calling EMA.update() BEFORE optimizer.step(). The shadow then tracks the PRE-step weight, which is one step stale. Always order: backward → clip → step → EMA update.
Pitfall 2: Per-parameter clip instead of joint norm.torch.nn.utils.clip_grad_value_ exists but clips each ELEMENT independently - it BREAKS the gradient direction. Always use clip_grad_norm_ with norm_type=2 for direction-preserving rescaling.
Pitfall 3: Forgetting to restore() after validation. If you call apply_shadow for val and forget to restore, training continues with the SHADOW weights instead of the live ones. The nextoptimizer.step() applies updates to weights that lag the trainer's internal state - subtle bug, plausible-looking loss curves, irreproducible runs.
The point. Two cheap tricks: clip joint gradient norm to 1.0, maintain an EMA of every parameter with decay 0.999. Both are paper-canonical and add about 2 lines to the training loop. §15.4 turns to per-dataset dropout - the last AMNL pipeline knob before the full training-script walkthrough in §15.5.
Takeaway
clip_grad_norm_(params, max_norm=1.0). Joint L2 norm; preserves direction; runs after backward, before step.
ExponentialMovingAverage(model, decay=0.999). Shadow per-parameter; updates after step; half-life ~693 steps.
apply_shadow → eval → restore. Validation uses shadow weights but training continues with live ones.
Order matters. backward → clip → step → EMA.update.
Cheap. Two extra lines per training step. Recovers 0.3-0.5 cycles RMSE on FD002.