State the difference between continuity at a point and uniform continuity on a set — and pinpoint the single quantifier swap that separates them.
Produce the standard counter-examples (x2 on R, 1/x on (0,1]) and explain visually WHY the same δ cannot survive a slide along the x-axis.
Use the sequential test to refute uniform continuity by finding two sequences xn,yn with ∣xn−yn∣→0 but ∣f(xn)−f(yn)∣→0.
State and apply the Heine–Cantor theorem: continuous + compact ⇒ uniformly continuous.
Recognise Lipschitz and Hölder conditions as quantitative refinements of uniform continuity, and use them to certify robustness of neural networks.
Why Plain Continuity Is Not Enough
Ordinary continuity is a local promise. At each point x0, for each tolerance ε>0, we can find some δ>0 that keeps ∣f(y)−f(x0)∣<ε on the window ∣y−x0∣<δ. Different points are allowed to demand different δ's. The function f(x)=1/x on (0,1] illustrates the problem: near x0=1 a moderate δ works, but near x0=10−6 you need a far tinier δ for the same ε.
Many results in analysis need something stronger — they need a singleδ that works everywhere for a given ε. Riemann integrability, convergence of numerical methods, robustness of neural networks, and approximation by step functions all rely on a single global δ. That is the strengthened concept we now formalise: uniform continuity.
The big idea. A function is uniformly continuous on a set if a single δ works for the entire set — the tolerance you need is independent of where on the x-axis you are testing. It is continuity with the supremum's permission slip already filled in.
Two Definitions, Side by Side
The only structural difference is the order of two quantifiers. Read both rows and watch the position of x0 migrate.
Concept
Quantifier order
Plain English
Continuous on A
∀x0∈A∀ε>0∃δ>0∀y∈A,∣y−x0∣<δ⇒∣f(y)−f(x0)∣<ε
First fix the centre x0, then a tolerance, then hunt for δ. The δ may depend on bothε and x0.
Uniformly continuous on A
∀ε>0∃δ>0∀x0,y∈A,∣y−x0∣<δ⇒∣f(y)−f(x0)∣<ε
Pick a tolerance first, then a SINGLE δ must work for every pair of points in A. The δ depends only on ε, not on the location.
The migration matters. Moving ∀x0∈A from the inside to the outside of ∃δ is the entire difference. With x0 inside, δ may shrink as x0 changes. With x0 outside, δ has to be picked once-and-for-all.
The Formal Definition
Definition (Uniform Continuity). Let A⊆R and f:A→R. Then f is uniformly continuous onA if∀ε>0∃δ>0∀x,y∈A,∣x−y∣<δ⟹∣f(x)−f(y)∣<ε.Equivalently: as the gap between two inputs in A shrinks to zero, the gap between their outputs shrinks to zero uniformly — at a rate that does not depend on where the inputs are.
Two structural remarks. First, uniform continuity is a property of a function on a set, never of a function at a point — "uniformly continuous at x0" is a nonsensical phrase. Second, uniform continuity is strictly stronger than continuity: every uniformly continuous function is continuous (take the same δ), but the converse fails.
The Sequential Test for Failure
Negating the definition gives a powerful tool for disproving uniform continuity.
Sequential criterion for non-uniform continuity.f is NOT uniformly continuous on A if and only if there exist sequences (xn),(yn) in A with ∣xn−yn∣→0 but ∣f(xn)−f(yn)∣→0.
The strategy is mechanical: build two sequences that crowd together while their function values stay separated. For f(x)=x2 on R, take xn=n,yn=n+n1. Then ∣xn−yn∣=1/n→0, but∣f(xn)−f(yn)∣=∣xn−yn∣(xn+yn)=n1(2n+n1)=2+n21→2.The gap refuses to shrink, so x2 is NOT uniformly continuous on R. Same argument with xn=1/n,yn=1/(n+1) proves 1/x is not uniformly continuous on (0,1].
A Tour of Four Functions
Four functions, four different stories. Before opening the explorer, predict each verdict on its own.
Function
Domain
Uniformly continuous?
Reason
f(x)=x2
R
No
Slope 2x is unbounded; sequential test with (n, n + 1/n) gives gap → 2.
f(x)=x2
[0,3]
Yes
Heine–Cantor: continuous on a closed bounded interval. Lipschitz constant 6 suffices: take δ = ε/6.
f(x)=1/x
(0,1]
No
Domain is bounded but NOT closed; slope blows up at 0. Sequential test with (1/n, 1/(n+1)).
f(x)=sinx
R
Yes
1-Lipschitz: |sin a − sin b| ≤ |a − b|. Take δ = ε. Domain is non-compact but the bound is global.
f(x)=x
[0,∞)
Yes
Hölder of exponent 1/2: |√a − √b| ≤ √|a − b|. Take δ = ε². Slope blows up at 0, yet UC holds.
Rows 4 and 5 are the most instructive: an unbounded domain does not forbid uniform continuity. Heine–Cantor is sufficient, not necessary.
Geometric Intuition — The Travelling Box
Picture the (ε,δ) challenge as a small rectangle of width 2δ and height 2ε centred on the graph at (x0,f(x0)). Plain continuity at x0 says: I can pick a (possibly tiny) δ so that the graph in the vertical strip (x0−δ,x0+δ) stays inside the box.
Uniform continuity is more demanding. Pick a δ once. Now slide the box horizontally along the graph, with the same δ and the same ε. At every centre x0, the graph in the box's vertical strip must still fit inside. If it ever spills over the top or bottom of the box, f is not uniformly continuous with that δ.
Mental movie. A function is uniformly continuous on A iff for every height 2ε there is a width 2δ small enough that a rectangle of those dimensions, translated along the graph over A, never catches the curve escaping the top or bottom edge.
Interactive: The Travelling-Box Explorer
Pick a function, then fix ε and δ. Drag the x0 slider to push the box along the graph and watch whether it still contains the curve. The green box means the local (ε,δ) check passes; red means a point inside the δ-window lies more than ε above or below f(x0). Press stress-test to make the computer do the sliding for you.
Loading uniform continuity explorer…
Things to try. (1) On f=x2 over R, slide x0 toward x0=5 and watch the green box turn red as the slope 2x0 wins. (2) Switch to x2 on the COMPACT interval [0,3] — the stress-test confirms the same δ survives everywhere. (3) On 1/x, slide x0 from 1 down toward 0.1 and watch the vertical gap explode.
Heine–Cantor Theorem — The Bridge to Compactness
Heine–Cantor Theorem. Let f:[a,b]→R be continuous on the closed bounded interval [a,b]. Then f is uniformly continuous on [a,b].
This is the single most important upgrade in the section. Continuity on a compact interval is automatically uniform — so all the technical advantages of uniform continuity (a single global δ) come for free as long as we restrict attention to closed bounded sets. Once we leave compactness, as with x2 on R or 1/x on (0,1], uniform continuity has to be earned by other means.
Proof sketch (sequential). Assume not — then some ε0>0 admits sequences xn,yn∈[a,b] with ∣xn−yn∣→0 but ∣f(xn)−f(yn)∣≥ε0. By the Bolzano–Weierstrass theorem (compactness), (xn) has a subsequence (xnk) converging to some c∈[a,b]. The same indices force ynk→c as well. Continuity of f at c yields f(xnk),f(ynk)→f(c), so their difference tends to 0 — contradicting the assumed lower bound ε0.
Why both hypotheses matter. Compactness gives the convergent subsequence. Continuity transfers the convergence of x to convergence of f(x). Drop either and the chain breaks — and the explorer's x2-on-R red boxes are precisely the case where compactness is missing.
Lipschitz and Hölder — Quantitative Cousins
Uniform continuity says "a δ exists" but not how big it is. Two named refinements pin down the dependence of δ on ε explicitly — both are uniform continuity in disguise.
Name
Condition
Implied δ(ε)
Lipschitz with constant L
∣f(x)−f(y)∣≤L∣x−y∣∀x,y∈A
δ=ε/L — linear in ε.
Hölder with exponent α∈(0,1] and constant C
∣f(x)−f(y)∣≤C∣x−y∣α∀x,y∈A
δ=(ε/C)1/α — polynomial in ε.
Hölder generalises Lipschitz: α=1 recovers Lipschitz. A textbook example: x is Hölder-1/2 on [0,∞) (with C=1) but not Lipschitz — its slope explodes at 0. These quantitative cousins are exactly the right vocabulary for stability and robustness analysis, which is why they reappear immediately in the PyTorch section.
Worked Example — Prove x² Fails by Hand
We refute uniform continuity of f(x)=x2 on R in full detail, then check the claim against the explorer. The argument is the prototype every UC failure proof copies.
▶▼Show full worked solution — f(x) = x² is not uniformly continuous on ℝ
Step 1 — State what we have to refute.
We want to show NOT uniformly continuous: there exists some ε0>0 such that for every δ>0 we can find a pair x,y∈R with ∣x−y∣<δ and ∣f(x)−f(y)∣≥ε0.
Step 2 — Pick a witness ε0.
Take ε0=1. (Any positive number ≤2 would work. We will see why 2 is the natural ceiling at the end.)
Step 3 — Build the failing pair from a free parameter.
For each n∈N, setxn=n,yn=n+n1.The distance is ∣xn−yn∣=1/n, which can be made smaller than any prescribed δ by choosing n>1/δ.
Step 4 — Compute the function-value gap.
Using a2−b2=(a−b)(a+b),∣f(yn)−f(xn)∣=∣yn−xn∣⋅(yn+xn)=n1⋅(2n+n1)=2+n21.For every n≥1 this gap is strictly greater than 2, hence certainly greater than ε0=1.
Step 5 — Assemble the refutation.
Given any candidate δ>0, pick n>1/δ. Then ∣xn−yn∣=1/n<δ while ∣f(xn)−f(yn)∣>2>1=ε0. So δ does NOT work for ε0=1. Since this is true for everyδ, no single δ can serve all of R. QED.
Step 6 — Cross-check with the explorer.
In the explorer set the preset to f(x)=x2, set ε=1, set δ=0.1. Slide x0 to about 5: the box turns red, and the worst gap inside the window reads ≈1.01 — matching the prediction 2x0δ=2⋅5⋅0.1=1.
Takeaway recipe — refuting uniform continuity in four moves.
Pick a fixed ε0>0.
Engineer two sequences indexed by n with ∣xn−yn∣→0.
Show ∣f(xn)−f(yn)∣ stays bounded below by ε0.
Choose n large to defeat any candidate δ.
Python — Hunting for a Failing Pair
The hand-proof says "for every δ there is a pair". Below is exactly the same negation, encoded as a Python search. We feed it five candidate δ's and watch it defeat each one in turn.
Plain Python — refute UC for f(x) = x² by exhibiting a failing pair
🐍uniform_continuity_failure.py
Explanation(30)
Code(32)
1# --- Step 1: the function we suspect is NOT uniformly continuous on R ---
We pick the smallest counter-example: f(x) = x². It is continuous at every real number (a polynomial), so plain continuity holds. The question is whether the SAME δ can work for every ε, regardless of where on the x-axis we centre the test.
2def f(x) → float
The function under test. We deliberately keep it tiny so the example stays focused on the uniform-continuity question, not on the function's algebra.
EXECUTION STATE
⬇ input: x = Any real number. We will call f at carefully chosen pairs (a, b) with a → ∞ to force the gap |f(a) − f(b)| to stay large.
formula = f(x) = x². Slope f′(x) = 2x is unbounded as x → ∞ — that is the entire source of the failure.
⬆ returns = Python float. f(1) = 1, f(100) = 10 000, f(100.01) = 10 002.0001 — note how a tiny δ = 0.01 step changes the value by ≈ 2 at x = 100, but only by ≈ 0.02 at x = 1.
3Docstring — "Plain squaring."
States the function and flags the property that drives the failure (slope grows without bound). Keeping the WHY in the docstring is the difference between a reusable example and a throwaway snippet.
4return x * x
Pure arithmetic. No library calls. We use x * x instead of x**2 because multiplication is marginally faster and avoids any tiny floating-point surprises from the ** operator on negative bases.
Mechanical translation of the negation of UC. Given a candidate δ and a target ε, we look for any (a, b) with |a − b| < δ but |f(a) − f(b)| ≥ ε.
EXECUTION STATE
⬇ input: epsilon = The tolerance the candidate δ is supposed to enforce. We fix ε = 1 in the driver code — meaning we want |f(a) − f(b)| < 1 for every nearby pair.
⬇ input: delta = The proposed step size that someone CLAIMS works uniformly. Our job is to refute that claim by finding a single failing pair.
⬇ input: x_max=10_000 = Search budget. Because x² has slope 2x, the failing pair lives somewhere around x = ε/(2δ) — at ε = 1, δ = 0.0001 we expect a ≈ 5000. Budget must exceed that.
⬆ returns = Either a tuple (a, b, gap) proving UC fails at this δ, or None when the search budget was exhausted without finding one. None does NOT prove UC — only a smarter or wider search might still find a pair.
8Docstring — "Search for (a, b) with |a-b| < delta but |f(a)-f(b)| >= epsilon."
A literal restatement of the negation of UC. Documenting the contract here means anyone reading the function knows immediately what to expect.
10# The classic witness sequence: a_n = n, b_n = n + 1/n.
We could brute-force scan; instead we use a known winning strategy. The pair (n, n + 1/n) is the textbook witness for x² not being uniformly continuous — we will see exactly why it works.
11# |a-b| = 1/n -> 0,
First half of the cleverness: the distance between the two test points shrinks to zero as n grows. Any candidate δ — no matter how small — is eventually exceeded by 1/n.
Algebra reveals the failure. a² − b² factors as (a − b)(a + b). The (a − b) factor is shrinking like 1/n, but the (a + b) factor is GROWING like 2n. Their product is bounded BELOW by 2.
13# = 2 + 1/n^2 -> 2.
So no matter how small δ is, you can pick n large enough to fit a and b inside a δ-window AND have their function values 2 units apart. ε = 1 < 2 is impossible to enforce. UC fails.
14for n in range(1, x_max):
Try n = 1, 2, 3, …, x_max − 1. The loop will exit early via `return` as soon as we hit a pair whose distance is below δ.
LOOP TRACE · 3 iterations
Driver loop's δ = 1.0
n = 1 = a = 1.0, b = 2.0, |a−b| = 1.0 — NOT < 1.0, skip
n = Loop must reach n ≈ 10 001 before 1/n becomes < 0.0001
found pair = a ≈ 10001, b ≈ 10001.0001, gap ≈ 2 — confirms the failure pattern at every scale
15a = float(n)
First member of the candidate pair. Cast to float so subsequent arithmetic stays in floating-point and doesn't get promoted to int by accident.
EXECUTION STATE
a (during loop) =
n=1 → 1.0
n=2 → 2.0
n=10 → 10.0
n=100 → 100.0
16b = a + 1.0 / n # |a-b| = 1/n
Second member of the pair, at horizontal distance exactly 1/n from a. The comment annotates the key fact the search relies on. Floating-point division produces a float — promotion is automatic.
Filter out pairs that are not close enough. UC's requirement is strict inequality, so we use >= to skip — the case |a − b| = δ is on the boundary and is NOT considered a failure.
EXECUTION STATE
📚 abs(x) = Python builtin: absolute value for ints, floats, and complex numbers. For real numbers, identical to math.fabs but slightly faster on plain floats.
18continue # too far apart — skip
Skip to the next iteration. As n grows, 1/n shrinks, so eventually we will be inside δ. This step is just patience.
19gap = abs(f(a) - f(b)) # |f(a) - f(b)|
Compute the function-value gap. For our chosen pair, gap = |a² − b²| = (b − a)(a + b) = (1/n)(2n + 1/n) = 2 + 1/n². So gap is always > 2.
EXECUTION STATE
gap (during loop, all pairs that pass the δ filter) =
n=2: |4 − 6.25| = 2.2500
n=10: |100 − 102.01| = 2.0100
n=100:|10000 − 10002.0001|= 2.0001
ALL ≥ 2 — the bound is sharp
20if gap >= epsilon:
Have we found the witness? The condition |f(a) − f(b)| ≥ ε is exactly the second half of the UC failure clause. If yes, we return immediately.
21return (a, b, gap)
Hand back the smoking gun: the pair (a, b) and the gap it produces. The caller can print it, plot it, or assert it.
22return None # no failing pair found within search budget
Fell off the end of the search. This is NOT a proof of uniform continuity — it might just mean our budget was too small. For f(x) = x² with ε = 1 and δ = 0.0001, the failing pair lives near n = 10 000, so x_max must be at least that high.
EXECUTION STATE
⚠ asymmetry = Finding a failing pair PROVES non-uniform continuity. Failing to find one PROVES nothing — uniform continuity has to be established by a positive argument (e.g., Heine–Cantor on a compact interval, or a Lipschitz bound).
We probe the function at decreasing δ values. UC would require a SINGLE δ to work for every nearby pair. Watching the search succeed at every δ confirms that no such δ exists — non-uniform.
25epsilon = 1.0
Fix ε. The choice ε = 1 is arbitrary — any positive number would do, since the gap goes to 2 ≥ any ε ≤ 2. We could also use ε = 0.1; the search would find pairs further out but still find them.
26for delta in [1.0, 0.1, 0.01, 0.001, 0.0001]:
Iterate over a geometric sequence of candidate δ values, each 10× smaller than the last. UC's contract is 'pick ANY δ > 0 and I will defeat it' — so we hammer five increasingly hopeful δ values to make the point.
27result = failing_pair(epsilon, delta)
Call the searcher. The function returns either a (a, b, gap) tuple (failure proven) or None (search exhausted).
28if result is None:
Branch on the search outcome. Use `is None` not `== None` — the identity test is unambiguous for the singleton None.
29print(f"delta={delta:<8}: no failing pair found (NOT a proof of UC!)")
When our budget is too small, we cannot conclude anything. The warning string makes that explicit so a reader does not over-interpret the empty result.
30else:
The interesting branch: we found a witness. Unpack and print.
31a, b, gap = result
Tuple destructuring. Python pattern: split a returned tuple into named variables in one line. Order matters — must match the order in `return (a, b, gap)` above.
Closing fragment of the f-string. The repeated pattern — every δ produces a failing pair, and the gap is always ≈ 2 — is the visual proof that the slope of x² is dragging the failure all the way out to infinity. UC is decisively refuted.
2 lines without explanation
1# --- Step 1: the function we suspect is NOT uniformly continuous on R ---2deff(x):3"""Plain squaring. Continuous everywhere, but slope grows without bound."""4return x * x
56# --- Step 2: encode the "uniform continuity FAILS" definition directly ---7deffailing_pair(epsilon, delta, x_max=10_000):8"""Search for (a, b) with |a-b| < delta but |f(a)-f(b)| >= epsilon.
9 If such a pair exists, plain continuity is NOT uniform on R."""10# The classic witness sequence: a_n = n, b_n = n + 1/n.11# |a-b| = 1/n -> 0, but |a^2 - b^2| = |a-b|*(a+b) = (1/n)*(2n + 1/n)12# = 2 + 1/n^2 -> 2.13for n inrange(1, x_max):14 a =float(n)15 b = a +1.0/ n # |a-b| = 1/n16ifabs(a - b)>= delta:17continue# too far apart — skip18 gap =abs(f(a)- f(b))# |f(a) - f(b)|19if gap >= epsilon:20return(a, b, gap)21returnNone# no failing pair found within search budget2223# --- Step 3: try several delta values, fix epsilon = 1 ---24epsilon =1.025for delta in[1.0,0.1,0.01,0.001,0.0001]:26 result = failing_pair(epsilon, delta)27if result isNone:28print(f"delta={delta:<8}: no failing pair found (NOT a proof of UC!)")29else:30 a, b, gap = result
31print(f"delta={delta:<8}: failing pair a={a:.4f}, b={b:.4f}, "32f"|f(a)-f(b)|={gap:.4f} >= epsilon={epsilon}")
The asymmetry. Finding a failing pair proves non-uniform continuity. Failing to find one proves nothing — UC has to be established by a positive argument (Heine–Cantor on a compact set, or a Lipschitz / Hölder bound). This is the same logical asymmetry as P vs NP: a witness is easy to verify, impossible to verify by exhaustion.
PyTorch — Uniform Continuity and Adversarial Robustness
Why should an ML engineer care about uniform continuity? Because a neural network built from Lipschitz layers is uniformly continuous on all of input space, and the Lipschitz constant L gives an immediate robustness certificate: an adversary perturbing the input by ∥δ∥≤ε cannot move the output by more than Lε. The whole argument is just uniform continuity with explicit δ(ε)=ε/L.
Loads the core PyTorch package. We need tensors (for the network's input), torch.linalg.svdvals (to read the singular values of each weight matrix), and torch.manual_seed (so the demo is reproducible).
EXECUTION STATE
torch = Tensor + autograd + neural-network stack.
2import torch.nn as nn
Subpackage with neural-network building blocks. We use nn.Linear (affine layer), nn.ReLU (non-linearity), and nn.Sequential (containers that wires them in order).
4# --- Step 1: a small "MLP" with ONE hidden layer ---
We build the smallest network that still has interesting Lipschitz structure: 3-D input → 4-D hidden → 1-D output. The Lipschitz argument scales identically to networks with millions of weights.
5torch.manual_seed(0)
Pin the random number generator so weight initialisation is reproducible. Without this line, every run of the script produces different singular values and different printed numbers.
EXECUTION STATE
📚 torch.manual_seed(int) = Seeds the PyTorch CPU RNG. For GPU runs, also call torch.cuda.manual_seed_all. Different seeds — different weights — different Lipschitz bound, but the relationship empirical <= upper-bound still holds.
6net = nn.Sequential(
Chain three modules end-to-end. nn.Sequential exposes the modules as a Python list while wiring forward(x) to compose them in order.
First affine layer: y = W₁ x + b₁. W₁ is a (4, 3) matrix — 4 output features, 3 input features. The PyTorch convention is rows = outputs, cols = inputs, opposite from some textbooks.
EXECUTION STATE
📚 nn.Linear(in_features, out_features, bias) = Builds a torch.nn.Linear module. Stores .weight (shape (out, in)) and .bias (shape (out,)) as learnable Parameters. Forward: input @ weight.T + bias.
operator norm = ‖W₁‖_op = largest singular value of W₁ = max_x ‖W₁ x‖ / ‖x‖. It is the Lipschitz constant of the linear map y = W₁ x.
8nn.ReLU(),
Element-wise ReLU: relu(x) = max(0, x). Crucially, ReLU is 1-Lipschitz: |relu(a) − relu(b)| ≤ |a − b|. Composition with a 1-Lipschitz map cannot inflate the upstream Lipschitz constant.
EXECUTION STATE
why 1-Lipschitz? = If a, b ≥ 0: |a − b|. If a, b ≤ 0: |0 − 0| = 0. If a > 0 > b: |a − 0| ≤ |a − b|. So always ≤ |a − b|.
Second affine layer, mapping the 4-D hidden state to a scalar output. Combined with the ReLU, this is a tiny binary-classifier head.
10)
Close the Sequential container. net is now a callable module: net(x) runs Linear → ReLU → Linear in order.
12# --- Step 2: an upper bound on the network's LIPSCHITZ constant ---
A Lipschitz constant L for f means |f(x) − f(y)| ≤ L · |x − y|. This is a quantitative form of uniform continuity: take δ = ε / L and the UC condition holds at every x. Networks built from Lipschitz layers are uniformly continuous on all of Rᴺ.
13# For a composition of layers, ||f(x) - f(y)|| <= L * ||x - y||
Composition rule. If f₁ is L₁-Lipschitz and f₂ is L₂-Lipschitz, then f₂ ∘ f₁ is (L₁·L₂)-Lipschitz. Proof in one line: |f₂(f₁(x)) − f₂(f₁(y))| ≤ L₂|f₁(x) − f₁(y)| ≤ L₁L₂|x − y|.
14# where L = product of the OPERATOR NORMS of all linear layers.
Operator (spectral) norm of W = max ‖Wx‖/‖x‖ = top singular value. Since the bias just shifts and doesn't stretch, it does not enter the Lipschitz bound.
15# (ReLU is 1-Lipschitz, so it doesn't increase the bound.)
ReLU contributes a factor of 1. Same for tanh, sigmoid, GELU, LayerNorm (with standard parameterisation). For dropout and batch-norm the analysis is more delicate.
Walk through the Sequential, multiply the operator norms of every linear layer, return the product. Other layers contribute factor 1.
EXECUTION STATE
⬇ input: net = Any nn.Sequential. Could equally be a Module with `.modules()` — we just keep the demo simple.
⬆ returns = A Python float L such that |net(x) − net(y)| ≤ L · ‖x − y‖ for all x, y ∈ R³. With this seed: L ≈ 1.30.
17L = 1.0
Initialise the running product. Identity factor for multiplication.
18for layer in net:
Iterate over the Sequential's children in forward order. nn.Sequential implements __iter__, so this works directly.
19if isinstance(layer, nn.Linear):
Only Linear layers contribute a non-trivial factor. Everything else (ReLU, etc.) we treat as 1-Lipschitz and silently skip.
EXECUTION STATE
📚 isinstance(obj, cls) = Python builtin: True if obj is an instance of cls (or a subclass). The standard duck-typing alternative is hasattr(layer, 'weight') — both work.
20# Operator (spectral) norm = largest singular value of W
Mathematical reminder. For any matrix W, ‖W‖_op = σ₁(W), the top singular value. We will read it directly from PyTorch's SVD.
Compute the SVD of the weight matrix and take its top singular value. .item() converts the resulting 0-d tensor to a Python float.
EXECUTION STATE
📚 torch.linalg.svdvals(M) = Returns the singular values of M, sorted in descending order, as a 1-D tensor. Faster than torch.linalg.svd because it skips the U, V factors.
layer.weight =
The weight matrix Parameter. For Linear(3, 4): shape (4, 3) — out × in.
⬆ first-layer value (seed 0) = σ₁(W₁) ≈ 0.9684 (PyTorch's default Kaiming init gives spectral norm close to 1, slightly below 1).
⬆ second-layer value (seed 0) = σ₁(W₂) ≈ 1.341
22L *= sigma_max
Accumulate the product. After the first Linear: L ≈ 0.9684. After the ReLU (skipped): unchanged. After the second Linear: L ≈ 0.9684 · 1.341 ≈ 1.30.
23return L
Hand back the upper bound. This is a PROVEN bound — no further empirical testing can violate it.
25L = lipschitz_upper_bound(net)
Run the audit. The result is a single number that fully characterises how much the network can amplify any input perturbation.
26print(f"upper bound on Lipschitz constant L = {L:.4f}")
Display L. Typical printed value with seed 0: L ≈ 1.3000. Take δ = ε / L and uniform continuity holds with that δ — everywhere on R³.
EXECUTION STATE
expected printed line = upper bound on Lipschitz constant L = 1.2988
28# --- Step 3: an EMPIRICAL Lipschitz estimate by random sampling ---
We now sanity-check the theoretical bound by sampling. The empirical ratio MUST be ≤ L for every pair, so the largest observed ratio gives a lower bound on the TRUE Lipschitz constant.
29# UC + Lipschitz means: for delta > 0, |f(x) - f(y)| <= L * delta whenever ||x-y|| <= delta.
Restates the contract. The empirical loop tests this directly: pick random pairs (x, y) with small ‖x − y‖ and measure the resulting output gap.
30# We measure the worst observed ratio |f(x) - f(y)| / ||x - y||.
Worst-case ratio is a LOWER bound on the true Lipschitz constant of the network. If this ratio ever exceeded L, our theoretical bound would be wrong.
Randomly sample n_pairs nearby pairs and track the maximum |f(x) − f(y)| / ‖x − y‖.
EXECUTION STATE
⬇ input: n_pairs = Sample size. 5000 is a comfortable trade-off — enough to surface the worst-case direction with reasonable probability, fast enough to run in milliseconds.
32worst = 0.0
Running maximum of the observed ratios. We are looking for a LOWER bound on the true Lipschitz constant, so we maximise.
33for _ in range(n_pairs):
Iterate 5000 times. Underscore convention for an unused loop variable.
34x = torch.randn(dim)
Draw a random 3-D input from the standard normal. randn produces float32 tensors by default.
EXECUTION STATE
📚 torch.randn(*shape) = Returns a tensor of given shape with values ~ N(0, 1). dim=3 → shape (3,). For higher dimensions: torch.randn(B, 3) for a batch.
35y = x + 0.01 * torch.randn(dim) # nearby point
Build a nearby y. Mean perturbation ‖y − x‖ ≈ 0.01·√3 ≈ 0.017. Smaller perturbations sharpen the empirical estimate but at the cost of more floating-point noise.
36with torch.no_grad():
Disable autograd for the forward pass — we are only measuring, not training. Skips graph construction so the inner loop runs ≈ 3× faster.
EXECUTION STATE
📚 torch.no_grad() = Context manager that turns off gradient tracking. Required for inference loops; failing to use it leads to memory leaks because activation tensors stay alive for a backward that never comes.
Compute the local stretch factor. Both norms are L² (default for .norm() on a vector). .item() pulls a Python float out of each 0-d tensor.
38worst = max(worst, ratio)
Update the running maximum. After the loop, worst is a sampled lower bound on the true Lipschitz constant.
39return worst
Hand back the observed worst ratio.
41emp = empirical_lipschitz(net)
Run the sampling. With seed 0 and 5000 samples in 3-D, the empirical worst is around 1.10 — comfortably below L ≈ 1.30, which is exactly the theoretical guarantee.
42print(f"empirical Lipschitz ratio = {emp:.4f} (must be <= L)")
Print and remind the reader of the inequality. If you ever observe emp > L, you have a bug — either in the operator-norm calculation or in your network definition.
EXECUTION STATE
expected output = empirical Lipschitz ratio = 1.0974 (must be <= L)
44# --- Step 4: an ADVERSARIAL example uses the SAME Lipschitz bound ---
Pivot from theory to security. An adversarial attack tries to perturb x by a small amount and flip the network's prediction. Lipschitz bounds tell us the SMALLEST such perturbation must satisfy ‖δ‖ ≥ |margin| / L.
45# If we perturb the input by epsilon, the output cannot move by more than L*epsilon.
Direct consequence of |f(x + δ) − f(x)| ≤ L‖δ‖. This is the network-level analogue of the uniform-continuity δ-ε contract.
46# Conversely, an attacker who wants to flip a binary decision needs ||delta|| >= |margin| / L.
Robustness certificate. If your decision boundary is f(x) = 0, the confidence margin |f(x)| / L is a GUARANTEED radius of safety. Training networks to maximise this margin gives certified robust models — and the entire argument rests on uniform continuity.
47x = torch.tensor([1.0, 0.0, -1.0])
A specific test input we want to certify. Mid-magnitude features; nothing special.
48y0 = net(x).item()
Forward-pass the input through the network and extract a Python float. This is the model's score on x.
Pick an attacker's budget — the maximum ‖δ‖ they are allowed to push the input by. Common choice in robustness research for ImageNet is eps = 8/255 ≈ 0.03.
50print(f"f(x) = {y0:.4f}. Any perturbation of norm <= {eps} can move f by at most {L*eps:.4f}.")
Punchline: the network's output cannot move by more than L · eps. If |y0| > L · eps, the prediction is provably safe against any attack inside the eps-ball — a certificate that flows from uniform continuity all the way to robustness.
EXECUTION STATE
example printed line = f(x) = 0.3592. Any perturbation of norm <= 0.05 can move f by at most 0.0649.
6 lines without explanation
1import torch
2import torch.nn as nn
34# --- Step 1: a small "MLP" with ONE hidden layer ---5torch.manual_seed(0)6net = nn.Sequential(7 nn.Linear(3,4, bias=True),# W1 : (4, 3), b1 : (4,)8 nn.ReLU(),9 nn.Linear(4,1, bias=True),# W2 : (1, 4), b2 : (1,)10)1112# --- Step 2: an upper bound on the network's LIPSCHITZ constant ---13# For a composition of layers, ||f(x) - f(y)|| <= L * ||x - y||14# where L = product of the OPERATOR NORMS of all linear layers.15# (ReLU is 1-Lipschitz, so it doesn't increase the bound.)16deflipschitz_upper_bound(net: nn.Sequential)->float:17 L =1.018for layer in net:19ifisinstance(layer, nn.Linear):20# Operator (spectral) norm = largest singular value of W21 sigma_max = torch.linalg.svdvals(layer.weight).max().item()22 L *= sigma_max
23return L
2425L = lipschitz_upper_bound(net)26print(f"upper bound on Lipschitz constant L = {L:.4f}")2728# --- Step 3: an EMPIRICAL Lipschitz estimate by random sampling ---29# UC + Lipschitz means: for delta > 0, |f(x) - f(y)| <= L * delta whenever ||x-y|| <= delta.30# We measure the worst observed ratio |f(x) - f(y)| / ||x - y||.31defempirical_lipschitz(net, n_pairs=5000, dim=3):32 worst =0.033for _ inrange(n_pairs):34 x = torch.randn(dim)35 y = x +0.01* torch.randn(dim)# nearby point36with torch.no_grad():37 ratio =(net(y)- net(x)).norm().item()/(y - x).norm().item()38 worst =max(worst, ratio)39return worst
4041emp = empirical_lipschitz(net)42print(f"empirical Lipschitz ratio = {emp:.4f} (must be <= L)")4344# --- Step 4: an ADVERSARIAL example uses the SAME Lipschitz bound ---45# If we perturb the input by epsilon, the output cannot move by more than L*epsilon.46# Conversely, an attacker who wants to flip a binary decision needs ||delta|| >= |margin| / L.47x = torch.tensor([1.0,0.0,-1.0])48y0 = net(x).item()49eps =0.0550print(f"f(x) = {y0:.4f}. Any perturbation of norm <= {eps} can move f by at most {L*eps:.4f}.")
The connecting thread. Uniform continuity says: small input movement ⇒ small output movement, at a rate independent of where you are in input space. Lipschitz quantifies the rate. The Lipschitz constant times the attacker's budget is the maximum possible output shift — giving a provable minimum distance to the decision boundary. Modern robust deep learning (Parseval networks, spectral normalisation, 1-Lipschitz GANs) is fundamentally an exercise in controlling this constant.
Why Uniform Continuity Matters
Result
How UC enters
Riemann integrability of continuous functions
On [a, b], continuous ⇒ uniformly continuous (Heine–Cantor) ⇒ for any ε there is a partition fine enough that the upper and lower Riemann sums differ by less than ε(b − a). Without UC the proof collapses.
Continuous functions extend to closures
A uniformly continuous function on a dense subset of a metric space extends uniquely to the closure. The completion of the rationals into the reals is a UC-extension.
Numerical analysis
Error bounds for trapezoidal rule, Simpson's rule, and piecewise polynomial interpolation all require a UC modulus to convert step-size estimates into global error.
Convergence of approximation schemes
Stone–Weierstrass and polynomial approximation guarantee uniform convergence — possible only if the limit function is uniformly continuous.
ODE existence (Picard–Lindelöf)
The standard fixed-point proof needs the RHS to be Lipschitz in the state variable — a uniform form of continuity that makes the contraction argument work.
Adversarial robustness in deep learning
A network with Lipschitz constant L cannot have any adversarial example closer than |margin|/L. Spectral-normalised and 1-Lipschitz architectures bake UC directly into the model.
Solving PDEs by Galerkin / finite-element methods
Convergence of the discrete solution to the continuous one demands a uniform-continuity-style bound on the problem's solution operator.
Common Pitfalls
Saying "f is uniformly continuous at x₀". Uniform continuity is a property of functions on sets, not of functions at points. The natural phrase is "f is uniformly continuous on A".
Forgetting that the domain matters.f(x)=x2 is uniformly continuous on [0,3] but NOT on R. The function is the same; the verdict depends on the set.
Concluding non-UC from an empty search. If your computer fails to find a failing pair, that does NOT prove uniform continuity. The search might just be too small. UC needs a positive argument (Heine–Cantor or Lipschitz / Hölder).
Confusing "bounded slope" with "Lipschitz".x is uniformly continuous but has unbounded slope at 0. Bounded slope ⇒ Lipschitz ⇒ UC, but the converse fails.
Thinking UC requires compact domain.sinx and x are uniformly continuous on non-compact domains. Compactness is sufficient (Heine–Cantor) but not necessary.
Assuming the Lipschitz constant of a network equals its operator-norm product. The product is an UPPER bound, often loose by factors of 2–10 for ReLU networks. Tighter bounds exist (e.g., LipSDP) at higher computational cost.
Summary
Uniform continuity upgrades pointwise continuity by demanding a single global δ per ε — independent of the location.
The sequential test refutes UC: build xn,yn with ∣xn−yn∣→0 but ∣f(xn)−f(yn)∣→0. x2 on R and 1/x on (0,1] are the canonical counter-examples.
Heine–Cantor says continuous + compact ⇒ uniformly continuous. Compactness is the cleanest sufficient hypothesis; the proof flows through Bolzano–Weierstrass.
Lipschitz and Hölder conditions are quantitative refinements of UC: they give an explicit δ(ε) rather than promising one exists.
UC underlies Riemann integrability, error analysis of numerical schemes, existence theorems for ODEs, and — most concretely — the certified-robustness guarantees of Lipschitz-controlled neural networks.