Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

State the difference between continuity at a point and uniform continuity on a set — and pinpoint the single quantifier swap that separates them.
Produce the standard counter-examples ( $x^2$ on $\mathbb{R}$ , $1/x$ on $(0, 1]$ ) and explain visually WHY the same $\delta$ cannot survive a slide along the x-axis.
Use the sequential test to refute uniform continuity by finding two sequences $x_n, y_n$ with $|x_n - y_n| \to 0$ but $|f(x_n) - f(y_n)| \not\to 0$ .
State and apply the Heine–Cantor theorem: continuous $+$ compact $\Rightarrow$ uniformly continuous.
Recognise Lipschitz and Hölder conditions as quantitative refinements of uniform continuity, and use them to certify robustness of neural networks.

Why Plain Continuity Is Not Enough

Ordinary continuity is a local promise. At each point $x_0$ , for each tolerance $\varepsilon > 0$ , we can find some $\delta > 0$ that keeps $|f(y) - f(x_0)| < \varepsilon$ on the window $|y - x_0| < \delta$ . Different points are allowed to demand different $\delta$ 's. The function $f(x) = 1/x$ on $(0, 1]$ illustrates the problem: near $x_0 = 1$ a moderate $\delta$ works, but near $x_0 = 10^{-6}$ you need a far tinier $\delta$ for the same $\varepsilon$ .

Many results in analysis need something stronger — they need a single $\delta$ that works everywhere for a given $\varepsilon$ . Riemann integrability, convergence of numerical methods, robustness of neural networks, and approximation by step functions all rely on a single global $\delta$ . That is the strengthened concept we now formalise: uniform continuity.

The big idea. A function is uniformly continuous on a set if a single $\delta$ works for the entire set — the tolerance you need is independent of where on the x-axis you are testing. It is continuity with the supremum's permission slip already filled in.

Two Definitions, Side by Side

The only structural difference is the order of two quantifiers. Read both rows and watch the position of $x_0$ migrate.

Concept	Quantifier order	Plain English
Continuous on $A$	$\forall x_0 \in A \;\; \forall \varepsilon > 0 \;\; \exists \delta > 0 \;\; \forall y \in A, \|y - x_0\| < \delta \Rightarrow \|f(y) - f(x_0)\| < \varepsilon$	First fix the centre $x_0$ , then a tolerance, then hunt for $\delta$ . The $\delta$ may depend on both $\varepsilon$ and $x_0$ .
Uniformly continuous on $A$	$\forall \varepsilon > 0 \;\; \exists \delta > 0 \;\; \forall x_0, y \in A, \|y - x_0\| < \delta \Rightarrow \|f(y) - f(x_0)\| < \varepsilon$	Pick a tolerance first, then a SINGLE $\delta$ must work for every pair of points in $A$ . The $\delta$ depends only on $\varepsilon$ , not on the location.

The migration matters. Moving

\forall x_0 \in A

from the inside to the outside of

\exists \delta

is the entire difference. With

x_0

inside,

\delta

may shrink as

x_0

changes. With

x_0

outside,

\delta

has to be picked once-and-for-all.

The Formal Definition

Definition (Uniform Continuity). Let

A \subseteq \mathbb{R}

and

f: A \to \mathbb{R}

. Then

f

is uniformly continuous on

A

\forall \varepsilon > 0 \;\; \exists \delta > 0 \;\; \forall x, y \in A, \;\; |x - y| < \delta \;\Longrightarrow\; |f(x) - f(y)| < \varepsilon.

Equivalently: as the gap between two inputs in

A

shrinks to zero, the gap between their outputs shrinks to zero uniformly — at a rate that does not depend on where the inputs are.

Two structural remarks. First, uniform continuity is a property of a function on a set, never of a function at a point — "uniformly continuous at $x_0$ " is a nonsensical phrase. Second, uniform continuity is strictly stronger than continuity: every uniformly continuous function is continuous (take the same $\delta$ ), but the converse fails.

The Sequential Test for Failure

Negating the definition gives a powerful tool for disproving uniform continuity.

Sequential criterion for non-uniform continuity.

f

is NOT uniformly continuous on

A

if and only if there exist sequences

(x_n), (y_n)

A

with

|x_n - y_n| \to 0

but

|f(x_n) - f(y_n)| \not\to 0

The strategy is mechanical: build two sequences that crowd together while their function values stay separated. For $f(x) = x^2$ on $\mathbb{R}$ , take $x_n = n, y_n = n + \tfrac{1}{n}$ . Then $|x_n - y_n| = 1/n \to 0$ , but $|f(x_n) - f(y_n)| = |x_n - y_n|\,(x_n + y_n) = \tfrac{1}{n}\,\left(2n + \tfrac{1}{n}\right) = 2 + \tfrac{1}{n^2} \to 2.$ The gap refuses to shrink, so $x^2$ is NOT uniformly continuous on $\mathbb{R}$ . Same argument with $x_n = 1/n, y_n = 1/(n+1)$ proves $1/x$ is not uniformly continuous on $(0, 1]$ .

A Tour of Four Functions

Four functions, four different stories. Before opening the explorer, predict each verdict on its own.

Function	Domain	Uniformly continuous?	Reason
$f(x) = x^2$	$\mathbb{R}$	No	Slope 2x is unbounded; sequential test with (n, n + 1/n) gives gap → 2.
$f(x) = x^2$	$[0, 3]$	Yes	Heine–Cantor: continuous on a closed bounded interval. Lipschitz constant 6 suffices: take δ = ε/6.
$f(x) = 1/x$	$(0, 1]$	No	Domain is bounded but NOT closed; slope blows up at 0. Sequential test with (1/n, 1/(n+1)).
$f(x) = \sin x$	$\mathbb{R}$	Yes	1-Lipschitz: \|sin a − sin b\| ≤ \|a − b\|. Take δ = ε. Domain is non-compact but the bound is global.
$f(x) = \sqrt{x}$	$[0, \infty)$	Yes	Hölder of exponent 1/2: \|√a − √b\| ≤ √\|a − b\|. Take δ = ε². Slope blows up at 0, yet UC holds.

Rows 4 and 5 are the most instructive: an unbounded domain does not forbid uniform continuity. Heine–Cantor is sufficient, not necessary.

Geometric Intuition — The Travelling Box

Picture the $(\varepsilon, \delta)$ challenge as a small rectangle of width $2\delta$ and height $2\varepsilon$ centred on the graph at $(x_0, f(x_0))$ . Plain continuity at $x_0$ says: I can pick a (possibly tiny) $\delta$ so that the graph in the vertical strip $(x_0 - \delta, x_0 + \delta)$ stays inside the box.

Uniform continuity is more demanding. Pick a $\delta$ once. Now slide the box horizontally along the graph, with the same $\delta$ and the same $\varepsilon$ . At every centre $x_0$ , the graph in the box's vertical strip must still fit inside. If it ever spills over the top or bottom of the box, $f$ is not uniformly continuous with that $\delta$ .

Mental movie. A function is uniformly continuous on

A

iff for every height

2\varepsilon

there is a width

2\delta

small enough that a rectangle of those dimensions, translated along the graph over

A

, never catches the curve escaping the top or bottom edge.

Interactive: The Travelling-Box Explorer

Pick a function, then fix $\varepsilon$ and $\delta$ . Drag the $x_0$ slider to push the box along the graph and watch whether it still contains the curve. The green box means the local $(\varepsilon, \delta)$ check passes; red means a point inside the $\delta$ -window lies more than $\varepsilon$ above or below $f(x_0)$ . Press stress-test to make the computer do the sliding for you.

Loading uniform continuity explorer…

Things to try. (1) On

f = x^2

over

\mathbb{R}

, slide

x_0

toward

x_0 = 5

and watch the green box turn red as the slope

2x_0

wins. (2) Switch to

x^2

on the COMPACT interval

[0, 3]

— the stress-test confirms the same

\delta

survives everywhere. (3) On

1/x

, slide

x_0

from

1

down toward

0.1

and watch the vertical gap explode.

Heine–Cantor Theorem — The Bridge to Compactness

Heine–Cantor Theorem. Let

f: [a, b] \to \mathbb{R}

be continuous on the closed bounded interval

[a, b]

. Then

f

is uniformly continuous on

[a, b]

This is the single most important upgrade in the section. Continuity on a compact interval is automatically uniform — so all the technical advantages of uniform continuity (a single global $\delta$ ) come for free as long as we restrict attention to closed bounded sets. Once we leave compactness, as with $x^2$ on $\mathbb{R}$ or $1/x$ on $(0, 1]$ , uniform continuity has to be earned by other means.

Proof sketch (sequential). Assume not — then some $\varepsilon_0 > 0$ admits sequences $x_n, y_n \in [a, b]$ with $|x_n - y_n| \to 0$ but $|f(x_n) - f(y_n)| \ge \varepsilon_0$ . By the Bolzano–Weierstrass theorem (compactness), $(x_n)$ has a subsequence $(x_{n_k})$ converging to some $c \in [a, b]$ . The same indices force $y_{n_k} \to c$ as well. Continuity of $f$ at $c$ yields $f(x_{n_k}), f(y_{n_k}) \to f(c)$ , so their difference tends to $0$ — contradicting the assumed lower bound $\varepsilon_0$ .

Why both hypotheses matter. Compactness gives the convergent subsequence. Continuity transfers the convergence of x to convergence of f(x). Drop either and the chain breaks — and the explorer's

x^2

-on-

\mathbb{R}

red boxes are precisely the case where compactness is missing.

Lipschitz and Hölder — Quantitative Cousins

Uniform continuity says "a $\delta$ exists" but not how big it is. Two named refinements pin down the dependence of $\delta$ on $\varepsilon$ explicitly — both are uniform continuity in disguise.

Name	Condition	Implied δ(ε)
Lipschitz with constant $L$	$\|f(x) - f(y)\| \le L\,\|x - y\| \;\;\forall x, y \in A$	$\delta = \varepsilon / L$ — linear in $\varepsilon$ .
Hölder with exponent $\alpha \in (0, 1]$ and constant $C$	$\|f(x) - f(y)\| \le C\,\|x - y\|^{\alpha} \;\;\forall x, y \in A$	$\delta = (\varepsilon / C)^{1/\alpha}$ — polynomial in $\varepsilon$ .

Hölder generalises Lipschitz: $\alpha = 1$ recovers Lipschitz. A textbook example: $\sqrt{x}$ is Hölder-1/2 on $[0, \infty)$ (with $C = 1$ ) but not Lipschitz — its slope explodes at 0. These quantitative cousins are exactly the right vocabulary for stability and robustness analysis, which is why they reappear immediately in the PyTorch section.

Worked Example — Prove x² Fails by Hand

We refute uniform continuity of $f(x) = x^2$ on $\mathbb{R}$ in full detail, then check the claim against the explorer. The argument is the prototype every UC failure proof copies.

▶Show full worked solution — f(x) = x² is not uniformly continuous on ℝ

Step 1 — State what we have to refute.

We want to show NOT uniformly continuous: there exists some $\varepsilon_0 > 0$ such that for every $\delta > 0$ we can find a pair $x, y \in \mathbb{R}$ with $|x - y| < \delta$ and $|f(x) - f(y)| \ge \varepsilon_0$ .

Step 2 — Pick a witness $\varepsilon_0$ .

Take $\varepsilon_0 = 1$ . (Any positive number $\le 2$ would work. We will see why $2$ is the natural ceiling at the end.)

Step 3 — Build the failing pair from a free parameter.

For each $n \in \mathbb{N}$ , set $x_n = n, \qquad y_n = n + \tfrac{1}{n}.$ The distance is $|x_n - y_n| = 1/n$ , which can be made smaller than any prescribed $\delta$ by choosing $n > 1/\delta$ .

Step 4 — Compute the function-value gap.

Using $a^2 - b^2 = (a - b)(a + b)$ , $|f(y_n) - f(x_n)| = |y_n - x_n| \cdot (y_n + x_n) = \tfrac{1}{n} \cdot \left(2n + \tfrac{1}{n}\right) = 2 + \tfrac{1}{n^2}.$ For every $n \ge 1$ this gap is strictly greater than $2$ , hence certainly greater than $\varepsilon_0 = 1$ .

Step 5 — Assemble the refutation.

Given any candidate $\delta > 0$ , pick $n > 1/\delta$ . Then $|x_n - y_n| = 1/n < \delta$ while $|f(x_n) - f(y_n)| > 2 > 1 = \varepsilon_0$ . So $\delta$ does NOT work for $\varepsilon_0 = 1$ . Since this is true for every $\delta$ , no single $\delta$ can serve all of $\mathbb{R}$ . QED.

Step 6 — Cross-check with the explorer.

In the explorer set the preset to $f(x) = x^2$ , set $\varepsilon = 1$ , set $\delta = 0.1$ . Slide $x_0$ to about $5$ : the box turns red, and the worst gap inside the window reads $\approx 1.01$ — matching the prediction $2x_0 \delta = 2 \cdot 5 \cdot 0.1 = 1$ .

Takeaway recipe — refuting uniform continuity in four moves.

Pick a fixed $\varepsilon_0 > 0$ .
Engineer two sequences indexed by $n$ with $|x_n - y_n| \to 0$ .
Show $|f(x_n) - f(y_n)|$ stays bounded below by $\varepsilon_0$ .
Choose $n$ large to defeat any candidate $\delta$ .

Python — Hunting for a Failing Pair

The hand-proof says "for every $\delta$ there is a pair". Below is exactly the same negation, encoded as a Python search. We feed it five candidate $\delta$ 's and watch it defeat each one in turn.

Plain Python — refute UC for f(x) = x² by exhibiting a failing pair

🐍uniform_continuity_failure.py

Explanation(30)

Code(32)

1# --- Step 1: the function we suspect is NOT uniformly continuous on R ---

We pick the smallest counter-example: f(x) = x². It is continuous at every real number (a polynomial), so plain continuity holds. The question is whether the SAME δ can work for every ε, regardless of where on the x-axis we centre the test.

2def f(x) → float

The function under test. We deliberately keep it tiny so the example stays focused on the uniform-continuity question, not on the function's algebra.

EXECUTION STATE

⬇ input: x = Any real number. We will call f at carefully chosen pairs (a, b) with a → ∞ to force the gap |f(a) − f(b)| to stay large.

formula = f(x) = x². Slope f′(x) = 2x is unbounded as x → ∞ — that is the entire source of the failure.

⬆ returns = Python float. f(1) = 1, f(100) = 10 000, f(100.01) = 10 002.0001 — note how a tiny δ = 0.01 step changes the value by ≈ 2 at x = 100, but only by ≈ 0.02 at x = 1.

3Docstring — "Plain squaring."

States the function and flags the property that drives the failure (slope grows without bound). Keeping the WHY in the docstring is the difference between a reusable example and a throwaway snippet.

4return x * x

Pure arithmetic. No library calls. We use x * x instead of x**2 because multiplication is marginally faster and avoids any tiny floating-point surprises from the ** operator on negative bases.

6# --- Step 2: encode the "uniform continuity FAILS" definition directly ---

7def failing_pair(epsilon, delta, x_max=10_000) → tuple | None

Mechanical translation of the negation of UC. Given a candidate δ and a target ε, we look for any (a, b) with |a − b| < δ but |f(a) − f(b)| ≥ ε.

EXECUTION STATE

⬇ input: epsilon = The tolerance the candidate δ is supposed to enforce. We fix ε = 1 in the driver code — meaning we want |f(a) − f(b)| < 1 for every nearby pair.

⬇ input: delta = The proposed step size that someone CLAIMS works uniformly. Our job is to refute that claim by finding a single failing pair.

⬇ input: x_max=10_000 = Search budget. Because x² has slope 2x, the failing pair lives somewhere around x = ε/(2δ) — at ε = 1, δ = 0.0001 we expect a ≈ 5000. Budget must exceed that.

⬆ returns = Either a tuple (a, b, gap) proving UC fails at this δ, or None when the search budget was exhausted without finding one. None does NOT prove UC — only a smarter or wider search might still find a pair.

8Docstring — "Search for (a, b) with |a-b| < delta but |f(a)-f(b)| >= epsilon."

A literal restatement of the negation of UC. Documenting the contract here means anyone reading the function knows immediately what to expect.

10# The classic witness sequence: a_n = n, b_n = n + 1/n.

We could brute-force scan; instead we use a known winning strategy. The pair (n, n + 1/n) is the textbook witness for x² not being uniformly continuous — we will see exactly why it works.

11# |a-b| = 1/n -> 0,

First half of the cleverness: the distance between the two test points shrinks to zero as n grows. Any candidate δ — no matter how small — is eventually exceeded by 1/n.

12# but |a^2 - b^2| = |a-b|*(a+b) = (1/n)*(2n + 1/n)

Algebra reveals the failure. a² − b² factors as (a − b)(a + b). The (a − b) factor is shrinking like 1/n, but the (a + b) factor is GROWING like 2n. Their product is bounded BELOW by 2.

13# = 2 + 1/n^2 -> 2.

So no matter how small δ is, you can pick n large enough to fit a and b inside a δ-window AND have their function values 2 units apart. ε = 1 < 2 is impossible to enforce. UC fails.

14for n in range(1, x_max):

Try n = 1, 2, 3, …, x_max − 1. The loop will exit early via `return` as soon as we hit a pair whose distance is below δ.

LOOP TRACE · 3 iterations

Driver loop's δ = 1.0

n = 1 = a = 1.0, b = 2.0, |a−b| = 1.0 — NOT < 1.0, skip

n = 2 = a = 2.0, b = 2.5, |a−b| = 0.5 < 1.0 ✓

gap = |f(2) − f(2.5)| = |4 − 6.25| = 2.25 ≥ ε = 1 → RETURN (2.0, 2.5, 2.25)

Driver loop's δ = 0.01

n = 100 = a = 100.0, b = 100.01, |a−b| = 0.01 — equal, NOT strictly less than δ → skip

n = 101 = a = 101.0, b = 101.00990…, |a−b| = 0.00990 < 0.01 ✓

gap = |101² − 101.00990…²| = (1/101)·(2·101 + 1/101) ≈ 2.00010 ≥ ε = 1 → RETURN

Driver loop's δ = 0.0001

n = Loop must reach n ≈ 10 001 before 1/n becomes < 0.0001

found pair = a ≈ 10001, b ≈ 10001.0001, gap ≈ 2 — confirms the failure pattern at every scale

15a = float(n)

First member of the candidate pair. Cast to float so subsequent arithmetic stays in floating-point and doesn't get promoted to int by accident.

EXECUTION STATE

a (during loop) =

n=1 → 1.0
n=2 → 2.0
n=10 → 10.0
n=100 → 100.0

16b = a + 1.0 / n # |a-b| = 1/n

Second member of the pair, at horizontal distance exactly 1/n from a. The comment annotates the key fact the search relies on. Floating-point division produces a float — promotion is automatic.

EXECUTION STATE

b (during loop) =

n=1 → 2.0     (|a−b|=1.0)
n=2 → 2.5     (|a−b|=0.5)
n=10 → 10.1   (|a−b|=0.1)
n=100 → 100.01 (|a−b|=0.01)

17if abs(a - b) >= delta:

Filter out pairs that are not close enough. UC's requirement is strict inequality, so we use >= to skip — the case |a − b| = δ is on the boundary and is NOT considered a failure.

EXECUTION STATE

📚 abs(x) = Python builtin: absolute value for ints, floats, and complex numbers. For real numbers, identical to math.fabs but slightly faster on plain floats.

18continue # too far apart — skip

Skip to the next iteration. As n grows, 1/n shrinks, so eventually we will be inside δ. This step is just patience.

19gap = abs(f(a) - f(b)) # |f(a) - f(b)|

Compute the function-value gap. For our chosen pair, gap = |a² − b²| = (b − a)(a + b) = (1/n)(2n + 1/n) = 2 + 1/n². So gap is always > 2.

EXECUTION STATE

gap (during loop, all pairs that pass the δ filter) =

n=2:  |4 − 6.25|        = 2.2500
n=10: |100 − 102.01|    = 2.0100
n=100:|10000 − 10002.0001|= 2.0001
ALL ≥ 2 — the bound is sharp

20if gap >= epsilon:

Have we found the witness? The condition |f(a) − f(b)| ≥ ε is exactly the second half of the UC failure clause. If yes, we return immediately.

21return (a, b, gap)

Hand back the smoking gun: the pair (a, b) and the gap it produces. The caller can print it, plot it, or assert it.

22return None # no failing pair found within search budget

Fell off the end of the search. This is NOT a proof of uniform continuity — it might just mean our budget was too small. For f(x) = x² with ε = 1 and δ = 0.0001, the failing pair lives near n = 10 000, so x_max must be at least that high.

EXECUTION STATE

⚠ asymmetry = Finding a failing pair PROVES non-uniform continuity. Failing to find one PROVES nothing — uniform continuity has to be established by a positive argument (e.g., Heine–Cantor on a compact interval, or a Lipschitz bound).

24# --- Step 3: try several delta values, fix epsilon = 1 ---

We probe the function at decreasing δ values. UC would require a SINGLE δ to work for every nearby pair. Watching the search succeed at every δ confirms that no such δ exists — non-uniform.

25epsilon = 1.0

Fix ε. The choice ε = 1 is arbitrary — any positive number would do, since the gap goes to 2 ≥ any ε ≤ 2. We could also use ε = 0.1; the search would find pairs further out but still find them.

26for delta in [1.0, 0.1, 0.01, 0.001, 0.0001]:

Iterate over a geometric sequence of candidate δ values, each 10× smaller than the last. UC's contract is 'pick ANY δ > 0 and I will defeat it' — so we hammer five increasingly hopeful δ values to make the point.

27result = failing_pair(epsilon, delta)

Call the searcher. The function returns either a (a, b, gap) tuple (failure proven) or None (search exhausted).

28if result is None:

Branch on the search outcome. Use `is None` not `== None` — the identity test is unambiguous for the singleton None.

29print(f"delta={delta:<8}: no failing pair found (NOT a proof of UC!)")

When our budget is too small, we cannot conclude anything. The warning string makes that explicit so a reader does not over-interpret the empty result.

30else:

The interesting branch: we found a witness. Unpack and print.

31a, b, gap = result

Tuple destructuring. Python pattern: split a returned tuple into named variables in one line. Order matters — must match the order in `return (a, b, gap)` above.

32print(f"delta={delta:<8}: failing pair a={a:.4f}, b={b:.4f}, "

Format the witness. {delta:<8} left-aligns δ in an 8-character field so the output lines up in a column. {a:.4f} prints a with four decimal places.

EXECUTION STATE

expected printed output =

delta=1.0     : failing pair a=2.0000, b=2.5000, |f(a)-f(b)|=2.2500  >=  epsilon=1.0
delta=0.1     : failing pair a=11.0000, b=11.0909, |f(a)-f(b)|=2.0083 >= 1.0
delta=0.01    : failing pair a=101.0000, b=101.0099, |f(a)-f(b)|=2.0001 >= 1.0
delta=0.001   : failing pair a=1001.0000, b=1001.0010, |f(a)-f(b)|≈2.000 >= 1.0
delta=0.0001  : failing pair a=10001.0000, b=10001.0001, |f(a)-f(b)|≈2.000 >= 1.0

33f"|f(a)-f(b)|={gap:.4f} >= epsilon={epsilon}")

Closing fragment of the f-string. The repeated pattern — every δ produces a failing pair, and the gap is always ≈ 2 — is the visual proof that the slope of x² is dragging the failure all the way out to infinity. UC is decisively refuted.

2 lines without explanation

1# --- Step 1: the function we suspect is NOT uniformly continuous on R ---
2def f(x):
3    """Plain squaring. Continuous everywhere, but slope grows without bound."""
4    return x * x
5
6# --- Step 2: encode the "uniform continuity FAILS" definition directly ---
7def failing_pair(epsilon, delta, x_max=10_000):
8    """Search for (a, b) with |a-b| < delta but |f(a)-f(b)| >= epsilon.
9    If such a pair exists, plain continuity is NOT uniform on R."""
10    # The classic witness sequence: a_n = n, b_n = n + 1/n.
11    # |a-b| = 1/n -> 0, but |a^2 - b^2| = |a-b|*(a+b) = (1/n)*(2n + 1/n)
12    #                                  = 2 + 1/n^2  ->  2.
13    for n in range(1, x_max):
14        a = float(n)
15        b = a + 1.0 / n            # |a-b| = 1/n
16        if abs(a - b) >= delta:
17            continue               # too far apart — skip
18        gap = abs(f(a) - f(b))     # |f(a) - f(b)|
19        if gap >= epsilon:
20            return (a, b, gap)
21    return None                    # no failing pair found within search budget
22
23# --- Step 3: try several delta values, fix epsilon = 1 ---
24epsilon = 1.0
25for delta in [1.0, 0.1, 0.01, 0.001, 0.0001]:
26    result = failing_pair(epsilon, delta)
27    if result is None:
28        print(f"delta={delta:<8}: no failing pair found  (NOT a proof of UC!)")
29    else:
30        a, b, gap = result
31        print(f"delta={delta:<8}: failing pair a={a:.4f}, b={b:.4f}, "
32              f"|f(a)-f(b)|={gap:.4f}  >=  epsilon={epsilon}")

The asymmetry. Finding a failing pair proves non-uniform continuity. Failing to find one proves nothing — UC has to be established by a positive argument (Heine–Cantor on a compact set, or a Lipschitz / Hölder bound). This is the same logical asymmetry as P vs NP: a witness is easy to verify, impossible to verify by exhaustion.

PyTorch — Uniform Continuity and Adversarial Robustness

Why should an ML engineer care about uniform continuity? Because a neural network built from Lipschitz layers is uniformly continuous on all of input space, and the Lipschitz constant $L$ gives an immediate robustness certificate: an adversary perturbing the input by $\|\boldsymbol{\delta}\| \le \varepsilon$ cannot move the output by more than $L \varepsilon$ . The whole argument is just uniform continuity with explicit $\delta(\varepsilon) = \varepsilon / L$ .

PyTorch — compute a Lipschitz upper bound, sample-verify it, certify robustness

🐍lipschitz_certificate.py

Explanation(44)

Code(50)

1import torch

Loads the core PyTorch package. We need tensors (for the network's input), torch.linalg.svdvals (to read the singular values of each weight matrix), and torch.manual_seed (so the demo is reproducible).

EXECUTION STATE

torch = Tensor + autograd + neural-network stack.

2import torch.nn as nn

Subpackage with neural-network building blocks. We use nn.Linear (affine layer), nn.ReLU (non-linearity), and nn.Sequential (containers that wires them in order).

4# --- Step 1: a small "MLP" with ONE hidden layer ---

We build the smallest network that still has interesting Lipschitz structure: 3-D input → 4-D hidden → 1-D output. The Lipschitz argument scales identically to networks with millions of weights.

5torch.manual_seed(0)

Pin the random number generator so weight initialisation is reproducible. Without this line, every run of the script produces different singular values and different printed numbers.

EXECUTION STATE

📚 torch.manual_seed(int) = Seeds the PyTorch CPU RNG. For GPU runs, also call torch.cuda.manual_seed_all. Different seeds — different weights — different Lipschitz bound, but the relationship empirical <= upper-bound still holds.

6net = nn.Sequential(

Chain three modules end-to-end. nn.Sequential exposes the modules as a Python list while wiring forward(x) to compose them in order.

7nn.Linear(3, 4, bias=True), # W1 : (4, 3), b1 : (4,)

First affine layer: y = W₁ x + b₁. W₁ is a (4, 3) matrix — 4 output features, 3 input features. The PyTorch convention is rows = outputs, cols = inputs, opposite from some textbooks.

EXECUTION STATE

📚 nn.Linear(in_features, out_features, bias) = Builds a torch.nn.Linear module. Stores .weight (shape (out, in)) and .bias (shape (out,)) as learnable Parameters. Forward: input @ weight.T + bias.

operator norm = ‖W₁‖_op = largest singular value of W₁ = max_x ‖W₁ x‖ / ‖x‖. It is the Lipschitz constant of the linear map y = W₁ x.

8nn.ReLU(),

Element-wise ReLU: relu(x) = max(0, x). Crucially, ReLU is 1-Lipschitz: |relu(a) − relu(b)| ≤ |a − b|. Composition with a 1-Lipschitz map cannot inflate the upstream Lipschitz constant.

EXECUTION STATE

why 1-Lipschitz? = If a, b ≥ 0: |a − b|. If a, b ≤ 0: |0 − 0| = 0. If a > 0 > b: |a − 0| ≤ |a − b|. So always ≤ |a − b|.

9nn.Linear(4, 1, bias=True), # W2 : (1, 4), b2 : (1,)

Second affine layer, mapping the 4-D hidden state to a scalar output. Combined with the ReLU, this is a tiny binary-classifier head.

10)

Close the Sequential container. net is now a callable module: net(x) runs Linear → ReLU → Linear in order.

12# --- Step 2: an upper bound on the network's LIPSCHITZ constant ---

A Lipschitz constant L for f means |f(x) − f(y)| ≤ L · |x − y|. This is a quantitative form of uniform continuity: take δ = ε / L and the UC condition holds at every x. Networks built from Lipschitz layers are uniformly continuous on all of Rᴺ.

13# For a composition of layers, ||f(x) - f(y)|| <= L * ||x - y||

Composition rule. If f₁ is L₁-Lipschitz and f₂ is L₂-Lipschitz, then f₂ ∘ f₁ is (L₁·L₂)-Lipschitz. Proof in one line: |f₂(f₁(x)) − f₂(f₁(y))| ≤ L₂|f₁(x) − f₁(y)| ≤ L₁L₂|x − y|.

14# where L = product of the OPERATOR NORMS of all linear layers.

Operator (spectral) norm of W = max ‖Wx‖/‖x‖ = top singular value. Since the bias just shifts and doesn't stretch, it does not enter the Lipschitz bound.

15# (ReLU is 1-Lipschitz, so it doesn't increase the bound.)

ReLU contributes a factor of 1. Same for tanh, sigmoid, GELU, LayerNorm (with standard parameterisation). For dropout and batch-norm the analysis is more delicate.

16def lipschitz_upper_bound(net: nn.Sequential) → float

Walk through the Sequential, multiply the operator norms of every linear layer, return the product. Other layers contribute factor 1.

EXECUTION STATE

⬇ input: net = Any nn.Sequential. Could equally be a Module with `.modules()` — we just keep the demo simple.

⬆ returns = A Python float L such that |net(x) − net(y)| ≤ L · ‖x − y‖ for all x, y ∈ R³. With this seed: L ≈ 1.30.

17L = 1.0

Initialise the running product. Identity factor for multiplication.

18for layer in net:

Iterate over the Sequential's children in forward order. nn.Sequential implements __iter__, so this works directly.

19if isinstance(layer, nn.Linear):

Only Linear layers contribute a non-trivial factor. Everything else (ReLU, etc.) we treat as 1-Lipschitz and silently skip.

EXECUTION STATE

📚 isinstance(obj, cls) = Python builtin: True if obj is an instance of cls (or a subclass). The standard duck-typing alternative is hasattr(layer, 'weight') — both work.

20# Operator (spectral) norm = largest singular value of W

Mathematical reminder. For any matrix W, ‖W‖_op = σ₁(W), the top singular value. We will read it directly from PyTorch's SVD.

21sigma_max = torch.linalg.svdvals(layer.weight).max().item()

Compute the SVD of the weight matrix and take its top singular value. .item() converts the resulting 0-d tensor to a Python float.

EXECUTION STATE

📚 torch.linalg.svdvals(M) = Returns the singular values of M, sorted in descending order, as a 1-D tensor. Faster than torch.linalg.svd because it skips the U, V factors.

layer.weight =

The weight matrix Parameter. For Linear(3, 4): shape (4, 3) — out × in.

⬆ first-layer value (seed 0) = σ₁(W₁) ≈ 0.9684 (PyTorch's default Kaiming init gives spectral norm close to 1, slightly below 1).

⬆ second-layer value (seed 0) = σ₁(W₂) ≈ 1.341

22L *= sigma_max

Accumulate the product. After the first Linear: L ≈ 0.9684. After the ReLU (skipped): unchanged. After the second Linear: L ≈ 0.9684 · 1.341 ≈ 1.30.

23return L

Hand back the upper bound. This is a PROVEN bound — no further empirical testing can violate it.

25L = lipschitz_upper_bound(net)

Run the audit. The result is a single number that fully characterises how much the network can amplify any input perturbation.

26print(f"upper bound on Lipschitz constant L = {L:.4f}")

Display L. Typical printed value with seed 0: L ≈ 1.3000. Take δ = ε / L and uniform continuity holds with that δ — everywhere on R³.

EXECUTION STATE

expected printed line = upper bound on Lipschitz constant L = 1.2988

28# --- Step 3: an EMPIRICAL Lipschitz estimate by random sampling ---

We now sanity-check the theoretical bound by sampling. The empirical ratio MUST be ≤ L for every pair, so the largest observed ratio gives a lower bound on the TRUE Lipschitz constant.

29# UC + Lipschitz means: for delta > 0, |f(x) - f(y)| <= L * delta whenever ||x-y|| <= delta.

Restates the contract. The empirical loop tests this directly: pick random pairs (x, y) with small ‖x − y‖ and measure the resulting output gap.

30# We measure the worst observed ratio |f(x) - f(y)| / ||x - y||.

Worst-case ratio is a LOWER bound on the true Lipschitz constant of the network. If this ratio ever exceeded L, our theoretical bound would be wrong.

31def empirical_lipschitz(net, n_pairs=5000, dim=3):

Randomly sample n_pairs nearby pairs and track the maximum |f(x) − f(y)| / ‖x − y‖.

EXECUTION STATE

⬇ input: n_pairs = Sample size. 5000 is a comfortable trade-off — enough to surface the worst-case direction with reasonable probability, fast enough to run in milliseconds.

32worst = 0.0

Running maximum of the observed ratios. We are looking for a LOWER bound on the true Lipschitz constant, so we maximise.

33for _ in range(n_pairs):

Iterate 5000 times. Underscore convention for an unused loop variable.

34x = torch.randn(dim)

Draw a random 3-D input from the standard normal. randn produces float32 tensors by default.

EXECUTION STATE

📚 torch.randn(*shape) = Returns a tensor of given shape with values ~ N(0, 1). dim=3 → shape (3,). For higher dimensions: torch.randn(B, 3) for a batch.

35y = x + 0.01 * torch.randn(dim) # nearby point

Build a nearby y. Mean perturbation ‖y − x‖ ≈ 0.01·√3 ≈ 0.017. Smaller perturbations sharpen the empirical estimate but at the cost of more floating-point noise.

36with torch.no_grad():

Disable autograd for the forward pass — we are only measuring, not training. Skips graph construction so the inner loop runs ≈ 3× faster.

EXECUTION STATE

📚 torch.no_grad() = Context manager that turns off gradient tracking. Required for inference loops; failing to use it leads to memory leaks because activation tensors stay alive for a backward that never comes.

37ratio = (net(y) - net(x)).norm().item() / (y - x).norm().item()

Compute the local stretch factor. Both norms are L² (default for .norm() on a vector). .item() pulls a Python float out of each 0-d tensor.

38worst = max(worst, ratio)

Update the running maximum. After the loop, worst is a sampled lower bound on the true Lipschitz constant.

39return worst

Hand back the observed worst ratio.

41emp = empirical_lipschitz(net)

Run the sampling. With seed 0 and 5000 samples in 3-D, the empirical worst is around 1.10 — comfortably below L ≈ 1.30, which is exactly the theoretical guarantee.

42print(f"empirical Lipschitz ratio = {emp:.4f} (must be <= L)")

Print and remind the reader of the inequality. If you ever observe emp > L, you have a bug — either in the operator-norm calculation or in your network definition.

EXECUTION STATE

expected output = empirical Lipschitz ratio = 1.0974 (must be <= L)

44# --- Step 4: an ADVERSARIAL example uses the SAME Lipschitz bound ---

Pivot from theory to security. An adversarial attack tries to perturb x by a small amount and flip the network's prediction. Lipschitz bounds tell us the SMALLEST such perturbation must satisfy ‖δ‖ ≥ |margin| / L.

45# If we perturb the input by epsilon, the output cannot move by more than L*epsilon.

Direct consequence of |f(x + δ) − f(x)| ≤ L‖δ‖. This is the network-level analogue of the uniform-continuity δ-ε contract.

46# Conversely, an attacker who wants to flip a binary decision needs ||delta|| >= |margin| / L.

Robustness certificate. If your decision boundary is f(x) = 0, the confidence margin |f(x)| / L is a GUARANTEED radius of safety. Training networks to maximise this margin gives certified robust models — and the entire argument rests on uniform continuity.

47x = torch.tensor([1.0, 0.0, -1.0])

A specific test input we want to certify. Mid-magnitude features; nothing special.

48y0 = net(x).item()

Forward-pass the input through the network and extract a Python float. This is the model's score on x.

EXECUTION STATE

expected y0 (seed 0) = y0 ≈ 0.36 (depends on Kaiming init).

49eps = 0.05

Pick an attacker's budget — the maximum ‖δ‖ they are allowed to push the input by. Common choice in robustness research for ImageNet is eps = 8/255 ≈ 0.03.

50print(f"f(x) = {y0:.4f}. Any perturbation of norm <= {eps} can move f by at most {L*eps:.4f}.")

Punchline: the network's output cannot move by more than L · eps. If |y0| > L · eps, the prediction is provably safe against any attack inside the eps-ball — a certificate that flows from uniform continuity all the way to robustness.

EXECUTION STATE

example printed line = f(x) = 0.3592. Any perturbation of norm <= 0.05 can move f by at most 0.0649.

6 lines without explanation

1import torch
2import torch.nn as nn
3
4# --- Step 1: a small "MLP" with ONE hidden layer ---
5torch.manual_seed(0)
6net = nn.Sequential(
7    nn.Linear(3, 4, bias=True),   # W1 : (4, 3),  b1 : (4,)
8    nn.ReLU(),
9    nn.Linear(4, 1, bias=True),   # W2 : (1, 4),  b2 : (1,)
10)
11
12# --- Step 2: an upper bound on the network's LIPSCHITZ constant ---
13# For a composition of layers, ||f(x) - f(y)|| <= L * ||x - y||
14# where L = product of the OPERATOR NORMS of all linear layers.
15# (ReLU is 1-Lipschitz, so it doesn't increase the bound.)
16def lipschitz_upper_bound(net: nn.Sequential) -> float:
17    L = 1.0
18    for layer in net:
19        if isinstance(layer, nn.Linear):
20            # Operator (spectral) norm = largest singular value of W
21            sigma_max = torch.linalg.svdvals(layer.weight).max().item()
22            L *= sigma_max
23    return L
24
25L = lipschitz_upper_bound(net)
26print(f"upper bound on Lipschitz constant L = {L:.4f}")
27
28# --- Step 3: an EMPIRICAL Lipschitz estimate by random sampling ---
29# UC + Lipschitz means: for delta > 0, |f(x) - f(y)| <= L * delta whenever ||x-y|| <= delta.
30# We measure the worst observed ratio |f(x) - f(y)| / ||x - y||.
31def empirical_lipschitz(net, n_pairs=5000, dim=3):
32    worst = 0.0
33    for _ in range(n_pairs):
34        x = torch.randn(dim)
35        y = x + 0.01 * torch.randn(dim)        # nearby point
36        with torch.no_grad():
37            ratio = (net(y) - net(x)).norm().item() / (y - x).norm().item()
38        worst = max(worst, ratio)
39    return worst
40
41emp = empirical_lipschitz(net)
42print(f"empirical Lipschitz ratio   = {emp:.4f}  (must be <= L)")
43
44# --- Step 4: an ADVERSARIAL example uses the SAME Lipschitz bound ---
45# If we perturb the input by epsilon, the output cannot move by more than L*epsilon.
46# Conversely, an attacker who wants to flip a binary decision needs ||delta|| >= |margin| / L.
47x   = torch.tensor([1.0, 0.0, -1.0])
48y0  = net(x).item()
49eps = 0.05
50print(f"f(x) = {y0:.4f}.  Any perturbation of norm <= {eps} can move f by at most {L*eps:.4f}.")

The connecting thread. Uniform continuity says: small input movement ⇒ small output movement, at a rate independent of where you are in input space. Lipschitz quantifies the rate. The Lipschitz constant times the attacker's budget is the maximum possible output shift — giving a provable minimum distance to the decision boundary. Modern robust deep learning (Parseval networks, spectral normalisation, 1-Lipschitz GANs) is fundamentally an exercise in controlling this constant.

Why Uniform Continuity Matters

Result	How UC enters
Riemann integrability of continuous functions	On [a, b], continuous ⇒ uniformly continuous (Heine–Cantor) ⇒ for any ε there is a partition fine enough that the upper and lower Riemann sums differ by less than ε(b − a). Without UC the proof collapses.
Continuous functions extend to closures	A uniformly continuous function on a dense subset of a metric space extends uniquely to the closure. The completion of the rationals into the reals is a UC-extension.
Numerical analysis	Error bounds for trapezoidal rule, Simpson's rule, and piecewise polynomial interpolation all require a UC modulus to convert step-size estimates into global error.
Convergence of approximation schemes	Stone–Weierstrass and polynomial approximation guarantee uniform convergence — possible only if the limit function is uniformly continuous.
ODE existence (Picard–Lindelöf)	The standard fixed-point proof needs the RHS to be Lipschitz in the state variable — a uniform form of continuity that makes the contraction argument work.
Adversarial robustness in deep learning	A network with Lipschitz constant L cannot have any adversarial example closer than \|margin\|/L. Spectral-normalised and 1-Lipschitz architectures bake UC directly into the model.
Solving PDEs by Galerkin / finite-element methods	Convergence of the discrete solution to the continuous one demands a uniform-continuity-style bound on the problem's solution operator.

Common Pitfalls

Saying "f is uniformly continuous at x₀". Uniform continuity is a property of functions on sets, not of functions at points. The natural phrase is " $f$ is uniformly continuous on $A$ ".
Forgetting that the domain matters. $f(x) = x^2$ is uniformly continuous on $[0, 3]$ but NOT on $\mathbb{R}$ . The function is the same; the verdict depends on the set.
Concluding non-UC from an empty search. If your computer fails to find a failing pair, that does NOT prove uniform continuity. The search might just be too small. UC needs a positive argument (Heine–Cantor or Lipschitz / Hölder).
Confusing "bounded slope" with "Lipschitz". $\sqrt{x}$ is uniformly continuous but has unbounded slope at 0. Bounded slope ⇒ Lipschitz ⇒ UC, but the converse fails.
Thinking UC requires compact domain. $\sin x$ and $\sqrt{x}$ are uniformly continuous on non-compact domains. Compactness is sufficient (Heine–Cantor) but not necessary.
Assuming the Lipschitz constant of a network equals its operator-norm product. The product is an UPPER bound, often loose by factors of 2–10 for ReLU networks. Tighter bounds exist (e.g., LipSDP) at higher computational cost.

Summary

Uniform continuity upgrades pointwise continuity by demanding a single global $\delta$ per $\varepsilon$ — independent of the location.
The sequential test refutes UC: build $x_n, y_n$ with $|x_n - y_n| \to 0$ but $|f(x_n) - f(y_n)| \not\to 0$ . $x^2$ on $\mathbb{R}$ and $1/x$ on $(0, 1]$ are the canonical counter-examples.
Heine–Cantor says continuous + compact ⇒ uniformly continuous. Compactness is the cleanest sufficient hypothesis; the proof flows through Bolzano–Weierstrass.
Lipschitz and Hölder conditions are quantitative refinements of UC: they give an explicit $\delta(\varepsilon)$ rather than promising one exists.
UC underlies Riemann integrability, error analysis of numerical schemes, existence theorems for ODEs, and — most concretely — the certified-robustness guarantees of Lipschitz-controlled neural networks.

Learning Objectives

Why Plain Continuity Is Not Enough

Two Definitions, Side by Side

The Formal Definition

The Sequential Test for Failure

A Tour of Four Functions

Geometric Intuition — The Travelling Box

Interactive: The Travelling-Box Explorer

Heine–Cantor Theorem — The Bridge to Compactness

Lipschitz and Hölder — Quantitative Cousins

Worked Example — Prove x² Fails by Hand

Step 1 — State what we have to refute.

Step 2 — Pick a witness ε0\varepsilon_0ε0​.

Step 3 — Build the failing pair from a free parameter.

Step 4 — Compute the function-value gap.

Step 5 — Assemble the refutation.

Step 6 — Cross-check with the explorer.

Python — Hunting for a Failing Pair

PyTorch — Uniform Continuity and Adversarial Robustness

Why Uniform Continuity Matters

Common Pitfalls

Summary

Step 2 — Pick a witness $\varepsilon_0$ .