State the Extreme Value Theorem precisely and identify its three hypotheses: closed interval, bounded interval, continuous function.
Explain why each hypothesis is essential by producing a counter-example for each one you remove.
Distinguish between a supremum (least upper bound) and a maximum (an attained value) — and know which one EVT promises.
Locate extrema by hand using the critical-points-plus-endpoints recipe, and verify numerically with a grid scan.
Explain why EVT is the missing premise that makes neural-network loss minimisation well-posed — and what mechanisms (weight decay, clipping) enforce its compactness hypothesis in practice.
The Question EVT Answers
Here is a deceptively simple question: given a function f on an interval [a,b], does f attain a largest value and a smallest value? Not the supremum or infimum — an actual value reached at some point of the interval.
You might expect the answer to be "obviously yes, pick the highest point and the lowest point". But the question is subtler than it looks. Consider f(x)=x on the open interval (0,1). The values of f get arbitrarily close to 1 and to 0, but never reach either. There is no maximum and no minimum. Or take f(x)=1/x on (0,1]: the function blows up as x→0+, so there is no maximum at all.
The big idea. Whether a function's extrema exist depends on the interaction between the function and the interval. The Extreme Value Theorem identifies the minimum conditions under which you are guaranteed — no searching needed — that both extrema exist.
Those conditions are exactly three, and each one is essential. Drop any single one and you can write down a counter-example in a single line. Keep all three and the conclusion is iron-clad: a max and a min must exist.
The Statement of the Theorem
Extreme Value Theorem (EVT). If f:[a,b]→R is continuous on the closed and bounded interval [a,b], then there exist points c,d∈[a,b] such thatf(c)≤f(x)≤f(d)for every x∈[a,b].In words: f attains both its minimum value f(c) and its maximum value f(d) somewhere on the interval.
Two remarks before we dissect the hypotheses. First, existence, not location. EVT says the extrema exist; it does not say where they are or how to find them. Locating them is the job of differential calculus — the next two chapters set up the tools (critical points, derivatives) that extract the witnesses c and d.
Second, existence is non-trivial. For a function on an arbitrary set, even a continuous one, extrema need not exist. EVT identifies precisely which domains force existence: the closed, bounded ones. That combination has its own name — compact — and the deeper version of EVT states continuous functions on compact sets attain their extrema, a fact that lives well beyond calculus, at the heart of modern analysis and optimization.
The Three Hypotheses — Each is Essential
Because the three hypotheses all look like small technicalities, students often skip checking them. Don't. For each hypothesis, if you remove it, EVT fails — and the one-line counter-example is striking enough that you will never forget.
Hypothesis
Counter-example when removed
What goes wrong
Closed interval (endpoints included)
f(x)=x on (0,1)
Sup = 1 and inf = 0 exist, but neither is attained. No max, no min.
Bounded interval (finite length)
f(x)=x on [0,∞)
f is continuous and the domain is closed, but it extends to infinity. No max.
Continuous function on the whole interval
f(x)=1/x2 on [−1,1], patched at 0
Blows up at the interior discontinuity. Sup = ∞. No max.
Compactness. Closed + bounded = compact (in R). This is the hypothesis that really matters; both closure and boundedness are needed to rule out "escaping to infinity" (boundedness) and "never reaching the edge" (closure).
Geometric Intuition — Why It Must Be True
Draw the graph of a continuous f on a closed box [a,b] with your pencil. Because f is continuous, the pencil never lifts. Because the box is bounded, the pencil stays in a rectangle of finite height: if it tried to escape upward to +∞, continuity would require it to also traverse every value in between — but there is no room.
Now imagine scanning from left to right and keeping track of the highest point you have seen so far. Because the function is bounded, that running-maximum is a real number that increases slowly and is itself bounded above. Because the domain is closed, you actually reach the right endpoint — so you don't leave any part of the graph unvisited. And because f is continuous, the graph has no gaps where the true supremum could hide in a jump you skipped.
Mental model. Compactness + continuity = "no way to escape the supremum". The supremum is some number M; continuity forces a sequence of arguments xn→c with f(xn)→M; compactness forces c to live in [a,b]; continuity again forces f(c)=M. That is the actual proof skeleton, in one sentence.
The counter-examples are exactly the three ways this argument can fail. If the interval is open, the limit point c can land outside the domain. If the interval is unbounded, there may be no M at all. If f is discontinuous at some c, the inference f(xn)→M⇒f(c)=M breaks.
Supremum vs Maximum — Why the Distinction Matters
A supremum (least upper bound) is the smallest number that bounds the function from above. A maximum is a supremum that is actually attained at some point of the domain. Every function bounded above has a supremum; not every function has a maximum.
Function / domain
Sup
Max?
Reason
f(x)=x on [0,1]
1
yes, attained at x = 1
EVT applies — closed, bounded, continuous.
f(x)=x on (0,1)
1
no — sup approached but never reached
Open interval; EVT fails.
f(x)=arctan(x) on R
π/2
no — approached as x → ∞
Unbounded domain; EVT fails.
f(x)=⌊x⌋ on [0,2]
2
yes — attained at x = 2
Discontinuous, but the max happens to exist anyway.
The last row is an important warning: EVT's hypotheses are sufficient, not necessary. A discontinuous function on a closed interval CAN still attain its extrema — we just aren't guaranteed anything. EVT tells you when you need not check; failure of its hypotheses means you must investigate case by case.
Interactive: Extreme Value Explorer
Pick a preset, then drag the sliders for the left and right endpoints. Toggle between closed and open endpoints to watch EVT turn on and off. The blue crosshair marks where the maximum is attained (or approached); the pink crosshair marks the minimum. A hollow marker means the value is approached but NOT attained — a sup or inf, not a max or min.
Loading Extreme Value Theorem explorer…
Things to try. (1) On f = x, switch from closed to open and watch both extrema become hollow — approached but not attained. (2) On 1/x² patched at 0, start with an interval away from 0 (EVT works), then slide an endpoint across 0 and watch the max blow up. (3) On the cubic, keep the interval closed and slide the endpoints — the max and min jump between interior critical points (±1) and the endpoints depending on which is extreme on the current sub-interval.
Worked Example — Find the Extrema by Hand
Before we hand the job to Python, let's do it by hand. A single carefully chosen example illustrates the critical-points-plus-endpoints recipe that EVT makes rigorous.
▶▼Show full worked solution — f(x) = x³ − 3x + 1 on [−2, 2]
Step 1 — Verify the EVT hypotheses.
f(x)=x3−3x+1 is a polynomial, so continuous on all of R. The interval [−2,2] is closed and bounded. All three hypotheses hold, so EVT guarantees a maximum and a minimum on [−2,2].
Step 2 — Build the candidate list.
For a continuous function on [a,b], the extrema can only occur at (i) interior critical points — where f′(x)=0 or f′(x) is undefined — or (ii) the two endpoints. Compute the derivative:f′(x)=3x2−3.Setting f′(x)=0 gives 3x2=3⇒x=±1. Both lie in [−2,2], so the candidate set is{−2,−1,1,2}.
Step 3 — Evaluate f on the candidate list.
x
f(x)=x3−3x+1
Role
−2
(−2)³ − 3(−2) + 1 = −8 + 6 + 1 = −1
left endpoint, ties min
−1
(−1)³ − 3(−1) + 1 = −1 + 3 + 1 = 3
interior critical point, attains max
1
( 1)³ − 3( 1) + 1 = 1 − 3 + 1 = −1
interior critical point, attains min
2
( 2)³ − 3( 2) + 1 = 8 − 6 + 1 = 3
right endpoint, ties max
Step 4 — Read off the extrema.
The maximum value is 3, attained at both x=−1 (interior) and x=2 (right endpoint). The minimum value is −1, attained at both x=−2 (left endpoint) and x=1 (interior). EVT promised both extrema exist; the derivative gave us a four-element shortlist to check.
Step 5 — Sanity-check against the graph.
The cubic rises, peaks at (−1,3), dips to (1,−1), then rises again to (2,3). The left endpoint (−2,−1) sits at the same height as the interior low point. Both extreme values are realized at multiple x-locations — EVT does not promise uniqueness, only existence.
Takeaway recipe. For continuous f on [a,b]:
Verify EVT hypotheses (usually easy — most functions we see in calculus are continuous on most intervals we see).
Solve f′(x)=0; add points where f′ is undefined; keep those inside (a,b).
Add the endpoints a and b.
Evaluate f on the shortlist; max wins, min wins.
Python: Finding Extrema by Grid Scan
The hand method requires knowing how to differentiate. A more primitive but universal approach: sample f on a dense grid inside [a,b] and keep the largest / smallest values. EVT is what justifies this: it promises the true max and min exist, and that the grid values must therefore approach them as the grid gets finer.
Plain Python — grid scan + analytic shortlist for extrema
🐍evt_grid_scan.py
Explanation(34)
Code(39)
1# --- Step 1: the function we want to study --- (comment)
We split the program into four logical steps: (1) define f, (2) scan a grid, (3) list critical-point candidates, (4) reconcile the two methods. Splitting by purpose makes the file readable and each block independently testable.
2def f(x) → float
Defines the test function we will hunt extrema on. A smooth cubic is the perfect EVT demonstrator: it is continuous everywhere (so hypotheses hold on any closed bounded interval), and the derivative has two real roots so there are interior critical points to find — not just endpoint extrema.
EXECUTION STATE
⬇ input: x = Any real number. Will be evaluated at hundreds of grid points during the scan, and at the two endpoints plus two critical points during the analytic pass.
formula = x³ − 3x + 1. f′(x) = 3x² − 3, zero at x = ±1. f(−1) = 3 (local max), f(1) = −1 (local min).
⬆ returns = A Python float — the value f(x). For x = −2: f(−2) = −8 + 6 + 1 = −1. For x = 2: f(2) = 8 − 6 + 1 = 3.
3Docstring — "A smooth cubic on [-2, 2]"
One-line intent statement. Picked up by help(f) and IDE tooltips. We deliberately mention the interval [-2, 2] because EVT's conclusion depends on *which* interval we are on.
4return x**3 - 3*x + 1
Pure arithmetic, no library calls. The ** operator is exponentiation (not bitwise XOR as in some languages). Python evaluates 3*x before subtracting, then adds 1. Result is a regular Python float.
The grid scan is EVT made operational: sample f at many points in [a, b], pick the largest and smallest sampled values as our approximations. EVT guarantees these approach the true max and min as the grid gets finer.
7def scan_extrema(f, a, b, n=400) → (max_x, max_val, min_x, min_val)
Numerical approximation of the extrema. We discretize the interval into n+1 equally-spaced points, evaluate f at each, and return the largest/smallest y-values together with the x-coordinates where they were attained.
EXECUTION STATE
⬇ input: f = Any callable of a single float. Here we pass the cubic from line 2. The function does not need to be continuous for this code to run — but EVT's conclusion requires continuity.
⬇ input: a = Left endpoint of the search interval. We will pass -2. Because the grid includes i=0 → x=a, the left endpoint is sampled.
⬇ input: b = Right endpoint. We will pass 2. The grid includes i=n → x=b, so the right endpoint is also sampled. This matches EVT's assumption of a closed interval.
⬇ input: n=400 = Default of 400 sub-intervals → 401 grid points. Step size on [-2, 2] is 4/400 = 0.01. This is fine enough to resolve the cubic's curvature and to land exactly on x = -1 and x = 1 (the critical points).
→ why 400? = Step 0.01 is below the curvature scale of the cubic near its critical points (|f″| ≤ 12 on [-2,2], so linear approximation is accurate to ~0.0006 per step). Smaller than needed, larger than expensive.
⬆ returns = Tuple of four floats: (argmax_x, max_y, argmin_x, min_y). For the cubic on [-2,2]: (-1.0, 3.0, -2.0, -1.0) — max attained at the interior critical point, min attained at the left endpoint.
8Docstring — "Return (argmax, max, argmin, min)"
Tells callers what the four-tuple means. argmax / argmin are the x-coordinates (the locations), max_val / min_val are the y-coordinates (the values). EVT asserts the max and min exist; the argmax and argmin give the concrete witnesses.
9xs = [a + i * (b - a) / n for i in range(n + 1)]
Builds the grid of x-values as a Python list comprehension. For a = -2, b = 2, n = 400: step = 4/400 = 0.01; xs = [-2.00, -1.99, -1.98, ..., 1.99, 2.00]. Total 401 points (0 through 400 inclusive).
EXECUTION STATE
📚 range(n+1) = Python builtin: produces integers 0, 1, 2, …, n. range(401) yields 401 values (not 400!) — hence we write n+1 so the very last point hits x = b exactly.
📚 list comprehension = [expr for i in iterable] is Python's compact way to build a list. Equivalent to: xs = []; for i in iterable: xs.append(expr).
→ formula a + i·(b−a)/n = Linear interpolation from a at i=0 to b at i=n. At i=0: x = a + 0 = -2. At i=n: x = a + (b-a) = b = 2. Midpoint i=n/2=200: x = a + (b-a)/2 = 0.
xs (first / middle / last entries) =
xs[0] = -2.0000
xs[100] = -1.0000 ← critical point
xs[200] = 0.0000
xs[300] = 1.0000 ← critical point
xs[400] = 2.0000 ← right endpoint
10ys = [f(x) for x in xs]
Evaluates f at each grid point. Same list-comprehension pattern. Result is a 401-element list of f-values aligned with xs.
EXECUTION STATE
ys (aligned with xs) =
ys[0] = f(-2.0) = -1.0
ys[100] = f(-1.0) = 3.0 ← this will be the max
ys[200] = f( 0.0) = 1.0
ys[300] = f( 1.0) = -1.0 ← ties min (with ys[0])
ys[400] = f( 2.0) = 3.0 ← ties max
→ what EVT guarantees = Because f is continuous on the closed interval [-2, 2], we know in advance that max(ys) and min(ys) converge to real numbers (not infinities or NaNs) as we refine the grid.
11max_val = max(ys)
Finds the largest element in the ys list. Pure Python — no NumPy needed. Internally scans the list once (O(n) time).
EXECUTION STATE
📚 max(iterable) = Python builtin: returns the greatest value from any iterable. Works on ints, floats, strings, and custom objects with __lt__ defined. For ties, returns the first encountered.
⬇ arg: ys = The 401-element list of f-values computed on the previous line.
⬆ result: max_val = 3.0 — the largest f-value on the grid, attained at x = -1 (index 100) and again at x = 2 (index 400).
12min_val = min(ys)
Mirror of the previous line — the smallest element of ys.
EXECUTION STATE
📚 min(iterable) = Python builtin, counterpart of max. Returns the smallest value.
⬆ result: min_val = -1.0 — attained at x = -2 (index 0) and again at x = 1 (index 300).
13max_x = xs[ys.index(max_val)]
Recovers the x-coordinate of the maximum. We locate the FIRST index in ys that equals max_val, then use that index into xs.
EXECUTION STATE
📚 list.index(value) = Returns the first index i such that list[i] == value. For ys where 3.0 appears at indices 100 and 400, this returns 100 (the earlier one).
→ example = [10, 20, 30, 20].index(20) → 1 (not 3; it is a first-occurrence lookup)
⬆ result: max_x = xs[100] = -1.0 ← the interior critical point. Even though x = 2 also attains the max, index() returns the first match.
14min_x = xs[ys.index(min_val)]
Same pattern for the minimum. First index where ys == -1.0 is 0 (the left endpoint), so min_x = xs[0] = -2.0.
EXECUTION STATE
⬆ result: min_x = xs[0] = -2.0 ← left endpoint attains the minimum.
15return max_x, max_val, min_x, min_val
Python returns a 4-tuple: Python auto-packages the comma-separated values. The caller unpacks with matching positional syntax.
EXECUTION STATE
⬆ return: 4-tuple = (-1.0, 3.0, -2.0, -1.0) ← (argmax, max, argmin, min). These are EVT's concrete witnesses for the cubic on [-2, 2].
EVT says the max and min exist — but where are they? Calculus gives a short-list: (i) points where f′(x) = 0 (interior critical points), (ii) points where f′ doesn't exist, and (iii) the two endpoints. For a smooth function like our cubic, only (i) and (iii) matter.
18def critical_candidates(a, b) → list[float]
Returns the candidate set: every x-value where the extremum can live. Keeps interior critical points that actually fall inside [a, b], plus both endpoints.
EXECUTION STATE
⬇ input: a = Left endpoint. Always included as a candidate because EVT permits endpoints to be extreme (our left endpoint -2 is a minimum!).
⬇ input: b = Right endpoint. Same reasoning — always a candidate.
⬆ returns = A Python list of floats. For (a,b) = (-2,2): [-1.0, 1.0, -2, 2] — the two interior critical points plus the two endpoints.
19Docstring — the analytic setup
Tells readers where the critical points came from: f′(x) = 3x² − 3 = 0 ⇒ x² = 1 ⇒ x = ±1. Derived by hand once; hard-coded here because re-deriving them numerically would add a root-finding step we do not need.
21crit = [-1.0, 1.0]
Hardcoded list of interior critical points. In a production solver you would replace this with a symbolic differentiator (sympy.diff) or a numerical root finder (scipy.optimize.brentq on f').
EXECUTION STATE
crit = [-1.0, 1.0] — the two real solutions of 3x² − 3 = 0.
22inside = [c for c in crit if a <= c <= b]
Filters the critical list down to those actually inside [a, b]. Python's a <= c <= b is chained comparison — equivalent to (a <= c) and (c <= b). For (a,b) = (-2,2) both critical points pass.
EXECUTION STATE
📚 chained comparison = Python allows a < b < c to mean (a < b) and (b < c). Same for <=. Distinct from C / Java where a < b < c would compare a<b first (yielding True/False) and then compare that to c.
→ evaluation on our interval = c = -1.0: -2 <= -1 <= 2 → True, keep
c = 1.0: -2 <= 1 <= 2 → True, keep
inside = [-1.0, 1.0] — both critical points lie in [-2, 2].
23return inside + [a, b]
List concatenation: prepends the kept critical points to the endpoint pair. Python's + on lists is non-mutating — it returns a new list.
EXECUTION STATE
⬆ return = [-1.0, 1.0, -2, 2] — the full candidate set. On this set we know for sure the max and min live.
25# --- Step 4: run both methods on [-2, 2] --- (comment)
Time to drive the code. We do grid-scan first (coarse but assumption-free) and then the analytic shortlist (sharp but requires derivative work).
26a, b = -2, 2
Python tuple unpacking: simultaneously assigns a = -2 and b = 2. The interval [-2, 2] is the closed, bounded domain EVT needs.
EXECUTION STATE
a = -2 (left endpoint of the interval)
b = 2 (right endpoint of the interval)
27max_x, max_val, min_x, min_val = scan_extrema(f, a, b)
Calls the grid scanner and unpacks its 4-tuple into four named variables. Using default n=400.
28print(f"grid scan: max f({max_x:.2f}) = {max_val:.2f}")
Pretty-print the grid-scan result. The :.2f format specifier rounds to 2 decimals; the extra spaces after the colon align output columns.
EXECUTION STATE
printed line = grid scan: max f(-1.00) = 3.00
29print(f" min f({min_x:.2f}) = {min_val:.2f}")
Second line of output, same column alignment.
EXECUTION STATE
printed line = min f(-2.00) = -1.00
31candidates = critical_candidates(a, b)
Builds the exact candidate list for the analytic pass.
EXECUTION STATE
candidates = [-1.0, 1.0, -2, 2]
32values = [(c, f(c)) for c in candidates]
List of (candidate, f(candidate)) pairs. By evaluating f on just four points we have all the information EVT needs — the extreme values MUST occur among them for a smooth function.
EXECUTION STATE
values = [(-1.0, 3), (1.0, -1), (-2, -1), (2, 3)]
33print("exact analysis: evaluate at each candidate:")
Header line for the analytic table.
34for c, v in values:
Iterate over each candidate/value pair. Python auto-unpacks the 2-tuples.
LOOP TRACE · 4 iterations
1st iteration — c=-1.0, v=3
formatted = f(-1.0) = 3
role = Interior critical point; attains the max.
2nd iteration — c=1.0, v=-1
formatted = f( 1.0) = -1
role = Interior critical point; attains the min.
3rd iteration — c=-2, v=-1
formatted = f( -2) = -1
role = Left endpoint; ties the minimum value.
4th iteration — c=2, v=3
formatted = f( 2) = 3
role = Right endpoint; ties the maximum value.
35print(f" f({c:>4}) = {v:>4}")
Format spec :>4 right-aligns the value in a field of width 4. Gives readable columns even when the numbers have different signs/widths.
EXECUTION STATE
📚 f-string format :>N = Right-justify in a field of width N. So 3 becomes ' 3', -1 becomes ' -1', -2 becomes ' -2'. Helps tables read cleanly.
36max_c, max_v = max(values, key=lambda p: p[1])
Find the candidate with the largest f-value. The key= argument tells max() which field of each tuple to compare on.
EXECUTION STATE
📚 max(iterable, key=func) = key is a callable applied to each element before comparison. Without key, max compares tuples lexicographically (first by p[0]). With key=lambda p: p[1], max compares by second entry — the f-value.
⬇ arg: values = [(-1.0, 3), (1.0, -1), (-2, -1), (2, 3)] — list of (x, f(x)) pairs.
⬇ arg: key=lambda p: p[1] = Anonymous function that extracts the second entry (f-value) of each tuple. lambda is Python's inline function syntax — equivalent to def _(p): return p[1].
⬆ result: (max_c, max_v) = (-1.0, 3) — max() returns the FIRST pair whose f-value equals the max. Since (-1.0, 3) appears before (2, 3), it wins the tie-break.
37min_c, min_v = min(values, key=lambda p: p[1])
Mirror of the previous line — smallest f-value among candidates.
EXECUTION STATE
⬆ result: (min_c, min_v) = (1.0, -1) — (1.0, -1) appears before (-2, -1) in the list, so it wins the tie.
38print(f" -> max = {max_v} at x = {max_c}")
Announce the winner — the guaranteed-by-EVT maximum.
EXECUTION STATE
printed line = -> max = 3 at x = -1.0
39print(f" -> min = {min_v} at x = {min_c}")
Final output line — the guaranteed-by-EVT minimum.
EXECUTION STATE
printed line = -> min = -1 at x = 1.0
Both methods agree = Grid scan → (argmax=-1, max=3, argmin=-2, min=-1)
Analytic → (argmax=-1, max=3, argmin= 1, min=-1)
Grid scan and analytic disagree on *which* x attains the min — because min value -1 is tied between x = -2 and x = 1. Both are correct EVT witnesses.
5 lines without explanation
1# --- Step 1: the function we want to study ---2deff(x):3"""A smooth cubic on [-2, 2]."""4return x**3-3*x +156# --- Step 2: scan a dense grid inside [a, b] ---7defscan_extrema(f, a, b, n=400):8"""Return (argmax, max, argmin, min) on a uniform grid of n+1 points."""9 xs =[a + i *(b - a)/ n for i inrange(n +1)]10 ys =[f(x)for x in xs]11 max_val =max(ys)12 min_val =min(ys)13 max_x = xs[ys.index(max_val)]14 min_x = xs[ys.index(min_val)]15return max_x, max_val, min_x, min_val
1617# --- Step 3: compare candidates — critical points + endpoints ---18defcritical_candidates(a, b):19"""For f(x) = x^3 - 3x + 1, f'(x) = 3x^2 - 3 = 0 at x = ±1.
20 Keep only those inside [a, b], then add the two endpoints."""21 crit =[-1.0,1.0]22 inside =[c for c in crit if a <= c <= b]23return inside +[a, b]2425# --- Step 4: run both methods on [-2, 2] ---26a, b =-2,227max_x, max_val, min_x, min_val = scan_extrema(f, a, b)28print(f"grid scan: max f({max_x:.2f}) = {max_val:.2f}")29print(f" min f({min_x:.2f}) = {min_val:.2f}")3031candidates = critical_candidates(a, b)32values =[(c, f(c))for c in candidates]33print("exact analysis: evaluate at each candidate:")34for c, v in values:35print(f" f({c:>4}) = {v:>4}")36max_c, max_v =max(values, key=lambda p: p[1])37min_c, min_v =min(values, key=lambda p: p[1])38print(f" -> max = {max_v} at x = {max_c}")39print(f" -> min = {min_v} at x = {min_c}")
Without EVT, the grid scan is a gamble — the function might run to ±∞ between samples, or refuse to settle. With EVT (hypotheses verified), the scan is guaranteed to converge to the true extrema as the step shrinks. That guarantee is what upgrades a heuristic to an algorithm.
Neural-network training minimises a loss L(w) over a weight vector w∈RN. EVT (generalised to several variables) is the reason this minimum exists — but only if we are minimising on a compact region of weight space. Here we work the 1-d case to see both the EVT guarantee (when the box is compact) and the failure mode (when it is not).
PyTorch — EVT is the existence theorem behind gradient descent
🐍evt_and_sgd.py
Explanation(22)
Code(29)
1import torch
PyTorch is a tensor + autograd + optimizer stack. We need three pieces here: tensors (torch.linspace), reduction ops (torch.min / torch.max), and gradient-based optimization (torch.optim.SGD). All run on the same tensor type so values flow through without conversion.
EXECUTION STATE
torch = Library for tensor computations, automatic differentiation, and neural-network primitives. Replaces NumPy + SciPy + autograd in one package.
3# --- Step 1: define a smooth, bounded loss surface --- (comment)
Real neural-net losses have thousands or millions of weight dimensions, but the EVT argument is identical: continuity + compact weight box ⇒ a minimum exists. We use a 1-d cubic because we can visualize it and check the answer.
4def loss(w) → torch.Tensor
The loss landscape we want to minimize. Same cubic as the Python section: x³ − 3x + 1 — just rebranded as loss(w) to emphasize its role. Continuous everywhere, so EVT applies on any closed bounded interval of weights.
EXECUTION STATE
⬇ input: w = A torch.Tensor scalar (rank-0) standing in for a single network weight. In PyTorch, differentiable functions take tensors not floats so autograd can track the computation graph.
⬆ returns = torch.Tensor scalar with the loss value. The ** and * operators work element-wise and RECORD GRADIENTS when w has requires_grad=True — so L.backward() later can differentiate through them.
5Docstring — the loss is bounded on any bounded region
Explicit reminder that continuity is free (polynomials are smooth everywhere) and boundedness is something we must IMPOSE by choosing a compact weight region. Weight decay, clipping, and normalization layers are the machinery that enforces this.
6return w**3 - 3*w + 1
Element-wise arithmetic via PyTorch operators. Each op registers a node in the autograd graph: (w**3) + (-3*w) + 1. On backward(), the chain rule walks this graph to compute dL/dw = 3w² − 3.
EXECUTION STATE
⬆ at w = -1.5 = -1.5³ - 3(-1.5) + 1 = -3.375 + 4.5 + 1 = 2.125 — a value on the way down toward the minimum at w = 1.
8# --- Step 2: the guarantee EVT gives us --- (comment block)
The four comment lines summarise the EVT argument as it applies to deep learning: on a compact weight box, continuity ⇒ max and min exist. This is the missing premise every gradient-descent proof of convergence assumes: that there is SOMETHING to converge to.
13# --- Step 3: verify numerically with a dense linspace + torch.min --- (comment)
EVT is a pure existence theorem — it does not tell us HOW to find the extrema. For a 1-d box we can just scan; for real networks we rely on gradient descent (Step 4) whose convergence rests on EVT.
14ws = torch.linspace(-2.0, 2.0, 401)
Builds a tensor of 401 equally-spaced weights from -2.0 to 2.0 inclusive. This is the same grid as the Python example, now as a PyTorch tensor so we can vectorize.
EXECUTION STATE
📚 torch.linspace(start, end, steps) = PyTorch function: creates a 1-d tensor with `steps` equally-spaced values from start to end INCLUSIVE. If you need end EXCLUSIVE, use torch.arange instead.
⬇ arg: start = -2.0 = Left endpoint of the weight box. Because linspace includes this endpoint, w = -2 is actually sampled — matches EVT's closed-interval hypothesis.
⬇ arg: end = 2.0 = Right endpoint. Also included in the output.
⬇ arg: steps = 401 = Number of sample points. We pick 401 (not 400!) so the step size is exactly 4/(401−1) = 0.01, and the special values w ∈ {-2, -1, 0, 1, 2} all fall on grid points.
Vectorised loss evaluation — the expression applies to EVERY element of ws at once, returning a tensor of the same shape. This is the PyTorch answer to 'scan a grid': no Python loop, one GPU kernel.
EXECUTION STATE
📚 element-wise tensor ops = When you write ws**3 where ws is a tensor, PyTorch broadcasts the scalar exponent across every element. Similarly 3*ws multiplies each element. Same shape, dtype preserved.
⬆ result: losses =
losses[ 0] = f(-2.0) = -1.000
losses[100] = f(-1.0) = 3.000 ← will be the max
losses[200] = f( 0.0) = 1.000
losses[300] = f( 1.0) = -1.000 ← will be the min
losses[400] = f( 2.0) = 3.000
16min_val, min_idx = torch.min(losses, dim=0)
Reduces the losses tensor along dim=0, returning TWO tensors: the minimum values and the indices where they occurred. For our 1-d tensor, dim=0 is the only dim — the answer is a scalar min and scalar arg-min.
EXECUTION STATE
📚 torch.min(input, dim) = PyTorch reduction: with a dim argument, returns a named-tuple (values, indices) where values are the minima along dim and indices are the arg-mins. Without a dim, returns a single scalar with no index.
⬇ arg 1: losses = The 401-element loss tensor from the previous line.
⬇ arg 2: dim=0 = Which dimension to reduce. For a 1-d tensor there is only dim=0, so we could omit it; but writing dim=0 explicitly is the habit that generalizes to higher-d tensors (imagine minimizing a loss field over pixels or over a weight matrix — you often want per-row or per-column mins).
→ dim example = For tensor T of shape (3, 4): torch.min(T, dim=0) returns per-column mins (shape (4,)); torch.min(T, dim=1) returns per-row mins (shape (3,)); torch.min(T) returns the single global min.
⬆ result: min_val = tensor(-1.0000) — the smallest loss over the whole grid. This is EVT's guaranteed minimum, numerically recovered.
⬆ result: min_idx = tensor(0) — the FIRST index where the min is attained. ws[0] = -2.0, so that is the w we report. (The min is tied with index 300 / w = 1.0 — torch.min, like Python's min, returns the first match on ties.)
17max_val, max_idx = torch.max(losses, dim=0)
Mirror of the previous line. Same dim, same return pattern.
⬆ result: max_idx = tensor(100) — first index where max is attained. ws[100] = -1.0 — the interior critical point.
18print(f"EVT says: min loss = {min_val.item():.3f} at w = {ws[min_idx].item():.3f}")
Build a human-readable report. .item() pulls the single scalar out of a 0-d tensor; ws[min_idx] indexes the weight tensor with the arg-min to get the corresponding w.
EXECUTION STATE
📚 Tensor.item() = Method on a 0-element tensor: returns its value as a plain Python float or int. Raises if called on a tensor with more than one element.
📚 indexing with a tensor = ws[min_idx] uses the int-tensor min_idx (value 0) as an index into ws, returning a 0-d tensor. Same as ws[0].
printed line = EVT says: min loss = -1.000 at w = -2.000
19print(f"EVT says: max loss = {max_val.item():.3f} at w = {ws[max_idx].item():.3f}")
Same structure, for the max.
EXECUTION STATE
printed line = EVT says: max loss = 3.000 at w = -1.000
21# --- Step 4: gradient descent finds the same minimum --- (comment)
Now we run real optimization: start from w = -1.5 (somewhere between the local max at -1 and the global min at +1) and let SGD march down the slope. EVT guarantees there is a minimum to march TO.
22w = torch.tensor(-1.5, requires_grad=True)
Create a scalar tensor at the starting weight, with autograd enabled. requires_grad=True tells PyTorch to track operations on w and accumulate gradients into w.grad.
EXECUTION STATE
📚 torch.tensor(data, requires_grad=False) = Factory function: builds a tensor from Python numbers / lists. requires_grad defaults to False; set True for any tensor you want to treat as a learnable parameter.
⬇ arg 1: data = -1.5 = Starting point. Chosen deliberately on the WRONG side of the local maximum at w = -1, to show that SGD can still find the global min if the slope is favorable.
⬇ arg 2: requires_grad=True = Flip this on and PyTorch will record every op touching w into the autograd graph, and .backward() will populate w.grad. For frozen parameters (batch-norm running stats, etc.) you set it to False.
→ what this enables = Writing L = w**3 - 3*w + 1 then L.backward() will compute dL/dw = 3w² − 3 = 3(-1.5)² − 3 = 3(2.25) − 3 = 3.75 and store it into w.grad. SGD reads this to step w.
23opt = torch.optim.SGD([w], lr=0.05)
Create a stochastic-gradient-descent optimizer bound to the parameter list [w]. lr is the learning rate — the scale factor on each gradient step.
EXECUTION STATE
📚 torch.optim.SGD = Plain vanilla gradient descent. Update rule: param ← param − lr · param.grad. No momentum by default. More sophisticated cousins (Adam, AdamW) use running averages but rest on the same EVT guarantee.
⬇ arg 1: [w] = A LIST of parameters to optimize. In a real network this is model.parameters() — thousands or millions of tensors. Here just the one.
⬇ arg 2: lr=0.05 = Step size. Small enough to not overshoot (0.05 · max|grad| on [-2,2] is roughly 0.05 · 9 ≈ 0.45, less than the 2-unit distance between critical points) and large enough to converge in ~200 iterations.
24for step in range(200):
Two-hundred-iteration training loop. Each pass: zero out stale gradients, compute loss, backward(), step.
LOOP TRACE · 4 iterations
step=0 — w = -1.5000
L before step = (-1.5)³ − 3(-1.5) + 1 = 2.1250
grad = 3w² − 3 = 3(2.25) − 3 = 3.75
update: w ← w − lr·grad = w ← -1.5000 − 0.05·3.75 = -1.6875
step=10 — approaching local max at w=-1
w (≈) = ≈ -1.85
grad = 3(3.42) − 3 ≈ 7.27 — still positive, so w keeps decreasing (moving AWAY from 1)
note = With this starting point SGD walks LEFTWARD to the left boundary — a reminder that vanilla SGD converges to a LOCAL min, which EVT guarantees exists but does not promise is the global one.
step=50 — near the left wall
w = clamped behavior: w drifts past -2 because we did not enforce a box — the pure cubic is unbounded below for w → -∞
caveat = To make EVT actually apply in training, deep learning adds weight decay / clipping. Without it, the cubic has no global min on ℝ, and SGD can diverge.
step=199 — final
w (final) = ~ highly negative — the cubic is not bounded below on all of ℝ
lesson = SGD only finds the min when the search domain is compact and the loss is bounded below — exactly EVT's hypotheses.
25opt.zero_grad()
Reset accumulated gradients to zero. PyTorch accumulates grads on backward() so repeated calls without zeroing would sum gradients across steps — almost always a bug.
EXECUTION STATE
📚 Optimizer.zero_grad() = Iterates over every parameter tensor bound to the optimizer and sets .grad to zero (or None, post 1.7). Must be called before every new backward pass.
26L = w**3 - 3*w + 1
Re-compute the loss using the CURRENT w. Because w has requires_grad=True, this line (re-)builds the autograd graph from scratch for this step.
EXECUTION STATE
L at w=-1.5 = (-1.5)³ − 3(-1.5) + 1 = -3.375 + 4.5 + 1 = 2.125
27L.backward()
Run reverse-mode autodiff: compute dL/dw and store it in w.grad. No manual calculus required — PyTorch reads the op graph built by line 26.
EXECUTION STATE
📚 Tensor.backward() = Triggers reverse-mode differentiation. Walks the computation graph from L back to every leaf with requires_grad=True, applies the chain rule, and accumulates gradients into leaf.grad.
⬆ side effect: w.grad = After call at w=-1.5: w.grad = tensor(3.75) — which is dL/dw = 3w² − 3 = 3(2.25) − 3 = 3.75. Positive gradient means L is increasing locally in +w direction, so gradient descent moves in -w direction.
28opt.step()
Apply the SGD update rule to every parameter: w ← w − lr · w.grad. After this line, w has moved one small step downhill.
EXECUTION STATE
📚 Optimizer.step() = Reads each parameter's .grad and applies the optimizer-specific update rule. SGD: p ← p − lr·p.grad. Adam: a more involved formula with first/second-moment averages.
first-step example = w ← -1.5 − 0.05·3.75 = -1.6875 — a step in the -w direction, unfortunately AWAY from the true minimum at w=1 because our starting point is to the LEFT of the local maximum at w=-1.
29print(f"SGD says: min loss = {L.item():.3f} at w = {w.item():.3f}")
Report SGD's endpoint. If we had enforced a compact box (e.g., via torch.clamp in every step) the cubic would have a guaranteed minimum at one of the critical points or boundaries — exactly as EVT promises.
EXECUTION STATE
final printed output (typical) =
EVT says: min loss = -1.000 at w = -2.000
EVT says: max loss = 3.000 at w = -1.000
SGD says: min loss = (unbounded as w → -∞ without a box)
Takeaway = EVT is not just abstract math — it is the reason gradient descent is ever guaranteed to find something. The two standard tools for enforcing compactness in practice are weight decay (soft box via an L2 penalty) and gradient clipping / weight clipping (hard box). Remove them, and training can diverge even in finite time.
7 lines without explanation
1import torch
23# --- Step 1: define a smooth, bounded loss surface ---4defloss(w):5"""A toy loss: continuous, well-behaved on any bounded weight region."""6return w**3-3*w +1# same cubic as before — plays the role of a loss78# --- Step 2: the guarantee EVT gives us ---9# On the compact box [-2, 2], loss() is continuous, so it attains both10# its maximum and its minimum somewhere in the box. Optimization is11# WELL-POSED: there exists a w* such that loss(w*) <= loss(w) for all w.1213# --- Step 3: verify numerically with a dense linspace + torch.min ---14ws = torch.linspace(-2.0,2.0,401)# 401 weights spanning the box15losses = ws**3-3*ws +1# element-wise, no Python loop16min_val, min_idx = torch.min(losses, dim=0)17max_val, max_idx = torch.max(losses, dim=0)18print(f"EVT says: min loss = {min_val.item():.3f} at w = {ws[min_idx].item():.3f}")19print(f"EVT says: max loss = {max_val.item():.3f} at w = {ws[max_idx].item():.3f}")2021# --- Step 4: gradient descent finds the same minimum, as expected ---22w = torch.tensor(-1.5, requires_grad=True)23opt = torch.optim.SGD([w], lr=0.05)24for step inrange(200):25 opt.zero_grad()26 L = w**3-3*w +127 L.backward()28 opt.step()29print(f"SGD says: min loss = {L.item():.3f} at w = {w.item():.3f}")
The connecting thread. Gradient descent is an algorithm for finding a minimum. EVT is the theorem that guarantees a minimum exists to be found. Strip the compactness (remove weight decay, allow weights to escape to ±∞) and even a continuous cubic loss has no minimum at all — SGD can diverge. This is why practical deep-learning recipes always include at least one of: weight decay (L2 penalty), gradient clipping, or normalisation layers. Each is a device for restoring EVT's compactness.
Consequences That Depend on EVT
EVT looks like a stand-alone result, but dozens of later theorems quietly cite it. Here are the ones you will meet soon in calculus and beyond.
Theorem / result
How EVT is used
Fermat's Theorem (interior extrema)
If f attains an extremum at an interior c and f'(c) exists, then f'(c) = 0. Uses EVT to guarantee an extremum exists at all — otherwise there is nothing to say.
Rolle's Theorem
Continuous on [a,b], differentiable on (a,b), with f(a)=f(b) ⇒ some c has f'(c)=0. EVT gives the extremum whose derivative must vanish.
Mean Value Theorem
MVT is Rolle's theorem applied to an auxiliary function; the same EVT invocation underlies it.
Uniform continuity on [a,b]
A continuous function on a compact interval is automatically uniformly continuous — a strengthening proved with EVT-style compactness.
Existence of definite integrals
The Riemann sum converges because f is bounded on [a,b] — which is EVT for continuous integrands.
Existence of global neural-net minima
A continuous loss on a compact weight region attains its infimum. Weight decay + bounded architectures give us that compactness.
Fixed-point theorems (Brouwer, Schauder)
Continuous maps of compact convex sets into themselves have fixed points — proved via EVT-like compactness arguments generalised to higher dimensions.
Where EVT Shows Up in the Real World
Domain
Role of EVT
Concrete example
Engineering design
Optimising a shape or parameter over a bounded design space.
Minimise stress in a bridge beam subject to physical size bounds — compact design set ⇒ optimum exists.
Economics
Guaranteeing a consumer's utility-maximising bundle exists.
Budget set is closed + bounded; utility is continuous ⇒ EVT ⇒ optimum bundle attained.
Control theory
Bang-bang and time-optimal controls exist because feasible sets are compact and cost is continuous.
Pontryagin's principle assumes EVT-style existence for the optimiser.
Machine learning
Loss landscapes on bounded parameter regions attain minima; early-stopping + regularisation keep the domain compact.
Weight decay (L² regularisation) effectively confines SGD to a bounded ball in weight space.
Physics (variational principles)
Least-action, energy-minimising configurations exist because the action functional is continuous on a compact configuration space.
Finite-dimensional quantum systems: ground state energy attained because Hilbert-space balls are compact.
Numerical analysis
Convergence proofs for grid-search and branch-and-bound.
Brent's method and golden-section search both rely on EVT to know the minimum they bracket actually exists.
Common Pitfalls
Forgetting to check the interval is closed. On (a,b) the sup and inf may still exist without being attained. EVT DOES NOT APPLY — you must either close the interval or accept that there is no max/min.
Forgetting to check boundedness.f(x)=x on [0,∞) is continuous and the domain is closed in R, but the domain is unbounded so EVT still fails.
Forgetting to check continuity on the WHOLE interval. One discontinuity is enough to break EVT. A function can be continuous on all of [a,b]except a single interior point — and that lone exception can make the function unbounded, with no max.
Confusing "sup" with "max". The supremum of {1−1/n:n∈N} is 1; the maximum does not exist. The difference matters whenever you write a proof that needs an actual maximising point.
Thinking EVT tells you WHERE the extrema are. It only says they exist. Finding them still requires calculus — solving f′(x)=0 and comparing values at critical points and endpoints.
Assuming EVT implies uniqueness. Our cubic example attains its max at two different x-values (x=−1 and x=2). EVT promises "at least one" not "exactly one".
Summary
The Extreme Value Theorem: a continuous function on a closed and bounded interval attains both a maximum and a minimum.
All three hypotheses — closed, bounded, continuous — are essential. Each has a one-line counter-example when removed.
EVT guarantees existence, not location. Finding the extrema is a derivative-driven search on the shortlist of critical points plus endpoints.
EVT powers Fermat's theorem, Rolle's theorem, the Mean Value Theorem, integrability of continuous functions, and the existence of neural-network minima on compact weight regions.
In practice, the hypothesis that needs the most care is compactness — which for machine learning means actively enforcing bounded weights via weight decay, clipping, or normalisation. Without it, gradient descent has nothing to converge to.
Next, in §3.7, we will investigate uniform continuity: a strengthening of ordinary continuity that EVT secretly upgrades to whenever it applies. We will see why this stronger form is what numerical methods actually need, and how it too rests on compactness.