Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why the intuitive idea of "getting close to L" is not rigorous enough to build calculus on.
State the ε–δ definition of a limit and identify each quantifier and inequality in plain language.
Construct a δ that responds to a given ε for linear and quadratic functions, both visually and algebraically.
Recognise when no δ can possibly work — the geometric signature of a failed limit.
Connect the limiting ratio ε / δ to the derivative and to the gradient autograd computes in PyTorch.

The Problem with "Approaches"

"What does it mean for a quantity to approach a value? For two centuries, this question stalled the foundations of calculus."

In every previous section we have written things like $\lim_{x \to 2} x^{2} = 4$ and trusted that "as x gets close to 2, x² gets close to 4". That sentence is convincing, but it is not a definition — it is a description of feelings.

Three uncomfortable questions

How close is close? If I demand that x² be within 0.0001 of 4, how close does x have to be to 2?
What if someone disagrees with my answer? "Close" is subjective; we need a referee that anyone can check.
For which functions does "close in x ⇒ close in f(x)" even hold? Some functions (think of a step) explode the moment you cross a boundary.

In the early 1800s Augustin-Louis Cauchy and Karl Weierstrass faced this head-on. They flipped the question upside-down: instead of letting the prover decide what "close" meant, they let any opponent name the tolerance. The prover then has to produce a matching guarantee. If the prover can win every round of this game — for arbitrarily small tolerances — the limit is real. Otherwise it isn't.

The ε–δ Game: A Conversation Between Two People

Picture two mathematicians arguing about whether $\lim_{x \to c} f(x) = L$ :

🗡 The Challenger

Names a positive number $\varepsilon$ . Means: "I don't believe you. Here is how close to L the output must come — within $\varepsilon$ . Beat that."

🛡 The Defender

Responds with a positive number $\delta$ . Means: "If you keep x within $\delta$ of c (and not equal to c), I promise f(x) will land inside your $\varepsilon$ tolerance."

The limit equals L if and only if the Defender has a winning strategy — a recipe that produces a working $\delta$ for every positive $\varepsilon$ the Challenger could possibly throw, no matter how small.

Why the Challenger goes first

If the Defender went first and chose $\delta$ , they would be tempted to hide behind a big window and call it close enough. By forcing the Challenger to pre-commit to $\varepsilon$ , the definition rules out cheating: the Defender must actually shrink to fit any tolerance.

The Formal Definition

ε–δ definition of a limit

We say $\lim_{x \to c} f(x) = L$ if and only if for every $\varepsilon > 0$ there exists a $\delta > 0$ such that

0 < |x - c| < \delta \implies |f(x) - L| < \varepsilon.

In symbolic shorthand: $\forall\,\varepsilon > 0\;\exists\,\delta > 0:\;0 < |x-c| < \delta \Rightarrow |f(x)-L| < \varepsilon$ .

Anatomy of the Definition

The definition is dense. Every symbol is doing serious work. Let's slow down and translate piece-by-piece.

Symbol	Reads as	What it actually means
∀ ε > 0	for every positive ε	The Challenger is allowed to pick any tolerance, no matter how tiny.
∃ δ > 0	there exists a positive δ	The Defender must produce some response — usually a recipe δ(ε) that depends on ε.
0 < \|x − c\|	x is not equal to c	We never require f(c) itself to behave. The limit is about behavior NEAR c, not AT c.
\|x − c\| < δ	x is within δ of c	x lives in the punctured input window (c − δ, c) ∪ (c, c + δ).
⇒	implies	Whenever the input condition holds, the output condition is forced to hold.
\|f(x) − L\| < ε	f(x) is within ε of L	f(x) lives in the output window (L − ε, L + ε) — the green band.

The order of quantifiers is everything

∀ε ∃δ means the Defender hears ε before choosing δ — δ is allowed to depend on ε. Swapping to ∃δ ∀ε would mean a single δ has to handle every ε, which is impossible for any non-constant function. Many failed limit proofs collapse the moment you check the order.

Interactive: Watch ε Force δ

The picture is the easiest way to feel the definition. Drag the ε slider — that is the Challenger setting the height of the green band. Then drag δ to set the width of the violet band. The limit holds when the curve segment inside the violet band stays entirely inside the green band.

Loading ε–δ visualizer…

Try this experimental loop:

Pick the linear function. Set ε = 1.0. Click Auto-fit best δ. Note δ = 0.5.
Drop ε to 0.1. Auto-fit again. δ shrinks to 0.05 — exactly ε/2.
Switch to $f(x) = x^{2}$ . Observe that δ now depends on the curve's steepness near c = 2.
Click Make δ too big — red dots appear: those are the witness points where the curve escapes the green band.
Switch to the sinc function. Even though f(0) is undefined, the limit is still 1 — δ traps the surrounding values just fine.

Worked Example: lim (2x + 1) = 7

Let's do a complete ε–δ proof for $\lim_{x \to 3}(2x + 1) = 7$ . The strategy has two parts: scratch-work (private — figure out what δ must be) and the polished proof (public — present δ and verify).

📝 Step-by-step numerical walkthrough — try it yourself first

Step 1 — Set up the bound we must satisfy. The Challenger picks some ε > 0 (for instance 0.1). We need

|f(x) - L| < \varepsilon \quad\Longleftrightarrow\quad |(2x + 1) - 7| < \varepsilon.

Step 2 — Simplify the left side.

|(2x + 1) - 7| = |2x - 6| = 2|x - 3|.

Step 3 — Solve for |x − 3|. Divide both sides by 2:

2|x - 3| < \varepsilon \;\Longleftrightarrow\; |x - 3| < \varepsilon / 2.

Step 4 — Read off δ. The condition we need on x is exactly |x − 3| < ε/2. So choose

\delta = \varepsilon / 2.

Step 5 — Verify. Suppose 0 < |x − 3| < δ = ε/2. Then

|f(x) - 7| = 2|x - 3| < 2 \cdot (\varepsilon / 2) = \varepsilon.\;\;\blacksquare

Step 6 — Plug in numbers and feel it. If ε = 0.1, take δ = 0.05. Then x = 3.04 gives f(x) = 7.08 and |7.08 − 7| = 0.08 < 0.1. ✓ Try x = 2.97: f(x) = 6.94, |6.94 − 7| = 0.06 < 0.1. ✓

Notice that we never had to choose specific x values to win — δ = ε/2 handles all of them at once. That is what makes this a proof rather than a spot-check.

The pattern works for every linear function

For $f(x) = mx + b$ with m ≠ 0, the same algebra gives $\delta = \varepsilon / |m|$ . Steeper lines (bigger |m|) force smaller δ — the input window has to shrink faster to compensate.

Play the Challenger–Defender Game

Below, you are the Challenger. Pick any ε you like — even unreasonably small. Watch the Defender produce a δ that always wins. The numerical witness table proves that every x in the input window really does land in the output window.

Loading challenger–defender game…

A Trickier Case: lim x² = 4

For curves, the Defender's job is harder because steepness changes from point to point. Take $\lim_{x \to 2} x^{2} = 4$ . We want

|x^{2} - 4| < \varepsilon \quad\Longleftrightarrow\quad |x - 2|\cdot|x + 2| < \varepsilon.

The factor $|x - 2|$ is the one we control with δ. The factor $|x + 2|$ wobbles as x moves. The standard trick: tame the wobble first, then choose δ.

Tame the wobble. Restrict δ ≤ 1 once and for all. Then $|x - 2| < 1$ means $1 < x < 3$ , which gives $3 < x + 2 < 5$ , so $|x + 2| < 5$ .
Bound the product. Under the restriction, $|x - 2|\cdot|x + 2| < 5|x - 2|$ . We need this to be less than ε, so we need $|x - 2| < \varepsilon / 5$ .
Combine the two requirements. Choose $\delta = \min\!\left(1,\; \varepsilon / 5\right)$ .

Why min?

We need both conditions: δ ≤ 1 (so the wobble is bounded) and δ ≤ ε/5 (so the product is small enough). The smaller of the two satisfies both at once.

For ε = 0.1: $\delta = \min(1, 0.02) = 0.02$ . Verification: if $|x - 2| < 0.02$ then $|x^{2} - 4| < 0.02 \cdot 4.02 = 0.0804 < 0.1$ . ✓

The Limit Cage in 3D

The two-dimensional picture has the function as a curve and the bands as strips. In three dimensions we can see the bands as slabs, and the place where the green slab and the violet slab intersect is a yellow box — the limit cage. The ε–δ definition becomes a single sentence: the curve segment for x in the input window must fit inside the cage.

Loading 3D limit cage…

Rotate to look down the x-axis: you see the green ε-strip head-on. Now look down the y-axis: the violet δ-strip becomes obvious. Look from above: the box footprint shows the actual region of the (x, y) plane that the definition cares about.

When the Limit Fails

The definition is also a powerful tool for proving a limit does not exist. Consider the sign function near 0:

f(x) = \begin{cases} 1 & x > 0 \\ -1 & x < 0 \end{cases}.

Suppose someone claims $\lim_{x \to 0} f(x) = 0$ . The Challenger picks ε = 0.5. Whatever δ the Defender names, the input window (−δ, 0) ∪ (0, δ) contains points where f = +1 and points where f = −1. Both are 1 unit away from 0 — far outside the 0.5 tolerance. No δ can ever work.

The geometry of failure

Whenever the function makes an unavoidable jump, twist, or oscillation that no shrinking input window can suppress, the limit fails. Visually this is red in the 2D visualizer — change the function or pick a small ε at a discontinuity and watch the green band turn rose.

Python: Verifying ε–δ Numerically

The ε–δ definition is a quantifier statement, but a computer can approximate the Defender's job. We pick a function, sample the input window densely, and check every sample against the ε bound. If we want the smallest workable δ, we shrink δ in a halving search until the check passes.

Pure Python: search for δ that beats any ε

🐍epsilon_delta.py

Explanation(24)

Code(47)

1import math

Python's math module provides math.isclose(), which compares floating-point numbers safely so we can exclude the point x = c itself without being tripped by binary-floating-point noise.

EXECUTION STATE

math = Python standard library for elementary math. Provides isclose, sqrt, log, etc. Pure-Python and tiny — no NumPy needed for this demo.

3def f(x): the function under test

Defines the function whose limit we want to verify. We use the simple linear function f(x) = 2x + 1 because it has a known limit at x = 3 (namely L = 7) and the algebra |f(x) − 7| = 2|x − 3| makes ε–δ exact.

EXECUTION STATE

⬇ input: x = Any real number. We will probe x near c = 3, but the function itself is defined for all of R.

⬆ returns = 2 * x + 1 — a single float.

→ example = f(3.001) = 2(3.001) + 1 = 7.002

7def epsilon_delta_check(c, L, epsilon, delta, n_samples)

Decides whether a single (ε, δ) pair satisfies the ε–δ definition. It samples 2001 evenly spaced x values in (c − δ, c + δ) and checks that every one of them lands in (L − ε, L + ε). If even one escapes, it returns False along with the witness.

EXECUTION STATE

⬇ input: c = 3 — the point we are taking the limit at.

⬇ input: L = 7 — the candidate limit value.

⬇ input: epsilon = How close f(x) must come to L. Smaller ε = stricter test.

⬇ input: delta = How close x is allowed to c. The defender's response.

⬇ input: n_samples = 2001 = Density of the test grid inside (c − δ, c + δ). Odd number ensures a sample lands exactly at c so we can skip it.

⬆ returns = (True, None) if every sample passes, else (False, x) where x is the witness that violates the condition.

14step = (2 * delta) / (n_samples - 1)

Computes the spacing between adjacent test x-values so that the first sample lands exactly at c − δ and the last at c + δ. Using n_samples − 1 in the denominator (not n_samples) is the standard linspace convention.

EXECUTION STATE

step = Distance between adjacent x samples. With δ = 0.5 and 2001 samples: step = 1.0 / 2000 = 0.0005.

15for i in range(n_samples)

Walk through every sample in the input window. We check each one against the ε bound. The loop is a brute-force discretization of the universal quantifier ∀x in the formal definition.

EXECUTION STATE

i = Index 0, 1, …, n_samples − 1. We map it to an actual x in the next line.

16x = (c - delta) + i * step

Convert the integer index i into a real x value. When i = 0 we get the left edge c − δ; when i = n_samples − 1 we get the right edge c + δ. Linear interpolation across the input window.

EXECUTION STATE

x = A test point inside the window (c − δ, c + δ). Example: with c = 3, δ = 0.5, i = 1000 → x = 3.0.

17if math.isclose(x, c, abs_tol=1e-12): continue

The formal definition requires 0 < |x − c| < δ — strictly positive distance. We must skip the case x = c itself. math.isclose handles the floating-point case where x lands within rounding error of c.

EXECUTION STATE

📚 math.isclose(a, b, abs_tol) = Returns True if |a − b| ≤ abs_tol. Safer than `a == b` for floats. Example: math.isclose(0.1 + 0.2, 0.3) is False with default tolerance, but True with abs_tol = 1e-12.

⬇ arg: abs_tol = 1e-12 = Absolute tolerance — anything within 10⁻¹² is treated as equal. This is many orders of magnitude tighter than our ε values, so it never accidentally skips a real test point.

→ why skip c? = Because the definition only constrains f(x) for x ≠ c. Whether f(c) is defined, undefined, or wildly different is irrelevant — that is the whole power of limits.

19if abs(f(x) - L) >= epsilon: return False, x

The ε test. Compute f(x), measure its distance from the candidate limit L, and reject the (ε, δ) pair the moment any x violates |f(x) − L| < ε. Return the offending x as a witness so the user sees concretely why δ failed.

EXECUTION STATE

abs(f(x) - L) = How far f(x) is from L. Example: f(3.6) = 8.2, |8.2 − 7| = 1.2.

>= epsilon = Strictly fails the ε bound. Note we use >= (not >) because the definition is strict <, so equality at the boundary is also a failure.

⬆ return: (False, x) = Tells the caller that the (ε, δ) pair is not valid, and hands back the witness point that broke it.

21return True, None

If we exit the loop without finding a single bad x, every test point passed. The (ε, δ) pair is valid for this discretization. None occupies the witness slot because there is no offender.

EXECUTION STATE

⬆ return: (True, None) = The defender succeeded — every x in (c − δ, c + δ) lands inside (L − ε, L + ε).

23def find_smallest_delta(c, L, epsilon, start, shrink, tries)

Implements the defender's strategy: start with a generous δ and halve it until the ε–δ check passes. This is exactly how a defender reasons during the game — try the largest δ that could plausibly work, and if it fails, shrink and try again.

EXECUTION STATE

⬇ input: epsilon = The ε value the challenger named.

⬇ input: start = 1.0 = The first δ to try. We start large because for many functions a wide δ already works.

⬇ input: shrink = 0.5 = Multiplicative shrink factor. Halving δ each round means we converge in O(log) iterations even for very small ε.

⬇ input: tries = 40 = Cap on iterations. With shrink = 0.5, after 40 tries δ has shrunk by 2⁴⁰ ≈ 10¹². Plenty of head-room for any ε down to 10⁻¹².

⬆ returns = The first δ that satisfies the ε–δ condition, or None if even the smallest δ in the search failed (a sign the limit is wrong).

29delta = start

Initialize the candidate δ to the starting value (1.0). Each iteration of the upcoming loop will multiply it by `shrink` until the ε–δ condition holds.

EXECUTION STATE

delta = 1.0 initially. After one halving: 0.5. After two: 0.25. After ten: ≈ 0.000977.

30for _ in range(tries)

Run the shrink loop up to `tries` times. We use `_` because we don't need the loop index — only the iteration count matters.

EXECUTION STATE

range(tries) = An iterator over 0, 1, …, tries − 1. Using it with `_` is the Pythonic way to say 'do this N times'.

31ok, _ = epsilon_delta_check(c, L, epsilon, delta)

Call our checker with the current δ. We unpack the (bool, witness) tuple. We discard the witness here with `_` because we just want the boolean — the user is not debugging at this level.

EXECUTION STATE

ok = True if this δ satisfies the ε–δ condition, False otherwise.

→ tuple unpacking = Python feature: `a, b = (1, 2)` assigns a = 1, b = 2 in one statement. We use `_` as the conventional 'I don't care' name.

32if ok: return delta

Found one! As soon as the ε–δ check passes, return the current δ. We don't need the smallest possible δ — the definition only requires that some δ exists.

EXECUTION STATE

⬆ return: delta = The first δ in the shrink sequence that worked. Often within a factor of 2 of the analytic optimum δ* = ε/2.

34delta *= shrink

Did not pass. Halve δ and try again. The compound assignment `delta *= shrink` is shorthand for `delta = delta * shrink`.

EXECUTION STATE

delta after halving = If we entered with δ = 0.5, we leave with δ = 0.25.

35return None

Loop exhausted without finding a working δ. This is a strong signal that the proposed limit L is wrong, or that the function is misbehaving (e.g., a removable discontinuity at c with the wrong L).

EXECUTION STATE

⬆ return: None = Defender failed. The challenger wins this round.

38c, L = 3, 7

Set up the limit we are testing: c is the input we approach (3), L is the value f(x) should approach (7). This is exactly the limit lim_{x→3}(2x + 1) = 7.

EXECUTION STATE

c = 3 — the limit point.

L = 7 — the candidate limit, equal to f(c) for this continuous function.

39print header line

Tell the reader what experiment is about to run. Embedding the values c and L directly in the message keeps the output self-documenting.

EXECUTION STATE

f-string = Python 3.6+ syntax: `f"...{expr}..."` interpolates `expr` into the string. Faster and more readable than `"...".format(...)`.

41print column headers

Format the output as a small table. The format spec `{...:>10}` right-aligns text in a 10-character field — important so the columns line up regardless of how big each ε is.

EXECUTION STATE

:>10 = Format spec: right-align in a width-10 field. Example: 'x'.rjust(10) is the same idea.

42print("-" * 42)

Repeats the dash 42 times to produce a horizontal rule under the headers. Python's string multiplication is a quick way to build separator lines.

EXECUTION STATE

"-" * 42 = Returns the string '------------------------------------------' (42 dashes).

44for epsilon in [1.0, 0.1, 1e-3, 1e-6, 1e-9]

The challenger plays five rounds, each more demanding than the last. By spanning nine orders of magnitude in ε we will see the defender's δ shrink in lockstep — concrete evidence that ε can be made arbitrarily small.

LOOP TRACE · 5 iterations

round 1: ε = 1.0

expected δ = ≈ 0.5 (since δ ≈ ε/2 for f = 2x + 1)

round 2: ε = 0.1

expected δ = ≈ 0.05

round 3: ε = 1e-3

expected δ = ≈ 5e-4

round 4: ε = 1e-6

expected δ = ≈ 5e-7

round 5: ε = 1e-9

expected δ = ≈ 5e-10

45delta = find_smallest_delta(c, L, epsilon)

Run the defender's halving search for the current ε. The keyword arguments `start`, `shrink`, `tries` use their defaults (1.0, 0.5, 40), giving more than enough resolution for the ε values in our list.

EXECUTION STATE

delta (returned) = First δ in the halving sequence that satisfies the ε bound for every test x. For ε = 1e-3, the loop returns δ ≈ 7.8e-4 (roughly ε/2 rounded up to a power of 0.5).

46ratio = epsilon / delta if delta else float('inf')

Compute ε/δ. For a linear function with slope 2, this ratio is exactly 2 in the limit — a numerical fingerprint of the local slope. This is the connecting tissue between ε–δ and the derivative.

EXECUTION STATE

ratio = ε / δ = ≈ 2 for f(x) = 2x + 1 because |f(x) − L| / |x − c| = 2 exactly.

if delta else float('inf') = Conditional expression. If `delta` is None or 0 (falsy), use ∞ instead — avoids ZeroDivisionError and signals defender failure visually.

47print formatted row

Print one row of the result table. `:.1e` gives one digit of mantissa in scientific notation (e.g. 1.0e-03). `:.4e` shows four. `:.2f` shows two decimal places of fixed-point. Tuned column-by-column for readability.

EXECUTION STATE

:.1e = Scientific notation, 1 fractional digit. 0.001 → '1.0e-03'.

:.4e = Scientific notation, 4 fractional digits. 0.0005 → '5.0000e-04'.

:.2f = Fixed-point, 2 decimals. 2.0 → '2.00'.

23 lines without explanation

1import math
2
3def f(x):
4    """The function whose limit we are testing."""
5    return 2 * x + 1
6
7def epsilon_delta_check(c, L, epsilon, delta, n_samples=2001):
8    """
9    Verify the ε–δ condition for a single (epsilon, delta) pair.
10
11    Returns True if every x with 0 < |x - c| < delta satisfies
12    |f(x) - L| < epsilon.
13    """
14    step = (2 * delta) / (n_samples - 1)
15    for i in range(n_samples):
16        x = (c - delta) + i * step
17        if math.isclose(x, c, abs_tol=1e-12):
18            continue                       # 0 < |x - c| excludes c itself
19        if abs(f(x) - L) >= epsilon:
20            return False, x                # Found a witness that breaks it
21    return True, None
22
23def find_smallest_delta(c, L, epsilon, start=1.0, shrink=0.5, tries=40):
24    """
25    Halve δ until the ε–δ condition holds.
26
27    This mimics the *defender's* search in the ε–δ game.
28    """
29    delta = start
30    for _ in range(tries):
31        ok, _ = epsilon_delta_check(c, L, epsilon, delta)
32        if ok:
33            return delta
34        delta *= shrink
35    return None  # Defender failed — limit probably wrong
36
37# The challenger names ε. The defender finds δ.
38c, L = 3, 7
39print(f"Testing  lim_(x→{c}) f(x) = {L}  for f(x) = 2x + 1")
40print()
41print(f"{'epsilon':>10}  {'delta found':>14}  {'ratio ε/δ':>10}")
42print("-" * 42)
43
44for epsilon in [1.0, 0.1, 1e-3, 1e-6, 1e-9]:
45    delta = find_smallest_delta(c, L, epsilon)
46    ratio = epsilon / delta if delta else float("inf")
47    print(f"{epsilon:>10.1e}  {delta:>14.4e}  {ratio:>10.2f}")

What you should see in the output

Every row prints a smaller ε and a correspondingly smaller δ — and the ratio ε/δ stays glued to ≈ 2.00 across nine orders of magnitude. That stable ratio is the derivative sneaking out of the ε–δ definition.

PyTorch: From ε–δ to Automatic Differentiation

The Python search measures how steep the function is by hand. PyTorch's autograd gives us that slope in one call to .backward(). The slope IS the limiting ratio ε/δ — so once we have it, we can predict δ for any ε without searching.

PyTorch: autograd reads the limiting ratio in one shot

🐍epsilon_delta_torch.py

Explanation(22)

Code(34)

1import torch

PyTorch is a tensor library with built-in automatic differentiation. The same kernel that powers neural network training will, in this section, compute exact derivatives at a single point — and that derivative is precisely the limiting ratio ε/δ we have been chasing by hand.

EXECUTION STATE

torch = Provides tensors, autograd, optimizers. Used here only for autograd on a single scalar.

3Comment: build the input as a tensor that tracks gradients

The next line will define the variable c not as a Python float but as a tracked tensor. Every operation involving c will be recorded in PyTorch's computation graph so we can later backpropagate through it.

4c = torch.tensor(3.0, requires_grad=True)

Wrap the scalar 3.0 in a PyTorch tensor and turn on gradient tracking. After this call, c is a leaf node of the autograd graph, ready to receive a gradient from any downstream backward() call.

EXECUTION STATE

📚 torch.tensor(data, requires_grad) = Constructor that builds a tensor from Python data. With requires_grad=True the tensor is registered as a leaf in the autograd graph.

⬇ arg 1: 3.0 = The numeric value. We pick c = 3 because we are testing lim_{x→3}(2x + 1) = 7. PyTorch stores this as a 0-dim float32 tensor by default.

⬇ arg 2: requires_grad=True = Tells autograd 'remember every operation involving this tensor so I can compute d(output)/dc later'. Without it, .backward() on the output would error.

→ c.grad = Initially None. Will be filled in by .backward(). It is reset (or accumulated) on every backward call.

6Comment: define the same function as before

We will define f below. Reusing the same f as the Python example keeps the experiment honest — only the implementation strategy differs (manual ε/δ search vs. autograd).

7def f(x): return 2 * x + 1

Same function, but now it accepts a tensor and returns a tensor. PyTorch overloads `*` and `+` on tensors so this one line of source is identical to the Python version yet builds an autograd graph.

EXECUTION STATE

⬇ input: x (tensor) = A tracked tensor. Each arithmetic op creates a new node in the autograd graph that records its parents and the local derivative.

→ 2 * x = MulBackward node. Local derivative ∂(2x)/∂x = 2, stored on the node.

→ ... + 1 = AddBackward node. Local derivative ∂(y+1)/∂y = 1, stored on the node.

⬆ returns = A scalar tensor whose `.grad_fn` chain encodes the entire computation 2x + 1.

12y = f(c)

Run the function on the tracked tensor c. The result y is itself a tensor with `.grad_fn = <AddBackward>`, linking back through MulBackward to c. The forward value is 2(3) + 1 = 7, but more importantly the graph is now built.

EXECUTION STATE

y = A 0-dim tensor with value 7.0 and a grad_fn that knows how to compute dy/dc.

→ y.item() = 7.0 — the plain Python float you would expect.

→ y.grad_fn = <AddBackward0> — the function autograd will call when we backpropagate.

14Comment: backpropagate

The big payoff is one line away. y.backward() walks the graph from y back to c and accumulates ∂y/∂c into c.grad. For our linear f, that derivative is the constant 2.

15y.backward()

Trigger reverse-mode automatic differentiation. PyTorch traverses the graph y → AddBackward → MulBackward → c, multiplying local Jacobians (1, 2) using the chain rule. The result, 2 × 1 = 2, is stored in c.grad.

EXECUTION STATE

📚 .backward() = PyTorch tensor method: for a scalar output, computes the gradient with respect to every leaf tensor that has requires_grad=True.

→ side effect = c.grad becomes tensor(2.0). y itself is unchanged.

→ why scalar? = backward() on a non-scalar tensor needs a gradient argument. Scalars (0-dim) implicitly use 1.0.

17slope = c.grad.item()

Extract the numerical gradient from the tensor and convert it to a plain Python float. This single number, 2.0, IS the limit lim_{Δx→0} (f(c+Δx) − f(c)) / Δx — the same ratio that ε/δ was approaching in the previous experiment.

EXECUTION STATE

📚 .item() = Tensor method: returns the underlying Python scalar from a 0-dim tensor. Errors on multi-element tensors. Detaches from autograd graph.

slope = 2.0 — exactly f'(3) for f(x) = 2x + 1.

→ connection to ε/δ = In the Python table the ratio ε/δ converged to ≈ 2.00. That convergence target is the derivative slope returned here in one autograd call.

18print f-string for f(c)

Print the forward value 7.0000 with four decimals so the reader sees that PyTorch agrees with hand-arithmetic before we look at the gradient.

EXECUTION STATE

:.4f = Format spec: fixed-point with 4 fractional digits. 7.0 → '7.0000'.

19print f-string for slope

Print the slope side-by-side with a label that names what it physically is — the limiting ratio ε/δ. Reinforces the conceptual link the section is building.

EXECUTION STATE

slope value = 2.0000 — the analytic derivative.

21Comment: predict the smallest delta for any epsilon

Now we use the gradient as a prediction tool. For smooth functions, near c we have f(c + h) − L ≈ f'(c) · h, so |x − c| < ε / |f'(c)| forces |f(x) − L| < ε.

23for epsilon in [1.0, 0.1, 1e-3, 1e-6]

Loop over the same ε challenge values, but instead of searching for δ we calculate it directly from the slope. This is the asymptotic shortcut that derivatives give us.

LOOP TRACE · 4 iterations

ε = 1.0

predicted δ = 1.0 / 2.0 = 0.500

ε = 0.1

predicted δ = 0.05

ε = 1e-3

predicted δ = 5e-4

ε = 1e-6

predicted δ = 5e-7

24predicted_delta = epsilon / abs(slope)

The asymptotic δ formula. abs(slope) handles the case of a negative slope (still a valid local sensitivity). For our slope = 2, predicted_delta = epsilon / 2 — agreeing exactly with the linear algebra |2x − 6| < ε ⇔ |x − 3| < ε/2.

EXECUTION STATE

abs(slope) = |2.0| = 2.0. Important when f' is negative; ε/δ uses absolute value because we measure distance.

predicted_delta = ε / 2 — matches our hand calculation perfectly because the function is linear.

25print row of the prediction table

Format with `:>8.0e` for ε (right-aligned, scientific, no fractional digits) and `:.3e` for δ (three fractional digits in scientific). The visual parallel with the Python table makes the agreement obvious.

EXECUTION STATE

:>8.0e = Right-align in width 8, scientific notation, 0 fractional digits. 0.001 → ' 1e-03'.

:.3e = Scientific notation, 3 fractional digits. 0.0005 → '5.000e-04'.

27Comment: sanity-check the autograd slope numerically

The next block computes the difference quotient (f(c + h) − f(c))/h for shrinking h and compares to slope. As h → 0, this should approach the autograd value — which in this case is exact for any h because f is linear.

29print blank line

Empty print() inserts a newline between the two tables so the output is not visually crowded. Cosmetic only.

30print headers for the second table

Headers describing the numerical-vs-autograd comparison. Wide format strings keep columns aligned even when the slope value is negative.

31for h in [1e-1, 1e-3, 1e-5, 1e-7]

Sweep h across four orders of magnitude. We expect the difference quotient to equal 2.0 for every h (because f is linear) — a strong verification that autograd's slope is the exact answer.

LOOP TRACE · 4 iterations

h = 1e-1

(f(c+h) - f(c))/h = 2.000000

h = 1e-3

(f(c+h) - f(c))/h = 2.000000

h = 1e-5

(f(c+h) - f(c))/h = 2.000000

h = 1e-7

(f(c+h) - f(c))/h = 2.000000

32with torch.no_grad():

Temporarily disable autograd while we compute the difference quotient. We do not need to backprop through this — we just want the numbers. Skipping graph construction saves time and avoids polluting c.grad on subsequent backward calls.

EXECUTION STATE

📚 torch.no_grad() = Context manager: every tensor op inside the block has requires_grad implicitly off. Standard tool for inference, evaluation, and any pure-numerics computation.

→ why here? = We are not training; we are sanity-checking. Building an autograd graph for tensors we never backprop through wastes memory.

33diff = (f(c + h) - f(c)) / h

Classical forward difference quotient. For our linear f the numerator simplifies to 2h algebraically, so diff = 2.0 for every h — proving by construction that autograd returned the exact slope.

EXECUTION STATE

f(c + h) = 2(3 + h) + 1 = 7 + 2h

f(c) = 7

diff = (7 + 2h − 7) / h = 2.0 — independent of h.

34print formatted comparison row

Lay the difference-quotient value next to its absolute error against the autograd slope. For this linear function the error column is identically 0.00e+00; for any non-linear function it would shrink with h until floating-point noise dominates.

EXECUTION STATE

:>20.6f = Right-align in width 20, fixed-point with 6 decimals. Pads the column for clean alignment regardless of sign.

:.2e = Error column: scientific notation, 2 fractional digits. Always visually readable even when the error is exactly zero.

12 lines without explanation

1import torch
2
3# 1) Build the input as a tensor that tracks gradients.
4c = torch.tensor(3.0, requires_grad=True)
5
6# 2) Define the same function as before.
7def f(x):
8    return 2 * x + 1
9
10# 3) Evaluate once at x = c. The result is a tensor connected
11#    to c by an autograd graph, NOT just a plain number.
12y = f(c)
13
14# 4) Backpropagate. y.backward() fills c.grad with df/dx evaluated at c.
15y.backward()
16
17slope = c.grad.item()                  # local sensitivity ≈ ε / δ
18print(f"f(c)     = {y.item():.4f}")
19print(f"f'(c)    = {slope:.4f}        (this is the LIMITING ratio ε/δ)")
20
21# 5) Use the slope to PREDICT the smallest delta for any epsilon.
22#    For a smooth function near c:   delta_min ≈ epsilon / |f'(c)|
23for epsilon in [1.0, 0.1, 1e-3, 1e-6]:
24    predicted_delta = epsilon / abs(slope)
25    print(f"epsilon = {epsilon:>8.0e}  →  predicted delta = {predicted_delta:.3e}")
26
27# 6) Sanity-check: the autograd slope agrees with the numerical
28#    difference quotient as h shrinks.
29print()
30print("h           numerical (f(c+h) - f(c)) / h     |error vs slope|")
31for h in [1e-1, 1e-3, 1e-5, 1e-7]:
32    with torch.no_grad():
33        diff = (f(c + h) - f(c)) / h
34    print(f"{h:>9.0e}    {diff.item():>20.6f}            {abs(diff.item() - slope):.2e}")

The bridge between ε–δ and machine learning

When you train a neural network with billions of parameters, every backward pass is asking, in effect, "If I nudge this parameter by an infinitesimal δ, how much does the loss change?" The answer is the gradient — the same $\lim_{\delta \to 0}\frac{\Delta\, \text{loss}}{\delta}$ that the ε–δ game makes precise. Without this definition, modern AI's mathematical foundations would still be hand-wave.

Why This Definition Changed Everything

📐 Rigorous derivatives

The derivative is a limit of difference quotients. Once limits are made air-tight, every theorem of differentiation (chain rule, mean value theorem) becomes provable from first principles.

🌊 Real analysis

Convergence of sequences, continuity of functions, the completeness of the real numbers — all of modern analysis is built on top of the ε–δ language.

🤖 Numerical methods

Every error bound in scientific computing — Newton's method, Runge–Kutta, finite differences — is an ε statement: "the answer is within ε if you push the algorithm far enough."

🧠 Deep learning

Universal approximation theorems, gradient convergence proofs, and the very concept of a continuous loss landscape all rest on the ε–δ definition of continuity and differentiability.

Common Pitfalls

Pitfall 1 — Choosing δ that depends on x

δ may depend on ε but not on the variable x. δ is a single number that must work uniformly for every x in the input window. Sneaking x into the formula for δ is a frequent (and silent) source of broken proofs.

Pitfall 2 — Forgetting 0 < |x − c|

The condition 0 < |x − c| is what excludes x = c from the discussion. If you drop the strict 0 <, you have implicitly required f(c) = L — and many limits (like the sinc function at 0) hold even though f(c) is undefined.

Pitfall 3 — Treating ε and δ as fixed numbers

ε is a stand-in for "any positive number, no matter how small". A proof that works only for ε = 0.1 is no proof at all. Always carry ε through the algebra; the symbol must survive to the end.

Pitfall 4 — Confusing the definition with the test

Sampling x values numerically and checking the bound is a great way to gain confidence, but a finite computer cannot verify ∀ε. Numerical checks falsify limits; only algebra proves them.

Summary

The ε–δ definition closed a 200-year-old gap in the foundations of calculus. It turned the vague notion of "approaching" into a precise game played between a Challenger and a Defender, and gave mathematicians the language they needed to prove every theorem about limits, continuity, derivatives, and integrals from scratch.

Concept	What it means in the game
ε > 0	The Challenger's tolerance — how close to L the output must be
δ > 0	The Defender's response — how close to c the input must be
0 < \|x − c\| < δ	x is in the punctured input window around c
\|f(x) − L\| < ε	f(x) is in the output window around L
∀ε ∃δ	For every challenge there is a working response
Limit fails	Some ε exists for which no δ can ever work
ε / δ ratio	Local sensitivity → the derivative as δ → 0

Key Takeaways

A limit is the formal answer to the question "how close can the output be made by squeezing the input?"
The Challenger always moves first; δ may depend on ε but never on x.
For linear functions $\delta = \varepsilon / |m|$ works exactly. For curves, bound the wobble first, then take $\delta = \min$ of the bounds.
Limits care about behavior near c, not at c — that is why limits generalise function evaluation.
The limiting ratio ε / δ converges to the derivative — the same number autograd computes in PyTorch.

The ε–δ promise:

"Name your tolerance. I will name a window. Every input in that window will land inside your tolerance — no matter how small you make it."

Coming Next: Now that limits are rigorous, we can stop computing them from scratch every time. The next section develops the Limit Laws — algebraic shortcuts that let us combine known limits into new ones, all justified by the ε–δ definition you just learned.