Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

Classify any point where a function fails continuity as removable, jump, infinite, or essential.
Use the three-condition test from §3.2 to diagnose $f(c) \text{ exists}$ , $\lim_{x \to c} f(x) \text{ exists}$ , and whether the two are equal.
Decide whether a discontinuity can be repaired by redefining $f(c)$ , or whether it is structurally broken.
Detect each type numerically from one-sided probes, and justify why smooth activations dominate modern machine learning.

Why Classify Discontinuities?

In §3.2 we distilled continuity down to a three-step checklist at a point $c$ : $\ f(c) \text{ is defined}$ , $\ \lim_{x \to c} f(x) \text{ exists}$ , and $\ \lim_{x \to c} f(x) = f(c)$ . When any one of these fails, the function is discontinuous at $c$ . But saying "the function is broken" is vague — how is it broken?

The way the test fails determines everything about how we repair (or can't repair) the function. A missing dot you can fill in. A finite jump you cannot smooth out without changing the function everywhere. A vertical asymptote signals that the function has left the planet. Naming these four modes is the first step to doing something about them — whether that means canceling a common factor, adding a Heaviside term in a physics model, or rejecting an activation that kills gradients.

The big idea. Every discontinuity fits one of four templates. Pin down which template and you immediately know: (i) whether the function has a limit there, (ii) whether a redefinition can fix it, (iii) what the graph looks like locally, and (iv) what the function will do to a differential equation, a numerical solver, or a learning algorithm that has to cross that point.

The Three Tests That Can Fail

Continuity at $c$ is the conjunction of three conditions. A discontinuity is any pattern of failures in those three. The table shows which tests fail for each of our four types — that mapping IS the classification.

Type	$f(c)$ defined?	$\lim_{x\to c} f(x)$ exists?	$\lim = f(c)$ ?	Repairable?
Continuous (no discontinuity)	yes	yes	yes	—
Removable	sometimes	yes	no	yes — redefine f(c)
Jump	yes	no (two-sided)	—	no
Infinite	no	no (blows up)	—	no
Essential / oscillating	no	no (wild)	—	no

Mnemonic. If both one-sided limits exist and agree, the discontinuity is removable — or there is none. If they exist but disagree, it's a jump. If even one blows up, it's infinite. Everything else is essential.

Type 1 — Removable (the Hole)

A removable discontinuity at $c$ is a point where the two-sided limit exists and is finite, but the function either has a different value at $c$ or has no value at all. Visually, the graph is a smooth curve with a single dot missing — or a dot placed somewhere above or below the curve.

Canonical example. $f(x) = \dfrac{x^2 - 1}{x - 1}$ . Algebraically, factoring gives $\dfrac{(x-1)(x+1)}{x-1} = x + 1 \quad (x \ne 1).$ . So the function is just the line $y = x + 1$ with a single dot missing at $x = 1$ . The left and right limits both equal $2$ , but the literal formula gives $0/0$ at $x = 1$ .

Why "removable"? We can define

\tilde f(x) = f(x)

for

x \ne c

and

\tilde f(c) = \lim_{x\to c} f(x)

. The new function

\tilde f

agrees with the old one everywhere it was already defined, and is continuous at

c

. The defect has been removed.

Second flavour — the floating dot. Sometimes $f(c)$ is defined but placed somewhere the curve doesn't go: e.g. $g(x) = \begin{cases} x + 1 & x \ne 1 \\ 5 & x = 1 \end{cases}.$ The two-sided limit is still $2$ , but $g(1) = 5$ . Same diagnosis: redefine $g(1) = 2$ and continuity is restored.

Type 2 — Jump (the Cliff)

A jump discontinuity at $c$ is where the two one-sided limits each exist and are finite, but they disagree: $\lim_{x \to c^-} f(x) \ne \lim_{x \to c^+} f(x).$ . The graph drops (or rises) instantaneously. Unlike a removable hole, no single dot can bridge the gap — the function really does "teleport" across an interval.

Canonical example. The Heaviside step $H(x) = \begin{cases} 0 & x < 0 \\ 1 & x \ge 0 \end{cases}.$ . At $x = 0$ , $\lim_{x\to 0^-} H = 0$ and $\lim_{x\to 0^+} H = 1$ — the gap is exactly 1. The value $H(0) = 1$ is a convention; any rule that makes $H(0)$ one of $\{0, \tfrac{1}{2}, 1\}$ leaves the jump in place.

Jumps are topologically permanent. No single-point redefinition eliminates them. The only way to "fix" a jump is to change the function on a whole interval — and that may break whatever property you wanted the function for.

Type 3 — Infinite (the Tower)

An infinite discontinuity is where at least one one-sided limit is $\pm \infty$ . The graph has a vertical asymptote at $c$ .

Canonical example. $f(x) = \dfrac{1}{x^2}$ . Both one-sided limits are $+\infty$ : $\lim_{x\to 0} \tfrac{1}{x^2} = +\infty.$ . A cousin is $1/x$ , which has $\lim_{x\to 0^-} = -\infty$ and $\lim_{x\to 0^+} = +\infty$ .

Infinite discontinuities dominate in physics: the Coulomb force

F \propto 1/r^2

blows up at

r = 0

, the gravitational potential

\Phi \propto -GM/r

does the same. These are not bugs — they are telling you the model has run out of validity at that point and a deeper theory is required (e.g. point particles are an idealization; real charges have extent).

Type 4 — Essential / Oscillating

An essential (or oscillating) discontinuity is the catch-all for pathologies where neither one-sided limit exists as a finite number nor as $\pm\infty$ . The function doesn't settle.

Canonical example. $f(x) = \sin(1/x)$ as $x \to 0$ . As $x$ shrinks, $1/x$ runs through all values in $[M, \infty)$ for any $M$ , and so $\sin(1/x)$ visits every value in $[-1, 1]$ infinitely many times. No single limit.

Naming. Some textbooks use "essential" only for the oscillating case; others use it as an umbrella for any non-removable, non-jump, non- infinite discontinuity. We adopt the umbrella view — if you meet one in the wild, the practical advice is always the same: your function is unfit for purpose here, choose a different model.

The Family Tree of Discontinuities

All four types fit into a single decision tree driven by the one-sided limits $L = \lim_{x\to c^-} f(x)$ and $R = \lim_{x\to c^+} f(x)$ :

$L$	$R$	$f(c)$	Classification
finite	finite, = L	= L	continuous
finite	finite, = L	≠ L or undefined	removable
finite	finite, ≠ L	—	jump
±∞ or finite	±∞ or finite (one blows up)	—	infinite
DNE	DNE	—	essential (oscillating)

The first three rows split by what happens to the two one-sided limits alone. The bottom two rows cover the two ways a one-sided limit can fail to be a finite number — blowing up (Type 3) or failing to settle at all (Type 4).

Interactive: Discontinuity Explorer

Pick a preset, then slide $\delta$ to watch the left (orange) and right (blue) probes approach $c$ . The diagnostic panel reports the running left limit, right limit, the actual value $f(c)$ , and which of the four types the function exhibits. For the removable case, the "Patch the hole" checkbox demonstrates how a single-point redefinition restores continuity.

Loading discontinuity explorer…

Things to try. (1) On Jump, watch how the probes never agree, no matter how small

\delta

gets — the gap is hard-wired. (2) On Infinite, both probes race to larger and larger values. (3) On Oscillating, shrink

\delta

and watch the probes jitter between ±1 — there is simply no single number to approach.

Worked Example — Classify Three Functions

Let's work by hand through three small problems so the classifier in your head is well-calibrated before we hand the job to a computer.

▶Show full worked solution (three parts)

Part (a) — Classify f at x = 2, where f(x) = (x² − 4) / (x − 2).

Simplify. $\dfrac{x^2-4}{x-2} = \dfrac{(x-2)(x+2)}{x-2} = x + 2$ for $x \ne 2$ .

Left limit. $\lim_{x\to 2^-} f(x) = 2 + 2 = 4$ . Right limit. $\lim_{x\to 2^+} f(x) = 4$ . f(2): $0/0$ , undefined.

Verdict. Both one-sided limits exist and agree at 4, but $f(2)$ is undefined. This is a removable discontinuity. Repair by defining $f(2) = 4$ .

Part (b) — Classify g at x = 0, where g(x) = |x| / x for x ≠ 0, g(0) = 0.

Rewrite piecewise. $g(x) = \begin{cases} -1 & x < 0 \\ +1 & x > 0 \\ 0 & x = 0 \end{cases}.$

Left limit: $\lim_{x\to 0^-} g = -1$ . Right limit: $\lim_{x\to 0^+} g = +1$ . They are finite but disagree.

Verdict. Jump discontinuity at 0. No single-point redefinition can fix it — the gap is of size $+1 - (-1) = 2$ .

Bonus: $g(0) = 0$ sits in the middle of the gap. That value is harmless but does nothing to help — even if we changed it to $+1$ or $-1$ , the two-sided limit would still not exist.

Part (c) — Classify h at x = 3, where h(x) = 1 / (x − 3).

Left limit: as $x \to 3^-$ , $x - 3 \to 0^-$ , so $h(x) \to -\infty$ .

Right limit: as $x \to 3^+$ , $x - 3 \to 0^+$ , so $h(x) \to +\infty$ .

Verdict. Infinite discontinuity at 3. The function has a vertical asymptote; a left-blowup to $-\infty$ and a right-blowup to $+\infty$ .

Takeaway. The routine is always the same:

Compute (or estimate) the one-sided limits.
Compare them to each other and to $f(c)$ .
Use the decision table to name the type.

Python: Detecting Discontinuities Numerically

We now encode the decision table as a short Python function. The strategy is exactly the one we used by hand — probe the function at $c \pm h$ , inspect $f(c)$ , and branch on the outcomes. No symbolic algebra needed.

Plain Python — numerical classifier for the four discontinuity types

🐍classify_discontinuity.py

Explanation(43)

Code(53)

1import math

We import the math module, Python's standard scalar-math library (sin, cos, exp, log, sqrt, etc.). The classifier itself does not need it, but the examples below use scalar arithmetic that could easily call math.* functions; importing it up front keeps the file ready to extend.

EXECUTION STATE

math = Python standard library for scalar math — provides sin, cos, exp, log, sqrt. Pure-Python and has no extra dependencies.

3Step 1 — probing one-sided limits (comment)

We split the whole classifier into three conceptual steps: probe, decide, apply. This first step tries to approximate the two one-sided limits lim x→c− f(x) and lim x→c+ f(x) just by evaluating f at c−h and c+h for a tiny h.

4def one_sided_limits(f, c, h=1e-5) → (L, R)

Defines a helper that numerically probes the two one-sided limits. It returns a pair (L, R). If the function raises (e.g. division by zero at x = c), we record None for that side.

EXECUTION STATE

⬇ input: f = Any single-variable function f: ℝ → ℝ. Could be a closed-form like (x²−1)/(x−1), a piecewise rule, or even a lookup table.

⬇ input: c = The point where we suspect a discontinuity. For (x²−1)/(x−1) we pass c = 1 because the denominator vanishes there.

⬇ input: h = 1e-5 = Step size for the probe — how close we sample on each side of c. Default 0.00001 is small enough to resolve jumps and removable holes, but large enough to avoid floating-point noise.

→ why h = 1e-5? = Too big (say 0.1) and we miss fine structure; too small (say 1e-15) and 64-bit floats cancel to zero. 1e-5 is the sweet spot for double-precision arithmetic.

⬆ returns = A pair (L, R) of floats or Nones. L ≈ f(c − h), R ≈ f(c + h). None means f is undefined on that side.

5Docstring — what the helper does

Short sentence describing intent. Picked up by help() and IDE tooltips so callers see why the function exists without reading the body.

6try: — attempt the left probe

Wraps the left-side evaluation in try/except. If f raises (ValueError, ZeroDivisionError, …) at c − h we do not want the whole classifier to crash.

7L = f(c - h)

Evaluate f just to the left of c. For f(x) = (x²−1)/(x−1) at c = 1 this is f(0.99999) ≈ 1.99999 — almost 2, which is the limit.

EXECUTION STATE

c - h = Left probe point. Example: 1 − 1e-5 = 0.99999.

L (for (x²−1)/(x−1) at c=1) = 1.999990 — extremely close to the true limit 2.

8except Exception:

Catch any exception the user's function might raise. We deliberately catch the broad Exception base class because classifiers need to be robust to arbitrary user-supplied rules.

9L = None

If the left probe failed, record None. Downstream code treats None as 'no observation on this side', distinct from a finite or infinite value.

EXECUTION STATE

L = None — sentinel meaning the left side could not be evaluated.

10try: — attempt the right probe

Same pattern as the left side — a try wrapper around the right probe.

11R = f(c + h)

Evaluate f just to the right of c. For (x²−1)/(x−1) at c = 1 this is f(1.00001) ≈ 2.00001.

EXECUTION STATE

c + h = Right probe point. Example: 1 + 1e-5 = 1.00001.

R (for (x²−1)/(x−1) at c=1) = 2.000010 — again essentially 2.

12except Exception:

Catch any failure on the right side so the classifier keeps going.

13R = None

Flag the right-side observation as missing.

EXECUTION STATE

R = None — right side could not be evaluated.

14return L, R

Hand the pair of probe values back to the caller. Python packs them into a tuple so the classifier can unpack with L, R = one_sided_limits(...).

EXECUTION STATE

⬆ return: (L, R) = Example for (x²−1)/(x−1) at c = 1: (1.999990, 2.000010). Example for step at c = 0: (0.0, 1.0). Example for 1/x² at c = 0: (1e10, 1e10).

16Step 2 — classify (comment)

Now that we have L and R we can make a decision: continuous, removable, jump, or infinite. The logic follows the three-condition test for continuity, but applied numerically.

17def classify(f, c, tol=1e-4, blowup=1e6) → str

The main classifier. Given a function f and a point c, returns a short string describing what happens at c.

EXECUTION STATE

⬇ input: f = Same f as before — the function under test.

⬇ input: c = The suspected trouble spot.

⬇ input: tol = 1e-4 = Tolerance for deciding whether two values are 'equal'. If |L − R| > tol we call it a jump. Any value between 1e-6 and 1e-3 works for well-scaled problems.

⬇ input: blowup = 1e6 = Threshold above which we declare a limit infinite. 1e6 is comfortably large for ordinary functions — 1/x² at x = 1e-5 is already 1e10, so it trips this test easily.

⬆ returns = A human-readable classification string: 'continuous', 'jump (gap = …)', 'infinite discontinuity', 'removable (…)', or 'undefined near c'.

18Docstring

Documents that the function returns one of four classification strings.

19L, R = one_sided_limits(f, c)

Call the probe we defined above, then unpack the tuple into two named variables. L ≈ lim x→c− f(x); R ≈ lim x→c+ f(x).

EXECUTION STATE

L = Numerical approximation of the left limit. E.g. for step at c = 0: L = 0.0.

R = Numerical approximation of the right limit. E.g. for step at c = 0: R = 1.0.

20if L is None or R is None:

If either probe failed we cannot say anything sensible — the function is not even defined on some side of c.

21return "undefined near c"

Early-exit with a clear message. This case catches things like f(x) = √x at c = −0.1 where the function is not real-valued.

22if abs(L) > blowup or abs(R) > blowup:

Check whether either probe has blown up past the blowup threshold. For 1/x² at h = 1e-5 we get 1/1e-10 = 1e10, which is far larger than blowup = 1e6.

EXECUTION STATE

📚 abs() = Python built-in: returns the absolute value of a number. abs(-3) = 3, abs(1e10) = 1e10. For complex numbers it returns the modulus.

Example trip = 1/x² at c = 0, h = 1e-5 → L ≈ R ≈ 1e10 > 1e6, condition fires.

23return "infinite discontinuity"

This is the Type 3 outcome — a vertical asymptote. We could refine further into +∞ / −∞ / ±∞ by looking at signs of L and R, but for classification purposes 'infinite' is enough.

24if abs(L - R) > tol:

Both sides are finite, but do they agree? If the numerical gap between L and R exceeds tol we have a jump.

EXECUTION STATE

Example = Step: abs(0.0 − 1.0) = 1.0, which is ≫ 1e-4 → classified as jump.

25return f"jump (gap = {R - L:.3f})"

Return a formatted string reporting the signed gap. For the step: 'jump (gap = 1.000)'.

EXECUTION STATE

📚 f-string = Python literal string formatting. f"…{expr}…" embeds Python expressions. {R - L:.3f} evaluates the expression and formats with 3 decimal places.

26try: — evaluate f at c itself

At this point L ≈ R are finite — the only two continuity conditions left are (i) f(c) is defined, (ii) f(c) equals that common limit.

27v = f(c)

Evaluate the function exactly at c. For (x²−1)/(x−1) at c = 1 this raises — the removable case. For x²+1 at c = 0 we get v = 1.

EXECUTION STATE

v (x²+1 at c=0) = 1.0 — matches L = R, so the test proceeds to line 32 and returns 'continuous'.

v ((x²−1)/(x−1) at c=1) = Raises ValueError because our definition of f_hole raises whenever x == 1. Execution jumps to the except block below.

28except Exception:

f is undefined at c. One-sided limits both exist and agree, so the discontinuity is removable — we just need to define f(c) to equal the common limit.

29return f"removable (limit = ..., f(c) undefined)"

Report the removable case along with the common limit. For (x²−1)/(x−1) we get '(limit = 2.000, f(c) undefined)'.

EXECUTION STATE

(L + R) / 2 = Averaging L and R cancels tiny numerical noise. For (x²−1)/(x−1) at c = 1: (1.999990 + 2.000010) / 2 = 2.000000.

30if abs(v - (L + R) / 2) > tol:

f(c) exists, but is it equal to the shared limit? If not, we still have a removable discontinuity — just one where someone has placed an out-of-line dot.

31return "removable (value mismatch)"

Classic 'floating dot' example: f(x) = x for x ≠ 0, f(0) = 5. Limits agree at 0 but the value does not. Redefining f(0) to 0 makes it continuous.

32return "continuous"

All three continuity tests passed: f(c) exists, the two-sided limit exists, and they are equal. Nothing broken here.

EXECUTION STATE

⬆ return = 'continuous' — clean bill of health at c.

34Step 3 — apply to the four canonical examples (comment)

The rest of the script defines the four canonical functions corresponding to the four rows of our family tree, and runs classify() on each.

35def f_hole(x):

Removable case: (x² − 1) / (x − 1). Factoring the numerator: (x − 1)(x + 1) / (x − 1) = x + 1 whenever x ≠ 1. So the limit at 1 is 2, but the literal formula is undefined there.

EXECUTION STATE

⬇ input: x = A real number. The function raises exactly at x = 1; elsewhere it returns x + 1.

⬆ returns = A float equal to (x² − 1)/(x − 1) for x ≠ 1. Example: f_hole(0.9) = 1.9, f_hole(1.1) = 2.1.

36if x == 1: raise ValueError(...)

We raise explicitly at x = 1 because Python's math would otherwise compute 0/0 and raise ZeroDivisionError. Either way the classify() helper catches it and routes to the removable branch.

37return (x * x - 1) / (x - 1)

Algebraically this equals x + 1, so as x → 1 the output approaches 2. Verify: f_hole(0.999) = (0.998001 − 1)/(0.999 − 1) = −0.001999 / −0.001 = 1.999.

EXECUTION STATE

Sample values =

f_hole(0.9)  = 1.9000
f_hole(0.99) = 1.9900
f_hole(0.999)= 1.9990
f_hole(1.001)= 2.0010
f_hole(1.01) = 2.0100
f_hole(1.1)  = 2.1000

39def f_jump(x):

Jump case: Heaviside step. Returns 0 for negative x and 1 for non-negative x. Finite on both sides but the two sides disagree.

EXECUTION STATE

⬇ input: x = Real number. Output flips from 0 to 1 exactly at x = 0.

⬆ returns = 0.0 if x < 0 else 1.0.

40return 0.0 if x < 0 else 1.0

Python ternary. Exactly encodes the Heaviside step. L = 0 from the left, R = 1 from the right — classic finite-but-unequal jump.

42def f_inf(x):

Infinite case: 1/x². Blows up symmetrically to +∞ as x → 0 from either side.

EXECUTION STATE

⬇ input: x = Real number, ≠ 0. The function raises at x = 0 and grows like 1/x² everywhere else.

⬆ returns = 1.0 / (x * x). Examples: f_inf(0.1) = 100, f_inf(0.01) = 10 000, f_inf(0.001) = 1 000 000.

43if x == 0: raise ValueError(...)

Guard against division by zero. Not strictly necessary — Python would naturally raise ZeroDivisionError — but explicit is friendlier.

44return 1.0 / (x * x)

Always positive since x * x ≥ 0. Blows up symmetrically: at h = 1e-5 both probes already yield 1e10, which triggers the 'blowup' branch of classify.

46def f_good(x):

Sanity-check case: a polynomial. f_good(x) = x² + 1 is continuous everywhere, so classify() should return 'continuous'.

EXECUTION STATE

⬇ input: x = Any real number.

⬆ returns = x * x + 1. f_good(0) = 1, f_good(1) = 2, f_good(-2) = 5.

47return x * x + 1

A smooth parabola. Every continuity test is guaranteed to pass.

49for name, f, c in [...]:

Loop over the four test cases. Each tuple is (human-readable name, function, suspected trouble point).

LOOP TRACE · 4 iterations

name="(x^2-1)/(x-1)", f=f_hole, c=1

classify result = 'removable (limit = 2.000, f(c) undefined)'

why = L = 1.999990, R = 2.000010 agree; f(1) raises → Type 1 Removable.

name="H(x) step", f=f_jump, c=0

classify result = 'jump (gap = 1.000)'

why = L = 0.0, R = 1.0; finite but unequal → Type 2 Jump.

name="1/x^2", f=f_inf, c=0

classify result = 'infinite discontinuity'

why = L = R ≈ 1e10, both exceed blowup = 1e6 → Type 3 Infinite.

name="x^2 + 1", f=f_good, c=0

classify result = 'continuous'

why = L = 1.000000, R = 1.000000, f(0) = 1 — all three continuity tests pass.

53print(f"{name:<16} at c={c:>2} -> {classify(f, c)}")

Pretty-print each classification on its own line. The colon specifiers align the columns — {name:<16} left-pads to width 16, {c:>2} right-pads to width 2.

EXECUTION STATE

📚 f-string format spec = {:<N} left-align in width N; {:>N} right-align; {:.3f} three decimals. Keeps output aligned without manual spacing.

Final printed output =

(x^2-1)/(x-1)   at c= 1  ->  removable (limit = 2.000, f(c) undefined)
H(x) step        at c= 0  ->  jump (gap = 1.000)
1/x^2            at c= 0  ->  infinite discontinuity
x^2 + 1          at c= 0  ->  continuous

10 lines without explanation

1import math
2
3# --- Step 1: numerically probe left and right one-sided limits ---
4def one_sided_limits(f, c, h=1e-5):
5    """Evaluate f just below and just above c."""
6    try:
7        L = f(c - h)
8    except Exception:
9        L = None
10    try:
11        R = f(c + h)
12    except Exception:
13        R = None
14    return L, R
15
16# --- Step 2: classify using only the probe values ---
17def classify(f, c, tol=1e-4, blowup=1e6):
18    """Decide which of the 4 possibilities f(x) exhibits at c."""
19    L, R = one_sided_limits(f, c)
20    if L is None or R is None:
21        return "undefined near c"
22    if abs(L) > blowup or abs(R) > blowup:
23        return "infinite discontinuity"
24    if abs(L - R) > tol:
25        return f"jump (gap = {R - L:.3f})"
26    try:
27        v = f(c)
28    except Exception:
29        return f"removable (limit = {(L + R)/2:.3f}, f(c) undefined)"
30    if abs(v - (L + R) / 2) > tol:
31        return "removable (value mismatch)"
32    return "continuous"
33
34# --- Step 3: apply to the four canonical examples ---
35def f_hole(x):
36    if x == 1: raise ValueError("undefined")
37    return (x * x - 1) / (x - 1)
38
39def f_jump(x):
40    return 0.0 if x < 0 else 1.0
41
42def f_inf(x):
43    if x == 0: raise ValueError("undefined")
44    return 1.0 / (x * x)
45
46def f_good(x):
47    return x * x + 1
48
49for name, f, c in [("(x^2-1)/(x-1)", f_hole, 1),
50                   ("H(x) step",     f_jump, 0),
51                   ("1/x^2",         f_inf,  0),
52                   ("x^2 + 1",       f_good, 0)]:
53    print(f"{name:<16} at c={c:>2}  ->  {classify(f, c)}")

This classifier is deliberately simple — numerical probes can be fooled by functions that oscillate too fast to catch at

h = 10^{-5}

, or by functions whose limits only show up at much smaller

h

. In §3.4 we will see how the formal definition of continuity delivers guarantees that no finite probe can.

PyTorch: Why Continuity Matters for Gradients

Continuity is not just an aesthetic property — it is the precondition for differentiability, which is the precondition for gradient-based training. The clearest way to see this is to compare a continuous activation (ReLU) with a discontinuous one (the step function) and watch what PyTorch's autograd engine does with each.

PyTorch — continuity is the gatekeeper for autograd

🐍continuity_vs_gradients.py

Explanation(22)

Code(31)

1import torch

PyTorch is a tensor library with automatic differentiation. We use it here to (a) evaluate functions at tensor inputs exactly like we did in the plain-Python classifier, and (b) ask 'what gradient does this function expose to a training loop?' via y.backward().

EXECUTION STATE

torch = PyTorch library. Provides torch.Tensor (a multi-dimensional array with autograd), torch.nn (neural-net building blocks), and torch.autograd (reverse-mode automatic differentiation).

3# Two candidate activations (comment)

Setting up the narrative: we will compare ReLU (continuous) with the step function (discontinuous). In deep learning, continuity is not just aesthetic — it is the ticket that lets gradients flow during backprop.

6def smooth_relu(x):

Defines a PyTorch-friendly ReLU. ReLU(x) = max(0, x) is continuous everywhere — L = R = 0 at x = 0 and f(0) = 0 — so it passes every continuity test. The only defect is smoothness: the left derivative is 0 while the right derivative is 1, so the point x = 0 is a 'corner'.

EXECUTION STATE

⬇ input: x = A torch.Tensor (scalar or batched). Example: torch.tensor(0.3) or torch.tensor([-1.0, 0.0, 2.5]).

⬆ returns = A tensor of the same shape with negative entries replaced by 0.0. smooth_relu(−0.1) = 0.0, smooth_relu(0.3) = 0.3.

8return torch.clamp(x, min=0.0)

torch.clamp implements ReLU element-wise. The 'min=0.0' floor clips negatives to 0 but leaves positives untouched.

EXECUTION STATE

📚 torch.clamp(input, min, max) = PyTorch function: clamps each element into the interval [min, max]. Equivalent to max(min, min(max, x)). Supports autograd — gradient is 1 inside the interval, 0 outside.

⬇ arg: min = 0.0 = Lower bound. Everything below 0 is replaced by 0. We omit 'max' so there is no upper clip.

→ Example: clamp applied to a batch = torch.clamp(tensor([-1.0, 0.0, 0.3, 2.5]), min=0.0) → tensor([0.0, 0.0, 0.3, 2.5])

10def step(x):

Defines the Heaviside step function on a tensor input. This is the canonical jump discontinuity and, historically, the original perceptron activation before Rosenblatt and Hinton abandoned it for differentiable alternatives.

EXECUTION STATE

⬇ input: x = A torch.Tensor. Scalar or batch of real numbers.

⬆ returns = A float tensor of 0s and 1s: 0 where x < 0, 1 where x ≥ 0.

12return (x >= 0).float()

The comparison x >= 0 produces a Bool tensor; .float() casts it to 0.0 / 1.0 floats so it can be fed into downstream numeric code.

EXECUTION STATE

📚 tensor comparison (>=) = Element-wise >= returns a torch.BoolTensor of the same shape. tensor([-0.1, 0.0, 0.3]) >= 0 → tensor([False, True, True]).

📚 .float() = Converts a tensor to torch.float32. For BoolTensor this maps False → 0.0, True → 1.0. Needed because autograd only flows through floating-point tensors.

→ Example = (tensor([-0.1, 0.0, 0.3]) >= 0).float() → tensor([0.0, 1.0, 1.0])

14# Probe each function near x = 0 (comment)

Same idea as the plain-Python probe: evaluate each function at c − h and c + h. The only change is that h and c are tensors.

15h = torch.tensor(1e-4)

Creates a scalar tensor holding the probe distance 0.0001. Using a tensor (not a raw Python float) makes sure the arithmetic c ± h is performed in tensor land.

EXECUTION STATE

📚 torch.tensor(data) = PyTorch factory function: builds a new tensor from Python numbers, lists, or numpy arrays. Picks a reasonable dtype unless overridden.

h = torch.tensor(0.0001) — dtype float32, shape ().

16c = torch.tensor(0.0)

The candidate discontinuity point. Both candidate functions are suspicious at x = 0, so we centre the probe there.

EXECUTION STATE

c = torch.tensor(0.0) — scalar float32, shape ().

18for name, f in [('smooth_relu', smooth_relu), ('step', step)]:

Iterate over the two candidate functions. Each iteration prints L, R, f(0) and a verdict.

LOOP TRACE · 2 iterations

name="smooth_relu", f=smooth_relu

f(c - h) = smooth_relu(−1e-4) → 0.0000

f(c + h) = smooth_relu( 1e-4) → 0.0001

f(c) = smooth_relu(0) → 0.0000

gap = |0.0001 − 0.0000| = 0.0001 (< 1e-3)

verdict = continuous ✓

name="step", f=step

f(c - h) = step(−1e-4) → 0.0000

f(c + h) = step( 1e-4) → 1.0000

f(c) = step(0) → 1.0000

gap = |1.0000 − 0.0000| = 1.0000 (> 1e-3)

verdict = discontinuous ✗

19left = f(c - h).item()

Evaluate f at c − h, then call .item() to pull the single scalar out of the 0-d tensor as a plain Python float for formatting.

EXECUTION STATE

📚 .item() = Tensor method: returns the single number inside a 0-d tensor as a Python float. Raises if the tensor has more than one element.

→ Example = torch.tensor(0.3).item() → 0.3 (Python float, not tensor).

20right = f(c + h).item()

Same but on the right probe. For step this is 1.0000, for ReLU 0.0001.

21fc = f(c).item()

Value at c itself. ReLU(0) = 0. step(0) = 1 because our rule uses x >= 0 (strictly greater-or-equal).

22gap = abs(right - left)

Size of the jump from left probe to right probe. If this is small the function behaves continuously; if it is ≳ 1 there is a visible discontinuity.

EXECUTION STATE

smooth_relu gap = |0.0001 − 0.0000| = 0.0001

step gap = |1.0000 − 0.0000| = 1.0000 — full unit jump!

23verdict = "continuous" if gap < 1e-3 else "discontinuous"

Ternary that labels the result. The threshold 1e-3 is generous compared to the probe noise (order 1e-4 for smooth functions) but far below the ~1.0 gap of a true jump.

24print(f"{name:<12} L={left:.4f} …")

Pretty-print the results. Runs once per function, producing the output shown below.

EXECUTION STATE

Final printed output =

smooth_relu  L=0.0000  R=0.0001  f(0)=0.0000  gap=0.0001  continuous
step         L=0.0000  R=1.0000  f(0)=1.0000  gap=1.0000  discontinuous

26# Autograd: what gradient does each function deliver at x = 0.3? (comment)

Now for the payoff. Continuity is the gatekeeper for differentiability, which in turn is the gatekeeper for gradient-based learning. Let us ask each function for a gradient and see what happens.

27for name, f in [...]:

Another pass over the same two functions, this time to run a miniature autograd experiment at x = 0.3 (well away from the trouble point so we are testing smoothness, not the kink itself).

LOOP TRACE · 2 iterations

name="smooth_relu", f=smooth_relu

x = tensor(0.3, requires_grad=True)

y = f(x) = tensor(0.3) — clamp passes positives through

x.grad after backward = 1.0000

why = Derivative of clamp(x, 0) for x > 0 is 1 — the gradient flows cleanly.

name="step", f=step

x = tensor(0.3, requires_grad=True)

y = f(x) = tensor(1.0) — but produced via a Bool cast that is not differentiable

x.grad after backward = 0.0000 (or RuntimeError on some versions)

why = The step function is constant = 1 for all x > 0, so dy/dx = 0. A training loop using step gets zero gradient — no signal, no learning.

28x = torch.tensor(0.3, requires_grad=True)

Create a leaf tensor that autograd will track. requires_grad=True tells PyTorch: remember every operation applied to x so we can later call .backward() and accumulate dy/dx into x.grad.

EXECUTION STATE

⬇ arg: requires_grad = True = Enables gradient recording. Without this flag, PyTorch does not build a computation graph and .backward() fails with 'element 0 of tensors does not require grad'.

x = tensor(0.3000, requires_grad=True). x.is_leaf == True. x.grad == None initially.

29y = f(x)

Forward pass. For smooth_relu this is torch.clamp(x, min=0.0) → tensor(0.3). For step this is (x >= 0).float() → tensor(1.0).

30y.backward()

Reverse-mode autodiff. PyTorch walks the computation graph from y to each leaf, computing dy/dx for each leaf tensor and storing the result in leaf.grad.

EXECUTION STATE

📚 Tensor.backward() = Triggers reverse-mode differentiation. If y is a scalar, equivalent to torch.autograd.grad(y, x)[0] but with results accumulated into x.grad. If y is non-scalar you must supply grad_outputs.

After call: x.grad = smooth_relu: tensor(1.0) — derivative of clamp is 1 for positive inputs. step: tensor(0.0) — derivative of a piecewise-constant output is 0.

31print(f"{name:<12} dy/dx at x=0.3 = {x.grad.item():.4f}")

Report the gradient. This is the moment the lesson lands: the continuous function returns a usable gradient (1.0) while the discontinuous one returns zero (or raises).

EXECUTION STATE

Final printed output =

smooth_relu  dy/dx at x=0.3 = 1.0000
step         dy/dx at x=0.3 = 0.0000

Takeaway = Discontinuities (and their too-abrupt cousins) make gradient descent impossible. That is why every modern neural net uses continuous activations — ReLU, GELU, tanh — and never raw step functions.

9 lines without explanation

1import torch
2
3# Two candidate "activations" from deep learning:
4#   ReLU:  continuous everywhere, smooth except at 0.
5#   Step:  jump discontinuity at 0 — gradient is 0 almost everywhere.
6def smooth_relu(x):
7    """ReLU = max(0, x). Continuous; kink at x = 0."""
8    return torch.clamp(x, min=0.0)
9
10def step(x):
11    """Heaviside step. Equals 0 for x < 0, 1 for x >= 0."""
12    return (x >= 0).float()
13
14# --- Probe each function just below and just above x = 0 ---
15h = torch.tensor(1e-4)
16c = torch.tensor(0.0)
17
18for name, f in [("smooth_relu", smooth_relu), ("step", step)]:
19    left  = f(c - h).item()
20    right = f(c + h).item()
21    fc    = f(c).item()
22    gap   = abs(right - left)
23    verdict = "continuous" if gap < 1e-3 else "discontinuous"
24    print(f"{name:<12} L={left:.4f}  R={right:.4f}  f(0)={fc:.4f}  gap={gap:.4f}  {verdict}")
25
26# --- Autograd: what gradient does each function deliver at x = 0.3? ---
27for name, f in [("smooth_relu", smooth_relu), ("step", step)]:
28    x = torch.tensor(0.3, requires_grad=True)
29    y = f(x)
30    y.backward()
31    print(f"{name:<12} dy/dx at x=0.3 = {x.grad.item():.4f}")

The connecting thread. A discontinuous activation gives you either zero gradient (learning stalls) or an undefined gradient (training crashes). Every modern neural net therefore uses continuous activations: ReLU, LeakyReLU, GELU, SiLU, tanh. Even when we want a hard on/off decision (binary classification), we implement it as a continuous logistic sigmoid and only threshold at inference time — so the gradient path is never broken.

Where Each Type Shows Up in the Real World

Domain	Type that appears	Why
Rational functions & symbolic simplification	Removable	Common factors in numerator and denominator produce holes; canceling them is a removable repair.
Digital signal processing	Jump	Sampled / quantized signals and on/off controllers produce Heaviside-like edges.
Classical physics (point sources)	Infinite	Coulomb force 1/r², gravitational potential 1/r, line charges: all blow up at the source location.
Thermodynamics phase transitions	Jump (first-order) / Essential (higher-order)	Density / enthalpy jumps at first-order phase transitions; derivatives of free energy misbehave at critical points.
Economics & finance	Jump	Tax brackets, dividend ex-dates, piecewise-defined payoff functions.
Machine learning	Must be removed	Loss landscapes need continuity for gradient descent. Step activations are historically the first thing abandoned.
Numerical ODE solvers	All four — catastrophically	Adaptive step controllers can get stuck at a jump or asymptote; detecting discontinuities is a whole subfield (event detection).

Common Pitfalls

Calling a removable discontinuity "smooth". Without the repair, the function is not continuous at $c$ . Algebraic simplification changes the formula on a dense set but does not redefine the function at the missing point.
Confusing corners with jumps. $|x|$ has a corner at $x = 0$ — the function is continuous there, but not differentiable. A corner is a failure of smoothness, not of continuity.
Believing numerical probes always find discontinuities. Functions like $\sin(1/x)$ can hide arbitrarily close to $c = 0$ . A probe at $h = 10^{-5}$ may see a value near 0 purely by coincidence while the function is wildly oscillating.
Treating "f(c) defined" as enough. The Heaviside step has $H(0) = 1$ perfectly defined, yet $H$ is still discontinuous at 0 because the two-sided limit does not exist.
Assuming "infinity from both sides" is continuous. Even when $1/x^2$ has left and right limits that both equal $+\infty$ , the function is not continuous — the value $+\infty$ is not a real number; continuity requires a finite matching value.

Summary

Discontinuities come in four flavours — removable, jump, infinite, and essential/oscillating — determined by what happens to the two one-sided limits at $c$ .
Removable holes can be repaired by a single-point redefinition. No other type can.
The classification table (§3.3, "Family Tree") is a complete decision procedure — compute L, R, and $f(c)$ , then read off the type.
Numerically, a simple two-probe algorithm with appropriate tolerances classifies the common cases; pathological oscillations require the formal $\varepsilon$ - $\delta$ machinery of §2.5.
In machine learning, discontinuities must be avoided in activations and losses because they either zero out or undefine the gradient needed by backprop.

Next, in §3.4, we turn from naming failures to celebrating successes: the algebraic properties that guarantee large families of functions — sums, products, compositions of continuous pieces — are continuous by construction.