Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

Explain why root-finding is the canonical application of continuity, and why the IVT gives us our first algorithm for free.
Derive Newton's method from the equation of a tangent line in a single line — and visualise it as "slide down the tangent to the x-axis."
State and explain the meaning of quadratic convergence — why Newton roughly doubles the number of correct digits per step.
Build the secant method as the finite-difference cousin of Newton, and place its $\varphi \approx 1.618$ super-linear order between bisection and Newton.
Recognise the failure modes of Newton's method — vanishing derivatives, cycles, and divergence — and choose a hybrid strategy that gets bisection's safety with Newton's speed.
Implement all three methods in plain Python and in PyTorch using autograd, demonstrating exactly the kind of building block that powers modern optimisation in deep learning.

The Question Calculus Was Built To Answer

Solving the equation $f(x) = 0$ is the most ancient problem in mathematics. The Babylonians computed square roots four thousand years ago by averaging an estimate with the quotient — what we now call $x_{n+1} = \tfrac{1}{2}\!\left(x_n + \tfrac{a}{x_n}\right)$ — and the algorithm was so accurate that the cuneiform tablet YBC 7289 records $\sqrt{2} \approx 1.41421296\ldots$ , correct to six decimals. Three millennia later, that exact algorithm turns out to be a special case of Newton's method applied to $f(x) = x^2 - a$ .

Every modern computer can take square roots, cube roots, logarithms, sines, and exponentials — and at the bottom of every one of those routines is a root-finding loop. The CPU's sqrt instruction is Newton's method. TensorFlow's tf.math.lgamma uses Halley's. SciPy's fsolve wraps a damped Newton. We are about to build the same machinery from continuity alone.

The big idea. Continuity guarantees a root EXISTS (IVT). A derivative tells us WHICH WAY the root lies (Newton). And iterating either idea gives us an algorithm that computers can run a billion times a second. Three chapters of calculus collapse into one paragraph here — that is why we treat root finding as Chapter 3's capstone.

From the IVT to an Algorithm

Section 3.5 established a theoretical fact: if $f$ is continuous on $[a, b]$ and $f(a)$ and $f(b)$ have opposite signs, then there exists a number $c \in (a, b)$ with $f(c) = 0$ . The theorem is non-constructive — it promises an answer without showing us where to look.

Watch how a single observation turns an existence proof into a procedure: if the IVT works on $[a, b]$ , it also works on the half-interval that still contains a sign change. Pick the midpoint $m = (a + b)/2$ , evaluate $f(m)$ , throw away the half whose endpoints agree in sign, and repeat. The root is trapped in an interval that shrinks geometrically. This is bisection: an existence theorem made algorithmic by a loop.

Pattern to remember. Every constructive algorithm in numerical analysis pairs a theorem ("a root exists") with a contraction rule ("here is how to shrink the search space"). For bisection, the theorem is the IVT and the contraction is halving. For Newton, the theorem is local linearisation and the contraction is the tangent-line update. Spotting this theorem-plus-contraction pattern reveals dozens of algorithms — Picard iteration, fixed-point methods, gradient descent — as cousins.

Recap: Bisection — Continuity's Algorithm

We covered bisection in detail in section 3.5; here is the one-paragraph crib. Start with a bracket $[a, b]$ with $f(a) \cdot f(b) < 0$ . Compute the midpoint $c = (a + b)/2$ . Whichever of $[a, c]$ or $[c, b]$ still contains a sign change, keep that half. After $n$ iterations, the bracket has width $(b - a)/2^n$ , so the error $|c_n - r|$ is at most that. Setting $(b - a)/2^n \le \varepsilon$ and solving for $n$ gives the famous count

n_{\text{bisect}} \;=\; \left\lceil \log_2\!\frac{b - a}{\varepsilon} \right\rceil.

For $b - a = 1$ and $\varepsilon = 10^{-12}$ , that is $\lceil 12 \log_2 10 \rceil = 40$ iterations. Reliable, but slow.

Property	What it buys	What it costs
Needs only continuity	Works on functions with corners, jumps in derivative, no closed form	Cannot exploit smoothness
Guaranteed to converge	No initial-guess gymnastics; just bracket the root	Need TWO starting points with opposite signs
One function eval per step	Cheap per iteration	Linear convergence rate (~3.3 iter per decade of accuracy)

Newton's Method — Slide Down the Tangent

Bisection treats $f$ as a black box — it asks only for signs. But if $f$ is differentiable, we have far richer information at our disposal: at any point $x_n$ , we know not just the function value $f(x_n)$ but also the slope $f'(x_n)$ — and a slope tells us which direction to head.

Imagine a hiker on a smooth hillside trying to reach sea level. At her current position she can see how the ground is tilted; she draws an imaginary straight ramp (the tangent line) that follows that exact tilt; she walks along the ramp until it reaches the imaginary sea-level plane. That landing point becomes her next estimate. Then she repeats. If the hill is nearly flat near the coast, a single straight ramp gets her practically there; if it's curvy, she repeats. That is Newton's method.

One-line summary. Newton replaces

f

by its tangent line at the current iterate, then finds the x-intercept of that line. Repeat until satisfied.

The One-Line Derivation

The tangent line to $y = f(x)$ at the point $(x_n, f(x_n))$ has equation

y - f(x_n) \;=\; f'(x_n)\,(x - x_n).

Setting $y = 0$ to find where the tangent crosses the x-axis,

-f(x_n) = f'(x_n)\,(x - x_n) \quad\Longrightarrow\quad x = x_n - \frac{f(x_n)}{f'(x_n)}.

That is the entire Newton update. Calling the result $x_{n+1}$ :

\boxed{\;x_{n+1} \;=\; x_n - \dfrac{f(x_n)}{f'(x_n)}\;}

Two observations. First, the update has units of length: $f/f'$ has the dimension of $x$ (function value over slope cancels the function units), which is exactly the dimension we need to add to $x_n$ . Second, when $f(x_n) = 0$ , the update size is zero — Newton stops at the root, as it should.

The Babylonian square root as a special case. Take

f(x) = x^2 - a

. Then

f'(x) = 2x

and the Newton update becomes

x_{n+1} = x_n - \frac{x_n^2 - a}{2 x_n} = \frac{1}{2}\left(x_n + \frac{a}{x_n}\right).

Exactly the 4000-year-old recipe. Newton's method is what the Babylonians were doing the whole time.

Interactive: Newton's Method Explorer

Pick a function from the three buttons. Drag the slider to choose a starting guess $x_0$ . Step through the iterations and watch each red tangent line meet the x-axis — that landing point becomes the next purple starting point, and the error column on the right halves of decades at each step.

Loading Newton's method visualizer…

Things to try. (1) Start with

x_0

close to the root and watch the error column drop like a stone. (2) Switch to

f(x) = x^3 - x - 1

and start at

x_0 = -1.3

— Newton jumps wildly because the tangent slopes the wrong way. (3) On

\cos(x) - x

, notice how slow Newton is when the slope is small — the tangent is nearly horizontal so the x-intercept is far away.

Why Newton Converges Quadratically

Let $r$ denote the true root and $e_n = x_n - r$ the error at step $n$ . We expand $f$ around $r$ using Taylor (we will derive Taylor's theorem properly in Chapter 14; here you can think of it as "polynomial approximation of order 2"):

f(x_n) = f(r) + f'(r)\, e_n + \tfrac{1}{2} f''(\xi)\, e_n^2 = f'(r)\, e_n + \tfrac{1}{2} f''(\xi)\, e_n^2

(using $f(r) = 0$ ). Similarly $f'(x_n) = f'(r) + f''(\eta)\, e_n$ . Plug into the Newton update and simplify:

e_{n+1} = x_{n+1} - r = e_n - \frac{f(x_n)}{f'(x_n)} \approx \frac{f''(r)}{2 f'(r)}\, e_n^2.

That last expression says $|e_{n+1}| \le C\, e_n^2$ with $C = |f''(r)/(2 f'(r))|$ . The error squares. If $e_n = 10^{-3}$ , then $e_{n+1} \approx 10^{-6}$ , then $10^{-12}$ , then $10^{-24}$ — except double-precision floats have already run out of precision, so we stop at 10⁻¹⁶.

The 6-step rule of thumb. Starting from a decent guess, Newton needs about

\log_2 (16) \approx 4

6

iterations to deliver full double-precision. Bisection needs about

52

. That is a one-decimal-order-of-magnitude speedup, available for the price of one derivative evaluation per step.

Worked Example — Computing √2 By Hand

▶ Click to work through three Newton steps manually

Solve $f(x) = x^2 - 2 = 0$ with $f'(x) = 2x$ , starting at $x_0 = 2$ .

Step 1.

f(2) = 4 - 2 = 2

f'(2) = 4

. The Newton step:

x_1 = 2 - \frac{2}{4} = \frac{3}{2} = 1.5

Error

|1.5 - \sqrt{2}| \approx 0.0858

Step 2.

f(1.5) = 0.25

f'(1.5) = 3

x_2 = 1.5 - \frac{0.25}{3} = \frac{17}{12} \approx 1.41666\overline{6}

Error

|17/12 - \sqrt{2}| \approx 2.45 \times 10^{-3}

— the error went from

10^{-1}

10^{-3}

in one step. Squared!

Step 3.

f(17/12) = (17/12)^2 - 2 = 289/144 - 288/144 = 1/144

f'(17/12) = 17/6

x_3 = \frac{17}{12} - \frac{1/144}{17/6} = \frac{17}{12} - \frac{1}{408} = \frac{577}{408} \approx 1.41421568627\ldots

Error

|577/408 - \sqrt{2}| \approx 2.12 \times 10^{-6}

— from

10^{-3}

10^{-6}

. Squared again.

Pattern. After 3 steps we have 6 correct decimals. Continuing one more step gives 12 correct decimals — double precision is exhausted in 5 iterations. The fraction trail

2 \to \tfrac{3}{2} \to \tfrac{17}{12} \to \tfrac{577}{408}

is the same sequence the Babylonians wrote on YBC 7289 four thousand years ago.

Secant Method — Newton Without the Derivative

Newton's method is beautiful but it has one practical demand: you must be able to compute $f'(x)$ . For tabulated data, black-box simulators, or functions defined by long pipelines, the derivative may be unavailable or expensive. The secant method patches this by replacing the tangent with a secant line through the two most recent iterates.

Given two estimates $x_{n-1}$ and $x_n$ , draw the straight line through $(x_{n-1}, f(x_{n-1}))$ and $(x_n, f(x_n))$ . Its slope is the finite difference

m_n = \frac{f(x_n) - f(x_{n-1})}{x_n - x_{n-1}}

— a discrete approximation to $f'(x_n)$ . Substituting $m_n$ in place of $f'(x_n)$ in the Newton update gives

x_{n+1} = x_n - f(x_n)\,\frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})}.

That is the secant iteration. It needs two starting points (we provide $x_0, x_1$ ) but never asks for a derivative. As $x_{n-1} \to x_n$ the secant slope approaches the tangent slope, so geometrically secant is "Newton, but with a delayed slope estimate."

Convergence order: the golden ratio. A delicate analysis (we omit the algebra) shows that for a simple root, the secant error satisfies

|e_{n+1}| \approx C\, |e_n|^{\varphi}

with

\varphi = (1 + \sqrt{5})/2 \approx 1.618

. Slower than Newton (order 2) but much faster than bisection (order 1). And the per-step cost is one function evaluation, not two — so per function evaluation secant can actually beat Newton.

Interactive: The Three Methods Race for √2

Three colours, three slopes, same destination. Toggle between log and linear y-axes to see the geometry of convergence change.

Loading method comparison…

On the log axis, look at three separate behaviours:

Bisection (red): a perfectly straight line. Each iteration cuts the error by exactly $1/2$ , so on a log axis that's a constant downward slope of $-\log_{10} 2 \approx -0.301$ per step. Predictable and pedestrian.
Newton (blue): a curve that bends sharply downward. The error squares each step, so the log of the error doubles in magnitude — a parabola on this plot. After 5 steps we hit machine zero (≈10⁻¹⁵) and stop.
Secant (green): visibly between the two. Curves downward, just not as steeply as Newton — exactly what an exponent of $\varphi \approx 1.618$ looks like geometrically.

Linear, Super-Linear, Quadratic — What the Orders Mean

We've used three phrases — linear, super-linear, quadratic. They have a precise meaning that ties together every iterative method you will meet for the rest of this book and beyond.

Definition (convergence order). A sequence

x_n \to r

converges with order

p \ge 1

if there is a constant

C > 0

such that

\lim_{n \to \infty} \frac{|x_{n+1} - r|}{|x_n - r|^p} = C.

Linear means

p = 1

(and

C < 1

). Super-linear means

1 < p < 2

. Quadratic means

p = 2

Method	Order p	Per-iter cost	Digits gained per iter
Bisection	1 (linear) with C = 1/2	1 function eval	log₁₀ 2 ≈ 0.30 (one digit every ~3.3 iters)
Secant	φ ≈ 1.618 (super-linear)	1 function eval	≈ 0.7 × previous (multiplies by φ-th power)
Newton	2 (quadratic)	1 function eval + 1 derivative eval	doubles previous (squares the error)
Halley (preview)	3 (cubic)	1 f + 1 f′ + 1 f″ eval	triples previous

Notice the trade-off pattern: higher order requires more derivative information per step. There is no universal "best" — the right choice depends on whether your function is cheap or expensive to differentiate.

When Newton's Method Goes Off the Rails

Quadratic convergence is a local property — it only kicks in once you're near the root. Globally Newton can do almost anything. Here are the four classical failure modes, each worth knowing.

Failure mode	Example	Geometric picture	Fix
Vanishing derivative	$f(x) = x^2$ at $x_0 = 0$	Tangent is horizontal — never crosses the x-axis	Add a small derivative offset or fall back to bisection
Cycle (period 2)	$f(x) = x^3 - 2x + 2$ at $x_0 = 0$	Newton alternates: $0 \to 1 \to 0 \to 1 \to \cdots$	Damped Newton: replace step by half-step
Wrong basin	$f(x) = (x - 1)(x - 5)(x - 10)$ , start near a local max	Tangent points away from nearest root → land near a different one	Bracket-and-Newton hybrid (Brent's method)
Divergence to infinity	$f(x) = \arctan(x)$ at $\|x_0\|$ large	Slope shrinks faster than function value — steps grow unbounded	Trust-region: cap step size

The right default for production code: Brent's method. When SciPy or MATLAB picks a single algorithm for "solve this scalar equation," it picks Brent's method — a hybrid of bisection (for safety), secant (for speed), and inverse quadratic interpolation (for super-linear bursts). It is bisection's reliability with near-Newton speed, and it is the algorithm you should reach for whenever you don't need the absolute fastest convergence.

Python — All Three Methods, Side By Side

One file, three solvers, one root. The driver runs all three on the same problem and prints the iteration counts side by side. Trace through the explanations to see exactly how the IVT, the tangent line, and the secant slope encode themselves as Python.

bisection, Newton, and secant in plain Python

🐍python

Explanation(40)

Code(64)

1# Three classical root-finders on the same target:

Header comment naming the common goal. Putting the goal at the top means a reader scanning the file immediately knows what success looks like — every function below has to land at √2 ≈ 1.41421356237.

2# find r > 0 such that f(r) = r^2 - 2 = 0 (so r = sqrt(2))

Restates the equation. √2 is the canonical irrational root — it is impossible to write as a finite decimal, so any computed answer is necessarily an approximation. The whole point of this section is HOW the approximations converge.

4def f(x): return x * x - 2.0

The function whose root we want. Pure float arithmetic, no imports. We use x*x rather than x**2 to dodge any small floating-point quirks the ** operator can have for negative or fractional exponents.

EXECUTION STATE

⬇ input: x = Any real number. The methods will probe f at points like 2, 1.5, 1.41666… as they close in on √2.

⬆ returns = f(1) = −1, f(1.4) = −0.04, f(1.4142) = −0.00006, f(2) = +2. The sign change between f(1) and f(2) is exactly what activates the IVT.

7def df(x): return 2.0 * x # analytic derivative

The derivative f′(x) = 2x. We feed this to Newton's method, which needs slope information. Computing the derivative by hand is the price Newton charges for its speed — secant will dodge this requirement by approximating df numerically.

EXECUTION STATE

🧮 why 2x? = Power rule from Chapter 4 (preview): if f(x) = x² then f′(x) = 2x. At x = 1.5: df(1.5) = 3, meaning the tangent line at (1.5, 0.25) slopes upward at 3 vertical units per horizontal unit.

9# --- Method 1: bisection (needs only continuity + a sign change) ---

Each method gets a banner that lists its REQUIREMENTS. Bisection's requirement is the smallest of the three: just continuity and a sign change. This makes it the safest fallback — but also the slowest.

10def bisection(a, b, tol=1e-12, max_iter=80) → (float, list)

Hunt the root by halving an interval [a, b] that brackets a sign change. Returns the final estimate AND the history of midpoints, so we can plot convergence later.

EXECUTION STATE

⬇ input: a, b = Endpoints of an interval that brackets a root. Required: f(a) and f(b) must have opposite signs. For our √2 search: a=1, b=2 because f(1) = −1 < 0 and f(2) = +2 > 0.

⬇ input: tol = Stopping tolerance. We halt when either |f(c)| < tol (we hit the root) or the bracket width b−a is below tol (the answer is pinned down to within tol).

⬇ input: max_iter = Belt-and-braces safety. The bracket halves each step, so after 80 iterations its width is (b−a)/2⁸⁰ ≈ 10⁻²⁴ — way below double-precision noise. We will hit tol long before this fires.

⬆ returns = (root_estimate, history) — the final midpoint and the list of midpoints visited. Plotting history reveals the linear-convergence slope on a log axis.

11if f(a) * f(b) > 0: raise ValueError(...)

Guard the bracket invariant. If f(a) and f(b) share a sign, their product is positive and there is no IVT-guaranteed root in [a, b]. Refusing to start with bad data is safer than silently returning the midpoint.

13history = []

Initialise the list that will record every midpoint. Recording history costs a tiny amount of memory but is invaluable for understanding convergence rates afterwards.

14for n in range(1, max_iter + 1):

Iterate from n=1 to max_iter. Starting at 1 (not 0) means n equals the iteration number you would report in a table — a small ergonomics win.

LOOP TRACE · 3 iterations

n = 1

a, b = 1.0, 2.0

c = (a+b)/2 = 1.5

f(c) = 0.25 > 0, f(a)·f(c) < 0 → b ← c, bracket = [1, 1.5]

n = 2

a, b = 1.0, 1.5

c = 1.25

f(c) = −0.4375 < 0, f(a)·f(c) > 0 → a ← c, bracket = [1.25, 1.5]

n = 7

c = ≈ 1.4140625

|c − √2| = ≈ 1.5 × 10⁻⁴ — three significant digits after 7 halvings

15c = 0.5 * (a + b)

The midpoint — bisection's ONE algorithmic choice. We could weight the midpoint toward f(a) or f(b) (that is what `false position` does), but the plain midpoint maximises the worst-case error reduction per step.

16history.append(c)

Record this midpoint for the convergence plot. The Python list .append is amortised O(1) and over 80 iterations costs nothing measurable.

17if abs(f(c)) < tol or (b - a) < tol: return c, history

Two stopping conditions joined by OR. Either condition is sufficient: function value tiny OR bracket narrow. Some implementations require BOTH, but that delays unnecessarily when one is clearly met.

18if f(a) * f(c) < 0: b = c

If f changes sign in [a, c], the root lives there → drop the right half. The product test is robust even when f(a) and f(c) have very different magnitudes.

20else: a = c

Otherwise the sign change is in [c, b] → drop the left half. Each iteration halves the bracket, so after n steps the bracket width is (b₀ − a₀)/2ⁿ. THAT geometric shrinkage is the entire performance story.

24# --- Method 2: Newton (needs differentiability + a good guess) ---

Banner for Newton. Compared to bisection, Newton ADDS a requirement (differentiability) and SUBTRACTS a requirement (no bracketing needed) — and in exchange offers vastly faster convergence near a simple root.

25def newton(x0, tol=1e-14, max_iter=20) → (float, list)

Newton-Raphson iteration starting from x₀. Twenty iterations is enormously more than ever needed in practice — quadratic convergence usually finishes the job in 5 to 8 steps on a clean problem.

EXECUTION STATE

⬇ input: x0 = Initial guess. Quality matters: bad initial guesses can send Newton off to a different root or diverge to infinity. For our example we start at x₀ = 2 (the upper end of the bisection bracket).

26x = x0

Initialise the running iterate. Newton mutates a SINGLE variable in place — no bracketing pair, no left/right bookkeeping. This is one reason Newton generalises cleanly to higher dimensions.

27history = [x]

Record the starting position so the history list mirrors the visualisation: history[0] is x₀, history[n] is xₙ.

28for n in range(max_iter):

Each iteration produces one new estimate xₙ → xₙ₊₁. We expect to break out around n = 5.

LOOP TRACE · 4 iterations

n = 0 (start at x₀ = 2.0)

f(x) = f(2) = 2.0

df(x) = df(2) = 4.0

x_next = 2 − 2/4 = 1.5

|x_next − √2| = ≈ 8.58 × 10⁻²

n = 1

x = 1.5 = f = 0.25, df = 3.0

x_next = 1.5 − 0.25/3 = 17/12 ≈ 1.41666…

|err| = ≈ 2.45 × 10⁻³ — error went from 10⁻¹ to 10⁻³ in one step

n = 2

x = 17/12 = f = 1/144, df = 17/6

x_next = ≈ 1.41421568627…

|err| = ≈ 2.12 × 10⁻⁶ — six correct decimals

n = 3

x_next = 1.41421356237… (matches √2 to 12 digits)

|err| = ≈ 1.6 × 10⁻¹² — error squares each step

29fx, dfx = f(x), df(x)

Cache one function evaluation and one derivative evaluation. We use them twice below (the Newton step and the convergence check), so caching is a small but worthwhile speedup.

30if dfx == 0.0: raise ZeroDivisionError(...)

Newton divides by f′(x). If the iterate lands at a critical point (flat tangent), the next step is undefined — geometrically the tangent line is horizontal and never crosses the x-axis. Raising is more honest than continuing with garbage.

EXECUTION STATE

⚠ when does this fire? = For f(x) = x² − 2 it fires only at x = 0. For tougher functions like x³ − 3x², df can vanish at x = 0 and x = 2 — both are local extrema, and starting near them is the classic Newton pathology.

32x = x - fx / dfx # the Newton step

The heart of the algorithm: subtract function value over slope. Geometrically this is exactly the x-intercept of the tangent line at (x, f(x)). Compare to bisection's midpoint — Newton uses LOCAL slope information that bisection ignores.

EXECUTION STATE

🧮 derivation = Tangent line: y − f(xₙ) = f′(xₙ)(x − xₙ). Set y = 0 and solve for x → x = xₙ − f(xₙ)/f′(xₙ). That is the entire mathematical content of Newton's method.

33history.append(x)

Record the new iterate. After completion, len(history) = number of completed steps + 1 (the starting point).

34if abs(fx) < tol: return x, history

Stop when the function value is tiny — that is the most direct measure of 'we hit the root'. Some implementations test |x_new − x_old| instead, which can fool you if the iteration stalls without converging.

39# --- Method 3: secant (needs only function values, not the derivative) ---

Secant replaces the analytic derivative with a finite-difference approximation built from the two most recent points. This is huge when f has no closed-form derivative — common for black-box engineering codes.

40def secant(x0, x1, tol=1e-14, max_iter=30) → (float, list)

Two initial points instead of one — they need not bracket a root. The secant method keeps a SLIDING WINDOW of two points: the most recent two iterates, used to build the next slope estimate.

41history = [x0, x1]

Seed the history with both initial points. Subsequent .append calls add one new x per iteration.

43fx0, fx1 = f(x0), f(x1)

Cache the two function values. A careful implementation would only evaluate f at the NEW point each iteration and reuse the previous one — but the cleaner-reads version evaluates both for clarity.

44if fx1 == fx0: break

Guard against the secant slope (fx1 − fx0)/(x1 − x0) becoming undefined when the two function values coincide. In double precision this is extraordinarily rare on smooth functions but happens trivially on step functions.

46x2 = x1 - fx1 * (x1 - x0) / (fx1 - fx0)

The secant step. Identical in form to Newton, but with the FINITE-DIFFERENCE slope (fx1−fx0)/(x1−x0) standing in for the true derivative. As x1 → x0 the secant slope approaches f′(x1) — secant is the discrete limit of Newton.

EXECUTION STATE

🧮 geometry = Draw the secant line through (x0, fx0) and (x1, fx1). Find where that line crosses the x-axis — that intercept is x2.

49x0, x1 = x1, x2

Slide the window forward: forget the oldest point, promote x1 to x0, x2 to x1. Python tuple-packing makes this a one-liner with no temporary variable.

54import math

Standard library import. We only need math.sqrt to compute the true root for the error column. Imports usually live at the top of the file — pulling it down here keeps the driver self-contained for the explainer.

55root_bi, h_bi = bisection(1.0, 2.0)

Run bisection on the standard bracket [1, 2]. h_bi is a list of 52 midpoints (default tol=1e-12 requires log₂((2-1)/1e-12) ≈ 40 halvings, plus a few for the function-value check).

EXECUTION STATE

expected iteration count = Bisection needs ⌈log₂((b−a)/tol)⌉ iterations to reach tolerance. For tol = 10⁻¹² on [1, 2]: ⌈log₂(10¹²)⌉ ≈ 40 — every method's iteration count depends on its convergence rate.

56root_nw, h_nw = newton(2.0)

Run Newton from x₀ = 2.0. With tol = 10⁻¹⁴ and quadratic convergence, h_nw will have about 5 entries.

57root_sc, h_sc = secant(1.0, 2.0)

Run secant from the SAME pair (1, 2) used for bisection. h_sc will have about 7 entries — somewhere between Newton's 5 and bisection's 52.

58true_root = math.sqrt(2.0)

Reference value computed by the library's hardware-accelerated square root. Double-precision sqrt returns the correctly-rounded value, so we treat math.sqrt(2.0) as the "ground truth" against which our errors are measured.

60print(f"true sqrt(2) = {true_root:.15f}")

Display ground truth to 15 decimal places — the maximum representable in double precision.

EXECUTION STATE

expected output =

true sqrt(2)        = 1.414213562373095
bisection (52 it.)  = 1.414213562373
   err = 1.11e-13
newton    ( 5 it.)  = 1.414213562373095
   err = 0.00e+00
secant    ( 7 it.)  = 1.414213562373095
   err = 2.22e-16

61print(f"bisection (52 it.) = {root_bi:.15f} err = {abs(root_bi-true_root):.2e}")

Bisection's answer + how many decimals it bought. The line is two facts in one: the converged value and the error magnitude — both needed to judge whether the method "worked".

62print(f"newton ( 5 it.) = {root_nw:.15f} err = {abs(root_nw-true_root):.2e}")

Newton produces the bit-exact float closest to √2 in just 5 steps. Compare to bisection's 52 — that 10× speedup with no extra computation per step is the entire commercial case for Newton.

63print(f"secant ( 7 it.) = {root_sc:.15f} err = {abs(root_sc-true_root):.2e}")

Secant takes about 2 extra iterations than Newton but does NOT require coding df by hand. For complex models where f might be a 500-line simulator, that trade is almost always worth it.

24 lines without explanation

1# Three classical root-finders on the same target:
2#     find r > 0 such that  f(r) = r^2 - 2 = 0    (so r = sqrt(2))
3
4def f(x):
5    return x * x - 2.0
6
7def df(x):
8    return 2.0 * x                  # analytic derivative
9
10# --- Method 1: bisection (needs only continuity + a sign change) ---
11def bisection(a, b, tol=1e-12, max_iter=80):
12    if f(a) * f(b) > 0:
13        raise ValueError("need a sign change in [a, b]")
14    history = []
15    for n in range(1, max_iter + 1):
16        c = 0.5 * (a + b)
17        history.append(c)
18        if abs(f(c)) < tol or (b - a) < tol:
19            return c, history
20        if f(a) * f(c) < 0:
21            b = c
22        else:
23            a = c
24    return 0.5 * (a + b), history
25
26# --- Method 2: Newton (needs differentiability + a good guess) ---
27def newton(x0, tol=1e-14, max_iter=20):
28    x = x0
29    history = [x]
30    for n in range(max_iter):
31        fx, dfx = f(x), df(x)
32        if dfx == 0.0:
33            raise ZeroDivisionError("derivative vanished")
34        x = x - fx / dfx                  # the Newton step
35        history.append(x)
36        if abs(fx) < tol:
37            return x, history
38    return x, history
39
40# --- Method 3: secant (needs only function values, not the derivative) ---
41def secant(x0, x1, tol=1e-14, max_iter=30):
42    history = [x0, x1]
43    for n in range(max_iter):
44        fx0, fx1 = f(x0), f(x1)
45        if fx1 == fx0:
46            break
47        x2 = x1 - fx1 * (x1 - x0) / (fx1 - fx0)
48        history.append(x2)
49        if abs(fx1) < tol:
50            return x1, history
51        x0, x1 = x1, x2
52    return x1, history
53
54# --- Driver: race all three on the same root ---
55import math
56root_bi, h_bi = bisection(1.0, 2.0)
57root_nw, h_nw = newton(2.0)
58root_sc, h_sc = secant(1.0, 2.0)
59true_root = math.sqrt(2.0)
60
61print(f"true sqrt(2)        = {true_root:.15f}")
62print(f"bisection (52 it.)  = {root_bi:.15f}   err = {abs(root_bi-true_root):.2e}")
63print(f"newton    ( 5 it.)  = {root_nw:.15f}   err = {abs(root_nw-true_root):.2e}")
64print(f"secant    ( 7 it.)  = {root_sc:.15f}   err = {abs(root_sc-true_root):.2e}")

PyTorch — Newton with Autograd

Newton needs $f'(x)$ . In a research notebook where the function changes every few minutes, hand-coding the derivative is a tax on experimentation. PyTorch's autograd gives us the derivative for free — and the resulting Newton loop is the same code we'd use inside an optimiser. If you ever wondered "is gradient descent like Newton?" — yes, gradient descent is the simplified cousin where we replace $1/f'$ with a fixed learning rate.

Newton-Raphson with PyTorch autograd

🐍python

Explanation(21)

Code(36)

1import torch

Pull in PyTorch. We need three things: torch.tensor (the data carrier), requires_grad=True (to track the computation), and .backward() (to fill in the derivative). Same imports you would use to train a neural network — the math is identical.

3# A more interesting target: solve f(x) = x^3 - 2*x - 5 = 0

Newton himself used this equation in 1669 to illustrate his method. The cubic has one real root near 2.094. The shape is more interesting than x² − 2 because there are also two complex roots — Newton-in-the-reals must navigate carefully or it can jump around.

4# (Newton famously used this exact equation in 1669.)

Historical credit. Newton developed the method in De analysi per aequationes numero terminorum infinitas (1669, published 1711). Raphson published an improved formulation in 1690 — hence Newton-Raphson.

5def f(x: torch.Tensor) → torch.Tensor:

Type-hinted: f takes a tensor, returns a tensor. Crucially, the OPERATIONS inside (** and -) are torch operations, not Python arithmetic — that is what lets autograd watch them.

EXECUTION STATE

⬇ input: x = A 0-d tensor with requires_grad=True. We use float64 (double precision) because root-finding errors shrink fast — float32's 7-digit precision would limit us.

6return x ** 3 - 2 * x - 5

Standard tensor arithmetic. PyTorch builds an autograd graph: x → x³ → x³ − 2x → x³ − 2x − 5. Each operation knows how to propagate gradients backwards through itself.

8def newton_autograd(x0: float, tol=1e-12, max_iter=20):

The Newton driver. Same signature as the plain-Python version, but inside we will use autograd instead of hand-coded df.

EXECUTION STATE

why bother with autograd? = For x³ − 2x − 5 the hand derivative 3x² − 2 is trivial. But for a 50-layer neural network or a real engineering ODE, deriving f′ by hand can take days and be wrong. Autograd is exact and free.

10x = torch.tensor(x0, dtype=torch.float64, requires_grad=True)

Allocate the iterate as a leaf tensor that autograd will track. float64 gives 15-16 digits of precision — enough that Newton can drive |f(x)| down to 10⁻¹² without hitting roundoff.

EXECUTION STATE

📚 requires_grad=True = Tells autograd: 'please record every operation that touches this tensor'. The recorded graph is what .backward() walks to compute derivatives.

📚 dtype=torch.float64 = Default is float32 (~7 digits). Newton can converge in 5 steps to error ≈ 10⁻¹⁵, so the bottleneck would be float, not the algorithm. Promote to float64 explicitly.

11history = []

Same convergence trace we used in the plain-Python version, but storing .item() Python floats so the history is JSON-serialisable.

12for n in range(max_iter):

Up to 20 Newton steps. We will exit early via return when |f(x)| < tol.

LOOP TRACE · 3 iterations

n = 0 (x₀ = 2.0)

f(2) = 2³ − 2·2 − 5 = 8 − 4 − 5 = −1

df via autograd = f′(2) = 3·2² − 2 = 10

x_new = 2 − (−1)/10 = 2.1

n = 1 (x = 2.1)

f(2.1) = 9.261 − 4.2 − 5 = 0.061

df = 13.23 − 2 = 11.23

x_new = 2.1 − 0.061/11.23 ≈ 2.094568…

n = 2 → converged

x ≈ 2.0945514815… = f(x) ≈ 2 × 10⁻⁷

14y = f(x)

Forward pass. PyTorch builds the autograd graph during this evaluation: nodes for x³, 2x, the subtractions. y is a 0-d tensor with grad_fn=<SubBackward0>.

16if x.grad is not None: x.grad.zero_()

.backward() ACCUMULATES into x.grad rather than overwriting. On iteration 0, x.grad is None and we skip; from iteration 1 onward we must explicitly zero or our derivative will be the sum of every step's derivative.

EXECUTION STATE

⚠ classic bug = Forgetting .zero_() in a loop is the #1 PyTorch gotcha. Symptoms: gradients exploding to large numbers after a few iterations even though x is bounded.

17y.backward()

Walk the autograd graph from y back to leaves. PyTorch applies the chain rule automatically: dy/dx = d(x³)/dx − 2 · d(x)/dx = 3x² − 2. The result lands in x.grad. No hand calculus required.

EXECUTION STATE

📚 y.backward() = For scalar y, computes dy/dx for every leaf tensor with requires_grad=True. For vector y, you must pass a `grad_outputs` weighting — but our y here is 0-d.

18dfx = x.grad # this IS f'(x), computed automatically

Read the derivative. dfx is now a tensor holding f′(x) at the current iterate. The comment is the punchline: we got the analytic derivative without writing it.

21with torch.no_grad():

Disable autograd tracking inside this block. We are about to mutate x and run arithmetic that is NOT part of the loss — we don't want this to enter the graph for next iteration.

EXECUTION STATE

📚 torch.no_grad() = Context manager that turns off grad tracking. Inside, all operations behave as if requires_grad=False. Outside, tracking resumes.

22x_new = x - y / dfx

The Newton update — same arithmetic as before, but on tensors. Because we're inside torch.no_grad(), this does NOT create new graph nodes.

23history.append(x_new.item())

.item() unpacks a 0-d tensor into a Python float. Building history as a list of Python numbers lets us pass it straight to matplotlib without further conversion.

24if y.abs().item() < tol: return x_new.item(), history

Convergence check. y.abs() is the function value magnitude; we exit when it's below tol. Returns the final estimate (as a Python float) and the trajectory.

26x.copy_(x_new) # in-place update, preserves graph

Update x IN PLACE so the leaf tensor identity (and requires_grad=True flag) is preserved. Doing x = x_new would create a new tensor that doesn't require grad — the next .backward() would then fail with 'tensor has no grad'.

EXECUTION STATE

⚠ why .copy_ and not assignment? = x = x_new rebinds the local name to a new (non-leaf) tensor. x.copy_(x_new) keeps the same leaf object and overwrites its data. Only the second form is compatible with the loop.

29root, hist = newton_autograd(x0=2.0)

Run from x₀ = 2.0. Newton converges in about 4–5 iterations to the cubic's real root ≈ 2.094551481542327.

30print(f"Newton-via-autograd: root = {root:.12f}")

Display the result. 12 decimal places is well within float64's 15–16 digits of meaningful precision.

EXECUTION STATE

expected printed output =

Newton-via-autograd: root = 2.094551481542
Function value:       f(root) = 0.00e+00
Took 5 iterations:
  x_1 = 2.100000000000
  x_2 = 2.094568121104
  x_3 = 2.094551481698
  x_4 = 2.094551481542
  x_5 = 2.094551481542

33for n, v in enumerate(hist):

Print the full convergence trace. The dramatic shrinkage from x_1 to x_4 makes the quadratic-convergence pattern vivid — each entry buys roughly twice as many correct decimals as the previous.

15 lines without explanation

1import torch
2
3# A more interesting target: solve f(x) = x^3 - 2*x - 5 = 0
4# (Newton famously used this exact equation in 1669.)
5def f(x: torch.Tensor) -> torch.Tensor:
6    return x ** 3 - 2 * x - 5
7
8def newton_autograd(x0: float, tol=1e-12, max_iter=20):
9    """Newton with the derivative computed automatically via autograd."""
10    x = torch.tensor(x0, dtype=torch.float64, requires_grad=True)
11    history = []
12    for n in range(max_iter):
13        # Forward pass: evaluate f(x) and record the graph
14        y = f(x)
15
16        # Backward pass: dy/dx is filled into x.grad
17        if x.grad is not None:
18            x.grad.zero_()                # always zero accumulators
19        y.backward()
20        dfx = x.grad                      # this IS f'(x), computed automatically
21
22        # The Newton step — exactly the same line as the plain-Python version
23        with torch.no_grad():
24            x_new = x - y / dfx
25            history.append(x_new.item())
26            if y.abs().item() < tol:
27                return x_new.item(), history
28            x.copy_(x_new)                # in-place update, preserves graph
29    return x.item(), history
30
31root, hist = newton_autograd(x0=2.0)
32print(f"Newton-via-autograd: root = {root:.12f}")
33print(f"Function value:       f(root) = {f(torch.tensor(root, dtype=torch.float64)).item():.2e}")
34print(f"Took {len(hist)} iterations:")
35for n, v in enumerate(hist):
36    print(f"  x_{n+1} = {v:.12f}")

Where Root-Finding Lives in Practice

Root finding is the universal solvent of applied mathematics. Whenever a problem reduces to "find $x$ such that some condition holds," you are almost certainly going to call a root-finder somewhere in the implementation.

Hardware square roots and trig

x86's FSQRT, ARM's FSQRT instruction, and the GPU intrinsic __fsqrt_rn all implement Newton's method on $f(x) = x^2 - a$ , refined to give the correctly-rounded result. The infamous Fast Inverse Square Root from Quake III is a single Newton step on $f(x) = 1/x^2 - a$ seeded by a magic integer constant.

Implicit equations of state

Compressors, gas pipelines, and refrigeration cycles need to solve van der Waals or Peng–Robinson equations of the form $P(V, T) = P_{\text{set}}$ for $V$ . There is no closed form. Every iteration of the plant's control loop invokes a Newton solver on the residual $f(V) = P(V, T) - P_{\text{set}}$ .

Implicit ODE integration

Stiff differential equations — chemical kinetics, electrical circuits, structural dynamics — are solved by implicit methods (backward Euler, BDF). Each time step requires solving a nonlinear equation in the new state. That "solve a nonlinear equation" is Newton, called millions of times per simulation.

Machine learning: maximum likelihood and Newton optimisation

Maximum likelihood estimation seeks the parameters $\theta^*$ that satisfy the score equation $\nabla_\theta \ell(\theta^*) = 0$ . That's root finding on the gradient. Newton's method generalised to vectors gives the celebrated Newton-Raphson iteration for logistic regression: $\theta_{n+1} = \theta_n - H^{-1} g$ , where $H$ is the Hessian and $g$ is the gradient.

Implied volatility in option pricing

The Black-Scholes formula gives an option price as a function of volatility. To recover the market's implied volatility from the observed price, you invert that formula numerically — which means finding the root of $f(\sigma) = \text{BS}(\sigma) - P_{\text{market}}$ . Every quote feed in every options exchange runs this loop in real time.

Common Pitfalls

Forgetting that "converged" is not always "right". A small $|f(x_n)|$ doesn't guarantee a small error. If the slope is huge, you can be far from the root and still have a tiny function value. Always inspect both $|f(x_n)|$ AND $|x_{n+1} - x_n|$ .
Trusting Newton near multiple roots. If $f$ has a double root (e.g. $f(x) = (x - 1)^2$ ), Newton degrades to linear convergence because $f'(r) = 0$ . Modified Newton (multiply the step by the multiplicity) restores quadratic convergence.
Picking a step that flies off the chart. Always sanity-check that $|x_{n+1} - x_n| < D$ for some maximum trust radius $D$ . If the proposed Newton step exceeds it, shrink it.
Mixing tolerances. Some implementations stop on $|f(x)|$ , others on $|x_{n+1} - x_n|$ . Be explicit about which one you want and document it in the API.
Using float32 when you wanted float64. A solver tuned for double precision can fail silently in single precision because the algorithm hits the noise floor before reaching the requested tolerance.

Summary

Continuity is not just a definition — it is the engine that drives a whole class of algorithms. The IVT gives us bisection. Differentiability adds Newton. Finite-difference approximations to the derivative add the secant method. Each method buys speed at the cost of a stronger assumption, and the convergence-order ladder — linear, super-linear, quadratic — quantifies that trade exactly.

Three things to remember.

Bisection's slogan: continuity + sign change = root. Slow but unconditional.
Newton's slogan: $x_{n+1} = x_n - f(x_n)/f'(x_n)$ . Slide down the tangent. Doubles the digits per step — when it works.
Secant's slogan: use yesterday's slope. $\varphi \approx 1.618$ -order convergence, no derivative needed.

Chapter 4 introduces the derivative formally — and immediately you will recognise the Newton update as the very first non-trivial use of derivatives, almost three centuries before its inventors knew the word "derivative." Calculus and computation, from the very first chapter, have always been the same subject.