Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section you will be able to:

See why the ordinary chain rule fails when noise has unbounded variation.
Derive the heuristic $(dW)^2 = dt$ from quadratic variation.
State and apply Itô's lemma to functions of a stochastic process.
Transform a stochastic differential equation by changing variables (log‑transform of GBM).
Simulate SDEs with Euler–Maruyama in Python and in vectorised PyTorch.

Why the Ordinary Chain Rule Breaks

In ordinary calculus, if $x(t)$ is a smooth function and $f$ is differentiable, the chain rule says

\displaystyle df = f'(x)\, dx.

That is, the change in $f$ is linear in the change in $x$ . The reason this works is buried in a Taylor expansion:

\displaystyle df = f'(x)\,dx + \tfrac{1}{2}f''(x)\,(dx)^2 + \cdots

For smooth $x(t)$ , the increment $dx$ is of order $dt$ , so $(dx)^2$ is of order $(dt)^2$ — utterly negligible as $dt \to 0$ . We drop it without guilt.

The Twist for Brownian Motion

If $X(t) = W(t)$ , a Brownian motion, then $dW$ is not of order $dt$ . It is of order $\sqrt{dt}$ .

So $(dW)^2$ is of order $dt$ — the same order as the drift term we are trying to keep. We cannot throw it away.

This single observation is the entire reason Itô's lemma exists, and the entire reason stochastic calculus is its own subject. Everything below is just careful book-keeping around that one fact.

The Secret: (dW)² = dt

Let us pin down what "of order dt" really means. Subdivide $[0, T]$ into $N$ intervals of size $\Delta t = T/N$ . The Brownian increment over each interval is $\Delta W_i = W(t_{i+1}) - W(t_i) \sim \mathcal{N}(0, \Delta t)$ .

The quadratic variation of the path is the limit

\displaystyle \langle W \rangle_T \;=\; \lim_{N\to\infty}\, \sum_{i=0}^{N-1} (\Delta W_i)^2.

Each summand is the square of a Gaussian, and its $\mathbb{E}[(\Delta W_i)^2] = \Delta t$ . So the expectation of the sum is exactly $N \cdot \Delta t = T$ . The variance of each summand is $2(\Delta t)^2$ , so the variance of the whole sum is $2T \cdot \Delta t \to 0$ . Mean stays at $T$ , fluctuations vanish:

\displaystyle \langle W \rangle_T = T \quad \text{almost surely.}

Heuristic shorthand: $(dW)^2 = dt$ . Plus, by similar but easier arguments, $dt \cdot dW = 0$ and $(dt)^2 = 0$ . These three rules are the entire algebra of stochastic differentials.

Compare this with a smooth function: the same sum of $(\Delta x_i)^2$ would be of order $\Delta t$ and vanish. Brownian motion is wiggly enough that its squared increments add up to something finite and deterministic. That paradox — random in every increment, deterministic in the sum of squares — is the engine of Itô's lemma.

Interactive: Quadratic Variation Explorer

Below we sample a Brownian path on $[0, 1]$ , then compute two sums over the increments: $\sum (\Delta W_i)^2$ and $\sum |\Delta W_i|$ . Move the slider to increase $N$ . Watch the green box snap to $T = 1$ while the red box blows up.

Quadratic Variation Explorer

Number of steps N = 256

816,384

Sum of squared increments

Σ (ΔW)² = 1.0757

Target as N → ∞: T = 1.0000

Sum of absolute increments

Σ |ΔW| = 12.87

Diverges to ∞ as N grows — Brownian motion has unbounded variation.

Drag the slider. The squared sum locks onto T; the absolute sum keeps growing. That tiny green box is the entire reason Itô's lemma needs an extra term.

The squared sum is a stable, deterministic quantity — exactly what we need to define

(dW)^2 = dt

. The absolute sum being infinite is why you cannot define stochastic integrals pathwise as Riemann–Stieltjes integrals. You have to build a new machinery: the Itô integral.

Itô's Lemma: Statement and Intuition

Suppose $X_t$ follows a stochastic differential equation

\displaystyle dX_t = \mu(X_t, t)\, dt + \sigma(X_t, t)\, dW_t,

and let $f(x, t)$ be any function with two continuous spatial derivatives and one time derivative. Then Itô's lemma says

\displaystyle df = \left(\,\frac{\partial f}{\partial t} + \mu\,\frac{\partial f}{\partial x} + \tfrac{1}{2}\sigma^2\,\frac{\partial^2 f}{\partial x^2}\right) dt \;+\; \sigma\,\frac{\partial f}{\partial x}\, dW.

Compare this with what the chain rule from ordinary calculus would have given you:

\displaystyle df_{\text{naive}} = \left(\,\frac{\partial f}{\partial t} + \mu\,\frac{\partial f}{\partial x}\right) dt \;+\; \sigma\,\frac{\partial f}{\partial x}\, dW.

The Itô Correction

The only difference is the extra term $\tfrac{1}{2}\sigma^2 f_{xx}\, dt$ . It is called the Itô correction and it comes entirely from the $(dW)^2 = dt$ rule.

Where the correction comes from. Taylor-expand $f(X_{t+dt}, t+dt)$ in both arguments to second order:

\displaystyle df = f_t\, dt + f_x\, dX + \tfrac{1}{2} f_{xx}\,(dX)^2 + \text{higher order}.

Now substitute $dX = \mu\, dt + \sigma\, dW$ into $(dX)^2$ and apply the three multiplication rules:

\displaystyle (dX)^2 = \mu^2(dt)^2 + 2\mu\sigma\,dt\,dW + \sigma^2(dW)^2 = 0 + 0 + \sigma^2\, dt.

Plug that back in, gather $dt$ and $dW$ terms, and Itô's lemma falls out. The whole derivation is a Taylor expansion plus the bookkeeping rule $(dW)^2 = dt$ .

Analogy. Think of $dW$ as a coin flip that has zero average but bounces with size $\sqrt{dt}$ . Squaring it knocks out the sign and what survives is the average size of a single bounce — $dt$ . That residual deterministic effect is the Itô correction.

Application: Geometric Brownian Motion

The most famous SDE in finance is geometric Brownian motion (GBM):

\displaystyle dS_t = \mu\, S_t\, dt + \sigma\, S_t\, dW_t.

Read it like this: in a small time $dt$ , the stock price moves by a deterministic drift $\mu S\, dt$ plus a random kick $\sigma S\, dW$ . Both pieces scale with the current price, which is what we want — a $100 stock should fluctuate more (in dollar terms) than a $1 stock.

Here $\mu$ is the expected return per unit time and $\sigma$ is the volatility. Typical values: a broad equity index has $\mu \approx 8\%/\text{yr}$ , $\sigma \approx 16\%/\text{yr}$ ; a single biotech stock can have $\sigma \approx 60\%/\text{yr}$ .

Interactive: Stock Price Simulator

Below you can play with $\mu$ and $\sigma$ and see how an ensemble of futures for the same stock evolves. The orange curve is the deterministic mean. Each blue line is one of many possible paths the world could take.

Geometric Brownian Motion: dS = μS dt + σS dW

Drift μ = 0.10 /yr

Volatility σ = 0.25 /yr

# of sample paths = 40

At t = T (sample stats over 40 paths)

Empirical mean 110.63

Theory S₀ e^μT = 110.52

Empirical std 28.85

Theory 28.07

The orange curve is the deterministic mean E[S_t]. Each thin blue line is one possible future of the stock — generated using the closed-form lognormal update derived from Itô's lemma. Crank σ up to feel why volatility, not drift, is the dominant force.

Notice that turning the drift up shifts the orange line, but turning the volatility up fans the paths out exponentially. Volatility is what makes options valuable in the first place — and we are about to see why.

Worked Example: From dS to log(S)

The whole reason Itô's lemma is worth knowing is that it lets us change variables in an SDE. The single most important change of variables in finance is $f(S) = \log S$ . Let us apply Itô's lemma by hand.

Show the full pen-and-paper derivation

Step 1. Identify the pieces. From $dS = \mu S\, dt + \sigma S\, dW$ we read off $\mu_X = \mu S$ and $\sigma_X = \sigma S$ in the generic SDE notation.

Step 2. Choose $f(S, t) = \log S$ . Compute the partials:

$\frac{\partial f}{\partial t} = 0,\quad \frac{\partial f}{\partial S} = \frac{1}{S},\quad \frac{\partial^2 f}{\partial S^2} = -\frac{1}{S^2}.$

Step 3. Plug into Itô's lemma:

$d(\log S) = \left(0 + \mu S \cdot \frac{1}{S} + \tfrac{1}{2}(\sigma S)^2 \cdot \left(-\frac{1}{S^2}\right)\right) dt + (\sigma S)\cdot \frac{1}{S}\, dW.$

Step 4. Simplify each piece. The two $S$ factors in the drift cancel, leaving $\mu$ . The $(\sigma S)^2 / S^2 = \sigma^2$ term gives the Itô correction. The diffusion term collapses to $\sigma$ .

$\boxed{\, d(\log S) = \left(\mu - \tfrac{1}{2}\sigma^2\right) dt + \sigma\, dW.\,}$

Step 5. Integrate from $0$ to $T$ . The right-hand side is purely deterministic times $T$ plus $\sigma$ times the total Brownian increment:

$\log\frac{S_T}{S_0} \;=\; \left(\mu - \tfrac{1}{2}\sigma^2\right) T \;+\; \sigma\, W_T.$

Because $W_T \sim \mathcal{N}(0, T)$ , the log return is normally distributed:

$\log\frac{S_T}{S_0} \sim \mathcal{N}\!\left((\mu - \tfrac{1}{2}\sigma^2)T,\, \sigma^2 T\right).$

Numerical check. With $\mu=0.10, \sigma=0.20, T=1$ :

Mean of log return = $0.10 - \tfrac{1}{2}(0.04) = 0.08$ .
Std of log return = $0.20$ .
Naive (wrong) answer: mean = 0.10. The difference, $-\tfrac{1}{2}\sigma^2 = -0.02$ , is exactly the Itô correction.

Takeaway. A stock with expected return 10% does not have log-expected return 10% — it has 8%. That two-percent gap, hammered out by volatility, is the origin of $d_1$ and $d_2$ in the Black–Scholes formula you will meet in section 31.5.

Two stocks with the same expected return but different volatilities compound to different typical futures. Volatility eats geometric growth.

Python: Euler–Maruyama from First Principles

Let us write the SDE simulator with no abstractions — one path, one explicit step at a time. This is the discrete cousin of the SDE itself: drift plus diffusion, step by step.

Plain Python Euler–Maruyama simulator

🐍simulate_gbm.py

Explanation(14)

Code(25)

1NumPy import

We need fast vector math and a good random generator. `numpy.random.default_rng` is the modern PCG64 generator — far better than the legacy `np.random.seed/randn` pair.

3Function signature

Five domain parameters and one numerical parameter. `S0` is today's price, `mu` is the annualised expected return, `sigma` is the annualised volatility, `T` is the time horizon in years, and `N` is the number of Euler steps. Setting `seed` makes the path reproducible — essential for unit tests.

EXAMPLE

Defaults: S₀=100, μ=10%/yr, σ=20%/yr, T=1 yr, N=1000 steps.

10Create the random generator

`default_rng(seed)` returns a fresh BitGenerator state. All randomness in this function pulls from it, so the same seed always reproduces the same path.

11Time-step size dt

We split [0, T] into N equal intervals of length dt = T/N. Smaller dt ⇒ closer to the true continuous SDE, but more compute. For SDEs the Euler error is O(√dt), much slower than the O(dt) you get for ODEs — this is the price of randomness.

EXAMPLE

T = 1.0, N = 1000 → dt = 0.001 years ≈ 6 trading hours.

12Time grid

`linspace(0, T, N+1)` is the array of N+1 sample times. We need N+1 points because we keep S[0] and produce N updates.

13Allocate the price array

Pre-allocate an empty Float64 array of length N+1. Faster than appending inside the loop — Python list append is amortised O(1) but with a much bigger constant than indexed assignment into a NumPy array.

14Initial condition

Set S[0] = S0. The SDE is first-order, so a single initial value pins down the whole path (given the same Brownian sample).

16Step loop

We walk the timeline left to right. At each step we sample one new Brownian increment and use it to push S forward by one dt.

17Brownian increment dW

Theoretically dW ~ N(0, dt). We synthesise it as √dt · Z where Z ~ N(0, 1). This is the discretised version of the integral ∫_t^{t+dt} dW = W(t+dt) − W(t).

EXAMPLE

If Z = 0.3 and dt = 0.001, dW = √0.001 · 0.3 ≈ 0.00949.

18Euler–Maruyama update for dS

Direct discretisation of dS = μ·S·dt + σ·S·dW. The first term is deterministic drift; the second is the random kick. Both scale with the current price S[i], which is what makes GBM multiplicative (and keeps S > 0 in expectation).

EXAMPLE

S[i]=100, μ=0.10, σ=0.20, dt=0.001, dW=0.00949 → drift = 100·0.10·0.001 = 0.01, diffusion = 100·0.20·0.00949 ≈ 0.1898, dS ≈ 0.2.

19Advance the price

S[i+1] = S[i] + dS — the simplest possible time-marching scheme. There is no implicit step, no Newton solve. Every randomness lives in dW.

21Return both arrays

Returning `(t, S)` lets the caller plot, log-transform, or compute summary statistics. We never mutate global state.

23Call the simulator

Run with default parameters, capture both arrays.

26Print summary

The log return ln(S_T/S_0) is the natural quantity. By Itô's lemma it should be approximately Normal with mean (μ − σ²/2)·T = 0.08 and standard deviation σ√T = 0.20. With seed=42 you typically see a value within ±0.5 of the mean.

EXAMPLE

Theory: mean = 0.08, std = 0.20. One realised path could be +0.0612, −0.1145, +0.3417, etc.

11 lines without explanation

1import numpy as np
2
3def simulate_gbm(S0=100.0, mu=0.10, sigma=0.20, T=1.0, N=1000, seed=42):
4    """
5    Simulate one sample path of geometric Brownian motion
6        dS = mu * S * dt + sigma * S * dW
7    using the Euler-Maruyama scheme.
8    """
9    rng = np.random.default_rng(seed)
10    dt = T / N
11    t = np.linspace(0.0, T, N + 1)
12    S = np.empty(N + 1)
13    S[0] = S0
14
15    for i in range(N):
16        dW = np.sqrt(dt) * rng.standard_normal()
17        dS = mu * S[i] * dt + sigma * S[i] * dW
18        S[i + 1] = S[i] + dS
19
20    return t, S
21
22t, S = simulate_gbm()
23print(f"S_0   = {S[0]:.4f}")
24print(f"S_T   = {S[-1]:.4f}")
25print(f"log return = {np.log(S[-1] / S[0]):+.4f}")

Verify it yourself. Copy the snippet, run it, and you should see something close to:

S_0 = 100.0000
S_T = 113.4205
log return = +0.1259

With seed=42 the value is deterministic. The expected log return is $0.08$ , so $+0.126$ is one standard deviation away — perfectly normal for a single sample.

PyTorch: Vectorised Path Generation

For pricing options by Monte Carlo we want tens of thousands of paths. A Python loop crawls; a tensor batch flies. The trick is to use the closed-form log-Euler update we derived above, then $\texttt{cumsum}$ along the time axis.

Vectorised PyTorch simulator (GPU-ready)

🐍simulate_gbm_torch.py

Explanation(14)

Code(33)

1Import torch

Same job as `numpy` but with GPU support and autograd. We will not need autograd here — we just want the fast tensorised random number generator and `cumsum`.

3Vectorised function signature

Same five SDE parameters as the NumPy version, plus `num_paths`, `device`, `dtype`, and `seed`. The big change: we now simulate `num_paths` Brownian motions simultaneously. With 10,000 paths and N=1000 we are about to crunch 10 million normals in one shot.

EXAMPLE

On a modern GPU this runs in ~30 ms; the NumPy loop above takes minutes for the same count.

9Per-call generator

`torch.Generator(device).manual_seed(seed)` is the device-local equivalent of NumPy's `default_rng`. Reproducibility on GPU requires this — the global `torch.manual_seed` is not enough.

10Time-step dt

Same definition as before: dt = T/N. We pull it out of the loop because we are about to vectorise the loop away entirely.

13Sample standard normals Z

`torch.randn(num_paths, N, ...)` produces a (num_paths × N) matrix of i.i.d. N(0,1) samples. Row p column i is the standard normal we will use for path p at step i.

EXAMPLE

Z[0, 0] = 0.32, Z[0, 1] = -1.14, … each row is one path's random ingredients.

14Scale to Brownian increments

Multiply elementwise by √dt to convert each Z into a dW with variance dt. Now `dW[p, i]` is the Brownian increment for path p during step i.

17Deterministic per-step drift

From Itô's lemma applied to log(S): d(log S) = (μ − σ²/2) dt + σ dW. The (μ − σ²/2) term is the famous Itô correction — naive calculus would have given just μ.

EXAMPLE

μ = 0.10, σ = 0.20, dt = 0.001 → drift = (0.10 − 0.02)·0.001 = 8e-5.

18Log-price increment per step

Each entry of `log_increments` is one realisation of d(log S) for one path at one step. Same shape (num_paths × N).

21cumsum builds the path

`cumsum(... dim=1)` integrates along the time axis: column i becomes the sum of log increments from step 0 up to step i. This is the discrete analogue of ∫_0^t (μ − σ²/2) ds + σ dW(s).

EXAMPLE

If increments along one path are [0.001, -0.002, 0.003], cumsum gives [0.001, -0.001, 0.002].

22Prepend log(S0) column

We need a column of zeros at t = 0 so the final array has N+1 time points. `torch.cat` glues them on. The next line will lift everything up by log(S0).

23Add log(S0)

Broadcasting `+ log(S0)` shifts every path so that S(0) = S0. Now `log_S[p, 0] = log(S0)` and `log_S[p, N]` is the realised log final price for path p.

25Exponentiate to get S

`S = exp(log_S)` recovers the actual price levels. This step is what guarantees S stays positive — taking a Wiener path through the log space and then exponentiating is the geometric in 'geometric Brownian motion'.

26Time grid

Same as NumPy version. Used only for plotting.

29Call and report

Run the batch and check that the empirical mean of S_T matches the theoretical S₀·exp(μT). With 10,000 paths the agreement is typically within 1%.

EXAMPLE

Empirical mean ≈ 110.4. Theory: 100·exp(0.10) ≈ 110.52.

19 lines without explanation

1import torch
2
3def simulate_gbm_batch(S0=100.0, mu=0.10, sigma=0.20,
4                       T=1.0, N=1000, num_paths=10_000,
5                       device="cpu", dtype=torch.float32, seed=42):
6    """
7    Vectorised geometric Brownian motion. Generates many paths in parallel.
8    Uses the exact log-Euler update from Ito's lemma applied to log(S).
9    """
10    g = torch.Generator(device=device).manual_seed(seed)
11    dt = T / N
12
13    # Brownian increments: shape (num_paths, N), each ~ N(0, dt)
14    Z = torch.randn(num_paths, N, generator=g, device=device, dtype=dtype)
15    dW = Z * (dt ** 0.5)
16
17    # Exact GBM increment for log S:  d(log S) = (mu - sigma^2/2) dt + sigma dW
18    drift = (mu - 0.5 * sigma * sigma) * dt
19    log_increments = drift + sigma * dW
20
21    # Cumulative sum along the time axis, then prepend log(S0)
22    log_S = torch.cumsum(log_increments, dim=1)
23    log_S = torch.cat([torch.zeros(num_paths, 1, device=device, dtype=dtype), log_S], dim=1)
24    log_S = log_S + torch.log(torch.tensor(S0, device=device, dtype=dtype))
25
26    S = torch.exp(log_S)
27    t = torch.linspace(0.0, T, N + 1, device=device, dtype=dtype)
28    return t, S
29
30t, S = simulate_gbm_batch()
31print("S shape:", tuple(S.shape))
32print("Mean S_T :", S[:, -1].mean().item())
33print("Theory   :", 100.0 * torch.exp(torch.tensor(0.10)).item())

Why the closed-form update beats Euler–Maruyama on $S$ directly. The naive Euler step $\;S_{i+1} = S_i + \mu S_i\, dt + \sigma S_i\, dW$ can produce negative prices when $dW$ is very negative and $\sigma$ is large. Working in $\log S$ space and exponentiating at the end is exact in distribution and never goes negative.

Stochastic Differential Equations

A general one-dimensional SDE has the form

\displaystyle dX_t = \mu(X_t, t)\, dt + \sigma(X_t, t)\, dW_t,\qquad X_0 \text{ given.}

The two ingredients are the drift coefficient $\mu$ and the diffusion coefficient $\sigma$ . Different choices give very different processes:

Name	Drift μ(x,t)	Diffusion σ(x,t)	Typical use
Geometric Brownian motion	μ x	σ x	Stock prices in Black–Scholes
Ornstein–Uhlenbeck	−θ(x − m)	σ	Mean-reverting interest rates, asset volatility
Cox–Ingersoll–Ross	κ(θ − x)	σ √x	Interest rate models that stay non-negative
Heston	κ(θ − x)	ξ √x	Stochastic-volatility option pricing
Langevin equation	−∇U(x)	√(2/β)	Statistical physics, diffusion models in ML

Itô's lemma applies uniformly to all of them. Once you can differentiate a function and substitute its drift and diffusion into the formula, you can transform any SDE into a new SDE for any smooth function of the state.

Important caveat. Itô's lemma is one of two consistent stochastic calculi. The other, Stratonovich, drops the $\tfrac{1}{2}\sigma^2 f_{xx}$ term but interprets the integral differently. Finance and machine learning conventionally use Itô (forward-looking, non-anticipating). Physics and engineering often use Stratonovich (consistent with the smooth limit). They are translatable, but mixing them silently is a classic bug.

Where Itô's Lemma Lives

Itô's lemma is not a finance-only tool. It is the chain rule of every field that deals with continuous-time noise.

Black–Scholes PDE. Apply Itô to an option price $V(S, t)$ . The $dW$ term is hedgeable by holding $\partial V / \partial S$ shares of stock. Setting the drift equal to the risk-free rate gives the PDE — the subject of the next section.
Interest-rate models. The short rate $r_t$ is an SDE; bond prices are expectations of $e^{-\int r\, ds}$ . Itô is how you derive the bond PDE.
Filtering and Kalman–Bucy. Conditioning a signal on noisy observations gives an SDE for the conditional mean. Itô delivers the gain equation.
Stochastic gradient Langevin dynamics. In Bayesian deep learning, parameters evolve via $d\theta = -\nabla U(\theta)\, dt + \sqrt{2/\beta}\, dW$ . Itô calculus lets you analyse the long-run distribution.
Diffusion generative models. The forward and reverse SDEs of score-based models are calibrated using Itô's lemma. The famous $\tfrac{1}{2}\sigma^2$ term in the reverse drift is exactly the Itô correction.

The slogan. $(dW)^2 = dt$ is the smallest mathematical statement with the largest economic, scientific, and ML footprint of the twentieth century.

Summary

Brownian motion has nonzero quadratic variation: $\sum (\Delta W_i)^2 \to T$ , almost surely.
Heuristic algebra of differentials: $(dW)^2 = dt$ , $dt\, dW = 0$ , $(dt)^2 = 0$ .
Itô's lemma: for $f(X_t, t)$ with $dX = \mu\, dt + \sigma\, dW$ , $\; df = (f_t + \mu f_x + \tfrac{1}{2}\sigma^2 f_{xx})\, dt + \sigma f_x\, dW$ .
The extra term $\tfrac{1}{2}\sigma^2 f_{xx}$ is the Itô correction. It is what makes stochastic calculus different from ordinary calculus.
Applied to $f = \log S$ on geometric Brownian motion, Itô gives the closed-form lognormal solution that lets us simulate paths exactly in log-space.
The Euler–Maruyama scheme is the simplest numerical solver for an SDE: drift step plus $\sqrt{dt}$ -scaled Gaussian noise.
Itô's lemma is the foundation of option pricing, interest-rate modelling, filtering, Langevin dynamics, and modern diffusion generative models.

Up next (Section 31.4): we will apply Itô's lemma to an option price $V(S, t)$ , build a self-financing replicating portfolio, and derive the famous Black–Scholes partial differential equation.