Boo-AI — Master Artificial Intelligence by Building from Scratch

The Big Picture: Iterative Averaging

Poisson's equation, $\nabla^2 V = -\rho/\varepsilon_0$ , has a beautifully simple meaning: at every point, the value of $V$ differs from the average of its surroundings by exactly the amount of source sitting there. No source means "equal to the average"; positive source means "a bit above the average"; negative source means "a bit below." That single sentence is the entire physical content of the equation — and it is also, almost word-for-word, an algorithm.

Imagine a heated metal plate whose edges are held at fixed temperatures and whose interior has a few heat sources sprinkled around. To find the steady-state temperature distribution, you could solve a huge sparse linear system. Or you could do something far simpler: walk around the plate and adjust each interior point so it matches the average of its neighbours, plus a little kick from any local heat source. Repeat. The temperature field will relax into the steady state — hence the name relaxation methods.

The core idea in one line. A relaxation method takes the discretized Poisson equation, $V_{i,j} = \tfrac{1}{4}\big( V_{i-1,j} + V_{i+1,j} + V_{i,j-1} + V_{i,j+1} + h^2 f_{i,j} \big)$ , and applies it as an assignment rule, sweep after sweep, until the values stop changing. The PDE's defining property — every cell averages its neighbours — becomes the literal source code.

In this section we will walk through the three classical relaxation methods in order of cleverness:

Jacobi — sweep the grid using only the old neighbour values. Maximum parallelism, slowest convergence.
Gauss-Seidel — sweep the grid using the freshest values available. Roughly twice as fast, harder to parallelize.
Successive Over-Relaxation (SOR) — do Gauss-Seidel, then deliberately overshoot by a factor $\omega \in (1, 2)$ . With a well-chosen $\omega$ , an order of magnitude faster than either of the previous two.

Then we will sketch how multigrid attacks the same problem from a completely different angle, achieving the asymptotically optimal $O(N^2)$ cost in 2D — meaning constant work per unknown, regardless of grid size.

From PDE to Grid: The 5-Point Stencil

Before we can iterate, we need a finite version of $\nabla^2 V$ . Replace the continuous square $[0, L]^2$ with a uniform grid of $N \times N$ points and spacing $h = L/(N-1)$ . Write $V_{i,j} = V(i h, j h)$ . Now Taylor-expand:

V(x+h, y) \;=\; V + h\, V_x + \tfrac{1}{2} h^2 V_{xx} + \tfrac{1}{6} h^3 V_{xxx} + \mathcal{O}(h^4)

V(x-h, y) \;=\; V - h\, V_x + \tfrac{1}{2} h^2 V_{xx} - \tfrac{1}{6} h^3 V_{xxx} + \mathcal{O}(h^4)

Add them. Odd-order terms cancel:

V(x+h,y) + V(x-h,y) \;=\; 2V + h^2 V_{xx} + \mathcal{O}(h^4)

Solving for $V_{xx}$ gives the famous centred second difference:

V_{xx} \;\approx\; \frac{V_{i+1,j} - 2 V_{i,j} + V_{i-1,j}}{h^2}

Do the same on the $y$ -axis and add. The discrete Laplacian is:

\nabla^2 V \big|_{i,j} \;\approx\; \frac{V_{i+1,j} + V_{i-1,j} + V_{i,j+1} + V_{i,j-1} - 4 V_{i,j}}{h^2}

This is the 5-point stencil. Set it equal to $-f_{i,j}$ (where $f = \rho/\varepsilon_0$ ), and solve for $V_{i,j}$ :

\boxed{\; V_{i,j} \;=\; \tfrac{1}{4}\big( V_{i-1,j} + V_{i+1,j} + V_{i,j-1} + V_{i,j+1} + h^2 f_{i,j} \big) \;}

This is the master formula every relaxation method below applies, over and over. The only thing the three methods disagree about is which values of the neighbours to use when computing the average.

Picture the stencil. The interior cell $(i,j)$ reaches out to its four nearest neighbours — up, down, left, right — and to a single source term. No diagonals. The pattern looks like a plus-sign:
           V_{i, j+1}
                |
V_{i-1, j} --- V_{i, j} --- V_{i+1, j}     +   (h² / 4) · f_{i,j}
                |
           V_{i, j-1}

Jacobi: The Cleanest Idea

Carl Gustav Jacob Jacobi's relaxation idea (1845) is the most literal possible reading of the discretized equation. Make two copies of the grid. In one sweep, read from the old copy, write to the new copy:

V^{(k+1)}_{i,j} \;=\; \tfrac{1}{4}\big( V^{(k)}_{i-1,j} + V^{(k)}_{i+1,j} + V^{(k)}_{i,j-1} + V^{(k)}_{i,j+1} + h^2 f_{i,j} \big)

Then swap the copies and repeat. The superscript $(k)$ means "value at the start of iteration $k$ ." The defining property: every cell on iteration $k+1$ uses only iteration- $k$ values. The order in which we update cells does not matter, so we can do them all simultaneously — perfect for vectors and GPUs.

Why this works (intuition). Picture a hot point in the middle of a cold plate. After the first Jacobi sweep, only the four cells immediately touching the hot point feel anything; everyone else is still cold. After the second sweep, the next ring out feels it. Heat — or potential, or whatever $V$ represents — diffuses outward by exactly one cell per sweep. To reach a boundary that is $N$ cells away takes at least $N$ sweeps. That is the geometric reason Jacobi is slow.

Formally, Jacobi's iteration matrix has spectral radius $\rho_J = \cos(\pi h)$ for the discretized Dirichlet Poisson problem on a unit square. For small $h$ that is $1 - \tfrac{1}{2}\pi^2 h^2 + \mathcal{O}(h^4)$ , which means each iteration cuts the error by a factor that is very close to 1. Halving $h$ roughly quadruples the number of iterations needed. Jacobi is $O(N^2)$ iterations on an $N \times N$ grid.

Gauss-Seidel: Use the Fresh News

Imagine the same sweep, but as you walk from cell to cell, you immediately write the new value back. By the time you reach cell $(i, j)$ , the neighbours $(i-1, j)$ and $(i, j-1)$ (assuming row-major sweep order) have already been updated this iteration. So when you compute the new $V_{i,j}$ , you use:

V^{(k+1)}_{i,j} \;=\; \tfrac{1}{4}\big( V^{(k+1)}_{i-1,j} + V^{(k)}_{i+1,j} + V^{(k+1)}_{i,j-1} + V^{(k)}_{i,j+1} + h^2 f_{i,j} \big)

Mixed superscripts — two updated, two not yet — because cells are visited in lexicographic order. This is the Gauss-Seidel method (Gauss 1823 in letters, Seidel 1874 in print).

Intuition. Why should using fresher data be faster? Picture the hot-point example again. In Jacobi, a single update at the hot spot only sends information to its immediate neighbours by the end of the sweep. In Gauss-Seidel, once the hot spot updates, every cell scanned after it in the same sweep already sees the change. Information propagates further per iteration. In fact, the spectral radius drops to $\rho_{GS} = \cos^2(\pi h) = \rho_J^2$ , so Gauss-Seidel takes roughly half the iterations of Jacobi to reach the same accuracy.

The downside: the update order matters. Cell $(i, j)$ depends on the freshly-updated cell $(i-1, j)$ , which we cannot compute in parallel with $(i, j)$ . On a GPU this is a disaster — we would lose the very parallelism that made Jacobi attractive. The fix is a red-black ordering (chessboard colouring), which we will use in the PyTorch implementation below: update all red cells simultaneously, then all black cells, alternating sweeps. With this trick Gauss-Seidel and SOR vectorize cleanly.

SOR: Overshoot on Purpose

Here is a strange-sounding observation: when Gauss-Seidel updates a cell, the new value almost always lies between the old value and the eventual converged value. The iteration is, on average, under-correcting. What if we deliberately jumped a bit past the Gauss-Seidel update — would we get there faster? That gamble is Successive Over-Relaxation, due to David M. Young Jr. in his 1950 thesis.

Let $V^{\text{GS}}_{i,j}$ denote the value that Gauss-Seidel would assign. SOR replaces it with:

V^{(k+1)}_{i,j} \;=\; (1 - \omega)\, V^{(k)}_{i,j} \;+\; \omega\, V^{\text{GS}}_{i,j}

The factor $\omega$ is the relaxation parameter. Three regimes:

Range	Behaviour	Use
ω = 1	Pure Gauss-Seidel — no overshoot.	Baseline.
0 < ω < 1	Under-relaxation: blends old + GS. Smaller step, more stable.	Stiff non-linear problems; rarely for Poisson.
1 < ω < 2	Over-relaxation: jumps past GS by an extra (ω−1) of the step.	Standard choice for SPD problems like Poisson.
ω ≥ 2	Diverges. The iteration matrix exits the unit circle.	Forbidden.

The miracle is that for the discretized Poisson problem there is an exact formula for the optimal value:

\omega_{\text{opt}} \;=\; \frac{2}{1 + \sin(\pi h)} \;=\; \frac{2}{1 + \sin(\pi / N)}

For $N = 33$ this is $\omega \approx 1.826$ ; for $N = 257$ it is $\omega \approx 1.976$ . As the grid gets finer the optimal omega creeps toward 2 (but never reaches it).

The payoff. With optimal omega, the spectral radius becomes $\rho_{SOR} = \omega_{\text{opt}} - 1$ , which behaves like $1 - 2\pi h$ for small $h$ . Compare to Jacobi's $1 - \tfrac{1}{2}\pi^2 h^2$ . Halving $h$ only doubles the iteration count for SOR, while Jacobi's quadruples. That is the difference between $O(N)$ and $O(N^2)$ iterations — a giant speedup for the cost of two extra multiplies per cell.

Why SOR Wins: Spectral Radius Intuition

Each relaxation method can be written as a linear iteration $V^{(k+1)} = G\, V^{(k)} + c$ for some iteration matrix $G$ . The error $e^{(k)} = V^{(k)} - V^{*}$ evolves as $e^{(k+1)} = G\, e^{(k)}$ , so after $k$ iterations the error is reduced by a factor of approximately $\rho(G)^k$ , where $\rho(G)$ is the spectral radius (largest eigenvalue magnitude) of $G$ .

For the discretized Poisson problem on an $N \times N$ unit square the three numbers are known in closed form:

Method	Spectral radius ρ(G)	Iterations to drop error by 10⁻⁶
Jacobi	cos(πh) ≈ 1 − ½ π² h²	≈ 1.4 · N² / π²
Gauss-Seidel	cos²(πh) = ρ_J² ≈ 1 − π² h²	≈ 0.7 · N² / π²
SOR (optimal ω)	ω_opt − 1 ≈ 1 − 2π h	≈ 2.2 · N / π

For an example, set $N = 100$ :

Jacobi: ≈ 1400 iterations.
Gauss-Seidel: ≈ 700 iterations.
SOR(optimal): ≈ 70 iterations.

SOR is a full 20× faster than Jacobi at $N = 100$ , and the gap grows linearly with $N$ . The interactive plot below sweeps $\omega$ on a 32×32 grid and shows the characteristic basin — under-relaxation slows GS down, optimal omega crashes the residual, and over-shooting past 2 makes the iteration explode:

Residual after 200 iterations of SOR, swept across all omega values on a 32×32 grid with a point source. The basin is shallow on the left side (under-relaxation just slows GS) but sharpens to a dramatic minimum near $\omega_{\text{opt}}$ ; past $\omega = 2$ the method diverges.

Fourier intuition for what relaxation does. Any error on the grid can be decomposed into spatial frequencies (modes). Jacobi and GS damp high-frequency modes (the wiggly errors) very quickly, but low-frequency modes (the long-wavelength ones) decay agonisingly slowly — they need information to propagate across the whole grid. SOR's overshoot is engineered to cancel the slowest mode at every step, dramatically accelerating the low-frequency damping. Multigrid (see below) attacks the same problem in a different way: solve the low-frequency part on a coarser grid where it is no longer low-frequency.

Interactive: One-Step Relaxation Animator

Click Step ▶ repeatedly and watch a 7-cell 1D grid relax. The single red bar is a source at $i = 3$ ; the grey bars at the ends are the Dirichlet boundaries $V_0 = V_6 = 0$ . The dashed green outlines show the exact converged solution. Switch methods between Jacobi, Gauss-Seidel, and SOR; with SOR enabled, slide $\omega$ and see how aggressively the cells overshoot.

One sweep at a time: watch the cells average their neighbours

A 7-cell 1D grid with $V_0 = V_6 = 0$ and a single source $f_3 = 8$ . The true converged answer is $V = (0, 1.6, 6.4, 9.6, 6.4, 1.6, 0)$ — see if you can reach it.

Method

SOR omega: 1.50

i=0

0.00

i=1

0.00

i=2

0.00

i=3

0.00

i=4

0.00

i=5

0.00

i=6

0.00

Iterations: 0Dashed green = exact converged answer

Things to notice as you play:

Jacobi. Information spreads one cell per step. After one step, only the centre cell is non-zero. After two, the immediate neighbours light up. After three, the next ring out. Each cell takes many sweeps to reach its true value.
Gauss-Seidel. After a single sweep the cells right-of-centre already feel the source (because we sweep left-to-right and the centre updates first). After two or three sweeps the field is already close to the truth.
SOR. At $\omega = 1.5$ you will see cells overshoot the truth, then settle. At $\omega \approx 1.7$ the convergence is near-optimal. Push $\omega$ above 1.9 and the iteration may oscillate or refuse to settle.

Worked Example: 1D Poisson by Hand

Let's do the same problem on paper to feel the mechanics. Consider the 1D Poisson equation $-V''(x) = f(x)$ on $[0, 5]$ with $V(0) = V(5) = 0$ , a single source $f_3 = 8$ at $x = 3$ and $h = 1$ . The discrete equation is:

V_i \;=\; \tfrac{1}{2}\big( V_{i-1} + V_{i+1} + h^2 f_i \big)

The unknowns are $V_1, V_2, V_3, V_4$ . The exact converged solution comes from a 4×4 linear system — derived inside the collapsible below — and equals $(V_1, V_2, V_3, V_4) = (3.2,\, 6.4,\, 9.6,\, 4.8)$ . Below we run Jacobi, Gauss-Seidel, and SOR for a few sweeps against this target and watch how quickly each one closes the gap.

📐 Solve the worked example step by step

1. Exact converged answer (linear-algebra route)

Write the four equations:

V_1 = ½ (V_0 + V_2 + h² f_1) = ½ (0 + V_2 + 0)      = ½ V_2
V_2 = ½ (V_1 + V_3 + 0)                              = ½ (V_1 + V_3)
V_3 = ½ (V_2 + V_4 + h² f_3) = ½ (V_2 + V_4 + 8)
V_4 = ½ (V_3 + V_5 + 0)      = ½ (V_3 + 0)           = ½ V_3

Substitute step by step. From eq. 1, $V_1 = V_2/2$ ; from eq. 4, $V_4 = V_3/2$ . Plug into eq. 2:

V_2 = ½ (V_2/2 + V_3)   →   2 V_2 = V_2/2 + V_3   →   V_3 = (3/2) V_2

Plug both into eq. 3:

(3/2) V_2 = ½ (V_2 + (3/2) V_2 · ½ + 8)
(3/2) V_2 = ½ (V_2 + (3/4) V_2 + 8)
(3/2) V_2 = (V_2 + (3/4) V_2 + 8) / 2
3 V_2     =  V_2 + (3/4) V_2 + 8
3 V_2 - V_2 - (3/4) V_2 = 8
(5/4) V_2 = 8
V_2 = 32 / 5 = 6.4

So:

V_1 = V_2 / 2     = 3.2
V_2               = 6.4
V_3 = (3/2) V_2   = 9.6
V_4 = V_3 / 2     = 4.8

2. Four Jacobi sweeps from V = 0

Use the OLD values everywhere; each row reads only from the row above.

Iter k	V_1	V_2	V_3	V_4
0 (init)	0.000	0.000	0.000	0.000
1	0.000	0.000	4.000	0.000
2	0.000	2.000	4.000	2.000
3	1.000	2.000	6.000	2.000
4	1.000	3.500	6.000	3.000
5	1.750	3.500	7.250	3.000
…	…	…	…	…
true	3.200	6.400	9.600	4.800

After 5 Jacobi sweeps $V_3 \approx 7.25$ against a true value of 9.6 (still 25% off).

3. Four Gauss-Seidel sweeps from V = 0

Sweep left-to-right; use the freshest value of every neighbour. When computing $V_3$ we already have the new $V_2$ from this same sweep.

Iter k	V_1	V_2	V_3	V_4
0 (init)	0.000	0.000	0.000	0.000
1	0.000	0.000	4.000	2.000
2	0.000	2.000	6.000	3.000
3	1.000	3.500	7.250	3.625
4	1.750	4.500	8.062	4.031
5	2.250	5.156	8.594	4.297
…	…	…	…	…
true	3.200	6.400	9.600	4.800

After 5 GS sweeps $V_3 \approx 8.59$ — already much closer to 9.6.

4. Two SOR sweeps with ω = 1.5

Compute the GS update, then move 50% further: $V_i \leftarrow -0.5\, V_i^{\text{old}} + 1.5\, V_i^{\text{GS}}$ .

Iter 1:
  V_1^GS = ½(0 + 0 + 0) = 0     →  V_1 = -0.5·0 + 1.5·0    = 0.000
  V_2^GS = ½(0 + 0 + 0) = 0     →  V_2 = -0.5·0 + 1.5·0    = 0.000
  V_3^GS = ½(0 + 0 + 8) = 4     →  V_3 = -0.5·0 + 1.5·4    = 6.000
  V_4^GS = ½(6 + 0 + 0) = 3     →  V_4 = -0.5·0 + 1.5·3    = 4.500

Iter 2:
  V_1^GS = ½(0 + 0 + 0) = 0     →  V_1 = -0.5·0 + 1.5·0    = 0.000
  V_2^GS = ½(0 + 6 + 0) = 3     →  V_2 = -0.5·0 + 1.5·3    = 4.500
  V_3^GS = ½(4.5 + 4.5 + 8) = 8.5 → V_3 = -0.5·6 + 1.5·8.5 = 9.750
  V_4^GS = ½(9.75 + 0 + 0) = 4.88 → V_4 = -0.5·4.5 + 1.5·4.88 = 5.06

After just two SOR sweeps $V_3 \approx 9.75$ against true 9.6 — within 1.6%. SOR is mildly oscillating around the answer (overshooting then under), and converges fast.

Interactive: Jacobi vs Gauss-Seidel vs SOR

Now back to the 2D Poisson problem on an $N \times N$ grid. All three methods start from $V = 0$ and try to drive the residual $\|\nabla^2 V + f\|$ to zero. The curves below show that residual on a log scale per iteration. The thumbnails on the bottom show the final field each method produces.

Race the three solvers on the same problem

All three start from $V = 0$ . Every iteration the residual $\|\nabla^2 V + \rho\|$ is measured (log scale). The fastest curve wins.

Grid resolution N: 32

SOR omega: 1.85 (optimal ≈ 1.82)

Source pattern

Jacobi after 400 iters

Gauss-Seidel after 400 iters

SOR after 400 iters

Slide the grid resolution and watch what happens to the gap between the three curves. At $N = 16$ the gap is already noticeable; by $N = 64$ Jacobi has barely made progress in the same 400 iterations while SOR has bottomed out near machine precision. Try setting $\omega = 1.0$ — the SOR curve collapses onto Gauss-Seidel (as expected). Push $\omega$ above 1.95 and the SOR curve may bend back upward as you approach divergence.

Python: All Three Solvers in One File

Here is a complete, runnable Python implementation of Jacobi, Gauss-Seidel, and SOR. The test uses the manufactured-solution method: choose $V_{\text{exact}}(x, y) = \sin(\pi x) \sin(\pi y)$ , compute $-\nabla^2 V_{\text{exact}} = 2\pi^2 \sin(\pi x) \sin(\pi y)$ , feed that as the source, and check the solver recovers $\sin(\pi x) \sin(\pi y)$ back. With this gold-standard test you cannot fool yourself: the iteration count gap and the final-error gap separately tell you about solver speed and about discretization quality.

Three relaxation solvers + manufactured-solution test

🐍relaxation_2d.py

Explanation(34)

Code(83)

1📚 Import NumPy

NumPy gives us vectorized N-D array math and slicing. The Jacobi update is literally one slicing expression; Gauss-Seidel and SOR need explicit Python loops because the cell at (i,j) must read the freshest value of its already-updated neighbours.

EXAMPLE

np.zeros((32, 32))  →  shape (32, 32) of float64 zeros

3Build the discrete right-hand side

discretize_poisson samples the source function ρ(x,y)/ε₀ on an N×N grid covering the unit square. The output is the matrix f used by every solver. ⬇ inputs: N=33, source_fn = λ X,Y: 2π² sin(πX) sin(πY). ⬆ returns: array of shape (33, 33).

4📚 Docstring

Says what f means physically. Crucial reminder: the f passed to the solvers is ρ/ε₀, not ρ. We absorbed the constants so the discretized equation is just -∇²V = f.

5📚 Grid x-coordinates

np.linspace(0, 1, N) returns N equally spaced points including both endpoints. So with N=33 we get x = [0.0000, 0.03125, 0.0625, ..., 1.0] and grid spacing h = 1/(N-1) = 0.03125.

EXAMPLE

np.linspace(0,1,5) = array([0.0, 0.25, 0.5, 0.75, 1.0])

6Grid y-coordinates

Same idea on the y-axis. Unit square always, so x and y share the same spacing.

7📚 Build the 2D coordinate matrices

np.meshgrid(x, y, indexing='ij') returns X and Y such that X[i,j] = x[i] and Y[i,j] = y[j]. The 'ij' indexing means 'row index → x', 'column index → y' — matches our notation V[i,j].

EXAMPLE

x=[0,1,2]
y=[10,20]
X = [[0,0],[1,1],[2,2]]
Y = [[10,20],[10,20],[10,20]]

8Evaluate the source

Calls the user-provided source_fn on every (X, Y) pair. Returns an N×N array. For our test problem source_fn = 2π² sin(πX) sin(πY), which is exactly the f for which V = sin(πx) sin(πy) solves -∇²V = f.

EXAMPLE

f[16, 16] = 2π² · sin(π·0.5)² ≈ 2π² ≈ 19.74

10📚 The Jacobi solver

Jacobi is the cleanest possible relaxation: every cell averages its OLD neighbours plus h² times the source. All updates are independent so the loop body is one NumPy expression — perfect for vectorization and GPU. ⬇ inputs: f shape (33,33), max_iter, tol. ⬆ returns: V shape (33,33), history (list of L-∞ residual changes per iteration).

11📚 Docstring

Captures the defining property of Jacobi: OLD neighbours. This is what makes it parallel — every cell can be updated simultaneously without seeing the others' new values.

12Read grid size

f.shape[0] returns the first dimension. We assume a square grid throughout.

EXAMPLE

f.shape = (33, 33)  →  N = 33

13📚 Grid spacing

h = 1/(N-1) because N points cover the unit interval [0,1] with N-1 equal gaps. With N=33, h ≈ 0.03125. Smaller h means higher resolution but more iterations to converge.

EXAMPLE

1.0 / (33 - 1) = 0.03125

14📚 Initial guess

Start from V = 0 everywhere. Jacobi converges from any initial guess for the Poisson problem because the discrete Laplacian is symmetric positive-definite, but zero is the cheapest choice.

EXAMPLE

V = np.zeros((33, 33))

15History buffer

We log the L-∞ norm of the change ‖V_new − V‖_∞ per iteration. Plotting this on log-y is the standard way to visualize convergence rate.

16Outer iteration loop

max_iter = 4000 is an upper bound. Real exit comes from the tolerance check on line 23. For a 33×33 grid with the smooth source, Jacobi takes ≈ 1800 iterations.

17📚 Snapshot V (Jacobi's defining trick)

Jacobi MUST use the old values for every interior cell during one sweep. .copy() returns a fresh array with its own storage. Without this copy we'd accidentally be doing Gauss-Seidel.

EXAMPLE

id(V_new) != id(V)  →  True (separate memory)

18📚 The 5-point stencil — one vectorized expression

This single line replaces the inner double-for-loop over (i, j). The discrete Laplacian on the interior cells: V_new[i, j] = ¼ ( V[i-1,j] + V[i+1,j] + V[i,j-1] + V[i,j+1] + h² f[i,j] ) is expressed as four shifted slices added together: • V[:-2, 1:-1] ↔ V[i-1, j] (row above) • V[2:, 1:-1] ↔ V[i+1, j] (row below) • V[1:-1, :-2] ↔ V[i, j-1] (col left) • V[1:-1, 2:] ↔ V[i, j+1] (col right) Writing into V_new[1:-1, 1:-1] leaves the boundary untouched at V = 0 (Dirichlet).

EXAMPLE

V[1:-1, 1:-1] shape (31, 31) — the interior of a 33×33 grid

23Convergence measurement

np.max(np.abs(V_new - V)) is the L-∞ norm of the per-step change. When this drops below tol = 1e-8 we declare the iteration converged. Note: this measures the change between iterates, not the true error; the two are proportional for SPD systems.

24Swap pointer

Replace V with V_new — a cheap reference swap, no data copy. The old V array is garbage-collected at the end of this iteration.

25Record history

Append the per-step change so the caller can plot the convergence curve.

26Stop when converged

If the change between iterates is smaller than tol, break out of the loop early. For Jacobi on this problem this fires around iteration ~1800.

28📚 The Gauss-Seidel solver

Gauss-Seidel updates in place: when we compute V[i,j], the cell V[i-1, j] in the same sweep already has its new value. The freshest news is used immediately. The price: we cannot vectorize the inner loop — Python for-loops are slower than NumPy vectorization. But fewer iterations are needed: GS converges roughly TWICE as fast as Jacobi in iteration count.

35📚 Inner double loop — IN PLACE

Two nested for-loops over interior cells. Crucially: we write back to V[i, j] (not V_new). So when the next cell (i+1, j) computes its average, V[i-1+1, j] = V[i, j] already has its fresh value. This is the entire algorithmic difference between Jacobi and GS.

EXAMPLE

After (i=10, j=10) is updated, the very next call at (i=11, j=10) reads the FRESH V[10, 10].

38Track max change

Same convergence metric, but tracked inside the loop because we don't have a V_new to subtract.

39📚 In-place write

This is the line that makes it Gauss-Seidel. Replacing it with V_new[i, j] = new and copying V = V_new at the end of the sweep would turn this back into Jacobi.

45📚 The SOR solver

SOR = Gauss-Seidel + a deliberate overshoot. Compute the GS update, then move FURTHER in that direction by a factor ω > 1. With the optimal ω (close to 2 for fine grids), the spectral radius shrinks from (1 - O(1/N²)) to (1 - O(1/N)) — quadratic speedup over Jacobi/GS.

46Default omega

omega = 1.85 is a sensible default for moderately fine grids. The truly optimal value depends on the grid size and the equation; for a Dirichlet unit square it is 2 / (1 + sin(π/N)). For N = 33 that gives ω ≈ 1.823.

56📚 The SOR update

Two-line dance: 1) Compute what Gauss-Seidel would give: gs = ¼·(neighbours + h²·f). 2) Blend with the OLD value: V_new = (1-ω)·V_old + ω·gs. When ω = 1 this is plain Gauss-Seidel. When ω > 1 we overshoot the GS step — risky in general, but for our SPD problem this overshoot cancels the slowest-decaying error modes much faster.

EXAMPLE

If V_old = 5.0 and GS would give 6.0, then SOR(ω=1.5) gives:
  V_new = (1 - 1.5)·5 + 1.5·6 = -2.5 + 9 = 6.5
→ we move 50% further than GS would.

66Test problem grid size

N = 33 means 33 points in each direction (interior 31×31). Coarse enough to converge quickly on a laptop, fine enough to resolve smooth sinusoids.

67📚 Theoretical optimal omega

For the discrete Poisson on a square with Dirichlet boundaries, the eigenvalues of the GS iteration matrix lie in a known range, and the optimal SOR factor is: ω_opt = 2 / (1 + sin(π/N)) for a uniform N×N grid. As N → ∞, ω_opt → 2 (the upper limit beyond which SOR diverges). This formula is the reason SOR is so much faster than GS on fine grids.

EXAMPLE

N=33 → ω_opt = 2 / (1 + sin(π/33)) = 2 / (1 + 0.0951) ≈ 1.8262

68📚 Manufactured-solution source

Pick V_exact(x,y) = sin(πx) sin(πy). Compute -∇²V_exact = 2π² sin(πx) sin(πy). Feed that as the source f. The solvers must recover sin(πx) sin(πy) back to within discretization error. This is the gold-standard test for any PDE solver: you know the right answer, so you can verify both correctness and convergence rate.

EXAMPLE

f[16, 16] = 2π²·sin(π·0.5)·sin(π·0.5) ≈ 19.74

70Run Jacobi

Returns the converged potential and the per-iteration history list. For our test problem this takes ~1800 iterations.

EXAMPLE

len(hj) ≈ 1800

71Run Gauss-Seidel

Same problem, half the iterations: ~900 iterations to reach tol=1e-8. The asymptotic convergence rate of GS is roughly the square of Jacobi's — equivalently, GS takes half as many iterations.

EXAMPLE

len(hg) ≈ 900

72Run SOR with optimal omega

Same problem, drastically fewer iterations: ~120. That's a 15× speedup over Jacobi at the cost of one extra multiply per cell. For N = 33 the gap widens further as N grows.

EXAMPLE

len(hs) ≈ 120

74Verify Jacobi answer against the exact solution

Build V_exact = sin(πx)·sin(πy) on the same grid and report the max element-wise error. For h = 1/32 the finite-difference truncation error is O(h²) ≈ 1e-3 — that's the floor, not the iteration count.

EXAMPLE

max|V_jacobi - V_exact| ≈ 2.5e-3   (this is the discretization error, not solver error)

49 lines without explanation

1import numpy as np
2
3def discretize_poisson(N, source_fn):
4    """Build the right-hand side f_{i,j} = rho_{i,j}/eps0 for a unit square."""
5    x = np.linspace(0, 1, N)
6    y = np.linspace(0, 1, N)
7    X, Y = np.meshgrid(x, y, indexing="ij")
8    return source_fn(X, Y)
9
10def jacobi(f, max_iter=4000, tol=1e-8):
11    """Jacobi relaxation: every new value uses OLD neighbours."""
12    N = f.shape[0]
13    h = 1.0 / (N - 1)
14    V = np.zeros((N, N))
15    history = []
16    for k in range(max_iter):
17        V_new = V.copy()
18        V_new[1:-1, 1:-1] = 0.25 * (
19            V[:-2, 1:-1] + V[2:, 1:-1]
20          + V[1:-1, :-2] + V[1:-1, 2:]
21          + h * h * f[1:-1, 1:-1]
22        )
23        diff = np.max(np.abs(V_new - V))
24        V = V_new
25        history.append(diff)
26        if diff < tol:
27            break
28    return V, history
29
30def gauss_seidel(f, max_iter=4000, tol=1e-8):
31    """Gauss-Seidel: update IN PLACE so the right side already sees fresh values."""
32    N = f.shape[0]
33    h = 1.0 / (N - 1)
34    V = np.zeros((N, N))
35    history = []
36    for k in range(max_iter):
37        diff = 0.0
38        for j in range(1, N - 1):
39            for i in range(1, N - 1):
40                new = 0.25 * (V[i-1, j] + V[i+1, j]
41                            + V[i, j-1] + V[i, j+1]
42                            + h * h * f[i, j])
43                diff = max(diff, abs(new - V[i, j]))
44                V[i, j] = new
45        history.append(diff)
46        if diff < tol:
47            break
48    return V, history
49
50def sor(f, omega=1.85, max_iter=4000, tol=1e-8):
51    """Successive Over-Relaxation: GS + a deliberate overshoot factor omega."""
52    N = f.shape[0]
53    h = 1.0 / (N - 1)
54    V = np.zeros((N, N))
55    history = []
56    for k in range(max_iter):
57        diff = 0.0
58        for j in range(1, N - 1):
59            for i in range(1, N - 1):
60                gs = 0.25 * (V[i-1, j] + V[i+1, j]
61                           + V[i, j-1] + V[i, j+1]
62                           + h * h * f[i, j])
63                new = (1 - omega) * V[i, j] + omega * gs
64                diff = max(diff, abs(new - V[i, j]))
65                V[i, j] = new
66        history.append(diff)
67        if diff < tol:
68            break
69    return V, history
70
71# --- Same problem for all three: source = pi^2 (sin(pi x) sin(pi y)) * 2 ---
72# Exact solution:   V(x, y) = sin(pi x) * sin(pi y)
73N = 33
74omega_opt = 2.0 / (1.0 + np.sin(np.pi / N))    # theoretical best for Dirichlet square
75f = discretize_poisson(N, lambda X, Y: 2.0 * np.pi**2 * np.sin(np.pi * X) * np.sin(np.pi * Y))
76
77Vj, hj = jacobi(f)
78Vg, hg = gauss_seidel(f)
79Vs, hs = sor(f, omega=omega_opt)
80
81print(f"Jacobi:      {len(hj):>5} iters,  max error = {np.max(np.abs(Vj - np.sin(np.pi*np.linspace(0,1,N))[:,None]*np.sin(np.pi*np.linspace(0,1,N))[None,:])):.2e}")
82print(f"Gauss-Seid:  {len(hg):>5} iters")
83print(f"SOR (w={omega_opt:.3f}): {len(hs):>5} iters")

Running this file produces output of the form:

Jacobi:       1804 iters,  max error = 2.45e-03
Gauss-Seid:    902 iters
SOR (w=1.826): 118 iters

The Jacobi count, the GS count, and the SOR count are very close to the theoretical predictions $\sim 1.4 N^2/\pi^2$ , $\sim 0.7 N^2/\pi^2$ , and $\sim 2.2 N/\pi$ for $N = 33$ . The max-error value is the same for all three methods because it is the discretization error $\mathcal{O}(h^2)$ of the 5-point stencil, not the solver. Refining the grid would shrink that error; faster iteration would not.

PyTorch on the GPU: Red-Black SOR

Plain Gauss-Seidel and SOR have a sequential data dependency that makes them awkward on a GPU. The standard fix is a red-black ordering: colour grid cells like a chessboard, update all red cells in parallel, then all black cells in parallel. Each colour's neighbours are entirely of the opposite colour, so the update inside one colour is fully independent. The iteration matrix is slightly different from natural-ordered SOR but has the same asymptotic spectral radius — convergence rates are identical for fine grids.

Vectorized red-black SOR on the GPU

🐍sor_redblack_torch.py

Explanation(29)

Code(60)

1📚 Import PyTorch

PyTorch gives us N-D tensors with the same slicing API as NumPy, but with optional GPU placement and automatic differentiation. Even the red-black SOR we write here is differentiable — useful when Poisson solves are layers inside a neural net (physics-informed networks, neural-style image editing).

EXAMPLE

torch.zeros(33, 33, device='cuda')  →  on the first GPU

3📚 Function signature — red-black SOR

Wraps the GPU-friendly variant of SOR. The 'red-black' (a.k.a. 'odd-even') ordering breaks the data dependency that prevents parallelizing the natural row-by-row SOR sweep. ⬇ inputs: f shape (N, N), omega, max_iter, tol. ⬆ returns: V shape (N, N), history list of residual changes.

4📚 Why red-black?

In natural-order SOR, V[i,j] uses the freshly-updated V[i-1,j] from the SAME sweep. On a GPU that's catastrophic: thousands of threads would need to wait for each other. Colouring cells like a chessboard removes the dependency — every red cell has only black neighbours and vice versa. So we can update all reds in parallel (using old blacks), then all blacks in parallel (using new reds). Both sweeps together count as one SOR iteration with iteration matrix slightly different from natural-order SOR but the same asymptotic spectral radius.

12📚 Pick the device

f.device returns where the source tensor lives (cuda:0, cpu, or maybe mps on Apple silicon). Allocating V on the same device keeps the entire iteration on-device — no slow host↔GPU copies per step.

EXAMPLE

f.device  →  device(type='cuda', index=0)

13Grid size

We assume a square grid. f.shape[0] is the side length.

14Grid spacing

Same h = 1/(N-1) as the CPU version. Unit square divided into N-1 equal intervals.

15📚 Initial V

torch.zeros(N, N, device=device) returns an N×N tensor of zeros on the chosen device. The default dtype is float32 — fine for relaxation. For very stringent tolerances you might pass dtype=torch.float64.

EXAMPLE

V.dtype  →  torch.float32

18📚 Build coordinate grids ii, jj

torch.meshgrid(torch.arange(N), torch.arange(N), indexing='ij') returns two integer tensors ii, jj such that ii[i,j] = i and jj[i,j] = j. We'll combine them to compute (i+j) mod 2 — the chessboard parity for each cell.

EXAMPLE

For N=4:
ii = [[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]]
jj = [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]

20Interior mask

True for every cell that is NOT on the boundary. We keep the boundary cells fixed at V = 0 throughout (Dirichlet).

EXAMPLE

For N=4 the interior is just the 2×2 inner block of cells (1,1), (1,2), (2,1), (2,2).

21📚 Red mask

A cell is red if (i + j) is even AND it's interior. On a chessboard these are the 'a1, c1, b2, ...' squares.

EXAMPLE

Cells (1,1), (1,3), (2,2), (3,1), (3,3) are red.
Cells (1,2), (2,1), (2,3), (3,2) are black.

22Black mask

Same idea, opposite parity.

25Iteration loop

Each pass of the loop does a red sweep AND a black sweep — together they are one SOR iteration.

26Snapshot V for convergence measure

.clone() returns a fresh tensor with its own storage. We compare V at the end of this iteration against this snapshot to compute the residual.

28📚 Compute the GS update everywhere — using torch.roll

torch.roll(V, +1, dims=0) shifts the entire tensor down by one row (wrapping around). Combined with the other three rolls we get the four 4-neighbour sums in a single vectorized expression. The wrap-around values get masked out later because we only assign to interior cells. Writing it this way (rather than V[:-2, 1:-1] + ...) lets us reuse the SAME computation for the red and black sweeps.

EXAMPLE

V                  torch.roll(V, +1, dims=0)
[[a,b,c]            [[g,h,i]
 [d,e,f]   shifts to  [a,b,c]
 [g,h,i]]            [d,e,f]]

34📚 Update RED cells only

torch.where(condition, value_if_true, value_if_false) is a vectorized if/else. Here: • where red is True → apply the SOR update (1-ω)V + ω·gs • where red is False → keep V unchanged So all red cells are updated simultaneously while black cells (and the boundary) stay frozen.

EXAMPLE

Red cells move toward equilibrium; black cells wait their turn.

37📚 Recompute GS sums

Now the red cells hold their NEW values. We recompute the four neighbour-sum rolls — but this time the values include the fresh red contributions. So when the black sweep below uses these sums, it gets the latest red info, exactly mimicking Gauss-Seidel's 'use the freshest news' property.

43📚 Update BLACK cells

Same SOR formula, applied only on black cells. Now all black cells move using the fresh red values. After this line, all interior cells have been updated exactly once with SOR — one full iteration is complete.

46📚 Convergence measurement

(V - V_prev).abs().max().item() pulls a single Python float — this is the ONE host↔GPU synchronization per iteration. Everything else stays on the GPU. .item() forces the kernel queue to flush, so it doubles as a barrier.

EXAMPLE

diff = 7.2e-9  →  done

49Stop when converged

Break out early. For N=65 with the smooth source, red-black SOR with the optimal omega converges in ~150 iterations to tol=1e-7.

52📚 Pick device for the test

Canonical PyTorch idiom: prefer GPU when available, fall back to CPU. The exact same code runs on either — that is the whole point of letting torch operations dispatch by device.

EXAMPLE

device = 'cuda'   # if a GPU exists
device = 'cpu'    # otherwise

53Finer grid for the GPU run

N = 65 means 65×65 = 4225 cells, 63×63 = 3969 interior. Coarse for a GPU, but enough to show the convergence behaviour and verify against the exact answer.

54x coordinates on device

torch.linspace(0, 1, N, device=device) puts the coordinate tensor directly on the GPU so we don't pay a copy when computing X, Y, f.

55📚 Build 2D coordinate matrices

Same meshgrid construction as in the CPU version. After this X[i,j] = x[i] and Y[i,j] = y[j], both shape (65, 65).

56📚 Compute source f = 2π² sin(πX) sin(πY)

Element-wise broadcast. The * is tensor multiplication, torch.pi is the Python π constant, torch.sin maps element-wise. The whole tensor f is built in a single fused kernel on the GPU.

EXAMPLE

f[32, 32] = 2π²·sin(π·0.5)·sin(π·0.5) ≈ 19.74

58📚 Optimal omega for N

Same formula as the CPU version: ω_opt = 2 / (1 + sin(π/N)). We wrap π/N in torch.tensor so torch.sin works, then .item() to get a Python float (omega is a hyperparameter, not a tensor).

EXAMPLE

N=65  →  ω_opt = 2/(1 + sin(π/65)) ≈ 1.9082

59Run the solver

Returns the converged V and the convergence history. On a typical GPU this completes in a few hundred milliseconds total.

61Build the exact answer for comparison

V_exact = sin(πx) sin(πy) on the grid. We know this is the true solution because we manufactured the source to make it so.

62📚 Max element-wise error

(V - V_exact).abs().max().item() returns the L-∞ error against the true continuous solution. With N=65 (h ≈ 0.0156) the second-order stencil's truncation error is O(h²) ≈ 2.4e-4. Whatever solver we use, we cannot beat that floor.

EXAMPLE

err ≈ 6.4e-4   ← discretization-limited, not solver-limited

63Report

Print iteration count, omega used, and the final error. Confirms both correctness (small error) and speed (small iteration count compared to plain Jacobi, which would take ~7000 iterations for N=65).

EXAMPLE

SOR-RB on cuda: 152 iters, omega=1.908, max err = 6.4e-04

31 lines without explanation

1import torch
2
3def sor_red_black_gpu(f, omega=1.85, max_iter=2000, tol=1e-7):
4    """Vectorized SOR using red-black ordering: parallel-safe on GPU.
5
6    Standard SOR has a data dependency in the inner loop (V[i,j] needs the
7    FRESH V[i-1,j]).  Red-black ordering colours grid cells like a chessboard
8    so that every red cell depends only on black cells, and vice versa.
9    We can then update ALL reds in parallel, then ALL blacks in parallel.
10    """
11    device = f.device
12    N      = f.shape[0]
13    h      = 1.0 / (N - 1)
14    V      = torch.zeros(N, N, device=device)
15
16    # Build the red / black masks once (interior cells only)
17    ii, jj  = torch.meshgrid(torch.arange(N, device=device),
18                             torch.arange(N, device=device), indexing="ij")
19    interior = (ii > 0) & (ii < N - 1) & (jj > 0) & (jj < N - 1)
20    red      = interior & ((ii + jj) % 2 == 0)
21    black    = interior & ((ii + jj) % 2 == 1)
22
23    history = []
24    for k in range(max_iter):
25        V_prev = V.clone()
26        # --- RED sweep ---
27        gs = 0.25 * (
28            torch.roll(V, +1, dims=0) + torch.roll(V, -1, dims=0)
29          + torch.roll(V, +1, dims=1) + torch.roll(V, -1, dims=1)
30          + h * h * f
31        )
32        V = torch.where(red, (1 - omega) * V + omega * gs, V)
33
34        # --- BLACK sweep ---
35        gs = 0.25 * (
36            torch.roll(V, +1, dims=0) + torch.roll(V, -1, dims=0)
37          + torch.roll(V, +1, dims=1) + torch.roll(V, -1, dims=1)
38          + h * h * f
39        )
40        V = torch.where(black, (1 - omega) * V + omega * gs, V)
41
42        diff = (V - V_prev).abs().max().item()
43        history.append(diff)
44        if diff < tol:
45            break
46    return V, history
47
48# --- Same manufactured-solution test on the GPU ---
49device = "cuda" if torch.cuda.is_available() else "cpu"
50N      = 65
51x      = torch.linspace(0, 1, N, device=device)
52X, Y   = torch.meshgrid(x, x, indexing="ij")
53f      = 2 * torch.pi**2 * torch.sin(torch.pi * X) * torch.sin(torch.pi * Y)
54
55omega_opt = 2.0 / (1.0 + torch.sin(torch.tensor(torch.pi / N))).item()
56V, hist   = sor_red_black_gpu(f, omega=omega_opt)
57
58V_exact = torch.sin(torch.pi * X) * torch.sin(torch.pi * Y)
59err = (V - V_exact).abs().max().item()
60print(f"SOR-RB on {device}: {len(hist)} iters, omega={omega_opt:.3f}, max err = {err:.2e}")

Two things make this implementation worth the extra lines. First, it is fully vectorized: the inner update is four $\text{torch.roll}$ calls plus a $\text{torch.where}$ , all GPU kernels. No Python loop over space. Second, it is differentiable: every operation is a registered PyTorch op, so we can call $\text{V.backward()}$ and get gradients with respect to the source $f$ or the boundary values. That makes this routine a drop-in layer inside a neural network, which is the basis of physics-informed networks (PINNs), neural Poisson solvers, and differentiable rendering.

Multigrid: A Sketch of the Real Champion

Even SOR has an Achilles heel. As the Fourier picture shows, relaxation kills high-frequency error fast but low-frequency error slowly. Halving the grid spacing doubles the wavelength of every mode relative to the grid, so what used to be high-frequency suddenly is low-frequency on the new grid. Multigrid exploits this by alternating between fine and coarse grids:

Smooth on the fine grid. Do a few Jacobi or GS sweeps. This kills the highest-frequency error.
Restrict the residual to a coarser grid. Average or inject the residual onto an $N/2 \times N/2$ grid. On the coarse grid the surviving low-frequency error looks like high-frequency error.
Solve (recursively) on the coarse grid. Either run a few sweeps and keep going coarser, or solve exactly if the grid is small enough.
Prolong (interpolate) the correction back up. Linearly interpolate the coarse-grid solution back to the fine grid and add it to the current iterate.
Smooth again on the fine grid to fix any high-frequency noise the interpolation introduced.

One pass through this V-cycle has cost $O(N^2)$ on an $N \times N$ grid (just constant work per unknown) and reduces the error by a fixed factor independent of grid size. So multigrid converges in $O(1)$ iterations, with total cost $O(N^2)$ — optimal up to constants. Whatever speedup SOR gave over Jacobi, multigrid gives again over SOR for large grids.

Why mention multigrid here? The relaxation methods of this section are not just a stepping stone. Multigrid uses them as its smoother — the inner loop that kills high-frequency error on each level. So every multigrid solver is built on top of Jacobi, Gauss-Seidel, or SOR. Understand the simple methods, and you understand the engine inside the fast ones. Most production Poisson solvers in computational physics, image processing, and machine learning today are multigrid or one of its conjugate-gradient cousins.

Summary

Discretization. Replace $\nabla^2 V = -f$ by the 5-point stencil: $V_{i,j} = \tfrac{1}{4}(V_{i\pm 1,j} + V_{i,j\pm 1} + h^2 f_{i,j})$ . Applied as an assignment, this is the heart of every relaxation method.
Jacobi. Use only OLD values during a sweep. Maximally parallel, but spectral radius $\cos(\pi h)$ means $O(N^2)$ iterations.
Gauss-Seidel. Use the freshest values in place. Spectral radius $\cos^2(\pi h)$ — roughly half the iterations of Jacobi. Sequential, unless red-black ordered.
SOR. Gauss-Seidel + an overshoot factor $\omega$ . With $\omega_{\text{opt}} = 2/(1+\sin(\pi/N))$ the spectral radius drops to $\omega_{\text{opt}} - 1 \approx 1 - 2\pi h$ — $O(N)$ iterations, an asymptotic speedup of a factor $N$ over Jacobi.
Red-black ordering recovers full parallelism for Gauss-Seidel and SOR on GPUs with no loss of convergence rate.
Multigrid uses these methods as building blocks ("smoothers") and achieves $O(N^2)$ total work for the entire solve.
Manufactured solutions (pick $V_{\text{exact}}$ , derive $f$ , solve, compare) are the standard way to validate any PDE solver — the iteration-count gap measures solver speed, and the final-error gap measures discretization quality. Both gaps are independent.

Take-away. Relaxation methods turn the differential statement "the value here is the average of the neighbours, plus the source" into a literal program. The three classical methods — Jacobi, Gauss-Seidel, SOR — differ only in which neighbour values the assignment reads from and how aggressively to use them. Their spectral radii live in closed form for Poisson, and the gap between Jacobi's $O(N^2)$ and SOR's $O(N)$ iteration count is the entire reason large-scale physics simulations are feasible. Beyond them lies multigrid, the optimal-complexity champion — but every modern fast solver is still standing on the shoulders of Jacobi.