Chapter 25
35 min read
Section 155 of 175

Gaussian Processes

Stochastic Processes

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Explain what a Gaussian Process is and why it defines a distribution over functions
  • • Understand the role of kernel functions in encoding prior assumptions
  • • Derive the posterior mean and variance for GP regression
  • • Interpret GP uncertainty and its connection to data coverage

🔧 Practical Skills

  • • Implement GP regression from scratch using NumPy
  • • Choose appropriate kernel functions for different problems
  • • Apply GPs for Bayesian optimization and hyperparameter tuning
  • • Understand the computational complexity and scalability challenges

🧠 Deep Learning Connections

  • Neural Network GPs — Infinite-width neural networks converge to Gaussian Processes (Neal, 1996)
  • Bayesian Optimization — GPs power hyperparameter tuning for deep learning models
  • Uncertainty Quantification — GP-based methods like Deep Ensembles and MC Dropout
  • Neural Tangent Kernels — Modern theory connecting GPs to neural network training dynamics
Where You'll Apply This: Bayesian optimization for AutoML, uncertainty-aware predictions, active learning, spatial modeling (geostatistics), time series forecasting, and understanding deep neural network behavior in the infinite-width limit.

The Big Picture

Imagine you want to model an unknown function f(x)f(x) based on a few observations. Traditional approaches might fit a specific parametric form (like a polynomial or neural network), but what if you want to express uncertainty about the entire function—not just its parameters?

The Core Insight

A Gaussian Process is a probability distribution over functions. Just as a Gaussian distribution describes uncertainty about a single number, a GP describes uncertainty about an entire function. Any finite collection of function values follows a multivariate Gaussian distribution.

📈

Function Prior: Express beliefs about functions before seeing data

🎯

Posterior: Update beliefs given observations via Bayes' rule

📊

Uncertainty: Know when predictions are reliable vs. uncertain

Historical Context

📜
Andrey Kolmogorov (1941)

Kolmogorov developed the mathematical foundation for stochastic processes, including the concept of Gaussian random fields that would later become Gaussian Processes.

🌍
Danie Krige & Georges Matheron (1950s-60s)

Developed "Kriging" for spatial interpolation in geostatistics—essentially GP regression for predicting mineral deposits. This is why GPs are sometimes called Kriging in spatial statistics.

🧠
Radford Neal (1996)

Showed that Bayesian neural networks with infinite hidden units converge to Gaussian Processes—a profound connection between neural networks and GPs that continues to inspire research today.

Why Gaussian Processes Matter

GPs occupy a unique position in machine learning: they are non-parametric (the model complexity grows with data), fully Bayesian (providing principled uncertainty), and analytically tractable (no MCMC required for standard regression).

  1. Principled Uncertainty Quantification: GPs naturally tell you when predictions are reliable and when they're not. This is critical for safety-sensitive applications and active learning.
  2. Sample Efficiency: GPs extract maximum information from limited data by leveraging prior assumptions encoded in the kernel. Ideal for expensive experiments.
  3. Interpretable Priors: Kernel hyperparameters have clear interpretations—length scales, variances, periodicities—making it easier to incorporate domain knowledge.
  4. Foundation for Bayesian Optimization: GPs enable efficient global optimization of expensive black-box functions, revolutionizing hyperparameter tuning in deep learning.

Mathematical Definition

Formal Definition

Definition: Gaussian Process

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is fully specified by:

m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)]

Mean function

k(x,x)=Cov[f(x),f(x)]k(x, x') = \text{Cov}[f(x), f(x')]

Covariance (kernel) function

We write:

f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

The key property: for any finite set of points {x1,,xn}\{x_1, \ldots, x_n\}, the corresponding function values follow a multivariate Gaussian:

(f(x1)f(x2)f(xn))N((m(x1)m(x2)m(xn)),(k(x1,x1)k(x1,xn)k(xn,x1)k(xn,xn)))\begin{pmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{pmatrix}, \begin{pmatrix} k(x_1, x_1) & \cdots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n) \end{pmatrix} \right)

For simplicity, we often assume the mean function is zero: m(x)=0m(x) = 0. This is not restrictive because the mean can be absorbed into the data preprocessing, and the kernel captures all the interesting structure.

Why "Gaussian"? The multivariate Gaussian distribution is the only distribution with the property that all marginals and conditionals are also Gaussian. This makes GPs analytically tractable—we can compute posteriors in closed form without sampling.

Interactive: GP Prior Samples

Before observing any data, a GP prior defines a distribution over possible functions. Different kernel functions encode different assumptions about the functions we expect. Experiment with kernels and parameters below:

Kernel Function

Infinitely differentiable, very smooth functions

Controls how quickly correlation decreases with distance

Controls the amplitude of the functions

-2-1012-2-1012xf(x)Samples from GP Prior with RBF (Squared Exponential) Kernel
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Key Observation: Notice how the length scale controls "wiggliness"—smaller length scales produce more rapidly varying functions, while larger length scales produce smoother functions. The variance controls the amplitude of variations around zero.

Kernel Functions

The kernel (or covariance function) is the heart of a GP. It encodes our prior beliefs about the functions we expect to see: Are they smooth? Periodic? Linear? The kernel determines everything about the GP's behavior.

What Makes a Valid Kernel?

A function k(x,x)k(x, x') is a valid kernel if and only if the covariance matrix it generates is positive semi-definite for any set of points. This ensures the multivariate Gaussian is well-defined.

Symmetry: k(x,x)=k(x,x)k(x, x') = k(x', x)
Positive Definiteness: i,jcicjk(xi,xj)0\sum_{i,j} c_i c_j k(x_i, x_j) \geq 0

Interactive: Kernel Explorer

Explore different kernel functions and see how they affect the covariance structure. The 1D slice showsk(0,x)k(0, x')—how correlation with a fixed point decreases with distance.

Kernel Function Explorer

k(x, x') = σ² exp(-|x - x'|² / 2ℓ²)

The most commonly used kernel. Produces infinitely smooth (C∞) functions. Excellent for modeling smooth phenomena.

Controls how quickly covariance decreases with distance

Controls the overall magnitude (k(x, x) = \u03C3\u00B2)

Covariance Matrix K(X, X)
x'x
Kernel Slice: k(0, x')
-4-20240.00.51.0k(0,0) = \u03C3\u00B2\u2113 = 1.0x'k(0, x')
Interpretation
  • \u2022 At x = 0: k(0, 0) = 1.00 (maximum correlation with itself)
  • \u2022 As |x'| increases, correlation decreases based on kernel shape
  • \u2022 Length scale \u2113 = 1.00 determines how quickly correlation falls

Common Kernel Functions

KernelFormulaPropertiesUse Cases
RBF (SE)σ² exp(-|x-x'|²/2ℓ²)C∞ smooth, stationaryDefault choice, smooth phenomena
Matérn 1/2σ² exp(-|x-x'|/ℓ)C⁰ (rough), stationaryDiscontinuous functions
Matérn 3/2σ²(1+√3r)exp(-√3r)C¹ (once diff.), stationaryPhysical systems, good default
Periodicσ² exp(-2sin²(π|x-x'|/p)/ℓ²)Periodic with period pSeasonal patterns, cyclical data
Linearσ²(x-c)(x'-c)Non-stationary, polynomialLinear trends, polynomial regression

Kernels can be combined to create more expressive priors:

  • Sum: k1(x,x)+k2(x,x)k_1(x,x') + k_2(x,x') — Functions that are a superposition of components
  • Product: k1(x,x)k2(x,x)k_1(x,x') \cdot k_2(x,x') — Modulating one component by another (e.g., growing amplitude)

Gaussian Process Regression

Given training data D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n where yi=f(xi)+ϵy_i = f(x_i) + \epsilon with Gaussian noise ϵN(0,σn2)\epsilon \sim \mathcal{N}(0, \sigma_n^2), we want to compute the posterior distribution p(fX,y,X)p(f_* | X, y, X_*) at test points XX_*.

Posterior Derivation

The joint distribution of training outputs and test function values is:

(yf)N(0,(K+σn2IKKTK))\begin{pmatrix} y \\ f_* \end{pmatrix} \sim \mathcal{N}\left( 0, \begin{pmatrix} K + \sigma_n^2 I & K_* \\ K_*^T & K_{**} \end{pmatrix} \right)

where K=k(X,X)K = k(X, X), K=k(X,X)K_* = k(X, X_*), and K=k(X,X)K_{**} = k(X_*, X_*). Using the formula for Gaussian conditioning:

GP Posterior Formulas

Posterior Mean:

fˉ=KT(K+σn2I)1y\bar{f}_* = K_*^T (K + \sigma_n^2 I)^{-1} y

Posterior Covariance:

cov(f)=KKT(K+σn2I)1K\text{cov}(f_*) = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*
Computational Cost: The matrix inversion (or Cholesky solve) costs O(n3)O(n^3) time and O(n2)O(n^2) memory. This limits standard GPs to datasets of a few thousand points. Sparse approximations and inducing point methods reduce this to O(nm2)O(nm^2) where mnm \ll n.

Interactive: GP Regression

Click on the plot to add data points and watch the GP posterior update. Notice how uncertainty is low near observations and high in unexplored regions.

-3-2-10123-2-1012xf(x)Gaussian Process Regression (Click to Add Points)
Observations (n=4)
Posterior Mean
\u00b12\u03C3 Confidence
Sample 1
Sample 2
Sample 3

Controls smoothness of the function

Controls amplitude of variations

Controls observation noise level

Key Observations
  • \u2022 Uncertainty is low near observed data points
  • \u2022 Uncertainty increases far from observations
  • \u2022 Mean function interpolates training data
  • \u2022 Posterior samples all pass near data points
Try This: Add points in a cluster and notice how uncertainty remains high far away. Then add a single point in the uncertain region and watch the entire posterior reshape. This illustrates how GPs propagate information through the correlation structure.

Application: Bayesian Optimization

Bayesian Optimization (BO) is one of the most impactful applications of GPs. It efficiently optimizes expensive black-box functions by using a GP to model the objective and an acquisition function to decide where to sample next.

The Bayesian Optimization Loop

  1. Fit GP: Train a GP on all observations so far
  2. Compute Acquisition: Evaluate an acquisition function (e.g., Expected Improvement) everywhere
  3. Sample: Query the true objective at the point maximizing the acquisition function
  4. Repeat: Add the new observation and go to step 1

Common acquisition functions include:

FunctionFormulaBehavior
Expected Improvement (EI)E[max(f(x) - f_best, 0)]Balances exploration and exploitation elegantly
Upper Confidence Bound (UCB)μ(x) + κσ(x)κ controls exploration-exploitation tradeoff
Probability of Improvement (PI)P(f(x) > f_best + ξ)Simple but can be too exploitative

Interactive: Bayesian Optimization

Watch Bayesian Optimization find the maximum of a multi-modal function. The acquisition function (bottom panel) shows where the algorithm wants to sample next.

Bayesian Optimization - Iteration 0-2-1012Nextf(x)Acquisition Function (Expected Improvement)-2-1012x
True Function
GP Mean
Sampled
Latest
Best Found
0.000
at x = 0.00
How It Works
  1. Fit GP to current observations
  2. Compute acquisition function everywhere
  3. Sample where acquisition is maximum
  4. Repeat until budget exhausted
Real-World Consideration: In practice, BO is most valuable when each function evaluation is expensive (minutes to hours). For cheap functions, random search or grid search may be sufficient and faster. BO shines for hyperparameter tuning of deep learning models, experiment design, and engineering optimization.

Applications in Machine Learning

🎯 Hyperparameter Optimization

Tools like Spearmint, GPyOpt, and Ax use GP-based Bayesian optimization to tune neural network hyperparameters. This can find good configurations in 10-20 trials vs. hundreds for random search.

🔬 Active Learning

GP uncertainty guides which examples to label next. Sample where uncertainty is highest to maximize information gain. Critical in domains where labeling is expensive (medical imaging, robotics).

🧠 Neural Network Theory

Neal's 1996 result shows infinite-width networks are GPs. The Neural Tangent Kernel (Jacot et al., 2018) shows that even finite-width networks trained with gradient descent behave like kernel methods, connecting GPs to modern deep learning theory.

🤖 Robotics and Control

GPs model unknown dynamics in model-based reinforcement learning. The uncertainty estimates allow safe exploration—the robot knows when it's entering unfamiliar territory.

📈 Time Series Forecasting

GPs naturally handle irregular time series and provide uncertainty bands on forecasts. Composite kernels can model trends, seasonality, and noise simultaneously.


Python Implementation

Let's implement GP regression from scratch. The key insight is using Cholesky decomposition for numerical stability—never directly invert the covariance matrix!

Gaussian Process Regression from Scratch
🐍gaussian_process.py
1Import NumPy

NumPy provides the linear algebra operations essential for GP computations, including matrix operations and Cholesky decomposition.

2Import SciPy

SciPy's linalg module provides numerically stable implementations of Cholesky decomposition and triangular solvers.

5RBF Kernel Function

The Radial Basis Function (RBF) kernel, also called Squared Exponential, is the most commonly used kernel. It produces infinitely differentiable (smooth) functions.

8Squared Distance

Compute the squared Euclidean distance between all pairs of points. This is the core computation for distance-based kernels like RBF.

EXAMPLE
For X1 of shape (n,d) and X2 of shape (m,d), output is (n,m)
11RBF Formula

Apply the RBF kernel formula: k(x,x') = variance * exp(-0.5 * ||x-x'||^2 / length_scale^2). The length_scale controls smoothness.

14GP Class Definition

A GaussianProcess class that encapsulates the kernel, hyperparameters, and methods for fitting and prediction.

24Fit Method

Store training data and precompute the Cholesky decomposition of K + noise*I. This is the expensive O(n^3) operation.

28Compute Training Kernel

Build the n×n kernel matrix K(X,X) between all pairs of training points.

31Add Noise to Diagonal

Add observation noise variance to the diagonal for numerical stability and to model measurement uncertainty.

34Cholesky Decomposition

Compute L such that LL^T = K + noise*I. Cholesky is more numerically stable and efficient than direct inversion.

37Precompute Alpha

Solve L alpha = y for alpha = (K + noise*I)^{-1} y. This is reused for all predictions, making them O(n) instead of O(n^3).

41Predict Method

Given test points X*, compute the posterior mean and variance using the stored Cholesky factors.

44Cross-Kernel Matrix

Compute K(X*, X) - the kernel between test points and training points. This is an m×n matrix where m is the number of test points.

47Posterior Mean

The posterior mean is K(X*,X) @ alpha = K(X*,X) @ (K + noise*I)^{-1} @ y. This is the GP prediction at each test point.

50Solve for v

Solve L v = K(X,X*)^T to efficiently compute the posterior variance without explicit matrix inversion.

53Posterior Variance

Variance = K(X*,X*) - v^T v = K(X*,X*) - K(X*,X) (K+noise*I)^{-1} K(X,X*). Diagonal gives pointwise uncertainty.

44 lines without explanation
1import numpy as np
2from scipy import linalg
3
4# RBF (Squared Exponential) Kernel
5def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
6    # Compute squared Euclidean distance
7    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + \
8             np.sum(X2**2, 1) - 2 * X1 @ X2.T
9    # Apply RBF formula
10    return variance * np.exp(-0.5 * sqdist / length_scale**2)
11
12class GaussianProcess:
13    def __init__(self, kernel=rbf_kernel, noise=1e-4,
14                 length_scale=1.0, variance=1.0):
15        self.kernel = kernel
16        self.noise = noise
17        self.length_scale = length_scale
18        self.variance = variance
19        self.X_train = None
20        self.L_ = None
21        self.alpha_ = None
22
23    def fit(self, X, y):
24        self.X_train = X
25        self.y_train = y
26
27        # Compute kernel matrix K(X, X)
28        K = self.kernel(X, X, self.length_scale, self.variance)
29
30        # Add noise to diagonal for numerical stability
31        K += self.noise * np.eye(len(X))
32
33        # Cholesky decomposition: K = L @ L.T
34        self.L_ = linalg.cholesky(K, lower=True)
35
36        # Solve L @ alpha = y for alpha
37        self.alpha_ = linalg.cho_solve((self.L_, True), y)
38        return self
39
40    def predict(self, X_test, return_std=True):
41        # Compute cross-covariance K(X*, X)
42        K_star = self.kernel(X_test, self.X_train,
43                             self.length_scale, self.variance)
44
45        # Posterior mean: K(X*, X) @ alpha
46        mean = K_star @ self.alpha_
47
48        if not return_std:
49            return mean
50
51        # Solve L @ v = K(X, X*)^T
52        v = linalg.solve_triangular(self.L_, K_star.T, lower=True)
53
54        # Posterior variance: K(X*, X*) - v.T @ v
55        K_ss = self.kernel(X_test, X_test,
56                           self.length_scale, self.variance)
57        cov = K_ss - v.T @ v
58        std = np.sqrt(np.diag(cov))
59
60        return mean, std
Production Use: For real applications, use GPyTorch (GPU-accelerated, with PyTorch integration), GPflow (TensorFlow-based), or scikit-learn's GaussianProcessRegressor. These handle numerical issues, hyperparameter optimization, and sparse approximations.

Knowledge Check

Test your understanding of Gaussian Processes with these questions:

Question 1 of 8Score: 0/0

What does a Gaussian Process define a distribution over?


Summary

Key Takeaways

A Gaussian Process defines a distribution over functions—any finite collection of values is jointly Gaussian.
The kernel function encodes prior assumptions about smoothness, periodicity, and other function properties.
GP regression provides closed-form posterior mean and variance without sampling.
Uncertainty is naturally low near observations and high in unexplored regions.
Bayesian optimization uses GPs to efficiently optimize expensive black-box functions.
The main computational bottleneck is O(n³) matrix inversion, limiting scalability.
Infinite-width neural networks converge to GPs, connecting GPs to deep learning theory.
GPs are ideal for small data, uncertainty-critical applications, and active learning.

Looking Ahead

In the next section, we'll explore Poisson Processes—stochastic processes that model random events occurring in time or space. While GPs model continuous function values, Poisson processes model discrete counts of events, with applications in queuing theory, network traffic, and spatial statistics.

Loading comments...