Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain what a Gaussian Process is and why it defines a distribution over functions
- • Understand the role of kernel functions in encoding prior assumptions
- • Derive the posterior mean and variance for GP regression
- • Interpret GP uncertainty and its connection to data coverage
🔧 Practical Skills
- • Implement GP regression from scratch using NumPy
- • Choose appropriate kernel functions for different problems
- • Apply GPs for Bayesian optimization and hyperparameter tuning
- • Understand the computational complexity and scalability challenges
🧠 Deep Learning Connections
- Neural Network GPs — Infinite-width neural networks converge to Gaussian Processes (Neal, 1996)
- Bayesian Optimization — GPs power hyperparameter tuning for deep learning models
- Uncertainty Quantification — GP-based methods like Deep Ensembles and MC Dropout
- Neural Tangent Kernels — Modern theory connecting GPs to neural network training dynamics
Where You'll Apply This: Bayesian optimization for AutoML, uncertainty-aware predictions, active learning, spatial modeling (geostatistics), time series forecasting, and understanding deep neural network behavior in the infinite-width limit.
The Big Picture
Imagine you want to model an unknown function based on a few observations. Traditional approaches might fit a specific parametric form (like a polynomial or neural network), but what if you want to express uncertainty about the entire function—not just its parameters?
The Core Insight
A Gaussian Process is a probability distribution over functions. Just as a Gaussian distribution describes uncertainty about a single number, a GP describes uncertainty about an entire function. Any finite collection of function values follows a multivariate Gaussian distribution.
Function Prior: Express beliefs about functions before seeing data
Posterior: Update beliefs given observations via Bayes' rule
Uncertainty: Know when predictions are reliable vs. uncertain
Historical Context
Andrey Kolmogorov (1941)
Kolmogorov developed the mathematical foundation for stochastic processes, including the concept of Gaussian random fields that would later become Gaussian Processes.
Danie Krige & Georges Matheron (1950s-60s)
Developed "Kriging" for spatial interpolation in geostatistics—essentially GP regression for predicting mineral deposits. This is why GPs are sometimes called Kriging in spatial statistics.
Radford Neal (1996)
Showed that Bayesian neural networks with infinite hidden units converge to Gaussian Processes—a profound connection between neural networks and GPs that continues to inspire research today.
Why Gaussian Processes Matter
GPs occupy a unique position in machine learning: they are non-parametric (the model complexity grows with data), fully Bayesian (providing principled uncertainty), and analytically tractable (no MCMC required for standard regression).
- Principled Uncertainty Quantification: GPs naturally tell you when predictions are reliable and when they're not. This is critical for safety-sensitive applications and active learning.
- Sample Efficiency: GPs extract maximum information from limited data by leveraging prior assumptions encoded in the kernel. Ideal for expensive experiments.
- Interpretable Priors: Kernel hyperparameters have clear interpretations—length scales, variances, periodicities—making it easier to incorporate domain knowledge.
- Foundation for Bayesian Optimization: GPs enable efficient global optimization of expensive black-box functions, revolutionizing hyperparameter tuning in deep learning.
Mathematical Definition
Formal Definition
Definition: Gaussian Process
A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is fully specified by:
Mean function
Covariance (kernel) function
We write:
The key property: for any finite set of points , the corresponding function values follow a multivariate Gaussian:
For simplicity, we often assume the mean function is zero: . This is not restrictive because the mean can be absorbed into the data preprocessing, and the kernel captures all the interesting structure.
Interactive: GP Prior Samples
Before observing any data, a GP prior defines a distribution over possible functions. Different kernel functions encode different assumptions about the functions we expect. Experiment with kernels and parameters below:
Kernel Function
Infinitely differentiable, very smooth functions
Controls how quickly correlation decreases with distance
Controls the amplitude of the functions
Kernel Functions
The kernel (or covariance function) is the heart of a GP. It encodes our prior beliefs about the functions we expect to see: Are they smooth? Periodic? Linear? The kernel determines everything about the GP's behavior.
What Makes a Valid Kernel?
A function is a valid kernel if and only if the covariance matrix it generates is positive semi-definite for any set of points. This ensures the multivariate Gaussian is well-defined.
Interactive: Kernel Explorer
Explore different kernel functions and see how they affect the covariance structure. The 1D slice shows—how correlation with a fixed point decreases with distance.
Kernel Function Explorer
The most commonly used kernel. Produces infinitely smooth (C∞) functions. Excellent for modeling smooth phenomena.
Controls how quickly covariance decreases with distance
Controls the overall magnitude (k(x, x) = \u03C3\u00B2)
Covariance Matrix K(X, X)
Kernel Slice: k(0, x')
Interpretation
- \u2022 At x = 0: k(0, 0) = 1.00 (maximum correlation with itself)
- \u2022 As |x'| increases, correlation decreases based on kernel shape
- \u2022 Length scale \u2113 = 1.00 determines how quickly correlation falls
Common Kernel Functions
| Kernel | Formula | Properties | Use Cases |
|---|---|---|---|
| RBF (SE) | σ² exp(-|x-x'|²/2ℓ²) | C∞ smooth, stationary | Default choice, smooth phenomena |
| Matérn 1/2 | σ² exp(-|x-x'|/ℓ) | C⁰ (rough), stationary | Discontinuous functions |
| Matérn 3/2 | σ²(1+√3r)exp(-√3r) | C¹ (once diff.), stationary | Physical systems, good default |
| Periodic | σ² exp(-2sin²(π|x-x'|/p)/ℓ²) | Periodic with period p | Seasonal patterns, cyclical data |
| Linear | σ²(x-c)(x'-c) | Non-stationary, polynomial | Linear trends, polynomial regression |
Kernels can be combined to create more expressive priors:
- Sum: — Functions that are a superposition of components
- Product: — Modulating one component by another (e.g., growing amplitude)
Gaussian Process Regression
Given training data where with Gaussian noise , we want to compute the posterior distribution at test points .
Posterior Derivation
The joint distribution of training outputs and test function values is:
where , , and . Using the formula for Gaussian conditioning:
GP Posterior Formulas
Posterior Mean:
Posterior Covariance:
Interactive: GP Regression
Click on the plot to add data points and watch the GP posterior update. Notice how uncertainty is low near observations and high in unexplored regions.
Controls smoothness of the function
Controls amplitude of variations
Controls observation noise level
Key Observations
- \u2022 Uncertainty is low near observed data points
- \u2022 Uncertainty increases far from observations
- \u2022 Mean function interpolates training data
- \u2022 Posterior samples all pass near data points
Application: Bayesian Optimization
Bayesian Optimization (BO) is one of the most impactful applications of GPs. It efficiently optimizes expensive black-box functions by using a GP to model the objective and an acquisition function to decide where to sample next.
The Bayesian Optimization Loop
- Fit GP: Train a GP on all observations so far
- Compute Acquisition: Evaluate an acquisition function (e.g., Expected Improvement) everywhere
- Sample: Query the true objective at the point maximizing the acquisition function
- Repeat: Add the new observation and go to step 1
Common acquisition functions include:
| Function | Formula | Behavior |
|---|---|---|
| Expected Improvement (EI) | E[max(f(x) - f_best, 0)] | Balances exploration and exploitation elegantly |
| Upper Confidence Bound (UCB) | μ(x) + κσ(x) | κ controls exploration-exploitation tradeoff |
| Probability of Improvement (PI) | P(f(x) > f_best + ξ) | Simple but can be too exploitative |
Interactive: Bayesian Optimization
Watch Bayesian Optimization find the maximum of a multi-modal function. The acquisition function (bottom panel) shows where the algorithm wants to sample next.
How It Works
- Fit GP to current observations
- Compute acquisition function everywhere
- Sample where acquisition is maximum
- Repeat until budget exhausted
Applications in Machine Learning
🎯 Hyperparameter Optimization
Tools like Spearmint, GPyOpt, and Ax use GP-based Bayesian optimization to tune neural network hyperparameters. This can find good configurations in 10-20 trials vs. hundreds for random search.
🔬 Active Learning
GP uncertainty guides which examples to label next. Sample where uncertainty is highest to maximize information gain. Critical in domains where labeling is expensive (medical imaging, robotics).
🧠 Neural Network Theory
Neal's 1996 result shows infinite-width networks are GPs. The Neural Tangent Kernel (Jacot et al., 2018) shows that even finite-width networks trained with gradient descent behave like kernel methods, connecting GPs to modern deep learning theory.
🤖 Robotics and Control
GPs model unknown dynamics in model-based reinforcement learning. The uncertainty estimates allow safe exploration—the robot knows when it's entering unfamiliar territory.
📈 Time Series Forecasting
GPs naturally handle irregular time series and provide uncertainty bands on forecasts. Composite kernels can model trends, seasonality, and noise simultaneously.
Python Implementation
Let's implement GP regression from scratch. The key insight is using Cholesky decomposition for numerical stability—never directly invert the covariance matrix!
Knowledge Check
Test your understanding of Gaussian Processes with these questions:
What does a Gaussian Process define a distribution over?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Poisson Processes—stochastic processes that model random events occurring in time or space. While GPs model continuous function values, Poisson processes model discrete counts of events, with applications in queuing theory, network traffic, and spatial statistics.