Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain how MRFs differ from Bayesian Networks
- • Derive the Gibbs distribution from energy functions
- • State and apply the Hammersley-Clifford theorem
- • Understand the three Markov properties for undirected graphs
🔧 Practical Skills
- • Construct MRFs for image denoising and segmentation
- • Implement Gibbs sampling for MRF inference
- • Apply the Ising model to understand phase transitions
- • Connect MRF concepts to deep learning architectures
🧠 Deep Learning Connections
- • Energy-Based Models — Modern EBMs use neural networks to parameterize MRF energy functions
- • Restricted Boltzmann Machines — A bipartite MRF that enabled deep belief networks and pretraining
- • Conditional Random Fields — MRFs conditioned on inputs; fundamental to sequence labeling in NLP
- • Graph Neural Networks — Message passing in GNNs mirrors belief propagation in MRFs
Where You'll Apply This: Image segmentation and denoising, semantic labeling, protein structure prediction, social network analysis, stereo vision, medical image analysis, and understanding the foundations of energy-based deep learning.
The Big Picture
A Markov Random Field (MRF), also called an undirected graphical model, represents the joint distribution of random variables using an undirected graph. Unlike Bayesian Networks where edges have directions representing causal relationships, MRF edges are undirected, representing symmetric relationships like "neighbors" or "similar."
The Core Insight
MRFs model probability through energy: low-energy configurations are more probable. This physics-inspired view leads to the Gibbs distribution:. The energy function encodes our beliefs about which configurations are preferred—for example, that neighboring pixels in an image should have similar values.
Undirected: Edges represent symmetric relationships
Energy-Based: Probability from Gibbs distribution
Clique Factorization: Potentials over graph cliques
Historical Context
Ernst Ising (1925)
The Ising model, proposed to explain ferromagnetism, is the prototypical MRF. Ising's doctoral work showed that the 1D model lacks phase transitions, but the 2D model (solved by Onsager in 1944) exhibits rich critical behavior and became foundational to statistical physics and machine learning.
Hammersley & Clifford (1971)
The Hammersley-Clifford theorem established the fundamental connection between conditional independence in graphs and factorization of probability distributions. This theorem is the theoretical foundation relating MRF structure to Gibbs distributions.
Computer Vision Revolution (1980s-2000s)
Geman & Geman (1984) applied MRFs to image restoration, introducing Gibbs sampling to computer vision. MRFs became the dominant framework for low-level vision tasks until deep learning, and continue to inform modern approaches like graph neural networks.
Real-World Motivation
Consider scenarios where undirected relationships are natural:
- Image Pixels: Neighboring pixels tend to have similar values—this is a symmetric relationship. There's no causal direction; they just "influence each other."
- Social Networks: Friendship is mutual. If Alice is Bob's friend, Bob is Alice's friend. Opinions propagate bidirectionally.
- Protein Interactions: Amino acids in a protein interact based on spatial proximity, not causal chains.
- Sensor Networks: Nearby sensors measure correlated values due to spatial correlation in the underlying phenomenon.
Undirected Graphical Models
MRFs vs Bayesian Networks
Let's contrast the two major families of probabilistic graphical models:
| Property | Bayesian Networks | Markov Random Fields |
|---|---|---|
| Graph type | Directed acyclic graph (DAG) | Undirected graph |
| Factorization | P(x) = Π P(xᵢ | parents(xᵢ)) | P(x) = (1/Z) Π ψ_c(x_c) |
| Parameters | Conditional probabilities | Potential functions (non-negative) |
| Normalization | Automatic (product of CPDs) | Requires partition function Z |
| Interpretation | Causal, generative | Correlational, discriminative |
| Independence | d-separation | Graph separation |
| Common use | Causal reasoning, diagnostics | Image processing, labeling |
Graph Structure and Notation
An MRF is defined by an undirected graph where:
- Vertices V: Represent random variables
- Edges E: Connect variables with direct statistical dependencies
- Cliques: Fully connected subgraphs (every pair of nodes is connected)
- Maximal Cliques: Cliques that cannot be extended by adding more nodes
Graph Structure Example
Consider a chain graph: X₁ — X₂ — X₃ — X₄
- • Edges: (X₁,X₂), (X₂,X₃), (X₃,X₄)
- • Cliques of size 2: {X₁,X₂}, {X₂,X₃}, {X₃,X₄}
- • Maximal cliques: Same as above (no larger complete subgraphs exist)
- • Neighbors of X₂: {X₁, X₃} (Markov blanket)
Markov Properties
MRFs satisfy three equivalent Markov properties that characterize conditional independence relationships implied by the graph structure.
Global Markov Property
Global Markov Property
If removing nodes in C disconnects nodes in A from nodes in B, then A and B are conditionally independent given C.
Local Markov Property
Local Markov Property
Each node is conditionally independent of all non-neighbors given its neighbors (Markov blanket).
The local Markov property is particularly important for inference: to update node , we only need to know its neighbors' values.
Pairwise Markov Property
Pairwise Markov Property
Non-adjacent nodes are conditionally independent given all other nodes.
Gibbs Distribution and Energy
MRFs specify probability through potential functions and the Gibbs distribution, borrowing concepts from statistical physics.
Energy Functions
An energy function assigns a real number to each configuration . Lower energy means higher probability:
Gibbs Distribution
where is temperature and is the partition function.
The energy typically decomposes as a sum of terms over cliques:
where:
- Unary terms : Energy contribution from individual nodes (data fidelity)
- Pairwise terms : Energy contribution from pairs (smoothness/compatibility)
- Higher-order terms: Energy over larger cliques (rare in practice)
The Partition Function
The partition function normalizes the distribution:
Hammersley-Clifford Theorem
The Hammersley-Clifford theorem is the fundamental result connecting graph structure to probability factorization in MRFs.
Hammersley-Clifford Theorem
A positive distribution for all satisfies the Markov properties with respect to graph if and only if it can be written as:
where is the set of maximal cliques and are non-negative potential functions.
Clique Factorization
The theorem tells us that the joint distribution factorizes over maximal cliques. Each potential function captures local interactions within the clique.
Key implications:
- Graph determines factorization: The clique structure of the graph dictates which variables can appear together in potential functions.
- Potentials need not be probabilities: Unlike Bayesian Networks, are not conditional probabilities—they can be any non-negative functions.
- Energy interpretation: Taking logarithms,, giving an additive energy decomposition.
Interactive: MRF Explorer
Explore a Markov Random Field interactively. Click nodes to toggle their values and observe how energy and probability change. Watch how the Gibbs sampler updates node states based on their neighbors.
Click nodes to toggle their values. Observe how energy and probability change.
Low T → sharp distribution (low entropy), High T → uniform distribution
Gibbs Sampling
Watch Gibbs sampling update nodes based on their neighbors
The Ising Model
The Ising model is the simplest non-trivial MRF, yet exhibits remarkably rich behavior. Originally designed to explain ferromagnetism, it serves as the prototype for understanding MRF behavior.
Ising Energy Function
Consider a lattice of spins . The Ising energy is:
Ising Model Energy
Favors aligned spins when J > 0
Biases spins toward +1 when h > 0
The notation means summing over neighboring pairs on the lattice.
Phase Transitions
The 2D Ising model on a square lattice exhibits a phase transition at the critical temperature :
| Regime | Temperature | Behavior | Magnetization |
|---|---|---|---|
| Ordered (Ferromagnetic) | T < Tꜿ | Spins align spontaneously | |M| → 1 |
| Critical | T = Tꜿ | Power-law correlations, scale invariance | |M| → 0 slowly |
| Disordered (Paramagnetic) | T > Tꜿ | Spins random, no long-range order | |M| → 0 quickly |
This phase transition is continuous (second-order), with the magnetization vanishing smoothly at . The critical point exhibits universality: many different systems share the same critical exponents.
Interactive: Ising Model
Watch the 2D Ising model evolve under Metropolis-Hastings sampling. Below the critical temperature, observe spontaneous magnetization emerging. Above it, observe random disorder.
Watch the 2D Ising model evolve via Metropolis-Hastings sampling. Below critical temperature Tc = 2.27, spontaneous magnetization occurs.
J > 0: ferromagnetic (prefer alignment)
Phase Transition
- • T < Tc: Ordered phase. Spins align, |M| → 1
- • T = Tc: Critical point. Power-law correlations
- • T > Tc: Disordered phase. Random spins, |M| → 0
Applications in AI and ML
MRFs have been foundational in computer vision and continue to influence modern deep learning. Understanding them provides insight into both classical and contemporary approaches.
Image Processing
MRFs are natural for image modeling: pixels form a grid, and the MRF structure encodes that neighboring pixels should be similar. Key applications include:
🖼️ Image Denoising
Energy: E(x|y) = Σ(xᵢ - yᵢ)² + β Σ(xᵢ - xⱼ)²
Balance data fidelity (match noisy observation) with smoothness (neighbors similar).
🎨 Image Segmentation
Assign labels to pixels (e.g., foreground/background). Pairwise potentials encourage neighboring pixels with similar colors to have the same label.
📏 Stereo Vision
Estimate depth from stereo pairs. MRF encodes that depth should vary smoothly except at object boundaries.
Interactive: Image Denoising
Watch a Markov Random Field denoise a binary image. The MRF balances matching the noisy observations (data term) with encouraging smoothness (pairwise term).
Watch a Markov Random Field denoise a binary image using the ICM algorithm. The MRF balances data fidelity with spatial smoothness.
Higher β = smoother result, may lose detail
MRF Energy Function
E(x|y) = Σ(xi - yi)2 + β Σ(xi - xj)2
- • Data term: Denoised should match noisy observation
- • Smoothness term: Neighboring pixels should be similar
- • β: Balances fidelity vs smoothness
Deep Learning Connections
MRFs have deeply influenced modern deep learning, both historically and conceptually:
🧠 Restricted Boltzmann Machines (RBMs)
A bipartite MRF with visible and hidden units. No connections within layers, enabling efficient block Gibbs sampling. RBMs were key to the deep learning renaissance—they enabled unsupervised pretraining of deep belief networks (Hinton, 2006).
🏷️ Conditional Random Fields (CRFs)
MRFs conditioned on input features: P(y|x) ∝ exp(-E(y; x)). Dominant in NLP for sequence labeling (POS tagging, NER, chunking). Now often used as the final layer in neural sequence models (BiLSTM-CRF).
⚡ Energy-Based Models (EBMs)
Modern EBMs use neural networks to parameterize energy functions. The energy function can be highly expressive, but training requires estimating gradients of log Z—typically via contrastive divergence or noise contrastive estimation.
🔗 Graph Neural Networks
Message passing in GNNs mirrors belief propagation in MRFs. Each node aggregates information from neighbors—exactly as in MRF inference. This connection helps explain why GNNs work and suggests improvements.
Python Implementation
Let's implement a Markov Random Field from scratch, including Gibbs sampling and mean-field variational inference.
pgmpy for general graphical models, pystruct for structured prediction, or OpenGM for optimization-based inference.Knowledge Check
Test your understanding of Markov Random Fields with these questions covering key concepts from this section:
What is the key difference between Bayesian Networks and Markov Random Fields?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Inference in Graphical Models—the algorithms that make MRFs and BNs practical. We'll cover exact methods like variable elimination and the junction tree algorithm, plus approximate methods including belief propagation, variational inference, and sampling techniques.