Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain how MRFs differ from Bayesian Networks
• Derive the Gibbs distribution from energy functions
• State and apply the Hammersley-Clifford theorem
• Understand the three Markov properties for undirected graphs

🔧 Practical Skills

• Construct MRFs for image denoising and segmentation
• Implement Gibbs sampling for MRF inference
• Apply the Ising model to understand phase transitions
• Connect MRF concepts to deep learning architectures

🧠 Deep Learning Connections

• Energy-Based Models — Modern EBMs use neural networks to parameterize MRF energy functions
• Restricted Boltzmann Machines — A bipartite MRF that enabled deep belief networks and pretraining
• Conditional Random Fields — MRFs conditioned on inputs; fundamental to sequence labeling in NLP
• Graph Neural Networks — Message passing in GNNs mirrors belief propagation in MRFs

Where You'll Apply This: Image segmentation and denoising, semantic labeling, protein structure prediction, social network analysis, stereo vision, medical image analysis, and understanding the foundations of energy-based deep learning.

The Big Picture

A Markov Random Field (MRF), also called an undirected graphical model, represents the joint distribution of random variables using an undirected graph. Unlike Bayesian Networks where edges have directions representing causal relationships, MRF edges are undirected, representing symmetric relationships like "neighbors" or "similar."

The Core Insight

MRFs model probability through energy: low-energy configurations are more probable. This physics-inspired view leads to the Gibbs distribution: $P(x) \propto e^{-E(x)}$ . The energy function encodes our beliefs about which configurations are preferred—for example, that neighboring pixels in an image should have similar values.

🔗

Undirected: Edges represent symmetric relationships

⚡

Energy-Based: Probability from Gibbs distribution

🧩

Clique Factorization: Potentials over graph cliques

Historical Context

📜

Ernst Ising (1925)

The Ising model, proposed to explain ferromagnetism, is the prototypical MRF. Ising's doctoral work showed that the 1D model lacks phase transitions, but the 2D model (solved by Onsager in 1944) exhibits rich critical behavior and became foundational to statistical physics and machine learning.

🏛️

Hammersley & Clifford (1971)

The Hammersley-Clifford theorem established the fundamental connection between conditional independence in graphs and factorization of probability distributions. This theorem is the theoretical foundation relating MRF structure to Gibbs distributions.

🖼️

Computer Vision Revolution (1980s-2000s)

Geman & Geman (1984) applied MRFs to image restoration, introducing Gibbs sampling to computer vision. MRFs became the dominant framework for low-level vision tasks until deep learning, and continue to inform modern approaches like graph neural networks.

Real-World Motivation

Consider scenarios where undirected relationships are natural:

Image Pixels: Neighboring pixels tend to have similar values—this is a symmetric relationship. There's no causal direction; they just "influence each other."
Social Networks: Friendship is mutual. If Alice is Bob's friend, Bob is Alice's friend. Opinions propagate bidirectionally.
Protein Interactions: Amino acids in a protein interact based on spatial proximity, not causal chains.
Sensor Networks: Nearby sensors measure correlated values due to spatial correlation in the underlying phenomenon.

Undirected Graphical Models

MRFs vs Bayesian Networks

Let's contrast the two major families of probabilistic graphical models:

Property	Bayesian Networks	Markov Random Fields
Graph type	Directed acyclic graph (DAG)	Undirected graph
Factorization	P(x) = Π P(xᵢ \| parents(xᵢ))	P(x) = (1/Z) Π ψ_c(x_c)
Parameters	Conditional probabilities	Potential functions (non-negative)
Normalization	Automatic (product of CPDs)	Requires partition function Z
Interpretation	Causal, generative	Correlational, discriminative
Independence	d-separation	Graph separation
Common use	Causal reasoning, diagnostics	Image processing, labeling

Converting Between Representations: Any BN can be converted to an MRF by "moralizing" (connecting parents and dropping directions), but this may lose independence information. Converting an MRF to a BN requires adding edges, potentially creating a denser graph.

Graph Structure and Notation

An MRF is defined by an undirected graph $G = (V, E)$ where:

Vertices V: Represent random variables $X_1, X_2, \ldots, X_n$
Edges E: Connect variables with direct statistical dependencies
Cliques: Fully connected subgraphs (every pair of nodes is connected)
Maximal Cliques: Cliques that cannot be extended by adding more nodes

Graph Structure Example

Consider a chain graph: X₁ — X₂ — X₃ — X₄

• Edges: (X₁,X₂), (X₂,X₃), (X₃,X₄)
• Cliques of size 2: {X₁,X₂}, {X₂,X₃}, {X₃,X₄}
• Maximal cliques: Same as above (no larger complete subgraphs exist)
• Neighbors of X₂: {X₁, X₃} (Markov blanket)

Markov Properties

MRFs satisfy three equivalent Markov properties that characterize conditional independence relationships implied by the graph structure.

Global Markov Property

X_A \perp\!\!\!\perp X_B \mid X_C \text{ if } C \text{ separates } A \text{ from } B \text{ in } G

If removing nodes in C disconnects nodes in A from nodes in B, then A and B are conditionally independent given C.

Local Markov Property

X_i \perp\!\!\!\perp X_{V \setminus (\{i\} \cup N(i))} \mid X_{N(i)}

Each node is conditionally independent of all non-neighbors given its neighbors (Markov blanket).

The local Markov property is particularly important for inference: to update node $i$ , we only need to know its neighbors' values.

Pairwise Markov Property

X_i \perp\!\!\!\perp X_j \mid X_{V \setminus \{i,j\}} \text{ if } (i,j) \notin E

Non-adjacent nodes are conditionally independent given all other nodes.

Equivalence: For positive distributions (all configurations have positive probability), these three properties are equivalent. The local property is most useful for Gibbs sampling; the global property for understanding conditional independence.

Gibbs Distribution and Energy

MRFs specify probability through potential functions and the Gibbs distribution, borrowing concepts from statistical physics.

Energy Functions

An energy function $E(x)$ assigns a real number to each configuration $x = (x_1, \ldots, x_n)$ . Lower energy means higher probability:

Gibbs Distribution

P(x) = \frac{1}{Z} \exp\left(-\frac{E(x)}{T}\right)

where $T$ is temperature and $Z$ is the partition function.

The energy typically decomposes as a sum of terms over cliques:

E(x) = \sum_{c \in \mathcal{C}} \phi_c(x_c)

where:

Unary terms $\phi_i(x_i)$ : Energy contribution from individual nodes (data fidelity)
Pairwise terms $\phi_{ij}(x_i, x_j)$ : Energy contribution from pairs (smoothness/compatibility)
Higher-order terms: Energy over larger cliques (rare in practice)

The Partition Function

The partition function $Z$ normalizes the distribution:

Z = \sum_{x} \exp\left(-\frac{E(x)}{T}\right)

Computational Challenge: Computing

Z

requires summing over all possible configurations—exponential in the number of variables. This makes exact inference intractable for most MRFs, motivating approximate methods like MCMC and variational inference.

Hammersley-Clifford Theorem

The Hammersley-Clifford theorem is the fundamental result connecting graph structure to probability factorization in MRFs.

Hammersley-Clifford Theorem

A positive distribution $P(x) > 0$ for all $x$ satisfies the Markov properties with respect to graph $G$ if and only if it can be written as:

P(x) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(x_c)

where $\mathcal{C}$ is the set of maximal cliques and $\psi_c$ are non-negative potential functions.

Clique Factorization

The theorem tells us that the joint distribution factorizes over maximal cliques. Each potential function $\psi_c(x_c)$ captures local interactions within the clique.

Key implications:

Graph determines factorization: The clique structure of the graph dictates which variables can appear together in potential functions.
Potentials need not be probabilities: Unlike Bayesian Networks, $\psi_c$ are not conditional probabilities—they can be any non-negative functions.
Energy interpretation: Taking logarithms, $E(x) = -\sum_c \log \psi_c(x_c)$ , giving an additive energy decomposition.

Pairwise MRFs: When the graph has no triangles (or we restrict to pairwise potentials), the factorization becomes

P(x) \propto \prod_i \psi_i(x_i) \prod_{(i,j) \in E} \psi_{ij}(x_i, x_j)

. This is the most common form in applications.

Interactive: MRF Explorer

Explore a Markov Random Field interactively. Click nodes to toggle their values and observe how energy and probability change. Watch how the Gibbs sampler updates node states based on their neighbors.

🕸️Interactive Markov Random Field

Click nodes to toggle their values. Observe how energy and probability change.

Value = 1

Value = 0

Aligned

Misaligned

Energy E(x)

-0.30

exp(-E/T)

1.350

Temperature (T)1.00

Low T → sharp distribution (low entropy), High T → uniform distribution

Gibbs Sampling

Watch Gibbs sampling update nodes based on their neighbors

Maximal Cliques:

Clique 1 (A,B,C)

Clique 2 (B,D)

Clique 3 (C,E)

Clique 4 (D,E)

The Ising Model

The Ising model is the simplest non-trivial MRF, yet exhibits remarkably rich behavior. Originally designed to explain ferromagnetism, it serves as the prototype for understanding MRF behavior.

Ising Energy Function

Consider a lattice of spins $s_i \in \{-1, +1\}$ . The Ising energy is:

Ising Model Energy

E(s) = -J \sum_{\langle i,j \rangle} s_i s_j - h \sum_i s_i

Coupling term:

-J \sum s_i s_j

Favors aligned spins when J > 0

External field:

-h \sum s_i

Biases spins toward +1 when h > 0

The notation $\langle i,j \rangle$ means summing over neighboring pairs on the lattice.

Phase Transitions

The 2D Ising model on a square lattice exhibits a phase transition at the critical temperature $T_c = \frac{2J}{\ln(1+\sqrt{2})} \approx 2.27J$ :

Regime	Temperature	Behavior	Magnetization
Ordered (Ferromagnetic)	T < Tꜿ	Spins align spontaneously	\|M\| → 1
Critical	T = Tꜿ	Power-law correlations, scale invariance	\|M\| → 0 slowly
Disordered (Paramagnetic)	T > Tꜿ	Spins random, no long-range order	\|M\| → 0 quickly

This phase transition is continuous (second-order), with the magnetization vanishing smoothly at $T_c$ . The critical point exhibits universality: many different systems share the same critical exponents.

Interactive: Ising Model

Watch the 2D Ising model evolve under Metropolis-Hastings sampling. Below the critical temperature, observe spontaneous magnetization emerging. Above it, observe random disorder.

🧲Interactive Ising Model

Watch the 2D Ising model evolve via Metropolis-Hastings sampling. Below critical temperature T_c = 2.27, spontaneous magnetization occurs.

Steps: 0

Magnetization |M|

0.057

Energy per Spin

0.016

Temperature (T)

2.00Ordered

0.5 (Cold)T_c = 2.275.0 (Hot)

Coupling J1.0

J > 0: ferromagnetic (prefer alignment)

Phase Transition

• T < T_c: Ordered phase. Spins align, |M| → 1
• T = T_c: Critical point. Power-law correlations
• T > T_c: Disordered phase. Random spins, |M| → 0

Spin +1 (Up)

Spin -1 (Down)

Applications in AI and ML

MRFs have been foundational in computer vision and continue to influence modern deep learning. Understanding them provides insight into both classical and contemporary approaches.

Image Processing

MRFs are natural for image modeling: pixels form a grid, and the MRF structure encodes that neighboring pixels should be similar. Key applications include:

🖼️ Image Denoising

Energy: E(x|y) = Σ(xᵢ - yᵢ)² + β Σ(xᵢ - xⱼ)²
Balance data fidelity (match noisy observation) with smoothness (neighbors similar).

🎨 Image Segmentation

Assign labels to pixels (e.g., foreground/background). Pairwise potentials encourage neighboring pixels with similar colors to have the same label.

📏 Stereo Vision

Estimate depth from stereo pairs. MRF encodes that depth should vary smoothly except at object boundaries.

Interactive: Image Denoising

Watch a Markov Random Field denoise a binary image. The MRF balances matching the noisy observations (data term) with encouraging smoothness (pairwise term).

🖼️MRF Image Denoising

Watch a Markov Random Field denoise a binary image using the ICM algorithm. The MRF balances data fidelity with spatial smoothness.

Original

Noisy (20%)

Denoised (PSNR: 7.0 dB)

Pattern

Noise Level20%

Smoothing (β)2.0

Higher β = smoother result, may lose detail

Iterations: 0

MRF Energy Function

E(x|y) = Σ(x_i - y_i)² + β Σ(x_i - x_j)²

• Data term: Denoised should match noisy observation
• Smoothness term: Neighboring pixels should be similar
• β: Balances fidelity vs smoothness

Deep Learning Connections

MRFs have deeply influenced modern deep learning, both historically and conceptually:

🧠 Restricted Boltzmann Machines (RBMs)

A bipartite MRF with visible and hidden units. No connections within layers, enabling efficient block Gibbs sampling. RBMs were key to the deep learning renaissance—they enabled unsupervised pretraining of deep belief networks (Hinton, 2006).

E(v,h) = -v^TWh - b^Tv - c^Th

🏷️ Conditional Random Fields (CRFs)

MRFs conditioned on input features: P(y|x) ∝ exp(-E(y; x)). Dominant in NLP for sequence labeling (POS tagging, NER, chunking). Now often used as the final layer in neural sequence models (BiLSTM-CRF).

⚡ Energy-Based Models (EBMs)

Modern EBMs use neural networks to parameterize energy functions. The energy function can be highly expressive, but training requires estimating gradients of log Z—typically via contrastive divergence or noise contrastive estimation.

🔗 Graph Neural Networks

Message passing in GNNs mirrors belief propagation in MRFs. Each node aggregates information from neighbors—exactly as in MRF inference. This connection helps explain why GNNs work and suggests improvements.

Python Implementation

Let's implement a Markov Random Field from scratch, including Gibbs sampling and mean-field variational inference.

Markov Random Field Implementation

🐍markov_random_field.py

Explanation(12)

Code(106)

1Import NumPy

NumPy provides array operations for efficiently computing energies and potentials over grid structures.

5Class Definition

We encapsulate the MRF as a class storing the graph structure, node potentials (unary), and edge potentials (pairwise).

6Constructor

Takes the number of nodes, their possible states, and initial potential functions. Edges define the graph topology.

14Add Edge Method

Adds an undirected edge between two nodes. Since MRFs are undirected, we don't need to specify direction.

19Set Unary Potential

Sets the potential ψ(x_i) for a single node. This captures the 'preference' for each state based on observed data.

23Set Pairwise Potential

Sets the potential ψ(x_i, x_j) for an edge. This encodes how pairs of states interact—do they attract or repel?

EXAMPLE

For Ising: ψ(+1,+1) = ψ(-1,-1) = exp(J) and ψ(+1,-1) = ψ(-1,+1) = exp(-J)

28Compute Energy

The total energy E(x) is the sum of all potentials. We use negative log of potentials: E = -log(ψ).

37Sum Over Cliques

For each maximal clique, we add the corresponding energy term. For pairwise MRFs, cliques are just edges.

44Compute Probability

The Gibbs distribution: P(x) = exp(-E(x)/T) / Z. Temperature T controls the sharpness of the distribution.

49Gibbs Sampling

Sample from the joint by iteratively sampling each node conditioned on its neighbors (Markov blanket).

55Conditional Distribution

For node i, compute P(x_i | x_{-i}) ∝ ψ_i(x_i) × Π_{j∈N(i)} ψ_{ij}(x_i, x_j). This is the key to Gibbs sampling.

63Normalize and Sample

Normalize the conditional probabilities and sample the new state for node i. This maintains detailed balance.

94 lines without explanation

1import numpy as np
2from typing import Dict, List, Tuple, Optional
3from collections import defaultdict
4
5class MarkovRandomField:
6    def __init__(self, n_nodes: int, n_states: int = 2):
7        self.n_nodes = n_nodes
8        self.n_states = n_states
9        self.edges: List[Tuple[int, int]] = []
10        self.neighbors: Dict[int, List[int]] = defaultdict(list)
11        self.unary_potentials = np.ones((n_nodes, n_states))
12        self.pairwise_potentials: Dict[Tuple[int, int], np.ndarray] = {}
13
14    def add_edge(self, i: int, j: int):
15        """Add undirected edge between nodes i and j."""
16        self.edges.append((i, j))
17        self.neighbors[i].append(j)
18        self.neighbors[j].append(i)
19
20    def set_unary_potential(self, node: int, potential: np.ndarray):
21        """Set node potential ψ_i(x_i)."""
22        self.unary_potentials[node] = potential
23
24    def set_pairwise_potential(self, i: int, j: int, potential: np.ndarray):
25        """Set edge potential ψ_{ij}(x_i, x_j)."""
26        edge = (min(i, j), max(i, j))
27        self.pairwise_potentials[edge] = potential
28
29    def compute_energy(self, config: np.ndarray, temperature: float = 1.0) -> float:
30        """Compute total energy E(x) = -log P(x) + const."""
31        energy = 0.0
32
33        # Unary terms
34        for i in range(self.n_nodes):
35            state = int(config[i])
36            energy -= np.log(self.unary_potentials[i, state] + 1e-10)
37
38        # Pairwise terms
39        for (i, j) in self.edges:
40            edge = (min(i, j), max(i, j))
41            if edge in self.pairwise_potentials:
42                si, sj = int(config[i]), int(config[j])
43                energy -= np.log(self.pairwise_potentials[edge][si, sj] + 1e-10)
44
45        return energy / temperature
46
47    def compute_probability(self, config: np.ndarray, temperature: float = 1.0) -> float:
48        """Compute unnormalized probability P(x) ∝ exp(-E(x)/T)."""
49        return np.exp(-self.compute_energy(config, temperature))
50
51    def gibbs_sample(self, n_samples: int, burn_in: int = 100,
52                     temperature: float = 1.0) -> np.ndarray:
53        """Generate samples using Gibbs sampling."""
54        # Initialize randomly
55        config = np.random.randint(0, self.n_states, self.n_nodes)
56        samples = []
57
58        for step in range(burn_in + n_samples):
59            # Update each node in random order
60            order = np.random.permutation(self.n_nodes)
61            for i in order:
62                # Compute conditional distribution P(x_i | x_{-i})
63                log_probs = np.log(self.unary_potentials[i] + 1e-10)
64
65                for j in self.neighbors[i]:
66                    edge = (min(i, j), max(i, j))
67                    if edge in self.pairwise_potentials:
68                        sj = int(config[j])
69                        log_probs += np.log(
70                            self.pairwise_potentials[edge][:, sj] + 1e-10
71                        )
72
73                # Sample from conditional
74                probs = np.exp(log_probs / temperature)
75                probs /= probs.sum()
76                config[i] = np.random.choice(self.n_states, p=probs)
77
78            if step >= burn_in:
79                samples.append(config.copy())
80
81        return np.array(samples)
82
83    def mean_field_inference(self, n_iters: int = 100,
84                             temperature: float = 1.0) -> np.ndarray:
85        """Variational inference using mean field approximation."""
86        # Initialize beliefs uniformly
87        q = np.ones((self.n_nodes, self.n_states)) / self.n_states
88
89        for _ in range(n_iters):
90            for i in range(self.n_nodes):
91                log_belief = np.log(self.unary_potentials[i] + 1e-10)
92
93                for j in self.neighbors[i]:
94                    edge = (min(i, j), max(i, j))
95                    if edge in self.pairwise_potentials:
96                        # Expected message from neighbor j
97                        log_belief += np.dot(
98                            self.pairwise_potentials[edge],
99                            q[j]
100                        )
101
102                # Update belief with temperature
103                q[i] = np.exp(log_belief / temperature)
104                q[i] /= q[i].sum()
105
106        return q

Libraries for MRFs: For production use, consider pgmpy for general graphical models, pystruct for structured prediction, or OpenGM for optimization-based inference.

Knowledge Check

Test your understanding of Markov Random Fields with these questions covering key concepts from this section:

📝Knowledge Check

Question 1 of 8

What is the key difference between Bayesian Networks and Markov Random Fields?

Summary

Key Takeaways

✅MRFs use undirected graphs to represent symmetric dependencies, unlike the directed edges in Bayesian Networks.

✅Gibbs distribution: P(x) ∝ exp(-E(x)/T) assigns higher probability to lower-energy configurations.

✅Hammersley-Clifford theorem shows the distribution factorizes as a product of potentials over maximal cliques.

✅The local Markov property states that each node is conditionally independent of non-neighbors given its neighbors.

✅The Ising model is the prototypical MRF, exhibiting phase transitions at critical temperature.

✅Computing Z is intractable; we use approximate inference methods like Gibbs sampling or mean-field.

✅Image processing naturally fits the MRF framework: grid graphs with pairwise smoothness priors.

✅RBMs, CRFs, and EBMs are MRF-based models that remain relevant in modern deep learning.

Looking Ahead

In the next section, we'll explore Inference in Graphical Models—the algorithms that make MRFs and BNs practical. We'll cover exact methods like variable elimination and the junction tree algorithm, plus approximate methods including belief propagation, variational inference, and sampling techniques.