Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain the structure learning problem and its challenges
• Define Markov equivalence and its implications
• Compare constraint-based vs score-based approaches
• Understand v-structures and edge orientation rules

🔧 Practical Skills

• Implement the PC algorithm for skeleton discovery
• Apply BIC and other scoring functions
• Perform greedy search over DAG space
• Choose appropriate methods for different data settings

🧠 Deep Learning Connections

• Causal Discovery — Learning causal structure from observational data for interpretable AI
• Neural Causal Models — Using neural networks to model nonlinear causal relationships
• Fairness & Explainability — Identifying causal paths for algorithmic fairness and model interpretability
• Transfer Learning — Leveraging causal invariances for robust generalization

Where You'll Apply This: Causal discovery from observational data, gene regulatory network inference, root cause analysis in complex systems, interpretable machine learning, fairness auditing, and scientific hypothesis generation.

The Big Picture

So far, we have learned how to perform inference in graphical models when the structure is known. But in many real-world applications, we don't know the true structure—we must discover it from data. This is the structure learning problem, one of the most important and challenging problems in probabilistic modeling.

The Core Question

Given data, can we discover the underlying causal structure?

📊

Data: Observations of multiple variables

🔍

Goal: Infer the graph structure

🎯

Result: DAG representing causal relationships

Why Learn Structure?

Structure learning serves several critical purposes:

🔬

Scientific Discovery

Discover causal mechanisms from observational data. Gene regulatory networks, protein interactions, and brain connectivity are all inferred through structure learning.

🎯

Causal Inference

Enable causal reasoning: "What would happen if we intervened on X?" Structure reveals which paths are causal vs spurious correlations.

🧠

Interpretable ML

Build interpretable models by learning feature dependencies. Understand why a model makes predictions, not just what it predicts.

Historical Development

📜

Spirtes, Glymour & Scheines (1993)

The PC algorithm (named after Peter and Clark) was developed at CMU. This constraint-based method uses conditional independence tests to systematically prune edges from a complete graph, then orients remaining edges using v-structures.

📊

Bayesian Structure Learning (1990s-2000s)

Score-based methods emerged with Bayesian scoring (BDe, BGe) and efficient search algorithms. The BIC score provided a principled way to balance fit and complexity. MCMC methods allowed sampling from the posterior over structures.

🚀

Modern Era (2010s-present)

Continuous optimization reformulations (NOTEARS), neural causal discovery, and integration with deep learning. Methods now scale to high-dimensional data and can handle nonlinear relationships.

Mathematical Foundations

The Structure Learning Problem

Given a dataset $\mathcal{D} = \{x^{(1)}, x^{(2)}, \ldots, x^{(n)}\}$ of $n$ i.i.d. samples over $p$ variables, we want to find the DAG $G = (V, E)$ that best represents the causal or conditional independence structure.

Formalization

Search Space:

\mathcal{G} = \{\text{all DAGs over } p \text{ nodes}\}

Size grows super-exponentially: $|\mathcal{G}| \sim O(2^{p^2})$

Objective (Score-Based):

G^* = \arg\max_{G \in \mathcal{G}} \text{Score}(G, \mathcal{D})

Objective (Constraint-Based):

Find G such that CI relations in data match d-separation in G

Markov Equivalence Classes

A fundamental challenge is that different DAGs can encode the same conditional independencies. Such DAGs are called Markov equivalent.

Markov Equivalence Theorem

Two DAGs are Markov equivalent if and only if they have:

The same skeleton (undirected edges)
The same v-structures (colliders: X → Z ← Y where X, Y not adjacent)

X → Y

X causes Y

X ← Y

Y causes X

Same CI!
Both imply X ⫫̸ Y (dependent)

Identifiability Limit: From observational data alone, we can only identify the Markov equivalence class, not the unique causal DAG. To distinguish between equivalent DAGs, we need interventional data or additional assumptions (e.g., linear non-Gaussian models, additive noise models).

Constraint-Based Methods

Constraint-based methods learn structure by testing conditional independence relationships in the data. The key idea: if X and Y are conditionally independent given Z, there is no direct edge between them (with Z on the path).

Conditional Independence Testing

The foundation of constraint-based learning is the statistical test for conditional independence: $X \perp\!\!\!\perp Y \mid Z$ .

Partial Correlation Test (Gaussian Data)

For Gaussian data, conditional independence is equivalent to zero partial correlation.

\rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ}\rho_{YZ}}{\sqrt{(1-\rho_{XZ}^2)(1-\rho_{YZ}^2)}}

Under the null hypothesis of conditional independence, Fisher's z-transform of the partial correlation follows a normal distribution:

z = \frac{1}{2}\ln\frac{1+r}{1-r} \cdot \sqrt{n - |Z| - 3} \sim N(0, 1)

The PC Algorithm

The PC algorithm (Peter-Clark) is the foundational constraint-based method. It operates in three phases:

Skeleton Discovery

• Start with a complete undirected graph
• For conditioning sets of increasing size d = 0, 1, 2, ...
• Test $X \perp\!\!\!\perp Y \mid S$ for all subsets S of size d from neighbors
• Remove edge X—Y if independent for any conditioning set
• Store the separating set S_XY

V-Structure Orientation

• Find unshielded triples: X—Z—Y where X,Y not adjacent
• If Z ∉ S_XY (separating set), orient as X → Z ← Y
• This is a v-structure (collider at Z)

Edge Orientation (Meek Rules)

• R1: If X → Z—Y and X,Y not adjacent, orient Z → Y
• R2: If X → Z → Y and X—Y, orient X → Y (avoid cycle)
• R3: If X—Z, X—Y, Z → W ← Y, orient X → Z
• Apply iteratively until no more changes

Interactive: PC Algorithm

Watch the PC algorithm discover structure step by step. Observe how conditional independence tests remove spurious edges and how v-structures guide edge orientation.

PC Algorithm: Constraint-Based Structure Learning

Discover causal structure through systematic conditional independence testing

Current Graph State

Undirected

Directed

Removed

Current Phase

InitSkeletonOrientDone

Significance Level (α)

0.05

Lower α = fewer edges removed (more conservative)

Independence Tests

No tests run yet

Key Insight: The PC algorithm uses conditional independence tests to systematically eliminate edges. If X and Y are conditionally independent given some set Z, there is no direct edge between them. The algorithm discovers v-structures (colliders like A → D ← B) to orient edges, since conditioning on a collider creates dependence where none existed.

Score-Based Methods

Score-based methods take a different approach: assign a score to each candidate DAG and search for the highest-scoring structure. This turns structure learning into an optimization problem.

Scoring Functions

Score	Formula	Properties
BIC	log P(D\|G,θ̂) - (k/2)log(n)	Asymptotically consistent, penalizes complexity
AIC	log P(D\|G,θ̂) - k	Less penalty than BIC, tends to select larger graphs
BGe	log P(D\|G) (marginal likelihood)	Fully Bayesian, integrates over parameters
BDe/BDeu	Dirichlet-multinomial marginal	For discrete data, with equivalent sample size

BIC Score Decomposition

A key property: BIC decomposes into local scores for each node:

\text{BIC}(G) = \sum_{i=1}^{p} \text{BIC}_{\text{local}}(X_i, \text{Pa}_G(X_i))

This decomposition enables efficient local search—we only need to recompute scores for affected nodes when modifying an edge.

Search Strategies

Greedy Hill Climbing

• Start from empty/random DAG
• Consider: add, delete, reverse edge
• Accept change that improves score most
• Stop when no improvement possible

Pro: Fast, simple. Con: Local optima.

MCMC Structure Sampling

• Sample from posterior P(G|D)
• Propose edge changes
• Accept/reject via Metropolis-Hastings
• Averages over model uncertainty

Pro: Principled uncertainty. Con: Slow convergence.

Exact Methods (Dynamic Programming)

• Optimal for small graphs (p ≤ 20-25)
• Order-based DP exploits decomposability
• O(p · 2^p) time and space

Pro: Guaranteed optimal. Con: Exponential complexity.

NOTEARS (Continuous Optimization)

• Reformulate as continuous optimization
• Acyclicity via algebraic constraint
• Use gradient-based optimization
• Scales to 1000s of variables

Pro: Scalable. Con: Local optima, assumptions.

Interactive: Score-Based Search

Explore how greedy search navigates the space of DAGs, guided by the scoring function. Compare how different scoring criteria lead to different learned structures.

Score-Based Structure Search

Find the highest-scoring DAG through greedy search over structures

Scoring Function

Bayesian Information Criterion: balances fit and complexity

Current Structure

Score: -150.2

Empty

0 edge(s)

Best Structure Found

Score: -150.2

Empty

Structure Space (3 nodes)

Empty

-150.2

X→Y

-125.3

X→Z

-130.1

Y→Z

-135.4

Fork (X)

-98.7

Chain

-102.5

Collider (X)

-110.3

Full (acyclic)

-89.2

Collider (Y)

-95.8

Search Path:

Empty

Scoring Functions

BIC:

log P(D|G, θ̂) - (k/2)log(n)

AIC:

log P(D|G, θ̂) - k

BGe:

log P(D|G) (marginal likelihood)

where k = number of parameters, n = sample size, D = data, G = graph

Hybrid and Modern Methods

Modern structure learning often combines the strengths of both approaches:

GES (Greedy Equivalence Search)

Two-phase search: first add edges greedily to increase score, then delete edges. Operates on equivalence classes (CPDAGs) rather than individual DAGs. Proven consistent under faithfulness.

Hybrid Methods (MMHC, H2PC)

Use constraint-based methods to find skeleton (restricts search space), then score-based search over orientations. Combines efficiency of CI tests with optimality of scoring.

Neural Structure Learning (DAG-GNN, NOTEARS-MLP)

Use neural networks to model nonlinear relationships while learning structure. Can handle high-dimensional data and complex functional forms. Active research area.

Interactive: Structure Learning Explorer

This comprehensive demo shows the full structure learning process from data to DAG. Watch how statistical tests and edge removals reveal the underlying causal structure.

Structure Learning Explorer

Watch how we discover the true causal structure from observational data

Start

Start: Empty Graph

We begin with only the nodes (variables) and no edges. Our goal is to discover the true causal structure from data alone.

True Causal Structure (Hidden)

Discovered Structure (From Data)

Sample Correlations (n = 500)

Pair	Correlation	Interpretation
X ↔ Y	0.928	Strong
X ↔ Z	0.836	Strong
X ↔ W	0.901	Strong
Y ↔ Z	0.766	Strong
Y ↔ W	0.905	Strong
Z ↔ W	0.904	Strong

Undirected (uncertain)

Directed (causal)

From Structure to Causation

Once we have learned the graphical structure, we can use it for causal reasoning. The do-calculus allows us to answer interventional questions.

The Power of Structure

Observational (P(Y|X))

What is Y when we observe X = x?
Includes confounding

Interventional (P(Y|do(X)))

What is Y when we set X = x?
True causal effect

The do-operator severs incoming edges to the intervened variable, removing confounding. Structure learning reveals what to condition on for valid causal inference.

Interactive: Causal Discovery

Explore how learned structure enables causal reasoning. See how interventions differ from observations and how confounders affect our conclusions.

From Structure to Causation

See how learned graph structure enables causal reasoning and interventions

Causal Graph

Click on a node to perform an intervention (do-operator)

Observational vs Interventional

Click a node to see how interventions (do-operator) differ from observations. Interventions cut incoming edges, eliminating confounding.

Causal Paths: Smoking → Cancer

Direct effect = 0.3

STC

Indirect effect = 0.4

GS/C

Confounder (blocks back-door path)

Total causal effect: do(S=1) → E[C] = 0.7

Why Structure Matters

• Observational: P(C|S) includes confounding from G
• Interventional: P(C|do(S)) captures true causal effect
• Structure learning reveals what to condition on

Applications in AI and Deep Learning

🎯 Interpretable Feature Selection

Structure learning identifies which features directly influence the target vs. which are correlated via confounders. This enables feature selection based on causal relevance rather than mere correlation—leading to more robust models.

⚖️ Algorithmic Fairness

Causal graphs help identify discriminatory paths from protected attributes to outcomes. By understanding the causal mechanism, we can block unfair paths while preserving legitimate relationships—a principled approach to fairness.

🔄 Transfer Learning & Domain Adaptation

Causal relationships are often invariant across domains, while spurious correlations change. Learning causal structure helps identify what transfers reliably—enabling better domain adaptation and out-of-distribution generalization.

🔍 Root Cause Analysis

In complex systems (microservices, networks, manufacturing), learned causal graphs help trace anomalies back to root causes. When something goes wrong, the graph shows which upstream factors could be responsible.

Application	How Structure Learning Helps	Example
Healthcare	Identify causal disease factors	Gene regulatory networks
Economics	Policy effect estimation	Central bank interventions
Robotics	Learn world dynamics	Model-based RL
NLP	Understand concept relations	Knowledge graph construction
Computer Vision	Disentangled representations	Causal image generation

Python Implementation

Let's implement the core structure learning algorithms from scratch. We'll build both the PC algorithm (constraint-based) and a score-based greedy search.

Structure Learning Implementation

🐍structure_learning.py

Explanation(16)

Code(148)

1Imports

We use NumPy for numerical operations, scipy.stats for statistical tests, and itertools for generating conditioning set combinations.

6PC Algorithm Class

The PC algorithm is the foundational constraint-based method. It starts with a complete graph and removes edges based on conditional independence tests.

9Significance Level

Alpha controls the threshold for conditional independence tests. Lower alpha = more conservative (fewer edges removed). Typical values: 0.01-0.05.

12Conditional Independence Test

Tests if X and Y are conditionally independent given Z using partial correlation. Returns True if the p-value exceeds alpha.

EXAMPLE

ci_test('X', 'Y', ['Z']) tests X ⊥ Y | Z

17Partial Correlation Formula

The partial correlation between X and Y given Z is: r_XY.Z = (r_XY - r_XZ * r_YZ) / sqrt((1-r_XZ²)(1-r_YZ²)). This measures linear association after removing Z's effect.

23Fisher's Z Transform

Converts partial correlation to a z-score for hypothesis testing: z = 0.5 * ln((1+r)/(1-r)) * sqrt(n-|Z|-3). Under H0, z ~ N(0,1).

30Skeleton Discovery

Phase 1: Start with complete undirected graph and systematically remove edges. For each edge (X,Y), test CI with conditioning sets of increasing size.

36Edge Iteration

For each undirected edge in the current graph, we will test if it should be removed by finding a separating set.

40Neighbor-Based Conditioning

We only condition on neighbors of X or Y (the adjacency faithful approach). This reduces the number of tests and is valid under faithfulness.

45Conditioning Set Enumeration

Try all subsets of size d from the neighbors. If X ⊥ Y | S for any S, remove the edge and record S as the separating set.

52V-Structure Orientation

Phase 2: Find unshielded triples X - Z - Y where X and Z are not adjacent. If Z is not in the separating set of X,Y, orient as X → Z ← Y (v-structure).

58Meek Rules

Phase 3: Apply Meek's orientation rules iteratively to propagate edge directions without creating cycles or new v-structures.

65Score-Based Learning

Alternative approach: assign a score to each DAG and search for the highest-scoring structure. The BIC score balances model fit against complexity.

70BIC Score Computation

BIC = log P(D|G,θ̂_MLE) - (k/2)log(n), where k is the number of parameters and n is sample size. The penalty prevents overfitting.

78Greedy Hill Climbing

Start from empty/random DAG. At each step, consider all single-edge additions, deletions, and reversals. Accept the change that most improves the score.

85Acyclicity Check

After each proposed change, verify the graph remains a DAG (no cycles). This can be done via topological sorting or DFS cycle detection.

132 lines without explanation

1import numpy as np
2from scipy import stats
3from itertools import combinations
4
5# =============================================================
6class PCAlgorithm:
7    """Constraint-based structure learning using the PC algorithm."""
8
9    def __init__(self, data, alpha=0.05):
10        self.data = data  # DataFrame with variables as columns
11        self.alpha = alpha  # Significance level for CI tests
12        self.sep_sets = {}  # Store separating sets
13
14    def conditional_independence_test(self, x, y, z_set):
15        """Test if X _||_ Y | Z using partial correlation."""
16        if len(z_set) == 0:
17            r = np.corrcoef(self.data[x], self.data[y])[0, 1]
18        else:
19            # Compute partial correlation
20            r_xy = np.corrcoef(self.data[x], self.data[y])[0, 1]
21            r_xz = [np.corrcoef(self.data[x], self.data[z])[0, 1] for z in z_set]
22            r_yz = [np.corrcoef(self.data[y], self.data[z])[0, 1] for z in z_set]
23            r = self._partial_correlation(r_xy, r_xz, r_yz)
24
25        # Fisher's z-transform for hypothesis testing
26        n = len(self.data)
27        z_stat = 0.5 * np.log((1 + r) / (1 - r)) * np.sqrt(n - len(z_set) - 3)
28        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
29
30        return p_value > self.alpha  # True if independent
31
32    def learn_skeleton(self):
33        """Phase 1: Learn undirected skeleton by removing edges."""
34        variables = list(self.data.columns)
35        # Start with complete undirected graph
36        adj = {v: set(variables) - {v} for v in variables}
37
38        d = 0  # Conditioning set size
39        while True:
40            for x in variables:
41                for y in list(adj[x]):
42                    if y not in adj[x]:
43                        continue
44
45                    # Get neighbors excluding y
46                    neighbors = adj[x] - {y}
47                    if len(neighbors) < d:
48                        continue
49
50                    # Test all subsets of size d
51                    for z_set in combinations(neighbors, d):
52                        if self.conditional_independence_test(x, y, z_set):
53                            # Remove edge and store separating set
54                            adj[x].discard(y)
55                            adj[y].discard(x)
56                            self.sep_sets[(x, y)] = set(z_set)
57                            break
58
59            d += 1
60            if d > max(len(adj[v]) for v in variables):
61                break
62
63        return adj
64
65    def orient_v_structures(self, adj):
66        """Phase 2: Orient edges at v-structures (colliders)."""
67        directed = {v: {'parents': set(), 'children': set()} for v in adj}
68
69        for z in adj:
70            neighbors = list(adj[z])
71            for i, x in enumerate(neighbors):
72                for y in neighbors[i+1:]:
73                    # Check if x -- z -- y is unshielded (x,y not adjacent)
74                    if y not in adj[x]:
75                        # Check if z is NOT in sep_set(x,y)
76                        sep = self.sep_sets.get((x, y), self.sep_sets.get((y, x), set()))
77                        if z not in sep:
78                            # Orient as x -> z <- y (v-structure)
79                            directed[z]['parents'].add(x)
80                            directed[z]['parents'].add(y)
81
82        return directed
83
84# =============================================================
85def compute_bic_score(data, graph):
86    """Compute BIC score for a DAG structure."""
87    n = len(data)
88    score = 0
89
90    for node in graph.nodes:
91        parents = list(graph.predecessors(node))
92
93        # Fit local regression: node ~ parents
94        if len(parents) == 0:
95            # No parents: just variance
96            residual_var = np.var(data[node])
97        else:
98            # OLS regression
99            X = data[parents].values
100            y = data[node].values
101            beta = np.linalg.lstsq(X, y, rcond=None)[0]
102            residuals = y - X @ beta
103            residual_var = np.var(residuals)
104
105        # Log-likelihood contribution
106        ll = -n/2 * np.log(2 * np.pi * residual_var) - n/2
107
108        # BIC penalty: k = len(parents) + 1 (intercept + coefficients)
109        k = len(parents) + 1
110        bic_local = ll - (k / 2) * np.log(n)
111
112        score += bic_local
113
114    return score
115
116def greedy_search(data, max_iter=1000):
117    """Greedy hill-climbing for structure learning."""
118    import networkx as nx
119
120    variables = list(data.columns)
121    G = nx.DiGraph()  # Start with empty graph
122    G.add_nodes_from(variables)
123
124    best_score = compute_bic_score(data, G)
125
126    for _ in range(max_iter):
127        improved = False
128
129        for x in variables:
130            for y in variables:
131                if x == y:
132                    continue
133
134                # Try adding edge x -> y
135                if not G.has_edge(x, y):
136                    G.add_edge(x, y)
137                    if nx.is_directed_acyclic_graph(G):
138                        score = compute_bic_score(data, G)
139                        if score > best_score:
140                            best_score = score
141                            improved = True
142                            break
143                    G.remove_edge(x, y)
144
145        if not improved:
146            break
147
148    return G, best_score

Using libraries: For production use, consider causal-learn (Python implementation of PC, FCI, GES), pgmpy (structure learning + inference), NOTEARS (continuous optimization), or DoWhy (causal inference framework).

Computational Complexity: The PC algorithm has worst-case complexity O(p^d+2) where d is the maximum degree. Score-based search is NP-hard in general, but heuristic methods work well in practice. Modern methods like NOTEARS scale to thousands of variables.

Knowledge Check

Test your understanding of structure learning with these questions:

Knowledge Check

Question 1 of 8

What is the fundamental difference between constraint-based and score-based structure learning?

Summary

Key Takeaways

✅Structure learning discovers graphical model topology from data.

✅Markov equivalence limits identifiability—different DAGs can encode same CI relations.

✅Constraint-based methods (PC) use conditional independence tests to prune edges.

✅Score-based methods (BIC, BGe) search for highest-scoring DAG structures.

✅V-structures (colliders) are key for orienting edges in the skeleton.

✅Hybrid methods combine CI tests for skeleton with score-based orientation.

✅Learned structure enables causal reasoning via the do-calculus.

✅Applications span interpretable ML, fairness, transfer learning, and root cause analysis.

Looking Ahead

You have now completed our exploration of Probabilistic Graphical Models! We covered Bayesian Networks, Markov Random Fields, inference algorithms, and structure learning. In the next chapter, we'll explore Statistical Decision Theory—the mathematical framework for making optimal decisions under uncertainty, which unifies many concepts from Bayesian inference and machine learning.