Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain the structure learning problem and its challenges
- • Define Markov equivalence and its implications
- • Compare constraint-based vs score-based approaches
- • Understand v-structures and edge orientation rules
🔧 Practical Skills
- • Implement the PC algorithm for skeleton discovery
- • Apply BIC and other scoring functions
- • Perform greedy search over DAG space
- • Choose appropriate methods for different data settings
🧠 Deep Learning Connections
- • Causal Discovery — Learning causal structure from observational data for interpretable AI
- • Neural Causal Models — Using neural networks to model nonlinear causal relationships
- • Fairness & Explainability — Identifying causal paths for algorithmic fairness and model interpretability
- • Transfer Learning — Leveraging causal invariances for robust generalization
Where You'll Apply This: Causal discovery from observational data, gene regulatory network inference, root cause analysis in complex systems, interpretable machine learning, fairness auditing, and scientific hypothesis generation.
The Big Picture
So far, we have learned how to perform inference in graphical models when the structure is known. But in many real-world applications, we don't know the true structure—we must discover it from data. This is the structure learning problem, one of the most important and challenging problems in probabilistic modeling.
The Core Question
Given data, can we discover the underlying causal structure?
Data: Observations of multiple variables
Goal: Infer the graph structure
Result: DAG representing causal relationships
Why Learn Structure?
Structure learning serves several critical purposes:
Scientific Discovery
Discover causal mechanisms from observational data. Gene regulatory networks, protein interactions, and brain connectivity are all inferred through structure learning.
Causal Inference
Enable causal reasoning: "What would happen if we intervened on X?" Structure reveals which paths are causal vs spurious correlations.
Interpretable ML
Build interpretable models by learning feature dependencies. Understand why a model makes predictions, not just what it predicts.
Historical Development
Spirtes, Glymour & Scheines (1993)
The PC algorithm (named after Peter and Clark) was developed at CMU. This constraint-based method uses conditional independence tests to systematically prune edges from a complete graph, then orients remaining edges using v-structures.
Bayesian Structure Learning (1990s-2000s)
Score-based methods emerged with Bayesian scoring (BDe, BGe) and efficient search algorithms. The BIC score provided a principled way to balance fit and complexity. MCMC methods allowed sampling from the posterior over structures.
Modern Era (2010s-present)
Continuous optimization reformulations (NOTEARS), neural causal discovery, and integration with deep learning. Methods now scale to high-dimensional data and can handle nonlinear relationships.
Mathematical Foundations
The Structure Learning Problem
Given a dataset of i.i.d. samples over variables, we want to find the DAG that best represents the causal or conditional independence structure.
Formalization
Search Space:
Size grows super-exponentially:
Objective (Score-Based):
Objective (Constraint-Based):
Find G such that CI relations in data match d-separation in G
Markov Equivalence Classes
A fundamental challenge is that different DAGs can encode the same conditional independencies. Such DAGs are called Markov equivalent.
Markov Equivalence Theorem
Two DAGs are Markov equivalent if and only if they have:
- The same skeleton (undirected edges)
- The same v-structures (colliders: X → Z ← Y where X, Y not adjacent)
Both imply X ⫫̸ Y (dependent)
Constraint-Based Methods
Constraint-based methods learn structure by testing conditional independence relationships in the data. The key idea: if X and Y are conditionally independent given Z, there is no direct edge between them (with Z on the path).
Conditional Independence Testing
The foundation of constraint-based learning is the statistical test for conditional independence: .
Partial Correlation Test (Gaussian Data)
For Gaussian data, conditional independence is equivalent to zero partial correlation.
Under the null hypothesis of conditional independence, Fisher's z-transform of the partial correlation follows a normal distribution:
The PC Algorithm
The PC algorithm (Peter-Clark) is the foundational constraint-based method. It operates in three phases:
Skeleton Discovery
- • Start with a complete undirected graph
- • For conditioning sets of increasing size d = 0, 1, 2, ...
- • Test for all subsets S of size d from neighbors
- • Remove edge X—Y if independent for any conditioning set
- • Store the separating set SXY
V-Structure Orientation
- • Find unshielded triples: X—Z—Y where X,Y not adjacent
- • If Z ∉ SXY (separating set), orient as X → Z ← Y
- • This is a v-structure (collider at Z)
Edge Orientation (Meek Rules)
- • R1: If X → Z—Y and X,Y not adjacent, orient Z → Y
- • R2: If X → Z → Y and X—Y, orient X → Y (avoid cycle)
- • R3: If X—Z, X—Y, Z → W ← Y, orient X → Z
- • Apply iteratively until no more changes
Interactive: PC Algorithm
Watch the PC algorithm discover structure step by step. Observe how conditional independence tests remove spurious edges and how v-structures guide edge orientation.
PC Algorithm: Constraint-Based Structure Learning
Discover causal structure through systematic conditional independence testing
Current Graph State
Current Phase
Significance Level (α)
0.05Lower α = fewer edges removed (more conservative)
Independence Tests
No tests run yet
Score-Based Methods
Score-based methods take a different approach: assign a score to each candidate DAG and search for the highest-scoring structure. This turns structure learning into an optimization problem.
Scoring Functions
| Score | Formula | Properties |
|---|---|---|
| BIC | log P(D|G,θ̂) - (k/2)log(n) | Asymptotically consistent, penalizes complexity |
| AIC | log P(D|G,θ̂) - k | Less penalty than BIC, tends to select larger graphs |
| BGe | log P(D|G) (marginal likelihood) | Fully Bayesian, integrates over parameters |
| BDe/BDeu | Dirichlet-multinomial marginal | For discrete data, with equivalent sample size |
BIC Score Decomposition
A key property: BIC decomposes into local scores for each node:
This decomposition enables efficient local search—we only need to recompute scores for affected nodes when modifying an edge.
Search Strategies
Greedy Hill Climbing
- • Start from empty/random DAG
- • Consider: add, delete, reverse edge
- • Accept change that improves score most
- • Stop when no improvement possible
Pro: Fast, simple. Con: Local optima.
MCMC Structure Sampling
- • Sample from posterior P(G|D)
- • Propose edge changes
- • Accept/reject via Metropolis-Hastings
- • Averages over model uncertainty
Pro: Principled uncertainty. Con: Slow convergence.
Exact Methods (Dynamic Programming)
- • Optimal for small graphs (p ≤ 20-25)
- • Order-based DP exploits decomposability
- • O(p · 2p) time and space
Pro: Guaranteed optimal. Con: Exponential complexity.
NOTEARS (Continuous Optimization)
- • Reformulate as continuous optimization
- • Acyclicity via algebraic constraint
- • Use gradient-based optimization
- • Scales to 1000s of variables
Pro: Scalable. Con: Local optima, assumptions.
Interactive: Score-Based Search
Explore how greedy search navigates the space of DAGs, guided by the scoring function. Compare how different scoring criteria lead to different learned structures.
Score-Based Structure Search
Find the highest-scoring DAG through greedy search over structures
Scoring Function
Bayesian Information Criterion: balances fit and complexity
Current Structure
Score: -150.2Best Structure Found
Score: -150.2Structure Space (3 nodes)
Search Path:
Scoring Functions
where k = number of parameters, n = sample size, D = data, G = graph
Hybrid and Modern Methods
Modern structure learning often combines the strengths of both approaches:
GES (Greedy Equivalence Search)
Two-phase search: first add edges greedily to increase score, then delete edges. Operates on equivalence classes (CPDAGs) rather than individual DAGs. Proven consistent under faithfulness.
Hybrid Methods (MMHC, H2PC)
Use constraint-based methods to find skeleton (restricts search space), then score-based search over orientations. Combines efficiency of CI tests with optimality of scoring.
Neural Structure Learning (DAG-GNN, NOTEARS-MLP)
Use neural networks to model nonlinear relationships while learning structure. Can handle high-dimensional data and complex functional forms. Active research area.
Interactive: Structure Learning Explorer
This comprehensive demo shows the full structure learning process from data to DAG. Watch how statistical tests and edge removals reveal the underlying causal structure.
Structure Learning Explorer
Watch how we discover the true causal structure from observational data
Start: Empty Graph
We begin with only the nodes (variables) and no edges. Our goal is to discover the true causal structure from data alone.
True Causal Structure (Hidden)
Discovered Structure (From Data)
Sample Correlations (n = 500)
| Pair | Correlation | Interpretation |
|---|---|---|
| X ↔ Y | 0.928 | Strong |
| X ↔ Z | 0.836 | Strong |
| X ↔ W | 0.901 | Strong |
| Y ↔ Z | 0.766 | Strong |
| Y ↔ W | 0.905 | Strong |
| Z ↔ W | 0.904 | Strong |
From Structure to Causation
Once we have learned the graphical structure, we can use it for causal reasoning. The do-calculus allows us to answer interventional questions.
The Power of Structure
Observational (P(Y|X))
What is Y when we observe X = x?
Includes confounding
Interventional (P(Y|do(X)))
What is Y when we set X = x?
True causal effect
The do-operator severs incoming edges to the intervened variable, removing confounding. Structure learning reveals what to condition on for valid causal inference.
Interactive: Causal Discovery
Explore how learned structure enables causal reasoning. See how interventions differ from observations and how confounders affect our conclusions.
From Structure to Causation
See how learned graph structure enables causal reasoning and interventions
Causal Graph
Click on a node to perform an intervention (do-operator)
Observational vs Interventional
Click a node to see how interventions (do-operator) differ from observations. Interventions cut incoming edges, eliminating confounding.
Causal Paths: Smoking → Cancer
Why Structure Matters
- • Observational: P(C|S) includes confounding from G
- • Interventional: P(C|do(S)) captures true causal effect
- • Structure learning reveals what to condition on
Applications in AI and Deep Learning
🎯 Interpretable Feature Selection
Structure learning identifies which features directly influence the target vs. which are correlated via confounders. This enables feature selection based on causal relevance rather than mere correlation—leading to more robust models.
⚖️ Algorithmic Fairness
Causal graphs help identify discriminatory paths from protected attributes to outcomes. By understanding the causal mechanism, we can block unfair paths while preserving legitimate relationships—a principled approach to fairness.
🔄 Transfer Learning & Domain Adaptation
Causal relationships are often invariant across domains, while spurious correlations change. Learning causal structure helps identify what transfers reliably—enabling better domain adaptation and out-of-distribution generalization.
🔍 Root Cause Analysis
In complex systems (microservices, networks, manufacturing), learned causal graphs help trace anomalies back to root causes. When something goes wrong, the graph shows which upstream factors could be responsible.
| Application | How Structure Learning Helps | Example |
|---|---|---|
| Healthcare | Identify causal disease factors | Gene regulatory networks |
| Economics | Policy effect estimation | Central bank interventions |
| Robotics | Learn world dynamics | Model-based RL |
| NLP | Understand concept relations | Knowledge graph construction |
| Computer Vision | Disentangled representations | Causal image generation |
Python Implementation
Let's implement the core structure learning algorithms from scratch. We'll build both the PC algorithm (constraint-based) and a score-based greedy search.
causal-learn (Python implementation of PC, FCI, GES), pgmpy (structure learning + inference), NOTEARS (continuous optimization), or DoWhy (causal inference framework).Knowledge Check
Test your understanding of structure learning with these questions:
Knowledge Check
Question 1 of 8What is the fundamental difference between constraint-based and score-based structure learning?
Summary
Key Takeaways
Looking Ahead
You have now completed our exploration of Probabilistic Graphical Models! We covered Bayesian Networks, Markov Random Fields, inference algorithms, and structure learning. In the next chapter, we'll explore Statistical Decision Theory—the mathematical framework for making optimal decisions under uncertainty, which unifies many concepts from Bayesian inference and machine learning.