Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define Bayesian Networks and their key components (DAG, CPTs)
- • Explain how the joint distribution factorizes according to the graph
- • Apply d-separation to determine conditional independence
- • Understand the difference between exact and approximate inference
🔧 Practical Skills
- • Build and reason about simple Bayesian Networks
- • Compute probabilities using the chain rule factorization
- • Implement rejection sampling and likelihood weighting
- • Read and interpret network structures in research papers
🧠 Deep Learning Connections
- • Variational Autoencoders — The VAE generative model p(x|z)p(z) is a simple Bayesian Network; the ELBO comes from variational inference
- • Belief Networks in Deep Learning — Deep belief networks and Helmholtz machines use layered Bayesian Network structures
- • Causal Reasoning — Understanding interventions (do-calculus) requires Bayesian Network foundations
- • Probabilistic Programming — Languages like Pyro, Stan, and Edward express models as Bayesian Networks
Where You'll Apply This: Medical diagnosis systems, fraud detection, recommendation engines, speech recognition, computer vision, robotics, and any domain where you need to reason under uncertainty with structured dependencies.
The Big Picture
A Bayesian Network (also called a belief network or directed graphical model) is a compact representation of a joint probability distribution over many random variables. Instead of storing an exponentially large probability table, we exploit the structure of dependencies to factor the distribution into manageable pieces.
The Core Insight
Real-world systems have sparse dependencies. Your headache might depend on whether you slept well and whether you have a cold, but not directly on the stock market. Bayesian Networks capture these dependency patterns in a graph, allowing us to reason efficiently about complex systems with many variables.
Graph Structure: DAG encodes dependencies
Local Tables: CPTs specify conditional probs
Efficient: Exponential savings in parameters
Historical Context
Judea Pearl (1980s)
Judea Pearl, who later won the Turing Award, developed the theoretical foundations of Bayesian Networks in the 1980s. His work unified probability theory with graph theory, enabling efficient reasoning under uncertainty. His book "Probabilistic Reasoning in Intelligent Systems" (1988) remains a foundational text.
From Expert Systems to Deep Learning
Bayesian Networks powered the first generation of AI systems for medical diagnosis, troubleshooting, and decision support. Today, they remain essential: VAEs use them for generative modeling, causal inference builds on their semantics, and probabilistic programming languages express models as Bayesian Networks.
Real-World Motivation
Consider a medical diagnosis scenario with 20 symptoms and 10 possible diseases:
Naive Approach: Full Joint Table
With 30 binary variables, we'd need entries. Impossible to estimate from data, store, or query efficiently.
Bayesian Network Approach
If each variable has at most 3 parents, we need only parameters. A reduction by a factor of 4 million!
What is a Bayesian Network?
A Bayesian Network consists of two components that together define a joint probability distribution:
- A Directed Acyclic Graph (DAG) where nodes represent random variables and edges represent direct probabilistic dependencies
- Conditional Probability Tables (CPTs) that specify the probability of each variable given its parents in the graph
Directed Acyclic Graphs (DAGs)
The graph structure encodes the dependency relationships between variables. An edge from node A to node B means "A is a direct cause or influence on B" or more precisely, "B depends directly on A."
| Term | Definition | Example |
|---|---|---|
| Parent | Node with outgoing edge to a child | Rain is a parent of Wet Grass |
| Child | Node with incoming edge from a parent | Wet Grass is a child of Rain |
| Ancestor | Parent, grandparent, etc. | If A → B → C, A is an ancestor of C |
| Descendant | Child, grandchild, etc. | If A → B → C, C is a descendant of A |
| Root | Node with no parents | Rain (no incoming edges) |
| Leaf | Node with no children | Wet Grass (no outgoing edges) |
Conditional Probability Tables (CPTs)
Each node has a Conditional Probability Table that specifies . For a node with binary parents, the CPT has rows, one for each parent configuration.
Example: Wet Grass CPT
| Sprinkler | Rain | P(WetGrass=T) | P(WetGrass=F) |
|---|---|---|---|
| F | F | 0.00 | 1.00 |
| F | T | 0.80 | 0.20 |
| T | F | 0.90 | 0.10 |
| T | T | 0.99 | 0.01 |
Each row sums to 1.0. The table captures how sprinkler and rain jointly influence wet grass.
Interactive: Network Explorer
Explore the classic "Sprinkler" Bayesian Network. Click on nodes to set evidence (observed values) and watch how beliefs propagate through the network. Toggle the CPT view to see the underlying probability tables.
Interactive Bayesian Network: Sprinkler Example
Click nodes to set evidence. Watch how beliefs propagate through the network.
The Factorization Property
The key mathematical property of Bayesian Networks is how they factor the joint distribution into a product of local conditional distributions:
Bayesian Network Factorization
The joint probability equals the product of each variable's CPT entry for its parent values.
Chain Rule for Bayesian Networks
This factorization follows from the chain rule of probability, but with a crucial simplification: we condition only on parents, not all preceding variables.
General Chain Rule
Each term conditions on ALL previous variables
Bayesian Network Simplification
Each term conditions only on PARENTS (often much smaller set)
This simplification is valid because of the conditional independence assumptions encoded in the graph: each variable is independent of its non-descendants given its parents.
Interactive: Factorization
Watch how the joint probability factors step-by-step according to the graph structure. The highlighted nodes show which variable is being factored at each step.
Chain Rule Factorization in Bayesian Networks
See how the joint distribution factors according to the graph structure.
We want to compute the joint probability of Rain (R), Sprinkler (S), and Wet Grass (W).
Parameter Savings
The factorization yields dramatic savings in the number of parameters needed:
| Representation | Parameters for n Binary Variables | Example: n=20 |
|---|---|---|
| Full Joint Table | 2^n - 1 | 1,048,575 |
| BN (max k parents) | n × 2^k | 20 × 8 = 160 (k=3) |
| Savings Factor | 2^(n-k) / n | ~6,500× |
Conditional Independence
The graph structure of a Bayesian Network encodes a rich set of conditional independence relationships. These independencies are what make efficient inference possible.
Local Markov Property
Every Bayesian Network satisfies the Local Markov Property:
Each variable is conditionally independent of its non-descendants given its parents.
This is the formal statement of the intuition that "knowing the immediate causes (parents) screens off the more distant causes."
D-Separation
D-separation is a graphical criterion that determines whether two sets of variables are conditionally independent given a third set. It's one of the most important concepts for reasoning about Bayesian Networks.
D-separation depends on three canonical structures:
Chain: X \u2192 Z \u2192 Y
Blocked when Z is observed. Information flows from X to Y through Z. Observing Z intercepts this flow.
Fork: X \u2190 Z \u2192 Y
Blocked when Z is observed. Z is a common cause of X and Y. Observing Z explains away the correlation.
Collider: X \u2192 Z \u2190 Y
Blocked when Z is NOT observed. X and Y are marginally independent but become dependent when we observe their common effect Z.
Interactive: D-Separation
Explore the three fundamental structures and see how observing nodes affects the flow of information. Click on nodes to observe them and watch whether X and Y become d-separated.
D-Separation: Understanding Conditional Independence
Explore how information flows through different graph structures. Click on nodes to observe them.
Chain: X → Z → Y
In a chain, information flows from X to Y through Z. Observing Z blocks this flow.
Information can flow between X and Y through the unblocked path.
D-Separation Rules:
- \u2022 Chain/Fork: Blocked when middle node is observed
- \u2022 Collider: Blocked when middle node is NOT observed
Click on nodes to toggle observation. Purple nodes (X, Y) are the query nodes. Blue indicates observed.
Inference in Bayesian Networks
Inference is the task of computing probabilities of interest given observed evidence. The most common query is: .
Exact Inference
For small networks or networks with special structure (trees, low treewidth), we can compute exact probabilities using algorithms like:
- Variable Elimination: Systematically sum out hidden variables. Complexity depends on the elimination order and the graph's treewidth.
- Belief Propagation: Message-passing algorithm that works exactly on trees and approximately on graphs with loops.
- Junction Tree Algorithm: Converts the graph into a tree of cliques for exact inference. Works for any graph but exponential in treewidth.
Approximate Inference
For large or densely connected networks, we use sampling-based methods that trade exactness for tractability:
Rejection Sampling
Sample from the prior and reject samples inconsistent with evidence. Simple but inefficient when evidence is unlikely.
Likelihood Weighting
Sample non-evidence variables, fix evidence to observed values, and weight samples by. More efficient than rejection.
MCMC (Gibbs Sampling)
Construct a Markov chain whose stationary distribution is the posterior. Resample each variable conditioned on all others. Asymptotically exact but requires burn-in.
Interactive: Inference Methods
Compare exact inference with sampling-based methods. See how many samples are needed for the estimates to converge to the true probability.
Inference in Bayesian Networks
Compare exact inference with sampling-based approximate methods. Query: P(Rain = True | Wet Grass = True)
Exact Inference via Variable Elimination
Computes the exact probability by marginalizing over hidden variables. Uses Bayes' rule: P(R|WG) = P(WG|R)P(R) / P(WG). Tractable for small networks but exponential in treewidth.
Applications in Deep Learning
Bayesian Networks aren't just classical AI—they're deeply connected to modern deep learning architectures and techniques:
🔮 Variational Autoencoders (VAEs)
The VAE generative model p_\\theta(x|z)p(z) is a two-node Bayesian Network: latent z generates observed x. The ELBO objective comes directly from variational inference on this network. Understanding BNs demystifies the VAE loss.
🔗 Normalizing Flows
Flows can be viewed as deep Bayesian Networks with deterministic nodes. Each layer transforms the distribution, and the Jacobian terms in the likelihood correspond to the deterministic CPTs.
🎯 Causal Inference for ML
Understanding when ML models will generalize under distribution shift requires causal reasoning. Bayesian Networks provide the semantics for interventions (do-calculus) that distinguish correlation from causation.
📊 Probabilistic Programming
Languages like Pyro, Stan, NumPyro, and Edward let you define models as code that implicitly creates Bayesian Networks. Neural networks can parameterize CPTs, combining deep learning with probabilistic modeling.
🧬 Structure Learning
Learning the graph structure from data is an active research area. Score-based methods (BIC, BDeu) and constraint-based methods (PC algorithm) can discover causal structure, which is increasingly important for interpretable ML.
Python Implementation
Let's implement a basic Bayesian Network from scratch. This implementation includes network construction, forward sampling, rejection sampling, and likelihood weighting.
pgmpy (Python), bnlearn (R/Python), or probabilistic programming languages like Pyro orNumPyro for deep learning integration.Knowledge Check
Test your understanding of Bayesian Networks with these questions covering the key concepts from this section:
Knowledge Check
Question 1 of 8What is the defining property of a Bayesian Network?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Markov Random Fields—undirected graphical models that are natural for modeling spatial data, images, and situations where the direction of influence is unclear. We'll see how they differ from Bayesian Networks and where each is most appropriate.