Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Define Bayesian Networks and their key components (DAG, CPTs)
• Explain how the joint distribution factorizes according to the graph
• Apply d-separation to determine conditional independence
• Understand the difference between exact and approximate inference

🔧 Practical Skills

• Build and reason about simple Bayesian Networks
• Compute probabilities using the chain rule factorization
• Implement rejection sampling and likelihood weighting
• Read and interpret network structures in research papers

🧠 Deep Learning Connections

• Variational Autoencoders — The VAE generative model p(x|z)p(z) is a simple Bayesian Network; the ELBO comes from variational inference
• Belief Networks in Deep Learning — Deep belief networks and Helmholtz machines use layered Bayesian Network structures
• Causal Reasoning — Understanding interventions (do-calculus) requires Bayesian Network foundations
• Probabilistic Programming — Languages like Pyro, Stan, and Edward express models as Bayesian Networks

Where You'll Apply This: Medical diagnosis systems, fraud detection, recommendation engines, speech recognition, computer vision, robotics, and any domain where you need to reason under uncertainty with structured dependencies.

The Big Picture

A Bayesian Network (also called a belief network or directed graphical model) is a compact representation of a joint probability distribution over many random variables. Instead of storing an exponentially large probability table, we exploit the structure of dependencies to factor the distribution into manageable pieces.

The Core Insight

Real-world systems have sparse dependencies. Your headache might depend on whether you slept well and whether you have a cold, but not directly on the stock market. Bayesian Networks capture these dependency patterns in a graph, allowing us to reason efficiently about complex systems with many variables.

📊

Graph Structure: DAG encodes dependencies

📋

Local Tables: CPTs specify conditional probs

⚡

Efficient: Exponential savings in parameters

Historical Context

📜

Judea Pearl (1980s)

Judea Pearl, who later won the Turing Award, developed the theoretical foundations of Bayesian Networks in the 1980s. His work unified probability theory with graph theory, enabling efficient reasoning under uncertainty. His book "Probabilistic Reasoning in Intelligent Systems" (1988) remains a foundational text.

🔬

From Expert Systems to Deep Learning

Bayesian Networks powered the first generation of AI systems for medical diagnosis, troubleshooting, and decision support. Today, they remain essential: VAEs use them for generative modeling, causal inference builds on their semantics, and probabilistic programming languages express models as Bayesian Networks.

Real-World Motivation

Consider a medical diagnosis scenario with 20 symptoms and 10 possible diseases:

Naive Approach: Full Joint Table

With 30 binary variables, we'd need $2^{30} \approx 10^9$ entries. Impossible to estimate from data, store, or query efficiently.

Bayesian Network Approach

If each variable has at most 3 parents, we need only $30 \times 2^3 = 240$ parameters. A reduction by a factor of 4 million!

What is a Bayesian Network?

A Bayesian Network consists of two components that together define a joint probability distribution:

A Directed Acyclic Graph (DAG) where nodes represent random variables and edges represent direct probabilistic dependencies
Conditional Probability Tables (CPTs) that specify the probability of each variable given its parents in the graph

Directed Acyclic Graphs (DAGs)

The graph structure encodes the dependency relationships between variables. An edge from node A to node B means "A is a direct cause or influence on B" or more precisely, "B depends directly on A."

Term	Definition	Example
Parent	Node with outgoing edge to a child	Rain is a parent of Wet Grass
Child	Node with incoming edge from a parent	Wet Grass is a child of Rain
Ancestor	Parent, grandparent, etc.	If A → B → C, A is an ancestor of C
Descendant	Child, grandchild, etc.	If A → B → C, C is a descendant of A
Root	Node with no parents	Rain (no incoming edges)
Leaf	Node with no children	Wet Grass (no outgoing edges)

Acyclic Requirement: The graph must have no directed cycles. You cannot have A \u2192 B \u2192 C \u2192 A. This ensures we can define a valid joint probability distribution and enables efficient inference algorithms.

Conditional Probability Tables (CPTs)

Each node $X$ has a Conditional Probability Table that specifies $P(X \mid \text{Parents}(X))$ . For a node with $k$ binary parents, the CPT has $2^k$ rows, one for each parent configuration.

Example: Wet Grass CPT

Sprinkler	Rain	P(WetGrass=T)	P(WetGrass=F)
F	F	0.00	1.00
F	T	0.80	0.20
T	F	0.90	0.10
T	T	0.99	0.01

Each row sums to 1.0. The table captures how sprinkler and rain jointly influence wet grass.

Interactive: Network Explorer

Explore the classic "Sprinkler" Bayesian Network. Click on nodes to set evidence (observed values) and watch how beliefs propagate through the network. Toggle the CPT view to see the underlying probability tables.

Interactive Bayesian Network: Sprinkler Example

Click nodes to set evidence. Watch how beliefs propagate through the network.

Evidence: True

Evidence: False

Unobserved (color = belief)

The Factorization Property

The key mathematical property of Bayesian Networks is how they factor the joint distribution into a product of local conditional distributions:

Bayesian Network Factorization

P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i))

The joint probability equals the product of each variable's CPT entry for its parent values.

Chain Rule for Bayesian Networks

This factorization follows from the chain rule of probability, but with a crucial simplification: we condition only on parents, not all preceding variables.

General Chain Rule

P(X_1, \ldots, X_n) = P(X_1) \cdot P(X_2 \mid X_1) \cdot P(X_3 \mid X_1, X_2) \cdots

Each term conditions on ALL previous variables

Bayesian Network Simplification

P(X_1, \ldots, X_n) = \prod_{i} P(X_i \mid \text{Parents}(X_i))

Each term conditions only on PARENTS (often much smaller set)

This simplification is valid because of the conditional independence assumptions encoded in the graph: each variable is independent of its non-descendants given its parents.

Interactive: Factorization

Watch how the joint probability factors step-by-step according to the graph structure. The highlighted nodes show which variable is being factored at each step.

Chain Rule Factorization in Bayesian Networks

See how the joint distribution factors according to the graph structure.

P(R, S, W)

We want to compute the joint probability of Rain (R), Sprinkler (S), and Wet Grass (W).

Parameter Savings

The factorization yields dramatic savings in the number of parameters needed:

Representation	Parameters for n Binary Variables	Example: n=20
Full Joint Table	2^n - 1	1,048,575
BN (max k parents)	n × 2^k	20 × 8 = 160 (k=3)
Savings Factor	2^(n-k) / n	~6,500×

Real-World Sparsity: In practice, most variables depend on only a few others. A disease might depend on 3-5 risk factors, not hundreds of symptoms. This sparsity is why Bayesian Networks work so well for real-world problems.

Conditional Independence

The graph structure of a Bayesian Network encodes a rich set of conditional independence relationships. These independencies are what make efficient inference possible.

Local Markov Property

Every Bayesian Network satisfies the Local Markov Property:

X_i \perp\!\!\!\perp \text{NonDescendants}(X_i) \mid \text{Parents}(X_i)

Each variable is conditionally independent of its non-descendants given its parents.

This is the formal statement of the intuition that "knowing the immediate causes (parents) screens off the more distant causes."

Markov Blanket: A variable is conditionally independent of all other variables given its Markov blanket: its parents, children, and children's other parents. This is the minimal set needed for prediction.

D-Separation

D-separation is a graphical criterion that determines whether two sets of variables are conditionally independent given a third set. It's one of the most important concepts for reasoning about Bayesian Networks.

D-separation depends on three canonical structures:

Chain: X \u2192 Z \u2192 Y

Blocked when Z is observed. Information flows from X to Y through Z. Observing Z intercepts this flow.

Fork: X \u2190 Z \u2192 Y

Blocked when Z is observed. Z is a common cause of X and Y. Observing Z explains away the correlation.

Collider: X \u2192 Z \u2190 Y

Blocked when Z is NOT observed. X and Y are marginally independent but become dependent when we observe their common effect Z.

Collider Bias: The collider is the "gotcha" structure. Conditioning on a common effect can create dependencies that didn't exist before. This is called "explaining away" or "Berkson's paradox." For example, if admission to a program requires either high test scores OR athletic ability, admitted students show a negative correlation between these traits—even if they're independent in the population.

Interactive: D-Separation

Explore the three fundamental structures and see how observing nodes affects the flow of information. Click on nodes to observe them and watch whether X and Y become d-separated.

D-Separation: Understanding Conditional Independence

Explore how information flows through different graph structures. Click on nodes to observe them.

Chain: X → Z → Y

In a chain, information flows from X to Y through Z. Observing Z blocks this flow.

D-Connected

Information can flow between X and Y through the unblocked path.

D-Separation Rules:

\u2022 Chain/Fork: Blocked when middle node is observed
\u2022 Collider: Blocked when middle node is NOT observed

Click on nodes to toggle observation. Purple nodes (X, Y) are the query nodes. Blue indicates observed.

Inference in Bayesian Networks

Inference is the task of computing probabilities of interest given observed evidence. The most common query is: $P(\text{Query} \mid \text{Evidence})$ .

Exact Inference

For small networks or networks with special structure (trees, low treewidth), we can compute exact probabilities using algorithms like:

Variable Elimination: Systematically sum out hidden variables. Complexity depends on the elimination order and the graph's treewidth.
Belief Propagation: Message-passing algorithm that works exactly on trees and approximately on graphs with loops.
Junction Tree Algorithm: Converts the graph into a tree of cliques for exact inference. Works for any graph but exponential in treewidth.

P(Q \mid E=e) = \frac{P(Q, E=e)}{P(E=e)} = \frac{\sum_{H} P(Q, H, E=e)}{\sum_{Q, H} P(Q, H, E=e)}

Computational Complexity: Exact inference in general Bayesian Networks is NP-hard. The complexity is exponential in the network's treewidth—roughly, how "tree-like" the graph is. For dense graphs, we must resort to approximations.

Approximate Inference

For large or densely connected networks, we use sampling-based methods that trade exactness for tractability:

Rejection Sampling

Sample from the prior $P(X_1, \ldots, X_n)$ and reject samples inconsistent with evidence. Simple but inefficient when evidence is unlikely.

Likelihood Weighting

Sample non-evidence variables, fix evidence to observed values, and weight samples by $\prod P(e_i \mid \text{parents})$ . More efficient than rejection.

MCMC (Gibbs Sampling)

Construct a Markov chain whose stationary distribution is the posterior. Resample each variable conditioned on all others. Asymptotically exact but requires burn-in.

Interactive: Inference Methods

Compare exact inference with sampling-based methods. See how many samples are needed for the estimates to converge to the true probability.

Inference in Bayesian Networks

Compare exact inference with sampling-based approximate methods. Query: P(Rain = True | Wet Grass = True)

Exact Inference via Variable Elimination

Computes the exact probability by marginalizing over hidden variables. Uses Bayes' rule: P(R|WG) = P(WG|R)P(R) / P(WG). Tractable for small networks but exponential in treewidth.

Applications in Deep Learning

Bayesian Networks aren't just classical AI—they're deeply connected to modern deep learning architectures and techniques:

🔮 Variational Autoencoders (VAEs)

The VAE generative model $p_\\theta(x|z)p(z)$ is a two-node Bayesian Network: latent z generates observed x. The ELBO objective comes directly from variational inference on this network. Understanding BNs demystifies the VAE loss.

🔗 Normalizing Flows

Flows can be viewed as deep Bayesian Networks with deterministic nodes. Each layer transforms the distribution, and the Jacobian terms in the likelihood correspond to the deterministic CPTs.

🎯 Causal Inference for ML

Understanding when ML models will generalize under distribution shift requires causal reasoning. Bayesian Networks provide the semantics for interventions (do-calculus) that distinguish correlation from causation.

📊 Probabilistic Programming

Languages like Pyro, Stan, NumPyro, and Edward let you define models as code that implicitly creates Bayesian Networks. Neural networks can parameterize CPTs, combining deep learning with probabilistic modeling.

🧬 Structure Learning

Learning the graph structure from data is an active research area. Score-based methods (BIC, BDeu) and constraint-based methods (PC algorithm) can discover causal structure, which is increasingly important for interpretable ML.

Python Implementation

Let's implement a basic Bayesian Network from scratch. This implementation includes network construction, forward sampling, rejection sampling, and likelihood weighting.

Bayesian Network Implementation from Scratch

🐍bayesian_network.py

Explanation(11)

Code(116)

1Import NumPy

NumPy provides efficient array operations for storing and manipulating probability tables.

5Node Class

Each node in the Bayesian Network stores its name, parent references, and conditional probability table (CPT).

11CPT Shape Validation

The CPT must have shape (2^num_parents, 2) for binary variables. Each row corresponds to a parent configuration.

18Parent Config to Index

Converts a dictionary of parent values to a row index in the CPT. Uses binary encoding: {A:0, B:1} maps to index 0*2 + 1 = 1.

EXAMPLE

parents={&apos;Rain&apos;: True, &apos;Sprinkler&apos;: False} → index = 1*2 + 0 = 2

27Conditional Probability

Returns P(node=value | parent_values) by looking up the appropriate row in the CPT.

34BayesianNetwork Class

The main class that holds all nodes and provides methods for sampling and inference.

42Topological Sort

Orders nodes so that parents come before children. Essential for forward sampling and belief propagation.

55Forward Sampling

Generates a complete sample by visiting nodes in topological order, sampling each conditioned on its parents.

67Rejection Sampling

Approximate inference by sampling and rejecting inconsistent samples. Simple but can be inefficient.

82Likelihood Weighting

More efficient than rejection: always include evidence but weight samples by P(evidence | parents).

103Joint Probability

Computes P(assignment) = ∏ P(X_i | Parents(X_i)) using the factorization property.

105 lines without explanation

1import numpy as np
2from typing import Dict, List, Optional, Tuple
3from collections import defaultdict
4
5class Node:
6    def __init__(self, name: str, parents: List['Node'], cpt: np.ndarray):
7        self.name = name
8        self.parents = parents
9        self.cpt = cpt  # Shape: (2^num_parents, 2) for binary vars
10
11        # Validate CPT shape
12        expected_rows = 2 ** len(parents)
13        if cpt.shape != (expected_rows, 2):
14            raise ValueError(f"CPT shape must be ({expected_rows}, 2)")
15        if not np.allclose(cpt.sum(axis=1), 1.0):
16            raise ValueError("CPT rows must sum to 1")
17
18    def _parent_config_to_index(self, parent_values: Dict[str, int]) -> int:
19        """Convert parent values to CPT row index (binary encoding)."""
20        index = 0
21        for i, parent in enumerate(self.parents):
22            index += parent_values[parent.name] * (2 ** i)
23        return index
24
25    def prob(self, value: int, parent_values: Dict[str, int]) -> float:
26        """Return P(self=value | parent_values)."""
27        if not self.parents:
28            return self.cpt[0, value]
29        idx = self._parent_config_to_index(parent_values)
30        return self.cpt[idx, value]
31
32
33class BayesianNetwork:
34    def __init__(self):
35        self.nodes: Dict[str, Node] = {}
36
37    def add_node(self, node: Node):
38        self.nodes[node.name] = node
39
40    def topological_sort(self) -> List[Node]:
41        """Return nodes in topological order (parents before children)."""
42        visited = set()
43        order = []
44
45        def visit(node: Node):
46            if node.name in visited:
47                return
48            visited.add(node.name)
49            for parent in node.parents:
50                visit(parent)
51            order.append(node)
52
53        for node in self.nodes.values():
54            visit(node)
55        return order
56
57    def sample(self) -> Dict[str, int]:
58        """Generate a sample from the joint distribution (forward sampling)."""
59        sample = {}
60        for node in self.topological_sort():
61            parent_vals = {p.name: sample[p.name] for p in node.parents}
62            p_true = node.prob(1, parent_vals)
63            sample[node.name] = 1 if np.random.random() < p_true else 0
64        return sample
65
66    def rejection_sampling(self, query: str, evidence: Dict[str, int],
67                          n_samples: int = 10000) -> float:
68        """Estimate P(query=1 | evidence) via rejection sampling."""
69        count_query = 0
70        count_consistent = 0
71
72        for _ in range(n_samples):
73            s = self.sample()
74            # Check if sample is consistent with evidence
75            if all(s[k] == v for k, v in evidence.items()):
76                count_consistent += 1
77                if s[query] == 1:
78                    count_query += 1
79
80        return count_query / count_consistent if count_consistent > 0 else 0.0
81
82    def likelihood_weighting(self, query: str, evidence: Dict[str, int],
83                            n_samples: int = 10000) -> float:
84        """Estimate P(query=1 | evidence) via likelihood weighting."""
85        total_weight = 0.0
86        query_weight = 0.0
87
88        for _ in range(n_samples):
89            sample = {}
90            weight = 1.0
91
92            for node in self.topological_sort():
93                parent_vals = {p.name: sample[p.name] for p in node.parents}
94
95                if node.name in evidence:
96                    # Evidence node: fix value, multiply weight
97                    sample[node.name] = evidence[node.name]
98                    weight *= node.prob(evidence[node.name], parent_vals)
99                else:
100                    # Sample as usual
101                    p_true = node.prob(1, parent_vals)
102                    sample[node.name] = 1 if np.random.random() < p_true else 0
103
104            total_weight += weight
105            if sample[query] == 1:
106                query_weight += weight
107
108        return query_weight / total_weight if total_weight > 0 else 0.0
109
110    def joint_probability(self, assignment: Dict[str, int]) -> float:
111        """Compute P(assignment) using the factorization property."""
112        prob = 1.0
113        for node in self.nodes.values():
114            parent_vals = {p.name: assignment[p.name] for p in node.parents}
115            prob *= node.prob(assignment[node.name], parent_vals)
116        return prob

Production Libraries: For real applications, use established libraries:pgmpy (Python), bnlearn (R/Python), or probabilistic programming languages like Pyro orNumPyro for deep learning integration.

Knowledge Check

Test your understanding of Bayesian Networks with these questions covering the key concepts from this section:

Knowledge Check

Question 1 of 8

What is the defining property of a Bayesian Network?

Current Score: 0/0

Summary

Key Takeaways

✅A Bayesian Network = DAG + CPTs, representing a joint distribution compactly.

✅The factorization property lets us write

P(\\mathbf{X}) = \\prod_i P(X_i | \\text{Pa}_i)

✅D-separation determines conditional independence from the graph structure.

✅Colliders are special: observing them creates dependencies between parents.

✅Exact inference uses variable elimination or belief propagation; NP-hard in general.

✅Approximate inference via sampling (rejection, likelihood weighting, MCMC).

✅VAEs and other generative models are Bayesian Networks with neural CPTs.

✅Parameter savings can be exponential when variables have sparse dependencies.

Looking Ahead

In the next section, we'll explore Markov Random Fields—undirected graphical models that are natural for modeling spatial data, images, and situations where the direction of influence is unclear. We'll see how they differ from Bayesian Networks and where each is most appropriate.