Chapter 19
25 min read
Section 120 of 175

Monte Carlo Integration

Computational Bayesian Methods

Learning Objectives

By the end of this section, you will be able to:

๐Ÿ“ Core Mathematical Concepts

  • โ€ข Derive the Monte Carlo estimator from the Law of Large Numbers
  • โ€ข Prove that MC error decreases as 1/n1/\sqrt{n}
  • โ€ข Explain why MC is dimension-independent while grids are not
  • โ€ข Calculate confidence intervals for MC estimates

๐Ÿ”ง Practical Skills

  • โ€ข Implement basic Monte Carlo integration in Python
  • โ€ข Apply importance sampling for variance reduction
  • โ€ข Diagnose convergence and choose appropriate sample sizes
  • โ€ข Use MC to compute Bayesian posterior expectations

๐Ÿง  AI/ML Connections

  • โ€ข Neural Network Training: Mini-batch SGD is Monte Carlo estimation of the gradient
  • โ€ข Reinforcement Learning: Policy gradient methods use MC to estimate expected return
  • โ€ข Generative Models: VAEs and diffusion models use MC for intractable expectations
  • โ€ข Bayesian Deep Learning: MC Dropout for uncertainty quantification
Where You'll Apply This: Computing expectations in high-dimensional spaces, training neural networks with stochastic gradient descent, uncertainty quantification, Bayesian inference, reinforcement learning, and any situation where exact integration is impossible.

The Big Picture

Many problems in statistics, machine learning, and physics require computing integrals that have no closed-form solution. Consider computing the expected value of a neural network's output over all possible inputs, or the posterior mean in a Bayesian model with thousands of parameters. These integrals are intractable - we cannot solve them analytically.

The Core Problem

I=โˆซf(x)โ€‰p(x)โ€‰dx=Ep[f(X)]I = \int f(x) \, p(x) \, dx = \mathbb{E}_p[f(X)]

We want to compute this integral, but it's analytically intractable. What can we do?

Monte Carlo integration provides a remarkably simple and powerful solution: instead of computing the integral exactly, we approximate it by drawing random samples. This idea - replacing integrals with random sampling - is one of the most important computational techniques in modern science and engineering.

Historical Development

๐ŸŽฒ
Comte de Buffon (1777)

First recorded use of random sampling for computation. Buffon's needle problem: drop a needle repeatedly on lined paper to estimate ฯ€. This predates computers by nearly two centuries!

๐Ÿ’ฃ
Stanislaw Ulam & John von Neumann (1946)

Working on nuclear weapons at Los Alamos, Ulam invented the modern Monte Carlo method while recovering from illness and playing solitaire. Von Neumann implemented it on early electronic computers. The name "Monte Carlo" was chosen as a code name, referencing the famous casino.

๐Ÿ“Š
Modern Era (1990s-Present)

Monte Carlo methods now underpin virtually all of computational statistics and machine learning. Every time you train a neural network with SGD, you're doing Monte Carlo gradient estimation. Billions of MC samples are computed every second across the world's computers.


The Fundamental Idea

The genius of Monte Carlo lies in its simplicity. To estimate an expected valueEp[f(X)]\mathbb{E}_p[f(X)], we:

  1. Draw samples X1,X2,โ€ฆ,XnX_1, X_2, \ldots, X_n from the distribution p(x)p(x)
  2. Evaluate the function f(Xi)f(X_i) at each sample point
  3. Average the results: I^n=1nโˆ‘i=1nf(Xi)\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} f(X_i)

The Monte Carlo Estimator

I^n=1nโˆ‘i=1nf(Xi)โ†’nโ†’โˆžEp[f(X)]=โˆซf(x)โ€‰p(x)โ€‰dx\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} f(X_i) \xrightarrow[n \to \infty]{} \mathbb{E}_p[f(X)] = \int f(x) \, p(x) \, dx

That's it! The sample average converges to the true expected value as the number of samples increases. This works for any distribution p(x)p(x) we can sample from, and any function f(x)f(x) we can evaluate.

Law of Large Numbers Connection

Why does Monte Carlo work? The answer is the Law of Large Numbers (LLN), which we covered in Chapter 10. The LLN guarantees that the sample average converges to the population mean:

The LLN Guarantee

Strong Law: I^nโ†’a.s.E[f(X)]\hat{I}_n \xrightarrow{a.s.} \mathbb{E}[f(X)] as nโ†’โˆžn \to \infty

This means with probability 1, our Monte Carlo estimate converges to the true integral. The only requirement is that E[โˆฃf(X)โˆฃ]<โˆž\mathbb{E}[|f(X)|] < \infty (finite first moment).

Interactive: Basic Monte Carlo

Let's see Monte Carlo integration in action. We'll estimate the integral of a complex function using random samples. Watch how the estimate converges to the true value as you add more samples.

๐ŸŽฒBasic Monte Carlo Integration

Sample random x values, evaluate f(x), and average to estimate the integral

Samples: 100
Samples
0
MC Estimate
0.0000
True Value
2.0665
Absolute Error
2.0665

How it works: We draw random samples x1, x2, ..., xn uniformly from [โˆ’3, 3], evaluate the function at each point, and compute:

โˆซf(x)dxโ‰ˆ(b โˆ’ a) ยท (1/n) ยท ฮฃ f(xi)

Convergence Analysis

Knowing that Monte Carlo converges isn't enough - we need to know how fastit converges and how accurate our estimates are. This is where the Central Limit Theorembecomes essential.

Variance and the CLT

By the CLT, the Monte Carlo estimator is approximately normally distributed:

n(I^nโˆ’I)โ†’dN(0,ฯƒ2)\sqrt{n}(\hat{I}_n - I) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

where ฯƒ2=Varp[f(X)]\sigma^2 = \text{Var}_p[f(X)]

This tells us several crucial things:

PropertyMathematical FormPractical Meaning
Standard ErrorSE = ฯƒ/โˆšnError decreases as 1/โˆšn
95% CII&#770; ยฑ 1.96ฯƒ/โˆšnWe can quantify uncertainty
To halve errorNeed 4ร— samplesAccuracy is expensive
Dimension-independentRate = O(1/โˆšn) for any dMC scales to high dimensions
The Square Root Law: The MC error decreases as 1/n1/\sqrt{n}. This rate is independent of the dimension of the integral. Whether you're integrating in 1D or 1000D, you still get O(1/n)O(1/\sqrt{n}) convergence. This is the secret sauce that makes Monte Carlo powerful for high-dimensional problems.

Interactive: Convergence Visualizer

Watch multiple independent Monte Carlo runs and observe how they all converge to the true value. The "confidence band" shrinks as 1/n1/\sqrt{n}, just as the theory predicts.

Monte Carlo Convergence: The 1/โˆšn Law

Watch multiple independent runs converge to the true value as n increases

Traces: 5
Max Samples: 500

Key Insight: The Square Root Law

The standard error of Monte Carlo estimation decreases as 1/โˆšn. This means:

  • To halve the error, you need 4ร— more samples
  • To reduce error by 10ร—, you need 100ร— more samples
  • This rate is dimension-independent - the key advantage for high-dimensional integrals!

Why MC Beats Grids in High Dimensions

To understand Monte Carlo's power, compare it to deterministic grid-based integration (like the trapezoid rule):

Grid-Based Methods

Error rate: O(nโˆ’2/d)O(n^{-2/d}) where d is dimension

  • In 1D: 100 points โ†’ 10,000 for same accuracy
  • In 10D: Need 10010=1020100^{10} = 10^{20} points!
  • Suffers from "curse of dimensionality"

Monte Carlo

Error rate: O(nโˆ’1/2)O(n^{-1/2}) independent of d

  • Same rate in 1D, 10D, or 10,000D
  • 100ร— more samples โ†’ 10ร— more accuracy
  • The only viable method for high-d integrals
Why This Matters for ML: Neural networks have millions of parameters. Computing expectations over weight space (for Bayesian neural networks) or input space (for expected outputs) is only possible with Monte Carlo methods. Grid-based integration would require more points than atoms in the universe!

Classic Example: Estimating ฯ€

The most famous demonstration of Monte Carlo integration is estimating ฯ€ by throwing random "darts" at a square with an inscribed quarter circle. This example perfectly illustrates the core MC idea: probability ratios equal area ratios.

The Geometric Setup

Unit Square:

  • Points (x, y) where 0 โ‰ค x, y โ‰ค 1
  • Area = 1 ร— 1 = 1

Quarter Circle:

  • Points where xยฒ + yยฒ โ‰ค 1
  • Area = ฯ€ ร— 1ยฒ / 4 = ฯ€/4
Pointsย insideย circleTotalย pointsโ‰ˆAreaย ofย circleAreaย ofย square=ฯ€/41=ฯ€4\frac{\text{Points inside circle}}{\text{Total points}} \approx \frac{\text{Area of circle}}{\text{Area of square}} = \frac{\pi/4}{1} = \frac{\pi}{4}

Therefore: ฯ€โ‰ˆ4ร—insidetotal\pi \approx 4 \times \frac{\text{inside}}{\text{total}}

Interactive: ฯ€ Estimator

Watch as random points accumulate and the estimate of ฯ€ improves. Notice how the convergence follows the 1/n1/\sqrt{n} law we derived above.

Estimating ฯ€ with Monte Carlo

The classic demonstration: throw darts at a square with inscribed quarter circle

Inside Circle
0
Outside Circle
0
ฯ€ Estimate
0.00000
Error
3.14159
True value of ฯ€
3.1415926536...

Why Does This Work?

The Geometry: We inscribe a quarter circle of radius 1 in a unit square.

  • Area of square = 1 ร— 1 = 1
  • Area of quarter circle = ฯ€rยฒ/4 = ฯ€/4
  • Ratio = ฯ€/4

The Monte Carlo Logic:

P(point inside circle)
ย ย = Area(circle) / Area(square)
ย ย = ฯ€/4

So: ฯ€ โ‰ˆ 4 ร— (inside / total)

Variance Reduction Techniques

While basic Monte Carlo is powerful, we can often do much better by being clever abouthow we sample. Variance reduction techniques maintain the correct expected value while reducing the variance of our estimator - giving us the same accuracy with fewer samples.

Importance Sampling

The key observation is that when we estimate Ep[f(X)]\mathbb{E}_p[f(X)], not all regions of the input space contribute equally. If f(x)f(x) varies greatly, we waste effort sampling wheref(x)โ‰ˆ0f(x) \approx 0 while missing regions where f(x)f(x) is large.

Importance Sampling Identity

Ep[f(X)]=โˆซf(x)p(x)dx=โˆซf(x)p(x)q(x)q(x)dx=Eq[f(X)p(X)q(X)]\mathbb{E}_p[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_q\left[f(X) \frac{p(X)}{q(X)}\right]

Sample from q(x), reweight by p(x)/q(x) to get an unbiased estimate of Ep[f(X)]

The proposal distribution q(x)q(x) should concentrate probability mass where โˆฃf(x)โˆฃโ‹…p(x)|f(x)| \cdot p(x) is large. The ideal choice (which achieves zero variance!) is:

qโˆ—(x)โˆโˆฃf(x)โˆฃโ‹…p(x)q^*(x) \propto |f(x)| \cdot p(x)

Of course, we usually can't use the optimal proposal (computing it requires the integral we're trying to estimate!), but even approximate matches can dramatically reduce variance.

Interactive: Importance Sampling

Compare naive uniform sampling with importance sampling. Notice how importance sampling focuses samples where the integrand is large, achieving lower variance with the same sample count.

Importance Sampling: Variance Reduction

Sample from a proposal distribution that focuses on important regions

The Idea: When f(x) varies greatly, uniform sampling wastes effort on regions where f(x) is small. Importance sampling draws more samples from regions where f(x) ยท p(x) is large.

How it works: Sample x from proposal q(x), then reweight:

E[f(X)] โ‰ˆ (1/n) ฮฃ f(xi) ยท p(xi) / q(xi)
Other Variance Reduction Techniques: Beyond importance sampling, there are several other powerful methods:
  • Control Variates: Subtract a known quantity with the same expected value
  • Antithetic Variables: Use negatively correlated sample pairs
  • Stratified Sampling: Divide domain into strata and sample proportionally
  • Latin Hypercube Sampling: Ensure samples cover the space evenly

Monte Carlo in Bayesian Inference

One of the most important applications of Monte Carlo integration is in Bayesian inference. Given data, we want to compute quantities like:

  • The posterior mean: E[ฮธโˆฃdata]\mathbb{E}[\theta | \text{data}]
  • The posterior variance: Var[ฮธโˆฃdata]\text{Var}[\theta | \text{data}]
  • Posterior probabilities: P(ฮธ>cโˆฃdata)P(\theta > c | \text{data})
  • The predictive distribution: p(ynewโˆฃdata)p(y_{\text{new}} | \text{data})

All of these are expectations with respect to the posterior distribution. Monte Carlo makes them trivial: just draw samples from the posterior and compute the sample average!

Monte Carlo for Posterior Expectations

Step 1: Draw samples ฮธ1,โ€ฆ,ฮธnโˆผp(ฮธโˆฃdata)\theta_1, \ldots, \theta_n \sim p(\theta | \text{data})

Step 2: Estimate any posterior quantity g(ฮธ):

E[g(ฮธ)โˆฃdata]โ‰ˆ1nโˆ‘i=1ng(ฮธi)\mathbb{E}[g(\theta) | \text{data}] \approx \frac{1}{n} \sum_{i=1}^n g(\theta_i)

This is incredibly powerful: once you have posterior samples, you can estimate anyfunction of the parameters without recomputing anything. Need the mean? Average the samples. Need the variance? Compute sample variance. Need P(ฮธ>0.5)P(\theta > 0.5)? Count the fraction above 0.5.

Interactive: Bayesian Integration

See Monte Carlo in action for Bayesian inference. We compute the posterior mean for a Beta-Binomial model and compare the Monte Carlo estimate to the exact analytical solution.

Bayesian Posterior Integration

Computing the posterior mean E[ฮธ | data] via Monte Carlo sampling

Exact Mean
0.7083
MC Estimate
โ€”
Rel. Error
โ€”
Data
15/20

The Bayesian Setup: We observe 15 successes in 20 Bernoulli trials. With a Beta(2, 2) prior on the success probability ฮธ, the posterior is Beta(17, 7).

Monte Carlo Approach: Instead of computing E[ฮธ | data] analytically, we draw samples from the posterior and take the sample mean. For complex posteriors (e.g., in neural network weight posteriors), this is often the only feasible approach!


Applications in AI/ML

Monte Carlo methods are ubiquitous in modern machine learning. Here are some key applications:

๐ŸŽฏ Stochastic Gradient Descent

Mini-batch SGD is Monte Carlo gradient estimation. Instead of computing the true gradient over all data:

โˆ‡L(ฮธ) = (1/n) ฮฃแตข โˆ‡โ„“แตข(ฮธ)
โ‰ˆ (1/B) ฮฃแตขโˆˆbatch โˆ‡โ„“แตข(ฮธ)

The mini-batch gradient is an unbiased MC estimate of the true gradient.

๐ŸŽฒ Reinforcement Learning

Policy gradients estimate the expected return gradient:

โˆ‡J(ฮธ) = E[R ยท โˆ‡log ฯ€(a|s;ฮธ)]
โ‰ˆ (1/N) ฮฃ Rแตข ยท โˆ‡log ฯ€แตข

REINFORCE, A2C, PPO all use Monte Carlo estimation of policy gradients.

๐Ÿง  Variational Autoencoders

VAEs use the reparameterization trick to get MC gradients through sampling:

z = ฮผ(x) + ฯƒ(x) ยท ฮต, ฮต~N(0,1)
E[f(z)] โ‰ˆ (1/L) ฮฃโ‚— f(zโ‚—)

This enables backprop through stochastic layers.

๐Ÿ”ฎ MC Dropout for Uncertainty

Run inference multiple times with dropout active:

mean = (1/T) ฮฃโ‚œ f(x; ฮธ, mask_t)
var = (1/T) ฮฃโ‚œ (f_t - mean)ยฒ

Dropout at test time approximates Bayesian inference over weights.

๐ŸŒซ๏ธ Diffusion Models

Score matching and denoising use MC estimation of expectations:

L = E[||ฮต - ฮตฮธ(xt, t)||ยฒ]
โ‰ˆ (1/B) ฮฃ ||ฮตi - ฮตฮธ(xti, ti)||ยฒ

DDPM, DALL-E, Stable Diffusion all rely on MC training.

๐Ÿ”— Bayesian Neural Networks

Full Bayesian inference over weights uses MCMC or VI:

p(y|x, D) = โˆซ p(y|x,w) p(w|D) dw
โ‰ˆ (1/S) ฮฃโ‚› p(y|x, wโ‚›)

Enables principled uncertainty quantification in deep learning.


Python Implementation

Let's implement Monte Carlo integration from scratch. Click on code lines to see detailed explanations of each component.

Monte Carlo Integration in Python
๐Ÿmonte_carlo.py
1NumPy Import

NumPy is essential for vectorized operations on sample arrays. Monte Carlo involves many independent samples, making NumPy&apos;s array operations crucial for efficiency.

EXAMPLE
np.random.uniform(0, 1, 10000) - draw 10K samples
8Monte Carlo Integration Function

This is the core Monte Carlo integrator. It estimates โˆซf(x)dx from a to b using random samples. The key insight: E[f(X)] for X ~ Uniform(a,b) equals the integral divided by (b-a).

17Uniform Sampling

We draw samples uniformly from [a, b]. Each sample x_i is equally likely anywhere in the integration domain. This is the simplest Monte Carlo approach.

EXAMPLE
For [0, 1]: each x has equal probability of being anywhere in the interval
20Function Evaluation

We evaluate f at all sample points. NumPy&apos;s broadcasting makes this efficient - f is applied elementwise to the entire array.

23Monte Carlo Estimate

The estimate is (b-a) ร— mean(f). This works because: โˆซf(x)dx = (b-a) ร— E[f(X)] where X ~ Uniform(a,b). The sample mean estimates this expectation.

EXAMPLE
For f(x)=xยฒ on [0,1]: estimate = 1 ร— mean(x_iยฒ) โ‰ˆ 1/3
26Standard Error

The standard error is ฯƒ/โˆšn where ฯƒ is the std of f(X). This quantifies uncertainty in our estimate. Note the 1/โˆšn scaling - this is the key Monte Carlo convergence rate!

31Importance Sampling

Instead of sampling uniformly, we sample from a proposal q(x) that focuses on regions where f(x)ยทp(x) is large. This reduces variance when the proposal is well-chosen.

44Importance Weights

The weight w(x) = p(x)/q(x) corrects for the fact that we&apos;re sampling from q instead of p. This is the fundamental importance sampling identity: E_p[f] = E_q[fยทp/q].

47Self-Normalized Weights

Self-normalization divides by the sum of weights. This makes the estimator robust when we only know p up to a normalizing constant - crucial in Bayesian inference!

53Effective Sample Size

ESS measures how many independent samples our weighted samples are &apos;worth&apos;. ESS << n indicates high weight variance - the proposal is a poor match to the target.

EXAMPLE
ESS = n means all weights are equal (perfect proposal). ESS = 1 means one sample dominates.
62Bayesian Example

This demonstrates Monte Carlo for Bayesian inference. We compute E[ฮธ|data] by sampling from the posterior. This is the foundation of computational Bayesian methods.

73Posterior Sampling

We draw samples directly from the posterior Beta distribution. In practice, for complex posteriors, we&apos;d use MCMC (covered in the next section).

82Posterior Probability

Monte Carlo makes computing ANY posterior quantity easy: just compute the quantity for each sample and average. Here: P(ฮธ > 0.7 | data) โ‰ˆ fraction of samples > 0.7.

95 lines without explanation
1import numpy as np
2from scipy import stats
3
4# ================================================
5# Monte Carlo Integration: Core Implementation
6# ================================================
7
8def monte_carlo_integrate(f, a, b, n_samples=10000, seed=None):
9    """
10    Estimate integral of f(x) from a to b using Monte Carlo.
11
12    Returns: (estimate, standard_error)
13    """
14    if seed is not None:
15        np.random.seed(seed)
16
17    # Draw uniform samples from [a, b]
18    x_samples = np.random.uniform(a, b, n_samples)
19
20    # Evaluate function at sample points
21    f_values = f(x_samples)
22
23    # Monte Carlo estimate: (b-a) * mean(f)
24    estimate = (b - a) * np.mean(f_values)
25
26    # Standard error estimate
27    std_error = (b - a) * np.std(f_values) / np.sqrt(n_samples)
28
29    return estimate, std_error
30
31
32def importance_sampling(f, p_pdf, q_pdf, q_sampler, n_samples=10000):
33    """
34    Estimate E_p[f(X)] using importance sampling with proposal q.
35
36    Args:
37        f: Function to integrate
38        p_pdf: Target density p(x)
39        q_pdf: Proposal density q(x)
40        q_sampler: Function that draws samples from q
41
42    Returns: (estimate, effective_sample_size)
43    """
44    # Sample from proposal distribution
45    x_samples = q_sampler(n_samples)
46
47    # Compute importance weights: w(x) = p(x) / q(x)
48    weights = p_pdf(x_samples) / q_pdf(x_samples)
49
50    # Normalize weights (self-normalized importance sampling)
51    normalized_weights = weights / np.sum(weights)
52
53    # Weighted estimate
54    estimate = np.sum(normalized_weights * f(x_samples))
55
56    # Effective sample size (diagnostic for weight quality)
57    ess = 1 / np.sum(normalized_weights ** 2)
58
59    return estimate, ess
60
61
62# ================================================
63# Example: Posterior Mean via Monte Carlo
64# ================================================
65
66def bayesian_mc_example():
67    """
68    Compute E[theta | data] for Beta-Binomial model.
69    """
70    # Prior: Beta(2, 2), Data: 15 successes in 20 trials
71    prior_a, prior_b = 2, 2
72    successes, trials = 15, 20
73
74    # Posterior: Beta(prior_a + successes, prior_b + failures)
75    post_a = prior_a + successes
76    post_b = prior_b + (trials - successes)
77
78    # Exact posterior mean (for comparison)
79    exact_mean = post_a / (post_a + post_b)
80
81    # Monte Carlo estimate
82    n_samples = 10000
83    theta_samples = np.random.beta(post_a, post_b, n_samples)
84    mc_mean = np.mean(theta_samples)
85    mc_std = np.std(theta_samples) / np.sqrt(n_samples)
86
87    print(f"Posterior: Beta({post_a}, {post_b})")
88    print(f"Exact Mean:     {exact_mean:.6f}")
89    print(f"MC Estimate:    {mc_mean:.6f} ยฑ {mc_std:.6f}")
90    print(f"MC Error:       {abs(mc_mean - exact_mean):.6f}")
91
92    # Also estimate posterior probability P(theta > 0.7 | data)
93    prob_gt_07 = np.mean(theta_samples > 0.7)
94    exact_prob = 1 - stats.beta.cdf(0.7, post_a, post_b)
95    print(f"\nP(ฮธ > 0.7 | data):")
96    print(f"  MC Estimate:  {prob_gt_07:.4f}")
97    print(f"  Exact:        {exact_prob:.4f}")
98
99
100# Run the example
101if __name__ == "__main__":
102    # Basic integration example
103    f = lambda x: np.exp(-x**2) * (1 + np.sin(3*x)**2)
104    est, se = monte_carlo_integrate(f, -3, 3, n_samples=100000)
105    print(f"Integral estimate: {est:.4f} ยฑ {se:.4f}")
106
107    print("\n" + "="*50 + "\n")
108    bayesian_mc_example()

Common Pitfalls

โŒ

Ignoring the 1/โˆšn Convergence Rate

Thinking "more samples = proportionally better" is wrong. To reduce error by 10ร—, you need 100ร— more samples, not 10ร—. Plan your compute budget accordingly.

Fix: Understand that accuracy gains diminish rapidly with sample size. Consider variance reduction techniques before brute-force sampling.

โŒ

Bad Importance Sampling Proposals

If your proposal q(x) doesn't cover regions where p(x)ยทf(x) is large, you'll miss important contributions. If q(x) has thin tails relative to p(x)ยทf(x), you'll get huge variance from rare extreme weights.

Fix: Always check the Effective Sample Size (ESS). If ESS << n, your proposal is poor. Use proposals with heavier tails than the target.

โŒ

Using Fixed Random Seeds in Production

While fixed seeds are great for reproducibility in research, using them in production can lead to systematic biases. Your "random" samples are always the same!

Fix: Use fixed seeds only for debugging. In production, use high-quality random number generators with proper seeding.

โŒ

Not Reporting Standard Errors

A Monte Carlo estimate without uncertainty bounds is incomplete. The standard error tells you how much your estimate might vary with different random samples.

Fix: Always compute and report SE = ฯƒ/โˆšn. Use this to construct confidence intervals: estimate ยฑ 1.96 ร— SE for 95% CI.

โŒ

Infinite Variance Functions

If Var[f(X)]=โˆž\text{Var}[f(X)] = \infty, the CLT doesn't apply and standard error estimates are meaningless. Heavy-tailed functions can have this property.

Fix: Check if your function has finite variance. If not, consider truncation, importance sampling with lighter-tailed proposals, or robust estimators.


Knowledge Check

Test your understanding of Monte Carlo integration with this interactive quiz.

Knowledge Check: Monte Carlo Integration

Question 1 of 8

What is the key insight behind Monte Carlo integration?

Score: 0/0

Summary

Key Takeaways

  1. Monte Carlo replaces integrals with averages: Draw samples from p(x), evaluate f, and average. The sample mean converges to E[f(X)] by the LLN.
  2. Error decreases as 1/โˆšn: This rate is dimension-independent, making MC the only viable method for high-dimensional integrals.
  3. The CLT gives confidence intervals: MC estimates are approximately normal with standard error ฯƒ/โˆšn, enabling uncertainty quantification.
  4. Importance sampling reduces variance: By focusing samples where the integrand is large, we get the same accuracy with fewer samples.
  5. MC enables Bayesian computation: Once you have posterior samples, any posterior quantity becomes a simple average.
  6. MC is everywhere in ML: SGD, policy gradients, VAEs, diffusion models, and Bayesian deep learning all use Monte Carlo estimation.
  7. Always report uncertainty: A MC estimate without standard error is incomplete. Use SE = ฯƒ/โˆšn to quantify precision.
Looking Ahead: In the next section, we'll tackle a fundamental problem: what if we can't directly sample from our target distribution? Markov Chain Monte Carlo(MCMC) provides the answer, using carefully constructed random walks to generate samples from complex, high-dimensional distributions.
Loading comments...