Learning Objectives
By the end of this section, you will:
- Deeply understand what expectation means intuitively
- See expectation as the center of mass of a distribution
- Understand why the formula is a weighted average
- Master LOTUS (Law of the Unconscious Statistician)
- Know why expectation minimizes mean squared error
- Apply Jensen's Inequality to ML problems
- Connect expectation to the Law of Large Numbers
- Understand why expectation appears everywhere in ML
- Avoid common pitfalls with expectation
- Preview conditional expectation and tail risks
- See how the integral formula arises from discrete sums
Historical Context
The Birth of Expected Value
The concept of expectation was born from gambling! In 1654, the French mathematicians Blaise Pascal and Pierre de Fermat exchanged letters about the "problem of points"—how to fairly divide stakes in an interrupted game of chance.
Christiaan Huygens published the first treatise on probability in 1657, introducing the term "expectatio" (Latin for expectation). He framed it as: "If I have equal chances of getting a or b, my expectation is (a+b)/2."
Historical Insight: The term "expected value" originally meant "what you should expect to win" in a fair game. Today it means the long-run average of any random variable.
What is Expectation Intuitively?
Expectation = the long-run average value of a random variable if you could repeat the experiment forever.
Let this picture be in your mind: If a random variable produces values—sometimes small, sometimes large, sometimes medium—then the expectation is the single number that summarizes where the outcomes concentrate on average.
The Core Insight: Expectation is the "average destination of randomness." Even though randomness produces chaos moment-to-moment, expectation captures where everything gravitates toward in the long run.
Correcting Common Misconceptions
Common Misconception
"Expectation measures how random the values are"
Correct Understanding
Expectation measures where the randomness is centered. Variance measures how random/spread the values are.
Common Misconception
"Expectation is the most likely value"
Correct Understanding
The most likely value is the mode. Expectation is the weighted average of all possible values.
The Formula: Why Sum and Integral?
For Discrete Random Variables
For Continuous Random Variables
Loading interactive demo...
Interpretation: You multiply each possible value by how likely it is. Then you add (or integrate) them up. The result is the weighted average of all possibilities.
The Core Truth
Expectation is just average = value × likelihood. Nothing more mystical. The formula simply weights each outcome by its probability.
More Generally: Functions of Random Variables
In reality, we often care about functions of X, not just X itself:
This is necessary because in real systems:
- Power =
- Loss =
- Log-likelihood =
LOTUS: Law of the Unconscious Statistician
One of the most powerful formulas in probability is the Law of the Unconscious Statistician (LOTUS). It lets you compute E[g(X)] without finding the distribution of g(X):
Why "Unconscious"?
It's called "unconscious" because students often use it without realizing they're applying a theorem! The formula looks obvious but requires proof. You can compute E[X²] directly from f_X(x) without first finding the distribution of X².
LOTUS in Practice
| Goal | LOTUS Formula | Example |
|---|---|---|
| Needed for variance: | ||
| Entropy, log-likelihood | ||
| Moment generating function | ||
| Skewness calculation |
Interactive: Weighted Average
Experiment with this visualization to see how expectation is computed as a weighted average:
Loading interactive demo...
Expectation as Center of Mass
Think of your random variable as mass spread on a number line. The expectation is the point where you could balance the distribution on a needle.
Loading interactive demo...
This physics analogy is not just a metaphor—it is mathematically exact! Just as center of mass is the weighted average of positions (weighted by mass), expectation is the weighted average of values (weighted by probability).
Expectations of Common Distributions
Here is a quick reference for the expectations of distributions you'll encounter frequently in ML and statistics:
Discrete Distributions
| Distribution | Notation | Intuition | |
|---|---|---|---|
| Bernoulli | Probability of success | ||
| Binomial | Expected number of successes in trials | ||
| Geometric | Expected trials until first success | ||
| Poisson | Expected count equals rate parameter | ||
| Uniform (discrete) | Middle of the range |
Continuous Distributions
| Distribution | Notation | Intuition | |
|---|---|---|---|
| Uniform | Midpoint of interval | ||
| Exponential | Inverse of rate = mean waiting time | ||
| Normal | Mean parameter directly gives expectation | ||
| Gamma | Shape/rate | ||
| Beta | Weighted proportion of | ||
| Chi-squared | Degrees of freedom | ||
| Log-normal | Note: NOT ! |
Log-normal Trap
For X ~ LogN(μ, σ²), . This is a consequence of Jensen's inequality since exp is convex: .
Why Statisticians Love Expectation
Expectation has magical properties that make it the foundation of all statistical analysis:
1. It Compresses the Whole Distribution into One Stable Number
Even if the distribution is complicated, expectation gives a stable center that summarizes the "typical" behavior.
2. It is LINEAR (This is HUGE!)
No other summary behaves this nicely! This makes derivations, proofs, estimators, and ML algorithms beautifully simple.
Linearity is Power
The linearity of expectation is used everywhere: in gradient descent, Bayesian inference, signal processing, and control theory. When you see a sum of random variables, you can immediately split the expectation!
3. It Connects to Reality Through the Law of Large Numbers
This means expectation is not an imaginary math object. It is literally what you observe in the real world if you take enough samples!
Moment Generating Functions
The Moment Generating Function (MGF) is a powerful tool that encodes all moments of a distribution in a single function. It's defined as:
The name comes from the remarkable property: derivatives of the MGF give moments.
Why "Moment Generating"?
Expand as a Taylor series:
Taking expectation term by term:
The Key Result
The n-th derivative of M_X(t) evaluated at t=0 gives the n-th moment:
MGFs of Common Distributions
| Distribution | Domain | |
|---|---|---|
Why MGFs Matter in ML
- Uniqueness: If two distributions have the same MGF, they're identical. Useful for proving distributional results.
- Sum of independent RVs: — products are easier than convolutions!
- Central Limit Theorem: The CLT proof uses MGF convergence.
- Concentration bounds: Chernoff bounds use
Characteristic Functions
When the MGF doesn't exist (heavy tails), use the characteristic function: . It always exists and has similar properties. The Fourier transform connection makes it fundamental in signal processing.
Jensen's Inequality
Jensen's Inequality is one of the most important results connecting expectation with function transformations. It tells us exactly when E[g(X)] differs from g(E[X]).
Understanding the Two Quantities
"Average of the transformed values"
First apply function g to each possible value of X, then take the average. You transform first, average second.
"Transformation of the average value"
First find the average of X, then apply function g to that single average. You average first, transform second.
🔑 Key Question: Does the order matter? Yes! Jensen's inequality tells us exactly how.
Jensen's Inequality
For a convex function g (curves upward like x², e^x):
📖 Intuitive Meaning:
"The average of squares is always greater than or equal to the square of the average."
When you apply a convex function to random values and then average, you get a larger result than if you first averaged and then applied the function. Convex functions "amplify" spread—the more variable your data, the bigger the gap.
For a concave function g (curves downward like log, √x):
📖 Intuitive Meaning:
"The average of logarithms is always less than or equal to the logarithm of the average."
When you apply a concave function to random values and then average, you get a smaller result than if you first averaged and then applied the function. Concave functions "compress" spread—they penalize variability.
When does equality hold?
happens in two cases:
- No randomness: X is a constant (no spread at all)
- Linear function: g(x) = ax + b (neither convex nor concave)
This is why expectation is linear: E[aX + b] = aE[X] + b always holds!
Why Jensen's Inequality Matters in ML
| Application | Function | Convexity | Consequence |
|---|---|---|---|
| ELBO (VAEs) | Concave | → maximize lower bound | |
| Cross-entropy | Convex | Nice optimization landscape | |
| Bias in estimators | Convex | ||
| Sample variance | Convex | ||
| KL divergence | Convex | always |
Geometric Intuition
For a convex function, the curve lies below any chord (line connecting two points on the curve). This means:
- Points on the curve:
- Average of points: (on or above the chord)
- Value at average: (on the curve)
- Result: Chord is above curve, so
Interactive 2D: Drag & Explore
This powerful visualization lets you drag distribution points along the curve, adjust probabilities, and instantly see how Jensen's inequality responds. Try different functions to build deep intuition!
Loading interactive demo...
Interactive 3D: Surface View
See Jensen's inequality come alive in 3D! For functions of two variables, convex surfaces curve upward like a bowl, and the weighted average of surface points always lies above the surface at the average point. Rotate the view to see this geometric truth from every angle.
Loading interactive demo...
Interactive: Jensen's Inequality (Basic)
Here's a simpler view focusing on the core concept with fewer controls:
Loading interactive demo...
Law of Large Numbers in Action
Watch how the sample average converges to the true expectation as you take more samples:
Loading interactive demo...
Why This Matters
Every machine learning algorithm uses expectation implicitly. When you train a model, you are approximating the expected loss. The Law of Large Numbers guarantees that your training converges to the true risk.
Physical and Engineering Meaning
What does expectation mean in real-world engineering applications?
| If X represents... | Expectation means... |
|---|---|
| Voltage | Average voltage level |
| Noise | Bias in the noise (DC component) |
| Component lifetime | Expected lifetime (MTTF) |
| Daily stock return | Average daily gain/loss |
| Model prediction error | True risk (expected loss) |
| Sensor reading | True underlying value |
| Queue waiting time | Average wait time |
Engineering Insight: Engineers LOVE expectation because we design for average energy, average power, expected error. It gives us a single number to optimize against.
What Information Does It Give?
Expectation answers one fundamental question:
"If randomness continues forever, what do I typically see?"
Expectation also allows us to define other key quantities:
- Variance:
- Covariance:
- Risk in ML:
- KL Divergence:
Expectation is the foundation of all statistical learning.
Expectation in Machine Learning
In ML, we always optimize:
Why expectation? Because:
- You train a model to minimize the expected loss
- You never know the real future inputs
- But expectation gives their "average behavior"
Every ML Algorithm Uses Expectation
Your network's gradient is literally:
This interchange (linearity!) is why gradient descent works. SGD is just Monte Carlo approximation of this expectation.
Comprehensive ML Applications
| Algorithm/Concept | How Expectation Appears | Formula |
|---|---|---|
| Cross-Entropy Loss | Expected negative log-likelihood | E[-log p(y|x)] |
| Policy Gradient (RL) | Expected reward under policy | E_π[R·∇log π] |
| Dropout | Ensemble averaging at test time | E[f(x; mask)] |
| Batch Normalization | Normalize using E[x] and Var(x) | (x - E[x])/√Var(x) |
| VAE ELBO | Expected reconstruction + KL | E_q[log p(x|z)] - KL |
| Attention Weights | Weighted average of values | E[V | Q,K] = softmax(QK^T)V |
| Monte Carlo Tree Search | Expected value of game state | E[reward | state, action] |
| Bayesian Neural Nets | Predictive uncertainty | E[f(x) | data] |
The Reparameterization Trick
In VAEs, we need gradients through expectations. The reparameterization trick rewrites:
Now the expectation is over ε which doesn't depend on φ, so we can backpropagate through μ and σ!
Population vs Sample World
One of the most important distinctions in statistics and machine learning is between the population world and the sample world. Understanding this distinction is key to understanding why expectation matters so deeply.
Population World (True, Infinite, Theoretical)
- The true data-generating process
- Infinite possible observations
- Governed by unknown parameter
- We can never fully observe it
Sample World (Finite, Observed, Practical)
- The data we actually collect
- Finite observations
- Used to estimate
- All we have access to in practice
What "Expectation under " ACTUALLY Means
When we write:
we are doing this thought experiment:
"Imagine the universe is truly generating data using the true but unknown parameter . If we could repeatedly collect infinite datasets from that universe, and for each dataset compute the loss using our guess , what would be the long-run average loss?"
So:
| Symbol | Meaning |
|---|---|
| Random data generated from the true world | |
| True data-generating parameter | |
| Your trial / guess | |
| Error of your guess on data | |
| Average over the true world |
The Risk Function
This expectation has a special name—it's called the risk function:
It is a population-level truth curve over all possible data. The risk function tells us: "For any guess , what is the true expected error?"
True Risk vs Empirical Risk
| Quantity | Formula | Meaning |
|---|---|---|
| True Risk | Infinite-world average | |
| Empirical Risk | Finite-sample average |
So:
- True risk = infinite-world average (what we want)
- Empirical risk = finite-sample average (what we can compute)
- We minimize empirical risk because that's all we have
- Empirical risk converges to true risk by LLN
- Therefore the minimizer converges to
The Fundamental Theorem of Statistical Learning
Here is the mathematically precise version of your idea:
Then:
This is Consistency
This is exactly consistency of estimators. This is exactly how MLE, least squares, and Empirical Risk Minimization (ERM) work!
Summary: Two Worlds, One Bridge
| World | What happens |
|---|---|
| True world | generates infinite data |
| Risk function | Measures theoretical error of any guess |
| Sample world | You only see |
| Training | You minimize empirical average |
| As | Empirical ≈ Population |
Deep Intuition in One Sentence
Expectation under means: "How wrong would my guess be on average if Nature keeps generating data using the true parameter forever?"
What Does Mean?
Before we dive into examples, let's clarify a notation you'll see everywhere in ML:
It means:
"Choose the value of for which the function becomes as small as possible."
Very important distinction:
min→ gives you the minimum value of the functionarg min→ gives you the argument (input) that achieves that minimum
Tiny Numerical Example (Concrete)
Suppose:
Let's test values:
| 0 | 4 |
| 1 | 1 |
| 2 | 0 ← minimum value |
| 3 | 1 |
| 4 | 4 |
- The minimum value is:
- The theta that gives this minimum is:
So:
In the Context of Machine Learning / Statistics
When you see:
It literally means:
"Choose the parameter that makes the average loss on the data as small as possible."
This is:
- Parameter estimation
- Model training
- Learning
- Optimization
- Fitting the model to data
All the same thing.
In Bayesian Form (MAP)
When you see:
That means:
"Choose the value of that is most probable after seeing the data."
And since:
You again get:
"Choose that minimizes (data loss + regularization)."
In Deep Learning (GPT, Diffusion, etc.)
When training a network:
Means:
"Adjust the weights so that prediction error becomes as small as possible."
Backprop + SGD are just numerical machines that search for this arg min.
"Arg min means: return the input value that makes the function as small as it can possibly be."
A Fully Numerical Toy Example
(See empirical risk → true risk → )
We'll use the simplest possible model so everything is visible.
◆ True data-generating world (unknown to us)
Assume Nature uses a Normal distribution:
We don't know that 2 is the truth. We only see samples.
◆ Loss function (your )
Use squared error:
◆ Step 1: The True Risk Function
For a normal distribution, this simplifies to:
So:
- This is a perfect upward parabola
- It is minimized exactly at
- This is a population-level truth curve
◆ Step 2: What you actually observe (finite data)
Say you observe:
Your empirical risk:
Try some values:
Try
Try
Try
Minimum occurs near 2, the true .
◆ Step 3: What happens as
and
The Punchline
By minimizing what we can compute (empirical risk on finite data), we get closer and closer to what we want (the true parameter ). This is the magic of statistical learning!
How This Is EXACTLY What Deep Learning Does
(Cross-entropy, GPT, diffusion, everything)
Let's rewrite the core identity:
Statistical learning principle:
Since we don't know the expectation:
- This is Empirical Risk Minimization (ERM)
- Every neural network is trained using this
- GPT, ResNet, Diffusion, everything
◆ GPT Training = Your Exact Framework
For GPT:
| Your notation | GPT equivalent |
|---|---|
| Token sequences | |
| Network weights | |
| True language distribution | |
| Cross-entropy loss |
Loss:
Empirical training:
Population objective (never directly accessible):
- Training GPT = trying to match the true unknown language generator
- With only a finite dataset
◆ Diffusion models, VAEs, GANs = same thing
All minimize:
Only the loss form changes, not the principle.
How This Explains Overfitting (Perfectly)
Now the most important insight:
◆ What you WANT to minimize
◆ What you CAN minimize
◆ Overfitting happens when:
Meaning:
- The model memorizes the finite data
- It stops representing the true population
◆ Why this happens geometrically
Your model space grows:
| Model | Risk curve |
|---|---|
| Small | Smooth, stable |
| Huge | Wild oscillations |
With few samples:
- Many parameter values give zero training error
- But only one minimizes true risk
◆ This creates the famous gap:
| Quantity | Behavior |
|---|---|
| Training loss | Always decreases |
| Test loss | Decreases → then increases |
| This gap | Overfitting |
◆ The Complete Picture: How We Fight Overfitting
We minimize empirical error to approximate population truth, regularize to encode prior beliefs, and stop early to prevent the optimizer from hallucinating structure that does not exist in Nature.
| Technique | What it does | In terms of risk |
|---|---|---|
| ERM | Minimize | Approximates |
| Regularization | Add penalty | Encodes prior: "simpler more likely" |
| Early stopping | Stop before | Prevents memorizing noise |
| Dropout | Random neuron masking | Implicit ensemble averaging |
| Data augmentation | Expand training set | Better approximation of |
Final Unified Truth
We minimize empirical averages to approximate an unobservable population expectation. As the dataset grows, the empirical landscape converges to the true risk landscape, and the minimizer converges to the true parameter.
One-Line Master Equation (ML + Stats + DL Unified)
This single equation is:
- Estimation theory
- Machine learning
- Neural network training
- GPT training
- Diffusion training
- Bayesian posterior mode (MAP)
- Risk minimization
- Consistency of M-estimators
Monte Carlo Estimation
In practice, we rarely compute expectations analytically. Instead, we use Monte Carlo estimation: approximate E[g(X)] by averaging samples.
Monte Carlo Estimator
As n → ∞, this converges to the true expectation by the Law of Large Numbers.
Properties of Monte Carlo Estimators
| Property | Value | Interpretation |
|---|---|---|
| Unbiased | E[estimator] = E[g(X)] | No systematic error |
| Variance | Var(g(X))/n | Decreases as 1/n |
| Standard Error | σ/√n | Error decreases as 1/√n |
| 95% CI Width | ≈ 4σ/√n | Need 4× samples to halve width |
Variance Reduction Techniques
The challenge with Monte Carlo is high variance. Modern ML uses several techniques to reduce it:
- Importance Sampling: Sample from a different distribution q(x) and reweight:
Used in: Off-policy RL, rare event simulation, variational inference
- Control Variates: Subtract a known-mean variable to reduce variance:
Used in: Policy gradient baselines, variance reduction in REINFORCE
- Antithetic Variates: Use negatively correlated samples
- Stratified Sampling: Divide the space into strata and sample from each
Mini-batch SGD is Monte Carlo
When you train a neural network with mini-batch gradient descent, you're doing Monte Carlo estimation of the expected gradient! Each mini-batch gives an unbiased estimate:
When to Use Monte Carlo
Use Monte Carlo When:
- Analytical integration is intractable
- High-dimensional integrals (curse of dimensionality)
- Complex, non-standard distributions
- Sampling is cheap but integration is hard
Prefer Analytical When:
- Closed-form solutions exist
- Low-dimensional problems
- Standard distributions with known moments
- Need exact answers (not approximations)
Why E[X] is the Best Predictor
Imagine someone tells you: "You MUST predict a random variable X using only one number. What number should you choose?"
This is a compression problem. Examples:
- Predict tomorrow's temperature with one number
- Predict a random lifetime with one number
- Predict sensor noise level with one number
The answer—proved rigorously—is:
Mathematical Proof: Expectation Minimizes MSE
We want to choose a number that best approximates X. Meaning, we want:
Step 1: Expand the squared error
Step 2: Take derivative with respect to
Step 3: Set derivative = 0
Conclusion
Expectation is the number that minimizes error in the L2 (least squares) sense. This is why we call it the best single-number summary of a random variable.
Interactive: MSE Minimization
See for yourself! Drag the slider to find the value that minimizes MSE:
Loading interactive demo...
Common Pitfalls and Gotchas
Even experienced practitioners fall into these traps. Understanding these pitfalls will save you from subtle bugs in your ML code:
Loading interactive demo...
Summary of Common Mistakes
E[g(X)] ≠ g(E[X]) in general (Jensen's inequality)
E[XY] ≠ E[X]·E[Y] unless X, Y are independent
E[X] may not exist for heavy-tailed distributions (e.g., Cauchy)
Sample mean ≠ E[X] for finite samples (converges only as n→∞)
Preview: Conditional Expectation
One of the most powerful extensions of expectation is conditional expectation. This is so important that it gets its own section, but here's a preview:
Conditional Expectation
This is the expected value of X given that we know Y = y.
The Tower Property (Law of Total Expectation)
One of the most useful formulas in all of probability:
This says: "The average of the conditional averages equals the overall average."
Where You'll See This in ML
| Application | Conditional Expectation Used |
|---|---|
| Bayesian inference | E[θ | data]—posterior mean as point estimate |
| Reinforcement learning | E[R | s, a]—value function is conditional expectation |
| Variational inference | E_q[log p(x|z)]—expected reconstruction |
| Dropout | E[output | mask]—averaging over random masks |
| Kalman filter | E[state | observations]—optimal state estimate |
Coming Up
Section 3.5 covers conditional expectation in depth, including the law of iterated expectations and its applications in Bayesian statistics.
Tail Expectation and CVaR
In risk-sensitive applications (finance, safety-critical ML), we care not just about the average, but about what happens in the tail—the worst-case scenarios.
Conditional Value at Risk (CVaR)
Also called Expected Shortfall, CVaR answers: "What is the expected value of X given that we're in the worst α% of cases?"
Where VaR_α is the α-quantile (e.g., the 95th percentile for α = 0.95).
Applications in ML
- Safe Reinforcement Learning: Optimize for worst-case outcomes, not just average reward
- Robust Optimization: Minimize expected loss in worst α% of scenarios
- Financial ML: Portfolio risk management using CVaR constraints
- Fairness: Ensure good performance for the worst-off groups
Risk-Aware ML: Standard ML minimizes E[Loss]. Risk-aware ML minimizes CVaR[Loss] to protect against tail events. This is crucial for safety-critical applications!
From Discrete to Continuous
Let's build the continuous expectation formula from scratch, starting with the discrete case.
Step 1: Start with Discrete
For discrete X with values and probabilities :
Step 2: Imagine Points Getting Closer
Now imagine many closely spaced values with spacing . Rewrite each probability as:
Call . Then:
Step 3: Take the Limit
As , the sum becomes a Riemann integral:
Key Insight: The density is probability per unit length, just like mass density is mass per unit length. That's why we call it a "density" function!
Interactive: Riemann Sum to Integral
Watch how the Riemann sum converges to the integral as we use more bins:
Loading interactive demo...
Why Density Integrates to 1
This follows directly from probability conservation using the same bin logic:
- Chop the real line into tiny bins:
- Probability of each bin:
- Total probability must be 1:
- Take the limit:
Not a Separate Rule
The normalization condition is not an arbitrary rule—it's simply saying "total probability of all possible outcomes = 1."
Advanced: Hilbert Space View
For those who want the deepest insight (PhD-level):
Define an inner product between two random variables:
Then:
- The space of square-integrable random variables becomes a Hilbert space
- Expectation becomes:
Geometric Meaning: Expectation is the projection of X onto the constant function 1. This explains why expectation minimizes MSE—it's the orthogonal projection!
This explains:
- Why expectation minimizes MSE
- Why variance is squared distance
- Why "uncorrelated" means "orthogonal"
- Why PCA, Kalman filters, and least squares all work
Final Mental Model
When someone says "Take expectation," your mind should see:
You are averaging all possible outcomes
You weight them by how likely they are
You are extracting the center of the distribution
You are describing average behavior
You are computing what happens in the long run
You find what randomness converges to
Expectation is the bridge between randomness and determinism.
Python Implementation
1"""
2Expected Value: Complete Python Implementation
3===============================================
4This module demonstrates all key concepts of expectation with
5comprehensive examples and visualizations.
6"""
7
8import numpy as np
9import matplotlib.pyplot as plt
10from scipy import stats
11from typing import Callable, Tuple
12
13# Set random seed for reproducibility
14np.random.seed(42)
15
16# =============================================================================
17# 1. DISCRETE EXPECTATION
18# =============================================================================
19# Formula: E[X] = Σ x_i * P(X = x_i)
20# This is simply a weighted average where weights are probabilities
21
22def discrete_expectation(values: np.ndarray, probs: np.ndarray) -> float:
23 """
24 Compute expectation for a discrete random variable.
25
26 Args:
27 values: Array of possible values x_i
28 probs: Array of probabilities P(X = x_i)
29
30 Returns:
31 E[X] = sum of value * probability
32 """
33 assert np.isclose(probs.sum(), 1.0), "Probabilities must sum to 1"
34 return np.sum(values * probs)
35
36# Example: Custom discrete distribution
37values = np.array([1, 2, 3, 4, 5])
38probabilities = np.array([0.1, 0.2, 0.4, 0.2, 0.1])
39
40expectation = discrete_expectation(values, probabilities)
41print(f"E[X] (discrete) = {expectation}") # Output: 3.0
42
43# =============================================================================
44# 2. CONTINUOUS EXPECTATION WITH SCIPY
45# =============================================================================
46# Formula: E[X] = ∫ x * f(x) dx
47# scipy.stats distributions have a .mean() method
48
49distributions = {
50 "Uniform(0, 1)": stats.uniform(0, 1),
51 "Exponential(λ=2)": stats.expon(scale=0.5), # scale = 1/λ
52 "Normal(μ=0, σ=1)": stats.norm(0, 1),
53 "Beta(α=2, β=5)": stats.beta(2, 5),
54 "Gamma(α=3, β=2)": stats.gamma(a=3, scale=0.5), # scale = 1/β
55}
56
57print("
58Expectations of common distributions:")
59for name, dist in distributions.items():
60 print(f" E[X] for {name} = {dist.mean():.4f}")
61
62# =============================================================================
63# 3. LAW OF LARGE NUMBERS VISUALIZATION
64# =============================================================================
65# As n → ∞, sample mean → E[X]
66# This is the fundamental connection between theory and practice
67
68def visualize_lln(n_samples: int = 10000) -> None:
69 """Visualize Law of Large Numbers convergence."""
70 samples = np.random.uniform(0, 1, size=n_samples)
71 running_avg = np.cumsum(samples) / np.arange(1, n_samples + 1)
72 true_mean = 0.5
73
74 plt.figure(figsize=(12, 5))
75
76 # Plot 1: Running average convergence
77 plt.subplot(1, 2, 1)
78 plt.plot(running_avg, "b-", alpha=0.7, linewidth=0.8)
79 plt.axhline(y=true_mean, color="r", linestyle="--",
80 label=f"True E[X] = {true_mean}")
81 plt.xlabel("Number of Samples (n)")
82 plt.ylabel("Sample Mean")
83 plt.title("Law of Large Numbers: Convergence to E[X]")
84 plt.legend()
85 plt.xscale("log") # Log scale shows convergence better
86
87 # Plot 2: Error decreases as 1/√n
88 plt.subplot(1, 2, 2)
89 errors = np.abs(running_avg - true_mean)
90 n_values = np.arange(1, n_samples + 1)
91 plt.loglog(n_values, errors, "b-", alpha=0.5, label="Actual error")
92 plt.loglog(n_values, 1/np.sqrt(n_values), "r--",
93 label=r"$1/sqrt{n}$ bound")
94 plt.xlabel("Number of Samples (n)")
95 plt.ylabel("|Sample Mean - E[X]|")
96 plt.title("Convergence Rate: O(1/√n)")
97 plt.legend()
98
99 plt.tight_layout()
100 plt.savefig("lln_convergence.png", dpi=150)
101 plt.show()
102
103# Uncomment to run: visualize_lln()
104
105# =============================================================================
106# 4. LOTUS: LAW OF THE UNCONSCIOUS STATISTICIAN
107# =============================================================================
108# E[g(X)] = ∫ g(x) * f(x) dx (no need to find distribution of g(X))
109
110def monte_carlo_lotus(
111 g: Callable[[np.ndarray], np.ndarray],
112 dist: stats.rv_continuous,
113 n_samples: int = 100000
114) -> Tuple[float, float]:
115 """
116 Estimate E[g(X)] using Monte Carlo (LOTUS in action).
117
118 Args:
119 g: Function to apply to samples
120 dist: scipy distribution to sample from
121 n_samples: Number of Monte Carlo samples
122
123 Returns:
124 (estimate, standard_error)
125 """
126 samples = dist.rvs(size=n_samples)
127 g_samples = g(samples)
128 estimate = np.mean(g_samples)
129 std_error = np.std(g_samples) / np.sqrt(n_samples)
130 return estimate, std_error
131
132# Example: E[X²] for Uniform(0,1) - theoretical value is 1/3
133uniform = stats.uniform(0, 1)
134e_x2, se = monte_carlo_lotus(lambda x: x**2, uniform)
135print(f"
136E[X²] Monte Carlo = {e_x2:.6f} ± {se:.6f}")
137print(f"E[X²] Theoretical = {1/3:.6f}")
138
139# Example: E[log(X)] for Exp(1) - theoretical value is -γ (Euler-Mascheroni)
140exp_dist = stats.expon(scale=1)
141e_log, se = monte_carlo_lotus(lambda x: np.log(x), exp_dist)
142euler_mascheroni = 0.5772156649
143print(f"
144E[log(X)] Monte Carlo = {e_log:.6f} ± {se:.6f}")
145print(f"E[log(X)] Theoretical = {-euler_mascheroni:.6f}")
146
147# =============================================================================
148# 5. MSE MINIMIZATION PROOF
149# =============================================================================
150# E[X] uniquely minimizes E[(X - a)²] over all constants a
151
152def visualize_mse_minimization() -> None:
153 """Show that E[X] minimizes MSE."""
154 # Generate samples from a skewed distribution
155 samples = np.random.gamma(shape=2, scale=2, size=10000)
156 true_mean = np.mean(samples)
157
158 # Compute MSE for different values of a
159 a_values = np.linspace(0, 10, 200)
160 mse_values = [np.mean((samples - a)**2) for a in a_values]
161
162 plt.figure(figsize=(10, 6))
163 plt.plot(a_values, mse_values, "b-", linewidth=2)
164 plt.axvline(x=true_mean, color="r", linestyle="--",
165 label=f"E[X] = {true_mean:.2f}")
166 plt.scatter([true_mean], [np.mean((samples - true_mean)**2)],
167 color="r", s=100, zorder=5)
168 plt.xlabel("Prediction value (a)")
169 plt.ylabel("Mean Squared Error E[(X - a)²]")
170 plt.title("E[X] Minimizes MSE: Proof by Visualization")
171 plt.legend()
172 plt.grid(True, alpha=0.3)
173 plt.savefig("mse_minimization.png", dpi=150)
174 plt.show()
175
176# Uncomment to run: visualize_mse_minimization()
177
178# =============================================================================
179# 6. JENSEN'S INEQUALITY DEMONSTRATION
180# =============================================================================
181# For convex g: E[g(X)] ≥ g(E[X])
182# For concave g: E[g(X)] ≤ g(E[X])
183
184def demonstrate_jensen() -> None:
185 """Demonstrate Jensen's inequality numerically."""
186 samples = np.random.uniform(1, 10, size=100000)
187 mean_x = np.mean(samples)
188
189 # Convex function: x²
190 e_x_squared = np.mean(samples**2)
191 squared_e_x = mean_x**2
192 print(f"
193Jensen's Inequality (convex g(x) = x²):")
194 print(f" E[X²] = {e_x_squared:.4f}")
195 print(f" (E[X])² = {squared_e_x:.4f}")
196 print(f" E[X²] ≥ (E[X])²? {e_x_squared >= squared_e_x}")
197
198 # Concave function: log(x)
199 e_log_x = np.mean(np.log(samples))
200 log_e_x = np.log(mean_x)
201 print(f"
202Jensen's Inequality (concave g(x) = log(x)):")
203 print(f" E[log(X)] = {e_log_x:.4f}")
204 print(f" log(E[X]) = {log_e_x:.4f}")
205 print(f" E[log(X)] ≤ log(E[X])? {e_log_x <= log_e_x}")
206
207demonstrate_jensen()
208
209# =============================================================================
210# 7. VARIANCE VIA EXPECTATION
211# =============================================================================
212# Var(X) = E[X²] - (E[X])² = E[(X - μ)²]
213
214def compute_variance(samples: np.ndarray) -> dict:
215 """Compute variance using both formulas to verify equality."""
216 mean = np.mean(samples)
217
218 # Method 1: E[(X - μ)²]
219 var_method1 = np.mean((samples - mean)**2)
220
221 # Method 2: E[X²] - (E[X])²
222 e_x2 = np.mean(samples**2)
223 var_method2 = e_x2 - mean**2
224
225 return {
226 "E[X]": mean,
227 "E[X²]": e_x2,
228 "Var (centered)": var_method1,
229 "Var (shortcut)": var_method2,
230 "Std Dev": np.sqrt(var_method1)
231 }
232
233samples = np.random.normal(loc=5, scale=2, size=100000)
234stats_dict = compute_variance(samples)
235print("
236Variance computation:")
237for key, value in stats_dict.items():
238 print(f" {key} = {value:.4f}")To run the visualizations, uncomment the function calls at the end of each section. The code produces publication-quality plots showing LLN convergence and MSE minimization.
Test Your Understanding
Put your knowledge to the test with this interactive quiz covering the key concepts from this section:
Loading interactive demo...
Summary
Key Takeaways
Expectation = long-run average of a random variable
Formula = weighted average: value × probability, summed up
LOTUS: E[g(X)] = ∫g(x)f(x)dx without finding g(X)'s distribution
Linearity makes it incredibly useful for calculations
Jensen's Inequality: E[g(X)] ≥ g(E[X]) for convex g
Law of Large Numbers: sample mean converges to E[X]
Best predictor: E[X] minimizes mean squared error
Pitfall awareness: E[1/X] ≠ 1/E[X], E[XY] ≠ E[X]E[Y] in general
Foundation of ML: all loss functions are expectations
One-Sentence Deep Intuition
"Expectation is the unique linear projection that compresses infinite randomness into the deterministic average behavior of the system, by summing all possible values weighted by how often Nature produces them."