Chapter 3
25 min read
Section 1 of 6

Expected Value - Definition and Properties

Expectation and Moments

Learning Objectives

By the end of this section, you will:

  • Deeply understand what expectation means intuitively
  • See expectation as the center of mass of a distribution
  • Understand why the formula is a weighted average
  • Master LOTUS (Law of the Unconscious Statistician)
  • Know why expectation minimizes mean squared error
  • Apply Jensen's Inequality to ML problems
  • Connect expectation to the Law of Large Numbers
  • Understand why expectation appears everywhere in ML
  • Avoid common pitfalls with expectation
  • Preview conditional expectation and tail risks
  • See how the integral formula arises from discrete sums

Historical Context

The Birth of Expected Value

The concept of expectation was born from gambling! In 1654, the French mathematicians Blaise Pascal and Pierre de Fermat exchanged letters about the "problem of points"—how to fairly divide stakes in an interrupted game of chance.

Christiaan Huygens published the first treatise on probability in 1657, introducing the term "expectatio" (Latin for expectation). He framed it as: "If I have equal chances of getting a or b, my expectation is (a+b)/2."

🎲
1654
Pascal-Fermat correspondence
📖
1657
Huygens publishes first treatise
📊
1713
Bernoulli's Law of Large Numbers
Historical Insight: The term "expected value" originally meant "what you should expect to win" in a fair game. Today it means the long-run average of any random variable.

What is Expectation Intuitively?

Expectation = the long-run average value of a random variable if you could repeat the experiment forever.

Let this picture be in your mind: If a random variable XX produces values—sometimes small, sometimes large, sometimes medium—then the expectation is the single number that summarizes where the outcomes concentrate on average.

The Core Insight: Expectation is the "average destination of randomness." Even though randomness produces chaos moment-to-moment, expectation captures where everything gravitates toward in the long run.

Correcting Common Misconceptions

Common Misconception

"Expectation measures how random the values are"

Correct Understanding

Expectation measures where the randomness is centered. Variance measures how random/spread the values are.

Common Misconception

"Expectation is the most likely value"

Correct Understanding

The most likely value is the mode. Expectation is the weighted average of all possible values.


The Formula: Why Sum and Integral?

For Discrete Random Variables

E[X]=xxP(X=x)\mathbb{E}[X] = \sum_x x \cdot P(X = x)

For Continuous Random Variables

E[X]=xfX(x),dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \\, dx

Loading interactive demo...

Interpretation: You multiply each possible value by how likely it is. Then you add (or integrate) them up. The result is the weighted average of all possibilities.

The Core Truth

Expectation is just average = value × likelihood. Nothing more mystical. The formula simply weights each outcome by its probability.

More Generally: Functions of Random Variables

In reality, we often care about functions of X, not just X itself:

E[g(X)]=g(x)fX(x),dx\mathbb{E}[g(X)] = \int g(x) \cdot f_X(x) \\, dx

This is necessary because in real systems:

  • Power = g(X)=X2g(X) = X^2
  • Loss = g(X)=(X)g(X) = \ell(X)
  • Log-likelihood = g(X)=logp(X)g(X) = \log p(X)

LOTUS: Law of the Unconscious Statistician

One of the most powerful formulas in probability is the Law of the Unconscious Statistician (LOTUS). It lets you compute E[g(X)] without finding the distribution of g(X):

E[g(X)]=xg(x)P(X=x)(discrete)\mathbb{E}[g(X)] = \sum_x g(x) \cdot P(X = x) \quad \text{(discrete)}

E[g(X)]=g(x)fX(x),dx(continuous)\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \\, dx \quad \text{(continuous)}

Why "Unconscious"?

It's called "unconscious" because students often use it without realizing they're applying a theorem! The formula looks obvious but requires proof. You can compute E[X²] directly from f_X(x) without first finding the distribution of X².

LOTUS in Practice

GoalLOTUS FormulaExample
E[X2]\mathbb{E}[X^2]x2fX(x)dx\int x^2 f_X(x) \, dxNeeded for variance: Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2
E[logX]\mathbb{E}[\log X]log(x)fX(x)dx\int \log(x) f_X(x) \, dxEntropy, log-likelihood
E[eX]\mathbb{E}[e^X]exfX(x)dx\int e^x f_X(x) \, dxMoment generating function M(1)M(1)
E[(Xμ)3]\mathbb{E}[(X-\mu)^3](xμ)3fX(x)dx\int (x-\mu)^3 f_X(x) \, dxSkewness calculation

Interactive: Weighted Average

Experiment with this visualization to see how expectation is computed as a weighted average:

Loading interactive demo...


Expectation as Center of Mass

Think of your random variable as mass spread on a number line. The expectation is the point where you could balance the distribution on a needle.

Loading interactive demo...

This physics analogy is not just a metaphor—it is mathematically exact! Just as center of mass is the weighted average of positions (weighted by mass), expectation is the weighted average of values (weighted by probability).


Expectations of Common Distributions

Here is a quick reference for the expectations of distributions you'll encounter frequently in ML and statistics:

Discrete Distributions

DistributionNotationE[X]\mathbb{E}[X]Intuition
BernoulliBernoulli(p)\text{Bernoulli}(p)ppProbability of success
BinomialBinomial(n,p)\text{Binomial}(n, p)npnpExpected number of successes in nn trials
GeometricGeometric(p)\text{Geometric}(p)1p\frac{1}{p}Expected trials until first success
PoissonPoisson(λ)\text{Poisson}(\lambda)λ\lambdaExpected count equals rate parameter
Uniform (discrete)Uniform{1,,n}\text{Uniform}\{1,\ldots,n\}n+12\frac{n+1}{2}Middle of the range

Continuous Distributions

DistributionNotationE[X]\mathbb{E}[X]Intuition
UniformUniform(a,b)\text{Uniform}(a, b)a+b2\frac{a+b}{2}Midpoint of interval
ExponentialExp(λ)\text{Exp}(\lambda)1λ\frac{1}{\lambda}Inverse of rate = mean waiting time
NormalN(μ,σ2)\mathcal{N}(\mu, \sigma^2)μ\muMean parameter directly gives expectation
GammaGamma(α,β)\text{Gamma}(\alpha, \beta)αβ\frac{\alpha}{\beta}Shape/rate
BetaBeta(α,β)\text{Beta}(\alpha, \beta)αα+β\frac{\alpha}{\alpha + \beta}Weighted proportion of α\alpha
Chi-squaredχ2(k)\chi^2(k)kkDegrees of freedom
Log-normalLogN(μ,σ2)\text{LogN}(\mu, \sigma^2)eμ+σ2/2e^{\mu + \sigma^2/2}Note: NOT eμe^\mu!

Log-normal Trap

For X ~ LogN(μ, σ²), E[X]=eμ+σ2/2eμ\mathbb{E}[X] = e^{\mu + \sigma^2/2} \neq e^\mu. This is a consequence of Jensen's inequality since exp is convex: E[eY]>eE[Y]\mathbb{E}[e^Y] > e^{\mathbb{E}[Y]}.


Why Statisticians Love Expectation

Expectation has magical properties that make it the foundation of all statistical analysis:

1. It Compresses the Whole Distribution into One Stable Number

Even if the distribution is complicated, expectation gives a stable center that summarizes the "typical" behavior.

2. It is LINEAR (This is HUGE!)

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

No other summary behaves this nicely! This makes derivations, proofs, estimators, and ML algorithms beautifully simple.

Linearity is Power

The linearity of expectation is used everywhere: in gradient descent, Bayesian inference, signal processing, and control theory. When you see a sum of random variables, you can immediately split the expectation!

3. It Connects to Reality Through the Law of Large Numbers

1ni=1nXiE[X] as n\frac{1}{n}\sum_{i=1}^{n} X_i \to \mathbb{E}[X] \text{ as } n \to \infty

This means expectation is not an imaginary math object. It is literally what you observe in the real world if you take enough samples!


Moment Generating Functions

The Moment Generating Function (MGF) is a powerful tool that encodes all moments of a distribution in a single function. It's defined as:

MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]

The name comes from the remarkable property: derivatives of the MGF give moments.

Why "Moment Generating"?

Expand etXe^{tX} as a Taylor series:

etX=1+tX+(tX)22!+(tX)33!+e^{tX} = 1 + tX + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + \cdots

Taking expectation term by term:

MX(t)=1+tE[X]+t22!E[X2]+t33!E[X3]+M_X(t) = 1 + t\mathbb{E}[X] + \frac{t^2}{2!}\mathbb{E}[X^2] + \frac{t^3}{3!}\mathbb{E}[X^3] + \cdots

The Key Result

The n-th derivative of M_X(t) evaluated at t=0 gives the n-th moment:

MX(n)(0)=E[Xn]M_X^{(n)}(0) = \mathbb{E}[X^n]

MGFs of Common Distributions

DistributionMX(t)M_X(t)Domain
Bernoulli(p)\text{Bernoulli}(p)1p+pet1 - p + pe^tt\forall t
Binomial(n,p)\text{Binomial}(n, p)(1p+pet)n(1 - p + pe^t)^nt\forall t
Poisson(λ)\text{Poisson}(\lambda)exp(λ(et1))\exp\bigl(\lambda(e^t - 1)\bigr)t\forall t
Exponential(λ)\text{Exponential}(\lambda)λλt\frac{\lambda}{\lambda - t}t<λt < \lambda
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)exp(μt+σ2t22)\exp\bigl(\mu t + \frac{\sigma^2 t^2}{2}\bigr)t\forall t
Gamma(α,β)\text{Gamma}(\alpha, \beta)(1tβ)α\bigl(1 - \frac{t}{\beta}\bigr)^{-\alpha}t<βt < \beta

Why MGFs Matter in ML

  • Uniqueness: If two distributions have the same MGF, they're identical. Useful for proving distributional results.
  • Sum of independent RVs: MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) — products are easier than convolutions!
  • Central Limit Theorem: The CLT proof uses MGF convergence.
  • Concentration bounds: Chernoff bounds use P(X>a)inftetaMX(t)P(X > a) \leq \inf_t e^{-ta} M_X(t)

Characteristic Functions

When the MGF doesn't exist (heavy tails), use the characteristic function: ϕX(t)=E[eitX]\phi_X(t) = \mathbb{E}[e^{itX}]. It always exists and has similar properties. The Fourier transform connection makes it fundamental in signal processing.


Jensen's Inequality

Jensen's Inequality is one of the most important results connecting expectation with function transformations. It tells us exactly when E[g(X)] differs from g(E[X]).

Understanding the Two Quantities

E[g(X)]\mathbb{E}[g(X)]

"Average of the transformed values"

First apply function g to each possible value of X, then take the average. You transform first, average second.

Example: If X can be 1, 2, or 3 and g(x) = x², compute 1², 2², 3² first, then average those squares.
g(E[X])g(\mathbb{E}[X])

"Transformation of the average value"

First find the average of X, then apply function g to that single average. You average first, transform second.

Example: If X can be 1, 2, or 3, compute the average (which is 2), then square it: 2² = 4.

🔑 Key Question: Does the order matter? Yes! Jensen's inequality tells us exactly how.

Jensen's Inequality

For a convex function g (curves upward like x², e^x):

E[g(X)]g(E[X])\mathbb{E}[g(X)] \geq g(\mathbb{E}[X])

📖 Intuitive Meaning:

"The average of squares is always greater than or equal to the square of the average."

When you apply a convex function to random values and then average, you get a larger result than if you first averaged and then applied the function. Convex functions "amplify" spread—the more variable your data, the bigger the gap.

Real-world analogy: Your average daily income squared is LESS than the average of your daily incomes squared. High-earning days contribute disproportionately when you square first.

For a concave function g (curves downward like log, √x):

E[g(X)]g(E[X])\mathbb{E}[g(X)] \leq g(\mathbb{E}[X])

📖 Intuitive Meaning:

"The average of logarithms is always less than or equal to the logarithm of the average."

When you apply a concave function to random values and then average, you get a smaller result than if you first averaged and then applied the function. Concave functions "compress" spread—they penalize variability.

Real-world analogy: The average satisfaction from variable-quality meals is LESS than the satisfaction from consistently average meals. (Diminishing returns from good meals, but bad meals hurt a lot.)

When does equality hold?

E[g(X)]=g(E[X])\mathbb{E}[g(X)] = g(\mathbb{E}[X]) happens in two cases:

  • No randomness: X is a constant (no spread at all)
  • Linear function: g(x) = ax + b (neither convex nor concave)

This is why expectation is linear: E[aX + b] = aE[X] + b always holds!

Why Jensen's Inequality Matters in ML

ApplicationFunctionConvexityConsequence
ELBO (VAEs)log(x)\log(x)ConcaveE[logp]logE[p]\mathbb{E}[\log p] \leq \log \mathbb{E}[p] → maximize lower bound
Cross-entropylog(x)-\log(x)ConvexNice optimization landscape
Bias in estimators1x\frac{1}{x}ConvexE[1/X]>1/E[X]\mathbb{E}[1/X] > 1/\mathbb{E}[X]
Sample variancex2x^2ConvexE[X2](E[X])2\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2
KL divergencexlog(x)x \log(x)ConvexDKL0D_{KL} \geq 0 always

Geometric Intuition

For a convex function, the curve lies below any chord (line connecting two points on the curve). This means:

  • Points on the curve: g(x1),g(x2),g(x_1), g(x_2), \ldots
  • Average of points: E[g(X)]\mathbb{E}[g(X)] (on or above the chord)
  • Value at average: g(E[X])g(\mathbb{E}[X]) (on the curve)
  • Result: Chord is above curve, so E[g(X)]g(E[X])\mathbb{E}[g(X)] \geq g(\mathbb{E}[X])

Interactive 2D: Drag & Explore

This powerful visualization lets you drag distribution points along the curve, adjust probabilities, and instantly see how Jensen's inequality responds. Try different functions to build deep intuition!

Loading interactive demo...


Interactive 3D: Surface View

See Jensen's inequality come alive in 3D! For functions of two variables, convex surfaces curve upward like a bowl, and the weighted average of surface points always lies above the surface at the average point. Rotate the view to see this geometric truth from every angle.

Loading interactive demo...


Interactive: Jensen's Inequality (Basic)

Here's a simpler view focusing on the core concept with fewer controls:

Loading interactive demo...


Law of Large Numbers in Action

Watch how the sample average converges to the true expectation as you take more samples:

Loading interactive demo...

Why This Matters

Every machine learning algorithm uses expectation implicitly. When you train a model, you are approximating the expected loss. The Law of Large Numbers guarantees that your training converges to the true risk.


Physical and Engineering Meaning

What does expectation mean in real-world engineering applications?

If X represents...Expectation means...
VoltageAverage voltage level
NoiseBias in the noise (DC component)
Component lifetimeExpected lifetime (MTTF)
Daily stock returnAverage daily gain/loss
Model prediction errorTrue risk (expected loss)
Sensor readingTrue underlying value
Queue waiting timeAverage wait time
Engineering Insight: Engineers LOVE expectation because we design for average energy, average power, expected error. It gives us a single number to optimize against.

What Information Does It Give?

Expectation answers one fundamental question:

"If randomness continues forever, what do I typically see?"

Expectation also allows us to define other key quantities:

  • Variance: Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]
  • Covariance: Cov(X,Y)=E[(XμX)(YμY)]\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]
  • Risk in ML: R(θ)=E[Loss(X,θ)]R(\theta) = \mathbb{E}[\text{Loss}(X, \theta)]
  • KL Divergence: DKL(pq)=Ep[logpq]D_{KL}(p||q) = \mathbb{E}_p\left[\log\frac{p}{q}\right]

Expectation is the foundation of all statistical learning.


Expectation in Machine Learning

In ML, we always optimize:

θ=argminθEx,y[Loss(fθ(x),y)]\theta^* = \arg\min_\theta \mathbb{E}_{x,y}[\text{Loss}(f_\theta(x), y)]

Why expectation? Because:

  1. You train a model to minimize the expected loss
  2. You never know the real future inputs
  3. But expectation gives their "average behavior"

Every ML Algorithm Uses Expectation

Your network's gradient is literally:

θE[Loss]=E[θLoss]\nabla_\theta \mathbb{E}[\text{Loss}] = \mathbb{E}[\nabla_\theta \text{Loss}]

This interchange (linearity!) is why gradient descent works. SGD is just Monte Carlo approximation of this expectation.

Comprehensive ML Applications

Algorithm/ConceptHow Expectation AppearsFormula
Cross-Entropy LossExpected negative log-likelihoodE[-log p(y|x)]
Policy Gradient (RL)Expected reward under policyE_π[R·∇log π]
DropoutEnsemble averaging at test timeE[f(x; mask)]
Batch NormalizationNormalize using E[x] and Var(x)(x - E[x])/√Var(x)
VAE ELBOExpected reconstruction + KLE_q[log p(x|z)] - KL
Attention WeightsWeighted average of valuesE[V | Q,K] = softmax(QK^T)V
Monte Carlo Tree SearchExpected value of game stateE[reward | state, action]
Bayesian Neural NetsPredictive uncertaintyE[f(x) | data]

The Reparameterization Trick

In VAEs, we need gradients through expectations. The reparameterization trick rewrites:

Ezqϕ(zx)[f(z)]=EϵN(0,1)[f(μϕ(x)+σϕ(x)ϵ)]\mathbb{E}_{z \sim q_\phi(z|x)}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[f(\mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon)]

Now the expectation is over ε which doesn't depend on φ, so we can backpropagate through μ and σ!


Population vs Sample World

One of the most important distinctions in statistics and machine learning is between the population world and the sample world. Understanding this distinction is key to understanding why expectation matters so deeply.

Population World (True, Infinite, Theoretical)

  • The true data-generating process
  • Infinite possible observations
  • Governed by unknown parameter θ0\theta_0
  • We can never fully observe it

Sample World (Finite, Observed, Practical)

  • The data we actually collect
  • Finite nn observations
  • Used to estimate θ0\theta_0
  • All we have access to in practice

What "Expectation under θ0\theta_0" ACTUALLY Means

When we write:

Eθ0[ρ(X,θ)]\mathbb{E}_{\theta_0}[\rho(X, \theta)]

we are doing this thought experiment:

"Imagine the universe is truly generating data using the true but unknown parameter θ0\theta_0. If we could repeatedly collect infinite datasets from that universe, and for each dataset compute the loss using our guess θ\theta, what would be the long-run average loss?"

So:

SymbolMeaning
XXRandom data generated from the true world
θ0\theta_0True data-generating parameter
θ\thetaYour trial / guess
ρ(X,θ)\rho(X, \theta)Error of your guess on data
Eθ0\mathbb{E}_{\theta_0}Average over the true world

The Risk Function

This expectation has a special name—it's called the risk function:

R(θ)=Eθ0[ρ(X,θ)]R(\theta) = \mathbb{E}_{\theta_0}[\rho(X, \theta)]

It is a population-level truth curve over all possible data. The risk function tells us: "For any guess θ\theta, what is the true expected error?"

True Risk vs Empirical Risk

QuantityFormulaMeaning
True RiskR(θ)=Eθ0[ρ(X,θ)]R(\theta) = \mathbb{E}_{\theta_0}[\rho(X, \theta)]Infinite-world average
Empirical RiskR^n(θ)=1ni=1nρ(Xi,θ)\hat{R}_n(\theta) = \frac{1}{n}\sum_{i=1}^{n} \rho(X_i, \theta)Finite-sample average

So:

  • True risk = infinite-world average (what we want)
  • Empirical risk = finite-sample average (what we can compute)
  • We minimize empirical risk because that's all we have
  • Empirical risk converges to true risk by LLN
  • Therefore the minimizer converges to θ0\theta_0

The Fundamental Theorem of Statistical Learning

Here is the mathematically precise version of your idea:

θ^n=argminθ1ni=1nρ(Xi,θ)\hat{\theta}_n = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \rho(X_i, \theta)

Then:

θ^nnθ0\hat{\theta}_n \xrightarrow{n \to \infty} \theta_0

This is Consistency

This is exactly consistency of estimators. This is exactly how MLE, least squares, and Empirical Risk Minimization (ERM) work!

Summary: Two Worlds, One Bridge

WorldWhat happens
True worldθ0\theta_0 generates infinite data
Risk functionMeasures theoretical error of any guess
Sample worldYou only see X1,,XnX_1, \ldots, X_n
TrainingYou minimize empirical average
As nn \to \inftyEmpirical ≈ Population

Deep Intuition in One Sentence

Expectation under θ0\theta_0 means: "How wrong would my guess θ\theta be on average if Nature keeps generating data using the true parameter forever?"

What Does argmin\arg\min Mean?

Before we dive into examples, let's clarify a notation you'll see everywhere in ML:

argminθf(θ)\arg\min_\theta f(\theta)

It means:

"Choose the value of θ\theta for which the function f(θ)f(\theta) becomes as small as possible."

Very important distinction:

  • min → gives you the minimum value of the function
  • arg min → gives you the argument (input) that achieves that minimum

Tiny Numerical Example (Concrete)

Suppose:

f(θ)=(θ2)2f(\theta) = (\theta - 2)^2

Let's test values:

θ\thetaf(θ)f(\theta)
04
11
20 ← minimum value
31
44
  • The minimum value is: minθf(θ)=0\min_\theta f(\theta) = 0
  • The theta that gives this minimum is: argminθf(θ)=2\arg\min_\theta f(\theta) = 2

So:

argminθ(θ2)2=2\arg\min_\theta (\theta - 2)^2 = 2

In the Context of Machine Learning / Statistics

When you see:

θ^=argminθ1ni=1nρ(Xi,θ)\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \rho(X_i, \theta)

It literally means:

"Choose the parameter θ\theta that makes the average loss on the data as small as possible."

This is:

  • Parameter estimation
  • Model training
  • Learning
  • Optimization
  • Fitting the model to data

All the same thing.

In Bayesian Form (MAP)

When you see:

argmaxθp(θX)\arg\max_\theta p(\theta | X)

That means:

"Choose the value of θ\theta that is most probable after seeing the data."

And since:

argmaxθp(θX)=argminθ[logp(Xθ)logp(θ)]\arg\max_\theta p(\theta | X) = \arg\min_\theta \bigl[ -\log p(X|\theta) - \log p(\theta) \bigr]

You again get:

"Choose θ\theta that minimizes (data loss + regularization)."

In Deep Learning (GPT, Diffusion, etc.)

When training a network:

θ=argminθCrossEntropyLoss(θ)\theta^* = \arg\min_\theta \text{CrossEntropyLoss}(\theta)

Means:

"Adjust the weights so that prediction error becomes as small as possible."

Backprop + SGD are just numerical machines that search for this arg min.

"Arg min means: return the input value that makes the function as small as it can possibly be."

A Fully Numerical Toy Example

(See empirical risk → true risk → θ0\theta_0)

We'll use the simplest possible model so everything is visible.

True data-generating world (unknown to us)

Assume Nature uses a Normal distribution:

XN(θ0,1),withθ0=2X \sim \mathcal{N}(\theta_0, 1), \quad \text{with} \quad \theta_0 = 2

We don't know that 2 is the truth. We only see samples.

Loss function (your ρ\rho)

Use squared error:

ρ(x,θ)=(xθ)2\rho(x, \theta) = (x - \theta)^2

Step 1: The True Risk Function

R(θ)=Eθ0[(Xθ)2]R(\theta) = \mathbb{E}_{\theta_0}[(X - \theta)^2]

For a normal distribution, this simplifies to:

R(θ)=(θθ0)2+Var(X)=1R(\theta) = (\theta - \theta_0)^2 + \underbrace{\text{Var}(X)}_{=1}

So:

R(θ)=(θ2)2+1R(\theta) = (\theta - 2)^2 + 1

  • This is a perfect upward parabola
  • It is minimized exactly at θ=2\theta = 2
  • This is a population-level truth curve

Step 2: What you actually observe (finite data)

Say you observe:

X1=1.4,X2=2.3,X3=2.0X_1 = 1.4, \quad X_2 = 2.3, \quad X_3 = 2.0

Your empirical risk:

R^3(θ)=13i=13(Xiθ)2\hat{R}_3(\theta) = \frac{1}{3} \sum_{i=1}^{3} (X_i - \theta)^2

Try some values:

Try θ=1\theta = 1

(1.41)2=0.16,(2.31)2=1.69,(2.01)2=1(1.4-1)^2 = 0.16, \quad (2.3-1)^2 = 1.69, \quad (2.0-1)^2 = 1

R^3(1)=2.853=0.95\hat{R}_3(1) = \frac{2.85}{3} = 0.95

Try θ=2\theta = 2

(1.42)2=0.36,(2.32)2=0.09,(2.02)2=0(1.4-2)^2 = 0.36, \quad (2.3-2)^2 = 0.09, \quad (2.0-2)^2 = 0

R^3(2)=0.453=0.15\hat{R}_3(2) = \frac{0.45}{3} = 0.15 \quad \checkmark

Try θ=3\theta = 3

(1.43)2=2.56,(2.33)2=0.49,(2.03)2=1(1.4-3)^2 = 2.56, \quad (2.3-3)^2 = 0.49, \quad (2.0-3)^2 = 1

R^3(3)=4.053=1.35\hat{R}_3(3) = \frac{4.05}{3} = 1.35

Minimum occurs near 2, the true θ0\theta_0.

Step 3: What happens as nn \to \infty

R^n(θ)nR(θ)\hat{R}_n(\theta) \xrightarrow{n \to \infty} R(\theta)

and

θ^n=argminθR^n(θ)nθ0\hat{\theta}_n = \arg\min_\theta \hat{R}_n(\theta) \xrightarrow{n \to \infty} \theta_0

The Punchline

By minimizing what we can compute (empirical risk on finite data), we get closer and closer to what we want (the true parameter θ0\theta_0). This is the magic of statistical learning!

How This Is EXACTLY What Deep Learning Does

(Cross-entropy, GPT, diffusion, everything)

Let's rewrite the core identity:

Statistical learning principle:

θ=argminθEθ0[ρ(X,θ)]\theta^* = \arg\min_\theta \mathbb{E}_{\theta_0}[\rho(X, \theta)]

Since we don't know the expectation:

θargminθ1ni=1nρ(Xi,θ)\theta^* \approx \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \rho(X_i, \theta)

  • This is Empirical Risk Minimization (ERM)
  • Every neural network is trained using this
  • GPT, ResNet, Diffusion, everything

GPT Training = Your Exact Framework

For GPT:

Your notationGPT equivalent
XXToken sequences
θ\thetaNetwork weights
θ0\theta_0True language distribution
ρ\rhoCross-entropy loss

Loss:

ρ(x,θ)=logpθ(x)\rho(x, \theta) = -\log p_\theta(x)

Empirical training:

R^n(θ)=1nlogpθ(xi)\hat{R}_n(\theta) = \frac{1}{n} \sum -\log p_\theta(x_i)

Population objective (never directly accessible):

R(θ)=Etrue language[logpθ(X)]R(\theta) = \mathbb{E}_{\text{true language}}[-\log p_\theta(X)]

  • Training GPT = trying to match the true unknown language generator θ0\theta_0
  • With only a finite dataset

Diffusion models, VAEs, GANs = same thing

All minimize:

1nρ(Xi,θ)Eθ0[ρ(X,θ)]\frac{1}{n} \sum \rho(X_i, \theta) \approx \mathbb{E}_{\theta_0}[\rho(X, \theta)]

Only the loss form changes, not the principle.

How This Explains Overfitting (Perfectly)

Now the most important insight:

What you WANT to minimize

R(θ)=Eθ0[ρ(X,θ)](true future error)R(\theta) = \mathbb{E}_{\theta_0}[\rho(X, \theta)] \quad \text{(true future error)}

What you CAN minimize

R^n(θ)=1nρ(Xi,θ)(training error)\hat{R}_n(\theta) = \frac{1}{n} \sum \rho(X_i, \theta) \quad \text{(training error)}

Overfitting happens when:

R^n(θ)butR(θ)\hat{R}_n(\theta) \downarrow \quad \text{but} \quad R(\theta) \uparrow

Meaning:

  • The model memorizes the finite data
  • It stops representing the true population

Why this happens geometrically

Your model space grows:

ModelRisk curve
SmallSmooth, stable
HugeWild oscillations

With few samples:

  • Many parameter values give zero training error
  • But only one minimizes true risk

This creates the famous gap:

QuantityBehavior
Training lossAlways decreases
Test lossDecreases → then increases
This gapOverfitting

The Complete Picture: How We Fight Overfitting

We minimize empirical error to approximate population truth, regularize to encode prior beliefs, and stop early to prevent the optimizer from hallucinating structure that does not exist in Nature.

TechniqueWhat it doesIn terms of risk
ERMMinimize R^n(θ)\hat{R}_n(\theta)Approximates R(θ)R(\theta)
RegularizationAdd penalty λθ2\lambda \|\theta\|^2Encodes prior: "simpler θ\theta more likely"
Early stoppingStop before R^n0\hat{R}_n \to 0Prevents memorizing noise
DropoutRandom neuron maskingImplicit ensemble averaging
Data augmentationExpand training setBetter approximation of Eθ0\mathbb{E}_{\theta_0}

Final Unified Truth

We minimize empirical averages to approximate an unobservable population expectation. As the dataset grows, the empirical landscape converges to the true risk landscape, and the minimizer converges to the true parameter.

One-Line Master Equation (ML + Stats + DL Unified)

argminθ1ni=1nρ(Xi,θ)nargminθEθ0[ρ(X,θ)]\arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \rho(X_i, \theta) \xrightarrow{n \to \infty} \arg\min_\theta \mathbb{E}_{\theta_0}[\rho(X, \theta)]

This single equation is:

  • Estimation theory
  • Machine learning
  • Neural network training
  • GPT training
  • Diffusion training
  • Bayesian posterior mode (MAP)
  • Risk minimization
  • Consistency of M-estimators

Monte Carlo Estimation

In practice, we rarely compute expectations analytically. Instead, we use Monte Carlo estimation: approximate E[g(X)] by averaging samples.

Monte Carlo Estimator

E[g(X)]1ni=1ng(Xi),Xiiidp(x)\mathbb{E}[g(X)] \approx \frac{1}{n}\sum_{i=1}^{n} g(X_i), \quad X_i \overset{\text{iid}}{\sim} p(x)

As n → ∞, this converges to the true expectation by the Law of Large Numbers.

Properties of Monte Carlo Estimators

PropertyValueInterpretation
UnbiasedE[estimator] = E[g(X)]No systematic error
VarianceVar(g(X))/nDecreases as 1/n
Standard Errorσ/√nError decreases as 1/√n
95% CI Width≈ 4σ/√nNeed 4× samples to halve width

Variance Reduction Techniques

The challenge with Monte Carlo is high variance. Modern ML uses several techniques to reduce it:

  1. Importance Sampling: Sample from a different distribution q(x) and reweight:

    Ep[g(X)]=Eq[g(X)p(X)q(X)]\mathbb{E}_p[g(X)] = \mathbb{E}_q\left[g(X) \cdot \frac{p(X)}{q(X)}\right]

    Used in: Off-policy RL, rare event simulation, variational inference
  2. Control Variates: Subtract a known-mean variable to reduce variance:

    μ^=1ni[g(Xi)c(h(Xi)E[h(X)])]\hat{\mu} = \frac{1}{n}\sum_i [g(X_i) - c(h(X_i) - \mathbb{E}[h(X)])]

    Used in: Policy gradient baselines, variance reduction in REINFORCE
  3. Antithetic Variates: Use negatively correlated samples
  4. Stratified Sampling: Divide the space into strata and sample from each

Mini-batch SGD is Monte Carlo

When you train a neural network with mini-batch gradient descent, you're doing Monte Carlo estimation of the expected gradient! Each mini-batch gives an unbiased estimate: θE[Loss]1Bi=1BθLoss(xi)\nabla_\theta \mathbb{E}[\text{Loss}] \approx \frac{1}{B}\sum_{i=1}^{B} \nabla_\theta \text{Loss}(x_i)

When to Use Monte Carlo

Use Monte Carlo When:

  • Analytical integration is intractable
  • High-dimensional integrals (curse of dimensionality)
  • Complex, non-standard distributions
  • Sampling is cheap but integration is hard

Prefer Analytical When:

  • Closed-form solutions exist
  • Low-dimensional problems
  • Standard distributions with known moments
  • Need exact answers (not approximations)

Why E[X] is the Best Predictor

Imagine someone tells you: "You MUST predict a random variable X using only one number. What number should you choose?"

This is a compression problem. Examples:

  • Predict tomorrow's temperature with one number
  • Predict a random lifetime with one number
  • Predict sensor noise level with one number

The answer—proved rigorously—is:

E[X]\boxed{\mathbb{E}[X]}

Mathematical Proof: Expectation Minimizes MSE

We want to choose a number aa that best approximates X. Meaning, we want:

Pick a to minimize E[(Xa)2]\text{Pick } a \text{ to minimize } \mathbb{E}[(X - a)^2]

Step 1: Expand the squared error

E[(Xa)2]=E[X2]2aE[X]+a2\mathbb{E}[(X-a)^2] = \mathbb{E}[X^2] - 2a\mathbb{E}[X] + a^2

Step 2: Take derivative with respect to aa

dda=2E[X]+2a\frac{d}{da} = -2\mathbb{E}[X] + 2a

Step 3: Set derivative = 0

a=E[X]a^* = \mathbb{E}[X]

Conclusion

Expectation is the number that minimizes error in the L2 (least squares) sense. This is why we call it the best single-number summary of a random variable.


Interactive: MSE Minimization

See for yourself! Drag the slider to find the value that minimizes MSE:

Loading interactive demo...


Common Pitfalls and Gotchas

Even experienced practitioners fall into these traps. Understanding these pitfalls will save you from subtle bugs in your ML code:

Loading interactive demo...

Summary of Common Mistakes

E[g(X)] ≠ g(E[X]) in general (Jensen's inequality)

E[XY] ≠ E[X]·E[Y] unless X, Y are independent

E[X] may not exist for heavy-tailed distributions (e.g., Cauchy)

Sample mean ≠ E[X] for finite samples (converges only as n→∞)


Preview: Conditional Expectation

One of the most powerful extensions of expectation is conditional expectation. This is so important that it gets its own section, but here's a preview:

Conditional Expectation

E[XY=y]=xfXY(xy),dx\mathbb{E}[X | Y = y] = \int x \cdot f_{X|Y}(x|y) \\, dx

This is the expected value of X given that we know Y = y.

The Tower Property (Law of Total Expectation)

One of the most useful formulas in all of probability:

E[X]=E[E[XY]]\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]]

This says: "The average of the conditional averages equals the overall average."

Where You'll See This in ML

ApplicationConditional Expectation Used
Bayesian inferenceE[θ | data]—posterior mean as point estimate
Reinforcement learningE[R | s, a]—value function is conditional expectation
Variational inferenceE_q[log p(x|z)]—expected reconstruction
DropoutE[output | mask]—averaging over random masks
Kalman filterE[state | observations]—optimal state estimate

Coming Up

Section 3.5 covers conditional expectation in depth, including the law of iterated expectations and its applications in Bayesian statistics.


Tail Expectation and CVaR

In risk-sensitive applications (finance, safety-critical ML), we care not just about the average, but about what happens in the tail—the worst-case scenarios.

Conditional Value at Risk (CVaR)

Also called Expected Shortfall, CVaR answers: "What is the expected value of X given that we're in the worst α% of cases?"

CVaRα(X)=E[XX>VaRα(X)]\text{CVaR}_\alpha(X) = \mathbb{E}[X \mid X > \text{VaR}_\alpha(X)]

Where VaR_α is the α-quantile (e.g., the 95th percentile for α = 0.95).

Applications in ML

  • Safe Reinforcement Learning: Optimize for worst-case outcomes, not just average reward
  • Robust Optimization: Minimize expected loss in worst α% of scenarios
  • Financial ML: Portfolio risk management using CVaR constraints
  • Fairness: Ensure good performance for the worst-off groups
Risk-Aware ML: Standard ML minimizes E[Loss]. Risk-aware ML minimizes CVaR[Loss] to protect against tail events. This is crucial for safety-critical applications!

From Discrete to Continuous

Let's build the continuous expectation formula from scratch, starting with the discrete case.

Step 1: Start with Discrete

For discrete X with values x1,x2,x_1, x_2, \ldots and probabilities p1,p2,p_1, p_2, \ldots:

E[X]=ixipi\mathbb{E}[X] = \sum_i x_i \cdot p_i

Step 2: Imagine Points Getting Closer

Now imagine many closely spaced values with spacing Δx\Delta x. Rewrite each probability as:

pi=piΔxdensityΔxp_i = \underbrace{\frac{p_i}{\Delta x}}_{\text{density}} \cdot \Delta x

Call f(xi)=piΔxf(x_i) = \frac{p_i}{\Delta x}. Then:

E[X]=ixif(xi)Δx\mathbb{E}[X] = \sum_i x_i \cdot f(x_i) \cdot \Delta x

Step 3: Take the Limit

As Δx0\Delta x \to 0, the sum becomes a Riemann integral:

ixif(xi)Δxxf(x),dx\sum_i x_i \cdot f(x_i) \cdot \Delta x \longrightarrow \int x \cdot f(x) \\, dx

Key Insight: The density f(x)f(x) is probability per unit length, just like mass density is mass per unit length. That's why we call it a "density" function!

Interactive: Riemann Sum to Integral

Watch how the Riemann sum converges to the integral as we use more bins:

Loading interactive demo...


Why Density Integrates to 1

This follows directly from probability conservation using the same bin logic:

  1. Chop the real line into tiny bins: [xi,xi+Δx][x_i, x_i + \Delta x]
  2. Probability of each bin: P(X[xi,xi+Δx])f(xi)ΔxP(X \in [x_i, x_i+\Delta x]) \approx f(x_i) \cdot \Delta x
  3. Total probability must be 1: if(xi)Δx=1\sum_i f(x_i) \cdot \Delta x = 1
  4. Take the limit: f(x)dx=1\int_{-\infty}^{\infty} f(x) \, dx = 1

Not a Separate Rule

The normalization condition is not an arbitrary rule—it's simply saying "total probability of all possible outcomes = 1."


Advanced: Hilbert Space View

For those who want the deepest insight (PhD-level):

Define an inner product between two random variables:

U,V=E[UV]\langle U, V \rangle = \mathbb{E}[UV]

Then:

  • The space of square-integrable random variables becomes a Hilbert space
  • Expectation becomes: E[X]=X,1\mathbb{E}[X] = \langle X, 1 \rangle
Geometric Meaning: Expectation is the projection of X onto the constant function 1. This explains why expectation minimizes MSE—it's the orthogonal projection!

This explains:

  • Why expectation minimizes MSE
  • Why variance is squared distance
  • Why "uncorrelated" means "orthogonal"
  • Why PCA, Kalman filters, and least squares all work

Final Mental Model

When someone says "Take expectation," your mind should see:

📊

You are averaging all possible outcomes

⚖️

You weight them by how likely they are

🎯

You are extracting the center of the distribution

📈

You are describing average behavior

♾️

You are computing what happens in the long run

🔗

You find what randomness converges to

Expectation is the bridge between randomness and determinism.


Python Implementation

🐍python
1"""
2Expected Value: Complete Python Implementation
3===============================================
4This module demonstrates all key concepts of expectation with
5comprehensive examples and visualizations.
6"""
7
8import numpy as np
9import matplotlib.pyplot as plt
10from scipy import stats
11from typing import Callable, Tuple
12
13# Set random seed for reproducibility
14np.random.seed(42)
15
16# =============================================================================
17# 1. DISCRETE EXPECTATION
18# =============================================================================
19# Formula: E[X] = Σ x_i * P(X = x_i)
20# This is simply a weighted average where weights are probabilities
21
22def discrete_expectation(values: np.ndarray, probs: np.ndarray) -> float:
23    """
24    Compute expectation for a discrete random variable.
25
26    Args:
27        values: Array of possible values x_i
28        probs: Array of probabilities P(X = x_i)
29
30    Returns:
31        E[X] = sum of value * probability
32    """
33    assert np.isclose(probs.sum(), 1.0), "Probabilities must sum to 1"
34    return np.sum(values * probs)
35
36# Example: Custom discrete distribution
37values = np.array([1, 2, 3, 4, 5])
38probabilities = np.array([0.1, 0.2, 0.4, 0.2, 0.1])
39
40expectation = discrete_expectation(values, probabilities)
41print(f"E[X] (discrete) = {expectation}")  # Output: 3.0
42
43# =============================================================================
44# 2. CONTINUOUS EXPECTATION WITH SCIPY
45# =============================================================================
46# Formula: E[X] = ∫ x * f(x) dx
47# scipy.stats distributions have a .mean() method
48
49distributions = {
50    "Uniform(0, 1)": stats.uniform(0, 1),
51    "Exponential(λ=2)": stats.expon(scale=0.5),  # scale = 1/λ
52    "Normal(μ=0, σ=1)": stats.norm(0, 1),
53    "Beta(α=2, β=5)": stats.beta(2, 5),
54    "Gamma(α=3, β=2)": stats.gamma(a=3, scale=0.5),  # scale = 1/β
55}
56
57print("
58Expectations of common distributions:")
59for name, dist in distributions.items():
60    print(f"  E[X] for {name} = {dist.mean():.4f}")
61
62# =============================================================================
63# 3. LAW OF LARGE NUMBERS VISUALIZATION
64# =============================================================================
65# As n → ∞, sample mean → E[X]
66# This is the fundamental connection between theory and practice
67
68def visualize_lln(n_samples: int = 10000) -> None:
69    """Visualize Law of Large Numbers convergence."""
70    samples = np.random.uniform(0, 1, size=n_samples)
71    running_avg = np.cumsum(samples) / np.arange(1, n_samples + 1)
72    true_mean = 0.5
73
74    plt.figure(figsize=(12, 5))
75
76    # Plot 1: Running average convergence
77    plt.subplot(1, 2, 1)
78    plt.plot(running_avg, "b-", alpha=0.7, linewidth=0.8)
79    plt.axhline(y=true_mean, color="r", linestyle="--",
80                label=f"True E[X] = {true_mean}")
81    plt.xlabel("Number of Samples (n)")
82    plt.ylabel("Sample Mean")
83    plt.title("Law of Large Numbers: Convergence to E[X]")
84    plt.legend()
85    plt.xscale("log")  # Log scale shows convergence better
86
87    # Plot 2: Error decreases as 1/√n
88    plt.subplot(1, 2, 2)
89    errors = np.abs(running_avg - true_mean)
90    n_values = np.arange(1, n_samples + 1)
91    plt.loglog(n_values, errors, "b-", alpha=0.5, label="Actual error")
92    plt.loglog(n_values, 1/np.sqrt(n_values), "r--",
93               label=r"$1/sqrt{n}$ bound")
94    plt.xlabel("Number of Samples (n)")
95    plt.ylabel("|Sample Mean - E[X]|")
96    plt.title("Convergence Rate: O(1/√n)")
97    plt.legend()
98
99    plt.tight_layout()
100    plt.savefig("lln_convergence.png", dpi=150)
101    plt.show()
102
103# Uncomment to run: visualize_lln()
104
105# =============================================================================
106# 4. LOTUS: LAW OF THE UNCONSCIOUS STATISTICIAN
107# =============================================================================
108# E[g(X)] = ∫ g(x) * f(x) dx  (no need to find distribution of g(X))
109
110def monte_carlo_lotus(
111    g: Callable[[np.ndarray], np.ndarray],
112    dist: stats.rv_continuous,
113    n_samples: int = 100000
114) -> Tuple[float, float]:
115    """
116    Estimate E[g(X)] using Monte Carlo (LOTUS in action).
117
118    Args:
119        g: Function to apply to samples
120        dist: scipy distribution to sample from
121        n_samples: Number of Monte Carlo samples
122
123    Returns:
124        (estimate, standard_error)
125    """
126    samples = dist.rvs(size=n_samples)
127    g_samples = g(samples)
128    estimate = np.mean(g_samples)
129    std_error = np.std(g_samples) / np.sqrt(n_samples)
130    return estimate, std_error
131
132# Example: E[X²] for Uniform(0,1) - theoretical value is 1/3
133uniform = stats.uniform(0, 1)
134e_x2, se = monte_carlo_lotus(lambda x: x**2, uniform)
135print(f"
136E[] Monte Carlo = {e_x2:.6f} ± {se:.6f}")
137print(f"E[X²] Theoretical = {1/3:.6f}")
138
139# Example: E[log(X)] for Exp(1) - theoretical value is -γ (Euler-Mascheroni)
140exp_dist = stats.expon(scale=1)
141e_log, se = monte_carlo_lotus(lambda x: np.log(x), exp_dist)
142euler_mascheroni = 0.5772156649
143print(f"
144E[log(X)] Monte Carlo = {e_log:.6f} ± {se:.6f}")
145print(f"E[log(X)] Theoretical = {-euler_mascheroni:.6f}")
146
147# =============================================================================
148# 5. MSE MINIMIZATION PROOF
149# =============================================================================
150# E[X] uniquely minimizes E[(X - a)²] over all constants a
151
152def visualize_mse_minimization() -> None:
153    """Show that E[X] minimizes MSE."""
154    # Generate samples from a skewed distribution
155    samples = np.random.gamma(shape=2, scale=2, size=10000)
156    true_mean = np.mean(samples)
157
158    # Compute MSE for different values of a
159    a_values = np.linspace(0, 10, 200)
160    mse_values = [np.mean((samples - a)**2) for a in a_values]
161
162    plt.figure(figsize=(10, 6))
163    plt.plot(a_values, mse_values, "b-", linewidth=2)
164    plt.axvline(x=true_mean, color="r", linestyle="--",
165                label=f"E[X] = {true_mean:.2f}")
166    plt.scatter([true_mean], [np.mean((samples - true_mean)**2)],
167                color="r", s=100, zorder=5)
168    plt.xlabel("Prediction value (a)")
169    plt.ylabel("Mean Squared Error E[(X - a)²]")
170    plt.title("E[X] Minimizes MSE: Proof by Visualization")
171    plt.legend()
172    plt.grid(True, alpha=0.3)
173    plt.savefig("mse_minimization.png", dpi=150)
174    plt.show()
175
176# Uncomment to run: visualize_mse_minimization()
177
178# =============================================================================
179# 6. JENSEN'S INEQUALITY DEMONSTRATION
180# =============================================================================
181# For convex g: E[g(X)] ≥ g(E[X])
182# For concave g: E[g(X)] ≤ g(E[X])
183
184def demonstrate_jensen() -> None:
185    """Demonstrate Jensen's inequality numerically."""
186    samples = np.random.uniform(1, 10, size=100000)
187    mean_x = np.mean(samples)
188
189    # Convex function: x²
190    e_x_squared = np.mean(samples**2)
191    squared_e_x = mean_x**2
192    print(f"
193Jensen's Inequality (convex g(x) =):")
194    print(f"  E[X²] = {e_x_squared:.4f}")
195    print(f"  (E[X])² = {squared_e_x:.4f}")
196    print(f"  E[X²] ≥ (E[X])²? {e_x_squared >= squared_e_x}")
197
198    # Concave function: log(x)
199    e_log_x = np.mean(np.log(samples))
200    log_e_x = np.log(mean_x)
201    print(f"
202Jensen's Inequality (concave g(x) = log(x)):")
203    print(f"  E[log(X)] = {e_log_x:.4f}")
204    print(f"  log(E[X]) = {log_e_x:.4f}")
205    print(f"  E[log(X)] ≤ log(E[X])? {e_log_x <= log_e_x}")
206
207demonstrate_jensen()
208
209# =============================================================================
210# 7. VARIANCE VIA EXPECTATION
211# =============================================================================
212# Var(X) = E[X²] - (E[X])² = E[(X - μ)²]
213
214def compute_variance(samples: np.ndarray) -> dict:
215    """Compute variance using both formulas to verify equality."""
216    mean = np.mean(samples)
217
218    # Method 1: E[(X - μ)²]
219    var_method1 = np.mean((samples - mean)**2)
220
221    # Method 2: E[X²] - (E[X])²
222    e_x2 = np.mean(samples**2)
223    var_method2 = e_x2 - mean**2
224
225    return {
226        "E[X]": mean,
227        "E[X²]": e_x2,
228        "Var (centered)": var_method1,
229        "Var (shortcut)": var_method2,
230        "Std Dev": np.sqrt(var_method1)
231    }
232
233samples = np.random.normal(loc=5, scale=2, size=100000)
234stats_dict = compute_variance(samples)
235print("
236Variance computation:")
237for key, value in stats_dict.items():
238    print(f"  {key} = {value:.4f}")

To run the visualizations, uncomment the function calls at the end of each section. The code produces publication-quality plots showing LLN convergence and MSE minimization.


Test Your Understanding

Put your knowledge to the test with this interactive quiz covering the key concepts from this section:

Loading interactive demo...


Summary

Key Takeaways

1

Expectation = long-run average of a random variable

2

Formula = weighted average: value × probability, summed up

3

LOTUS: E[g(X)] = ∫g(x)f(x)dx without finding g(X)'s distribution

4

Linearity makes it incredibly useful for calculations

5

Jensen's Inequality: E[g(X)] ≥ g(E[X]) for convex g

6

Law of Large Numbers: sample mean converges to E[X]

7

Best predictor: E[X] minimizes mean squared error

8

Pitfall awareness: E[1/X] ≠ 1/E[X], E[XY] ≠ E[X]E[Y] in general

9

Foundation of ML: all loss functions are expectations

One-Sentence Deep Intuition

"Expectation is the unique linear projection that compresses infinite randomness into the deterministic average behavior of the system, by summing all possible values weighted by how often Nature produces them."