Learning Objectives
Building on Previous Sections
This section builds on concepts from Sections 01-03. You should understand estimators, likelihood functions, and the concept of Fisher Information. We'll see how sufficiency connects to efficiency!
By the end of this section, you will be able to:
Understand what it means for a statistic to "capture all information" about θ
The practical tool for finding sufficient statistics from the likelihood
For common distributions: Normal, Bernoulli, Poisson, Exponential
The "most compressed" sufficient statistic with no redundancy
See why MLE is always a function of sufficient statistics
Historical Context
"We may say that a statistic is sufficient for the purposes of estimation if it contains all the information which the sample contains of the value of the parameter."
— R.A. Fisher, 1922
Why This Matters Today
Fisher's question — "what should we compute from data?" — is more relevant than ever in the age of big data. Sufficient statistics answer this precisely: compute only what captures all information about your parameter of interest.
Pitfalls and Caveats
Check the model assumptions
- Misspecification: If the model is wrong, a "sufficient" statistic may discard signal you actually care about.
- Non-iid data: Dependence or changing supports can break sufficiency and minimality arguments.
- Privacy: Sufficient statistics expose exact aggregates; add noise (e.g., differential privacy) before sharing.
Ancillarity & Completeness Bridge
Basu's Theorem
If T(X) is complete and sufficient for θ and A(X) is ancillary (its distribution does not depend on θ), then T and A are independent. This splits the data into an "information-carrying" part and a "noise-only" part.
Normal example
For iid N(μ, σ²) with known σ²: the sample mean X̄ is complete & sufficient for μ, the sample variance S² is ancillary for μ, and Basu implies X̄ ⊥ S². This independence powers t- and z-tests.
Completeness and UMVUE
Completeness guarantees uniqueness of unbiased estimators: by Lehmann-Scheffé, the UMVUE is any unbiased function of a complete sufficient statistic.
- Bernoulli(p): T = ΣXi is complete and sufficient; X̄ is the UMVUE for p.
- Poisson(λ): T = ΣXi is complete and sufficient; X̄ is the UMVUE for λ.
We'll lean on completeness and ancillarity in the next section to build best unbiased estimators and tests.
The Big Picture: Data Compression Without Loss
Sufficiency asks: Can we summarize our data without losing any information about the parameter? It's the statistical equivalent of lossless compression.
In previous sections, we learned about estimator quality (bias, variance, MSE, consistency, efficiency). Now we ask a more fundamental question:
Given data X = (X1, X2, ..., Xn), what is the minimum information we need to keep for inference about θ?
Can we throw away the original data and keep just a summary, without losing anything?
The Tax Return Analogy
Think about filing taxes:
- Every paycheck amount
- Every receipt
- Every bank transaction
- Thousands of data points!
- Total income
- Total deductions by category
- Just a few numbers
- All you need for tax calculation!
The IRS doesn't need to see every receipt — the totals are sufficient. Similarly, for many statistical problems, a simple summary is sufficient for inference.
| Benefit | Description |
|---|---|
| Data Compression | Store/transmit summaries instead of raw data |
| Simpler Estimators | Work with low-dimensional statistics |
| Privacy | Release summaries without individual data |
| Theoretical Foundation | Basis for Rao-Blackwell improvement and UMVUE |
What Is Sufficiency?
Intuitive Understanding
A statistic T(X) is sufficient for parameter θ if, once you know T(X), the original data X tells you nothing more about θ.
Key Insight: Given T(X), the conditional distribution of X does not depend on θ. All the "information about θ" flows through T(X).
Formal Definition
A statistic T(X) is sufficient for θ if the conditional distribution of X given T(X) does not depend on θ:
Equivalently: X ⊥ θ | T(X) (X is independent of θ given T)
What does this mean practically?
- If two samples have the same T(X), they provide identical information about θ
- Any estimator of θ based on X can be replaced by one based on T(X) alone
- The MLE will always be a function of T(X)
For Bernoulli data, the sum is sufficient. Watch how we compress n data points into a single number without losing any information about p!
This single number contains all information about p.
Compression: 10 values → 1 value (10.0%)
What we can still compute from T(X) = 6:
- MLE: p̂ = 6/10 = 0.600
- Likelihood ratio tests
- Confidence intervals for p
- Any inference about p!
The Factorization Theorem
The definition of sufficiency is conceptual. How do we actually find sufficient statistics? The Factorization Theorem provides the answer.
A statistic T(X) is sufficient for θ if and only if the joint density (or PMF) can be factored as:
where:
- g(T(x), θ): Depends on data only through T(x), and can depend on θ
- h(x): Depends on data x, but does NOT depend on θ
Why Factorization Works
The factorization theorem says:
Is captured in the factor g(T(x), θ).
This means T(x) contains everything about θ.
Is just "noise" with respect to θ.
It affects probabilities but carries no information about θ.
For N(μ, σ²) with known σ², watch how the likelihood factors:
Only involves T(x) = x̄ (sample mean)
Only involves deviations from mean
The Key Insight
The factor g(·) that depends on μ only uses x̄ (the sample mean). Therefore, x̄ is sufficient for μ!
Worked Examples
Proof Sketch
- (→) If f(x; θ) = g(T(x), θ)h(x), then for any t with P(T(X) = t) > 0, the conditional density is
The θ term cancels, so X | T(X) does not depend on θ → T is sufficient.
- (←) If T is sufficient, define h(x) = P(X = x | T(X) = t) for any x with T(x) = t (the choice is θ-free by sufficiency) and set g(T(x), θ) = f_T(T(x); θ), the marginal of T. Then f(x; θ) = g(T(x), θ) h(x), giving the factorization.
The proof works the same for densities with integrals instead of sums; measurability assumptions aside, it shows the factorization theorem is an equivalence, not just a handy trick.
Animated Proof Walkthrough
Step through the proof interactively to see how each step builds on the previous:
Step through the proof of the Fisher-Neyman Factorization Theorem. Each step builds on the previous one to show that sufficiency is equivalent to factorization.
We want to prove: T(X) is sufficient iff f(x; θ) = g(T(x), θ) · h(x)
Finding Sufficient Statistics
Exponential Family Shortcut
For distributions in the exponential family, finding sufficient statistics is automatic!
A distribution is in the exponential family if:
Then T(X) is automatically sufficient for θ!
| Distribution | Parameter(s) | Sufficient Statistic T(X) | Dimension |
|---|---|---|---|
| Bernoulli(p) | p | ΣXi | 1 |
| Poisson(λ) | λ | ΣXi | 1 |
| N(μ, σ²) (σ² known) | μ | ΣXi or X̄ | 1 |
| N(μ, σ²) (both unknown) | (μ, σ²) | (ΣXi, ΣXi²) | 2 |
| Exponential(λ) | λ | ΣXi | 1 |
| Gamma(α, β) | (α, β) | (ΣXi, Σlog Xi) | 2 |
| Beta(α, β) | (α, β) | (Σlog Xi, Σlog(1-Xi)) | 2 |
Pattern to Notice
The dimension of the sufficient statistic equals the number of unknown parameters. This is a hint that these are minimal sufficient statistics!
Additional Distribution Examples
Beyond the common exponential family members, here are more distributions with their sufficient statistics:
Pitman-Koopman-Darmois
Under mild regularity (iid samples, support not depending on θ), only exponential-family models admit a fixed-dimensional sufficient statistic as n grows. This is the Pitman-Koopman-Darmois theorem: if T(Xn) has dimension not growing with n, the density must belong to an exponential family.
Counterexample: Cauchy
For the Cauchy distribution, there is no nontrivial finite-dimensional sufficient statistic. Any summary smaller than the full sample loses information about location/scale.
Regularity Conditions
The factorization theorem and related results require certain regularity conditions. Understanding when these fail helps avoid misapplying sufficiency.
When the support of the distribution depends on θ, special care is needed.
Example: Uniform(0, θ). Here X(n) = max(Xi) is sufficient, but the factorization includes an indicator function that changes with θ.
When no common dominating measure exists across the parameter space, the standard factorization theorem may not apply.
Example: Mixture of continuous and discrete distributions.
If different θ values produce identical distributions, sufficiency becomes degenerate.
Example: Mixture models where component labels can be swapped.
In nonparametric settings where θ is a function (like a density), classical sufficiency concepts may not directly apply.
Example: Kernel density estimation — the full data is needed.
Practical Takeaway
Always verify that:
- The support of your distribution is fixed (doesn't depend on θ)
- You have a proper parametric model (finite-dimensional θ)
- The parameter is identifiable
If any of these fail, proceed carefully with sufficiency arguments.
Canonical GLM Examples
For canonical generalized linear models with fixed design matrix X, the sufficient statistics are theaggregated score terms:
- Logistic regression: Σ xi yi (vector) is sufficient for β.
- Poisson regression: Σ xi yi is sufficient for β.
- Gaussian GLM (known σ²): Σ xi yi, Σ xixiT summarize all β-information.
This is why distributed/mini-batch training can communicate gradients (scores) instead of raw data while retaining all likelihood information about the coefficients.
Minimal Sufficiency
Intuitive Understanding: Maximum Compression
Many statistics can be sufficient. The original data X is always (trivially) sufficient. But we want the most compressed sufficient statistic — one with no redundancy.
A minimal sufficient statistic is a sufficient statistic that is a function of every other sufficient statistic. It's the "coarsest" summary that still captures all information about θ.
Formal Definition
A sufficient statistic T(X) is minimal sufficient if for any other sufficient statistic S(X), there exists a function g such that:
In other words, T can be computed from any other sufficient statistic. It contains no extra information beyond what's needed.
For normal data with both μ and σ² unknown, let's compare different statistics:
| Statistic | Value | Dimension | Sufficient? | Minimal? |
|---|---|---|---|---|
| Original Data | (2, 5, 3, 8, 1, 7, 4, 6) | 8 | ✔ | — |
| Order Statistics | (1, 2, 3, 4, 5, 6, 7, 8) | 8 | ✔ | — |
| (Sum, Sum of Squares) | (36, 204) | 2 | ✔ | ✔ |
| (Mean, Variance) | (4.50, 5.25) | 2 | ✔ | ✔ |
| Just the Mean | 4.50 | 1 | ✘ | — |
Sufficient but not minimal. Contains more info than needed.
Minimal sufficient! Maximum compression without information loss.
NOT sufficient. Loses information about σ².
Discrete Minimality Example
For Bernoulli(p) data, the likelihood ratio method becomes explicit. For two samples x and y:
This ratio is free of p if and only if Σxi = Σyi. Therefore, T(X) = ΣXi is minimal sufficient for p. The same logic works for Poisson(λ): equality of sums defines the equivalence classes, so the total count is minimal sufficient.
Uniqueness and Existence
Minimal sufficient statistics are unique only up to one-to-one transforms. In irregular models (e.g., non-dominated families or changing support), a minimal sufficient statistic may not exist.
Finding Minimal Sufficient Statistics
There's a powerful technique for finding minimal sufficient statistics:
T(X) is minimal sufficient if:
In words: T(x) = T(y) if and only if the likelihood ratio doesn't depend on θ.
Rao-Blackwell Improvement
The Rao-Blackwell Theorem shows that conditioning any estimator on a sufficient statistic can only reduce (or maintain) its variance. This provides a systematic way to improve estimators.
Let δ(X) be any estimator of g(θ) and T(X) be sufficient for θ. Define:
Then for all θ:
with equality if and only if δ is already a function of T.
The Improvement Process:
- Start with any unbiased estimator δ(X)
- Find a sufficient statistic T(X)
- Compute δ*(X) = E[δ(X) | T(X)]
- The new estimator δ* has smaller (or equal) variance and is still unbiased
The Rao-Blackwell Theorem states that conditioning an estimator on a sufficient statistic never increases its variance. Watch how the Rao-Blackwellized estimator has a tighter distribution!
Just use the first observation as an estimator for p
Condition X₁ on T = ΣXᵢ to get the sample mean
By conditioning on the sufficient statistic T, we reduce variance from 0.2419 to 0.0105
Why This Works
The Rao-Blackwell theorem guarantees: Var(E[δ|T]) ≤ Var(δ). Equality holds only if δ is already a function of T. For Bernoulli data, conditioning X₁ on T = ΣXᵢ gives X̄, which has variance p(1-p)/n instead of p(1-p)!
Likelihood Surface Visualization
One of the most powerful ways to understand sufficiency is to see that theentire likelihood function can be reconstructed from just the sufficient statistic. Two datasets with the same T(X) produce identical likelihood curves.
Sufficiency and MLE
There's a beautiful connection between sufficiency and Maximum Likelihood Estimation:
If T(X) is sufficient for θ, then the MLE θ̂MLE is always a function of T(X).
Why? By factorization:
Since h(X) doesn't depend on θ, maximizing L(θ) is equivalent to maximizing g(T(X), θ), which only depends on T(X).
This Is Why MLE Is Smart
MLE automatically uses only the information that matters. It never "looks at" the parts of data that are irrelevant to θ!
| Connection | Relationship |
|---|---|
| Sufficiency + Efficiency | Efficient estimators are functions of sufficient statistics |
| Rao-Blackwell Theorem | Conditioning on sufficient statistics improves estimators |
| UMVUE | Best unbiased estimators are functions of complete sufficient statistics |
| Data Reduction | Sufficient statistics enable lossless compression |
Common Mistakes
Here are frequent errors students and practitioners make with sufficiency. Test yourself by identifying the error before reading the explanation.
Why it's wrong: The sample mean is sufficient for the mean of a Normal distribution, but NOT for other distributions.
Why it's wrong: Any one-to-one function of a sufficient statistic is also sufficient.
Why it's wrong: Sufficiency and unbiasedness are completely independent properties.
Why it's wrong: When the support depends on θ, the indicator function is crucial and must be included in the factorization.
Why it's wrong: Being sufficient for θ doesn't mean you can easily estimate θ from the statistic.
Bayesian Perspective
Sufficiency has a beautiful interpretation in Bayesian inference: the posterior depends on the data only through the sufficient statistic.
If T(X) is sufficient for θ, then:
The posterior distribution given the full data equals the posterior given just the sufficient statistic.
Why this works: By Bayes' theorem:
Since h(X) doesn't depend on θ, it factors out and only g(T(X), θ) matters for the posterior.
Connection to Conjugate Priors
The exponential family connection explains why conjugate priors work so elegantly:
| Likelihood | Conjugate Prior | Suff. Stat Updates |
|---|---|---|
| Bernoulli(p) | Beta(α, β) | α' = α + Σxi, β' = β + n - Σxi |
| Poisson(λ) | Gamma(α, β) | α' = α + Σxi, β' = β + n |
| N(μ, σ²) (σ² known) | N(μ0, τ²) | Posterior mean is weighted avg of μ0 and X̄ |
| Exponential(λ) | Gamma(α, β) | α' = α + n, β' = β + Σxi |
Computational Advantage
In Bayesian updating, you only need to store the sufficient statistics and prior hyperparameters. As new data arrives, update the hyperparameters — no need to store or reprocess raw data!
Information-Theoretic View
Information theory provides another lens for understanding sufficiency: a statistic is sufficient if and only if it preserves all mutual information with the parameter.
T(X) is sufficient for θ if and only if:
where I(·; ·) denotes mutual information.
The Data Processing Inequality:
For any function T, we always have I(T(X); θ) ≤ I(X; θ) (processing data can only lose information). Sufficiency is the special case where equality holds!
Connection to Information Bottleneck
The Information Bottleneck method in ML seeks to find T(X) that minimizes I(X; T) (compression) while maximizing I(T; Y) (relevant information). Sufficient statistics achieve perfect I(T; θ) = I(X; θ), making them the optimal choice when feasible.
Machine Learning Connections
Neural networks are machines for learning (approximate) sufficient statistics.
The deepest insight connecting classical statistics to modern AI is this: sufficiency is the hidden soul of deep learning. Every neural network, every transformer, every generative model is secretly performing the same ancient operation that Fisher described a century ago — compressing infinite-dimensional reality into finite-dimensional representations that preserve all decision-relevant information.
The Ancient Question in a Modern World
Long before neural networks existed, statisticians asked a deceptively simple question:
“What is the smallest piece of data that still lets me make the best possible decision?”
“What is the smallest latent representation that still lets my model reason, generate, and predict correctly?”
These are the same question — separated by 100 years of mathematics and hardware.
One side calls the answer
a sufficient statistic
The other calls it
a latent embedding
They are not metaphors. They are the same object viewed through different lenses.
The Central Mystery: How can we take an infinite, continuous, chaotic stream of reality — millions of pixels, billions of words, countless sensor readings — and compress it into a finite representation that loses nothing important?
The answer that emerged from both classical statistics and modern deep learning is remarkably unified: find the minimal sufficient statistic for your task.
“The universe presents us with infinite data. Wisdom is knowing what to keep and what to discard. A sufficient statistic is what remains after perfect compression — everything you need, nothing you don't.”
The Story: From Raw Reality to a Decision Crystal
Imagine you're holding a 10-megapixel photograph — about 30 million numbers (RGB values). You need to answer one question: “Is this a cat or a dog?”
Raw Reality
30,000,000 numbers
Neural Network
(Compression)
Decision Crystal
1 number: P(cat)
The question is: What is the right compression? Not too lossy (you might misclassify), not too large (inefficient, prone to overfitting). The answer: the minimal sufficient statistic for the classification task.
Two Worlds, One Principle
In statistics, we start with:
Where:
- X = observed sample (measurements, counts, values)
- θ = parameter we want to estimate (μ, σ, p, λ, ...)
- T(X) = a function that compresses the data
We search for T(X) such that:
Left: What you believe about θ after seeing ALL the raw data
Right: What you believe about θ after seeing ONLY the compressed summary
Meaning: If someone gives me T(X), I can throw away the original data X — I won't lose any information about θ. T(X) is a perfect summary.
→ That is the definition of a sufficient statistic
Now look at a neural network:
Where:
- x = raw pixels, text tokens, audio waveforms
- z = latent vector, embedding, hidden state
- ŷ = prediction, next token, image, action
The entire job of deep learning is to learn:
knowing z is as good as knowing x for the task.
In probability language, the dream objective is:
Left: Prediction from ALL raw input (pixels, tokens, ...)
Right: Prediction from ONLY the learned embedding
Which is exactly the sufficiency condition:
So:
The Unifying Insight
Both are doing exactly the same thing: Finding a function T such that T(X) contains all the information in X about the target parameter/prediction. The difference is only in how the function is discovered — classical statistics derives it analytically from the likelihood; deep learning learns it from data.
Modern AI Models as Sufficiency Machines
Every major deep learning architecture can be understood as a sufficiency machine — a system that compresses input into a representation that preserves all task-relevant information.
The Information Bottleneck = Minimal Sufficiency
The Information Bottleneck (IB) principle, introduced by Tishby et al., provides the formal connection between deep learning and minimal sufficiency.
The Information Bottleneck Objective:
I(X; Z) — Minimize
Compress! Forget everything about X that isn't needed.
I(Z; Y) — Preserve
Keep all information about Y that X contains.
This is exactly the definition of minimal sufficiency:
Sufficiency
I(Z; Y) = I(X; Y)
Z contains ALL info about Y
Minimality
min I(X; Z)
Smallest Z that works
What this means for ML:
Overfitting
Z too large (not minimal)
Underfitting
Z not sufficient
Generalization
Z is near-minimal sufficient
Tishby's Hypothesis: During training, networks first increase I(Z; Y) (fitting), then decrease I(X; Z) (compression). The result approximates the minimal sufficient statistic.
The Grand Unification
Here is the mapping between classical statistics and deep learning, revealing they are two languages for the same underlying mathematics:
Consider this statement:
“All generative AI models compress raw data into a latent space from which we generate or decide. If that latent space gives correct inference, it is sufficient.”
This is 100% correct and can be written formally as:
Which is the definition of sufficiency.
And furthermore:
“Neural networks are used to create that latent space.”
This is the deepest correct interpretation of neural networks:
Neural networks are automatic sufficient-statistic discovery engines.
Classical statistics asks: “What part of the data carries all the truth?”
Deep learning answers: “We will learn that part automatically.”
Sufficiency is the mathematical soul that unifies them.
Every successful AI model is secretly a sufficiency engine: it learns how to throw away everything that does not matter and keep only what is necessary to make optimal predictions and generate reality.
Where Fisher and his successors had to derive T(X) analytically from known likelihood functions, neural networks learn T from data. They solve the same problem — find the compression that preserves all task-relevant information — but without requiring us to specify the model family in advance.
“Classical statistics found sufficiency through mathematics.
Deep learning finds it through gradient descent.
Both arrive at the same destination: the essence of information.”
Practical Applications
Real-World Applications
A/B Testing Case Study
A/B testing provides a perfect real-world example of sufficiency in action. Consider testing two versions of a website button to see which gets more clicks.
- nA = 10,000 visitors
- xA = 312 clicks
- Click rate: 3.12%
- nB = 10,000 visitors
- xB = 347 clicks
- Click rate: 3.47%
Key insight: For Bernoulli trials, the sufficient statistic for each group's click probability p is just the count of clicks (xA, xB).
Why Sufficiency Matters for A/B Testing:
- Privacy-Preserving Analysis: Share only (nA, xA, nB, xB) with analysts. No need to reveal individual user actions — the sufficient statistics contain all information needed for inference about pA and pB.
- Efficient Storage: Store only 4 numbers instead of 20,000 individual records. For ongoing tests, just update the running totals.
- Complete Inference: From these 4 numbers, we can compute:
- Point estimates: p̂A = xA/nA, p̂B = xB/nB
- Confidence intervals for pA, pB, and pB - pA
- Chi-square test or Fisher's exact test
- Bayesian posterior distributions
Two-proportion z-test:
where p̂ = (xA + xB)/(nA + nB) is the pooled proportion.
Result: z ≈ 1.46, p-value ≈ 0.14. Not statistically significant at α = 0.05.
Industry Application
Companies like Google, Netflix, and Amazon run thousands of A/B tests simultaneously. They store and analyze only sufficient statistics (counts and sums), not billions of individual user actions. This is sufficiency at scale!
Python Implementation
Let's implement and verify sufficiency concepts in Python:
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def demonstrate_sufficiency():
6 """Show that MLE depends only on sufficient statistics."""
7 np.random.seed(42)
8
9 # Generate Bernoulli data
10 true_p = 0.7
11 n = 100
12 data = np.random.binomial(1, true_p, n)
13
14 # Original data approach
15 mle_from_data = np.mean(data)
16
17 # Sufficient statistic approach
18 T = np.sum(data) # Sufficient statistic
19 mle_from_T = T / n
20
21 print("Demonstrating Sufficiency for Bernoulli")
22 print("=" * 50)
23 print(f"Original data: {data[:20]}... (n={n})")
24 print(f"Sufficient statistic T = sum(X) = {T}")
25 print()
26 print(f"MLE from original data: {mle_from_data:.4f}")
27 print(f"MLE from sufficient T: {mle_from_T:.4f}")
28 print(f"Same answer: {np.isclose(mle_from_data, mle_from_T)}")
29
30def verify_factorization_normal():
31 """Verify factorization for Normal distribution."""
32 np.random.seed(42)
33
34 mu_true, sigma_true = 5.0, 2.0
35 n = 50
36 data = np.random.normal(mu_true, sigma_true, n)
37
38 # Compute sufficient statistics
39 T1 = np.sum(data) # Sum
40 T2 = np.sum(data**2) # Sum of squares
41
42 # Derived sufficient statistics
43 x_bar = T1 / n
44 s_squared = (T2 - n * x_bar**2) / (n - 1)
45
46 print("
47Normal Distribution: Sufficient Statistics")
48 print("=" * 50)
49 print(f"Sample size: n = {n}")
50 print(f"
51Minimal Sufficient Statistics:")
52 print(f" T1 = sum(X) = {T1:.4f}")
53 print(f" T2 = sum(X^2) = {T2:.4f}")
54 print(f"
55Equivalent forms:")
56 print(f" Sample mean = {x_bar:.4f} (true mu = {mu_true})")
57 print(f" Sample var = {s_squared:.4f} (true sigma^2 = {sigma_true**2})")
58
59def compare_sufficient_vs_insufficient():
60 """Compare sufficient vs insufficient statistics for Uniform(0, theta)."""
61 np.random.seed(42)
62
63 theta_true = 10.0
64 n = 30
65 data = np.random.uniform(0, theta_true, n)
66
67 # Sufficient statistic: max(X)
68 T_sufficient = np.max(data)
69 mle_theta = T_sufficient # MLE is max (biased, but consistent)
70
71 # Insufficient statistic: mean (loses information!)
72 insufficient_stat = np.mean(data)
73 naive_estimate = 2 * insufficient_stat # Method of moments
74
75 print("
76Uniform(0, theta): Sufficient vs Insufficient")
77 print("=" * 50)
78 print(f"True theta: {theta_true}")
79 print(f"
80Sufficient statistic T = max(X) = {T_sufficient:.4f}")
81 print(f"MLE from sufficient: {mle_theta:.4f}")
82 print(f"
83Insufficient stat (mean) = {insufficient_stat:.4f}")
84 print(f"Naive estimate 2*mean = {naive_estimate:.4f}")
85 print(f"
86Note: Mean loses information about theta!")
87 print(f"Two samples with same mean but different max")
88 print(f"would give SAME mean-based estimate but DIFFERENT")
89 print(f"information about theta.")
90
91def demonstrate_minimal_sufficiency():
92 """Show that minimal sufficient = maximum compression."""
93 np.random.seed(42)
94
95 mu_true, sigma_true = 5.0, 2.0
96 n = 20
97 data = np.random.normal(mu_true, sigma_true, n)
98
99 # Various statistics and their dimensions
100 statistics = {
101 "Original data": (tuple(data), n),
102 "Order statistics": (tuple(sorted(data)), n),
103 "(sum, sum_sq)": ((np.sum(data), np.sum(data**2)), 2),
104 "(mean, var)": ((np.mean(data), np.var(data)), 2),
105 }
106
107 print("
108Minimal Sufficiency: Data Compression")
109 print("=" * 50)
110 print(f"Original sample size: n = {n}")
111 print(f"Parameters: mu (unknown), sigma^2 (unknown) -> 2 params")
112 print()
113
114 for name, (stat, dim) in statistics.items():
115 sufficient = "Yes" if dim <= 2 or name == "Order statistics" else "No"
116 minimal = "Yes" if dim == 2 else "No"
117 compression = f"{n} -> {dim}" if dim < n else "None"
118
119 print(f"{name:20s}: dim={dim:2d}, sufficient={sufficient:3s}, "
120 f"minimal={minimal:3s}, compression={compression}")
121
122def streaming_sufficient_statistics():
123 """Demonstrate streaming computation using sufficient statistics."""
124 np.random.seed(42)
125
126 print("
127Streaming Computation via Sufficient Statistics")
128 print("=" * 50)
129
130 # Initialize sufficient statistics
131 n = 0
132 sum_x = 0
133 sum_x2 = 0
134
135 # Stream data in batches
136 true_mu, true_sigma = 100, 15
137 batch_size = 100
138 n_batches = 10
139
140 print(f"Streaming {n_batches} batches of {batch_size} observations each")
141 print(f"True parameters: mu={true_mu}, sigma={true_sigma}")
142 print()
143
144 for batch in range(n_batches):
145 # New batch of data
146 new_data = np.random.normal(true_mu, true_sigma, batch_size)
147
148 # Update sufficient statistics (O(1) update, O(1) storage!)
149 n += len(new_data)
150 sum_x += np.sum(new_data)
151 sum_x2 += np.sum(new_data**2)
152
153 # Compute estimates from sufficient statistics
154 mean_estimate = sum_x / n
155 var_estimate = (sum_x2 - n * mean_estimate**2) / (n - 1)
156
157 print(f"After batch {batch+1}: n={n:4d}, "
158 f"mu_hat={mean_estimate:.2f}, sigma_hat={np.sqrt(var_estimate):.2f}")
159
160 print()
161 print(f"Final estimates: mu_hat={mean_estimate:.2f}, sigma_hat={np.sqrt(var_estimate):.2f}")
162 print(f"True values: mu={true_mu}, sigma={true_sigma}")
163 print(f"
164Memory used: O(3) regardless of n = {n}!")
165
166# Run all demonstrations
167if __name__ == "__main__":
168 demonstrate_sufficiency()
169 verify_factorization_normal()
170 compare_sufficient_vs_insufficient()
171 demonstrate_minimal_sufficiency()
172 streaming_sufficient_statistics()Try It Yourself
Run this code to see sufficiency in action. The streaming example shows how you can process unlimited data with fixed memory using sufficient statistics!
Interactive Quiz
Test your understanding of sufficiency concepts with this interactive quiz. Each question has immediate feedback with detailed explanations.
Test your understanding of sufficiency and related concepts. Each question has immediate feedback with explanations.
Q1: For iid Bernoulli(p) data, which statistic is sufficient for p?
Practice Checks
Quick prompts to internalize the concepts:
- Find T(X) for Beta(α, β) and Negative Binomial(r, p). Are they minimal?
- Use the likelihood-ratio test to show minimality for Poisson(λ) (hint: compare sums).
- Propose a statistic for Uniform(0, θ) that is not sufficient and explain why.
- Compute the compression ratio n → dim(T) for Normal with known vs unknown variance.
- Explain how you would add differential privacy noise to a sufficient statistic before sharing it.
Key Insights
A sufficient statistic captures all information about θ. You can throw away the original data without losing anything for inference!
To find sufficient statistics, factor the likelihood. The part that depends on θ only uses T(X) — that's your sufficient statistic!
Minimal sufficient statistics achieve maximum compression. For exponential families, the dimension equals the number of parameters!
The Maximum Likelihood Estimator automatically depends only on sufficient statistics. This is one reason MLE is so powerful!
Sufficient statistics enable processing infinite streams with fixed memory, and sharing data summaries without revealing individual records.
Summary
| Symbol | Name | Meaning |
|---|---|---|
| T(X) | Statistic | A function of the data |
| g(T, θ) | Parameter-dependent factor | Depends on θ only through T |
| h(x) | Parameter-free factor | Depends on x but not θ |
| X(n) | Order statistic | Maximum value in sample |
| X ⊥ θ | T | Conditional independence | X independent of θ given T |
- T(X) captures all info about θ
- Factorization: f(x; θ) = g(T, θ) · h(x)
- Enables data compression
- MLE depends only on T(X)
- Maximum compression without loss
- Function of every other sufficient stat
- Dimension = number of parameters
- Use likelihood ratio method to find
In the next section, we'll explore Completeness and Ancillarity — completeness tells us when a sufficient statistic has "no extra parts", and ancillarity identifies statistics that carry no information about θ. Together with sufficiency, these concepts lead to the powerful Lehmann-Scheffé theorem for finding the best unbiased estimators!