Learning Objectives
By the end of this section, you will be able to:
- Understand fundamentally why the chi-square distribution arises naturally as the sum of squared standard normal random variables
- Define the chi-square distribution and explain the meaning of degrees of freedom
- Recognize chi-square as a special case of Gamma:
- Apply chi-square tests for goodness-of-fit and independence testing
- Use chi-square for variance estimation and confidence intervals
- Implement chi-square feature selection in machine learning pipelines
- Connect chi-square to the F-distribution and Student's t-distribution
Deep Intuition: The Surprise Meter
The chi-square distribution measures "how surprising" your data is compared to what you expected.
Imagine you roll a die 600 times. You expect about 100 of each face. But you get: 90, 105, 115, 95, 85, 110. Is this die fair, or is something fishy?
The chi-square distribution answers this question. It tells you: "Given random variation, how likely is it to see this much deviation from expectation?"
The Core Idea
Chi-square () is built from squared standard normal variables:
Where each is independent. The subscript is the degrees of freedom.
Why Squared?
Mental Model
Think of chi-square as a "surprise accumulator." Each squared deviation adds to your total surprise. The larger the chi-square value, the more your data deviates from what you expected.
The Historical Story
The chi-square distribution emerged from fundamental questions about data, errors, and testing scientific theories.
Friedrich Robert Helmert (1876)
The German mathematician Helmert first derived the distribution of the sample variance for normally distributed data. He showed that follows what we now call the chi-square distribution.
Karl Pearson (1900)
The breakthrough came when Pearson published "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling."
Pearson's Problem: Are the deviations I see in my data just due to random chance, or is there something systematic going on?
His motivation was practical: testing whether dice were fair, validating Mendel's genetic inheritance data, and checking if theoretical distributions fit observed data.
Ronald Fisher (1920s)
Fisher rigorously developed the concept of degrees of freedom and showed how to apply chi-square tests correctly. He established that when you estimate parameters from data, you "use up" degrees of freedom.
Why Do We Need the Chi-Square Distribution?
Chi-square is the foundation of statistical testing. It answers fundamental questions about data:
✅ Use Chi-Square When:
- Testing goodness-of-fit (does data match expected distribution?)
- Testing independence (are two categorical variables related?)
- Constructing confidence intervals for variance
- Feature selection for categorical data in ML
- Analyzing contingency tables
❌ Do NOT Use Chi-Square When:
- Expected frequencies are too small (E < 5)
- Data are continuous (need to bin first)
- Observations are not independent
- Testing means or correlations (use t-test, ANOVA)
Mathematical Definition
Definition from Normal Variables
If are independent standard normal random variables, then:
We say follows a chi-square distribution with k degrees of freedom.
The Probability Density Function
Symbol by Symbol
| Symbol | Meaning | Intuition |
|---|---|---|
| x | Chi-square value | Sum of squared deviations |
| k | Degrees of freedom | Number of independent squared normals |
| Γ(k/2) | Gamma function | Normalization constant (generalizes factorial) |
| e^(-x/2) | Exponential decay | Large values become rare |
| x^(k/2-1) | Power term | Shapes the distribution based on k |
Connection to Gamma Distribution
The chi-square distribution is a special case of Gamma:
This means: shape parameter , rate parameter .
Why This Connection Matters
Since chi-square is a special case of Gamma, it inherits many properties. Gamma is closed under addition (for the same rate), so chi-square is too: the sum of independent chi-squares is chi-square with summed degrees of freedom.
Exploring the Distribution
Use the interactive visualizer below to see how the degrees of freedom affects the shape of the chi-square distribution.
Right-skewed with interior mode
Statistics
As Gamma: Gamma(5/2, 1/2) = Gamma(2.5, 0.5)
Critical Values for χ²(5)
Reject H₀ if your test statistic exceeds these values
The PDF shows the probability density. For k=1,2, the mode is at 0. For k≥3, the mode is at k-2.
PDF Formula
f(x) = xk/2-1 × e-x/2 / [2k/2 × Γ(k/2)]
f(x) = x1.5 × e-x/2 / [22.5 × Γ(2.5)]
Key Observations
- k = 1: Heavily right-skewed, mode at 0, unbounded spike at origin
- k = 2: Exponential distribution! Mode at 0 but no spike
- k ≥ 3: Interior mode at , bell-like but still skewed
- Large k: Becomes more symmetric, approximates Normal by CLT
- Mean always equals k: Larger k shifts distribution right
Where Chi-Square Comes From
The chi-square distribution arises naturally when you square and sum independent standard normal variables. This interactive demonstration shows you why.
The Chi-Square Distribution Arises From:
χ²4 = Z₁² + Z₂² + ... + Z4²
where each Zi ~ N(0, 1) is independent
Latest Sample Breakdown
Click "Add One Sample" to see the breakdown
Key Insight: As you generate more samples, the histogram converges to the theoretical chi-square PDF (red curve). This demonstrates that summing k squared standard normals produces the χ²(k) distribution. The mean equals k (degrees of freedom), and variance equals 2k.
This fundamental relationship is why chi-square appears everywhere in statistics: any time you're working with squared errors or deviations from normality, chi-square is lurking.
Key Properties
Summary Statistics
| Property | Formula | Interpretation |
|---|---|---|
| Mean | E[X] = k | Average equals degrees of freedom |
| Variance | Var(X) = 2k | Twice the degrees of freedom |
| Mode | k - 2 (for k ≥ 2) | Most likely value |
| Skewness | √(8/k) | Always positive, decreases with k |
| MGF | (1 - 2t)^(-k/2) | For t < 1/2 |
Additivity Property
If and are independent, then:
Why this matters: You can decompose complex chi-square statistics into simpler components, or combine independent tests.
Scaling Property
If and :
Scaling doesn't preserve chi-square (unlike the Normal), but stays in the Gamma family.
The Chi-Square Test Framework
The chi-square test compares observed frequencies to expected frequencies using this statistic:
What Each Term Means
| Symbol | Meaning | Role |
|---|---|---|
| O_i | Observed frequency | What you actually counted |
| E_i | Expected frequency | What theory predicts |
| (O_i - E_i)² | Squared deviation | How far off each cell is |
| (O_i - E_i)²/E_i | Standardized deviation | Weight by expected (larger E allows larger variance) |
The Key Insight
We divide by because a deviation of 10 means different things depending on whether you expected 20 (50% off) or 1000 (1% off). This standardization makes deviations comparable.
Goodness-of-Fit Test: Is This Die Fair?
The goodness-of-fit test asks: Does my observed data match a hypothesized distribution? Try it yourself:
| Face | Observed (O) | Expected (E) | O - E | (O - E)² / E |
|---|---|---|---|---|
| Face 1 | 100 | 100.0 | 0.0 | 0.000 |
| Face 2 | 100 | 100.0 | 0.0 | 0.000 |
| Face 3 | 100 | 100.0 | 0.0 | 0.000 |
| Face 4 | 100 | 100.0 | 0.0 | 0.000 |
| Face 5 | 100 | 100.0 | 0.0 | 0.000 |
| Face 6 | 100 | 100.0 | 0.0 | 0.000 |
| χ² Statistic = | 0.000 | |||
✅ Fail to Reject H₀
χ² Statistic: 0.0000
Degrees of Freedom: 5 (6 faces - 1)
P-value: 1.0000 ()
Critical Value (α=0.05): 11.07
No significant evidence the die is unfair (p ≥ 0.05)
How to Interpret
- H₀ (Null Hypothesis): The die is fair (all faces equally likely)
- H₁ (Alternative): The die is not fair
- Large χ²: Observed data differs significantly from expected
- Small p-value: Unlikely to see this data if die were fair
Steps for a Goodness-of-Fit Test
- State hypotheses: : Data follows expected distribution
- Calculate expected frequencies under
- Compute chi-square statistic:
- Determine degrees of freedom: (subtract 1 more for each estimated parameter)
- Find p-value from distribution
- Decide: Reject if p-value < significance level
Degrees of Freedom
For goodness-of-fit:
If you estimate the mean from data, subtract 1. If you estimate mean AND variance, subtract 2.
Real-World Applications
🎲 Gaming & Fairness Testing
- Testing if dice, cards, or RNGs are fair
- Casino regulation and auditing
- Video game loot box probability verification
🧬 Genetics & Biology
- Testing Mendelian inheritance ratios
- Hardy-Weinberg equilibrium
- Gene-disease associations
📊 Survey Research
- Testing independence of demographic variables
- Cross-tabulation analysis
- Polling accuracy assessment
🏥 Medical Research
- Drug efficacy with categorical outcomes
- Treatment vs. control group comparisons
- Epidemiological studies
AI/ML Applications
Chi-square is a fundamental tool in machine learning, especially for categorical data analysis and feature selection.
1. Chi-Square Feature Selection
One of the most important uses of chi-square in ML is selecting the most informative features for classification. The idea: if a feature is independent of the target, it's not useful for prediction.
How Chi-Square Feature Selection Works:
For each categorical feature, we test whether it's independent of the target variable. Features with high chi-square scores (low p-values) are statistically associated with the target and are more likely to be useful for prediction.
Feature Rankings by Chi-Square Score
| Rank | Feature | χ² Score | p-value | df | Actually Predictive? | Selected? |
|---|---|---|---|---|---|---|
| #1 | income_level Income bracket | 20.641 | < 0.001 | 2 | ✓ Yes | Selected |
| #2 | age_group Customer age group | 18.224 | < 0.001 | 2 | ✓ Yes | Selected |
| #3 | device Device type | 7.540 | 0.0230 | 2 | ✓ Yes | Selected |
| #4 | signup_day Day of week signed up | 1.076 | 0.2995 | 1 | ✗ No | - |
| #5 | browser Web browser used | 0.449 | 0.9299 | 3 | ✗ No | - |
Selection Accuracy
Selected 3 out of 3 truly predictive features✓ Perfect!
Try increasing the sample size to see chi-square better identify the truly predictive features (age_group, income_level, device).
sklearn Equivalent
from sklearn.feature_selection import chi2, SelectKBest # Select top 3 features by chi-square score selector = SelectKBest(chi2, k=3) X_selected = selector.fit_transform(X, y) # Selected features: income_level, age_group, device
2. Text Classification & NLP
Chi-square is heavily used in text classification to:
- Select the most discriminative words/n-grams for each class
- Reduce vocabulary size while preserving predictive power
- Identify terms strongly associated with categories
1from sklearn.feature_extraction.text import CountVectorizer
2from sklearn.feature_selection import chi2, SelectKBest
3
4# Create bag-of-words features
5vectorizer = CountVectorizer(max_features=10000)
6X = vectorizer.fit_transform(documents)
7
8# Select top 1000 features by chi-square score
9selector = SelectKBest(chi2, k=1000)
10X_selected = selector.fit_transform(X, y)
11
12# Get the selected feature names
13feature_names = vectorizer.get_feature_names_out()
14selected_features = feature_names[selector.get_support()]3. Categorical Encoding Decisions
When deciding how to encode categorical features, chi-square helps determine:
- Whether a categorical feature is worth including
- Which categories to merge (similar chi-square contributions)
- Target encoding quality assessment
4. Model Calibration Testing
Chi-square goodness-of-fit can test if a model's predicted probabilities match reality:
- Bin predictions into groups (e.g., 0-10%, 10-20%, ...)
- Compare predicted vs. actual positive rates in each bin
- Large chi-square indicates poor calibration
5. Independence in Graphical Models
In probabilistic graphical models and causal inference, chi-square tests help:
- Test conditional independence assumptions
- Validate Bayesian network structure
- Identify spurious correlations vs. causal relationships
Related Distributions
| Distribution | Relationship to Chi-Square |
|---|---|
| Gamma(k/2, 1/2) | Chi-square IS Gamma with these parameters |
| Exponential(1/2) | Chi-square with k=2 is Exponential |
| F-distribution | Ratio of two chi-squares divided by their dfs |
| t-distribution | Normal divided by sqrt(chi-square/k) |
| Normal (large k) | Chi-square ≈ Normal(k, 2k) by CLT |
| Noncentral χ² | Sum of squared non-standard normals |
The Statistical Trinity
Chi-square, t, and F distributions form the backbone of classical statistics:
- Chi-square: Variance estimation, goodness-of-fit
- t-distribution: Mean estimation (unknown variance)
- F-distribution: Comparing variances, ANOVA
All three are intimately connected through the Normal distribution.
Python Implementation
Basic Operations with SciPy
1import numpy as np
2from scipy import stats
3
4# Create chi-square distribution with k=5 degrees of freedom
5k = 5
6chi2_dist = stats.chi2(df=k)
7
8# PDF, CDF, quantiles
9x = 5
10print(f"PDF at x={x}: {chi2_dist.pdf(x):.4f}")
11print(f"CDF at x={x}: {chi2_dist.cdf(x):.4f}")
12print(f"P(X > {x}): {1 - chi2_dist.cdf(x):.4f}")
13
14# Critical values for common significance levels
15print(f"95th percentile: {chi2_dist.ppf(0.95):.4f}")
16print(f"99th percentile: {chi2_dist.ppf(0.99):.4f}")
17
18# Summary statistics
19print(f"Mean: {chi2_dist.mean():.4f}") # Should equal k
20print(f"Variance: {chi2_dist.var():.4f}") # Should equal 2k
21
22# Generate random samples
23samples = chi2_dist.rvs(size=10000)
24print(f"Sample mean: {samples.mean():.4f}")
25print(f"Sample variance: {samples.var():.4f}")Goodness-of-Fit Test
1import numpy as np
2from scipy import stats
3
4# Example: Testing if a die is fair
5# Observed counts from 600 rolls
6observed = np.array([90, 105, 115, 95, 85, 110])
7
8# Expected counts if fair (600/6 = 100 each)
9expected = np.array([100, 100, 100, 100, 100, 100])
10
11# Method 1: Using scipy.stats.chisquare
12stat, pvalue = stats.chisquare(observed, f_exp=expected)
13print(f"Chi-square statistic: {stat:.4f}")
14print(f"P-value: {pvalue:.4f}")
15print(f"Degrees of freedom: {len(observed) - 1}")
16
17# Method 2: Manual calculation
18chi_sq_manual = np.sum((observed - expected)**2 / expected)
19df = len(observed) - 1
20pvalue_manual = 1 - stats.chi2.cdf(chi_sq_manual, df)
21print(f"Manual calculation: χ² = {chi_sq_manual:.4f}, p = {pvalue_manual:.4f}")
22
23# Decision at α = 0.05
24alpha = 0.05
25critical_value = stats.chi2.ppf(1 - alpha, df)
26print(f"Critical value at α={alpha}: {critical_value:.4f}")
27print(f"Reject H₀: {stat > critical_value}")Test of Independence
1import numpy as np
2from scipy import stats
3
4# Contingency table: Gender vs. Product Preference
5# Rows: Male, Female
6# Columns: Product A, Product B, Product C
7contingency_table = np.array([
8 [45, 35, 20], # Male
9 [30, 50, 25] # Female
10])
11
12# Perform chi-square test of independence
13stat, pvalue, dof, expected = stats.chi2_contingency(contingency_table)
14
15print("Observed frequencies:")
16print(contingency_table)
17print("\nExpected frequencies (under independence):")
18print(expected.round(2))
19print(f"\nChi-square statistic: {stat:.4f}")
20print(f"Degrees of freedom: {dof}")
21print(f"P-value: {pvalue:.4f}")
22
23# Calculate contributions from each cell
24contributions = (contingency_table - expected)**2 / expected
25print("\nContributions to chi-square:")
26print(contributions.round(4))Feature Selection with Chi-Square
1from sklearn.datasets import load_iris
2from sklearn.feature_selection import chi2, SelectKBest
3import numpy as np
4
5# Load data (using discretized version for chi-square)
6iris = load_iris()
7X = iris.data
8y = iris.target
9
10# Chi-square requires non-negative features
11# For continuous features, you might bin them first
12X_positive = X - X.min(axis=0) # Shift to non-negative
13
14# Calculate chi-square scores for each feature
15scores, pvalues = chi2(X_positive, y)
16
17print("Feature chi-square scores:")
18for i, (name, score, pval) in enumerate(zip(
19 iris.feature_names, scores, pvalues
20)):
21 print(f" {name}: χ² = {score:.2f}, p = {pval:.4f}")
22
23# Select top k features
24k = 2
25selector = SelectKBest(chi2, k=k)
26X_selected = selector.fit_transform(X_positive, y)
27
28# Which features were selected?
29selected_mask = selector.get_support()
30selected_features = np.array(iris.feature_names)[selected_mask]
31print(f"\nSelected features: {selected_features}")Common Pitfalls
Expected Frequency Rule
Expected frequency should be ≥ 5 in each cell. With small expected frequencies, the chi-square approximation breaks down. Solutions:
- Combine categories to increase expected counts
- Use Fisher's exact test for small samples
- Use Yates' continuity correction for 2×2 tables
Independence Assumption
Each observation must be independent. If the same person can appear multiple times or observations are clustered, chi-square is invalid. Consider:
- Clustered sampling methods
- Time series with autocorrelation
- Repeated measures designs
Categorical Data Only
Chi-square tests are for categorical data. For continuous data:
- Bin into categories first (lose information)
- Use Kolmogorov-Smirnov test instead
- Use appropriate parametric tests
One-Tailed Nature
Chi-square tests are always right-tailed. Large chi-square values indicate poor fit. Unlike t-tests, you never look at the left tail—a chi-square value of 0 means perfect fit!
Test Your Understanding
If X ~ χ²(10), what is E[X]?
Summary
The chi-square distribution is foundational to statistical testing. It naturally arises from squared standard normal variables and provides the basis for testing goodness-of-fit, independence, and variance.
Key Formulas
| Property | Formula |
|---|---|
| Definition | χ²_k = Z₁² + Z₂² + ... + Z_k² where Z_i ~ N(0,1) |
| f(x) = x^(k/2-1)e^(-x/2) / [2^(k/2)Γ(k/2)] | |
| Mean | E[X] = k |
| Variance | Var(X) = 2k |
| Mode | k - 2 (for k ≥ 2) |
| Test Statistic | χ² = Σ(O_i - E_i)² / E_i |
| Gamma Form | χ²_k = Gamma(k/2, 1/2) |
Key Takeaways
- Sum of squared normals: Chi-square arises naturally from squaring and summing independent standard normal variables
- Degrees of freedom = k: This single parameter controls shape, mean, and variance
- Special case of Gamma:
- Goodness-of-fit: Test if observed data matches expected distribution
- Independence test: Test if categorical variables are related
- ML feature selection: Chi-square score ranks categorical features by predictive power
- Foundation for other tests: t-test, F-test, ANOVA all use chi-square
Coming Next: In the next section, we'll explore the Student's t-distribution—essential for inference about means when the population variance is unknown.