Chapter 5
20 min read
Section 37 of 175

Chi-Square Distribution

Continuous Distributions

Learning Objectives

By the end of this section, you will be able to:

  1. Understand fundamentally why the chi-square distribution arises naturally as the sum of squared standard normal random variables
  2. Define the chi-square distribution χk2\chi^2_k and explain the meaning of degrees of freedom
  3. Recognize chi-square as a special case of Gamma: χk2=Gamma(k/2,1/2)\chi^2_k = \text{Gamma}(k/2, 1/2)
  4. Apply chi-square tests for goodness-of-fit and independence testing
  5. Use chi-square for variance estimation and confidence intervals
  6. Implement chi-square feature selection in machine learning pipelines
  7. Connect chi-square to the F-distribution and Student's t-distribution

Deep Intuition: The Surprise Meter

The chi-square distribution measures "how surprising" your data is compared to what you expected.

Imagine you roll a die 600 times. You expect about 100 of each face. But you get: 90, 105, 115, 95, 85, 110. Is this die fair, or is something fishy?

The chi-square distribution answers this question. It tells you: "Given random variation, how likely is it to see this much deviation from expectation?"

The Core Idea

Chi-square (χ2\chi^2) is built from squared standard normal variables:

χk2=Z12+Z22++Zk2\chi^2_k = Z_1^2 + Z_2^2 + \cdots + Z_k^2

Where each ZiN(0,1)Z_i \sim N(0, 1) is independent. The subscript kk is the degrees of freedom.

Why Squared?

Always Positive
Deviations above and below expectation both count as "surprise"
📈
Amplifies Large Deviations
A deviation of 10 contributes 100; deviation of 2 contributes only 4
🔗
Connects to Normal Distribution
The Normal distribution is the workhorse of statistics; chi-square captures squared normal behavior

Mental Model

Think of chi-square as a "surprise accumulator." Each squared deviation adds to your total surprise. The larger the chi-square value, the more your data deviates from what you expected.


The Historical Story

The chi-square distribution emerged from fundamental questions about data, errors, and testing scientific theories.

Friedrich Robert Helmert (1876)

The German mathematician Helmert first derived the distribution of the sample variance for normally distributed data. He showed that (n1)S2/σ2(n-1)S^2/\sigma^2 follows what we now call the chi-square distribution.

Karl Pearson (1900)

The breakthrough came when Pearson published "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling."

Pearson's Problem: Are the deviations I see in my data just due to random chance, or is there something systematic going on?

His motivation was practical: testing whether dice were fair, validating Mendel's genetic inheritance data, and checking if theoretical distributions fit observed data.

Ronald Fisher (1920s)

Fisher rigorously developed the concept of degrees of freedom and showed how to apply chi-square tests correctly. He established that when you estimate parameters from data, you "use up" degrees of freedom.


Why Do We Need the Chi-Square Distribution?

Chi-square is the foundation of statistical testing. It answers fundamental questions about data:

Use Chi-Square When:

  • Testing goodness-of-fit (does data match expected distribution?)
  • Testing independence (are two categorical variables related?)
  • Constructing confidence intervals for variance
  • Feature selection for categorical data in ML
  • Analyzing contingency tables

Do NOT Use Chi-Square When:

  • Expected frequencies are too small (E < 5)
  • Data are continuous (need to bin first)
  • Observations are not independent
  • Testing means or correlations (use t-test, ANOVA)

Mathematical Definition

Definition from Normal Variables

If Z1,Z2,,ZkZ_1, Z_2, \ldots, Z_k are independent standard normal random variables, then:

X=i=1kZi2χk2X = \sum_{i=1}^{k} Z_i^2 \sim \chi^2_k

We say XX follows a chi-square distribution with k degrees of freedom.

The Probability Density Function

f(x;k)=xk/21ex/22k/2Γ(k/2)for x>0f(x; k) = \frac{x^{k/2-1} e^{-x/2}}{2^{k/2} \Gamma(k/2)} \quad \text{for } x > 0

Symbol by Symbol

SymbolMeaningIntuition
xChi-square valueSum of squared deviations
kDegrees of freedomNumber of independent squared normals
Γ(k/2)Gamma functionNormalization constant (generalizes factorial)
e^(-x/2)Exponential decayLarge values become rare
x^(k/2-1)Power termShapes the distribution based on k

Connection to Gamma Distribution

The chi-square distribution is a special case of Gamma:

χk2=Gamma(k2,12)\chi^2_k = \text{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right)

This means: shape parameter α=k/2\alpha = k/2, rate parameter β=1/2\beta = 1/2.

Why This Connection Matters

Since chi-square is a special case of Gamma, it inherits many properties. Gamma is closed under addition (for the same rate), so chi-square is too: the sum of independent chi-squares is chi-square with summed degrees of freedom.


Exploring the Distribution

Use the interactive visualizer below to see how the degrees of freedom kk affects the shape of the chi-square distribution.

📊Chi-Square Distribution Explorer

Right-skewed with interior mode

Statistics

Mean:5
Variance:10
Std Dev:3.162
Mode:3
Skewness:1.2649
Notation: X ~ χ²(5)
As Gamma: Gamma(5/2, 1/2) = Gamma(2.5, 0.5)

Critical Values for χ²(5)

α = 0.05:11.070
α = 0.01:15.086
α = 0.001:20.515

Reject H₀ if your test statistic exceeds these values

α=0.05μ = 5modexf(x)0491318

The PDF shows the probability density. For k=1,2, the mode is at 0. For k≥3, the mode is at k-2.

PDF Formula

f(x) = xk/2-1 × e-x/2 / [2k/2 × Γ(k/2)]

f(x) = x1.5 × e-x/2 / [22.5 × Γ(2.5)]

Quick presets:

Key Observations

  • k = 1: Heavily right-skewed, mode at 0, unbounded spike at origin
  • k = 2: Exponential distribution! Mode at 0 but no spike
  • k ≥ 3: Interior mode at k2k - 2, bell-like but still skewed
  • Large k: Becomes more symmetric, approximates Normal by CLT
  • Mean always equals k: Larger k shifts distribution right

Where Chi-Square Comes From

The chi-square distribution arises naturally when you square and sum independent standard normal variables. This interactive demonstration shows you why.

🎲Sum of Squared Standard Normals

The Chi-Square Distribution Arises From:

χ²4 = Z₁² + Z₂² + ... + Z4²

where each Zi ~ N(0, 1) is independent

Latest Sample Breakdown

Click "Add One Sample" to see the breakdown

Histogram of 0 samplesvs. theoretical χ²(4)
E[X] = 4036912X = Z₁² + Z₂² + ... + Z4²

Key Insight: As you generate more samples, the histogram converges to the theoretical chi-square PDF (red curve). This demonstrates that summing k squared standard normals produces the χ²(k) distribution. The mean equals k (degrees of freedom), and variance equals 2k.

This fundamental relationship is why chi-square appears everywhere in statistics: any time you're working with squared errors or deviations from normality, chi-square is lurking.


Key Properties

Summary Statistics

PropertyFormulaInterpretation
MeanE[X] = kAverage equals degrees of freedom
VarianceVar(X) = 2kTwice the degrees of freedom
Modek - 2 (for k ≥ 2)Most likely value
Skewness√(8/k)Always positive, decreases with k
MGF(1 - 2t)^(-k/2)For t < 1/2

Additivity Property

If X1χk12X_1 \sim \chi^2_{k_1} and X2χk22X_2 \sim \chi^2_{k_2} are independent, then:

X1+X2χk1+k22X_1 + X_2 \sim \chi^2_{k_1 + k_2}

Why this matters: You can decompose complex chi-square statistics into simpler components, or combine independent tests.

Scaling Property

If Xχk2X \sim \chi^2_k and c>0c > 0:

cXGamma(k/2,1/(2c))cX \sim \text{Gamma}(k/2, 1/(2c))

Scaling doesn't preserve chi-square (unlike the Normal), but stays in the Gamma family.


The Chi-Square Test Framework

The chi-square test compares observed frequencies to expected frequencies using this statistic:

χ2=i=1n(OiEi)2Ei\chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

What Each Term Means

SymbolMeaningRole
O_iObserved frequencyWhat you actually counted
E_iExpected frequencyWhat theory predicts
(O_i - E_i)²Squared deviationHow far off each cell is
(O_i - E_i)²/E_iStandardized deviationWeight by expected (larger E allows larger variance)

The Key Insight

We divide by EiE_i because a deviation of 10 means different things depending on whether you expected 20 (50% off) or 1000 (1% off). This standardization makes deviations comparable.


Goodness-of-Fit Test: Is This Die Fair?

The goodness-of-fit test asks: Does my observed data match a hypothesized distribution? Try it yourself:

🎲Goodness-of-Fit Test: Is This Die Fair?
Face 1
Face 2
Face 3
Face 4
Face 5
Face 6
FaceObserved (O)Expected (E)O - E(O - E)² / E
Face 1100100.00.00.000
Face 2100100.00.00.000
Face 3100100.00.00.000
Face 4100100.00.00.000
Face 5100100.00.00.000
Face 6100100.00.00.000
χ² Statistic =0.000
critical (α=0.05)χ² = 0.00χ²(5)

✅ Fail to Reject H₀

χ² Statistic: 0.0000

Degrees of Freedom: 5 (6 faces - 1)

P-value: 1.0000 ()

Critical Value (α=0.05): 11.07

No significant evidence the die is unfair (p ≥ 0.05)

Significance Level (α):

How to Interpret

  • H₀ (Null Hypothesis): The die is fair (all faces equally likely)
  • H₁ (Alternative): The die is not fair
  • Large χ²: Observed data differs significantly from expected
  • Small p-value: Unlikely to see this data if die were fair

Steps for a Goodness-of-Fit Test

  1. State hypotheses: H0H_0: Data follows expected distribution
  2. Calculate expected frequencies under H0H_0
  3. Compute chi-square statistic: χ2=(OiEi)2/Ei\chi^2 = \sum (O_i - E_i)^2 / E_i
  4. Determine degrees of freedom: k=categories1k = \text{categories} - 1 (subtract 1 more for each estimated parameter)
  5. Find p-value from χk2\chi^2_k distribution
  6. Decide: Reject H0H_0 if p-value < significance level

Degrees of Freedom

For goodness-of-fit: df=(categories1)(parameters estimated)\text{df} = (\text{categories} - 1) - (\text{parameters estimated})

If you estimate the mean from data, subtract 1. If you estimate mean AND variance, subtract 2.


Real-World Applications

🎲 Gaming & Fairness Testing

  • Testing if dice, cards, or RNGs are fair
  • Casino regulation and auditing
  • Video game loot box probability verification

🧬 Genetics & Biology

  • Testing Mendelian inheritance ratios
  • Hardy-Weinberg equilibrium
  • Gene-disease associations

📊 Survey Research

  • Testing independence of demographic variables
  • Cross-tabulation analysis
  • Polling accuracy assessment

🏥 Medical Research

  • Drug efficacy with categorical outcomes
  • Treatment vs. control group comparisons
  • Epidemiological studies

AI/ML Applications

Chi-square is a fundamental tool in machine learning, especially for categorical data analysis and feature selection.

1. Chi-Square Feature Selection

One of the most important uses of chi-square in ML is selecting the most informative features for classification. The idea: if a feature is independent of the target, it's not useful for prediction.

🎯Chi-Square Feature Selection for ML

How Chi-Square Feature Selection Works:

For each categorical feature, we test whether it's independent of the target variable. Features with high chi-square scores (low p-values) are statistically associated with the target and are more likely to be useful for prediction.

features
Dataset: 500 samples | Target: purchased vs not_purchased

Feature Rankings by Chi-Square Score

20.6income_level18.2age_group7.5device1.1signup_day0.4browserχ² Score
Selected (top 3)Not selectedClicked✓ = Truly predictive
RankFeatureχ² Scorep-valuedfActually Predictive?Selected?
#1
income_level
Income bracket
20.641< 0.0012✓ YesSelected
#2
age_group
Customer age group
18.224< 0.0012✓ YesSelected
#3
device
Device type
7.5400.02302✓ YesSelected
#4
signup_day
Day of week signed up
1.0760.29951✗ No-
#5
browser
Web browser used
0.4490.92993✗ No-

Selection Accuracy

Selected 3 out of 3 truly predictive features✓ Perfect!

Try increasing the sample size to see chi-square better identify the truly predictive features (age_group, income_level, device).

sklearn Equivalent

from sklearn.feature_selection import chi2, SelectKBest

# Select top 3 features by chi-square score
selector = SelectKBest(chi2, k=3)
X_selected = selector.fit_transform(X, y)

# Selected features: income_level, age_group, device

2. Text Classification & NLP

Chi-square is heavily used in text classification to:

  • Select the most discriminative words/n-grams for each class
  • Reduce vocabulary size while preserving predictive power
  • Identify terms strongly associated with categories
🐍text_feature_selection.py
1from sklearn.feature_extraction.text import CountVectorizer
2from sklearn.feature_selection import chi2, SelectKBest
3
4# Create bag-of-words features
5vectorizer = CountVectorizer(max_features=10000)
6X = vectorizer.fit_transform(documents)
7
8# Select top 1000 features by chi-square score
9selector = SelectKBest(chi2, k=1000)
10X_selected = selector.fit_transform(X, y)
11
12# Get the selected feature names
13feature_names = vectorizer.get_feature_names_out()
14selected_features = feature_names[selector.get_support()]

3. Categorical Encoding Decisions

When deciding how to encode categorical features, chi-square helps determine:

  • Whether a categorical feature is worth including
  • Which categories to merge (similar chi-square contributions)
  • Target encoding quality assessment

4. Model Calibration Testing

Chi-square goodness-of-fit can test if a model's predicted probabilities match reality:

  • Bin predictions into groups (e.g., 0-10%, 10-20%, ...)
  • Compare predicted vs. actual positive rates in each bin
  • Large chi-square indicates poor calibration

5. Independence in Graphical Models

In probabilistic graphical models and causal inference, chi-square tests help:

  • Test conditional independence assumptions
  • Validate Bayesian network structure
  • Identify spurious correlations vs. causal relationships

DistributionRelationship to Chi-Square
Gamma(k/2, 1/2)Chi-square IS Gamma with these parameters
Exponential(1/2)Chi-square with k=2 is Exponential
F-distributionRatio of two chi-squares divided by their dfs
t-distributionNormal divided by sqrt(chi-square/k)
Normal (large k)Chi-square ≈ Normal(k, 2k) by CLT
Noncentral χ²Sum of squared non-standard normals

The Statistical Trinity

Chi-square, t, and F distributions form the backbone of classical statistics:

  • Chi-square: Variance estimation, goodness-of-fit
  • t-distribution: Mean estimation (unknown variance)
  • F-distribution: Comparing variances, ANOVA

All three are intimately connected through the Normal distribution.


Python Implementation

Basic Operations with SciPy

🐍chi_square_basics.py
1import numpy as np
2from scipy import stats
3
4# Create chi-square distribution with k=5 degrees of freedom
5k = 5
6chi2_dist = stats.chi2(df=k)
7
8# PDF, CDF, quantiles
9x = 5
10print(f"PDF at x={x}: {chi2_dist.pdf(x):.4f}")
11print(f"CDF at x={x}: {chi2_dist.cdf(x):.4f}")
12print(f"P(X > {x}): {1 - chi2_dist.cdf(x):.4f}")
13
14# Critical values for common significance levels
15print(f"95th percentile: {chi2_dist.ppf(0.95):.4f}")
16print(f"99th percentile: {chi2_dist.ppf(0.99):.4f}")
17
18# Summary statistics
19print(f"Mean: {chi2_dist.mean():.4f}")  # Should equal k
20print(f"Variance: {chi2_dist.var():.4f}")  # Should equal 2k
21
22# Generate random samples
23samples = chi2_dist.rvs(size=10000)
24print(f"Sample mean: {samples.mean():.4f}")
25print(f"Sample variance: {samples.var():.4f}")

Goodness-of-Fit Test

🐍goodness_of_fit.py
1import numpy as np
2from scipy import stats
3
4# Example: Testing if a die is fair
5# Observed counts from 600 rolls
6observed = np.array([90, 105, 115, 95, 85, 110])
7
8# Expected counts if fair (600/6 = 100 each)
9expected = np.array([100, 100, 100, 100, 100, 100])
10
11# Method 1: Using scipy.stats.chisquare
12stat, pvalue = stats.chisquare(observed, f_exp=expected)
13print(f"Chi-square statistic: {stat:.4f}")
14print(f"P-value: {pvalue:.4f}")
15print(f"Degrees of freedom: {len(observed) - 1}")
16
17# Method 2: Manual calculation
18chi_sq_manual = np.sum((observed - expected)**2 / expected)
19df = len(observed) - 1
20pvalue_manual = 1 - stats.chi2.cdf(chi_sq_manual, df)
21print(f"Manual calculation: χ² = {chi_sq_manual:.4f}, p = {pvalue_manual:.4f}")
22
23# Decision at α = 0.05
24alpha = 0.05
25critical_value = stats.chi2.ppf(1 - alpha, df)
26print(f"Critical value at α={alpha}: {critical_value:.4f}")
27print(f"Reject H₀: {stat > critical_value}")

Test of Independence

🐍independence_test.py
1import numpy as np
2from scipy import stats
3
4# Contingency table: Gender vs. Product Preference
5# Rows: Male, Female
6# Columns: Product A, Product B, Product C
7contingency_table = np.array([
8    [45, 35, 20],   # Male
9    [30, 50, 25]    # Female
10])
11
12# Perform chi-square test of independence
13stat, pvalue, dof, expected = stats.chi2_contingency(contingency_table)
14
15print("Observed frequencies:")
16print(contingency_table)
17print("\nExpected frequencies (under independence):")
18print(expected.round(2))
19print(f"\nChi-square statistic: {stat:.4f}")
20print(f"Degrees of freedom: {dof}")
21print(f"P-value: {pvalue:.4f}")
22
23# Calculate contributions from each cell
24contributions = (contingency_table - expected)**2 / expected
25print("\nContributions to chi-square:")
26print(contributions.round(4))

Feature Selection with Chi-Square

🐍feature_selection_ml.py
1from sklearn.datasets import load_iris
2from sklearn.feature_selection import chi2, SelectKBest
3import numpy as np
4
5# Load data (using discretized version for chi-square)
6iris = load_iris()
7X = iris.data
8y = iris.target
9
10# Chi-square requires non-negative features
11# For continuous features, you might bin them first
12X_positive = X - X.min(axis=0)  # Shift to non-negative
13
14# Calculate chi-square scores for each feature
15scores, pvalues = chi2(X_positive, y)
16
17print("Feature chi-square scores:")
18for i, (name, score, pval) in enumerate(zip(
19    iris.feature_names, scores, pvalues
20)):
21    print(f"  {name}: χ² = {score:.2f}, p = {pval:.4f}")
22
23# Select top k features
24k = 2
25selector = SelectKBest(chi2, k=k)
26X_selected = selector.fit_transform(X_positive, y)
27
28# Which features were selected?
29selected_mask = selector.get_support()
30selected_features = np.array(iris.feature_names)[selected_mask]
31print(f"\nSelected features: {selected_features}")

Common Pitfalls

Expected Frequency Rule

Expected frequency should be ≥ 5 in each cell. With small expected frequencies, the chi-square approximation breaks down. Solutions:

  • Combine categories to increase expected counts
  • Use Fisher's exact test for small samples
  • Use Yates' continuity correction for 2×2 tables

Independence Assumption

Each observation must be independent. If the same person can appear multiple times or observations are clustered, chi-square is invalid. Consider:

  • Clustered sampling methods
  • Time series with autocorrelation
  • Repeated measures designs

Categorical Data Only

Chi-square tests are for categorical data. For continuous data:

  • Bin into categories first (lose information)
  • Use Kolmogorov-Smirnov test instead
  • Use appropriate parametric tests

One-Tailed Nature

Chi-square tests are always right-tailed. Large chi-square values indicate poor fit. Unlike t-tests, you never look at the left tail—a chi-square value of 0 means perfect fit!


Test Your Understanding

📝Chi-Square Quiz
Question 1/7

If X ~ χ²(10), what is E[X]?


Summary

The chi-square distribution is foundational to statistical testing. It naturally arises from squared standard normal variables and provides the basis for testing goodness-of-fit, independence, and variance.

Key Formulas

PropertyFormula
Definitionχ²_k = Z₁² + Z₂² + ... + Z_k² where Z_i ~ N(0,1)
PDFf(x) = x^(k/2-1)e^(-x/2) / [2^(k/2)Γ(k/2)]
MeanE[X] = k
VarianceVar(X) = 2k
Modek - 2 (for k ≥ 2)
Test Statisticχ² = Σ(O_i - E_i)² / E_i
Gamma Formχ²_k = Gamma(k/2, 1/2)

Key Takeaways

  1. Sum of squared normals: Chi-square arises naturally from squaring and summing independent standard normal variables
  2. Degrees of freedom = k: This single parameter controls shape, mean, and variance
  3. Special case of Gamma: χk2=Gamma(k/2,1/2)\chi^2_k = \text{Gamma}(k/2, 1/2)
  4. Goodness-of-fit: Test if observed data matches expected distribution
  5. Independence test: Test if categorical variables are related
  6. ML feature selection: Chi-square score ranks categorical features by predictive power
  7. Foundation for other tests: t-test, F-test, ANOVA all use chi-square
The Essence of Chi-Square:
"How surprising is this data given what we expected?"
From testing fair dice to selecting ML features—chi-square quantifies the gap between expectation and reality.
Coming Next: In the next section, we'll explore the Student's t-distribution—essential for inference about means when the population variance is unknown.
Loading comments...