Chapter 15
25 min read
Section 100 of 175

Chi-Square Tests

Common Statistical Tests

Learning Objectives

By the end of this section, you will be able to:

πŸ“š Core Knowledge

  • β€’ Understand why chi-square tests work for categorical data
  • β€’ Compute chi-square statistics for goodness-of-fit tests
  • β€’ Construct and analyze contingency tables
  • β€’ Calculate expected frequencies under Hβ‚€
  • β€’ Determine correct degrees of freedom

πŸ”§ Practical Skills

  • β€’ Apply goodness-of-fit tests for distribution testing
  • β€’ Use independence tests to find relationships between categorical variables
  • β€’ Interpret chi-square test results correctly
  • β€’ Implement chi-square tests in Python

🧠 AI/ML Applications

  • β€’ Feature Selection - Select categorical features significantly associated with the target
  • β€’ Data Drift Detection - Detect when categorical feature distributions change in production
  • β€’ A/B Testing - Compare conversion rates across treatment groups
  • β€’ Model Evaluation - Test if model predictions differ from random chance
  • β€’ Fairness Analysis - Test for disparate treatment across demographic groups
Where You'll Apply This: Chi-square tests are essential whenever you work with categorical dataβ€”from analyzing user segments and survey responses to validating ML models and detecting dataset shift in production systems.

The Big Picture: Karl Pearson's Legacy

The year is 1900. Karl Pearson, a mathematician at University College London, faces a fundamental question: How do you test whether observed data follows a theoretical distribution when the data is categorical?

πŸ‘¨β€πŸ”¬

Karl Pearson's Insight

Pearson realized that when we observe counts falling into categories, we can measure how far these observed counts deviate from what we would expect if some null hypothesis were true. The key insight: the sum of squared, standardized deviations follows a chi-square distribution.

The Problem That Needed Solving

Before Pearson's chi-square test, scientists had no rigorous method to answer questions like:

  • Is this die fair? (Do all faces appear equally often?)
  • Does this genetic data follow Mendelian ratios? (The 3:1 ratio)
  • Is there a relationship between smoking and lung disease?
  • Do customers prefer one product variant over another?

What makes categorical data special is that we cannot compute means and standard deviations in the usual sense. Instead, we work with counts (frequencies) and proportions. Pearson's brilliant solution was to construct a statistic that measures the discrepancy between observed and expected counts.

The Fundamental Question

"Given observed categorical data, are the differences from expected counts too large to be due to chance?"


The Chi-Square Distribution Connection

Before diving into the tests, let's understand why the chi-square distribution appears here. This connection reveals the beautiful mathematical foundation underlying these tests.

The Mathematical Foundation

If Z1,Z2,…,ZkZ_1, Z_2, \ldots, Z_k are independent standard normal random variables, then:

βˆ‘i=1kZi2βˆΌΟ‡2(k)\sum_{i=1}^{k} Z_i^2 \sim \chi^2(k)

The sum of squared standard normals follows a chi-square distribution with k degrees of freedom.

The connection to categorical data: When sample sizes are large enough, the standardized differences between observed and expected counts are approximately normally distributed. When we square and sum these, we get a chi-square distribution!

Why This Matters

  1. Under Hβ‚€: If the null hypothesis is true, observed counts should be close to expected counts.
  2. Standardized residuals: Oiβˆ’EiEi\frac{O_i - E_i}{\sqrt{E_i}} are approximately standard normal for each category.
  3. Sum of squares: Adding up all squared standardized residuals gives us a chi-square statistic.
  4. Large values = evidence against Hβ‚€: The further our data deviates, the larger the statistic.

Interactive: Chi-Square Distribution

Explore how the chi-square distribution changes with different degrees of freedom. Notice how it becomes more symmetric as df increases, approaching a normal distribution.

πŸ“ŠChi-Square Distribution Explorer

Right-skewed with interior mode

Statistics

Mean:5
Variance:10
Std Dev:3.162
Mode:3
Skewness:1.2649
Notation: X ~ χ²(5)
As Gamma: Gamma(5/2, 1/2) = Gamma(2.5, 0.5)

Critical Values for χ²(5)

Ξ± = 0.05:11.070
Ξ± = 0.01:15.086
Ξ± = 0.001:20.515

Reject Hβ‚€ if your test statistic exceeds these values

Ξ±=0.05ΞΌ = 5modexf(x)0491318

The PDF shows the probability density. For k=1,2, the mode is at 0. For kβ‰₯3, the mode is at k-2.

PDF Formula

f(x) = xk/2-1 Γ— e-x/2 / [2k/2 Γ— Ξ“(k/2)]

f(x) = x1.5 Γ— e-x/2 / [22.5 Γ— Ξ“(2.5)]

Quick presets:

Goodness-of-Fit Test

The goodness-of-fit test answers the question: "Does my observed data follow a hypothesized distribution?" This is the original chi-square test developed by Pearson.

Null Hypothesis (Hβ‚€)

The observed data follows the specified theoretical distribution.

e.g., "This die is fair" or "The data follows a Poisson distribution"

Alternative Hypothesis (H₁)

The observed data does not follow the specified distribution.

e.g., "This die is biased" or "The data does not follow Poisson"

The Chi-Square Statistic

The test statistic measures how far the observed frequencies (OiO_i) are from the expected frequencies (EiE_i) under Hβ‚€:

Pearson's Chi-Square Statistic

Ο‡2=βˆ‘i=1k(Oiβˆ’Ei)2Ei\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}

Oi

Observed count in category i

Ei

Expected count in category i

k

Number of categories

Intuition behind the formula:

  • (Oi - Ei)Β²: Squared differences measure deviation from expectation. Squaring ensures positive and negative deviations don't cancel.
  • Dividing by Ei: Normalizes the contribution. A difference of 10 matters more when Ei = 20 than when Ei = 1000.
  • Summing across categories: Aggregates evidence from all categories into a single measure.
The Residual Perspective: Each term (Oiβˆ’Ei)2Ei\frac{(O_i - E_i)^2}{E_i} is the squared Pearson residual. Large residuals indicate categories where the fit is poor.

Degrees of Freedom

For a goodness-of-fit test with k categories and p estimated parameters:

df=kβˆ’1βˆ’p\text{df} = k - 1 - p
Situationkpdf
Fully specified distribution (e.g., fair die)605
Binomial with known n and pn+10n
Poisson with Ξ» estimated from datak1k-2
Normal with ΞΌ and Οƒ estimatedk2k-3

Why subtract 1? Because the observed counts must sum to n (the sample size), one count is determined by the others. This constraint costs us one degree of freedom.

Interactive: Goodness-of-Fit Test

Roll a die (fair or biased) and perform a chi-square goodness-of-fit test to determine if the die is fair. Watch how the test statistic and p-value change with different outcomes.

🎲Goodness-of-Fit Test: Is This Die Fair?
βš€
Face 1
⚁
Face 2
βš‚
Face 3
βšƒ
Face 4
βš„
Face 5
βš…
Face 6
FaceObserved (O)Expected (E)O - E(O - E)Β² / E
Face 1100100.00.00.000
Face 2100100.00.00.000
Face 3100100.00.00.000
Face 4100100.00.00.000
Face 5100100.00.00.000
Face 6100100.00.00.000
χ² Statistic =0.000
critical (Ξ±=0.05)χ² = 0.00χ²(5)

βœ… Fail to Reject Hβ‚€

χ² Statistic: 0.0000

Degrees of Freedom: 5 (6 faces - 1)

P-value: 1.0000 ()

Critical Value (Ξ±=0.05): 11.07

No significant evidence the die is unfair (p β‰₯ 0.05)

Significance Level (Ξ±):

How to Interpret

  • Hβ‚€ (Null Hypothesis): The die is fair (all faces equally likely)
  • H₁ (Alternative): The die is not fair
  • Large χ²: Observed data differs significantly from expected
  • Small p-value: Unlikely to see this data if die were fair

Test of Independence

The test of independence determines whether two categorical variables are related or independent. This is perhaps the most widely used chi-square test in practice.

The Question We're Asking

"Is there a relationship between Variable A and Variable B, or are they independent?"

Hβ‚€: Variables are independent

P(A and B) = P(A) Γ— P(B)

H₁: Variables are associated

P(A and B) β‰  P(A) Γ— P(B)

Contingency Tables

Data for independence tests is organized in a contingency table (also called a cross-tabulation or two-way table). Each cell shows the count of observations falling into that combination of categories.

Example: Customer Satisfaction by Product Type

SatisfiedNeutralDissatisfiedRow Total
Product A1504010200
Product B906050200
Column Total24010060400

Is satisfaction level independent of product type?

Computing Expected Frequencies

Under the null hypothesis of independence, the expected count for each cell is:

Expected Frequency Formula

Eij=(RowΒ iΒ Total)Γ—(ColumnΒ jΒ Total)nE_{ij} = \frac{(\text{Row } i \text{ Total}) \times (\text{Column } j \text{ Total})}{n}

Intuition: If the variables are truly independent, the proportion in each cell should be the product of the marginal proportions. The expected count is this product times the total sample size.

Expected Frequencies (Under Independence)

SatisfiedNeutralDissatisfied
Product A(200Γ—240)/400 = 120(200Γ—100)/400 = 50(200Γ—60)/400 = 30
Product B(200Γ—240)/400 = 120(200Γ—100)/400 = 50(200Γ—60)/400 = 30

The degrees of freedom for a contingency table with r rows and c columns:

df=(rβˆ’1)(cβˆ’1)\text{df} = (r - 1)(c - 1)
For our 2Γ—3 table: df = (2-1)(3-1) = 1Γ—2 = 2

Interactive: Contingency Table Explorer

Build your own contingency table and see how the chi-square statistic changes. Observe how deviations from expected counts contribute to the test statistic.

πŸ“ŠContingency Table Explorer

Observed Frequencies (O)

Total
200
200
Total24010060400

Cell colors show standardized residuals: green = higher than expected, red = lower than expected

Expected Frequencies (E) under Independence

SatisfiedNeutralDissatisfied
Product A120.050.030.0
Product B120.050.030.0

Cells highlighted in amber have expected value < 5

Chi-Square Contributions: (O - E)Β² / E

SatisfiedNeutralDissatisfied
Product A7.5002.00013.333
Product B7.5002.00013.333
Total χ² =45.667

Reject Hβ‚€: Variables are associated

Chi-Square Statistic: 45.6667

Degrees of Freedom: 2 = (2-1) Γ— (3-1)

P-value: < 0.0001

CramΓ©r's V: 0.338(medium effect)

Min Expected: 30.0

Sample Size: 400

Significance Level (Ξ±):

How to Interpret

  • Hβ‚€: Row and column variables are independent
  • H₁: Row and column variables are associated
  • Large residuals: Cells that contribute most to dependence
  • CramΓ©r's V: Effect size measure (0 = no association, 1 = perfect association)

Interactive: Independence Test Simulator

Simulate data from independent or associated variables and see how well the chi-square test detects the relationship. Explore how sample size affects the test's power.

πŸ”¬Independence Test Simulator
Ξ±:

Chi-Square Distribution (df = 1)

critical = 3.84051015χ² Statistic

Understanding the Simulation

  • Association = 0: Variables are independent. Rejection rate should equal Ξ± (Type I error rate)
  • Association > 0: Variables are associated. Higher rejection rate = test has power to detect it
  • Sample size matters: Larger samples give more power to detect weak associations
  • Dots on axis: Each simulation's test statistic (green = fail to reject, red = reject)

Test of Homogeneity

The test of homogeneity is mathematically identical to the test of independence, but the research question is different.

Test of Independence

One sample is drawn, and two variables are observed on each unit.

Question: Are the variables associated?

Example: Survey 500 people, record their gender and voting preference.

Test of Homogeneity

Multiple populations are sampled separately, and one variable is observed.

Question: Do the populations have the same distribution?

Example: Sample 200 men and 300 women separately, ask about voting preference.
Key Insight: The calculation is exactly the same for both tests. The difference is in how the data was collected and how we interpret the results. In practice, you'll often see "chi-square test" used for both without distinguishing between them.

Choosing the Right Test

TestPurposeData StructureExample
Goodness-of-FitDoes data follow a specific distribution?One categorical variable, k categoriesIs this die fair? Do births follow uniform distribution across weekdays?
IndependenceAre two variables associated?Two categorical variables, rΓ—c tableIs customer satisfaction related to product type?
HomogeneityDo populations have same distribution?Multiple samples, one categorical variableDo different age groups have same preferences?

Assumptions and Validity Conditions

Chi-square tests are approximate tests. The approximation to the chi-square distribution requires certain conditions to be met.

βœ… Condition 1: Expected Frequencies Large Enough

Rule of thumb: All expected frequencies should be β‰₯ 5.

Some texts allow Ei β‰₯ 1 for up to 20% of cells, with all others β‰₯ 5.

βœ… Condition 2: Independent Observations

Each observation should be independent. No individual should contribute to multiple cells.

βœ… Condition 3: Random Sampling

Data should come from random sampling or a randomized experiment.

What to do when assumptions are violated:
  • Expected counts too small: Combine categories or use Fisher's exact test
  • 2Γ—2 table with small n: Use Yates's continuity correction or Fisher's exact test
  • Paired/matched data: Use McNemar's test instead

Worked Examples


Applications in AI/ML

Chi-square tests are fundamental tools in machine learning pipelines. Here are the key applications every ML engineer should know:

🎯 Feature Selection

The chi-square test measures the dependence between each categorical feature and the target. Features with high χ² values (low p-values) are more informative. sklearn.feature_selection.SelectKBestwith chi2 score is the standard implementation.

πŸ“Š Data Drift Detection

Compare training and production distributions for categorical features. Significant differences indicate covariate shift, which degrades model performance. This is critical for ML monitoring in production.

βš–οΈ Fairness Analysis

Test whether model predictions or decisions are independent of protected attributes (gender, race, age). A significant chi-square indicates potential disparate impact.

πŸ”„ A/B Testing

Compare conversion rates, click-through rates, or categorical outcomes across treatment groups. The chi-square test is the standard method for categorical A/B test analysis when sample sizes are large.

Interactive: Chi-Square Feature Selection

See how the chi-square test ranks features by their association with the target variable. Features with higher chi-square scores (lower p-values) are more predictive.

🎯Chi-Square Feature Selection for ML

How Chi-Square Feature Selection Works:

For each categorical feature, we test whether it's independent of the target variable. Features with high chi-square scores (low p-values) are statistically associated with the target and are more likely to be useful for prediction.

features
Dataset: 500 samples | Target: purchased vs not_purchased

Feature Rankings by Chi-Square Score

42.7income_levelβœ“18.9age_groupβœ“9.2deviceβœ“1.7signup_day1.2browserχ² Score
Selected (top 3)Not selectedClickedβœ“ = Truly predictive
RankFeatureχ² Scorep-valuedfActually Predictive?Selected?
#1
income_level
Income bracket
42.744< 0.0012βœ“ YesSelected
#2
age_group
Customer age group
18.888< 0.0012βœ“ YesSelected
#3
device
Device type
9.2100.01002βœ“ YesSelected
#4
signup_day
Day of week signed up
1.6880.19381βœ— No-
#5
browser
Web browser used
1.1770.75863βœ— No-

Selection Accuracy

Selected 3 out of 3 truly predictive featuresβœ“ Perfect!

Try increasing the sample size to see chi-square better identify the truly predictive features (age_group, income_level, device).

sklearn Equivalent

from sklearn.feature_selection import chi2, SelectKBest

# Select top 3 features by chi-square score
selector = SelectKBest(chi2, k=3)
X_selected = selector.fit_transform(X, y)

# Selected features: income_level, age_group, device

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3from sklearn.feature_selection import SelectKBest, chi2
4import pandas as pd
5
6# ============================================
7# 1. Goodness-of-Fit Test
8# ============================================
9
10def chi_square_goodness_of_fit(observed, expected_ratios=None):
11    """
12    Perform a chi-square goodness-of-fit test.
13
14    Parameters
15    ----------
16    observed : array-like
17        Observed frequencies in each category
18    expected_ratios : array-like, optional
19        Expected proportions for each category.
20        If None, assumes uniform distribution.
21
22    Returns
23    -------
24    dict with test results
25    """
26    observed = np.array(observed)
27    n = observed.sum()
28    k = len(observed)
29
30    if expected_ratios is None:
31        expected = np.full(k, n / k)  # Uniform
32    else:
33        expected_ratios = np.array(expected_ratios)
34        expected = expected_ratios * n / expected_ratios.sum()
35
36    # Chi-square statistic
37    chi2_stat = np.sum((observed - expected) ** 2 / expected)
38    df = k - 1
39    p_value = 1 - stats.chi2.cdf(chi2_stat, df)
40
41    return {
42        'chi2_statistic': chi2_stat,
43        'degrees_of_freedom': df,
44        'p_value': p_value,
45        'observed': observed,
46        'expected': expected,
47        'residuals': (observed - expected) / np.sqrt(expected)
48    }
49
50
51# Example: Testing if a die is fair
52die_rolls = [95, 108, 102, 97, 105, 93]  # 600 total rolls
53result = chi_square_goodness_of_fit(die_rolls)
54print(f"Die fairness test: χ² = {result['chi2_statistic']:.3f}, p = {result['p_value']:.4f}")
55
56
57# ============================================
58# 2. Test of Independence
59# ============================================
60
61def chi_square_independence_test(contingency_table):
62    """
63    Perform chi-square test of independence on a contingency table.
64
65    Parameters
66    ----------
67    contingency_table : 2D array-like
68        The contingency table with observed counts
69
70    Returns
71    -------
72    dict with test results
73    """
74    table = np.array(contingency_table)
75    chi2_stat, p_value, dof, expected = stats.chi2_contingency(table)
76
77    # Calculate standardized residuals
78    residuals = (table - expected) / np.sqrt(expected)
79
80    return {
81        'chi2_statistic': chi2_stat,
82        'degrees_of_freedom': dof,
83        'p_value': p_value,
84        'observed': table,
85        'expected': expected,
86        'standardized_residuals': residuals
87    }
88
89
90# Example: A/B test for conversion rates
91ab_test_data = [
92    [120, 880],   # Design A: [converted, not converted]
93    [150, 850]    # Design B: [converted, not converted]
94]
95result = chi_square_independence_test(ab_test_data)
96print(f"A/B test: χ² = {result['chi2_statistic']:.3f}, p = {result['p_value']:.4f}")
97
98
99# ============================================
100# 3. Feature Selection with Chi-Square
101# ============================================
102
103def chi_square_feature_selection(X, y, k='all'):
104    """
105    Select top k features using chi-square test.
106
107    Parameters
108    ----------
109    X : DataFrame or 2D array
110        Feature matrix (counts or non-negative values)
111    y : array-like
112        Target labels
113    k : int or 'all'
114        Number of features to select
115
116    Returns
117    -------
118    DataFrame with feature scores and rankings
119    """
120    if k == 'all':
121        k = X.shape[1]
122
123    selector = SelectKBest(chi2, k=k)
124    selector.fit(X, y)
125
126    if hasattr(X, 'columns'):
127        feature_names = X.columns
128    else:
129        feature_names = [f'feature_{i}' for i in range(X.shape[1])]
130
131    scores = selector.scores_
132    p_values = selector.pvalues_
133
134    results = pd.DataFrame({
135        'feature': feature_names,
136        'chi2_score': scores,
137        'p_value': p_values
138    }).sort_values('chi2_score', ascending=False)
139
140    return results
141
142
143# ============================================
144# 4. Data Drift Detection
145# ============================================
146
147def detect_categorical_drift(
148    reference_data: np.ndarray,
149    production_data: np.ndarray,
150    alpha: float = 0.05
151) -> dict:
152    """
153    Detect drift in categorical feature distributions.
154
155    Parameters
156    ----------
157    reference_data : array-like
158        Categorical values from training/reference data
159    production_data : array-like
160        Categorical values from production data
161    alpha : float
162        Significance level for drift detection
163
164    Returns
165    -------
166    dict with drift detection results
167    """
168    # Get unique categories across both datasets
169    all_categories = np.unique(np.concatenate([reference_data, production_data]))
170
171    # Count frequencies
172    ref_counts = np.array([np.sum(reference_data == cat) for cat in all_categories])
173    prod_counts = np.array([np.sum(production_data == cat) for cat in all_categories])
174
175    # Build contingency table
176    contingency = np.vstack([ref_counts, prod_counts])
177
178    # Chi-square test of homogeneity
179    chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency)
180
181    # Effect size (CramΓ©r's V)
182    n = contingency.sum()
183    min_dim = min(contingency.shape) - 1
184    cramers_v = np.sqrt(chi2_stat / (n * min_dim)) if min_dim > 0 else 0
185
186    return {
187        'chi2_statistic': chi2_stat,
188        'p_value': p_value,
189        'drift_detected': p_value < alpha,
190        'cramers_v': cramers_v,  # Effect size
191        'categories': all_categories,
192        'reference_distribution': ref_counts / ref_counts.sum(),
193        'production_distribution': prod_counts / prod_counts.sum()
194    }
195
196
197# Example: Drift detection
198training_devices = np.random.choice(['mobile', 'desktop', 'tablet'],
199                                     size=1000, p=[0.5, 0.4, 0.1])
200production_devices = np.random.choice(['mobile', 'desktop', 'tablet'],
201                                       size=1000, p=[0.65, 0.3, 0.05])
202drift_result = detect_categorical_drift(training_devices, production_devices)
203print(f"Drift detected: {drift_result['drift_detected']}, p = {drift_result['p_value']:.4f}")
204
205
206# ============================================
207# 5. Quick SciPy Examples
208# ============================================
209
210# Goodness-of-fit (using scipy directly)
211observed = [16, 18, 16, 14, 12, 12]  # Die rolls
212expected = [15] * 6  # Fair die expectation (if n=90)
213chi2, p = stats.chisquare(observed, f_exp=expected)
214print(f"Goodness-of-fit: χ² = {chi2:.3f}, p = {p:.4f}")
215
216# Independence test
217table = [[50, 30, 20],
218         [45, 35, 20]]
219chi2, p, dof, exp = stats.chi2_contingency(table)
220print(f"Independence test: χ² = {chi2:.3f}, p = {p:.4f}")
221
222# Fisher's exact test (for small samples, 2x2 tables)
223table_small = [[8, 2], [1, 5]]
224odds_ratio, p = stats.fisher_exact(table_small)
225print(f"Fisher's exact: OR = {odds_ratio:.2f}, p = {p:.4f}")
Production Tip: For data drift monitoring, track chi-square statistics over time for each categorical feature. Set up alerts when p-values drop below your threshold. Tools likeevidently, deepchecks, and whylabs automate this process.

Knowledge Check

Test your understanding of chi-square tests with this interactive quiz.

πŸ“Chi-Square Tests QuizScore: 0/10

Question 1: What type of data are chi-square tests designed for?


Summary

Key Takeaways

  1. Chi-square tests are for categorical data: They measure whether observed frequencies differ significantly from expected frequencies.
  2. Three main variants: Goodness-of-fit (one variable, theoretical distribution), Independence (two variables, same sample), Homogeneity (one variable, multiple populations).
  3. The test statistic: Ο‡2=βˆ‘(Oβˆ’E)2E\chi^2 = \sum \frac{(O - E)^2}{E} follows a chi-square distribution with appropriate degrees of freedom.
  4. Degrees of freedom: For goodness-of-fit: df = k - 1 - p. For contingency tables: df = (r-1)(c-1).
  5. Assumptions matter: Expected frequencies should be β‰₯ 5 for the chi-square approximation to be valid. Use Fisher's exact test for small samples.
  6. Essential in ML: Feature selection, data drift detection, A/B testing, and fairness analysis all rely heavily on chi-square tests.

Quick Reference

AspectFormula / Rule
Chi-square statisticχ² = Ξ£ (O - E)Β² / E
Expected (independence)Eα΅’β±Ό = (Row Total Γ— Col Total) / n
df (goodness-of-fit)k - 1 (or k - 1 - p if estimating parameters)
df (contingency table)(r - 1)(c - 1)
Assumption checkAll Eα΅’ β‰₯ 5 (or use Fisher's exact)
Effect sizeCramΓ©r's V = √(χ² / (n Γ— min(r-1, c-1)))
Looking Ahead: In the next section, we'll explore F-tests, which compare variances across groups. F-tests are the foundation of ANOVA (Analysis of Variance), essential for comparing multiple treatment groups in experiments and ML model selection.
Loading comments...