Learning Objectives
By the end of this section, you will be able to:
π Core Knowledge
- β’ Understand why chi-square tests work for categorical data
- β’ Compute chi-square statistics for goodness-of-fit tests
- β’ Construct and analyze contingency tables
- β’ Calculate expected frequencies under Hβ
- β’ Determine correct degrees of freedom
π§ Practical Skills
- β’ Apply goodness-of-fit tests for distribution testing
- β’ Use independence tests to find relationships between categorical variables
- β’ Interpret chi-square test results correctly
- β’ Implement chi-square tests in Python
π§ AI/ML Applications
- β’ Feature Selection - Select categorical features significantly associated with the target
- β’ Data Drift Detection - Detect when categorical feature distributions change in production
- β’ A/B Testing - Compare conversion rates across treatment groups
- β’ Model Evaluation - Test if model predictions differ from random chance
- β’ Fairness Analysis - Test for disparate treatment across demographic groups
Where You'll Apply This: Chi-square tests are essential whenever you work with categorical dataβfrom analyzing user segments and survey responses to validating ML models and detecting dataset shift in production systems.
The Big Picture: Karl Pearson's Legacy
The year is 1900. Karl Pearson, a mathematician at University College London, faces a fundamental question: How do you test whether observed data follows a theoretical distribution when the data is categorical?
Karl Pearson's Insight
Pearson realized that when we observe counts falling into categories, we can measure how far these observed counts deviate from what we would expect if some null hypothesis were true. The key insight: the sum of squared, standardized deviations follows a chi-square distribution.
The Problem That Needed Solving
Before Pearson's chi-square test, scientists had no rigorous method to answer questions like:
- Is this die fair? (Do all faces appear equally often?)
- Does this genetic data follow Mendelian ratios? (The 3:1 ratio)
- Is there a relationship between smoking and lung disease?
- Do customers prefer one product variant over another?
What makes categorical data special is that we cannot compute means and standard deviations in the usual sense. Instead, we work with counts (frequencies) and proportions. Pearson's brilliant solution was to construct a statistic that measures the discrepancy between observed and expected counts.
The Fundamental Question
"Given observed categorical data, are the differences from expected counts too large to be due to chance?"
The Chi-Square Distribution Connection
Before diving into the tests, let's understand why the chi-square distribution appears here. This connection reveals the beautiful mathematical foundation underlying these tests.
The Mathematical Foundation
If are independent standard normal random variables, then:
The sum of squared standard normals follows a chi-square distribution with k degrees of freedom.
The connection to categorical data: When sample sizes are large enough, the standardized differences between observed and expected counts are approximately normally distributed. When we square and sum these, we get a chi-square distribution!
Why This Matters
- Under Hβ: If the null hypothesis is true, observed counts should be close to expected counts.
- Standardized residuals: are approximately standard normal for each category.
- Sum of squares: Adding up all squared standardized residuals gives us a chi-square statistic.
- Large values = evidence against Hβ: The further our data deviates, the larger the statistic.
Interactive: Chi-Square Distribution
Explore how the chi-square distribution changes with different degrees of freedom. Notice how it becomes more symmetric as df increases, approaching a normal distribution.
Right-skewed with interior mode
Statistics
As Gamma: Gamma(5/2, 1/2) = Gamma(2.5, 0.5)
Critical Values for ΟΒ²(5)
Reject Hβ if your test statistic exceeds these values
The PDF shows the probability density. For k=1,2, the mode is at 0. For kβ₯3, the mode is at k-2.
PDF Formula
f(x) = xk/2-1 Γ e-x/2 / [2k/2 Γ Ξ(k/2)]
f(x) = x1.5 Γ e-x/2 / [22.5 Γ Ξ(2.5)]
Goodness-of-Fit Test
The goodness-of-fit test answers the question: "Does my observed data follow a hypothesized distribution?" This is the original chi-square test developed by Pearson.
Null Hypothesis (Hβ)
The observed data follows the specified theoretical distribution.
e.g., "This die is fair" or "The data follows a Poisson distribution"
Alternative Hypothesis (Hβ)
The observed data does not follow the specified distribution.
e.g., "This die is biased" or "The data does not follow Poisson"
The Chi-Square Statistic
The test statistic measures how far the observed frequencies () are from the expected frequencies () under Hβ:
Pearson's Chi-Square Statistic
Oi
Observed count in category i
Ei
Expected count in category i
k
Number of categories
Intuition behind the formula:
- (Oi - Ei)Β²: Squared differences measure deviation from expectation. Squaring ensures positive and negative deviations don't cancel.
- Dividing by Ei: Normalizes the contribution. A difference of 10 matters more when Ei = 20 than when Ei = 1000.
- Summing across categories: Aggregates evidence from all categories into a single measure.
Degrees of Freedom
For a goodness-of-fit test with k categories and p estimated parameters:
| Situation | k | p | df |
|---|---|---|---|
| Fully specified distribution (e.g., fair die) | 6 | 0 | 5 |
| Binomial with known n and p | n+1 | 0 | n |
| Poisson with Ξ» estimated from data | k | 1 | k-2 |
| Normal with ΞΌ and Ο estimated | k | 2 | k-3 |
Why subtract 1? Because the observed counts must sum to n (the sample size), one count is determined by the others. This constraint costs us one degree of freedom.
Interactive: Goodness-of-Fit Test
Roll a die (fair or biased) and perform a chi-square goodness-of-fit test to determine if the die is fair. Watch how the test statistic and p-value change with different outcomes.
| Face | Observed (O) | Expected (E) | O - E | (O - E)Β² / E |
|---|---|---|---|---|
| Face 1 | 100 | 100.0 | 0.0 | 0.000 |
| Face 2 | 100 | 100.0 | 0.0 | 0.000 |
| Face 3 | 100 | 100.0 | 0.0 | 0.000 |
| Face 4 | 100 | 100.0 | 0.0 | 0.000 |
| Face 5 | 100 | 100.0 | 0.0 | 0.000 |
| Face 6 | 100 | 100.0 | 0.0 | 0.000 |
| ΟΒ² Statistic = | 0.000 | |||
β Fail to Reject Hβ
ΟΒ² Statistic: 0.0000
Degrees of Freedom: 5 (6 faces - 1)
P-value: 1.0000 ()
Critical Value (Ξ±=0.05): 11.07
No significant evidence the die is unfair (p β₯ 0.05)
How to Interpret
- Hβ (Null Hypothesis): The die is fair (all faces equally likely)
- Hβ (Alternative): The die is not fair
- Large ΟΒ²: Observed data differs significantly from expected
- Small p-value: Unlikely to see this data if die were fair
Test of Independence
The test of independence determines whether two categorical variables are related or independent. This is perhaps the most widely used chi-square test in practice.
The Question We're Asking
"Is there a relationship between Variable A and Variable B, or are they independent?"
Hβ: Variables are independent
P(A and B) = P(A) Γ P(B)
Hβ: Variables are associated
P(A and B) β P(A) Γ P(B)
Contingency Tables
Data for independence tests is organized in a contingency table (also called a cross-tabulation or two-way table). Each cell shows the count of observations falling into that combination of categories.
Example: Customer Satisfaction by Product Type
| Satisfied | Neutral | Dissatisfied | Row Total | |
|---|---|---|---|---|
| Product A | 150 | 40 | 10 | 200 |
| Product B | 90 | 60 | 50 | 200 |
| Column Total | 240 | 100 | 60 | 400 |
Is satisfaction level independent of product type?
Computing Expected Frequencies
Under the null hypothesis of independence, the expected count for each cell is:
Expected Frequency Formula
Intuition: If the variables are truly independent, the proportion in each cell should be the product of the marginal proportions. The expected count is this product times the total sample size.
Expected Frequencies (Under Independence)
| Satisfied | Neutral | Dissatisfied | |
|---|---|---|---|
| Product A | (200Γ240)/400 = 120 | (200Γ100)/400 = 50 | (200Γ60)/400 = 30 |
| Product B | (200Γ240)/400 = 120 | (200Γ100)/400 = 50 | (200Γ60)/400 = 30 |
The degrees of freedom for a contingency table with r rows and c columns:
Interactive: Contingency Table Explorer
Build your own contingency table and see how the chi-square statistic changes. Observe how deviations from expected counts contribute to the test statistic.
Observed Frequencies (O)
| Total | ||||
|---|---|---|---|---|
| 200 | ||||
| 200 | ||||
| Total | 240 | 100 | 60 | 400 |
Cell colors show standardized residuals: green = higher than expected, red = lower than expected
Expected Frequencies (E) under Independence
| Satisfied | Neutral | Dissatisfied | |
|---|---|---|---|
| Product A | 120.0 | 50.0 | 30.0 |
| Product B | 120.0 | 50.0 | 30.0 |
Cells highlighted in amber have expected value < 5
Chi-Square Contributions: (O - E)Β² / E
| Satisfied | Neutral | Dissatisfied | |
|---|---|---|---|
| Product A | 7.500 | 2.000 | 13.333 |
| Product B | 7.500 | 2.000 | 13.333 |
| Total ΟΒ² = | 45.667 | ||
Reject Hβ: Variables are associated
Chi-Square Statistic: 45.6667
Degrees of Freedom: 2 = (2-1) Γ (3-1)
P-value: < 0.0001
CramΓ©r's V: 0.338(medium effect)
Min Expected: 30.0
Sample Size: 400
How to Interpret
- Hβ: Row and column variables are independent
- Hβ: Row and column variables are associated
- Large residuals: Cells that contribute most to dependence
- CramΓ©r's V: Effect size measure (0 = no association, 1 = perfect association)
Interactive: Independence Test Simulator
Simulate data from independent or associated variables and see how well the chi-square test detects the relationship. Explore how sample size affects the test's power.
Chi-Square Distribution (df = 1)
Understanding the Simulation
- Association = 0: Variables are independent. Rejection rate should equal Ξ± (Type I error rate)
- Association > 0: Variables are associated. Higher rejection rate = test has power to detect it
- Sample size matters: Larger samples give more power to detect weak associations
- Dots on axis: Each simulation's test statistic (green = fail to reject, red = reject)
Test of Homogeneity
The test of homogeneity is mathematically identical to the test of independence, but the research question is different.
Test of Independence
One sample is drawn, and two variables are observed on each unit.
Question: Are the variables associated?
Test of Homogeneity
Multiple populations are sampled separately, and one variable is observed.
Question: Do the populations have the same distribution?
Choosing the Right Test
| Test | Purpose | Data Structure | Example |
|---|---|---|---|
| Goodness-of-Fit | Does data follow a specific distribution? | One categorical variable, k categories | Is this die fair? Do births follow uniform distribution across weekdays? |
| Independence | Are two variables associated? | Two categorical variables, rΓc table | Is customer satisfaction related to product type? |
| Homogeneity | Do populations have same distribution? | Multiple samples, one categorical variable | Do different age groups have same preferences? |
Assumptions and Validity Conditions
Chi-square tests are approximate tests. The approximation to the chi-square distribution requires certain conditions to be met.
β Condition 1: Expected Frequencies Large Enough
Rule of thumb: All expected frequencies should be β₯ 5.
Some texts allow Ei β₯ 1 for up to 20% of cells, with all others β₯ 5.
β Condition 2: Independent Observations
Each observation should be independent. No individual should contribute to multiple cells.
β Condition 3: Random Sampling
Data should come from random sampling or a randomized experiment.
- Expected counts too small: Combine categories or use Fisher's exact test
- 2Γ2 table with small n: Use Yates's continuity correction or Fisher's exact test
- Paired/matched data: Use McNemar's test instead
Worked Examples
Applications in AI/ML
Chi-square tests are fundamental tools in machine learning pipelines. Here are the key applications every ML engineer should know:
π― Feature Selection
The chi-square test measures the dependence between each categorical feature and the target. Features with high ΟΒ² values (low p-values) are more informative. sklearn.feature_selection.SelectKBestwith chi2 score is the standard implementation.
π Data Drift Detection
Compare training and production distributions for categorical features. Significant differences indicate covariate shift, which degrades model performance. This is critical for ML monitoring in production.
βοΈ Fairness Analysis
Test whether model predictions or decisions are independent of protected attributes (gender, race, age). A significant chi-square indicates potential disparate impact.
π A/B Testing
Compare conversion rates, click-through rates, or categorical outcomes across treatment groups. The chi-square test is the standard method for categorical A/B test analysis when sample sizes are large.
Interactive: Chi-Square Feature Selection
See how the chi-square test ranks features by their association with the target variable. Features with higher chi-square scores (lower p-values) are more predictive.
How Chi-Square Feature Selection Works:
For each categorical feature, we test whether it's independent of the target variable. Features with high chi-square scores (low p-values) are statistically associated with the target and are more likely to be useful for prediction.
Feature Rankings by Chi-Square Score
| Rank | Feature | ΟΒ² Score | p-value | df | Actually Predictive? | Selected? |
|---|---|---|---|---|---|---|
| #1 | income_level Income bracket | 42.744 | < 0.001 | 2 | β Yes | Selected |
| #2 | age_group Customer age group | 18.888 | < 0.001 | 2 | β Yes | Selected |
| #3 | device Device type | 9.210 | 0.0100 | 2 | β Yes | Selected |
| #4 | signup_day Day of week signed up | 1.688 | 0.1938 | 1 | β No | - |
| #5 | browser Web browser used | 1.177 | 0.7586 | 3 | β No | - |
Selection Accuracy
Selected 3 out of 3 truly predictive featuresβ Perfect!
Try increasing the sample size to see chi-square better identify the truly predictive features (age_group, income_level, device).
sklearn Equivalent
from sklearn.feature_selection import chi2, SelectKBest # Select top 3 features by chi-square score selector = SelectKBest(chi2, k=3) X_selected = selector.fit_transform(X, y) # Selected features: income_level, age_group, device
Python Implementation
1import numpy as np
2from scipy import stats
3from sklearn.feature_selection import SelectKBest, chi2
4import pandas as pd
5
6# ============================================
7# 1. Goodness-of-Fit Test
8# ============================================
9
10def chi_square_goodness_of_fit(observed, expected_ratios=None):
11 """
12 Perform a chi-square goodness-of-fit test.
13
14 Parameters
15 ----------
16 observed : array-like
17 Observed frequencies in each category
18 expected_ratios : array-like, optional
19 Expected proportions for each category.
20 If None, assumes uniform distribution.
21
22 Returns
23 -------
24 dict with test results
25 """
26 observed = np.array(observed)
27 n = observed.sum()
28 k = len(observed)
29
30 if expected_ratios is None:
31 expected = np.full(k, n / k) # Uniform
32 else:
33 expected_ratios = np.array(expected_ratios)
34 expected = expected_ratios * n / expected_ratios.sum()
35
36 # Chi-square statistic
37 chi2_stat = np.sum((observed - expected) ** 2 / expected)
38 df = k - 1
39 p_value = 1 - stats.chi2.cdf(chi2_stat, df)
40
41 return {
42 'chi2_statistic': chi2_stat,
43 'degrees_of_freedom': df,
44 'p_value': p_value,
45 'observed': observed,
46 'expected': expected,
47 'residuals': (observed - expected) / np.sqrt(expected)
48 }
49
50
51# Example: Testing if a die is fair
52die_rolls = [95, 108, 102, 97, 105, 93] # 600 total rolls
53result = chi_square_goodness_of_fit(die_rolls)
54print(f"Die fairness test: ΟΒ² = {result['chi2_statistic']:.3f}, p = {result['p_value']:.4f}")
55
56
57# ============================================
58# 2. Test of Independence
59# ============================================
60
61def chi_square_independence_test(contingency_table):
62 """
63 Perform chi-square test of independence on a contingency table.
64
65 Parameters
66 ----------
67 contingency_table : 2D array-like
68 The contingency table with observed counts
69
70 Returns
71 -------
72 dict with test results
73 """
74 table = np.array(contingency_table)
75 chi2_stat, p_value, dof, expected = stats.chi2_contingency(table)
76
77 # Calculate standardized residuals
78 residuals = (table - expected) / np.sqrt(expected)
79
80 return {
81 'chi2_statistic': chi2_stat,
82 'degrees_of_freedom': dof,
83 'p_value': p_value,
84 'observed': table,
85 'expected': expected,
86 'standardized_residuals': residuals
87 }
88
89
90# Example: A/B test for conversion rates
91ab_test_data = [
92 [120, 880], # Design A: [converted, not converted]
93 [150, 850] # Design B: [converted, not converted]
94]
95result = chi_square_independence_test(ab_test_data)
96print(f"A/B test: ΟΒ² = {result['chi2_statistic']:.3f}, p = {result['p_value']:.4f}")
97
98
99# ============================================
100# 3. Feature Selection with Chi-Square
101# ============================================
102
103def chi_square_feature_selection(X, y, k='all'):
104 """
105 Select top k features using chi-square test.
106
107 Parameters
108 ----------
109 X : DataFrame or 2D array
110 Feature matrix (counts or non-negative values)
111 y : array-like
112 Target labels
113 k : int or 'all'
114 Number of features to select
115
116 Returns
117 -------
118 DataFrame with feature scores and rankings
119 """
120 if k == 'all':
121 k = X.shape[1]
122
123 selector = SelectKBest(chi2, k=k)
124 selector.fit(X, y)
125
126 if hasattr(X, 'columns'):
127 feature_names = X.columns
128 else:
129 feature_names = [f'feature_{i}' for i in range(X.shape[1])]
130
131 scores = selector.scores_
132 p_values = selector.pvalues_
133
134 results = pd.DataFrame({
135 'feature': feature_names,
136 'chi2_score': scores,
137 'p_value': p_values
138 }).sort_values('chi2_score', ascending=False)
139
140 return results
141
142
143# ============================================
144# 4. Data Drift Detection
145# ============================================
146
147def detect_categorical_drift(
148 reference_data: np.ndarray,
149 production_data: np.ndarray,
150 alpha: float = 0.05
151) -> dict:
152 """
153 Detect drift in categorical feature distributions.
154
155 Parameters
156 ----------
157 reference_data : array-like
158 Categorical values from training/reference data
159 production_data : array-like
160 Categorical values from production data
161 alpha : float
162 Significance level for drift detection
163
164 Returns
165 -------
166 dict with drift detection results
167 """
168 # Get unique categories across both datasets
169 all_categories = np.unique(np.concatenate([reference_data, production_data]))
170
171 # Count frequencies
172 ref_counts = np.array([np.sum(reference_data == cat) for cat in all_categories])
173 prod_counts = np.array([np.sum(production_data == cat) for cat in all_categories])
174
175 # Build contingency table
176 contingency = np.vstack([ref_counts, prod_counts])
177
178 # Chi-square test of homogeneity
179 chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency)
180
181 # Effect size (CramΓ©r's V)
182 n = contingency.sum()
183 min_dim = min(contingency.shape) - 1
184 cramers_v = np.sqrt(chi2_stat / (n * min_dim)) if min_dim > 0 else 0
185
186 return {
187 'chi2_statistic': chi2_stat,
188 'p_value': p_value,
189 'drift_detected': p_value < alpha,
190 'cramers_v': cramers_v, # Effect size
191 'categories': all_categories,
192 'reference_distribution': ref_counts / ref_counts.sum(),
193 'production_distribution': prod_counts / prod_counts.sum()
194 }
195
196
197# Example: Drift detection
198training_devices = np.random.choice(['mobile', 'desktop', 'tablet'],
199 size=1000, p=[0.5, 0.4, 0.1])
200production_devices = np.random.choice(['mobile', 'desktop', 'tablet'],
201 size=1000, p=[0.65, 0.3, 0.05])
202drift_result = detect_categorical_drift(training_devices, production_devices)
203print(f"Drift detected: {drift_result['drift_detected']}, p = {drift_result['p_value']:.4f}")
204
205
206# ============================================
207# 5. Quick SciPy Examples
208# ============================================
209
210# Goodness-of-fit (using scipy directly)
211observed = [16, 18, 16, 14, 12, 12] # Die rolls
212expected = [15] * 6 # Fair die expectation (if n=90)
213chi2, p = stats.chisquare(observed, f_exp=expected)
214print(f"Goodness-of-fit: ΟΒ² = {chi2:.3f}, p = {p:.4f}")
215
216# Independence test
217table = [[50, 30, 20],
218 [45, 35, 20]]
219chi2, p, dof, exp = stats.chi2_contingency(table)
220print(f"Independence test: ΟΒ² = {chi2:.3f}, p = {p:.4f}")
221
222# Fisher's exact test (for small samples, 2x2 tables)
223table_small = [[8, 2], [1, 5]]
224odds_ratio, p = stats.fisher_exact(table_small)
225print(f"Fisher's exact: OR = {odds_ratio:.2f}, p = {p:.4f}")evidently, deepchecks, and whylabs automate this process.Knowledge Check
Test your understanding of chi-square tests with this interactive quiz.
Question 1: What type of data are chi-square tests designed for?
Summary
Key Takeaways
- Chi-square tests are for categorical data: They measure whether observed frequencies differ significantly from expected frequencies.
- Three main variants: Goodness-of-fit (one variable, theoretical distribution), Independence (two variables, same sample), Homogeneity (one variable, multiple populations).
- The test statistic: follows a chi-square distribution with appropriate degrees of freedom.
- Degrees of freedom: For goodness-of-fit: df = k - 1 - p. For contingency tables: df = (r-1)(c-1).
- Assumptions matter: Expected frequencies should be β₯ 5 for the chi-square approximation to be valid. Use Fisher's exact test for small samples.
- Essential in ML: Feature selection, data drift detection, A/B testing, and fairness analysis all rely heavily on chi-square tests.
Quick Reference
| Aspect | Formula / Rule |
|---|---|
| Chi-square statistic | ΟΒ² = Ξ£ (O - E)Β² / E |
| Expected (independence) | Eα΅’β±Ό = (Row Total Γ Col Total) / n |
| df (goodness-of-fit) | k - 1 (or k - 1 - p if estimating parameters) |
| df (contingency table) | (r - 1)(c - 1) |
| Assumption check | All Eα΅’ β₯ 5 (or use Fisher's exact) |
| Effect size | CramΓ©r's V = β(ΟΒ² / (n Γ min(r-1, c-1))) |
Looking Ahead: In the next section, we'll explore F-tests, which compare variances across groups. F-tests are the foundation of ANOVA (Analysis of Variance), essential for comparing multiple treatment groups in experiments and ML model selection.