Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand fundamentally why the chi-square distribution arises naturally as the sum of squared standard normal random variables
Define the chi-square distribution $\chi^2_k$ and explain the meaning of degrees of freedom
Recognize chi-square as a special case of Gamma: $\chi^2_k = \text{Gamma}(k/2, 1/2)$
Apply chi-square tests for goodness-of-fit and independence testing
Use chi-square for variance estimation and confidence intervals
Implement chi-square feature selection in machine learning pipelines
Connect chi-square to the F-distribution and Student's t-distribution

Deep Intuition: The Surprise Meter

The chi-square distribution measures "how surprising" your data is compared to what you expected.

Imagine you roll a die 600 times. You expect about 100 of each face. But you get: 90, 105, 115, 95, 85, 110. Is this die fair, or is something fishy?

The chi-square distribution answers this question. It tells you: "Given random variation, how likely is it to see this much deviation from expectation?"

The Core Idea

Chi-square ( $\chi^2$ ) is built from squared standard normal variables:

\chi^2_k = Z_1^2 + Z_2^2 + \cdots + Z_k^2

Where each $Z_i \sim N(0, 1)$ is independent. The subscript $k$ is the degrees of freedom.

Why Squared?

➕

Always Positive

Deviations above and below expectation both count as "surprise"

📈

Amplifies Large Deviations

A deviation of 10 contributes 100; deviation of 2 contributes only 4

🔗

Connects to Normal Distribution

The Normal distribution is the workhorse of statistics; chi-square captures squared normal behavior

Mental Model

Think of chi-square as a "surprise accumulator." Each squared deviation adds to your total surprise. The larger the chi-square value, the more your data deviates from what you expected.

The Historical Story

The chi-square distribution emerged from fundamental questions about data, errors, and testing scientific theories.

Friedrich Robert Helmert (1876)

The German mathematician Helmert first derived the distribution of the sample variance for normally distributed data. He showed that $(n-1)S^2/\sigma^2$ follows what we now call the chi-square distribution.

Karl Pearson (1900)

The breakthrough came when Pearson published "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling."

Pearson's Problem: Are the deviations I see in my data just due to random chance, or is there something systematic going on?

His motivation was practical: testing whether dice were fair, validating Mendel's genetic inheritance data, and checking if theoretical distributions fit observed data.

Ronald Fisher (1920s)

Fisher rigorously developed the concept of degrees of freedom and showed how to apply chi-square tests correctly. He established that when you estimate parameters from data, you "use up" degrees of freedom.

Why Do We Need the Chi-Square Distribution?

Chi-square is the foundation of statistical testing. It answers fundamental questions about data:

✅ Use Chi-Square When:

Testing goodness-of-fit (does data match expected distribution?)
Testing independence (are two categorical variables related?)
Constructing confidence intervals for variance
Feature selection for categorical data in ML
Analyzing contingency tables

❌ Do NOT Use Chi-Square When:

Expected frequencies are too small (E < 5)
Data are continuous (need to bin first)
Observations are not independent
Testing means or correlations (use t-test, ANOVA)

Mathematical Definition

Definition from Normal Variables

If $Z_1, Z_2, \ldots, Z_k$ are independent standard normal random variables, then:

X = \sum_{i=1}^{k} Z_i^2 \sim \chi^2_k

We say $X$ follows a chi-square distribution with k degrees of freedom.

The Probability Density Function

f(x; k) = \frac{x^{k/2-1} e^{-x/2}}{2^{k/2} \Gamma(k/2)} \quad \text{for } x > 0

Symbol by Symbol

Symbol	Meaning	Intuition
x	Chi-square value	Sum of squared deviations
k	Degrees of freedom	Number of independent squared normals
Γ(k/2)	Gamma function	Normalization constant (generalizes factorial)
e^(-x/2)	Exponential decay	Large values become rare
x^(k/2-1)	Power term	Shapes the distribution based on k

Connection to Gamma Distribution

The chi-square distribution is a special case of Gamma:

\chi^2_k = \text{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right)

This means: shape parameter $\alpha = k/2$ , rate parameter $\beta = 1/2$ .

Why This Connection Matters

Since chi-square is a special case of Gamma, it inherits many properties. Gamma is closed under addition (for the same rate), so chi-square is too: the sum of independent chi-squares is chi-square with summed degrees of freedom.

Exploring the Distribution

Use the interactive visualizer below to see how the degrees of freedom $k$ affects the shape of the chi-square distribution.

📊Chi-Square Distribution Explorer

Degrees of Freedom (k) = 5

Right-skewed with interior mode

Show critical values

Statistics

Mean:5

Variance:10

Std Dev:3.162

Mode:3

Skewness:1.2649

Notation: X ~ χ²(5)
As Gamma: Gamma(5/2, 1/2) = Gamma(2.5, 0.5)

Critical Values for χ²(5)

α = 0.05:11.070

α = 0.01:15.086

α = 0.001:20.515

Reject H₀ if your test statistic exceeds these values

The PDF shows the probability density. For k=1,2, the mode is at 0. For k≥3, the mode is at k-2.

PDF Formula

f(x) = x^k/2-1 × e^-x/2 / [2^k/2 × Γ(k/2)]

f(x) = x^1.5 × e^-x/2 / [2^2.5 × Γ(2.5)]

Quick presets:

Key Observations

k = 1: Heavily right-skewed, mode at 0, unbounded spike at origin
k = 2: Exponential distribution! Mode at 0 but no spike
k ≥ 3: Interior mode at $k - 2$ , bell-like but still skewed
Large k: Becomes more symmetric, approximates Normal by CLT
Mean always equals k: Larger k shifts distribution right

Where Chi-Square Comes From

The chi-square distribution arises naturally when you square and sum independent standard normal variables. This interactive demonstration shows you why.

🎲Sum of Squared Standard Normals

The Chi-Square Distribution Arises From:

χ²₄ = Z₁² + Z₂² + ... + Z₄²

where each Z_i ~ N(0, 1) is independent

Degrees of Freedom (k) = 4

Batch size: 500

Latest Sample Breakdown

Click "Add One Sample" to see the breakdown

Histogram of 0 samplesvs. theoretical χ²(4)

Key Insight: As you generate more samples, the histogram converges to the theoretical chi-square PDF (red curve). This demonstrates that summing k squared standard normals produces the χ²(k) distribution. The mean equals k (degrees of freedom), and variance equals 2k.

This fundamental relationship is why chi-square appears everywhere in statistics: any time you're working with squared errors or deviations from normality, chi-square is lurking.

Key Properties

Summary Statistics

Property	Formula	Interpretation
Mean	E[X] = k	Average equals degrees of freedom
Variance	Var(X) = 2k	Twice the degrees of freedom
Mode	k - 2 (for k ≥ 2)	Most likely value
Skewness	√(8/k)	Always positive, decreases with k
MGF	(1 - 2t)^(-k/2)	For t < 1/2

Additivity Property

If $X_1 \sim \chi^2_{k_1}$ and $X_2 \sim \chi^2_{k_2}$ are independent, then:

X_1 + X_2 \sim \chi^2_{k_1 + k_2}

Why this matters: You can decompose complex chi-square statistics into simpler components, or combine independent tests.

Scaling Property

If $X \sim \chi^2_k$ and $c > 0$ :

cX \sim \text{Gamma}(k/2, 1/(2c))

Scaling doesn't preserve chi-square (unlike the Normal), but stays in the Gamma family.

The Chi-Square Test Framework

The chi-square test compares observed frequencies to expected frequencies using this statistic:

\chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}

What Each Term Means

Symbol	Meaning	Role
O_i	Observed frequency	What you actually counted
E_i	Expected frequency	What theory predicts
(O_i - E_i)²	Squared deviation	How far off each cell is
(O_i - E_i)²/E_i	Standardized deviation	Weight by expected (larger E allows larger variance)

The Key Insight

We divide by $E_i$ because a deviation of 10 means different things depending on whether you expected 20 (50% off) or 1000 (1% off). This standardization makes deviations comparable.

Goodness-of-Fit Test: Is This Die Fair?

The goodness-of-fit test asks: Does my observed data match a hypothesized distribution? Try it yourself:

🎲Goodness-of-Fit Test: Is This Die Fair?

Rolls:

⚀

Face 1

⚁

Face 2

⚂

Face 3

⚃

Face 4

⚄

Face 5

⚅

Face 6

Face	Observed (O)	Expected (E)	O - E
Face 1	100	100.0	0.0
Face 2	100	100.0	0.0
Face 3	100	100.0	0.0
Face 4	100	100.0	0.0
Face 5	100	100.0	0.0
Face 6	100	100.0	0.0
χ² Statistic =

✅ Fail to Reject H₀

χ² Statistic: 0.0000

Degrees of Freedom: 5 (6 faces - 1)

P-value: 1.0000 ()

Critical Value (α=0.05): 11.07

No significant evidence the die is unfair (p ≥ 0.05)

Significance Level (α):

How to Interpret

H₀ (Null Hypothesis): The die is fair (all faces equally likely)
H₁ (Alternative): The die is not fair
Large χ²: Observed data differs significantly from expected
Small p-value: Unlikely to see this data if die were fair

Steps for a Goodness-of-Fit Test

State hypotheses: $H_0$ : Data follows expected distribution
Calculate expected frequencies under $H_0$
Compute chi-square statistic: $\chi^2 = \sum (O_i - E_i)^2 / E_i$
Determine degrees of freedom: $k = \text{categories} - 1$ (subtract 1 more for each estimated parameter)
Find p-value from $\chi^2_k$ distribution
Decide: Reject $H_0$ if p-value < significance level

Degrees of Freedom

For goodness-of-fit: $\text{df} = (\text{categories} - 1) - (\text{parameters estimated})$

If you estimate the mean from data, subtract 1. If you estimate mean AND variance, subtract 2.

Real-World Applications

🎲 Gaming & Fairness Testing

Testing if dice, cards, or RNGs are fair
Casino regulation and auditing
Video game loot box probability verification

🧬 Genetics & Biology

Testing Mendelian inheritance ratios
Hardy-Weinberg equilibrium
Gene-disease associations

📊 Survey Research

Testing independence of demographic variables
Cross-tabulation analysis
Polling accuracy assessment

🏥 Medical Research

Drug efficacy with categorical outcomes
Treatment vs. control group comparisons
Epidemiological studies

AI/ML Applications

Chi-square is a fundamental tool in machine learning, especially for categorical data analysis and feature selection.

1. Chi-Square Feature Selection

One of the most important uses of chi-square in ML is selecting the most informative features for classification. The idea: if a feature is independent of the target, it's not useful for prediction.

🎯Chi-Square Feature Selection for ML

How Chi-Square Feature Selection Works:

For each categorical feature, we test whether it's independent of the target variable. Features with high chi-square scores (low p-values) are statistically associated with the target and are more likely to be useful for prediction.

Sample size:

Select top:features

Dataset: 500 samples | Target: purchased vs not_purchased

Feature Rankings by Chi-Square Score

Selected (top 3)Not selectedClicked✓ = Truly predictive

Rank	Feature	χ² Score	p-value	df	Actually Predictive?	Selected?
#1	income_level Income bracket	20.641	< 0.001	2	✓ Yes	Selected
#2	age_group Customer age group	18.224	< 0.001	2	✓ Yes	Selected
#3	device Device type	7.540	0.0230	2	✓ Yes	Selected
#4	signup_day Day of week signed up	1.076	0.2995	1	✗ No	-
#5	browser Web browser used	0.449	0.9299	3	✗ No	-

Selection Accuracy

Selected 3 out of 3 truly predictive features✓ Perfect!

Try increasing the sample size to see chi-square better identify the truly predictive features (age_group, income_level, device).

sklearn Equivalent

from sklearn.feature_selection import chi2, SelectKBest

# Select top 3 features by chi-square score
selector = SelectKBest(chi2, k=3)
X_selected = selector.fit_transform(X, y)

# Selected features: income_level, age_group, device

2. Text Classification & NLP

Chi-square is heavily used in text classification to:

Select the most discriminative words/n-grams for each class
Reduce vocabulary size while preserving predictive power
Identify terms strongly associated with categories

🐍text_feature_selection.py

1from sklearn.feature_extraction.text import CountVectorizer
2from sklearn.feature_selection import chi2, SelectKBest
3
4# Create bag-of-words features
5vectorizer = CountVectorizer(max_features=10000)
6X = vectorizer.fit_transform(documents)
7
8# Select top 1000 features by chi-square score
9selector = SelectKBest(chi2, k=1000)
10X_selected = selector.fit_transform(X, y)
11
12# Get the selected feature names
13feature_names = vectorizer.get_feature_names_out()
14selected_features = feature_names[selector.get_support()]

3. Categorical Encoding Decisions

When deciding how to encode categorical features, chi-square helps determine:

Whether a categorical feature is worth including
Which categories to merge (similar chi-square contributions)
Target encoding quality assessment

4. Model Calibration Testing

Chi-square goodness-of-fit can test if a model's predicted probabilities match reality:

Bin predictions into groups (e.g., 0-10%, 10-20%, ...)
Compare predicted vs. actual positive rates in each bin
Large chi-square indicates poor calibration

5. Independence in Graphical Models

In probabilistic graphical models and causal inference, chi-square tests help:

Test conditional independence assumptions
Validate Bayesian network structure
Identify spurious correlations vs. causal relationships

Distribution	Relationship to Chi-Square
Gamma(k/2, 1/2)	Chi-square IS Gamma with these parameters
Exponential(1/2)	Chi-square with k=2 is Exponential
F-distribution	Ratio of two chi-squares divided by their dfs
t-distribution	Normal divided by sqrt(chi-square/k)
Normal (large k)	Chi-square ≈ Normal(k, 2k) by CLT
Noncentral χ²	Sum of squared non-standard normals

The Statistical Trinity

Chi-square, t, and F distributions form the backbone of classical statistics:

Chi-square: Variance estimation, goodness-of-fit
t-distribution: Mean estimation (unknown variance)
F-distribution: Comparing variances, ANOVA

All three are intimately connected through the Normal distribution.

Python Implementation

Basic Operations with SciPy

🐍chi_square_basics.py

1import numpy as np
2from scipy import stats
3
4# Create chi-square distribution with k=5 degrees of freedom
5k = 5
6chi2_dist = stats.chi2(df=k)
7
8# PDF, CDF, quantiles
9x = 5
10print(f"PDF at x={x}: {chi2_dist.pdf(x):.4f}")
11print(f"CDF at x={x}: {chi2_dist.cdf(x):.4f}")
12print(f"P(X > {x}): {1 - chi2_dist.cdf(x):.4f}")
13
14# Critical values for common significance levels
15print(f"95th percentile: {chi2_dist.ppf(0.95):.4f}")
16print(f"99th percentile: {chi2_dist.ppf(0.99):.4f}")
17
18# Summary statistics
19print(f"Mean: {chi2_dist.mean():.4f}")  # Should equal k
20print(f"Variance: {chi2_dist.var():.4f}")  # Should equal 2k
21
22# Generate random samples
23samples = chi2_dist.rvs(size=10000)
24print(f"Sample mean: {samples.mean():.4f}")
25print(f"Sample variance: {samples.var():.4f}")

Goodness-of-Fit Test

🐍goodness_of_fit.py

1import numpy as np
2from scipy import stats
3
4# Example: Testing if a die is fair
5# Observed counts from 600 rolls
6observed = np.array([90, 105, 115, 95, 85, 110])
7
8# Expected counts if fair (600/6 = 100 each)
9expected = np.array([100, 100, 100, 100, 100, 100])
10
11# Method 1: Using scipy.stats.chisquare
12stat, pvalue = stats.chisquare(observed, f_exp=expected)
13print(f"Chi-square statistic: {stat:.4f}")
14print(f"P-value: {pvalue:.4f}")
15print(f"Degrees of freedom: {len(observed) - 1}")
16
17# Method 2: Manual calculation
18chi_sq_manual = np.sum((observed - expected)**2 / expected)
19df = len(observed) - 1
20pvalue_manual = 1 - stats.chi2.cdf(chi_sq_manual, df)
21print(f"Manual calculation: χ² = {chi_sq_manual:.4f}, p = {pvalue_manual:.4f}")
22
23# Decision at α = 0.05
24alpha = 0.05
25critical_value = stats.chi2.ppf(1 - alpha, df)
26print(f"Critical value at α={alpha}: {critical_value:.4f}")
27print(f"Reject H₀: {stat > critical_value}")

Test of Independence

🐍independence_test.py

1import numpy as np
2from scipy import stats
3
4# Contingency table: Gender vs. Product Preference
5# Rows: Male, Female
6# Columns: Product A, Product B, Product C
7contingency_table = np.array([
8    [45, 35, 20],   # Male
9    [30, 50, 25]    # Female
10])
11
12# Perform chi-square test of independence
13stat, pvalue, dof, expected = stats.chi2_contingency(contingency_table)
14
15print("Observed frequencies:")
16print(contingency_table)
17print("\nExpected frequencies (under independence):")
18print(expected.round(2))
19print(f"\nChi-square statistic: {stat:.4f}")
20print(f"Degrees of freedom: {dof}")
21print(f"P-value: {pvalue:.4f}")
22
23# Calculate contributions from each cell
24contributions = (contingency_table - expected)**2 / expected
25print("\nContributions to chi-square:")
26print(contributions.round(4))

Feature Selection with Chi-Square

🐍feature_selection_ml.py

1from sklearn.datasets import load_iris
2from sklearn.feature_selection import chi2, SelectKBest
3import numpy as np
4
5# Load data (using discretized version for chi-square)
6iris = load_iris()
7X = iris.data
8y = iris.target
9
10# Chi-square requires non-negative features
11# For continuous features, you might bin them first
12X_positive = X - X.min(axis=0)  # Shift to non-negative
13
14# Calculate chi-square scores for each feature
15scores, pvalues = chi2(X_positive, y)
16
17print("Feature chi-square scores:")
18for i, (name, score, pval) in enumerate(zip(
19    iris.feature_names, scores, pvalues
20)):
21    print(f"  {name}: χ² = {score:.2f}, p = {pval:.4f}")
22
23# Select top k features
24k = 2
25selector = SelectKBest(chi2, k=k)
26X_selected = selector.fit_transform(X_positive, y)
27
28# Which features were selected?
29selected_mask = selector.get_support()
30selected_features = np.array(iris.feature_names)[selected_mask]
31print(f"\nSelected features: {selected_features}")

Common Pitfalls

Expected Frequency Rule

Expected frequency should be ≥ 5 in each cell. With small expected frequencies, the chi-square approximation breaks down. Solutions:

Combine categories to increase expected counts
Use Fisher's exact test for small samples
Use Yates' continuity correction for 2×2 tables

Independence Assumption

Each observation must be independent. If the same person can appear multiple times or observations are clustered, chi-square is invalid. Consider:

Clustered sampling methods
Time series with autocorrelation
Repeated measures designs

Categorical Data Only

Chi-square tests are for categorical data. For continuous data:

Bin into categories first (lose information)
Use Kolmogorov-Smirnov test instead
Use appropriate parametric tests

One-Tailed Nature

Chi-square tests are always right-tailed. Large chi-square values indicate poor fit. Unlike t-tests, you never look at the left tail—a chi-square value of 0 means perfect fit!

Test Your Understanding

📝Chi-Square Quiz

Question 1/7

If X ~ χ²(10), what is E[X]?

Summary

The chi-square distribution is foundational to statistical testing. It naturally arises from squared standard normal variables and provides the basis for testing goodness-of-fit, independence, and variance.

Key Formulas

Property	Formula
Definition	χ²_k = Z₁² + Z₂² + ... + Z_k² where Z_i ~ N(0,1)
PDF	f(x) = x^(k/2-1)e^(-x/2) / [2^(k/2)Γ(k/2)]
Mean	E[X] = k
Variance	Var(X) = 2k
Mode	k - 2 (for k ≥ 2)
Test Statistic	χ² = Σ(O_i - E_i)² / E_i
Gamma Form	χ²_k = Gamma(k/2, 1/2)

Key Takeaways

Sum of squared normals: Chi-square arises naturally from squaring and summing independent standard normal variables
Degrees of freedom = k: This single parameter controls shape, mean, and variance
Special case of Gamma: $\chi^2_k = \text{Gamma}(k/2, 1/2)$
Goodness-of-fit: Test if observed data matches expected distribution
Independence test: Test if categorical variables are related
ML feature selection: Chi-square score ranks categorical features by predictive power
Foundation for other tests: t-test, F-test, ANOVA all use chi-square

The Essence of Chi-Square:

"How surprising is this data given what we expected?"

From testing fair dice to selecting ML features—chi-square quantifies the gap between expectation and reality.

Coming Next: In the next section, we'll explore the Student's t-distribution—essential for inference about means when the population variance is unknown.