Chapter 14
25 min read
Section 94 of 175

Hypothesis Testing Framework

Fundamentals of Testing

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define null and alternative hypotheses precisely
  • • Explain test statistics and their role in testing
  • • Understand critical regions and rejection rules
  • • Interpret the significance level (α) correctly
  • • Distinguish one-tailed from two-tailed tests

🔧 Practical Skills

  • • Formulate hypotheses for real-world problems
  • • Choose appropriate test types for different scenarios
  • • Calculate critical values and make test decisions
  • • Interpret test results in practical contexts

🧠 AI/ML Applications

  • A/B Testing - Validate ML model improvements before deployment
  • Model Comparison - Statistically test if Model A outperforms Model B
  • Feature Selection - Use statistical tests to identify relevant features
  • Drift Detection - Test if production data distribution has changed
  • Fairness Testing - Statistical tests for bias in ML predictions
Where You'll Apply This: Every A/B test, model evaluation, feature selection pipeline, experiment validation in research papers, and quality control in ML systems relies on hypothesis testing.

The Big Picture: A Historical Journey

It's 1925. Scientists, manufacturers, and researchers face a fundamental problem: How do you make decisions about the world based on limited data? If a new fertilizer produces 10% more crop yield in a small trial, is that a real improvement or just random luck?

👨‍🔬

Two Foundational Approaches

Modern hypothesis testing emerged from two influential statisticians:

Sir Ronald Fisher (1890-1962)

Introduced the p-value as a measure of evidence. Focused on "how strongly does the data contradict the null hypothesis?"

Neyman & Pearson (1930s)

Formalized the framework of H₀ vs H₁, Type I/II errors, and power. Focused on decision-making with controlled error rates.

The Problem They Solved

Before hypothesis testing, scientists made ad-hoc judgments about their data. Two researchers could look at the same experiment and draw opposite conclusions. Hypothesis testing provides a principled, reproducible framework for making decisions under uncertainty.

The Fundamental Question

"Given my observed data, is there enough evidence to reject my default assumption?"


The Hypothesis Testing Framework

Hypothesis testing is a structured approach to decision-making. Think of it like a legal trial: there's a presumption of innocence (null hypothesis), and we need sufficient evidence to convict (reject the null).

Null and Alternative Hypotheses

The Two Hypotheses

Null Hypothesis (H₀)

H0:θ=θ0H_0: \theta = \theta_0

The default assumption - typically "no effect" or "no difference." We assume H₀ is true until evidence suggests otherwise.

Like "innocent until proven guilty"

Alternative Hypothesis (H₁)

H1:θθ0H_1: \theta \neq \theta_0

What we want to provide evidence for - typically an effect exists or there is a difference.

Like "guilty" - requires evidence

SymbolMeaningExample
H₀Null hypothesis (default)The drug has no effect
H₁ (or Hₐ)Alternative hypothesisThe drug reduces symptoms
θParameter of interestPopulation mean, proportion, etc.
θ₀Hypothesized value under H₀μ = 100, p = 0.5, etc.
The Key Insight: We never "prove" H₀ is true. We either reject H₀ (finding sufficient evidence against it) or fail to reject H₀ (not finding sufficient evidence). "Fail to reject" is very different from "accept"!

Test Statistics

A test statistic is a function of the sample data that summarizes how much the observed data differs from what we'd expect under H₀.

Common Test Statistics

Z-Statistic (known σ)

Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

t-Statistic (unknown σ)

t=Xˉμ0s/nt = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

Intuition: The test statistic measures "how many standard errors away from the null hypothesis is our sample?" If H₀ is true, this value should be close to zero. Large values suggest H₀ might be wrong.

Critical Regions and Rejection Rules

The critical region (or rejection region) is the set of test statistic values that lead us to reject H₀.

Rejection Rule

For a two-tailed test at significance level α:

Reject H0 if Z>zα/2\text{Reject } H_0 \text{ if } |Z| > z_{\alpha/2}

For α = 0.05, the critical value is z₀.₀₂₅ = 1.96, so reject if |Z| > 1.96

Intuition: If our test statistic falls in the "extreme" tails of the distribution, it's unlikely that H₀ is true. We set a threshold (the critical value) to define "extreme enough to doubt H₀."

Significance Level (α)

Significance Level Definition

α=P(Reject H0H0 is true)\alpha = P(\text{Reject } H_0 \mid H_0 \text{ is true})

The probability of a false positive (Type I error)

What α controls: When we set α = 0.05, we're saying: "I'm willing to incorrectly reject a true H₀ about 5% of the time." This is our tolerance for false alarms.

α ValueCritical z (two-tailed)Common Usage
0.10±1.645Exploratory research
0.05±1.96Standard (most common)
0.01±2.576Conservative
0.001±3.291Very stringent (particle physics)
α is NOT the probability that H₀ is true! This is a common misinterpretation. α is P(Reject H₀ | H₀ true), not P(H₀ true | Rejected). These are very different quantities (see Bayes' theorem).

Interactive: Hypothesis Testing Visualizer

Explore the complete hypothesis testing framework interactively. Adjust the significance level, test type, and test statistic to see how they affect the decision.

📊Hypothesis Testing Visualizer

0.0010.050.10.2
-3.503.5
Standard Normal Distribution under H₀-4-3-2-101234z (Test Statistic)-zα/2= -1.96zα/2= 1.96z = 1.80Rejection Region (α)p-value RegionTest Statistic

Fail to Reject H₀

The test statistic does not fall in the rejection region. Insufficient evidence against H₀.

p-value
0.0551
p ≥ α (0.050)

How to read this visualization:

  • The blue curve is the standard normal distribution under H₀
  • The red shaded areas are the rejection regions (total area = α)
  • The purple line shows where your test statistic falls
  • If the test statistic falls in a red region, reject H₀
  • The blue overlay shows the p-value (probability of getting a result this extreme or more)

One-Tailed vs Two-Tailed Tests

The choice between one-tailed and two-tailed tests depends on your research question:

Two-Tailed Test

H1:θθ0H_1: \theta \neq \theta_0

Use when any deviation matters. "Is the coin biased?" (either direction)

Left-Tailed Test

H1:θ<θ0H_1: \theta < \theta_0

Use when only decreases matter. "Does the drug reduce blood pressure?"

Right-Tailed Test

H1:θ>θ0H_1: \theta > \theta_0

Use when only increases matter. "Is the new model better?"

Interactive: Critical Region Explorer

Compare how the rejection regions differ between one-tailed and two-tailed tests. Notice how one-tailed tests have a less extreme critical value, giving them more power to detect effects in the specified direction.

🎯Critical Region Explorer: One-Tailed vs Two-Tailed Tests

Two-Tailed Test

Reject if too extreme in either direction

-1.961.96α/2α/2
H₁: θ ≠ θ₀ (different, either direction)

Left-Tailed Test

Reject only if significantly less than H₀

-1.64α
H₁: θ < θ₀ (less than)

Right-Tailed Test

Reject only if significantly greater than H₀

1.64α
H₁: θ > θ₀ (greater than)
PropertyTwo-TailedLeft-TailedRight-Tailed
Alternative H₁θ ≠ θ₀θ < θ₀θ > θ₀
Critical Value±1.960-1.645+1.645
Rejection RegionBoth tails (α/2 each)Left tail only (α)Right tail only (α)
Use WhenAny difference mattersOnly decreases matterOnly increases matter

💡 Key Insight

One-tailed tests have more power to detect effects in the specified direction because all of α is concentrated in one tail. The critical value for one-tailed tests (1.645) is less extreme than for two-tailed (1.960), making it easier to reject H₀. However, one-tailed tests cannot detect effects in the opposite direction.

Use Two-Tailed When:

  • • You care about deviations in either direction
  • • No prior hypothesis about direction of effect
  • • Most common in exploratory research
  • • Example: "Is this coin biased?" (either way)

Use One-Tailed When:

  • • Strong prior reason to test one direction only
  • • Effect in opposite direction is irrelevant
  • • Example: "Does this drug reduce symptoms?"
  • • Example: "Is new model better?" (not worse)

Where Test Statistics Come From

Why do we know the distribution of the test statistic under H₀? This comes from the Central Limit Theorem and properties of sampling distributions.

The Key Result

If we take repeated random samples and compute the sample mean each time, the distribution of sample means (the sampling distribution) is approximately normal with:

Mean of sampling distribution

E[Xˉ]=μE[\bar{X}] = \mu

Standard error

SE(Xˉ)=σn\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}

This means if H₀ is true (μ = μ₀), then:

Z=Xˉμ0σ/nN(0,1)Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \sim N(0, 1)

We know the distribution of Z under H₀, which is why we can compute critical values and p-values!

Interactive: Sampling Distribution Simulator

Watch the sampling distribution build up from repeated samples. See how the distribution of z-statistics approximates the standard normal under H₀.

🎲Sampling Distribution Simulator

Watch how the sampling distribution of the test statistic emerges from repeated sampling. Under H₀ (true mean = μ₀), the z-statistic follows a standard normal distribution.

Samples taken:0
Distribution of z-Statistics (Should be N(0,1) under H₀)-4-3-2-101234z-StatisticCount-1.961.96Observed z'sN(0,1) Theory

💡 Key Insight

This is why hypothesis testing works! Under H₀, the test statistic follows a known distribution (standard normal in this case). About 5% of samples produce |z| > 1.96 even when H₀ is true. When we observe such an extreme z, we either witnessed a rare event (~5% chance) OR H₀ is false. We bet on the latter and reject H₀.


The Complete Testing Process

Here's the systematic procedure for conducting a hypothesis test:

  1. State the hypotheses:
    Define H₀ (null) and H₁ (alternative) clearly based on the research question.
  2. Choose significance level α:
    Common choices: 0.05, 0.01, 0.001. Set this before looking at data.
  3. Select the appropriate test statistic:
    Based on what you know (σ known → Z-test, σ unknown → t-test, proportions → z for proportions, etc.)
  4. Collect data and compute the test statistic:
    Calculate the observed value of your test statistic from the sample.
  5. Determine the critical value or p-value:
    Find the threshold for rejection or the probability of observing such an extreme result.
  6. Make a decision:
    If |test statistic| > critical value (or p-value < α): Reject H₀
    Otherwise: Fail to reject H₀
  7. Interpret in context:
    State what the decision means for the original research question.

Worked Examples


Applications in AI/ML

Hypothesis testing is essential in machine learning and AI. Here's how it's used in practice:

🧪 A/B Testing for ML Models

Google, Netflix, Meta, Amazon - all use A/B testing to validate model improvements. Before deploying a new recommendation algorithm, they test it against the current version using hypothesis testing to ensure improvements are real, not random noise.

🎯 Feature Selection

Statistical tests (chi-square, ANOVA F-test) help identify which features are significantly associated with the target variable. This reduces dimensionality and improves model interpretability.

⚠️ Drift Detection

ML models in production degrade when the data distribution changes (concept drift). Statistical tests (KS test, chi-square) detect when production data significantly differs from training data, triggering model retraining.

⚖️ Fairness Testing

Testing whether ML models treat different demographic groups fairly. Hypothesis tests check for disparate impact - whether prediction rates differ significantly between protected groups.

Interactive: A/B Testing Simulator

Experience a realistic A/B testing scenario. Compare two ML models and determine when you have enough statistical evidence to declare a winner.

🔬A/B Testing Simulator: ML Model Comparison

Simulate an A/B test comparing two ML models. Model A is the current production model (control), Model B is a new model (treatment). The conversion rate represents the proportion of users who take a desired action (click, purchase, etc.).

⚙️ Simulation Parameters (True Rates - Unknown in Real Life)

True difference: 2.0% absolute, 20.0% relative lift

Model A (Control - Current)

Users
0
Conversions
0
Rate
0.00%

Model B (Treatment - New)

Users
0
Conversions
0
Rate
0.00%

🧠 How This Works in Real ML Systems

H₀: pA = pB (Models have the same conversion rate)

H₁: pA ≠ pB (Models have different conversion rates)

Companies like Google, Netflix, Meta, and Amazon use this exact framework to:

  • Validate new ML model improvements before full deployment
  • Test UI changes that affect user engagement
  • Compare recommendation algorithms
  • Measure the impact of ranking changes

⚠️ Warning: Peeking Problem

In real A/B tests, do NOT stop the test early just because you see a significant p-value! Repeated checking inflates the false positive rate (this is called the "peeking problem"). Either use sequential testing methods or pre-determine your sample size and only check significance once at the end.


Common Misconceptions

The Replication Crisis: Many published "significant" findings have failed to replicate because of: (1) p-hacking, (2) publication bias, (3) small sample sizes, and (4) misunderstanding of what p-values mean. Always report effect sizes and confidence intervals alongside p-values.

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3
4# ============================================
5# One-Sample t-Test
6# ============================================
7
8def one_sample_ttest(
9    data: np.ndarray,
10    hypothesized_mean: float,
11    alpha: float = 0.05,
12    alternative: str = 'two-sided'  # 'less', 'greater', 'two-sided'
13) -> dict:
14    """
15    Perform a one-sample t-test.
16
17    Parameters
18    ----------
19    data : array-like
20        Sample data
21    hypothesized_mean : float
22        The mean under H₀
23    alpha : float
24        Significance level
25    alternative : str
26        'two-sided', 'less', 'greater'
27
28    Returns
29    -------
30    dict with test results
31    """
32    n = len(data)
33    sample_mean = np.mean(data)
34    sample_std = np.std(data, ddof=1)  # Sample std (n-1)
35    se = sample_std / np.sqrt(n)
36
37    # t-statistic
38    t_stat = (sample_mean - hypothesized_mean) / se
39    df = n - 1
40
41    # p-value
42    if alternative == 'two-sided':
43        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
44        critical = stats.t.ppf(1 - alpha/2, df)
45        reject = abs(t_stat) > critical
46    elif alternative == 'greater':
47        p_value = 1 - stats.t.cdf(t_stat, df)
48        critical = stats.t.ppf(1 - alpha, df)
49        reject = t_stat > critical
50    else:  # 'less'
51        p_value = stats.t.cdf(t_stat, df)
52        critical = stats.t.ppf(alpha, df)
53        reject = t_stat < critical
54
55    return {
56        'sample_mean': sample_mean,
57        't_statistic': t_stat,
58        'p_value': p_value,
59        'critical_value': critical,
60        'reject_h0': reject,
61        'alpha': alpha,
62        'alternative': alternative,
63        'degrees_freedom': df
64    }
65
66
67# ============================================
68# Two-Proportion Z-Test (A/B Testing)
69# ============================================
70
71def two_proportion_ztest(
72    successes_a: int,
73    n_a: int,
74    successes_b: int,
75    n_b: int,
76    alpha: float = 0.05,
77    alternative: str = 'two-sided'
78) -> dict:
79    """
80    Two-proportion z-test for A/B testing.
81
82    H₀: p_a = p_b
83    """
84    p_a = successes_a / n_a
85    p_b = successes_b / n_b
86
87    # Pooled proportion under H₀
88    p_pooled = (successes_a + successes_b) / (n_a + n_b)
89
90    # Standard error under H₀
91    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
92
93    # z-statistic
94    z_stat = (p_b - p_a) / se if se > 0 else 0
95
96    # p-value
97    if alternative == 'two-sided':
98        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
99    elif alternative == 'greater':
100        p_value = 1 - stats.norm.cdf(z_stat)
101    else:
102        p_value = stats.norm.cdf(z_stat)
103
104    return {
105        'p_a': p_a,
106        'p_b': p_b,
107        'difference': p_b - p_a,
108        'relative_lift': (p_b - p_a) / p_a if p_a > 0 else 0,
109        'z_statistic': z_stat,
110        'p_value': p_value,
111        'reject_h0': p_value < alpha,
112        'alpha': alpha
113    }
114
115
116# ============================================
117# Using SciPy (recommended for production)
118# ============================================
119
120# One-sample t-test
121data = np.random.normal(loc=102, scale=15, size=50)
122result = stats.ttest_1samp(data, popmean=100)
123print(f"t-statistic: {result.statistic:.3f}, p-value: {result.pvalue:.4f}")
124
125# Two-sample t-test (comparing two groups)
126group_a = np.random.normal(loc=100, scale=15, size=50)
127group_b = np.random.normal(loc=105, scale=15, size=50)
128result = stats.ttest_ind(group_a, group_b)
129print(f"t-statistic: {result.statistic:.3f}, p-value: {result.pvalue:.4f}")
130
131# Chi-square test for independence (categorical features)
132observed = np.array([[50, 30], [20, 40]])
133chi2, p, dof, expected = stats.chi2_contingency(observed)
134print(f"Chi-square: {chi2:.3f}, p-value: {p:.4f}")
135
136# Kolmogorov-Smirnov test (distribution comparison)
137sample1 = np.random.normal(0, 1, 100)
138sample2 = np.random.normal(0.5, 1, 100)  # Slightly shifted
139ks_stat, p_value = stats.ks_2samp(sample1, sample2)
140print(f"KS statistic: {ks_stat:.3f}, p-value: {p_value:.4f}")

Knowledge Check

Test your understanding of the hypothesis testing framework with this interactive quiz.

🧪Hypothesis Testing Knowledge Check

Question 1 of 8

What is the null hypothesis (H₀) in a hypothesis test?

Current Score: 0/0

Summary

Key Takeaways

  1. Hypothesis testing is a decision framework: We formalize questions about populations, collect sample evidence, and make principled decisions.
  2. H₀ is the null hypothesis (default assumption): We assume it's true and only reject it with sufficient evidence. Like "innocent until proven guilty."
  3. The test statistic measures evidence against H₀: It tells us how extreme our data is compared to what we'd expect if H₀ were true.
  4. α controls the false positive rate: P(Reject H₀ | H₀ true). It is NOT the probability that H₀ is true.
  5. "Fail to reject" ≠ "Accept H₀": Absence of evidence is not evidence of absence. We may just need more data.
  6. Critical for AI/ML: A/B testing, model comparison, feature selection, drift detection, and fairness testing all rely on hypothesis testing.
Looking Ahead: In the next section, we'll explore the two types of errors in hypothesis testing (Type I and Type II errors) and understand the tradeoffs involved in choosing significance levels.
Loading comments...