Chapter 15
25 min read
Section 99 of 175

Z-Tests and T-Tests

Common Statistical Tests

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Understand when to use Z-tests vs t-tests
  • • Derive and interpret the t-statistic formula
  • • Explain why the t-distribution has heavier tails
  • • Distinguish one-sample, two-sample, and paired t-tests
  • • Calculate degrees of freedom for each test type

🔧 Practical Skills

  • • Choose the appropriate test for different scenarios
  • • Perform Z-tests and t-tests by hand and in Python
  • • Interpret test results and p-values correctly
  • • Understand assumptions and when they can be relaxed

🧠 AI/ML Applications

  • A/B Testing - Compare conversion rates between model variants
  • Model Evaluation - Test if a new model significantly outperforms the baseline
  • Hyperparameter Tuning - Verify improvements are not due to random variation
  • Paired Comparisons - Compare models on the same test set (paired t-test)
  • Sample Size Planning - Determine how much data you need for reliable conclusions
Central Message: Z-tests and t-tests are the workhorses of parametric hypothesis testing. Understanding when and how to use each test is essential for any data scientist making evidence-based decisions.

The Big Picture: A Historical Journey

The story of Z-tests and t-tests begins in the early 20th century, when scientists and industrialists faced a fundamental problem: how do you draw reliable conclusions from limited data?

🏭

The Industrial Revolution Problem

Factories needed quality control. Breweries needed consistent products. Scientists needed to test hypotheses. But running large experiments was expensive. Could you trust conclusions from small samples?

The Z-test worked when population parameters were known (rare in practice). But what if you only had a small sample and had to estimate the variance? This extra uncertainty needed to be accounted for.

The solution came from an unlikely source: a brewer at Guinness in Dublin, working under a pseudonym to protect trade secrets. His discovery would revolutionize small-sample statistics.


The Z-Test: When Sigma is Known

The Z-test is the foundational test for means when the population standard deviation σ\sigma is known. While this situation is rare in practice, understanding the Z-test provides crucial intuition for all parametric tests.

Z-Test Formulation

One-Sample Z-Test Statistic

Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under H₀, Z follows a standard normal distribution N(0,1)

SymbolMeaningKnown From
Sample meanCalculated from data
μ₀Hypothesized population meanNull hypothesis
σPopulation standard deviationPrior knowledge (RARE)
nSample sizeStudy design
σ/√nStandard error of the meanKnown formula

Why it works: By the Central Limit Theorem, the sample mean Xˉ\bar{X} is approximately normally distributed with mean μ\mu and standard deviation σ/n\sigma/\sqrt{n}. When we standardize, we get Z ~ N(0,1).

Worked Example: Quality Control

Example: Precision Manufacturing

A machine produces metal rods that should have a mean length of 100 mm. From historical data (thousands of measurements), we know σ=2\sigma = 2 mm. A sample of 36 rods has mean length 100.8 mm. Has the machine drifted from its target?

Setup
H0:μ=100H_0: \mu = 100
H1:μ100H_1: \mu \neq 100

Two-tailed test at α = 0.05

Calculation
Z=100.81002/36=0.80.333=2.4Z = \frac{100.8 - 100}{2/\sqrt{36}} = \frac{0.8}{0.333} = 2.4
Decision

Critical value at α = 0.05 is ±1.96. Since |Z| = 2.4 > 1.96, we reject H₀.

The p-value = 2 × (1 - Φ(2.4)) = 2 × 0.0082 = 0.0164 < 0.05

Conclusion: There is significant evidence that the machine has drifted from its target of 100 mm. The production line should be recalibrated.

When do we actually know σ? Almost never in practice! Possible scenarios include: (1) standardized tests with established norms, (2) instruments with known measurement error, (3) historical processes with massive prior data. In most real applications, we use the t-test instead.

The t-Test: Accounting for Unknown Variance

In reality, we almost never know the true population standard deviation. We must estimate it from the sample using the sample standard deviation ss. This introduces additional uncertainty that the Z-test ignores.

William Sealy Gosset and Student's t

🍺

The Guinness Statistician (1908)

William Sealy Gosset worked as a chemist at Guinness Brewery in Dublin. He needed to analyze small batches of barley and hops but couldn't run large experiments.

Guinness prohibited employees from publishing, so Gosset used the pseudonym "Student". His 1908 paper introduced what we now call Student's t-distribution.

"The problem is to find a way of using the sample to test a hypothesis about the population when the only knowledge of the population we have is what the sample tells us."

Gosset's insight was profound: when we replace the unknown σ\sigma with the sample estimate ss, the resulting statistic no longer follows a normal distribution. It follows a new distribution with heavier tails that depends on the sample size.

One-Sample t-Test Statistic

t=Xˉμ0s/nt = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

Under H₀, t follows a Student's t-distribution with df = n - 1

Properties of the t-Distribution

Key Properties

  • Symmetric about zero (like the normal)
  • Unimodal (single peak at center)
  • Bell-shaped but with heavier tails
  • Defined by degrees of freedom (df)
  • Converges to N(0,1) as df → ∞

Why Heavier Tails?

When we estimate σ\sigma with ss, sometimes we underestimate (making t too large) and sometimes overestimate (making t too small).

This extra randomness makes extreme t-values more likely than extreme Z-values. With small n, the estimate ss is unstable, so tails are heavier.

dft₀.₀₂₅ (two-tailed)z₀.₀₂₅Difference
52.5711.96+31%
102.2281.96+14%
202.0861.96+6%
302.0421.96+4%
1.961.960%
Rule of Thumb: When n ≥ 30, the t-distribution is close enough to the normal that the practical difference is small. However, with modern computing, there's no reason not to use the exact t-distribution regardless of sample size.

Interactive: Z vs t Distribution Comparison

Explore how the t-distribution differs from the standard normal. Adjust the degrees of freedom to see how the t-distribution converges to the normal as df increases.

Z-Distribution vs t-Distribution Comparison

Z (Standard Normal)
t (df = 5)

Z Critical

+-1.960

t Critical

+-2.571

-4-3-2-1012340.00.10.20.30.4Test Statistic ValueProbability Density

Key Observations

  • Heavier Tails: The t-distribution has heavier tails than the normal, especially with low df.
  • Wider Critical Values: t-critical values are always larger than z-critical values (more conservative).
  • Convergence: As df increases, t approaches the standard normal (try df = 30+).
  • Rule of Thumb: When df ≥ 30, the difference becomes negligible for practical purposes.

One-Sample t-Test

The one-sample t-test tests whether a population mean equals a specified value when the population standard deviation is unknown.

One-Sample t-Test Procedure

  1. State hypotheses:
    H0:μ=μ0vsH1:μμ0H_0: \mu = \mu_0 \quad \text{vs} \quad H_1: \mu \neq \mu_0

    (or one-tailed: < or >)

  2. Calculate sample statistics:
    xˉ=1ni=1nxi,s=1n1i=1n(xixˉ)2\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \quad s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}
  3. Compute the t-statistic:
    t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}
  4. Determine degrees of freedom: df = n - 1
  5. Find p-value or compare to critical value
  6. Make decision and interpret in context


Two-Sample t-Test

The two-sample t-test (also called independent samples t-test) compares the means of two independent groups to determine if they come from populations with equal means.

Two-Sample t-Test Statistic

t=Xˉ1Xˉ2SE(Xˉ1Xˉ2)t = \frac{\bar{X}_1 - \bar{X}_2}{\text{SE}(\bar{X}_1 - \bar{X}_2)}

Testing H₀: μ₁ = μ₂ (or equivalently, μ₁ - μ₂ = 0)

Pooled vs Welch's t-Test

Pooled t-Test

Assumption: Equal variances (σ₁² = σ₂²)

sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}

SE=sp1n1+1n2\text{SE} = s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}

df = n₁ + n₂ - 2

Welch's t-Test

No assumption: Variances can differ

SE=s12n1+s22n2\text{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

df ≈ Welch-Satterthwaite approximation (non-integer)

Default choice in most software (including scipy)

Modern Best Practice: Use Welch's t-test by default. It is more robust and performs well even when variances are equal. The pooled t-test only has slightly more power when variances are truly equal - a condition we can rarely verify.

Paired t-Test

The paired t-test (also called dependent samples t-test) is used when observations come in natural pairs: the same subjects measured twice, or matched pairs.

Paired t-Test Statistic

t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

where di=Xi,afterXi,befored_i = X_{i,\text{after}} - X_{i,\text{before}} and df = n - 1

ScenarioWhy Paired?
Before/After TreatmentSame subjects measured twice
Left/Right (eyes, hands)Naturally matched within subject
Matched Case-ControlPairs matched on confounders
Same ML Model, Different SeedsSame model architecture tested multiple times
Cross-validation FoldsSame data split compared across models
The Power Advantage: Paired tests are more powerful than independent tests when pairs are correlated. By computing differences, we eliminate between-subject variability and focus only on within-subject changes.

Interactive: Paired vs Unpaired Comparison

Compare the power of paired vs independent t-tests. See how correlation between pairs affects the relative performance of each test.

Paired vs Independent (Unpaired) t-Test Comparison

When observations are naturally paired (same subjects measured twice, matched pairs), the paired t-test is more powerful because it removes between-subject variability.

Paired t-Test

t-statistic:1.160
df:14
SE:2.117
p-value:0.2653

Independent t-Test (ignoring pairing)

t-statistic:0.531
df:28
SE:4.627
p-value:0.5996

Observed Correlation: r = 0.865

Higher correlation = larger advantage for paired test

Measurement 1 (Before/Control)Measurement 2 (After/Treatment)BeforeAfter

Why Paired Tests Have More Power

The standard error for the paired test is based on the variance of differences, which accounts for correlation between measurements:

SEpaired = sd / sqrt(n)

Uses within-pair variability only

SEindep = sqrt(sp² (1/n + 1/n))

Includes all between-subject variability

When pairs are correlated, Var(X-Y) = Var(X) + Var(Y) - 2Cov(X,Y), which is smaller than treating groups as independent. Higher correlation = more variance reduction = more power!


Interactive: Sample Size Effect

Explore how sample size affects the t-distribution, standard error, and statistical power. See how larger samples make it easier to detect real effects.

Sample Size Effect on t-Distribution

Explore how sample size affects the t-distribution and test power. With larger samples, the t-distribution becomes more similar to the normal, and we have more power to detect true effects.

H0: mu = 100 (null hypothesis)

Truth: mu = 105 (real population mean)

Effect Size: d = 0.33

-4-2024Distribution of t-statistics (n = 10, df = 9)t-StatisticFail to RejectReject H0

Small n (n < 30)

  • Heavy-tailed t-distribution
  • Wider critical values
  • Lower power to detect effects
  • More uncertainty in estimate

Medium n (30-100)

  • t approaches normal
  • Moderate power
  • Standard error decreases
  • More reliable estimates

Large n (n > 100)

  • t nearly identical to Z
  • High power
  • Can detect small effects
  • Practical vs statistical significance

Interactive: t-Test Calculator

Use this calculator to run one-sample, two-sample, or paired t-tests with your own data. Enter comma-separated values and see the full statistical output.

Interactive t-Test Calculator


Decision Guide: Which Test to Use?

Test Selection Flowchart

Question 1: Do you know the population standard deviation σ?

Yes → Use Z-test (rare)
No → Continue to Question 2

Question 2: How many groups are you comparing?

1 group → One-sample t-test
2 groups → Continue to Q3
3+ groups → ANOVA (not t-test)

Question 3: Are the observations paired or independent?

Paired → Paired t-test
Independent → Two-sample t-test (Welch's)
ScenarioRecommended Test
Testing if sample mean equals a known valueOne-sample t-test
Comparing means of two independent groupsTwo-sample t-test (Welch's)
Before/after measurements on same subjectsPaired t-test
Two models evaluated on same test setPaired t-test
Large n, known σ (rare)Z-test
Proportions (success/failure data)Z-test for proportions (Ch 15.2)

Applications in AI/ML

Z-tests and t-tests are essential tools in the ML practitioner's toolkit. Here's how they're used in practice:

🔬 Model Comparison

Use a paired t-test when comparing two models on the same test set or cross-validation folds. The pairing accounts for dataset-specific effects.

🐍python
1from scipy import stats
2
3# Accuracy on 10 CV folds
4model_a = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
5model_b = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
6
7# Paired t-test (same folds!)
8t, p = stats.ttest_rel(model_b, model_a)
9print(f"t={t:.3f}, p={p:.4f}")

🧪 A/B Testing for Conversions

For conversion rates (proportions), use a two-proportion Z-test. For continuous metrics like revenue per user, use a two-sample t-test.

🐍python
1from scipy import stats
2
3# Revenue per user (continuous)
4control = [45, 52, 38, 61, 55, 42, ...]
5treatment = [48, 56, 41, 65, 58, 45, ...]
6
7# Welch's t-test (default)
8t, p = stats.ttest_ind(treatment, control)
9print(f"t={t:.3f}, p={p:.4f}")

🎯 Hyperparameter Tuning Validation

Before deploying a model with new hyperparameters, verify that improvements are statistically significant. Random seed variation can cause apparent gains that are just noise.

⚠️ Drift Detection

Use t-tests to detect if model predictions or feature distributions have shifted in production. Significant drift may trigger model retraining pipelines.

Multiple Testing Problem: If you run many t-tests (e.g., comparing 10 models pairwise), false positives accumulate! With 45 pairwise comparisons at α = 0.05, you expect about 2.25 false positives by chance. Apply corrections like Bonferroni or control FDR.

Assumptions and Robustness

Like all parametric tests, t-tests make assumptions about the data. Understanding these assumptions and their robustness is crucial for valid inference.

AssumptionConsequence of ViolationRobustness
Random SamplingBias, invalid inferenceNot robust - must be satisfied
IndependenceWrong standard errors, inflated Type I errorNot robust for severe violations
NormalityAffects small samples mostVery robust for n ≥ 30 (CLT)
Equal Variances (pooled)Wrong SE, biased testUse Welch's t-test instead

When t-Tests Are Robust

  • Large sample sizes (n ≥ 30 per group)
  • Symmetric or only mildly skewed data
  • No extreme outliers
  • Similar group sizes (for two-sample)

When to Use Alternatives

  • Heavy-tailed data: Use Mann-Whitney U or Wilcoxon
  • Very small n with non-normality: Use permutation tests
  • Ordinal data: Use nonparametric tests
  • Highly skewed continuous: Consider log transformation

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3
4# ============================================
5# One-Sample t-Test
6# ============================================
7
8# Testing if model latency differs from 50ms
9latencies = np.array([54, 48, 56, 51, 53, 49, 55, 52, 50, 54])
10hypothesized_mean = 50
11
12result = stats.ttest_1samp(latencies, hypothesized_mean)
13print(f"One-Sample t-Test:")
14print(f"  t-statistic: {result.statistic:.4f}")
15print(f"  p-value (two-tailed): {result.pvalue:.4f}")
16print(f"  Sample mean: {latencies.mean():.2f}")
17
18# One-tailed p-value (testing if latency > 50)
19p_greater = result.pvalue / 2 if result.statistic > 0 else 1 - result.pvalue / 2
20print(f"  p-value (one-tailed, greater): {p_greater:.4f}")
21
22
23# ============================================
24# Two-Sample t-Test (Independent Groups)
25# ============================================
26
27# Comparing accuracy of two models on different test sets
28model_a_acc = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
29model_b_acc = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
30
31# Welch's t-test (default, does not assume equal variances)
32result = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=False)
33print(f"\nTwo-Sample t-Test (Welch's):")
34print(f"  t-statistic: {result.statistic:.4f}")
35print(f"  p-value: {result.pvalue:.4f}")
36print(f"  Mean difference: {np.mean(model_b_acc) - np.mean(model_a_acc):.4f}")
37
38# Pooled t-test (assumes equal variances)
39result_pooled = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=True)
40print(f"  Pooled t-statistic: {result_pooled.statistic:.4f}")
41
42
43# ============================================
44# Paired t-Test
45# ============================================
46
47# Same model evaluated before and after optimization
48before = np.array([100, 105, 98, 102, 99, 103, 101, 104, 97, 100])
49after = np.array([95, 100, 93, 97, 94, 98, 96, 99, 92, 95])
50
51result = stats.ttest_rel(after, before)
52print(f"\nPaired t-Test:")
53print(f"  t-statistic: {result.statistic:.4f}")
54print(f"  p-value: {result.pvalue:.4f}")
55print(f"  Mean difference: {(after - before).mean():.2f}")
56print(f"  Std of differences: {(after - before).std(ddof=1):.2f}")
57
58
59# ============================================
60# Effect Size (Cohen's d)
61# ============================================
62
63def cohens_d_independent(group1, group2):
64    """Cohen's d for independent samples (pooled std)."""
65    n1, n2 = len(group1), len(group2)
66    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
67    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
68    return (np.mean(group1) - np.mean(group2)) / pooled_std
69
70def cohens_d_paired(diff):
71    """Cohen's d for paired samples (using std of differences)."""
72    return np.mean(diff) / np.std(diff, ddof=1)
73
74d_indep = cohens_d_independent(model_b_acc, model_a_acc)
75d_paired = cohens_d_paired(after - before)
76
77print(f"\nEffect Sizes:")
78print(f"  Cohen's d (independent): {d_indep:.3f}")
79print(f"  Cohen's d (paired): {d_paired:.3f}")
80print(f"  Interpretation: |d| < 0.2 small, 0.5 medium, 0.8 large")
81
82
83# ============================================
84# Confidence Interval for Mean Difference
85# ============================================
86
87from scipy.stats import t as t_dist
88
89def mean_diff_ci(data, confidence=0.95):
90    """CI for one-sample mean or paired mean difference."""
91    n = len(data)
92    mean = np.mean(data)
93    se = np.std(data, ddof=1) / np.sqrt(n)
94    t_crit = t_dist.ppf((1 + confidence) / 2, df=n-1)
95    return mean - t_crit * se, mean + t_crit * se
96
97ci_lower, ci_upper = mean_diff_ci(after - before)
98print(f"\n95% CI for mean difference: [{ci_lower:.2f}, {ci_upper:.2f}]")
99
100
101# ============================================
102# Power Analysis (for sample size planning)
103# ============================================
104
105from scipy.stats import nct  # Non-central t distribution
106
107def power_one_sample_ttest(n, effect_size, alpha=0.05, alternative='two-sided'):
108    """Calculate power for one-sample t-test."""
109    df = n - 1
110    ncp = effect_size * np.sqrt(n)  # Non-centrality parameter
111
112    if alternative == 'two-sided':
113        t_crit = t_dist.ppf(1 - alpha/2, df)
114        power = 1 - nct.cdf(t_crit, df, ncp) + nct.cdf(-t_crit, df, ncp)
115    elif alternative == 'greater':
116        t_crit = t_dist.ppf(1 - alpha, df)
117        power = 1 - nct.cdf(t_crit, df, ncp)
118    else:  # less
119        t_crit = t_dist.ppf(alpha, df)
120        power = nct.cdf(t_crit, df, ncp)
121
122    return power
123
124# What sample size needed to detect d=0.5 with 80% power?
125for n in [10, 20, 30, 50, 100]:
126    power = power_one_sample_ttest(n, effect_size=0.5)
127    print(f"n={n:3d}: power = {power:.3f}")

Knowledge Check

Test your understanding of Z-tests and t-tests with this interactive quiz.

Knowledge Check: Z-Tests & t-Tests

concept1/10

When should you use a Z-test instead of a t-test?

Score: 0/10


Summary

Key Takeaways

  1. Z-tests require known σ: Rarely applicable in practice. Use t-tests when you must estimate the population standard deviation from the sample.
  2. t-distribution has heavier tails: This accounts for the uncertainty in estimating σ with s. As n increases, t converges to Z.
  3. One-sample t-test: Tests if a population mean equals a hypothesized value. df = n - 1.
  4. Two-sample t-test: Compares means of two independent groups. Use Welch's version by default (no equal variance assumption).
  5. Paired t-test: For matched or repeated measures. More powerful than independent tests when pairs are correlated.
  6. Always report effect sizes: Statistical significance alone doesn't tell you if an effect is practically meaningful. Cohen's d provides interpretable magnitudes.
  7. t-tests are robust: For n ≥ 30, violations of normality are usually not critical. However, independence and random sampling are essential.

Quick Reference

TestUse CaseDegrees of Freedom
Z-testKnown σ (rare)N/A (uses N(0,1))
One-sample tSample mean vs hypothesized valuen - 1
Two-sample t (pooled)Two independent groups, equal varn₁ + n₂ - 2
Two-sample t (Welch)Two independent groups, unequal varSatterthwaite approx.
Paired tMatched pairs or repeated measuresn - 1 (n = pairs)
Looking Ahead: In the next section, we'll explore Chi-Square tests for categorical data - testing goodness of fit and independence in contingency tables.
Loading comments...