Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand when to use Z-tests vs t-tests
- • Derive and interpret the t-statistic formula
- • Explain why the t-distribution has heavier tails
- • Distinguish one-sample, two-sample, and paired t-tests
- • Calculate degrees of freedom for each test type
🔧 Practical Skills
- • Choose the appropriate test for different scenarios
- • Perform Z-tests and t-tests by hand and in Python
- • Interpret test results and p-values correctly
- • Understand assumptions and when they can be relaxed
🧠 AI/ML Applications
- • A/B Testing - Compare conversion rates between model variants
- • Model Evaluation - Test if a new model significantly outperforms the baseline
- • Hyperparameter Tuning - Verify improvements are not due to random variation
- • Paired Comparisons - Compare models on the same test set (paired t-test)
- • Sample Size Planning - Determine how much data you need for reliable conclusions
Central Message: Z-tests and t-tests are the workhorses of parametric hypothesis testing. Understanding when and how to use each test is essential for any data scientist making evidence-based decisions.
The Big Picture: A Historical Journey
The story of Z-tests and t-tests begins in the early 20th century, when scientists and industrialists faced a fundamental problem: how do you draw reliable conclusions from limited data?
The Industrial Revolution Problem
Factories needed quality control. Breweries needed consistent products. Scientists needed to test hypotheses. But running large experiments was expensive. Could you trust conclusions from small samples?
The Z-test worked when population parameters were known (rare in practice). But what if you only had a small sample and had to estimate the variance? This extra uncertainty needed to be accounted for.
The solution came from an unlikely source: a brewer at Guinness in Dublin, working under a pseudonym to protect trade secrets. His discovery would revolutionize small-sample statistics.
The Z-Test: When Sigma is Known
The Z-test is the foundational test for means when the population standard deviation is known. While this situation is rare in practice, understanding the Z-test provides crucial intuition for all parametric tests.
Z-Test Formulation
One-Sample Z-Test Statistic
Under H₀, Z follows a standard normal distribution N(0,1)
| Symbol | Meaning | Known From |
|---|---|---|
| X̄ | Sample mean | Calculated from data |
| μ₀ | Hypothesized population mean | Null hypothesis |
| σ | Population standard deviation | Prior knowledge (RARE) |
| n | Sample size | Study design |
| σ/√n | Standard error of the mean | Known formula |
Why it works: By the Central Limit Theorem, the sample mean is approximately normally distributed with mean and standard deviation . When we standardize, we get Z ~ N(0,1).
Worked Example: Quality Control
Example: Precision Manufacturing
A machine produces metal rods that should have a mean length of 100 mm. From historical data (thousands of measurements), we know mm. A sample of 36 rods has mean length 100.8 mm. Has the machine drifted from its target?
Setup
Two-tailed test at α = 0.05
Calculation
Decision
Critical value at α = 0.05 is ±1.96. Since |Z| = 2.4 > 1.96, we reject H₀.
The p-value = 2 × (1 - Φ(2.4)) = 2 × 0.0082 = 0.0164 < 0.05
Conclusion: There is significant evidence that the machine has drifted from its target of 100 mm. The production line should be recalibrated.
The t-Test: Accounting for Unknown Variance
In reality, we almost never know the true population standard deviation. We must estimate it from the sample using the sample standard deviation . This introduces additional uncertainty that the Z-test ignores.
William Sealy Gosset and Student's t
The Guinness Statistician (1908)
William Sealy Gosset worked as a chemist at Guinness Brewery in Dublin. He needed to analyze small batches of barley and hops but couldn't run large experiments.
Guinness prohibited employees from publishing, so Gosset used the pseudonym "Student". His 1908 paper introduced what we now call Student's t-distribution.
"The problem is to find a way of using the sample to test a hypothesis about the population when the only knowledge of the population we have is what the sample tells us."
Gosset's insight was profound: when we replace the unknown with the sample estimate , the resulting statistic no longer follows a normal distribution. It follows a new distribution with heavier tails that depends on the sample size.
One-Sample t-Test Statistic
Under H₀, t follows a Student's t-distribution with df = n - 1
Properties of the t-Distribution
Key Properties
- Symmetric about zero (like the normal)
- Unimodal (single peak at center)
- Bell-shaped but with heavier tails
- Defined by degrees of freedom (df)
- Converges to N(0,1) as df → ∞
Why Heavier Tails?
When we estimate with , sometimes we underestimate (making t too large) and sometimes overestimate (making t too small).
This extra randomness makes extreme t-values more likely than extreme Z-values. With small n, the estimate is unstable, so tails are heavier.
| df | t₀.₀₂₅ (two-tailed) | z₀.₀₂₅ | Difference |
|---|---|---|---|
| 5 | 2.571 | 1.96 | +31% |
| 10 | 2.228 | 1.96 | +14% |
| 20 | 2.086 | 1.96 | +6% |
| 30 | 2.042 | 1.96 | +4% |
| ∞ | 1.96 | 1.96 | 0% |
Interactive: Z vs t Distribution Comparison
Explore how the t-distribution differs from the standard normal. Adjust the degrees of freedom to see how the t-distribution converges to the normal as df increases.
Z-Distribution vs t-Distribution Comparison
Z Critical
+-1.960
t Critical
+-2.571
Key Observations
- Heavier Tails: The t-distribution has heavier tails than the normal, especially with low df.
- Wider Critical Values: t-critical values are always larger than z-critical values (more conservative).
- Convergence: As df increases, t approaches the standard normal (try df = 30+).
- Rule of Thumb: When df ≥ 30, the difference becomes negligible for practical purposes.
One-Sample t-Test
The one-sample t-test tests whether a population mean equals a specified value when the population standard deviation is unknown.
One-Sample t-Test Procedure
- State hypotheses:
(or one-tailed: < or >)
- Calculate sample statistics:
- Compute the t-statistic:
- Determine degrees of freedom: df = n - 1
- Find p-value or compare to critical value
- Make decision and interpret in context
Two-Sample t-Test
The two-sample t-test (also called independent samples t-test) compares the means of two independent groups to determine if they come from populations with equal means.
Two-Sample t-Test Statistic
Testing H₀: μ₁ = μ₂ (or equivalently, μ₁ - μ₂ = 0)
Pooled vs Welch's t-Test
Pooled t-Test
Assumption: Equal variances (σ₁² = σ₂²)
df = n₁ + n₂ - 2
Welch's t-Test
No assumption: Variances can differ
df ≈ Welch-Satterthwaite approximation (non-integer)
Default choice in most software (including scipy)
Paired t-Test
The paired t-test (also called dependent samples t-test) is used when observations come in natural pairs: the same subjects measured twice, or matched pairs.
Paired t-Test Statistic
where and df = n - 1
| Scenario | Why Paired? |
|---|---|
| Before/After Treatment | Same subjects measured twice |
| Left/Right (eyes, hands) | Naturally matched within subject |
| Matched Case-Control | Pairs matched on confounders |
| Same ML Model, Different Seeds | Same model architecture tested multiple times |
| Cross-validation Folds | Same data split compared across models |
Interactive: Paired vs Unpaired Comparison
Compare the power of paired vs independent t-tests. See how correlation between pairs affects the relative performance of each test.
Paired vs Independent (Unpaired) t-Test Comparison
When observations are naturally paired (same subjects measured twice, matched pairs), the paired t-test is more powerful because it removes between-subject variability.
Paired t-Test
Independent t-Test (ignoring pairing)
Observed Correlation: r = 0.865
Higher correlation = larger advantage for paired test
Why Paired Tests Have More Power
The standard error for the paired test is based on the variance of differences, which accounts for correlation between measurements:
SEpaired = sd / sqrt(n)
Uses within-pair variability only
SEindep = sqrt(sp² (1/n + 1/n))
Includes all between-subject variability
When pairs are correlated, Var(X-Y) = Var(X) + Var(Y) - 2Cov(X,Y), which is smaller than treating groups as independent. Higher correlation = more variance reduction = more power!
Interactive: Sample Size Effect
Explore how sample size affects the t-distribution, standard error, and statistical power. See how larger samples make it easier to detect real effects.
Sample Size Effect on t-Distribution
Explore how sample size affects the t-distribution and test power. With larger samples, the t-distribution becomes more similar to the normal, and we have more power to detect true effects.
H0: mu = 100 (null hypothesis)
Truth: mu = 105 (real population mean)
Effect Size: d = 0.33
Small n (n < 30)
- Heavy-tailed t-distribution
- Wider critical values
- Lower power to detect effects
- More uncertainty in estimate
Medium n (30-100)
- t approaches normal
- Moderate power
- Standard error decreases
- More reliable estimates
Large n (n > 100)
- t nearly identical to Z
- High power
- Can detect small effects
- Practical vs statistical significance
Interactive: t-Test Calculator
Use this calculator to run one-sample, two-sample, or paired t-tests with your own data. Enter comma-separated values and see the full statistical output.
Interactive t-Test Calculator
Decision Guide: Which Test to Use?
Test Selection Flowchart
Question 1: Do you know the population standard deviation σ?
Question 2: How many groups are you comparing?
Question 3: Are the observations paired or independent?
| Scenario | Recommended Test |
|---|---|
| Testing if sample mean equals a known value | One-sample t-test |
| Comparing means of two independent groups | Two-sample t-test (Welch's) |
| Before/after measurements on same subjects | Paired t-test |
| Two models evaluated on same test set | Paired t-test |
| Large n, known σ (rare) | Z-test |
| Proportions (success/failure data) | Z-test for proportions (Ch 15.2) |
Applications in AI/ML
Z-tests and t-tests are essential tools in the ML practitioner's toolkit. Here's how they're used in practice:
🔬 Model Comparison
Use a paired t-test when comparing two models on the same test set or cross-validation folds. The pairing accounts for dataset-specific effects.
1from scipy import stats
2
3# Accuracy on 10 CV folds
4model_a = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
5model_b = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
6
7# Paired t-test (same folds!)
8t, p = stats.ttest_rel(model_b, model_a)
9print(f"t={t:.3f}, p={p:.4f}")🧪 A/B Testing for Conversions
For conversion rates (proportions), use a two-proportion Z-test. For continuous metrics like revenue per user, use a two-sample t-test.
1from scipy import stats
2
3# Revenue per user (continuous)
4control = [45, 52, 38, 61, 55, 42, ...]
5treatment = [48, 56, 41, 65, 58, 45, ...]
6
7# Welch's t-test (default)
8t, p = stats.ttest_ind(treatment, control)
9print(f"t={t:.3f}, p={p:.4f}")🎯 Hyperparameter Tuning Validation
Before deploying a model with new hyperparameters, verify that improvements are statistically significant. Random seed variation can cause apparent gains that are just noise.
⚠️ Drift Detection
Use t-tests to detect if model predictions or feature distributions have shifted in production. Significant drift may trigger model retraining pipelines.
Assumptions and Robustness
Like all parametric tests, t-tests make assumptions about the data. Understanding these assumptions and their robustness is crucial for valid inference.
| Assumption | Consequence of Violation | Robustness |
|---|---|---|
| Random Sampling | Bias, invalid inference | Not robust - must be satisfied |
| Independence | Wrong standard errors, inflated Type I error | Not robust for severe violations |
| Normality | Affects small samples most | Very robust for n ≥ 30 (CLT) |
| Equal Variances (pooled) | Wrong SE, biased test | Use Welch's t-test instead |
When t-Tests Are Robust
- Large sample sizes (n ≥ 30 per group)
- Symmetric or only mildly skewed data
- No extreme outliers
- Similar group sizes (for two-sample)
When to Use Alternatives
- Heavy-tailed data: Use Mann-Whitney U or Wilcoxon
- Very small n with non-normality: Use permutation tests
- Ordinal data: Use nonparametric tests
- Highly skewed continuous: Consider log transformation
Python Implementation
1import numpy as np
2from scipy import stats
3
4# ============================================
5# One-Sample t-Test
6# ============================================
7
8# Testing if model latency differs from 50ms
9latencies = np.array([54, 48, 56, 51, 53, 49, 55, 52, 50, 54])
10hypothesized_mean = 50
11
12result = stats.ttest_1samp(latencies, hypothesized_mean)
13print(f"One-Sample t-Test:")
14print(f" t-statistic: {result.statistic:.4f}")
15print(f" p-value (two-tailed): {result.pvalue:.4f}")
16print(f" Sample mean: {latencies.mean():.2f}")
17
18# One-tailed p-value (testing if latency > 50)
19p_greater = result.pvalue / 2 if result.statistic > 0 else 1 - result.pvalue / 2
20print(f" p-value (one-tailed, greater): {p_greater:.4f}")
21
22
23# ============================================
24# Two-Sample t-Test (Independent Groups)
25# ============================================
26
27# Comparing accuracy of two models on different test sets
28model_a_acc = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
29model_b_acc = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
30
31# Welch's t-test (default, does not assume equal variances)
32result = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=False)
33print(f"\nTwo-Sample t-Test (Welch's):")
34print(f" t-statistic: {result.statistic:.4f}")
35print(f" p-value: {result.pvalue:.4f}")
36print(f" Mean difference: {np.mean(model_b_acc) - np.mean(model_a_acc):.4f}")
37
38# Pooled t-test (assumes equal variances)
39result_pooled = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=True)
40print(f" Pooled t-statistic: {result_pooled.statistic:.4f}")
41
42
43# ============================================
44# Paired t-Test
45# ============================================
46
47# Same model evaluated before and after optimization
48before = np.array([100, 105, 98, 102, 99, 103, 101, 104, 97, 100])
49after = np.array([95, 100, 93, 97, 94, 98, 96, 99, 92, 95])
50
51result = stats.ttest_rel(after, before)
52print(f"\nPaired t-Test:")
53print(f" t-statistic: {result.statistic:.4f}")
54print(f" p-value: {result.pvalue:.4f}")
55print(f" Mean difference: {(after - before).mean():.2f}")
56print(f" Std of differences: {(after - before).std(ddof=1):.2f}")
57
58
59# ============================================
60# Effect Size (Cohen's d)
61# ============================================
62
63def cohens_d_independent(group1, group2):
64 """Cohen's d for independent samples (pooled std)."""
65 n1, n2 = len(group1), len(group2)
66 var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
67 pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
68 return (np.mean(group1) - np.mean(group2)) / pooled_std
69
70def cohens_d_paired(diff):
71 """Cohen's d for paired samples (using std of differences)."""
72 return np.mean(diff) / np.std(diff, ddof=1)
73
74d_indep = cohens_d_independent(model_b_acc, model_a_acc)
75d_paired = cohens_d_paired(after - before)
76
77print(f"\nEffect Sizes:")
78print(f" Cohen's d (independent): {d_indep:.3f}")
79print(f" Cohen's d (paired): {d_paired:.3f}")
80print(f" Interpretation: |d| < 0.2 small, 0.5 medium, 0.8 large")
81
82
83# ============================================
84# Confidence Interval for Mean Difference
85# ============================================
86
87from scipy.stats import t as t_dist
88
89def mean_diff_ci(data, confidence=0.95):
90 """CI for one-sample mean or paired mean difference."""
91 n = len(data)
92 mean = np.mean(data)
93 se = np.std(data, ddof=1) / np.sqrt(n)
94 t_crit = t_dist.ppf((1 + confidence) / 2, df=n-1)
95 return mean - t_crit * se, mean + t_crit * se
96
97ci_lower, ci_upper = mean_diff_ci(after - before)
98print(f"\n95% CI for mean difference: [{ci_lower:.2f}, {ci_upper:.2f}]")
99
100
101# ============================================
102# Power Analysis (for sample size planning)
103# ============================================
104
105from scipy.stats import nct # Non-central t distribution
106
107def power_one_sample_ttest(n, effect_size, alpha=0.05, alternative='two-sided'):
108 """Calculate power for one-sample t-test."""
109 df = n - 1
110 ncp = effect_size * np.sqrt(n) # Non-centrality parameter
111
112 if alternative == 'two-sided':
113 t_crit = t_dist.ppf(1 - alpha/2, df)
114 power = 1 - nct.cdf(t_crit, df, ncp) + nct.cdf(-t_crit, df, ncp)
115 elif alternative == 'greater':
116 t_crit = t_dist.ppf(1 - alpha, df)
117 power = 1 - nct.cdf(t_crit, df, ncp)
118 else: # less
119 t_crit = t_dist.ppf(alpha, df)
120 power = nct.cdf(t_crit, df, ncp)
121
122 return power
123
124# What sample size needed to detect d=0.5 with 80% power?
125for n in [10, 20, 30, 50, 100]:
126 power = power_one_sample_ttest(n, effect_size=0.5)
127 print(f"n={n:3d}: power = {power:.3f}")Knowledge Check
Test your understanding of Z-tests and t-tests with this interactive quiz.
Knowledge Check: Z-Tests & t-Tests
When should you use a Z-test instead of a t-test?
Score: 0/10
Summary
Key Takeaways
- Z-tests require known σ: Rarely applicable in practice. Use t-tests when you must estimate the population standard deviation from the sample.
- t-distribution has heavier tails: This accounts for the uncertainty in estimating σ with s. As n increases, t converges to Z.
- One-sample t-test: Tests if a population mean equals a hypothesized value. df = n - 1.
- Two-sample t-test: Compares means of two independent groups. Use Welch's version by default (no equal variance assumption).
- Paired t-test: For matched or repeated measures. More powerful than independent tests when pairs are correlated.
- Always report effect sizes: Statistical significance alone doesn't tell you if an effect is practically meaningful. Cohen's d provides interpretable magnitudes.
- t-tests are robust: For n ≥ 30, violations of normality are usually not critical. However, independence and random sampling are essential.
Quick Reference
| Test | Use Case | Degrees of Freedom |
|---|---|---|
| Z-test | Known σ (rare) | N/A (uses N(0,1)) |
| One-sample t | Sample mean vs hypothesized value | n - 1 |
| Two-sample t (pooled) | Two independent groups, equal var | n₁ + n₂ - 2 |
| Two-sample t (Welch) | Two independent groups, unequal var | Satterthwaite approx. |
| Paired t | Matched pairs or repeated measures | n - 1 (n = pairs) |
Looking Ahead: In the next section, we'll explore Chi-Square tests for categorical data - testing goodness of fit and independence in contingency tables.