Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand when to use Z-tests vs t-tests
• Derive and interpret the t-statistic formula
• Explain why the t-distribution has heavier tails
• Distinguish one-sample, two-sample, and paired t-tests
• Calculate degrees of freedom for each test type

🔧 Practical Skills

• Choose the appropriate test for different scenarios
• Perform Z-tests and t-tests by hand and in Python
• Interpret test results and p-values correctly
• Understand assumptions and when they can be relaxed

🧠 AI/ML Applications

• A/B Testing - Compare conversion rates between model variants
• Model Evaluation - Test if a new model significantly outperforms the baseline
• Hyperparameter Tuning - Verify improvements are not due to random variation
• Paired Comparisons - Compare models on the same test set (paired t-test)
• Sample Size Planning - Determine how much data you need for reliable conclusions

Central Message: Z-tests and t-tests are the workhorses of parametric hypothesis testing. Understanding when and how to use each test is essential for any data scientist making evidence-based decisions.

The Big Picture: A Historical Journey

The story of Z-tests and t-tests begins in the early 20th century, when scientists and industrialists faced a fundamental problem: how do you draw reliable conclusions from limited data?

🏭

The Industrial Revolution Problem

Factories needed quality control. Breweries needed consistent products. Scientists needed to test hypotheses. But running large experiments was expensive. Could you trust conclusions from small samples?

The Z-test worked when population parameters were known (rare in practice). But what if you only had a small sample and had to estimate the variance? This extra uncertainty needed to be accounted for.

The solution came from an unlikely source: a brewer at Guinness in Dublin, working under a pseudonym to protect trade secrets. His discovery would revolutionize small-sample statistics.

The Z-Test: When Sigma is Known

The Z-test is the foundational test for means when the population standard deviation $\sigma$ is known. While this situation is rare in practice, understanding the Z-test provides crucial intuition for all parametric tests.

Z-Test Formulation

One-Sample Z-Test Statistic

Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under H₀, Z follows a standard normal distribution N(0,1)

Symbol	Meaning	Known From
X̄	Sample mean	Calculated from data
μ₀	Hypothesized population mean	Null hypothesis
σ	Population standard deviation	Prior knowledge (RARE)
n	Sample size	Study design
σ/√n	Standard error of the mean	Known formula

Why it works: By the Central Limit Theorem, the sample mean $\bar{X}$ is approximately normally distributed with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$ . When we standardize, we get Z ~ N(0,1).

Worked Example: Quality Control

Example: Precision Manufacturing

A machine produces metal rods that should have a mean length of 100 mm. From historical data (thousands of measurements), we know $\sigma = 2$ mm. A sample of 36 rods has mean length 100.8 mm. Has the machine drifted from its target?

Setup

H_0: \mu = 100

H_1: \mu \neq 100

Two-tailed test at α = 0.05

Calculation

Z = \frac{100.8 - 100}{2/\sqrt{36}} = \frac{0.8}{0.333} = 2.4

Decision

Critical value at α = 0.05 is ±1.96. Since |Z| = 2.4 > 1.96, we reject H₀.

The p-value = 2 × (1 - Φ(2.4)) = 2 × 0.0082 = 0.0164 < 0.05

Conclusion: There is significant evidence that the machine has drifted from its target of 100 mm. The production line should be recalibrated.

When do we actually know σ? Almost never in practice! Possible scenarios include: (1) standardized tests with established norms, (2) instruments with known measurement error, (3) historical processes with massive prior data. In most real applications, we use the t-test instead.

The t-Test: Accounting for Unknown Variance

In reality, we almost never know the true population standard deviation. We must estimate it from the sample using the sample standard deviation $s$ . This introduces additional uncertainty that the Z-test ignores.

William Sealy Gosset and Student's t

🍺

The Guinness Statistician (1908)

William Sealy Gosset worked as a chemist at Guinness Brewery in Dublin. He needed to analyze small batches of barley and hops but couldn't run large experiments.

Guinness prohibited employees from publishing, so Gosset used the pseudonym "Student". His 1908 paper introduced what we now call Student's t-distribution.

"The problem is to find a way of using the sample to test a hypothesis about the population when the only knowledge of the population we have is what the sample tells us."

Gosset's insight was profound: when we replace the unknown $\sigma$ with the sample estimate $s$ , the resulting statistic no longer follows a normal distribution. It follows a new distribution with heavier tails that depends on the sample size.

One-Sample t-Test Statistic

t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

Under H₀, t follows a Student's t-distribution with df = n - 1

Properties of the t-Distribution

Key Properties

Symmetric about zero (like the normal)
Unimodal (single peak at center)
Bell-shaped but with heavier tails
Defined by degrees of freedom (df)
Converges to N(0,1) as df → ∞

Why Heavier Tails?

When we estimate $\sigma$ with $s$ , sometimes we underestimate (making t too large) and sometimes overestimate (making t too small).

This extra randomness makes extreme t-values more likely than extreme Z-values. With small n, the estimate $s$ is unstable, so tails are heavier.

df	t₀.₀₂₅ (two-tailed)	z₀.₀₂₅	Difference
5	2.571	1.96	+31%
10	2.228	1.96	+14%
20	2.086	1.96	+6%
30	2.042	1.96	+4%
∞	1.96	1.96	0%

Rule of Thumb: When n ≥ 30, the t-distribution is close enough to the normal that the practical difference is small. However, with modern computing, there's no reason not to use the exact t-distribution regardless of sample size.

Interactive: Z vs t Distribution Comparison

Explore how the t-distribution differs from the standard normal. Adjust the degrees of freedom to see how the t-distribution converges to the normal as df increases.

Z-Distribution vs t-Distribution Comparison

Degrees of Freedom (df): 5

Significance Level (alpha): 0.05

Test Type

Z (Standard Normal)

t (df = 5)

Z Critical

+-1.960

t Critical

+-2.571

Key Observations

Heavier Tails: The t-distribution has heavier tails than the normal, especially with low df.
Wider Critical Values: t-critical values are always larger than z-critical values (more conservative).
Convergence: As df increases, t approaches the standard normal (try df = 30+).
Rule of Thumb: When df ≥ 30, the difference becomes negligible for practical purposes.

One-Sample t-Test

The one-sample t-test tests whether a population mean equals a specified value when the population standard deviation is unknown.

One-Sample t-Test Procedure

State hypotheses:
$H_0: \mu = \mu_0 \quad \text{vs} \quad H_1: \mu \neq \mu_0$
(or one-tailed: < or >)
Calculate sample statistics:
$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \quad s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}$
Compute the t-statistic:
$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$
Determine degrees of freedom: df = n - 1
Find p-value or compare to critical value
Make decision and interpret in context

Two-Sample t-Test

The two-sample t-test (also called independent samples t-test) compares the means of two independent groups to determine if they come from populations with equal means.

Two-Sample t-Test Statistic

t = \frac{\bar{X}_1 - \bar{X}_2}{\text{SE}(\bar{X}_1 - \bar{X}_2)}

Testing H₀: μ₁ = μ₂ (or equivalently, μ₁ - μ₂ = 0)

Pooled vs Welch's t-Test

Pooled t-Test

Assumption: Equal variances (σ₁² = σ₂²)

s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}

$\text{SE} = s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$

df = n₁ + n₂ - 2

Welch's t-Test

No assumption: Variances can differ

\text{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

df ≈ Welch-Satterthwaite approximation (non-integer)

Default choice in most software (including scipy)

Modern Best Practice: Use Welch's t-test by default. It is more robust and performs well even when variances are equal. The pooled t-test only has slightly more power when variances are truly equal - a condition we can rarely verify.

Paired t-Test

The paired t-test (also called dependent samples t-test) is used when observations come in natural pairs: the same subjects measured twice, or matched pairs.

Paired t-Test Statistic

t = \frac{\bar{d}}{s_d / \sqrt{n}}

where $d_i = X_{i,\text{after}} - X_{i,\text{before}}$ and df = n - 1

Scenario	Why Paired?
Before/After Treatment	Same subjects measured twice
Left/Right (eyes, hands)	Naturally matched within subject
Matched Case-Control	Pairs matched on confounders
Same ML Model, Different Seeds	Same model architecture tested multiple times
Cross-validation Folds	Same data split compared across models

The Power Advantage: Paired tests are more powerful than independent tests when pairs are correlated. By computing differences, we eliminate between-subject variability and focus only on within-subject changes.

Interactive: Paired vs Unpaired Comparison

Compare the power of paired vs independent t-tests. See how correlation between pairs affects the relative performance of each test.

Paired vs Independent (Unpaired) t-Test Comparison

When observations are naturally paired (same subjects measured twice, matched pairs), the paired t-test is more powerful because it removes between-subject variability.

Sample Size (n pairs): 15

True Mean Difference: 5

Standard Deviation: 10

Within-Pair Correlation: 0.70

Paired t-Test

t-statistic:1.160

df:14

SE:2.117

p-value:0.2653

Independent t-Test (ignoring pairing)

t-statistic:0.531

df:28

SE:4.627

p-value:0.5996

Observed Correlation: r = 0.865

Higher correlation = larger advantage for paired test

Why Paired Tests Have More Power

The standard error for the paired test is based on the variance of differences, which accounts for correlation between measurements:

SE_paired = s_d / sqrt(n)

Uses within-pair variability only

SE_indep = sqrt(s_p² (1/n + 1/n))

Includes all between-subject variability

When pairs are correlated, Var(X-Y) = Var(X) + Var(Y) - 2Cov(X,Y), which is smaller than treating groups as independent. Higher correlation = more variance reduction = more power!

Interactive: Sample Size Effect

Explore how sample size affects the t-distribution, standard error, and statistical power. See how larger samples make it easier to detect real effects.

Sample Size Effect on t-Distribution

Explore how sample size affects the t-distribution and test power. With larger samples, the t-distribution becomes more similar to the normal, and we have more power to detect true effects.

Sample Size (n): 10

True Population Mean: 105

True Population Std: 15

H₀: mu = 100 (null hypothesis)

Truth: mu = 105 (real population mean)

Effect Size: d = 0.33

Small n (n < 30)

Heavy-tailed t-distribution
Wider critical values
Lower power to detect effects
More uncertainty in estimate

Medium n (30-100)

t approaches normal
Moderate power
Standard error decreases
More reliable estimates

Large n (n > 100)

t nearly identical to Z
High power
Can detect small effects
Practical vs statistical significance

Interactive: t-Test Calculator

Use this calculator to run one-sample, two-sample, or paired t-tests with your own data. Enter comma-separated values and see the full statistical output.

Interactive t-Test Calculator

Alternative Hypothesis

Significance Level (alpha)

Sample Data (comma separated)

Hypothesized Mean (mu0)

Decision Guide: Which Test to Use?

Test Selection Flowchart

Question 1: Do you know the population standard deviation σ?

Yes → Use Z-test (rare)

No → Continue to Question 2

Question 2: How many groups are you comparing?

1 group → One-sample t-test

2 groups → Continue to Q3

3+ groups → ANOVA (not t-test)

Question 3: Are the observations paired or independent?

Paired → Paired t-test

Independent → Two-sample t-test (Welch's)

Scenario	Recommended Test
Testing if sample mean equals a known value	One-sample t-test
Comparing means of two independent groups	Two-sample t-test (Welch's)
Before/after measurements on same subjects	Paired t-test
Two models evaluated on same test set	Paired t-test
Large n, known σ (rare)	Z-test
Proportions (success/failure data)	Z-test for proportions (Ch 15.2)

Applications in AI/ML

Z-tests and t-tests are essential tools in the ML practitioner's toolkit. Here's how they're used in practice:

🔬 Model Comparison

Use a paired t-test when comparing two models on the same test set or cross-validation folds. The pairing accounts for dataset-specific effects.

🐍python

1from scipy import stats
2
3# Accuracy on 10 CV folds
4model_a = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
5model_b = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
6
7# Paired t-test (same folds!)
8t, p = stats.ttest_rel(model_b, model_a)
9print(f"t={t:.3f}, p={p:.4f}")

🧪 A/B Testing for Conversions

For conversion rates (proportions), use a two-proportion Z-test. For continuous metrics like revenue per user, use a two-sample t-test.

🐍python

1from scipy import stats
2
3# Revenue per user (continuous)
4control = [45, 52, 38, 61, 55, 42, ...]
5treatment = [48, 56, 41, 65, 58, 45, ...]
6
7# Welch's t-test (default)
8t, p = stats.ttest_ind(treatment, control)
9print(f"t={t:.3f}, p={p:.4f}")

🎯 Hyperparameter Tuning Validation

Before deploying a model with new hyperparameters, verify that improvements are statistically significant. Random seed variation can cause apparent gains that are just noise.

⚠️ Drift Detection

Use t-tests to detect if model predictions or feature distributions have shifted in production. Significant drift may trigger model retraining pipelines.

Multiple Testing Problem: If you run many t-tests (e.g., comparing 10 models pairwise), false positives accumulate! With 45 pairwise comparisons at α = 0.05, you expect about 2.25 false positives by chance. Apply corrections like Bonferroni or control FDR.

Assumptions and Robustness

Like all parametric tests, t-tests make assumptions about the data. Understanding these assumptions and their robustness is crucial for valid inference.

Assumption	Consequence of Violation	Robustness
Random Sampling	Bias, invalid inference	Not robust - must be satisfied
Independence	Wrong standard errors, inflated Type I error	Not robust for severe violations
Normality	Affects small samples most	Very robust for n ≥ 30 (CLT)
Equal Variances (pooled)	Wrong SE, biased test	Use Welch's t-test instead

When t-Tests Are Robust

Large sample sizes (n ≥ 30 per group)
Symmetric or only mildly skewed data
No extreme outliers
Similar group sizes (for two-sample)

When to Use Alternatives

Heavy-tailed data: Use Mann-Whitney U or Wilcoxon
Very small n with non-normality: Use permutation tests
Ordinal data: Use nonparametric tests
Highly skewed continuous: Consider log transformation

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3
4# ============================================
5# One-Sample t-Test
6# ============================================
7
8# Testing if model latency differs from 50ms
9latencies = np.array([54, 48, 56, 51, 53, 49, 55, 52, 50, 54])
10hypothesized_mean = 50
11
12result = stats.ttest_1samp(latencies, hypothesized_mean)
13print(f"One-Sample t-Test:")
14print(f"  t-statistic: {result.statistic:.4f}")
15print(f"  p-value (two-tailed): {result.pvalue:.4f}")
16print(f"  Sample mean: {latencies.mean():.2f}")
17
18# One-tailed p-value (testing if latency > 50)
19p_greater = result.pvalue / 2 if result.statistic > 0 else 1 - result.pvalue / 2
20print(f"  p-value (one-tailed, greater): {p_greater:.4f}")
21
22
23# ============================================
24# Two-Sample t-Test (Independent Groups)
25# ============================================
26
27# Comparing accuracy of two models on different test sets
28model_a_acc = [0.82, 0.85, 0.81, 0.84, 0.83, 0.86, 0.82, 0.84, 0.85, 0.83]
29model_b_acc = [0.84, 0.87, 0.83, 0.86, 0.85, 0.88, 0.84, 0.86, 0.87, 0.85]
30
31# Welch's t-test (default, does not assume equal variances)
32result = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=False)
33print(f"\nTwo-Sample t-Test (Welch's):")
34print(f"  t-statistic: {result.statistic:.4f}")
35print(f"  p-value: {result.pvalue:.4f}")
36print(f"  Mean difference: {np.mean(model_b_acc) - np.mean(model_a_acc):.4f}")
37
38# Pooled t-test (assumes equal variances)
39result_pooled = stats.ttest_ind(model_b_acc, model_a_acc, equal_var=True)
40print(f"  Pooled t-statistic: {result_pooled.statistic:.4f}")
41
42
43# ============================================
44# Paired t-Test
45# ============================================
46
47# Same model evaluated before and after optimization
48before = np.array([100, 105, 98, 102, 99, 103, 101, 104, 97, 100])
49after = np.array([95, 100, 93, 97, 94, 98, 96, 99, 92, 95])
50
51result = stats.ttest_rel(after, before)
52print(f"\nPaired t-Test:")
53print(f"  t-statistic: {result.statistic:.4f}")
54print(f"  p-value: {result.pvalue:.4f}")
55print(f"  Mean difference: {(after - before).mean():.2f}")
56print(f"  Std of differences: {(after - before).std(ddof=1):.2f}")
57
58
59# ============================================
60# Effect Size (Cohen's d)
61# ============================================
62
63def cohens_d_independent(group1, group2):
64    """Cohen's d for independent samples (pooled std)."""
65    n1, n2 = len(group1), len(group2)
66    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
67    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
68    return (np.mean(group1) - np.mean(group2)) / pooled_std
69
70def cohens_d_paired(diff):
71    """Cohen's d for paired samples (using std of differences)."""
72    return np.mean(diff) / np.std(diff, ddof=1)
73
74d_indep = cohens_d_independent(model_b_acc, model_a_acc)
75d_paired = cohens_d_paired(after - before)
76
77print(f"\nEffect Sizes:")
78print(f"  Cohen's d (independent): {d_indep:.3f}")
79print(f"  Cohen's d (paired): {d_paired:.3f}")
80print(f"  Interpretation: |d| < 0.2 small, 0.5 medium, 0.8 large")
81
82
83# ============================================
84# Confidence Interval for Mean Difference
85# ============================================
86
87from scipy.stats import t as t_dist
88
89def mean_diff_ci(data, confidence=0.95):
90    """CI for one-sample mean or paired mean difference."""
91    n = len(data)
92    mean = np.mean(data)
93    se = np.std(data, ddof=1) / np.sqrt(n)
94    t_crit = t_dist.ppf((1 + confidence) / 2, df=n-1)
95    return mean - t_crit * se, mean + t_crit * se
96
97ci_lower, ci_upper = mean_diff_ci(after - before)
98print(f"\n95% CI for mean difference: [{ci_lower:.2f}, {ci_upper:.2f}]")
99
100
101# ============================================
102# Power Analysis (for sample size planning)
103# ============================================
104
105from scipy.stats import nct  # Non-central t distribution
106
107def power_one_sample_ttest(n, effect_size, alpha=0.05, alternative='two-sided'):
108    """Calculate power for one-sample t-test."""
109    df = n - 1
110    ncp = effect_size * np.sqrt(n)  # Non-centrality parameter
111
112    if alternative == 'two-sided':
113        t_crit = t_dist.ppf(1 - alpha/2, df)
114        power = 1 - nct.cdf(t_crit, df, ncp) + nct.cdf(-t_crit, df, ncp)
115    elif alternative == 'greater':
116        t_crit = t_dist.ppf(1 - alpha, df)
117        power = 1 - nct.cdf(t_crit, df, ncp)
118    else:  # less
119        t_crit = t_dist.ppf(alpha, df)
120        power = nct.cdf(t_crit, df, ncp)
121
122    return power
123
124# What sample size needed to detect d=0.5 with 80% power?
125for n in [10, 20, 30, 50, 100]:
126    power = power_one_sample_ttest(n, effect_size=0.5)
127    print(f"n={n:3d}: power = {power:.3f}")

Knowledge Check

Test your understanding of Z-tests and t-tests with this interactive quiz.

Knowledge Check: Z-Tests & t-Tests

concept1/10

When should you use a Z-test instead of a t-test?

Score: 0/10

Summary

Key Takeaways

Z-tests require known σ: Rarely applicable in practice. Use t-tests when you must estimate the population standard deviation from the sample.
t-distribution has heavier tails: This accounts for the uncertainty in estimating σ with s. As n increases, t converges to Z.
One-sample t-test: Tests if a population mean equals a hypothesized value. df = n - 1.
Two-sample t-test: Compares means of two independent groups. Use Welch's version by default (no equal variance assumption).
Paired t-test: For matched or repeated measures. More powerful than independent tests when pairs are correlated.
Always report effect sizes: Statistical significance alone doesn't tell you if an effect is practically meaningful. Cohen's d provides interpretable magnitudes.
t-tests are robust: For n ≥ 30, violations of normality are usually not critical. However, independence and random sampling are essential.

Quick Reference

Test	Use Case	Degrees of Freedom
Z-test	Known σ (rare)	N/A (uses N(0,1))
One-sample t	Sample mean vs hypothesized value	n - 1
Two-sample t (pooled)	Two independent groups, equal var	n₁ + n₂ - 2
Two-sample t (Welch)	Two independent groups, unequal var	Satterthwaite approx.
Paired t	Matched pairs or repeated measures	n - 1 (n = pairs)

Looking Ahead: In the next section, we'll explore Chi-Square tests for categorical data - testing goodness of fit and independence in contingency tables.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 AI/ML Applications

The Big Picture: A Historical Journey

The Industrial Revolution Problem

The Z-Test: When Sigma is Known

Z-Test Formulation

One-Sample Z-Test Statistic

Worked Example: Quality Control

Example: Precision Manufacturing

Setup

Calculation

Decision

The t-Test: Accounting for Unknown Variance

William Sealy Gosset and Student's t

The Guinness Statistician (1908)

One-Sample t-Test Statistic

Properties of the t-Distribution

Key Properties

Why Heavier Tails?

Interactive: Z vs t Distribution Comparison

Z-Distribution vs t-Distribution Comparison

Key Observations

One-Sample t-Test

One-Sample t-Test Procedure

📊Worked Example: Model Latency

Two-Sample t-Test

Two-Sample t-Test Statistic

Pooled vs Welch's t-Test

Pooled t-Test

Welch's t-Test

Paired t-Test

Paired t-Test Statistic

Interactive: Paired vs Unpaired Comparison

Paired vs Independent (Unpaired) t-Test Comparison

Paired t-Test

Independent t-Test (ignoring pairing)

Why Paired Tests Have More Power

Interactive: Sample Size Effect

Sample Size Effect on t-Distribution

Small n (n < 30)

Medium n (30-100)

Large n (n > 100)

Interactive: t-Test Calculator

Interactive t-Test Calculator

Decision Guide: Which Test to Use?

Test Selection Flowchart

Applications in AI/ML

🔬 Model Comparison

🧪 A/B Testing for Conversions

🎯 Hyperparameter Tuning Validation

⚠️ Drift Detection

Assumptions and Robustness

When t-Tests Are Robust

When to Use Alternatives

Python Implementation

Knowledge Check

Knowledge Check: Z-Tests & t-Tests

Summary

Key Takeaways

Quick Reference