Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand the Wald test and its connection to confidence intervals
• Derive and interpret the Score (Rao) test statistic
• Explain why all three large-sample tests are asymptotically equivalent
• Recognize when each test is most appropriate
• Connect Fisher Information to test construction

🔧 Practical Skills

• Compute Wald, Score, and LRT statistics from data
• Visualize the three tests on the likelihood surface
• Choose the appropriate test for different scenarios
• Implement these tests in Python/statsmodels

🧠 AI/ML Applications

• Gradient-Based Optimization - Score tests are the theoretical foundation of gradient descent
• Neural Network Training - Early stopping criteria relate to Score test statistics
• Feature Significance - Wald tests determine which features significantly affect predictions
• Model Comparison - All three tests used in nested model selection
• Natural Gradient Methods - Fisher Information normalization connects to Score tests

Central Message: The Wald, Score, and Likelihood Ratio tests are three approaches to the same fundamental question—whether data is consistent with a null hypothesis. Understanding when and why they differ is essential for rigorous statistical inference and underlies key concepts in modern machine learning.

The Big Picture: Three Paths to the Same Truth

In the previous sections, we encountered the Likelihood Ratio Test (LRT). Now we explore two complementary approaches—the Wald test and the Score test—that provide different perspectives on the same underlying question.

🏔️

The Mountain Analogy

Imagine the log-likelihood function as a mountain landscape, with the peak at the MLE. Each test asks a different question about where the null hypothesis $\theta_0$ sits relative to the peak:

Wald Test

"Standing at the peak (MLE), how far away is θ₀?"

Score Test

"Standing at θ₀, how steep is the slope toward the peak?"

LRT

"How much higher is the peak than where we're standing at θ₀?"

Historical Context: The Wald test was developed by Abraham Wald in the 1940s, building on earlier work by Ronald Fisher. The Score test was introduced by C. R. Rao in 1948, hence it's often called the Rao test or Lagrange Multiplier (LM) test in econometrics. Together with Neyman and Pearson's likelihood ratio approach, these form the "Holy Trinity" of large-sample hypothesis testing.

The Wald Test

The Wald test is perhaps the most intuitive of the three tests. It directly measures how far the Maximum Likelihood Estimate (MLE) is from the null hypothesis value, standardized by the estimated standard error.

Mathematical Formulation

Wald Test Statistic

W = \frac{\hat{\theta} - \theta_0}{\text{SE}(\hat{\theta})} = \frac{\hat{\theta} - \theta_0}{\sqrt{\widehat{\text{Var}}(\hat{\theta})}}

Under H₀, $W \stackrel{d}{\to} N(0,1)$ as $n \to \infty$

Equivalently, the squared Wald statistic follows a chi-square distribution:

W^2 = \frac{(\hat{\theta} - \theta_0)^2}{\widehat{\text{Var}}(\hat{\theta})} \stackrel{d}{\to} \chi^2(1)

Symbol	Meaning	Interpretation
θ̂	Maximum Likelihood Estimate	Best guess for the true parameter
θ₀	Null hypothesis value	The value we're testing against
SE(θ̂)	Standard error of MLE	Uncertainty in our estimate (from Fisher Info)
W	Wald statistic	Number of SEs the MLE is from θ₀

Key insight: The Wald test is the natural extension of confidence intervals. Rejecting H₀: θ = θ₀ at level α is equivalent to checking whether θ₀ falls outside the (1 - α) confidence interval for θ.

Connection to Confidence Intervals: The Wald test inverts the confidence interval. If θ̂ ± z_α/2 × SE(θ̂) contains θ₀, we fail to reject H₀. This duality is fundamental.

Interactive: Wald Test Explorer

Explore how the Wald test works. Adjust the MLE, null hypothesis value, and standard error to see how the test statistic and decision change.

WWald Test Explorer

Null value (θ₀): 0.00

MLE (θ̂): 0.50

Standard Error SE(θ̂): 0.200

Significance level (α): 0.05

Wald Statistic Formula

W = (θ̂ - θ₀) / SE(θ̂) = (0.50 - 0.00) / 0.200 = 2.500

Sampling Distribution Under H₀ (Standard Normal)

Decision at α = 0.05

Critical value: ±0.241

|W| = 2.500 > 0.241

Reject H₀

95% Confidence Interval

θ̂ ± z_α/2 · SE(θ̂)

0.500 ± 0.241 × 0.200

[0.452, 0.548]

CI does not contain θ₀ (consistent with rejecting H₀)

Two-sided P-value

p = 0.0124

p < α = 0.05 → Reject H₀

Wald Test Pitfalls:

Boundary problems: When θ̂ is near parameter boundaries (e.g., p near 0 or 1), the SE shrinks, potentially giving misleading results
Parameterization sensitivity: The Wald test is not invariant to reparameterization—testing θ = 0 vs θ² = 0 can give different results
Finite sample bias: The asymptotic SE may poorly approximate the true SE for small samples

The Score (Rao) Test

The Score test takes a fundamentally different approach. Instead of asking how far the MLE is from θ₀, it asks: "If θ₀ were true, how much does the likelihood want to move away from θ₀?"

The Geometric Intuition

The score function is the derivative of the log-likelihood with respect to the parameter:

Score Function and Test Statistic

Score Function:

U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \log L(\theta)

Score Test Statistic:

S = \frac{U(\theta_0)}{\sqrt{I(\theta_0)}} = \frac{U(\theta_0)}{\sqrt{-E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]_{\theta_0}}}

Under H₀, $S \stackrel{d}{\to} N(0,1)$

Why "Score"?

The score function tells us the direction and magnitude the parameter should move to increase the likelihood. At the MLE, U(θ̂) = 0 (the likelihood is maximized).

The Key Insight

If θ₀ is the true value (H₀ is true), then we're already at or near the peak of the likelihood—the slope (score) should be close to zero. A large |U(θ₀)| suggests we're on a steep slope, far from the peak.

Fisher Information I(θ): The denominator uses the Fisher Information, which measures the curvature of the log-likelihood. Higher curvature means the data is more informative about θ. This normalizes the score to create a proper test statistic.

Interactive: Score Test Explorer

Visualize how the Score test works. See the tangent line (score) at the null hypothesis value and how its slope relates to the test decision.

SScore (Rao) Test Explorer

Sample size (n): 30

Sample mean (x̄): 2.50

Known σ: 2.00

Null hypothesis (μ₀): 2.00

The Score Function: Derivative of Log-Likelihood

Score Function

U(μ₀) = ∂ℓ/∂μ |_μ=μ₀

= n(x̄ - μ₀)/σ²

Fisher Information

I(μ₀) = n/σ²

= 7.50

Score Statistic

S = U(μ₀)/√I(μ₀)

= 1.369

Log-Likelihood Curve with Score (Tangent Slope at μ₀)

Log-likelihood curve

Tangent (score) at μ₀

Null hypothesis point

MLE (maximum)

The Score Test Intuition

The score is the slope of the log-likelihood at μ₀. If the null hypothesis is true, μ₀ should be near the maximum of the likelihood, so the slope (score) should be near zero. A large absolute score indicates that μ₀ is far from the MLE, providing evidence against H₀.

Score near 0: We're near the peak → H₀ consistent with data

Large |Score|: We're on a steep slope → Evidence against H₀

Test Calculations

Score: U(μ₀) = n(x̄ - μ₀)/σ² = 30(2.50 - 2.00)/2.00² = 3.750

Fisher Info: I(μ₀) = n/σ² = 30/2.00² = 7.500

Score Statistic: S = U/√I = 3.750/2.739 = 1.369

Test Decision (α = 0.05)

Score Statistic S² = 1.875

p-value = 0.1709

Fail to reject H₀

Why Use the Score Test?

Only requires fitting the null model: Unlike Wald and LRT, we don't need to find the MLE under H₁
Computationally efficient: Essential for testing individual coefficients in large models (neural networks!)
Better small-sample behavior: Less sensitive to reparameterization than Wald
Lagrange Multiplier Test: Also known by this name in econometrics

Computational Advantage: The Score test only requires evaluating the likelihood and its derivatives at θ₀. Unlike the Wald and LRT tests, you don't need to find the MLE under the alternative hypothesis. This makes the Score test computationally efficient for complex models.

Interactive: Comparing All Three Tests

Now let's see all three tests side by side on the same likelihood curve. This visualization reveals how they approach the same problem from different perspectives.

Wald, Score, and Likelihood Ratio Test Comparison

Sample size (n): 50

Successes (x): 30 (60.0%)

Null hypothesis (p₀): 0.50

Log-Likelihood Curve: Visual Comparison of the Three Tests

Wald (horizontal at MLE)

LRT (vertical drop)

Score (tangent at p₀)

Wald Test

Z-statistic:1.443

Chi-square:2.083

P-value:0.1489

Evaluated at MLE (p̂)

Score (Rao) Test

Z-statistic:1.414

Chi-square:2.000

P-value:0.1573

Evaluated at H₀ (p₀)

Likelihood Ratio Test

LRT Statistic:2.014

Signed Z:1.419

P-value:0.1559

Compares both p₀ and p̂

Key Observations

- The three tests are asymptotically equivalent as n grows large
- For finite samples, LRT is generally most reliable (uses both null and MLE)
- Wald test can be misleading when p̂ is near 0 or 1 (boundary effects)
- Score test only requires fitting the null model (efficient for complex models)
- Notice how the three p-values are nearly identical at this sample size

Asymptotic Equivalence

One of the most beautiful results in mathematical statistics is that all three tests are asymptotically equivalent—they converge to the same distribution and give the same conclusions as sample size grows.

Asymptotic Equivalence Theorem

Under the null hypothesis and suitable regularity conditions, as n → ∞:

W^2 \approx S^2 \approx \Lambda \stackrel{d}{\to} \chi^2(k)

where k is the number of parameters being tested, W is Wald, S is Score, and Λ is the LRT statistic

The mathematical reason: All three statistics are quadratic approximations to the log-likelihood ratio around its maximum. Using Taylor expansions:

Log-likelihood expansion around MLE:

$\ell(\theta_0) \approx \ell(\hat{\theta}) - \frac{1}{2}(\hat{\theta} - \theta_0)^2 I(\hat{\theta})$

This gives us:

$\Lambda = 2[\ell(\hat{\theta}) - \ell(\theta_0)] \approx (\hat{\theta} - \theta_0)^2 I(\hat{\theta}) \approx W^2$

Interactive: Asymptotic Equivalence Demo

Run simulations to see how the three test statistics converge as sample size increases. Watch the scatter plots tighten around the diagonal (perfect agreement) as n grows.

∞Asymptotic Equivalence Demonstration

The Asymptotic Equivalence Theorem

Under the null hypothesis and suitable regularity conditions, as n → ∞:

W² (Wald)

≈ S² (Score) ≈

Λ (LRT)

All three converge to χ²(1) under H₀

Sample size (n): 50

Increase n to see tests converge

True probability (p): 0.50

Testing H₀: p = 0.5 vs H₁: p ≠ 0.5

Typical Ordering of Test Statistics

For finite samples, the statistics often follow this pattern:

S² (Score)≤Λ (LRT)≤W² (Wald)

This ordering holds exactly for exponential family models and approximately for others

Finite Sample Ordering

For finite samples, the statistics typically follow this ordering:

Score S² ≤ LRT Λ ≤ Wald W²

This means the Wald test is most likely to reject H₀ (most "liberal"), and the Score test is least likely to reject (most "conservative"). The LRT sits in between. This ordering is exact for exponential family models.

Which Test Should You Use?

Test	Best When	Pros	Cons
Wald	MLE is easy to compute; want confidence intervals	Intuitive; directly connects to CI; widely reported	Unreliable near boundaries; not invariant to reparameterization
Score (Rao)	Model fitting is expensive; testing many parameters	Only needs null model; computationally efficient; good small-sample behavior	Less intuitive; requires derivative computations
LRT	Both models can be fit; want reliable small-sample performance	Best overall properties; invariant to reparameterization; uses all information	Requires fitting both models; not always available in closed form

Worked Examples

Applications in AI/ML

The Wald, Score, and LRT tests have deep connections to machine learning, often appearing in unexpected places:

⚡ Gradient Descent is a Score Test

When training neural networks, the gradient ∇θL is exactly the score function! The stopping criterion |∇θL| < ε is checking whether the score is small enough—a Score test. Early stopping asks: "Is the gradient small enough that we're near a local optimum?"

🎯 Natural Gradient and Fisher Information

Natural gradient methods use the Fisher Information matrix to precondition updates: θ ← θ - α·I(θ)⁻¹·∇θL. This is the same normalization used in Score tests! It makes optimization invariant to parameterization.

📊 Feature Selection via Wald Tests

In scikit-learn's LogisticRegression, the coefficient p-values come from Wald tests. Features with |W| > 1.96 (or equivalently, p < 0.05) are considered significant predictors. This is the statistical foundation of stepwise selection.

🧠 Model Comparison and Nested Models

When comparing neural architectures (e.g., more layers vs. fewer), the LRT framework applies. The AIC and BIC are penalized versions of the log-likelihood ratio, connecting to model selection.

Pearson's χ² is a Score Test: The familiar Σ(O-E)²/E statistic for contingency tables is actually a Score test for multinomial data! This connection is rarely taught but deeply unifies categorical data analysis.

Python Implementation

Wald, Score, and LRT Implementation

🐍python

Explanation(8)

Code(156)

1Imports

Import necessary libraries: numpy for numerical operations, scipy.stats for distributions and tests, and statsmodels for regression analysis.

10Function Definition

Function docstring explaining the three tests we'll compute. Each test answers the same question (is p = p0?) but uses different information from the likelihood.

27Wald Test

WALD TEST: Evaluate the standard error at the MLE (p̂). The test statistic measures how many standard errors the MLE is from the null value. The SE formula uses p̂, not p0!

36Score Test

SCORE TEST: Compute the score (gradient of log-likelihood) and Fisher Information at the NULL value p0. The test asks: 'if H0 is true, how steep is the slope toward the MLE?'

44LRT

LRT: Compare the log-likelihood at MLE vs at null. The test statistic 2(l̂ - l0) follows chi-square under H0. This is the most direct measure of evidence.

66Example Usage

Example showing that all three tests give similar conclusions for this moderate sample size. The Wald z is largest (most liberal), Score is smallest (most conservative).

77Regression Example

Logistic regression example. Statsmodels automatically computes Wald tests for each coefficient - that's what the 'z' column in regression output represents.

112Model Comparison

LRT for model comparison: Testing whether the last 2 coefficients are jointly zero. Fit both models, compare log-likelihoods. This is how we test nested models.

148 lines without explanation

1import numpy as np
2from scipy import stats
3from scipy.optimize import minimize_scalar
4import statsmodels.api as sm
5
6# =============================================
7# Example: Binomial proportion test
8# =============================================
9
10def binomial_tests(x: int, n: int, p0: float = 0.5):
11    """
12    Compute Wald, Score, and LRT for H0: p = p0 vs H1: p != p0.
13
14    Parameters
15    ----------
16    x : int
17        Number of successes
18    n : int
19        Total trials
20    p0 : float
21        Null hypothesis proportion
22
23    Returns
24    -------
25    dict with all three test statistics and p-values
26    """
27    p_hat = x / n  # MLE
28
29    # --- WALD TEST ---
30    # Uses SE evaluated at MLE
31    se_wald = np.sqrt(p_hat * (1 - p_hat) / n)
32    if se_wald > 0:
33        wald_z = (p_hat - p0) / se_wald
34    else:
35        wald_z = 0
36    wald_p = 2 * (1 - stats.norm.cdf(abs(wald_z)))
37
38    # --- SCORE TEST ---
39    # Uses Fisher Information at null hypothesis value
40    score = x / p0 - (n - x) / (1 - p0)  # Derivative of log-L at p0
41    fisher_info = n / (p0 * (1 - p0))     # Fisher Information at p0
42    score_z = score / np.sqrt(fisher_info)
43    score_p = 2 * (1 - stats.norm.cdf(abs(score_z)))
44
45    # --- LIKELIHOOD RATIO TEST ---
46    # Compares log-likelihoods at MLE and null
47    def log_lik(p):
48        if p <= 0 or p >= 1:
49            return -np.inf
50        return x * np.log(p) + (n - x) * np.log(1 - p)
51
52    ll_mle = log_lik(p_hat)
53    ll_null = log_lik(p0)
54    lrt_stat = 2 * (ll_mle - ll_null)
55    lrt_p = 1 - stats.chi2.cdf(lrt_stat, df=1)
56
57    return {
58        'mle': p_hat,
59        'wald': {'z': wald_z, 'chi2': wald_z**2, 'p_value': wald_p},
60        'score': {'z': score_z, 'chi2': score_z**2, 'p_value': score_p},
61        'lrt': {'statistic': lrt_stat, 'p_value': lrt_p}
62    }
63
64# Example usage
65result = binomial_tests(x=62, n=100, p0=0.5)
66print(f"MLE: p̂ = {result['mle']:.4f}")
67print(f"Wald:  z = {result['wald']['z']:.3f}, p = {result['wald']['p_value']:.4f}")
68print(f"Score: z = {result['score']['z']:.3f}, p = {result['score']['p_value']:.4f}")
69print(f"LRT:   Λ = {result['lrt']['statistic']:.3f}, p = {result['lrt']['p_value']:.4f}")
70
71
72# =============================================
73# Using statsmodels for regression
74# =============================================
75
76def regression_wald_tests():
77    """Demonstrate Wald tests in logistic regression."""
78    # Generate synthetic data
79    np.random.seed(42)
80    n = 1000
81    X = np.random.randn(n, 3)
82    X = sm.add_constant(X)
83    true_beta = np.array([0.5, -1.0, 0.0, 0.5])  # Note: X2 has no effect
84    prob = 1 / (1 + np.exp(-X @ true_beta))
85    y = np.random.binomial(1, prob)
86
87    # Fit logistic regression
88    model = sm.Logit(y, X)
89    result = model.fit(disp=0)
90
91    # Extract Wald statistics (this is what statsmodels reports)
92    print("\n=== Logistic Regression Wald Tests ===")
93    print(result.summary2().tables[1])  # Coefficient table
94
95    # The z-column is the Wald statistic: z = coef / SE
96    # The P>|z| column is the two-sided p-value
97
98    # Manual verification for first coefficient:
99    coef = result.params[1]
100    se = result.bse[1]
101    z = coef / se
102    p = 2 * (1 - stats.norm.cdf(abs(z)))
103    print(f"\nManual check for X1:")
104    print(f"  Coefficient: {coef:.4f}")
105    print(f"  SE: {se:.4f}")
106    print(f"  Wald z: {z:.4f}")
107    print(f"  p-value: {p:.4f}")
108
109    return result
110
111# Run regression example
112reg_result = regression_wald_tests()
113
114
115# =============================================
116# Comparing models with LRT (nested models)
117# =============================================
118
119def model_comparison_lrt():
120    """Compare nested models using LRT."""
121    np.random.seed(42)
122    n = 500
123    X = np.random.randn(n, 4)
124    true_beta = np.array([1.0, -0.5, 0.0, 0.0])  # Only first 2 are nonzero
125    y = X @ true_beta + np.random.randn(n)
126
127    # Full model (all 4 predictors)
128    X_full = sm.add_constant(X)
129    model_full = sm.OLS(y, X_full).fit()
130
131    # Reduced model (first 2 predictors only)
132    X_reduced = sm.add_constant(X[:, :2])
133    model_reduced = sm.OLS(y, X_reduced).fit()
134
135    # LRT: Are the last 2 coefficients jointly zero?
136    ll_full = model_full.llf
137    ll_reduced = model_reduced.llf
138    lrt_stat = 2 * (ll_full - ll_reduced)
139    df = 2  # Number of restrictions (2 coefficients set to 0)
140    p_value = 1 - stats.chi2.cdf(lrt_stat, df)
141
142    print("\n=== LRT for Nested Models ===")
143    print(f"Full model log-lik: {ll_full:.2f}")
144    print(f"Reduced model log-lik: {ll_reduced:.2f}")
145    print(f"LRT statistic: {lrt_stat:.4f}")
146    print(f"df: {df}")
147    print(f"p-value: {p_value:.4f}")
148
149    if p_value > 0.05:
150        print("Conclusion: Fail to reject H0. Reduced model is adequate.")
151    else:
152        print("Conclusion: Reject H0. Full model is significantly better.")
153
154    return model_full, model_reduced
155
156model_comparison_lrt()

Knowledge Check

Test your understanding of Wald and Score tests with this interactive quiz.

Knowledge CheckQuestion 1 of 10

The Wald test statistic is evaluated at which parameter value?

Current score: 0/0

Summary

Key Takeaways

Three tests, one truth: Wald, Score, and LRT are three different approaches to the same hypothesis testing problem. They are asymptotically equivalent but can differ for finite samples.
Wald test: Measures (θ̂ - θ₀)/SE(θ̂). Evaluated at the MLE. Most intuitive but can be unreliable near parameter boundaries.
Score test: Measures U(θ₀)/√I(θ₀). Evaluated at the null hypothesis. Computationally efficient—only requires fitting the null model.
Finite sample ordering: Typically S² ≤ Λ ≤ W², meaning Wald is most liberal (rejects most often) and Score is most conservative.
Asymptotic equivalence: As n → ∞, all three converge to χ²(k) under H₀. This deep result connects them as different views of the same likelihood curvature.
Connections to ML: Gradient descent uses the score function, natural gradient normalizes by Fisher Information, and Wald tests determine coefficient significance.

Quick Reference

Test	Formula	Evaluated At	Uses
Wald (W)	(θ̂ - θ₀)/SE(θ̂)	MLE	SE from Fisher Info at MLE
Score (S)	U(θ₀)/√I(θ₀)	Null value	Score and Fisher Info at θ₀
LRT (Λ)	2[ℓ(θ̂) - ℓ(θ₀)]	Both	Log-likelihoods at both points

All three have χ²(k) distribution under H₀ as n → ∞

Looking Ahead: In the next section, we'll explore Permutation Tests—a powerful non-parametric approach that makes no distributional assumptions. While Wald, Score, and LRT rely on asymptotic theory, permutation tests provide exact inference for any sample size.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 AI/ML Applications

The Big Picture: Three Paths to the Same Truth

The Mountain Analogy

The Wald Test

Mathematical Formulation

Wald Test Statistic

Interactive: Wald Test Explorer

Sampling Distribution Under H₀ (Standard Normal)

Decision at α = 0.05

95% Confidence Interval

The Score (Rao) Test

The Geometric Intuition

Score Function and Test Statistic

Why "Score"?

The Key Insight

Interactive: Score Test Explorer

The Score Function: Derivative of Log-Likelihood

Log-Likelihood Curve with Score (Tangent Slope at μ₀)

The Score Test Intuition

Test Calculations

Test Decision (α = 0.05)

Why Use the Score Test?

Interactive: Comparing All Three Tests

Log-Likelihood Curve: Visual Comparison of the Three Tests

Wald Test

Score (Rao) Test

Likelihood Ratio Test

Key Observations

Asymptotic Equivalence

Asymptotic Equivalence Theorem

Interactive: Asymptotic Equivalence Demo

The Asymptotic Equivalence Theorem

Typical Ordering of Test Statistics

Finite Sample Ordering

Which Test Should You Use?

🎯Decision Guide: Choosing the Right Test

Worked Examples

🎲Example 1: Testing a Proportion (Binomial)

📈Example 2: Testing a Regression Coefficient

Applications in AI/ML

⚡ Gradient Descent is a Score Test

🎯 Natural Gradient and Fisher Information

📊 Feature Selection via Wald Tests

🧠 Model Comparison and Nested Models

Python Implementation

Knowledge Check

Summary

Key Takeaways

Quick Reference