Chapter 15
25 min read
Section 103 of 175

Wald and Score Tests

Common Statistical Tests

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Understand the Wald test and its connection to confidence intervals
  • • Derive and interpret the Score (Rao) test statistic
  • • Explain why all three large-sample tests are asymptotically equivalent
  • • Recognize when each test is most appropriate
  • • Connect Fisher Information to test construction

🔧 Practical Skills

  • • Compute Wald, Score, and LRT statistics from data
  • • Visualize the three tests on the likelihood surface
  • • Choose the appropriate test for different scenarios
  • • Implement these tests in Python/statsmodels

🧠 AI/ML Applications

  • Gradient-Based Optimization - Score tests are the theoretical foundation of gradient descent
  • Neural Network Training - Early stopping criteria relate to Score test statistics
  • Feature Significance - Wald tests determine which features significantly affect predictions
  • Model Comparison - All three tests used in nested model selection
  • Natural Gradient Methods - Fisher Information normalization connects to Score tests
Central Message: The Wald, Score, and Likelihood Ratio tests are three approaches to the same fundamental question—whether data is consistent with a null hypothesis. Understanding when and why they differ is essential for rigorous statistical inference and underlies key concepts in modern machine learning.

The Big Picture: Three Paths to the Same Truth

In the previous sections, we encountered the Likelihood Ratio Test (LRT). Now we explore two complementary approaches—the Wald test and the Score test—that provide different perspectives on the same underlying question.

🏔️

The Mountain Analogy

Imagine the log-likelihood function as a mountain landscape, with the peak at the MLE. Each test asks a different question about where the null hypothesis θ0\theta_0 sits relative to the peak:

Wald Test

"Standing at the peak (MLE), how far away is θ₀?"

Score Test

"Standing at θ₀, how steep is the slope toward the peak?"

LRT

"How much higher is the peak than where we're standing at θ₀?"

Historical Context: The Wald test was developed by Abraham Wald in the 1940s, building on earlier work by Ronald Fisher. The Score test was introduced by C. R. Rao in 1948, hence it's often called the Rao test or Lagrange Multiplier (LM) test in econometrics. Together with Neyman and Pearson's likelihood ratio approach, these form the "Holy Trinity" of large-sample hypothesis testing.


The Wald Test

The Wald test is perhaps the most intuitive of the three tests. It directly measures how far the Maximum Likelihood Estimate (MLE) is from the null hypothesis value, standardized by the estimated standard error.

Mathematical Formulation

Wald Test Statistic

W=θ^θ0SE(θ^)=θ^θ0Var^(θ^)W = \frac{\hat{\theta} - \theta_0}{\text{SE}(\hat{\theta})} = \frac{\hat{\theta} - \theta_0}{\sqrt{\widehat{\text{Var}}(\hat{\theta})}}

Under H₀, WdN(0,1)W \stackrel{d}{\to} N(0,1) as nn \to \infty

Equivalently, the squared Wald statistic follows a chi-square distribution:

W2=(θ^θ0)2Var^(θ^)dχ2(1)W^2 = \frac{(\hat{\theta} - \theta_0)^2}{\widehat{\text{Var}}(\hat{\theta})} \stackrel{d}{\to} \chi^2(1)
SymbolMeaningInterpretation
θ̂Maximum Likelihood EstimateBest guess for the true parameter
θ₀Null hypothesis valueThe value we're testing against
SE(θ̂)Standard error of MLEUncertainty in our estimate (from Fisher Info)
WWald statisticNumber of SEs the MLE is from θ₀

Key insight: The Wald test is the natural extension of confidence intervals. Rejecting H₀: θ = θ₀ at level α is equivalent to checking whether θ₀ falls outside the (1 - α) confidence interval for θ.

Connection to Confidence Intervals: The Wald test inverts the confidence interval. If θ̂ ± zα/2 × SE(θ̂) contains θ₀, we fail to reject H₀. This duality is fundamental.

Interactive: Wald Test Explorer

Explore how the Wald test works. Adjust the MLE, null hypothesis value, and standard error to see how the test statistic and decision change.

WWald Test Explorer

Wald Statistic Formula

W = (θ̂ - θ₀) / SE(θ̂) = (0.50 - 0.00) / 0.200 = 2.500

Sampling Distribution Under H₀ (Standard Normal)

-4-3-2-101234-0.24+0.24W = 2.50Reject H₀Reject H₀Fail to Reject H₀
Decision at α = 0.05

Critical value: ±0.241

|W| = 2.500 > 0.241

Reject H₀

95% Confidence Interval

θ̂ ± zα/2 · SE(θ̂)

0.500 ± 0.241 × 0.200

[0.452, 0.548]

CI does not contain θ₀ (consistent with rejecting H₀)

Two-sided P-value

p = 0.0124

p < α = 0.05 → Reject H₀

Wald Test Pitfalls:
  • Boundary problems: When θ̂ is near parameter boundaries (e.g., p near 0 or 1), the SE shrinks, potentially giving misleading results
  • Parameterization sensitivity: The Wald test is not invariant to reparameterization—testing θ = 0 vs θ² = 0 can give different results
  • Finite sample bias: The asymptotic SE may poorly approximate the true SE for small samples

The Score (Rao) Test

The Score test takes a fundamentally different approach. Instead of asking how far the MLE is from θ₀, it asks: "If θ₀ were true, how much does the likelihood want to move away from θ₀?"

The Geometric Intuition

The score function is the derivative of the log-likelihood with respect to the parameter:

Score Function and Test Statistic

Score Function:

U(θ)=(θ)θ=θlogL(θ)U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \log L(\theta)

Score Test Statistic:

S=U(θ0)I(θ0)=U(θ0)E[2θ2]θ0S = \frac{U(\theta_0)}{\sqrt{I(\theta_0)}} = \frac{U(\theta_0)}{\sqrt{-E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]_{\theta_0}}}

Under H₀, SdN(0,1)S \stackrel{d}{\to} N(0,1)

Why "Score"?

The score function tells us the direction and magnitude the parameter should move to increase the likelihood. At the MLE, U(θ̂) = 0 (the likelihood is maximized).

The Key Insight

If θ₀ is the true value (H₀ is true), then we're already at or near the peak of the likelihood—the slope (score) should be close to zero. A large |U(θ₀)| suggests we're on a steep slope, far from the peak.

Fisher Information I(θ): The denominator uses the Fisher Information, which measures the curvature of the log-likelihood. Higher curvature means the data is more informative about θ. This normalizes the score to create a proper test statistic.

Interactive: Score Test Explorer

Visualize how the Score test works. See the tangent line (score) at the null hypothesis value and how its slope relates to the test decision.

SScore (Rao) Test Explorer

The Score Function: Derivative of Log-Likelihood

Score Function

U(μ₀) = ∂ℓ/∂μ |μ=μ₀

= n(x̄ - μ₀)/σ²

Fisher Information

I(μ₀) = n/σ²

= 7.50

Score Statistic

S = U(μ₀)/√I(μ₀)

= 1.369

Log-Likelihood Curve with Score (Tangent Slope at μ₀)

μ₀ = 2.00x̄ = 2.50Score = 3.75μ (parameter)Log-Likelihood ℓ(μ)
Log-likelihood curve
Tangent (score) at μ₀
Null hypothesis point
MLE (maximum)
The Score Test Intuition

The score is the slope of the log-likelihood at μ₀. If the null hypothesis is true, μ₀ should be near the maximum of the likelihood, so the slope (score) should be near zero. A large absolute score indicates that μ₀ is far from the MLE, providing evidence against H₀.

Score near 0: We're near the peak → H₀ consistent with data
Large |Score|: We're on a steep slope → Evidence against H₀
Test Calculations

Score: U(μ₀) = n(x̄ - μ₀)/σ² = 30(2.50 - 2.00)/2.00² = 3.750

Fisher Info: I(μ₀) = n/σ² = 30/2.00² = 7.500

Score Statistic: S = U/√I = 3.750/2.739 = 1.369

Test Decision (α = 0.05)

Score Statistic S² = 1.875

p-value = 0.1709

Fail to reject H₀

Why Use the Score Test?
  • Only requires fitting the null model: Unlike Wald and LRT, we don't need to find the MLE under H₁
  • Computationally efficient: Essential for testing individual coefficients in large models (neural networks!)
  • Better small-sample behavior: Less sensitive to reparameterization than Wald
  • Lagrange Multiplier Test: Also known by this name in econometrics
Computational Advantage: The Score test only requires evaluating the likelihood and its derivatives at θ₀. Unlike the Wald and LRT tests, you don't need to find the MLE under the alternative hypothesis. This makes the Score test computationally efficient for complex models.

Interactive: Comparing All Three Tests

Now let's see all three tests side by side on the same likelihood curve. This visualization reveals how they approach the same problem from different perspectives.

Wald, Score, and Likelihood Ratio Test Comparison

Log-Likelihood Curve: Visual Comparison of the Three Tests

00.250.50.751pLog-Likelihoodp₀ = 0.50p̂ = 0.60
Wald (horizontal at MLE)
LRT (vertical drop)
Score (tangent at p₀)
Wald Test
Z-statistic:1.443
Chi-square:2.083
P-value:0.1489
Evaluated at MLE (p̂)
Score (Rao) Test
Z-statistic:1.414
Chi-square:2.000
P-value:0.1573
Evaluated at H₀ (p₀)
Likelihood Ratio Test
LRT Statistic:2.014
Signed Z:1.419
P-value:0.1559
Compares both p₀ and p̂
Key Observations
  • - The three tests are asymptotically equivalent as n grows large
  • - For finite samples, LRT is generally most reliable (uses both null and MLE)
  • - Wald test can be misleading when p̂ is near 0 or 1 (boundary effects)
  • - Score test only requires fitting the null model (efficient for complex models)
  • - Notice how the three p-values are nearly identical at this sample size

Asymptotic Equivalence

One of the most beautiful results in mathematical statistics is that all three tests are asymptotically equivalent—they converge to the same distribution and give the same conclusions as sample size grows.

Asymptotic Equivalence Theorem

Under the null hypothesis and suitable regularity conditions, as n → ∞:

W2S2Λdχ2(k)W^2 \approx S^2 \approx \Lambda \stackrel{d}{\to} \chi^2(k)

where k is the number of parameters being tested, W is Wald, S is Score, and Λ is the LRT statistic

The mathematical reason: All three statistics are quadratic approximations to the log-likelihood ratio around its maximum. Using Taylor expansions:

Log-likelihood expansion around MLE:

(θ0)(θ^)12(θ^θ0)2I(θ^)\ell(\theta_0) \approx \ell(\hat{\theta}) - \frac{1}{2}(\hat{\theta} - \theta_0)^2 I(\hat{\theta})

This gives us:

Λ=2[(θ^)(θ0)](θ^θ0)2I(θ^)W2\Lambda = 2[\ell(\hat{\theta}) - \ell(\theta_0)] \approx (\hat{\theta} - \theta_0)^2 I(\hat{\theta}) \approx W^2

Interactive: Asymptotic Equivalence Demo

Run simulations to see how the three test statistics converge as sample size increases. Watch the scatter plots tighten around the diagonal (perfect agreement) as n grows.

Asymptotic Equivalence Demonstration

The Asymptotic Equivalence Theorem

Under the null hypothesis and suitable regularity conditions, as n → ∞:

W² (Wald)
≈ S² (Score) ≈
Λ (LRT)

All three converge to χ²(1) under H₀

Increase n to see tests converge

Testing H₀: p = 0.5 vs H₁: p ≠ 0.5

Typical Ordering of Test Statistics

For finite samples, the statistics often follow this pattern:

S² (Score)Λ (LRT)W² (Wald)

This ordering holds exactly for exponential family models and approximately for others

Finite Sample Ordering

For finite samples, the statistics typically follow this ordering:

Score S² ≤ LRT Λ ≤ Wald W²

This means the Wald test is most likely to reject H₀ (most "liberal"), and the Score test is least likely to reject (most "conservative"). The LRT sits in between. This ordering is exact for exponential family models.


Which Test Should You Use?

TestBest WhenProsCons
WaldMLE is easy to compute; want confidence intervalsIntuitive; directly connects to CI; widely reportedUnreliable near boundaries; not invariant to reparameterization
Score (Rao)Model fitting is expensive; testing many parametersOnly needs null model; computationally efficient; good small-sample behaviorLess intuitive; requires derivative computations
LRTBoth models can be fit; want reliable small-sample performanceBest overall properties; invariant to reparameterization; uses all informationRequires fitting both models; not always available in closed form


Worked Examples


Applications in AI/ML

The Wald, Score, and LRT tests have deep connections to machine learning, often appearing in unexpected places:

Gradient Descent is a Score Test

When training neural networks, the gradient ∇θL is exactly the score function! The stopping criterion |∇θL| < ε is checking whether the score is small enough—a Score test. Early stopping asks: "Is the gradient small enough that we're near a local optimum?"

🎯 Natural Gradient and Fisher Information

Natural gradient methods use the Fisher Information matrix to precondition updates: θ ← θ - α·I(θ)⁻¹·∇θL. This is the same normalization used in Score tests! It makes optimization invariant to parameterization.

📊 Feature Selection via Wald Tests

In scikit-learn's LogisticRegression, the coefficient p-values come from Wald tests. Features with |W| > 1.96 (or equivalently, p < 0.05) are considered significant predictors. This is the statistical foundation of stepwise selection.

🧠 Model Comparison and Nested Models

When comparing neural architectures (e.g., more layers vs. fewer), the LRT framework applies. The AIC and BIC are penalized versions of the log-likelihood ratio, connecting to model selection.

Pearson's χ² is a Score Test: The familiar Σ(O-E)²/E statistic for contingency tables is actually a Score test for multinomial data! This connection is rarely taught but deeply unifies categorical data analysis.

Python Implementation

Wald, Score, and LRT Implementation
🐍python
1Imports

Import necessary libraries: numpy for numerical operations, scipy.stats for distributions and tests, and statsmodels for regression analysis.

10Function Definition

Function docstring explaining the three tests we'll compute. Each test answers the same question (is p = p0?) but uses different information from the likelihood.

27Wald Test

WALD TEST: Evaluate the standard error at the MLE (p̂). The test statistic measures how many standard errors the MLE is from the null value. The SE formula uses p̂, not p0!

36Score Test

SCORE TEST: Compute the score (gradient of log-likelihood) and Fisher Information at the NULL value p0. The test asks: 'if H0 is true, how steep is the slope toward the MLE?'

44LRT

LRT: Compare the log-likelihood at MLE vs at null. The test statistic 2(l̂ - l0) follows chi-square under H0. This is the most direct measure of evidence.

66Example Usage

Example showing that all three tests give similar conclusions for this moderate sample size. The Wald z is largest (most liberal), Score is smallest (most conservative).

77Regression Example

Logistic regression example. Statsmodels automatically computes Wald tests for each coefficient - that's what the 'z' column in regression output represents.

112Model Comparison

LRT for model comparison: Testing whether the last 2 coefficients are jointly zero. Fit both models, compare log-likelihoods. This is how we test nested models.

148 lines without explanation
1import numpy as np
2from scipy import stats
3from scipy.optimize import minimize_scalar
4import statsmodels.api as sm
5
6# =============================================
7# Example: Binomial proportion test
8# =============================================
9
10def binomial_tests(x: int, n: int, p0: float = 0.5):
11    """
12    Compute Wald, Score, and LRT for H0: p = p0 vs H1: p != p0.
13
14    Parameters
15    ----------
16    x : int
17        Number of successes
18    n : int
19        Total trials
20    p0 : float
21        Null hypothesis proportion
22
23    Returns
24    -------
25    dict with all three test statistics and p-values
26    """
27    p_hat = x / n  # MLE
28
29    # --- WALD TEST ---
30    # Uses SE evaluated at MLE
31    se_wald = np.sqrt(p_hat * (1 - p_hat) / n)
32    if se_wald > 0:
33        wald_z = (p_hat - p0) / se_wald
34    else:
35        wald_z = 0
36    wald_p = 2 * (1 - stats.norm.cdf(abs(wald_z)))
37
38    # --- SCORE TEST ---
39    # Uses Fisher Information at null hypothesis value
40    score = x / p0 - (n - x) / (1 - p0)  # Derivative of log-L at p0
41    fisher_info = n / (p0 * (1 - p0))     # Fisher Information at p0
42    score_z = score / np.sqrt(fisher_info)
43    score_p = 2 * (1 - stats.norm.cdf(abs(score_z)))
44
45    # --- LIKELIHOOD RATIO TEST ---
46    # Compares log-likelihoods at MLE and null
47    def log_lik(p):
48        if p <= 0 or p >= 1:
49            return -np.inf
50        return x * np.log(p) + (n - x) * np.log(1 - p)
51
52    ll_mle = log_lik(p_hat)
53    ll_null = log_lik(p0)
54    lrt_stat = 2 * (ll_mle - ll_null)
55    lrt_p = 1 - stats.chi2.cdf(lrt_stat, df=1)
56
57    return {
58        'mle': p_hat,
59        'wald': {'z': wald_z, 'chi2': wald_z**2, 'p_value': wald_p},
60        'score': {'z': score_z, 'chi2': score_z**2, 'p_value': score_p},
61        'lrt': {'statistic': lrt_stat, 'p_value': lrt_p}
62    }
63
64# Example usage
65result = binomial_tests(x=62, n=100, p0=0.5)
66print(f"MLE: p̂ = {result['mle']:.4f}")
67print(f"Wald:  z = {result['wald']['z']:.3f}, p = {result['wald']['p_value']:.4f}")
68print(f"Score: z = {result['score']['z']:.3f}, p = {result['score']['p_value']:.4f}")
69print(f"LRT:   Λ = {result['lrt']['statistic']:.3f}, p = {result['lrt']['p_value']:.4f}")
70
71
72# =============================================
73# Using statsmodels for regression
74# =============================================
75
76def regression_wald_tests():
77    """Demonstrate Wald tests in logistic regression."""
78    # Generate synthetic data
79    np.random.seed(42)
80    n = 1000
81    X = np.random.randn(n, 3)
82    X = sm.add_constant(X)
83    true_beta = np.array([0.5, -1.0, 0.0, 0.5])  # Note: X2 has no effect
84    prob = 1 / (1 + np.exp(-X @ true_beta))
85    y = np.random.binomial(1, prob)
86
87    # Fit logistic regression
88    model = sm.Logit(y, X)
89    result = model.fit(disp=0)
90
91    # Extract Wald statistics (this is what statsmodels reports)
92    print("\n=== Logistic Regression Wald Tests ===")
93    print(result.summary2().tables[1])  # Coefficient table
94
95    # The z-column is the Wald statistic: z = coef / SE
96    # The P>|z| column is the two-sided p-value
97
98    # Manual verification for first coefficient:
99    coef = result.params[1]
100    se = result.bse[1]
101    z = coef / se
102    p = 2 * (1 - stats.norm.cdf(abs(z)))
103    print(f"\nManual check for X1:")
104    print(f"  Coefficient: {coef:.4f}")
105    print(f"  SE: {se:.4f}")
106    print(f"  Wald z: {z:.4f}")
107    print(f"  p-value: {p:.4f}")
108
109    return result
110
111# Run regression example
112reg_result = regression_wald_tests()
113
114
115# =============================================
116# Comparing models with LRT (nested models)
117# =============================================
118
119def model_comparison_lrt():
120    """Compare nested models using LRT."""
121    np.random.seed(42)
122    n = 500
123    X = np.random.randn(n, 4)
124    true_beta = np.array([1.0, -0.5, 0.0, 0.0])  # Only first 2 are nonzero
125    y = X @ true_beta + np.random.randn(n)
126
127    # Full model (all 4 predictors)
128    X_full = sm.add_constant(X)
129    model_full = sm.OLS(y, X_full).fit()
130
131    # Reduced model (first 2 predictors only)
132    X_reduced = sm.add_constant(X[:, :2])
133    model_reduced = sm.OLS(y, X_reduced).fit()
134
135    # LRT: Are the last 2 coefficients jointly zero?
136    ll_full = model_full.llf
137    ll_reduced = model_reduced.llf
138    lrt_stat = 2 * (ll_full - ll_reduced)
139    df = 2  # Number of restrictions (2 coefficients set to 0)
140    p_value = 1 - stats.chi2.cdf(lrt_stat, df)
141
142    print("\n=== LRT for Nested Models ===")
143    print(f"Full model log-lik: {ll_full:.2f}")
144    print(f"Reduced model log-lik: {ll_reduced:.2f}")
145    print(f"LRT statistic: {lrt_stat:.4f}")
146    print(f"df: {df}")
147    print(f"p-value: {p_value:.4f}")
148
149    if p_value > 0.05:
150        print("Conclusion: Fail to reject H0. Reduced model is adequate.")
151    else:
152        print("Conclusion: Reject H0. Full model is significantly better.")
153
154    return model_full, model_reduced
155
156model_comparison_lrt()

Knowledge Check

Test your understanding of Wald and Score tests with this interactive quiz.

Knowledge CheckQuestion 1 of 10

The Wald test statistic is evaluated at which parameter value?

Current score: 0/0

Summary

Key Takeaways

  1. Three tests, one truth: Wald, Score, and LRT are three different approaches to the same hypothesis testing problem. They are asymptotically equivalent but can differ for finite samples.
  2. Wald test: Measures (θ̂ - θ₀)/SE(θ̂). Evaluated at the MLE. Most intuitive but can be unreliable near parameter boundaries.
  3. Score test: Measures U(θ₀)/√I(θ₀). Evaluated at the null hypothesis. Computationally efficient—only requires fitting the null model.
  4. Finite sample ordering: Typically S² ≤ Λ ≤ W², meaning Wald is most liberal (rejects most often) and Score is most conservative.
  5. Asymptotic equivalence: As n → ∞, all three converge to χ²(k) under H₀. This deep result connects them as different views of the same likelihood curvature.
  6. Connections to ML: Gradient descent uses the score function, natural gradient normalizes by Fisher Information, and Wald tests determine coefficient significance.

Quick Reference

TestFormulaEvaluated AtUses
Wald (W)(θ̂ - θ₀)/SE(θ̂)MLESE from Fisher Info at MLE
Score (S)U(θ₀)/√I(θ₀)Null valueScore and Fisher Info at θ₀
LRT (Λ)2[ℓ(θ̂) - ℓ(θ₀)]BothLog-likelihoods at both points

All three have χ²(k) distribution under H₀ as n → ∞

Looking Ahead: In the next section, we'll explore Permutation Tests—a powerful non-parametric approach that makes no distributional assumptions. While Wald, Score, and LRT rely on asymptotic theory, permutation tests provide exact inference for any sample size.
Loading comments...