Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand the Wald test and its connection to confidence intervals
- • Derive and interpret the Score (Rao) test statistic
- • Explain why all three large-sample tests are asymptotically equivalent
- • Recognize when each test is most appropriate
- • Connect Fisher Information to test construction
🔧 Practical Skills
- • Compute Wald, Score, and LRT statistics from data
- • Visualize the three tests on the likelihood surface
- • Choose the appropriate test for different scenarios
- • Implement these tests in Python/statsmodels
🧠 AI/ML Applications
- • Gradient-Based Optimization - Score tests are the theoretical foundation of gradient descent
- • Neural Network Training - Early stopping criteria relate to Score test statistics
- • Feature Significance - Wald tests determine which features significantly affect predictions
- • Model Comparison - All three tests used in nested model selection
- • Natural Gradient Methods - Fisher Information normalization connects to Score tests
Central Message: The Wald, Score, and Likelihood Ratio tests are three approaches to the same fundamental question—whether data is consistent with a null hypothesis. Understanding when and why they differ is essential for rigorous statistical inference and underlies key concepts in modern machine learning.
The Big Picture: Three Paths to the Same Truth
In the previous sections, we encountered the Likelihood Ratio Test (LRT). Now we explore two complementary approaches—the Wald test and the Score test—that provide different perspectives on the same underlying question.
The Mountain Analogy
Imagine the log-likelihood function as a mountain landscape, with the peak at the MLE. Each test asks a different question about where the null hypothesis sits relative to the peak:
Wald Test
"Standing at the peak (MLE), how far away is θ₀?"
Score Test
"Standing at θ₀, how steep is the slope toward the peak?"
LRT
"How much higher is the peak than where we're standing at θ₀?"
Historical Context: The Wald test was developed by Abraham Wald in the 1940s, building on earlier work by Ronald Fisher. The Score test was introduced by C. R. Rao in 1948, hence it's often called the Rao test or Lagrange Multiplier (LM) test in econometrics. Together with Neyman and Pearson's likelihood ratio approach, these form the "Holy Trinity" of large-sample hypothesis testing.
The Wald Test
The Wald test is perhaps the most intuitive of the three tests. It directly measures how far the Maximum Likelihood Estimate (MLE) is from the null hypothesis value, standardized by the estimated standard error.
Mathematical Formulation
Wald Test Statistic
Under H₀, as
Equivalently, the squared Wald statistic follows a chi-square distribution:
| Symbol | Meaning | Interpretation |
|---|---|---|
| θ̂ | Maximum Likelihood Estimate | Best guess for the true parameter |
| θ₀ | Null hypothesis value | The value we're testing against |
| SE(θ̂) | Standard error of MLE | Uncertainty in our estimate (from Fisher Info) |
| W | Wald statistic | Number of SEs the MLE is from θ₀ |
Key insight: The Wald test is the natural extension of confidence intervals. Rejecting H₀: θ = θ₀ at level α is equivalent to checking whether θ₀ falls outside the (1 - α) confidence interval for θ.
Interactive: Wald Test Explorer
Explore how the Wald test works. Adjust the MLE, null hypothesis value, and standard error to see how the test statistic and decision change.
Wald Statistic Formula
Sampling Distribution Under H₀ (Standard Normal)
Decision at α = 0.05
Critical value: ±0.241
|W| = 2.500 > 0.241
Reject H₀
95% Confidence Interval
θ̂ ± zα/2 · SE(θ̂)
0.500 ± 0.241 × 0.200
[0.452, 0.548]
CI does not contain θ₀ (consistent with rejecting H₀)
Two-sided P-value
p = 0.0124
p < α = 0.05 → Reject H₀
- Boundary problems: When θ̂ is near parameter boundaries (e.g., p near 0 or 1), the SE shrinks, potentially giving misleading results
- Parameterization sensitivity: The Wald test is not invariant to reparameterization—testing θ = 0 vs θ² = 0 can give different results
- Finite sample bias: The asymptotic SE may poorly approximate the true SE for small samples
The Score (Rao) Test
The Score test takes a fundamentally different approach. Instead of asking how far the MLE is from θ₀, it asks: "If θ₀ were true, how much does the likelihood want to move away from θ₀?"
The Geometric Intuition
The score function is the derivative of the log-likelihood with respect to the parameter:
Score Function and Test Statistic
Score Function:
Score Test Statistic:
Under H₀,
Why "Score"?
The score function tells us the direction and magnitude the parameter should move to increase the likelihood. At the MLE, U(θ̂) = 0 (the likelihood is maximized).
The Key Insight
If θ₀ is the true value (H₀ is true), then we're already at or near the peak of the likelihood—the slope (score) should be close to zero. A large |U(θ₀)| suggests we're on a steep slope, far from the peak.
Interactive: Score Test Explorer
Visualize how the Score test works. See the tangent line (score) at the null hypothesis value and how its slope relates to the test decision.
The Score Function: Derivative of Log-Likelihood
Score Function
U(μ₀) = ∂ℓ/∂μ |μ=μ₀
= n(x̄ - μ₀)/σ²
Fisher Information
I(μ₀) = n/σ²
= 7.50
Score Statistic
S = U(μ₀)/√I(μ₀)
= 1.369
Log-Likelihood Curve with Score (Tangent Slope at μ₀)
The Score Test Intuition
The score is the slope of the log-likelihood at μ₀. If the null hypothesis is true, μ₀ should be near the maximum of the likelihood, so the slope (score) should be near zero. A large absolute score indicates that μ₀ is far from the MLE, providing evidence against H₀.
Test Calculations
Score: U(μ₀) = n(x̄ - μ₀)/σ² = 30(2.50 - 2.00)/2.00² = 3.750
Fisher Info: I(μ₀) = n/σ² = 30/2.00² = 7.500
Score Statistic: S = U/√I = 3.750/2.739 = 1.369
Test Decision (α = 0.05)
Score Statistic S² = 1.875
p-value = 0.1709
Fail to reject H₀
Why Use the Score Test?
- Only requires fitting the null model: Unlike Wald and LRT, we don't need to find the MLE under H₁
- Computationally efficient: Essential for testing individual coefficients in large models (neural networks!)
- Better small-sample behavior: Less sensitive to reparameterization than Wald
- Lagrange Multiplier Test: Also known by this name in econometrics
Interactive: Comparing All Three Tests
Now let's see all three tests side by side on the same likelihood curve. This visualization reveals how they approach the same problem from different perspectives.
Log-Likelihood Curve: Visual Comparison of the Three Tests
Wald Test
Score (Rao) Test
Likelihood Ratio Test
Key Observations
- - The three tests are asymptotically equivalent as n grows large
- - For finite samples, LRT is generally most reliable (uses both null and MLE)
- - Wald test can be misleading when p̂ is near 0 or 1 (boundary effects)
- - Score test only requires fitting the null model (efficient for complex models)
- - Notice how the three p-values are nearly identical at this sample size
Asymptotic Equivalence
One of the most beautiful results in mathematical statistics is that all three tests are asymptotically equivalent—they converge to the same distribution and give the same conclusions as sample size grows.
Asymptotic Equivalence Theorem
Under the null hypothesis and suitable regularity conditions, as n → ∞:
where k is the number of parameters being tested, W is Wald, S is Score, and Λ is the LRT statistic
The mathematical reason: All three statistics are quadratic approximations to the log-likelihood ratio around its maximum. Using Taylor expansions:
Log-likelihood expansion around MLE:
This gives us:
Interactive: Asymptotic Equivalence Demo
Run simulations to see how the three test statistics converge as sample size increases. Watch the scatter plots tighten around the diagonal (perfect agreement) as n grows.
The Asymptotic Equivalence Theorem
Under the null hypothesis and suitable regularity conditions, as n → ∞:
All three converge to χ²(1) under H₀
Increase n to see tests converge
Testing H₀: p = 0.5 vs H₁: p ≠ 0.5
Typical Ordering of Test Statistics
For finite samples, the statistics often follow this pattern:
This ordering holds exactly for exponential family models and approximately for others
Finite Sample Ordering
For finite samples, the statistics typically follow this ordering:
This means the Wald test is most likely to reject H₀ (most "liberal"), and the Score test is least likely to reject (most "conservative"). The LRT sits in between. This ordering is exact for exponential family models.
Which Test Should You Use?
| Test | Best When | Pros | Cons |
|---|---|---|---|
| Wald | MLE is easy to compute; want confidence intervals | Intuitive; directly connects to CI; widely reported | Unreliable near boundaries; not invariant to reparameterization |
| Score (Rao) | Model fitting is expensive; testing many parameters | Only needs null model; computationally efficient; good small-sample behavior | Less intuitive; requires derivative computations |
| LRT | Both models can be fit; want reliable small-sample performance | Best overall properties; invariant to reparameterization; uses all information | Requires fitting both models; not always available in closed form |
Worked Examples
Applications in AI/ML
The Wald, Score, and LRT tests have deep connections to machine learning, often appearing in unexpected places:
⚡ Gradient Descent is a Score Test
When training neural networks, the gradient ∇θL is exactly the score function! The stopping criterion |∇θL| < ε is checking whether the score is small enough—a Score test. Early stopping asks: "Is the gradient small enough that we're near a local optimum?"
🎯 Natural Gradient and Fisher Information
Natural gradient methods use the Fisher Information matrix to precondition updates: θ ← θ - α·I(θ)⁻¹·∇θL. This is the same normalization used in Score tests! It makes optimization invariant to parameterization.
📊 Feature Selection via Wald Tests
In scikit-learn's LogisticRegression, the coefficient p-values come from Wald tests. Features with |W| > 1.96 (or equivalently, p < 0.05) are considered significant predictors. This is the statistical foundation of stepwise selection.
🧠 Model Comparison and Nested Models
When comparing neural architectures (e.g., more layers vs. fewer), the LRT framework applies. The AIC and BIC are penalized versions of the log-likelihood ratio, connecting to model selection.
Python Implementation
Knowledge Check
Test your understanding of Wald and Score tests with this interactive quiz.
The Wald test statistic is evaluated at which parameter value?
Summary
Key Takeaways
- Three tests, one truth: Wald, Score, and LRT are three different approaches to the same hypothesis testing problem. They are asymptotically equivalent but can differ for finite samples.
- Wald test: Measures (θ̂ - θ₀)/SE(θ̂). Evaluated at the MLE. Most intuitive but can be unreliable near parameter boundaries.
- Score test: Measures U(θ₀)/√I(θ₀). Evaluated at the null hypothesis. Computationally efficient—only requires fitting the null model.
- Finite sample ordering: Typically S² ≤ Λ ≤ W², meaning Wald is most liberal (rejects most often) and Score is most conservative.
- Asymptotic equivalence: As n → ∞, all three converge to χ²(k) under H₀. This deep result connects them as different views of the same likelihood curvature.
- Connections to ML: Gradient descent uses the score function, natural gradient normalizes by Fisher Information, and Wald tests determine coefficient significance.
Quick Reference
| Test | Formula | Evaluated At | Uses |
|---|---|---|---|
| Wald (W) | (θ̂ - θ₀)/SE(θ̂) | MLE | SE from Fisher Info at MLE |
| Score (S) | U(θ₀)/√I(θ₀) | Null value | Score and Fisher Info at θ₀ |
| LRT (Λ) | 2[ℓ(θ̂) - ℓ(θ₀)] | Both | Log-likelihoods at both points |
All three have χ²(k) distribution under H₀ as n → ∞
Looking Ahead: In the next section, we'll explore Permutation Tests—a powerful non-parametric approach that makes no distributional assumptions. While Wald, Score, and LRT rely on asymptotic theory, permutation tests provide exact inference for any sample size.