Chapter 16
30 min read
Section 108 of 175

A/B Testing for ML Applications

Multiple Testing and Modern Issues

Learning Objectives

By the end of this section, you will be able to:

πŸ“š Core Knowledge

  • β€’ Design and analyze A/B tests for ML model comparison
  • β€’ Calculate required sample sizes for desired power
  • β€’ Understand the peeking problem and its consequences
  • β€’ Distinguish frequentist vs Bayesian A/B testing approaches
  • β€’ Compare A/B testing with multi-armed bandit strategies

πŸ”§ Practical Skills

  • β€’ Implement A/B tests for model deployment decisions
  • β€’ Choose appropriate metrics for ML experiments
  • β€’ Handle multiple metrics and guardrail metrics
  • β€’ Avoid common pitfalls in online experimentation

🧠 Real-World Applications

  • β€’ Model Deployment - Validate that new ML models improve key business metrics before full rollout
  • β€’ Recommendation Systems - Test different ranking algorithms at companies like Netflix, Spotify, Amazon
  • β€’ Search Engines - Google runs 10,000+ A/B tests per year to optimize search quality
  • β€’ Personalization - Validate that personalized experiences outperform generic ones
Central Message: A/B testing is the gold standard for making causal claims about whether a new ML model or feature improves business outcomes. Understanding its statistical foundations is essential for any ML engineer deploying models in production.

The Big Picture: Why A/B Testing Matters for ML

You've built a new recommendation model. Offline metrics (AUC, precision, recall) look great on your holdout set. But here's the uncomfortable truth: offline metrics often don't translate to real-world improvements. Users behave differently with live systems. Your model might have higher precision but introduce latency that frustrates users. It might optimize for clicks but reduce long-term engagement.

The Fundamental Question

"Does my new model actually improve business outcomes when deployed to real users, or could the apparent improvement be due to chance?"

A/B testing (also called online controlled experiments or randomized controlled trials) provides the only rigorous way to answer this question. By randomly assigning users to either the current model (control) or the new model (treatment), we can make causal inferences about the effect of our changes.

From Fisher to Silicon Valley

πŸ“œ

A Brief History of A/B Testing

1920s-1930s: Ronald Fisher develops randomized controlled experiments for agricultural research. His work on hypothesis testing laid the statistical foundations we still use today.

1950s-1990s: Clinical trials adopt these methods rigorously. Pharmaceutical companies use randomized controlled trials to test drug efficacy.

2000: Google engineer's first A/B test on ad colors launches the modern era of web experimentation. Today, tech companies run thousands of simultaneous experiments.

"At Google, experimentation is practically a religion." β€” Diane Tang, Google

CompanyExperiments/YearNotable Use Case
Google~10,000+Search ranking, ad optimization
Microsoft (Bing)~15,000Search features, UI changes
Netflix~300Recommendation algorithms, UI
Amazon~1,000+Product recommendations, pricing
Meta~10,000+News feed ranking, ad targeting

The A/B Testing Framework

An A/B test is a randomized controlled experiment with the following structure:

  1. Define the hypothesis: What improvement do you expect from the new model?
  2. Choose metrics: Select a primary metric and guardrail metrics
  3. Calculate sample size: Determine how many users you need
  4. Randomize: Randomly assign users to control (A) or treatment (B)
  5. Collect data: Run the experiment for the predetermined duration
  6. Analyze: Perform hypothesis test and interpret results
  7. Decide: Ship, iterate, or abandon based on statistical evidence

Hypothesis Formulation for ML Metrics

For ML model comparisons, we typically test whether the treatment model improves a key business metric. Let pAp_A be the metric value for control and pBp_B for treatment.

Standard A/B Test Hypotheses

Null Hypothesis (Hβ‚€)

pA=pBp_A = p_B

The models have the same effect on the metric

Alternative Hypothesis (H₁)

pA≠pBp_A \neq p_B

The models have different effects (two-tailed)

Statistical Tests for A/B Testing

For comparing proportions (conversion rates, click-through rates), we use the two-proportion z-test:

Two-Proportion Z-Test

Pooled proportion:

p^=XA+XBnA+nB\hat{p} = \frac{X_A + X_B}{n_A + n_B}

Standard error:

SE=p^(1βˆ’p^)(1nA+1nB)SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}

Z-statistic:

z=p^Bβˆ’p^ASEz = \frac{\hat{p}_B - \hat{p}_A}{SE}

where XA,XBX_A, X_B are successes and nA,nBn_A, n_B are sample sizes

Interactive: A/B Test Simulator

Run a simulated A/B test comparing two ML models. Adjust the true conversion rates and watch how sample size affects the ability to detect a real difference.

πŸ”¬A/B Testing Simulator: ML Model Comparison

Simulate an A/B test comparing two ML models. Model A is the current production model (control), Model B is a new model (treatment). The conversion rate represents the proportion of users who take a desired action (click, purchase, etc.).

βš™οΈ Simulation Parameters (True Rates - Unknown in Real Life)

True difference: 2.0% absolute, 20.0% relative lift

Model A (Control - Current)

Users
0
Conversions
0
Rate
0.00%

Model B (Treatment - New)

Users
0
Conversions
0
Rate
0.00%

🧠 How This Works in Real ML Systems

Hβ‚€: pA = pB (Models have the same conversion rate)

H₁: pA β‰  pB (Models have different conversion rates)

Companies like Google, Netflix, Meta, and Amazon use this exact framework to:

  • Validate new ML model improvements before full deployment
  • Test UI changes that affect user engagement
  • Compare recommendation algorithms
  • Measure the impact of ranking changes

⚠️ Warning: Peeking Problem

In real A/B tests, do NOT stop the test early just because you see a significant p-value! Repeated checking inflates the false positive rate (this is called the "peeking problem"). Either use sequential testing methods or pre-determine your sample size and only check significance once at the end.


Power Analysis and Sample Size

One of the most critical decisions in A/B testing is determining how many users you need. Too few, and you won't detect real improvements. Too many, and you're wasting traffic on experiments when you could be shipping.

Sample Size Formula (Per Group)

n=2(zΞ±/22pΛ‰(1βˆ’pΛ‰)+zΞ²pA(1βˆ’pA)+pB(1βˆ’pB))2(pBβˆ’pA)2n = \frac{2\left(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_\beta\sqrt{p_A(1-p_A) + p_B(1-p_B)}\right)^2}{(p_B - p_A)^2}

Simplified approximation: nβ‰ˆ16Οƒ2Ξ΄2n \approx \frac{16\sigma^2}{\delta^2} for 80% power at Ξ± = 0.05

ParameterSymbolTypical ValueEffect on Sample Size
Significance levelΞ±0.05Lower Ξ± β†’ larger n
Power1-Ξ²0.80Higher power β†’ larger n
Effect sizeΞ΄ = p_B - p_A1-5%Smaller effect β†’ much larger n
Baseline ratep_AVariesAffects variance
The Effect Size Trap: Sample size scales with the inverse square of the effect size. Detecting a 1% improvement requires 4Γ— more samples than detecting a 2% improvement. This is why power analysis before the experiment is critical.

Interactive: Sample Size Calculator

Use this calculator to plan your A/B test. Enter your baseline conversion rate, desired minimum detectable effect, and see the required sample size.

πŸ“ŠSample Size & Power Calculator for A/B Tests

Plan your A/B test by calculating the required sample size, statistical power, or minimum detectable effect. All calculations assume a two-tailed test for proportion differences.

What do you want to calculate?

Test Parameters

Specify MDE & Power

Required Sample Size

7,682 per group
15,364 total users

To detect a 2.0% absolute change (20% relative) with 80% power at Ξ± = 0.05

Power Curve: How Sample Size Affects Power

20%40%60%80%100%0k5k10k15k20kSample Size per GroupPower80%

πŸ’‘ Key Insights for ML Engineers

  • β€’ Smaller effects need more data: A 1% improvement needs ~4x more samples than a 2% improvement
  • β€’ Industry standard: 80% power and Ξ± = 0.05 are conventional choices
  • β€’ Traffic constraints: If you can't reach the required sample size, either accept lower power or focus on larger improvements
  • β€’ Multiple metrics: Running tests on multiple metrics requires correction (Bonferroni, etc.) which increases sample size requirements

Testing ML-Specific Metrics

ML systems often require testing multiple metrics simultaneously. A single "primary metric" doesn't capture all the effects of a model change.

Primary Metrics

The main metric you're trying to improve. The experiment decision is based on this.

  • β€’ Click-through rate (CTR)
  • β€’ Conversion rate
  • β€’ Revenue per user
  • β€’ Engagement (time spent, actions)

Guardrail Metrics

Metrics that must not regress, even if the primary metric improves.

  • β€’ Latency / page load time
  • β€’ Error rates
  • β€’ User retention
  • β€’ Long-term engagement
ML Model Metrics: When comparing ML models, you might also track:
  • Model-specific metrics (AUC, precision, recall) measured on live traffic
  • Prediction confidence/calibration
  • Feature coverage (does the new model handle more cases?)
  • Model latency (inference time)

Interactive: Multi-Metric ML Testing

Test a new ML model across multiple metrics simultaneously. See how improvements in one metric might come with tradeoffs in others.

🎯ML Model A/B Testing: Multi-Metric Comparison

Compare ML models across multiple metrics. Set the "true" performance values (unknown in real life), then run the experiment to see how observed metrics converge and when statistical significance is reached.

Select Primary Metric for Testing

βœ“ Higher is better - testing if Model B improves this metric

Model A (Baseline) - True Values

Model B (Challenger) - True Values

Model A (Baseline)

Observations
0
Observed AUC-ROC
-

Model B (Challenger)

Observations
0
Observed AUC-ROC
-

The Peeking Problem

One of the most common mistakes in A/B testing is peekingβ€”checking results multiple times during the experiment and stopping early when you see significance.

⚠️ Why Peeking is Dangerous

When you check significance k times during an experiment, your true false positive rate is approximately:

P(at least one false positive) β‰ˆ 1 - (1 - Ξ±)^k
k = 1
5%
k = 5
~23%
k = 10
~40%
k = 20
~64%

Checking daily for a month with Ξ± = 0.05 gives you a ~78% chance of a false positive!

Interactive: Peeking Simulation

See the peeking problem in action. This simulation runs experiments where Hβ‚€ is true (both groups have the same rate), showing how repeated checking inflates false positives.

πŸ‘€The Peeking Problem: How Repeated Checking Inflates False Positives

When Hβ‚€ is true (both variants have the same conversion rate), repeated checking at Ξ± = 0.05 dramatically increases the chance of seeing a "significant" result. This simulation demonstrates the real-world consequences of peeking.

Simulation Settings

Peeking Behavior

Total peeks per experiment: 10

Ready to demonstrate the peeking problem

Click "Run Simulations" to see how checking significance repeatedly dramatically inflates false positives when Hβ‚€ is true.

Solutions to the Peeking Problem

1. Fixed Sample Size

Pre-register sample size and only analyze once at the end.

2. Sequential Testing

Use O'Brien-Fleming or Pocock boundaries that account for multiple looks.

3. Bayesian Methods

Bayesian A/B testing doesn't have the same peeking penalty.

4. Ξ±-Spending Functions

Distribute Ξ± budget across interim analyses (e.g., Lan-DeMets).


Bayesian A/B Testing

Bayesian A/B testing offers an alternative paradigm that many find more intuitive and practical for real-world ML applications.

Frequentist Approach

  • Answer: "Is the difference significant?" (Yes/No)
  • Output: p-value, confidence interval
  • Requires fixed sample size
  • Peeking inflates false positive rate
  • Cannot say "B is 90% likely to be better"

Bayesian Approach

  • Answer: "What's P(B > A)?" (probability)
  • Output: posterior distributions, credible intervals
  • Can stop anytime based on decision criteria
  • Natural handling of continuous monitoring
  • Directly answers business questions

Bayesian A/B Test with Beta Priors

Prior (uninformative):

pA,pB∼Beta(1,1)=Uniform(0,1)p_A, p_B \sim \text{Beta}(1, 1) = \text{Uniform}(0, 1)

Posterior after data:

pA∣data∼Beta(1+successesA,1+failuresA)p_A | \text{data} \sim \text{Beta}(1 + \text{successes}_A, 1 + \text{failures}_A)

Probability B is better (via Monte Carlo):

P(pB>pA∣data)=∫∫1[pB>pA] dΟ€(pA) dΟ€(pB)P(p_B > p_A | \text{data}) = \int\int \mathbf{1}[p_B > p_A] \, d\pi(p_A) \, d\pi(p_B)

Interactive: Bayesian A/B Testing

Watch Bayesian inference in action. As you collect more data, the posterior distributions become more concentrated, and you can see the probability that B outperforms A.

A/B Testing Simulator

Compare two variants using Bayesian inference with Beta distributions

A (Control) - True Rate???
1%30%
B (Treatment) - True Rate???
1%30%
5%10%15%20%25%Conversion RateA (Control)B (Treatment)

A (Control)

Visitors
0
Conversions
0
Rate
0.00%
Posterior: Beta(1, 1)

B (Treatment)

Visitors
0
Conversions
0
Rate
0.00%
Posterior: Beta(1, 1)
P(B better than A) = 50.0%
Not enough evidence yet

Why Bayesian A/B Testing?

  • - Get a probability that B is better, not just "significant" or not
  • - Can stop early when confident enough (no peeking penalty)
  • - Works with small sample sizes through prior information
  • - Naturally handles uncertainty - wider curves = more uncertain
When to Use Bayesian A/B Testing:
  • When you want to continuously monitor results without peeking penalties
  • When stakeholders want probabilities rather than p-values
  • When you have prior information about likely effect sizes
  • For faster decisions when resources are limited

A/B Testing vs Multi-Armed Bandits

A/B testing and multi-armed bandits represent different philosophies for online experimentation:

AspectA/B TestingMulti-Armed Bandits
AllocationFixed (50/50)Adaptive (explore/exploit)
GoalStatistical inferenceMinimize regret
Traffic efficiencyLowerHigher
Statistical rigorHighLower
Stopping ruleFixed sample sizeContinuous
Best forHypothesis testingOptimization

Regret is the key concept in bandit algorithmsβ€”it measures the cumulative reward lost by not always showing the best arm:

Regret(T)=Tβ‹…ΞΌβˆ—βˆ’βˆ‘t=1TE[rt]\text{Regret}(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mathbb{E}[r_t]

where ΞΌβˆ—\mu^* is the mean reward of the best arm.

Interactive: Bandits Comparison

Compare A/B testing with bandit algorithms (Thompson Sampling, UCB1, Ξ΅-Greedy). See how bandits allocate more traffic to the winning arm over time, reducing regret.

🎰A/B Testing vs Multi-Armed Bandits

Compare A/B testing (equal allocation) with adaptive strategies (Thompson Sampling, UCB1, Ξ΅-Greedy). Bandits minimize regretβ€”the reward lost by not always choosing the best arm.

True Conversion Rates

Experiment Settings

πŸ’‘ When to Use Each Approach

Use A/B Testing When:

  • You need rigorous statistical inference
  • Regulatory requirements apply (e.g., medical)
  • You want interpretable p-values and CIs
  • Long-term effects matter (not just immediate reward)

Use Bandits When:

  • Minimizing regret (lost reward) is critical
  • You need faster convergence to the winner
  • Traffic is limited and every conversion counts
  • Continuous optimization (not one-time test)
When NOT to Use Bandits:
  • When you need rigorous p-values or confidence intervals
  • When regulatory requirements apply (medical, financial)
  • When delayed conversions make immediate feedback impossible
  • When you need to measure effect size precisely

Practical Considerations


Python Implementation

Let's implement a complete A/B testing analysis pipeline in Python:

Two-Proportion Z-Test for A/B Testing
🐍python
1

Import scipy.stats for statistical distributions and hypothesis tests

7

Conversion rates: successes divided by total visitors for each group

11

Pooled conversion rate under the null hypothesis (groups are identical)

14

Standard error of the difference under the null hypothesis

17

z-statistic measures how many standard errors the observed difference is from zero

20

Two-tailed p-value: probability of seeing a difference this extreme if Hβ‚€ is true

26

Relative lift: percentage improvement of treatment over control

27 lines without explanation
1from scipy import stats
2import numpy as np
3
4def two_proportion_z_test(successes_a, n_a, successes_b, n_b):
5    """Two-proportion z-test for A/B testing."""
6
7    # Conversion rates
8    p_a = successes_a / n_a
9    p_b = successes_b / n_b
10
11    # Pooled proportion under H0
12    p_pooled = (successes_a + successes_b) / (n_a + n_b)
13
14    # Standard error under H0
15    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
16
17    # Z-statistic
18    z = (p_b - p_a) / se
19
20    # Two-tailed p-value
21    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
22
23    # Confidence interval for the difference
24    se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
25    ci = (p_b - p_a - 1.96*se_diff, p_b - p_a + 1.96*se_diff)
26
27    # Relative lift
28    lift = (p_b - p_a) / p_a * 100 if p_a > 0 else np.inf
29
30    return {
31        'p_a': p_a, 'p_b': p_b,
32        'z_statistic': z, 'p_value': p_value,
33        'ci_95': ci, 'lift_percent': lift
34    }

Now let's add power analysis and Bayesian methods:

🐍python
1import numpy as np
2from scipy import stats
3
4# ================================================
5# 1. Sample Size Calculation
6# ================================================
7
8def sample_size_for_proportion(baseline_rate, mde, alpha=0.05, power=0.80):
9    """
10    Calculate sample size per group for A/B test on proportions.
11
12    Parameters
13    ----------
14    baseline_rate : float
15        Expected conversion rate of control group
16    mde : float
17        Minimum detectable effect (absolute difference)
18    alpha : float
19        Significance level (Type I error rate)
20    power : float
21        Desired power (1 - Type II error rate)
22
23    Returns
24    -------
25    int : Required sample size per group
26    """
27    p1 = baseline_rate
28    p2 = baseline_rate + mde
29
30    z_alpha = stats.norm.ppf(1 - alpha/2)
31    z_beta = stats.norm.ppf(power)
32
33    p_bar = (p1 + p2) / 2
34
35    n = (z_alpha * np.sqrt(2*p_bar*(1-p_bar)) +
36         z_beta * np.sqrt(p1*(1-p1) + p2*(1-p2)))**2 / (p2 - p1)**2
37
38    return int(np.ceil(n))
39
40# Example
41n = sample_size_for_proportion(baseline_rate=0.10, mde=0.02)
42print(f"Required sample size per group: {n:,}")
43
44
45# ================================================
46# 2. Bayesian A/B Testing
47# ================================================
48
49def bayesian_ab_test(successes_a, n_a, successes_b, n_b,
50                     prior_alpha=1, prior_beta=1, n_samples=100000):
51    """
52    Bayesian A/B test using Beta-Binomial model.
53
54    Returns probability that B is better than A.
55    """
56    # Posterior distributions (conjugate update)
57    alpha_a = prior_alpha + successes_a
58    beta_a = prior_beta + (n_a - successes_a)
59
60    alpha_b = prior_alpha + successes_b
61    beta_b = prior_beta + (n_b - successes_b)
62
63    # Monte Carlo sampling
64    samples_a = np.random.beta(alpha_a, beta_a, n_samples)
65    samples_b = np.random.beta(alpha_b, beta_b, n_samples)
66
67    # P(B > A)
68    prob_b_better = np.mean(samples_b > samples_a)
69
70    # Expected lift
71    expected_lift = np.mean((samples_b - samples_a) / samples_a) * 100
72
73    # Credible intervals
74    ci_a = np.percentile(samples_a, [2.5, 97.5])
75    ci_b = np.percentile(samples_b, [2.5, 97.5])
76    ci_diff = np.percentile(samples_b - samples_a, [2.5, 97.5])
77
78    return {
79        'prob_b_better': prob_b_better,
80        'expected_lift_percent': expected_lift,
81        'posterior_mean_a': samples_a.mean(),
82        'posterior_mean_b': samples_b.mean(),
83        'ci_95_a': ci_a,
84        'ci_95_b': ci_b,
85        'ci_95_diff': ci_diff
86    }
87
88# Example
89result = bayesian_ab_test(
90    successes_a=120, n_a=1000,  # 12% conversion
91    successes_b=145, n_b=1000   # 14.5% conversion
92)
93print(f"P(B > A) = {result['prob_b_better']:.1%}")
94print(f"Expected lift: {result['expected_lift_percent']:.1f}%")
95
96
97# ================================================
98# 3. Sequential Testing (O'Brien-Fleming)
99# ================================================
100
101def obrien_fleming_boundary(alpha, n_looks, look_number):
102    """
103    Calculate O'Brien-Fleming boundary for sequential testing.
104
105    Returns z-score threshold for rejecting H0 at this look.
106    """
107    if look_number > n_looks:
108        raise ValueError("look_number cannot exceed n_looks")
109
110    # O'Brien-Fleming spending function
111    t = look_number / n_looks  # Information fraction
112
113    # Approximate z-boundary using spending function
114    z_alpha_full = stats.norm.ppf(1 - alpha/2)
115    z_boundary = z_alpha_full / np.sqrt(t)
116
117    return z_boundary
118
119# Example: 3 interim looks + 1 final
120for look in [1, 2, 3, 4]:
121    boundary = obrien_fleming_boundary(alpha=0.05, n_looks=4, look_number=look)
122    print(f"Look {look}: |z| > {boundary:.3f} to reject H0")
123
124
125# ================================================
126# 4. Complete A/B Test Analysis
127# ================================================
128
129def analyze_ab_test(successes_a, n_a, successes_b, n_b, alpha=0.05):
130    """Complete A/B test analysis with both frequentist and Bayesian results."""
131
132    # Frequentist analysis
133    p_a = successes_a / n_a
134    p_b = successes_b / n_b
135
136    p_pooled = (successes_a + successes_b) / (n_a + n_b)
137    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
138    z = (p_b - p_a) / se
139    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
140
141    se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
142    z_crit = stats.norm.ppf(1 - alpha/2)
143    ci = (p_b - p_a - z_crit*se_diff, p_b - p_a + z_crit*se_diff)
144
145    # Bayesian analysis
146    bayes = bayesian_ab_test(successes_a, n_a, successes_b, n_b)
147
148    # Print report
149    print("=" * 60)
150    print("A/B TEST ANALYSIS REPORT")
151    print("=" * 60)
152    print(f"\nSample Sizes: Control = {n_a:,}, Treatment = {n_b:,}")
153    print(f"Conversions:  Control = {successes_a:,}, Treatment = {successes_b:,}")
154    print(f"\nConversion Rates:")
155    print(f"  Control:   {p_a*100:.2f}%")
156    print(f"  Treatment: {p_b*100:.2f}%")
157    print(f"  Lift:      {(p_b-p_a)/p_a*100:+.2f}%")
158    print(f"\nFrequentist Results:")
159    print(f"  z-statistic: {z:.3f}")
160    print(f"  p-value:     {p_value:.4f}")
161    print(f"  95% CI:      [{ci[0]*100:.2f}%, {ci[1]*100:.2f}%]")
162    print(f"  Significant: {'Yes' if p_value < alpha else 'No'} (Ξ± = {alpha})")
163    print(f"\nBayesian Results:")
164    print(f"  P(B > A):    {bayes['prob_b_better']*100:.1f}%")
165    print(f"  95% Credible Interval: [{bayes['ci_95_diff'][0]*100:.2f}%, {bayes['ci_95_diff'][1]*100:.2f}%]")
166
167    if bayes['prob_b_better'] > 0.95:
168        print("\nβœ… RECOMMENDATION: Ship treatment (high confidence B is better)")
169    elif bayes['prob_b_better'] < 0.05:
170        print("\n❌ RECOMMENDATION: Keep control (high confidence A is better)")
171    else:
172        print("\n⏳ RECOMMENDATION: Continue collecting data")
173
174# Example usage
175analyze_ab_test(
176    successes_a=1200, n_a=10000,  # 12% baseline
177    successes_b=1350, n_b=10000   # 13.5% treatment
178)

Knowledge Check

Test your understanding of A/B testing for ML applications with this quiz.

Knowledge Check

Question 1 of 8

A company runs an A/B test comparing two recommendation models. After 1 week, the p-value is 0.08. After 2 weeks, it drops to 0.03. What is the correct interpretation?


Summary

Key Takeaways

  1. A/B testing is essential for ML deployment: Offline metrics don't always predict online performance. Randomized experiments provide causal evidence.
  2. Power analysis is critical: Calculate sample size before starting. Effect size scales inversely with the square of sample size.
  3. Beware the peeking problem: Repeated checking at Ξ± = 0.05 inflates false positive rates dramatically. Use sequential testing or Bayesian methods.
  4. Multiple metrics need multiple considerations: Define primary and guardrail metrics. Consider corrections for multiple testing.
  5. Bayesian A/B testing offers practical advantages: Continuous monitoring, probability statements, and faster decisionsβ€”but requires different interpretation.
  6. Bandits optimize differently than A/B tests: Multi-armed bandits minimize regret but sacrifice statistical rigor. Choose based on your goals.

Quick Reference

TopicKey Formula / Rule
Z-test statisticz = (p_B - p_A) / √[pΜ‚(1-pΜ‚)(1/n_A + 1/n_B)]
Sample size (approx)n β‰ˆ 16σ²/δ² for 80% power, Ξ±=0.05
False positive with k peeksP(FP) β‰ˆ 1 - (1-Ξ±)^k
Bayesian posteriorp | data ~ Beta(1 + successes, 1 + failures)
RegretTΒ·ΞΌ* - Ξ£ E[r_t]

Industry Best Practices

Before the Test

  • β€’ Define hypothesis and success criteria
  • β€’ Calculate required sample size
  • β€’ Pre-register analysis plan
  • β€’ Set up proper logging and metrics

During and After

  • β€’ Check for sample ratio mismatch
  • β€’ Don't peek without correction
  • β€’ Run long enough for novelty effects
  • β€’ Document and share learnings
Looking Ahead: In the next section, we'll explore Sequential Testing, which provides a principled framework for monitoring experiments over time while maintaining proper error control.
Loading comments...